WorldWideScience

Sample records for studies snps imputation

  1. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

    KAUST Repository

    Chatterjee, Nilanjan

    2009-11-01

    Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

  2. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

    KAUST Repository

    Chatterjee, Nilanjan; Chen, Yi-Hau; Luo, Sheng; Carroll, Raymond J.

    2009-01-01

    Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

  3. Quick, “Imputation-free” meta-analysis with proxy-SNPs

    Directory of Open Access Journals (Sweden)

    Meesters Christian

    2012-09-01

    Full Text Available Abstract Background Meta-analysis (MA is widely used to pool genome-wide association studies (GWASes in order to a increase the power to detect strong or weak genotype effects or b as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software, however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy-SNPs

  4. Comparing strategies for selection of low-density SNPs for imputation-mediated genomic prediction in U. S. Holsteins.

    Science.gov (United States)

    He, Jun; Xu, Jiaqi; Wu, Xiao-Lin; Bauck, Stewart; Lee, Jungjae; Morota, Gota; Kachman, Stephen D; Spangler, Matthew L

    2018-04-01

    SNP chips are commonly used for genotyping animals in genomic selection but strategies for selecting low-density (LD) SNPs for imputation-mediated genomic selection have not been addressed adequately. The main purpose of the present study was to compare the performance of eight LD (6K) SNP panels, each selected by a different strategy exploiting a combination of three major factors: evenly-spaced SNPs, increased minor allele frequencies, and SNP-trait associations either for single traits independently or for all the three traits jointly. The imputation accuracies from 6K to 80K SNP genotypes were between 96.2 and 98.2%. Genomic prediction accuracies obtained using imputed 80K genotypes were between 0.817 and 0.821 for daughter pregnancy rate, between 0.838 and 0.844 for fat yield, and between 0.850 and 0.863 for milk yield. The two SNP panels optimized on the three major factors had the highest genomic prediction accuracy (0.821-0.863), and these accuracies were very close to those obtained using observed 80K genotypes (0.825-0.868). Further exploration of the underlying relationships showed that genomic prediction accuracies did not respond linearly to imputation accuracies, but were significantly affected by genotype (imputation) errors of SNPs in association with the traits to be predicted. SNPs optimal for map coverage and MAF were favorable for obtaining accurate imputation of genotypes whereas trait-associated SNPs improved genomic prediction accuracies. Thus, optimal LD SNP panels were the ones that combined both strengths. The present results have practical implications on the design of LD SNP chips for imputation-enabled genomic prediction.

  5. Sequence imputation of HPV16 genomes for genetic association studies.

    Directory of Open Access Journals (Sweden)

    Benjamin Smith

    Full Text Available Human Papillomavirus type 16 (HPV16 causes over half of all cervical cancer and some HPV16 variants are more oncogenic than others. The genetic basis for the extraordinary oncogenic properties of HPV16 compared to other HPVs is unknown. In addition, we neither know which nucleotides vary across and within HPV types and lineages, nor which of the single nucleotide polymorphisms (SNPs determine oncogenicity.A reference set of 62 HPV16 complete genome sequences was established and used to examine patterns of evolutionary relatedness amongst variants using a pairwise identity heatmap and HPV16 phylogeny. A BLAST-based algorithm was developed to impute complete genome data from partial sequence information using the reference database. To interrogate the oncogenic risk of determined and imputed HPV16 SNPs, odds-ratios for each SNP were calculated in a case-control viral genome-wide association study (VWAS using biopsy confirmed high-grade cervix neoplasia and self-limited HPV16 infections from Guanacaste, Costa Rica.HPV16 variants display evolutionarily stable lineages that contain conserved diagnostic SNPs. The imputation algorithm indicated that an average of 97.5±1.03% of SNPs could be accurately imputed. The VWAS revealed specific HPV16 viral SNPs associated with variant lineages and elevated odds ratios; however, individual causal SNPs could not be distinguished with certainty due to the nature of HPV evolution.Conserved and lineage-specific SNPs can be imputed with a high degree of accuracy from limited viral polymorphic data due to the lack of recombination and the stochastic mechanism of variation accumulation in the HPV genome. However, to determine the role of novel variants or non-lineage-specific SNPs by VWAS will require direct sequence analysis. The investigation of patterns of genetic variation and the identification of diagnostic SNPs for lineages of HPV16 variants provides a valuable resource for future studies of HPV16

  6. Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy.

    Science.gov (United States)

    Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P

    2013-05-01

    A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.

  7. Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study

    Directory of Open Access Journals (Sweden)

    Abbas Mikhchi

    2016-01-01

    Full Text Available Abstract Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K to high density (10 K SNP panel using three different Boosting methods namely TotalBoost (TB, LogitBoost (LB and AdaBoost (AB. The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs, G2 (100 trios with 10 k SNPs, G3 (500 trios with 5 k SNPs, and G4 (500 trio with 10 k SNPs were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500 was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.

  8. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies

    Directory of Open Access Journals (Sweden)

    McElwee Joshua

    2009-06-01

    Full Text Available Abstract Background Although high-throughput genotyping arrays have made whole-genome association studies (WGAS feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion. Methods 384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis. In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs. Results MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers was not as successful. It was more challenging to impute genotypes in the African American population, given (1 shorter LD blocks and (2 admixture with Caucasian populations in this population. To address issue (2, we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis

  9. Local exome sequences facilitate imputation of less common variants and increase power of genome wide association studies.

    Directory of Open Access Journals (Sweden)

    Peter K Joshi

    Full Text Available The analysis of less common variants in genome-wide association studies promises to elucidate complex trait genetics but is hampered by low power to reliably detect association. We show that addition of population-specific exome sequence data to global reference data allows more accurate imputation, particularly of less common SNPs (minor allele frequency 1-10% in two very different European populations. The imputation improvement corresponds to an increase in effective sample size of 28-38%, for SNPs with a minor allele frequency in the range 1-3%.

  10. Using imputed genotype data in the joint score tests for genetic association and gene-environment interactions in case-control studies.

    Science.gov (United States)

    Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan

    2018-03-01

    Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.

  11. Imputation of genotypes in Danish two-way crossbred pigs using low density panels

    DEFF Research Database (Denmark)

    Xiang, Tao; Christensen, Ole Fredslund; Legarra, Andres

    Genotype imputation is commonly used as an initial step of genomic selection. Studies on humans, plants and ruminants suggested many factors would affect the performance of imputation. However, studies rarely investigated pigs, especially crossbred pigs. In this study, different scenarios...... of imputation from 5K SNPs to 7K SNPs on Danish Landrace, Yorkshire, and crossbred Landrace-Yorkshire were compared. In conclusion, genotype imputation on crossbreds performs equally well as in purebreds, when parental breeds are used as the reference panel. When the size of reference is considerably large...... SNPs. This dataset will be analyzed for genomic selection in a future study...

  12. GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

    Science.gov (United States)

    Sulovari, Arvis; Li, Dawei

    2014-07-19

    Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep

  13. The multiple imputation method: a case study involving secondary data analysis.

    Science.gov (United States)

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  14. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs)

    OpenAIRE

    Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud; Kar, Siddhartha; Nord, Silje; Moradi Marjaneh, Mahdi; Soucy, Penny; Michailidou, Kyriaki; Ghoussaini, Maya; Fues Wahl, Hanna; Bolla, Manjeet K.; Wang, Qin; Dennis, Joe; Alonso, M Rosario; Andrulis, Irene L.

    2016-01-01

    Genome-wide association studies have found SNPs at 17q22 to be associated with breast cancer risk. To identify potential causal variants related to breast cancer risk, we performed a high resolution fine-mapping analysis that involved genotyping 517 SNPs using a custom Illumina iSelect array (iCOGS) followed by imputation of genotypes for 3,134 SNPs in more than 89,000 participants of European ancestry from the Breast Cancer Association Consortium (BCAC). We identified 28 highly correlated co...

  15. Imputation-based analysis of association studies: candidate regions and quantitative traits.

    Directory of Open Access Journals (Sweden)

    Bertrand Servin

    2007-07-01

    Full Text Available We introduce a new framework for the analysis of association studies, designed to allow untyped variants to be more effectively and directly tested for association with a phenotype. The idea is to combine knowledge on patterns of correlation among SNPs (e.g., from the International HapMap project or resequencing data in a candidate region of interest with genotype data at tag SNPs collected on a phenotyped study sample, to estimate ("impute" unmeasured genotypes, and then assess association between the phenotype and these estimated genotypes. Compared with standard single-SNP tests, this approach results in increased power to detect association, even in cases in which the causal variant is typed, with the greatest gain occurring when multiple causal variants are present. It also provides more interpretable explanations for observed associations, including assessing, for each SNP, the strength of the evidence that it (rather than another correlated SNP is causal. Although we focus on association studies with quantitative phenotype and a relatively restricted region (e.g., a candidate gene, the framework is applicable and computationally practical for whole genome association studies. Methods described here are implemented in a software package, Bim-Bam, available from the Stephens Lab website http://stephenslab.uchicago.edu/software.html.

  16. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies.

    Directory of Open Access Journals (Sweden)

    Clive J Hoggart

    2008-07-01

    Full Text Available Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

  17. Study of 25 X-chromosome SNPs in the Portuguese

    DEFF Research Database (Denmark)

    Pereira, Vania; Tomas Mas, Carmen; Amorim, António

    2011-01-01

    The importance of X-chromosome markers in individual identifications, population genetics, forensics and kinship testing is getting wide recognition. In this work, we studied the distributions of 25 X-chromosome single nucleotide polymorphisms (X-SNPs) in population samples from Northern, Central...... and Southern Portugal (n=305). The data were also compared with previous data from the Mediterranean area confirming a general genetic homogeneity among populations in the region. The X-SNP distribution in the three Portuguese regional samples did not show any significant substructure and the X...

  18. Sensitivity analysis in multiple imputation in effectiveness studies of psychotherapy.

    Science.gov (United States)

    Crameri, Aureliano; von Wyl, Agnes; Koemeda, Margit; Schulthess, Peter; Tschuschke, Volker

    2015-01-01

    The importance of preventing and treating incomplete data in effectiveness studies is nowadays emphasized. However, most of the publications focus on randomized clinical trials (RCT). One flexible technique for statistical inference with missing data is multiple imputation (MI). Since methods such as MI rely on the assumption of missing data being at random (MAR), a sensitivity analysis for testing the robustness against departures from this assumption is required. In this paper we present a sensitivity analysis technique based on posterior predictive checking, which takes into consideration the concept of clinical significance used in the evaluation of intra-individual changes. We demonstrate the possibilities this technique can offer with the example of irregular longitudinal data collected with the Outcome Questionnaire-45 (OQ-45) and the Helping Alliance Questionnaire (HAQ) in a sample of 260 outpatients. The sensitivity analysis can be used to (1) quantify the degree of bias introduced by missing not at random data (MNAR) in a worst reasonable case scenario, (2) compare the performance of different analysis methods for dealing with missing data, or (3) detect the influence of possible violations to the model assumptions (e.g., lack of normality). Moreover, our analysis showed that ratings from the patient's and therapist's version of the HAQ could significantly improve the predictive value of the routine outcome monitoring based on the OQ-45. Since analysis dropouts always occur, repeated measurements with the OQ-45 and the HAQ analyzed with MI are useful to improve the accuracy of outcome estimates in quality assurance assessments and non-randomized effectiveness studies in the field of outpatient psychotherapy.

  19. Is missing geographic positioning system data in accelerometry studies a problem, and is imputation the solution?

    DEFF Research Database (Denmark)

    Meseck, Kristin; Jankowska, Marta M; Schipperijn, Jasper

    2016-01-01

    The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate...

  20. Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.

    Science.gov (United States)

    Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li

    2015-01-01

    Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.

  1. Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data

    Science.gov (United States)

    2012-01-01

    Background Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the method proposed by Carpenter et al. (2007). This approach aims to approximate inferences under a Missing Not At random (MNAR) mechanism by reweighting estimates obtained after multiple imputation where the weights depend on the assumed degree of departure from the MAR assumption. Methods The method is illustrated with epidemiological data from a surveillance system of hepatitis C virus (HCV) infection in France during the 2001–2007 period. The subpopulation studied included 4343 HCV infected patients who reported drug use. Risk factors for severe liver disease were assessed. After performing complete-case and multiple imputation analyses, we applied the sensitivity analysis to 3 risk factors of severe liver disease: past excessive alcohol consumption, HIV co-infection and infection with HCV genotype 3. Results In these data, the association between severe liver disease and HIV was underestimated, if given the observed data the chance of observing HIV status is high when this is positive. Inference for two other risk factors were robust to plausible local departures from the MAR assumption. Conclusions We have demonstrated the practical utility of, and advocate, a pragmatic widely applicable approach to exploring plausible departures from the MAR assumption post multiple imputation. We have developed guidelines for applying this approach to epidemiological studies. PMID:22681630

  2. New insights into the pharmacogenomics of antidepressant response from the GENDEP and STAR*D studies: rare variant analysis and high-density imputation.

    Science.gov (United States)

    Fabbri, C; Tansey, K E; Perlis, R H; Hauser, J; Henigsberg, N; Maier, W; Mors, O; Placentino, A; Rietschel, M; Souery, D; Breen, G; Curtis, C; Sang-Hyuk, L; Newhouse, S; Patel, H; Guipponi, M; Perroud, N; Bondolfi, G; O'Donovan, M; Lewis, G; Biernacka, J M; Weinshilboum, R M; Farmer, A; Aitchison, K J; Craig, I; McGuffin, P; Uher, R; Lewis, C M

    2017-11-21

    Genome-wide association studies have generally failed to identify polymorphisms associated with antidepressant response. Possible reasons include limited coverage of genetic variants that this study tried to address by exome genotyping and dense imputation. A meta-analysis of Genome-Based Therapeutic Drugs for Depression (GENDEP) and Sequenced Treatment Alternatives to Relieve Depression (STAR*D) studies was performed at the single-nucleotide polymorphism (SNP), gene and pathway levels. Coverage of genetic variants was increased compared with previous studies by adding exome genotypes to previously available genome-wide data and using the Haplotype Reference Consortium panel for imputation. Standard quality control was applied. Phenotypes were symptom improvement and remission after 12 weeks of antidepressant treatment. Significant findings were investigated in NEWMEDS consortium samples and Pharmacogenomic Research Network Antidepressant Medication Pharmacogenomic Study (PGRN-AMPS) for replication. A total of 7062 950 SNPs were analyzed in GENDEP (n=738) and STAR*D (n=1409). rs116692768 (P=1.80e-08, ITGA9 (integrin α9)) and rs76191705 (P=2.59e-08, NRXN3 (neurexin 3)) were significantly associated with symptom improvement during citalopram/escitalopram treatment. At the gene level, no consistent effect was found. At the pathway level, the Gene Ontology (GO) terms GO: 0005694 (chromosome) and GO: 0044427 (chromosomal part) were associated with improvement (corrected P=0.007 and 0.045, respectively). The association between rs116692768 and symptom improvement was replicated in PGRN-AMPS (P=0.047), whereas rs76191705 was not. The two SNPs did not replicate in NEWMEDS. ITGA9 codes for a membrane receptor for neurotrophins and NRXN3 is a transmembrane neuronal adhesion receptor involved in synaptic differentiation. Despite their meaningful biological rationale for being involved in antidepressant effect, replication was partial. Further studies may help in clarifying

  3. Construction and application of a Korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes.

    Science.gov (United States)

    Kim, Kwangwoo; Bang, So-Young; Lee, Hye-Soon; Bae, Sang-Cheol

    2014-01-01

    Genetic variations of human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs) at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.

  4. Construction and application of a Korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes.

    Directory of Open Access Journals (Sweden)

    Kwangwoo Kim

    Full Text Available Genetic variations of human leukocyte antigen (HLA genes within the major histocompatibility complex (MHC locus are strongly associated with disease susceptibility and prognosis for many diseases, including many autoimmune diseases. In this study, we developed a Korean HLA reference panel for imputing classical alleles and amino acid residues of several HLA genes. An HLA reference panel has potential for use in identifying and fine-mapping disease associations with the MHC locus in East Asian populations, including Koreans. A total of 413 unrelated Korean subjects were analyzed for single nucleotide polymorphisms (SNPs at the MHC locus and six HLA genes, including HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1. The HLA reference panel was constructed by phasing the 5,858 MHC SNPs, 233 classical HLA alleles, and 1,387 amino acid residue markers from 1,025 amino acid positions as binary variables. The imputation accuracy of the HLA reference panel was assessed by measuring concordance rates between imputed and genotyped alleles of the HLA genes from a subset of the study subjects and East Asian HapMap individuals. Average concordance rates were 95.6% and 91.1% at 2-digit and 4-digit allele resolutions, respectively. The imputation accuracy was minimally affected by SNP density of a test dataset for imputation. In conclusion, the Korean HLA reference panel we developed was highly suitable for imputing HLA alleles and amino acids from MHC SNPs in East Asians, including Koreans.

  5. Assessing accuracy of genotype imputation in American Indians.

    Directory of Open Access Journals (Sweden)

    Alka Malhotra

    Full Text Available Genotype imputation is commonly used in genetic association studies to test untyped variants using information on linkage disequilibrium (LD with typed markers. Imputing genotypes requires a suitable reference population in which the LD pattern is known, most often one selected from HapMap. However, some populations, such as American Indians, are not represented in HapMap. In the present study, we assessed accuracy of imputation using HapMap reference populations in a genome-wide association study in Pima Indians.Data from six randomly selected chromosomes were used. Genotypes in the study population were masked (either 1% or 20% of SNPs available for a given chromosome. The masked genotypes were then imputed using the software Markov Chain Haplotyping Algorithm. Using four HapMap reference populations, average genotype error rates ranged from 7.86% for Mexican Americans to 22.30% for Yoruba. In contrast, use of the original Pima Indian data as a reference resulted in an average error rate of 1.73%.Our results suggest that the use of HapMap reference populations results in substantial inaccuracy in the imputation of genotypes in American Indians. A possible solution would be to densely genotype or sequence a reference American Indian population.

  6. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs).

    Science.gov (United States)

    Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud; Kar, Siddhartha; Nord, Silje; Moradi Marjaneh, Mahdi; Soucy, Penny; Michailidou, Kyriaki; Ghoussaini, Maya; Fues Wahl, Hanna; Bolla, Manjeet K; Wang, Qin; Dennis, Joe; Alonso, M Rosario; Andrulis, Irene L; Anton-Culver, Hoda; Arndt, Volker; Beckmann, Matthias W; Benitez, Javier; Bogdanova, Natalia V; Bojesen, Stig E; Brauch, Hiltrud; Brenner, Hermann; Broeks, Annegien; Brüning, Thomas; Burwinkel, Barbara; Chang-Claude, Jenny; Choi, Ji-Yeob; Conroy, Don M; Couch, Fergus J; Cox, Angela; Cross, Simon S; Czene, Kamila; Devilee, Peter; Dörk, Thilo; Easton, Douglas F; Fasching, Peter A; Figueroa, Jonine; Fletcher, Olivia; Flyger, Henrik; Galle, Eva; García-Closas, Montserrat; Giles, Graham G; Goldberg, Mark S; González-Neira, Anna; Guénel, Pascal; Haiman, Christopher A; Hallberg, Emily; Hamann, Ute; Hartman, Mikael; Hollestelle, Antoinette; Hopper, John L; Ito, Hidemi; Jakubowska, Anna; Johnson, Nichola; Kang, Daehee; Khan, Sofia; Kosma, Veli-Matti; Kriege, Mieke; Kristensen, Vessela; Lambrechts, Diether; Le Marchand, Loic; Lee, Soo Chin; Lindblom, Annika; Lophatananon, Artitaya; Lubinski, Jan; Mannermaa, Arto; Manoukian, Siranoush; Margolin, Sara; Matsuo, Keitaro; Mayes, Rebecca; McKay, James; Meindl, Alfons; Milne, Roger L; Muir, Kenneth; Neuhausen, Susan L; Nevanlinna, Heli; Olswold, Curtis; Orr, Nick; Peterlongo, Paolo; Pita, Guillermo; Pylkäs, Katri; Rudolph, Anja; Sangrajrang, Suleeporn; Sawyer, Elinor J; Schmidt, Marjanka K; Schmutzler, Rita K; Seynaeve, Caroline; Shah, Mitul; Shen, Chen-Yang; Shu, Xiao-Ou; Southey, Melissa C; Stram, Daniel O; Surowy, Harald; Swerdlow, Anthony; Teo, Soo H; Tessier, Daniel C; Tomlinson, Ian; Torres, Diana; Truong, Thérèse; Vachon, Celine M; Vincent, Daniel; Winqvist, Robert; Wu, Anna H; Wu, Pei-Ei; Yip, Cheng Har; Zheng, Wei; Pharoah, Paul D P; Hall, Per; Edwards, Stacey L; Simard, Jacques; French, Juliet D; Chenevix-Trench, Georgia; Dunning, Alison M

    2016-09-07

    Genome-wide association studies have found SNPs at 17q22 to be associated with breast cancer risk. To identify potential causal variants related to breast cancer risk, we performed a high resolution fine-mapping analysis that involved genotyping 517 SNPs using a custom Illumina iSelect array (iCOGS) followed by imputation of genotypes for 3,134 SNPs in more than 89,000 participants of European ancestry from the Breast Cancer Association Consortium (BCAC). We identified 28 highly correlated common variants, in a 53 Kb region spanning two introns of the STXBP4 gene, that are strong candidates for driving breast cancer risk (lead SNP rs2787486 (OR = 0.92; CI 0.90-0.94; P = 8.96 × 10(-15))) and are correlated with two previously reported risk-associated variants at this locus, SNPs rs6504950 (OR = 0.94, P = 2.04 × 10(-09), r(2) = 0.73 with lead SNP) and rs1156287 (OR = 0.93, P = 3.41 × 10(-11), r(2) = 0.83 with lead SNP). Analyses indicate only one causal SNP in the region and several enhancer elements targeting STXBP4 are located within the 53 kb association signal. Expression studies in breast tumor tissues found SNP rs2787486 to be associated with increased STXBP4 expression, suggesting this may be a target gene of this locus.

  7. Studies on interaction of colloidal silver nanoparticles (SNPs) with five different bacterial species.

    Science.gov (United States)

    Khan, S Sudheer; Mukherjee, Amitava; Chandrasekaran, N

    2011-10-01

    Silver nanoparticles (SNPs) are being increasingly used in many consumer products like textile fabrics, cosmetics, washing machines, food and drug products owing to its excellent antimicrobial properties. Here we have studied the adsorption and toxicity of SNPs on bacterial species such as Pseudomonas aeruginosa, Micrococcus luteus, Bacillus subtilis, Bacillus barbaricus and Klebsiella pneumoniae. The influence of zeta potential on the adsorption of SNPs on bacterial cell surface was investigated at acidic, neutral and alkaline pH and with varying salt (NaCl) concentrations (0.05, 0.1, 0.5, 1 and 1.5 M). The survival rate of bacterial species decreased with increase in adsorption of SNPs. Maximum adsorption and toxicity was observed at pH 5, and NaCl concentration of 0.5 M, there by resulting in less toxicity. The zeta potential study suggests that, the adsorption of SNPs on the cell surface was related to electrostatic force of attraction. The equilibrium and kinetics of the adsorption process were also studied. The adsorption equilibrium isotherms fitted well to the Langmuir model. The kinetics of adsorption fitted best to pseudo-first-order. These findings form a basis for interpreting the interaction of nanoparticles with environmental bacterial species. Copyright © 2011 Elsevier B.V. All rights reserved.

  8. Genome-wide association study with 1000 genomes imputation identifies signals for nine sex hormone-related phenotypes.

    Science.gov (United States)

    Ruth, Katherine S; Campbell, Purdey J; Chew, Shelby; Lim, Ee Mun; Hadlow, Narelle; Stuckey, Bronwyn G A; Brown, Suzanne J; Feenstra, Bjarke; Joseph, John; Surdulescu, Gabriela L; Zheng, Hou Feng; Richards, J Brent; Murray, Anna; Spector, Tim D; Wilson, Scott G; Perry, John R B

    2016-02-01

    Genetic factors contribute strongly to sex hormone levels, yet knowledge of the regulatory mechanisms remains incomplete. Genome-wide association studies (GWAS) have identified only a small number of loci associated with sex hormone levels, with several reproductive hormones yet to be assessed. The aim of the study was to identify novel genetic variants contributing to the regulation of sex hormones. We performed GWAS using genotypes imputed from the 1000 Genomes reference panel. The study used genotype and phenotype data from a UK twin register. We included 2913 individuals (up to 294 males) from the Twins UK study, excluding individuals receiving hormone treatment. Phenotypes were standardised for age, sex, BMI, stage of menstrual cycle and menopausal status. We tested 7,879,351 autosomal SNPs for association with levels of dehydroepiandrosterone sulphate (DHEAS), oestradiol, free androgen index (FAI), follicle-stimulating hormone (FSH), luteinizing hormone (LH), prolactin, progesterone, sex hormone-binding globulin and testosterone. Eight independent genetic variants reached genome-wide significance (P<5 × 10(-8)), with minor allele frequencies of 1.3-23.9%. Novel signals included variants for progesterone (P=7.68 × 10(-12)), oestradiol (P=1.63 × 10(-8)) and FAI (P=1.50 × 10(-8)). A genetic variant near the FSHB gene was identified which influenced both FSH (P=1.74 × 10(-8)) and LH (P=3.94 × 10(-9)) levels. A separate locus on chromosome 7 was associated with both DHEAS (P=1.82 × 10(-14)) and progesterone (P=6.09 × 10(-14)). This study highlights loci that are relevant to reproductive function and suggests overlap in the genetic basis of hormone regulation.

  9. All SNPs are not created equal: Genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs

    NARCIS (Netherlands)

    Schork, A.J.; Thompson, W.K.; Pham, P.; Torkamani, A.; Roddey, J.C.; Sullivan, P.F.; Kelsoe, J.; O'Donovan, M.C.; Furberg, H.; Absher, D.; Agudo, A.; Almgren, P.; Ardissino, D.; Assimes, T.L.; Bandinelli, S.; Barzan, L.; Bencko, V.; Benhamou, S.; Benjamin, E.J.; Bernardinelli, L.; Bis, J.; Boehnke, M.; Boerwinkle, E.; Boomsma, D.I.; Brennan, P.; Canova, C.; Castellsagué, X.; Chanock, S.; Chasman, D.I.; Conway, D.I.; Dackor, J.; de Geus, E.J.C.; Duan, J.; Elosua, R.; Everett, B.; Fabianova, E.; Ferrucci, L.; Foretova, L.; Fortmann, S.P.; Franceschini, N.; Frayling, T.M.; Furberg, C.; Gejman, P.V.; Groop, L.; Gu, F.; Guralnik, J.; Hankinson, S.E.; Haritunians, T.; Healy, C.; Hofman, A.; Holcátová, I.; Hunter, D.J.; Hwang, S.J.; Ioannidis, J.P.A.; Iribarren, C.; Jackson, A.U.; Janout, V.; Kaprio, J.; Kim, Y.; Kjaerheim, K.; Knowles, J.W.; Kraft, P.; Ladenvall, C.; Lagiou, P.; Lanthrop, M.; Lerman, C.; Levinson, D.F.; Levy, D.; Li, M.D.; Lin, D.Y.; Lips, E.H.; Lissowska, J.; Lowry, R.B.; Lucas, G.; Macfarlane, T.V.; Maes, H.H.M.; Mannucci, P.M.; Mates, D.; Mauri, F.; McGovern, J.A.; McKay, J.D.; McKnight, B.; Melander, O.; Merlini, P.A.; Milaneschi, Y.; Mohlke, K.L.; O'Donnell, C.J.; Pare, G.; Penninx, B.W.J.H.; Perry, J.R.B.; Posthuma, D.; Preis, S.R.; Psaty, B.; Quertermous, T.; Ramachandran, V.S.; Richiardi, L.; Ridker, P.M.; Rose, J.; Rudnai, P.; Salomaa, V.; Sanders, A.R.; Schwartz, S.M.; Shi, J.; Smit, J.H.; Stringham, H.M.; Szeszenia-Dabrowska, N.; Tanaka, T.; Taylor, K.; Thacker, E.E.; Thornton, L.; Tiemeier, H.; Tuomilehto, J.; Uitterlinden, A.G.; van Duijn, C.M.; Vink, J.M.; Vogelzangs, N.; Voight, B.F.; Walter, S.; Willemsen, G.; Zaridze, D.; Znaor, A.; Akil, H.; Anjorin, A.; Backlund, L.; Badner, J.A.; Barchas, J.D.; Barrett, T.; Bass, N.; Bauer, M.; Bellivier, F.; Bergen, S.E.; Berrettini, W.; Blackwood, D.; Bloss, C.S.; Breen, G.; Breuer, R.; Bunner, W.E.; Burmeister, M.; Byerley, W. F.; Caesar, S.; Chambert, K.; Cichon, S.; St Clair, D.; Collier, D.A.; Corvin, A.; Coryell, W.H.; Craddock, N.; Craig, D.W.; Daly, M.; Day, R.; Degenhardt, F.; Djurovic, S.; Dudbridge, F.; Edenberg, H.J.; Elkin, A.; Etain, B.; Farmer, A.E.; Ferreira, M.A.; Ferrier, I.; Flickinger, M.; Foroud, T.; Frank, J.; Fraser, C.; Frisén, L.; Gershon, E.S.; Gill, M.; Gordon-Smith, K.; Green, E.K.; Greenwood, T.A.; Grozeva, D.; Guan, W.; Gurling, H.; Gustafsson, O.; Hamshere, M.L.; Hautzinger, M.; Herms, S.; Hipolito, M.; Holmans, P.A.; Hultman, C. M.; Jamain, S.; Jones, E.G.; Jones, I.; Jones, L.; Kandaswamy, R.; Kennedy, J.L.; Kirov, G. K.; Koller, D.L.; Kwan, P.; Landén, M.; Langstrom, N.; Lathrop, M.; Lawrence, J.; Lawson, W.B.; Leboyer, M.; Lee, P.H.; Li, J.; Lichtenstein, P.; Lin, D.; Liu, C.; Lohoff, F.W.; Lucae, S.; Mahon, P.B.; Maier, W.; Martin, N.G.; Mattheisen, M.; Matthews, K.; Mattingsdal, M.; McGhee, K.A.; McGuffin, P.; McInnis, M.G.; McIntosh, A.; McKinney, R.; McLean, A.W.; McMahon, F.J.; McQuillin, A.; Meier, S.; Melle, I.; Meng, F.; Mitchell, P.B.; Montgomery, G.W.; Moran, J.; Morken, G.; Morris, D.W.; Moskvina, V.; Muglia, P.; Mühleisen, T.W.; Muir, W.J.; Müller-Myhsok, B.; Myers, R.M.; Nievergelt, C.M.; Nikolov, I.; Nimgaonkar, V.L.; Nöthen, M.M.; Nurnberger, J.I.; Nwulia, E.A.; O'Dushlaine, C.; Osby, U.; Óskarsson, H.; Owen, M.J.; Petursson, H.; Pickard, B.S.; Porgeirsson, P.; Potash, J.B.; Propping, P.; Purcell, S.M.; Quinn, E.; Raychaudhuri, S.; Rice, J.; Rietschel, M.; Ruderfer, D.; Schalling, M.; Schatzberg, A.F.; Scheftner, W.A.; Schofield, P.R.; Schulze, T.G.; Schumacher, J.; Schwarz, M.M.; Scolnick, E.; Scott, L.J.; Shilling, P.D.; Sigurdsson, E.; Sklar, P.; Smith, E.N.; Stefansson, H.; Stefansson, K.; Steffens, M; Steinberg, S.; Strauss, J.; Strohmaier, J.; Szelinger, S.; Thompson, R.C.; Tozzi, F.; Treutlein, J.; Vincent, J.B.; Watson, S.J.; Wienker, T.F.; Williamson, R.; Witt, S.H.; Wright, A.; Xu, W.; Young, A.H.; Zandi, P.P.; Zhang, P.; Zöllner, S.; Agartz, I.; Albus, M.; Alexander, M.; Amdur, R. L.; Amin, F.; Bitter, I.; Black, D.W.; Børglum, A.D.; Brown, M.A.; Bruggeman, R.; Buccola, N.G.; Cahn, W.; Cantor, R.M.; Carr, V.J.; Catts, S. V.; Choudhury, K.; Cloninger, C. R.; Cormican, P.; Danoy, P. A.; Datta, S.; DeHert, M.; Demontis, D.; Dikeos, D.; Donnelly, P.; Donohoe, G.; Duong, L.; Dwyer, S.; Fanous, A.; Fink-Jensen, A.; Freedman, R.; Freimer, N.B.; Friedl, M.; Georgieva, L.; Giegling, I.; Glenthoj, B.; Godard, S.; Golimbet, V.; de Haan, L.; Hansen, M.; Hansen, T.; Hartmann, A.M.; Henskens, F. A.; Hougaard, D. M.; Ingason, A.; Jablensky, A. V.; Jakobsen, K.D.; Jay, M.; Jönsson, E.G.; Jürgens, G.; Kahn, R.S.; Keller, M.C.; Kendler, K.S.; Kenis, G.; Kenny, E.; Konnerth, H.; Konte, B.; Krabbendam, L.; Krasucki, R.; Lasseter, V. K.; Laurent, C.; Lencz, T.; Lerer, F. B.; Liang, K. Y.; Lieberman, J. A.; Linszen, D.H.; Lönnqvist, J.; Loughland, C. M.; Maclean, A. W.; Maher, B.S.; Malhotra, A.K.; Mallet, J.; Malloy, P.; McGrath, J. J.; McLean, D. E.; Michie, P. T.; Milanova, V.; Mors, O.; Mortensen, P.B.; Mowry, B. J.; Myin-Germeys, I.; Neale, B.; Nertney, D. A.; Nestadt, G.; Nielsen, J.; Nordentoft, M.; Norton, N.; O'Neill, F.; Olincy, A.; Olsen, L.; Ophoff, R.A.; Orntoft, T. F.; van Os, J.; Pantelis, C.; Papadimitriou, G.; Pato, C.N.; Peltonen, L.; Pickard, B.; Pietilainen, O.P.; Pimm, J.; Pulver, A. E.; Puri, V.; Quested, D.; Rasmussen, H.B.; Rethelyi, J.M.; Ribble, R.; Riley, B.P.; Rossin, L.; Ruggeri, M.; Rujescu, D.; Schall, U.; Schwab, S. G.; Scott, R.J.; Silverman, J.M.; Spencer, C. C.; Strange, A.; Strengman, E.; Stroup, T.S.; Suvisaari, J.; Terenius, L.; Thirumalai, S.; Timm, S.; Toncheva, D.; Tosato, S.; van den Oord, E.J.; Veldink, J.; Visscher, P.M.; Walsh, D.; Wang, A. G.; Werge, T.; Wiersma, D.; Wildenauer, D. B.; Williams, H.J.; Williams, N.M.; van Winkel, R.; Wormley, B.; Zammit, S.; Schork, N.J.; Andreassen, O.A.; Dale, A.M.

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery

  10. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs

    NARCIS (Netherlands)

    Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.; Absher, Devin; Agudo, Antonio; Almgren, Peter; Ardissino, Diego; Assimes, Themistocles L.; Bandinelli, Stephania; Barzan, Luigi; Bencko, Vladimir; Benhamou, Simone; Benjamin, Emelia J.; Bernardinelli, Luisa; Bis, Joshua; Boehnke, Michael; Boerwinkle, Eric; Boomsma, Dorret I.; Brennan, Paul; Canova, Cristina; Castellsagué, Xavier; Chanock, Stephen; Chasman, Daniel; Conway, David I.; Dackor, Jennifer; de Geus, Eco J. C.; Duan, Jubao; Elosua, Roberto; Everett, Brendan; Fabianova, Eleonora; Ferrucci, Luigi; Foretova, Lenka; Fortmann, Stephen P.; Franceschini, Nora; Frayling, Timothy; Furberg, Curt; Gejman, Pablo V.; Groop, Leif; Gu, Fangyi; de Haan, Lieuwe; Linszen, Don H.

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery

  11. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods.

    Directory of Open Access Journals (Sweden)

    Nawar Shara

    Full Text Available Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS. Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991, 2 (1993-1995, and 3 (1998-1999 was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.

  12. Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA

    KAUST Repository

    Jung, Yoonsuh; Huang, Jianhua Z.; Hu, Jianhua

    2014-01-01

    In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L 1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.

  13. Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA

    KAUST Repository

    Jung, Yoonsuh

    2014-10-02

    In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L 1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.

  14. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs

    DEFF Research Database (Denmark)

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False...... Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate...... in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate...

  15. Meta-analysis of genome-wide studies identifies WNT16 and ESR1 SNPs associated with bone mineral density in premenopausal women.

    Science.gov (United States)

    Koller, Daniel L; Zheng, Hou-Feng; Karasik, David; Yerges-Armstrong, Laura; Liu, Ching-Ti; McGuigan, Fiona; Kemp, John P; Giroux, Sylvie; Lai, Dongbing; Edenberg, Howard J; Peacock, Munro; Czerwinski, Stefan A; Choh, Audrey C; McMahon, George; St Pourcain, Beate; Timpson, Nicholas J; Lawlor, Debbie A; Evans, David M; Towne, Bradford; Blangero, John; Carless, Melanie A; Kammerer, Candace; Goltzman, David; Kovacs, Christopher S; Prior, Jerilynn C; Spector, Tim D; Rousseau, Francois; Tobias, Jon H; Akesson, Kristina; Econs, Michael J; Mitchell, Braxton D; Richards, J Brent; Kiel, Douglas P; Foroud, Tatiana

    2013-03-01

    Previous genome-wide association studies (GWAS) have identified common variants in genes associated with variation in bone mineral density (BMD), although most have been carried out in combined samples of older women and men. Meta-analyses of these results have identified numerous single-nucleotide polymorphisms (SNPs) of modest effect at genome-wide significance levels in genes involved in both bone formation and resorption, as well as other pathways. We performed a meta-analysis restricted to premenopausal white women from four cohorts (n = 4061 women, aged 20 to 45 years) to identify genes influencing peak bone mass at the lumbar spine and femoral neck. After imputation, age- and weight-adjusted bone-mineral density (BMD) values were tested for association with each SNP. Association of an SNP in the WNT16 gene (rs3801387; p = 1.7 × 10(-9) ) and multiple SNPs in the ESR1/C6orf97 region (rs4870044; p = 1.3 × 10(-8) ) achieved genome-wide significance levels for lumbar spine BMD. These SNPs, along with others demonstrating suggestive evidence of association, were then tested for association in seven replication cohorts that included premenopausal women of European, Hispanic-American, and African-American descent (combined n = 5597 for femoral neck; n = 4744 for lumbar spine). When the data from the discovery and replication cohorts were analyzed jointly, the evidence was more significant (WNT16 joint p = 1.3 × 10(-11) ; ESR1/C6orf97 joint p = 1.4 × 10(-10) ). Multiple independent association signals were observed with spine BMD at the ESR1 region after conditioning on the primary signal. Analyses of femoral neck BMD also supported association with SNPs in WNT16 and ESR1/C6orf97 (p women. These data support the hypothesis that variants in these genes of known skeletal function also affect BMD during the premenopausal period. Copyright © 2013 American Society for Bone and Mineral Research.

  16. META-ANALYSIS OF GENOME-WIDE STUDIES IDENTIFIES WNT16 AND ESR1 SNPS ASSOCIATED WITH BONE MINERAL DENSITY IN PREMENOPAUSAL WOMEN

    Science.gov (United States)

    Koller, Daniel L.; Zheng, Hou-Feng; Karasik, David; Yerges-Armstrong, Laura; Liu, Ching-Ti; McGuigan, Fiona; Kemp, John P.; Giroux, Sylvie; Lai, Dongbing; Edenberg, Howard J.; Peacock, Munro; Czerwinski, Stefan A.; Choh, Audrey C.; McMahon, George; St Pourcain, Beate; Timpson, Nicholas J.; Lawlor, Debbie A; Evans, David M; Towne, Bradford; Blangero, John; Carless, Melanie A.; Kammerer, Candace; Goltzman, David; Kovacs, Christopher S.; Prior, Jerilynn C.; Spector, Tim D.; Rousseau, Francois; Tobias, Jon H.; Akesson, Kristina; Econs, Michael J.; Mitchell, Braxton D.; Richards, J. Brent; Kiel, Douglas P.; Foroud, Tatiana

    2013-01-01

    Previous genome-wide association studies (GWAS) have identified common variants in genes associated with variation in bone mineral density (BMD), although most have been carried out in combined samples of older women and men. Meta-analyses of these results have identified numerous SNPs of modest effect at genome-wide significance levels in genes involved in both bone formation and resorption, as well as other pathways. We performed a meta-analysis restricted to premenopausal white women from four cohorts (n= 4,061 women, ages 20 to 45) to identify genes influencing peak bone mass at the lumbar spine and femoral neck. Following imputation, age- and weight-adjusted BMD values were tested for association with each SNP. Association of a SNP in the WNT16 gene (rs3801387; p=1.7 × 10−9) and multiple SNPs in the ESR1/C6orf97 (rs4870044; p=1.3 × 10−8) achieved genome-wide significance levels for lumbar spine BMD. These SNPs, along with others demonstrating suggestive evidence of association, were then tested for association in seven Replication cohorts that included premenopausal women of European, Hispanic-American, and African-American descent (combined n=5,597 for femoral neck; 4,744 for lumbar spine). When the data from the Discovery and Replication cohorts were analyzed jointly, the evidence was more significant (WNT16 joint p=1.3 × 10−11; ESR1/C6orf97 joint p= 1.4 × 10−10). Multiple independent association signals were observed with spine BMD at the ESR1 region after conditioning on the primary signal. Analyses of femoral neck BMD also supported association with SNPs in WNT16 and ESR1/C6orf97 (p< 1 × 10−5). Our results confirm that several of the genes contributing to BMD variation across a broad age range in both sexes have effects of similar magnitude on BMD of the spine in premenopausal women. These data support the hypothesis that variants in these genes of known skeletal function also affect BMD during the premenopausal period. PMID:23074152

  17. Evaluating geographic imputation approaches for zip code level data: an application to a study of pediatric diabetes

    Directory of Open Access Journals (Sweden)

    Puett Robin C

    2009-10-01

    Full Text Available Abstract Background There is increasing interest in the study of place effects on health, facilitated in part by geographic information systems. Incomplete or missing address information reduces geocoding success. Several geographic imputation methods have been suggested to overcome this limitation. Accuracy evaluation of these methods can be focused at the level of individuals and at higher group-levels (e.g., spatial distribution. Methods We evaluated the accuracy of eight geo-imputation methods for address allocation from ZIP codes to census tracts at the individual and group level. The spatial apportioning approaches underlying the imputation methods included four fixed (deterministic and four random (stochastic allocation methods using land area, total population, population under age 20, and race/ethnicity as weighting factors. Data included more than 2,000 geocoded cases of diabetes mellitus among youth aged 0-19 in four U.S. regions. The imputed distribution of cases across tracts was compared to the true distribution using a chi-squared statistic. Results At the individual level, population-weighted (total or under age 20 fixed allocation showed the greatest level of accuracy, with correct census tract assignments averaging 30.01% across all regions, followed by the race/ethnicity-weighted random method (23.83%. The true distribution of cases across census tracts was that 58.2% of tracts exhibited no cases, 26.2% had one case, 9.5% had two cases, and less than 3% had three or more. This distribution was best captured by random allocation methods, with no significant differences (p-value > 0.90. However, significant differences in distributions based on fixed allocation methods were found (p-value Conclusion Fixed imputation methods seemed to yield greatest accuracy at the individual level, suggesting use for studies on area-level environmental exposures. Fixed methods result in artificial clusters in single census tracts. For studies

  18. ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing

    Directory of Open Access Journals (Sweden)

    Kamatani Naoyuki

    2011-05-01

    Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.

  19. Missing data imputation: focusing on single imputation.

    Science.gov (United States)

    Zhang, Zhongheng

    2016-01-01

    Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.

  20. Association study of FOXO3A SNPs and aging phenotypes in Danish oldest-old individuals

    DEFF Research Database (Denmark)

    Soerensen, Mette; Nygaard, Marianne; Dato, Serena

    2015-01-01

    -old Danes (age 92-93) with 4 phenotypes known to predict their survival: cognitive function, hand grip strength, activity of daily living (ADL), and self-rated health. Based on previous studies in humans and foxo animal models, we also explore self-reported diabetes, cancer, cardiovascular disease......FOXO3A variation has repeatedly been reported to associate with human longevity, yet only few studies have investigated whether FOXO3A variation also associates with aging-related traits. Here, we investigate the association of 15 FOXO3A tagging single nucleotide polymorphisms (SNPs) in 1088 oldest...... borderline significance (P = 0.054), while ADL did not (P = 0.396). Although the single-SNP associations did not formally replicate in another study population of oldest-old Danes (n = 1279, age 94-100), the estimates were of similar direction of effect as observed in the Discovery sample. A pooled analysis...

  1. A genome-wide study of common SNPs and CNVs in cognitive performance in the CANTAB

    Science.gov (United States)

    Need, Anna C.; Attix, Deborah K.; McEvoy, Jill M.; Cirulli, Elizabeth T.; Linney, Kristen L.; Hunt, Priscilla; Ge, Dongliang; Heinzen, Erin L.; Maia, Jessica M.; Shianna, Kevin V.; Weale, Michael E.; Cherkas, Lynn F.; Clement, Gail; Spector, Tim D.; Gibson, Greg; Goldstein, David B.

    2009-01-01

    Psychiatric disorders such as schizophrenia are commonly accompanied by cognitive impairments that are treatment resistant and crucial to functional outcome. There has been great interest in studying cognitive measures as endophenotypes for psychiatric disorders, with the hope that their genetic basis will be clearer. To investigate this, we performed a genome-wide association study involving 11 cognitive phenotypes from the Cambridge Neuropsychological Test Automated Battery. We showed these measures to be heritable by comparing the correlation in 100 monozygotic and 100 dizygotic twin pairs. The full battery was tested in ∼750 subjects, and for spatial and verbal recognition memory, we investigated a further 500 individuals to search for smaller genetic effects. We were unable to find any genome-wide significant associations with either SNPs or common copy number variants. Nor could we formally replicate any polymorphism that has been previously associated with cognition, although we found a weak signal of lower than expected P-values for variants in a set of 10 candidate genes. We additionally investigated SNPs in genomic loci that have been shown to harbor rare variants that associate with neuropsychiatric disorders, to see if they showed any suggestion of association when considered as a separate set. Only NRXN1 showed evidence of significant association with cognition. These results suggest that common genetic variation does not strongly influence cognition in healthy subjects and that cognitive measures do not represent a more tractable genetic trait than clinical endpoints such as schizophrenia. We discuss a possible role for rare variation in cognitive genomics. PMID:19734545

  2. ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework.

    Science.gov (United States)

    Zhang, Kunlin; Chang, Suhua; Cui, Sijia; Guo, Liyuan; Zhang, Liuyan; Wang, Jing

    2011-07-01

    Genome-wide association study (GWAS) is widely utilized to identify genes involved in human complex disease or some other trait. One key challenge for GWAS data interpretation is to identify causal SNPs and provide profound evidence on how they affect the trait. Currently, researches are focusing on identification of candidate causal variants from the most significant SNPs of GWAS, while there is lack of support on biological mechanisms as represented by pathways. Although pathway-based analysis (PBA) has been designed to identify disease-related pathways by analyzing the full list of SNPs from GWAS, it does not emphasize on interpreting causal SNPs. To our knowledge, so far there is no web server available to solve the challenge for GWAS data interpretation within one analytical framework. ICSNPathway is developed to identify candidate causal SNPs and their corresponding candidate causal pathways from GWAS by integrating linkage disequilibrium (LD) analysis, functional SNP annotation and PBA. ICSNPathway provides a feasible solution to bridge the gap between GWAS and disease mechanism study by generating hypothesis of SNP → gene → pathway(s). The ICSNPathway server is freely available at http://icsnpathway.psych.ac.cn/.

  3. SNPs in Multi-Species Conserved Sequences (MCS as useful markers in association studies: a practical approach

    Directory of Open Access Journals (Sweden)

    Pericak-Vance Margaret A

    2007-08-01

    Full Text Available Abstract Background Although genes play a key role in many complex diseases, the specific genes involved in most complex diseases remain largely unidentified. Their discovery will hinge on the identification of key sequence variants that are conclusively associated with disease. While much attention has been focused on variants in protein-coding DNA, variants in noncoding regions may also play many important roles in complex disease by altering gene regulation. Since the vast majority of noncoding genomic sequence is of unknown function, this increases the challenge of identifying "functional" variants that cause disease. However, evolutionary conservation can be used as a guide to indicate regions of noncoding or coding DNA that are likely to have biological function, and thus may be more likely to harbor SNP variants with functional consequences. To help bias marker selection in favor of such variants, we devised a process that prioritizes annotated SNPs for genotyping studies based on their location within Multi-species Conserved Sequences (MCSs and used this process to select SNPs in a region of linkage to a complex disease. This allowed us to evaluate the utility of the chosen SNPs for further association studies. Previously, a region of chromosome 1q43 was linked to Multiple Sclerosis (MS in a genome-wide screen. We chose annotated SNPs in the region based on location within MCSs (termed MCS-SNPs. We then obtained genotypes for 478 MCS-SNPs in 989 individuals from MS families. Results Analysis of our MCS-SNP genotypes from the 1q43 region and comparison to HapMap data confirmed that annotated SNPs in MCS regions are frequently polymorphic and show subtle signatures of selective pressure, consistent with previous reports of genome-wide variation in conserved regions. We also present an online tool that allows MCS data to be directly exported to the UCSC genome browser so that MCS-SNPs can be easily identified within genomic regions of

  4. Japan PGx Data Science Consortium Database: SNPs and HLA genotype data from 2994 Japanese healthy individuals for pharmacogenomics studies.

    Science.gov (United States)

    Kamitsuji, Shigeo; Matsuda, Takashi; Nishimura, Koichi; Endo, Seiko; Wada, Chisa; Watanabe, Kenji; Hasegawa, Koichi; Hishigaki, Haretsugu; Masuda, Masatoshi; Kuwahara, Yusuke; Tsuritani, Katsuki; Sugiura, Kenkichi; Kubota, Tomoko; Miyoshi, Shinji; Okada, Kinya; Nakazono, Kazuyuki; Sugaya, Yuki; Yang, Woosung; Sawamoto, Taiji; Uchida, Wataru; Shinagawa, Akira; Fujiwara, Tsutomu; Yamada, Hisaharu; Suematsu, Koji; Tsutsui, Naohisa; Kamatani, Naoyuki; Liou, Shyh-Yuh

    2015-06-01

    Japan Pharmacogenomics Data Science Consortium (JPDSC) has assembled a database for conducting pharmacogenomics (PGx) studies in Japanese subjects. The database contains the genotypes of 2.5 million single-nucleotide polymorphisms (SNPs) and 5 human leukocyte antigen loci from 2994 Japanese healthy volunteers, as well as 121 kinds of clinical information, including self-reports, physiological data, hematological data and biochemical data. In this article, the reliability of our data was evaluated by principal component analysis (PCA) and association analysis for hematological and biochemical traits by using genome-wide SNP data. PCA of the SNPs showed that all the samples were collected from the Japanese population and that the samples were separated into two major clusters by birthplace, Okinawa and other than Okinawa, as had been previously reported. Among 87 SNPs that have been reported to be associated with 18 hematological and biochemical traits in genome-wide association studies (GWAS), the associations of 56 SNPs were replicated using our data base. Statistical power simulations showed that the sample size of the JPDSC control database is large enough to detect genetic markers having a relatively strong association even when the case sample size is small. The JPDSC database will be useful as control data for conducting PGx studies to explore genetic markers to improve the safety and efficacy of drugs either during clinical development or in post-marketing.

  5. In silico screening, genotyping, molecular dynamics simulation and activity studies of SNPs in pyruvate kinase M2.

    Directory of Open Access Journals (Sweden)

    Ponnusamy Kalaiarasan

    Full Text Available Role of, 29-non-synonymous, 15-intronic, 3-close to UTR, single nucleotide polymorphisms (SNPs and 2 mutations of Human Pyruvate Kinase (PK M2 were investigated by in-silico and in-vitro functional studies. Prediction of deleterious substitutions based on sequence homology and structure based servers, SIFT, PANTHER, SNPs&GO, PhD-SNP, SNAP and PolyPhen, depicted that 19% emerged common between all the mentioned programs. SNPeffect and HOPE showed three substitutions (C31F, Q310P and S437Y in-silico as deleterious and functionally important. In-vitro activity assays showed C31F and S437Y variants of PKM2 with reduced activity, while Q310P variant was catalytically inactive. The allosteric activation due to binding of fructose 1-6 bisphosphate (FBP was compromised in case of S437Y nsSNP variant protein. This was corroborated through molecular dynamics (MD simulation study, which was also carried out in other two variant proteins. The 5 intronic SNPs of PKM2, associated with sporadic breast cancer in a case-control study, when subjected to different computational analyses, indicated that 3 SNPs (rs2856929, rs8192381 and rs8192431 could generate an alternative transcript by influencing splicing factor binding to PKM2. We propose that these, potentially functional and important variations, both within exons and introns, could have a bearing on cancer metabolism, since PKM2 has been implicated in cancer in the recent past.

  6. SCREENING LOW FREQUENCY SNPS FROM GENOME WIDE ASSOCIATION STUDY REVEALS A NEW RISK ALLELE FOR PROGRESSION TO AIDS

    Science.gov (United States)

    Le Clerc, Sigrid; Coulonges, Cédric; Delaneau, Olivier; Van Manen, Danielle; Herbeck, Joshua T.; Limou, Sophie; An, Ping; Martinson, Jeremy J.; Spadoni, Jean-Louis; Therwath, Amu; Veldink, Jan H.; van den Berg, Leonard H.; Taing, Lieng; Labib, Taoufik; Mellak, Safa; Montes, Matthieu; Delfraissy, Jean-François; Schächter, François; Winkler, Cheryl; Froguel, Philippe; Mullins, James I.; Schuitemaker, Hanneke; Zagury, Jean-François

    2011-01-01

    Background Seven genome-wide association studies (GWAS) have been published in AIDS and only associations in the HLA region on chromosome 6 and CXCR6 have passed genome-wide significance. Methods We reanalyzed the data from three previously published GWAS, targeting specifically low frequency SNPs (minor allele frequency (MAF)<5%). Two groups composed of 365 slow progressors (SP) and 147 rapid progressors (RP) from Europe and the US were compared with a control group of 1394 seronegative individuals using Eigenstrat corrections. Results Of the 8584 SNPs with MAF<5% in cases and controls (Bonferroni threshold=5.8×10−6), four SNPs showed statistical evidence of association with the SP phenotype. The best result was for HCP5 rs2395029 (p=8.54×10−15, OR=3.41) in the HLA locus, in partial linkage disequilibrium with two additional chromosome 6 associations in C6orf48 (p=3.03×10−10, OR=2.9) and NOTCH4 (9.08×10−07, OR=2.32). The fourth association corresponded to rs2072255 located in RICH2 (p=3.30×10−06, OR=0.43) in chromosome 17. Using HCP5 rs2395029 as a covariate, the C6orf48 and NOTCH4 signals disappeared, but the RICH2 signal still remained significant. Conclusion Besides the already known chromosome 6 associations, the analysis of low frequency SNPs brought up a new association in the RICH2 gene. Interestingly, RICH2 interacts with BST-2 known to be a major restriction factor for HIV-1 infection. Our study has thus identified a new candidate gene for AIDS molecular etiology and confirms the interest of singling out low frequency SNPs in order to exploit GWAS data. PMID:21107268

  7. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs)

    NARCIS (Netherlands)

    H. Darabi (Hatef); J. Beesley (Jonathan); A. Droit (Arnaud); S. Kar (Siddhartha); S. Nord (Silje); M.M. Marjaneh (Mahdi Moradi); Soucy, P. (Penny); K. Michailidou (Kyriaki); M. Ghoussaini (Maya); Wahl, H.F. (Hanna Fues); M.K. Bolla (Manjeet K.); Wang, Q. (Qin); J. Dennis (Joe); M.R. Alonso (Rosario); I.L. Andrulis (Irene); H. Anton-Culver (Hoda); Arndt, V. (Volker); M.W. Beckmann (Matthias); J. Benítez (Javier); N.V. Bogdanova (Natalia); S.E. Bojesen (Stig); H. Brauch (Hiltrud); H. Brenner (Hermann); A. Broeks (Annegien); T. Brüning (Thomas); B. Burwinkel (Barbara); J. Chang-Claude (Jenny); Choi, J.-Y. (Ji-Yeob); D. Conroy (Don); F.J. Couch (Fergus); A. Cox (Angela); S.S. Cross (Simon); K. Czene (Kamila); P. Devilee (Peter); T. Dörk (Thilo); D.F. Easton (Douglas F.); P.A. Fasching (Peter); J.D. Figueroa (Jonine); O. Fletcher (Olivia); H. Flyger (Henrik); Galle, E. (Eva); M. García-Closas (Montserrat); Giles, G.G. (Graham G.); M.S. Goldberg (Mark); A. González-Neira (Anna); P. Guénel (Pascal); C.A. Haiman (Christopher A.); Hallberg, E. (Emily); U. Hamann (Ute); J.M. Hartman (Joost); A. Hollestelle (Antoinette); J.L. Hopper (John); H. Ito (Hidemi); A. Jakubowska (Anna); Johnson, N. (Nichola); D. Kang (Daehee); S. Khan (Sofia); V-M. Kosma (Veli-Matti); Kriege, M. (Mieke); V. Kristensen (Vessela); Lambrechts, D. (Diether); L. Le Marchand (Loic); Lee, S.C. (Soo Chin); A. Lindblom (Annika); A. Lophatananon (Artitaya); J. Lubinski (Jan); A. Mannermaa (Arto); S. Manoukian (Siranoush); S. Margolin (Sara); K. Matsuo (Keitaro); Mayes, R. (Rebecca); McKay, J. (James); A. Meindl (Alfons); R.L. Milne (Roger); K.R. Muir (K.); S.L. Neuhausen (Susan); H. Nevanlinna (Heli); C. Olswold (Curtis); Orr, N. (Nick); P. Peterlongo (Paolo); G. Pita (Guillermo); K. Pykäs (Katri); Rudolph, A. (Anja); Sangrajrang, S. (Suleeporn); Sawyer, E.J. (Elinor J.); M.K. Schmidt (Marjanka); R.K. Schmutzler (Rita); C.M. Seynaeve (Caroline); Shah, M. (Mitul); C.-Y. Shen (Chen-Yang); X.-O. Shu (Xiao-Ou); M.C. Southey (Melissa); Stram, D.O. (Daniel O.); H. Surowy (Harald); A.J. Swerdlow (Anthony ); S.-H. Teo (Soo-Hwang); D.C. Tessier (Daniel C.); I.P. Tomlinson (Ian); D. Torres (Diana); T. Truong (Thérèse); C. Vachon (Celine); D. Vincent (Daniel); R. Winqvist (Robert); A.H. Wu (Anna); P.-E. Wu (Pei-Ei); C.H. Yip (Cheng Har); W. Zheng (Wei); P.D.P. Pharoah (Paul); P. Hall (Per); S.L. Edwards (Stacey); J. Simard (Jacques); J.D. French (Juliet); G. Chenevix-Trench (Georgia); A.M. Dunning (Alison)

    2016-01-01

    textabstractGenome-wide association studies have found SNPs at 17q22 to be associated with breast cancer risk. To identify potential causal variants related to breast cancer risk, we performed a high resolution fine-mapping analysis that involved genotyping 517 SNPs using a custom Illumina iSelect

  8. Improving the detection of pathways in genome-wide association studies by combined effects of SNPs from Linkage Disequilibrium blocks

    OpenAIRE

    Zhao, Huiying; Nyholt, Dale R.; Yang, Yuanhao; Wang, Jihua; Yang, Yuedong

    2017-01-01

    Genome-wide association studies (GWAS) have successfully identified single variants associated with diseases. To increase the power of GWAS, gene-based and pathway-based tests are commonly employed to detect more risk factors. However, the gene- and pathway-based association tests may be biased towards genes or pathways containing a large number of single-nucleotide polymorphisms (SNPs) with small P-values caused by high linkage disequilibrium (LD) correlations. To address such bias, numerous...

  9. Improving the detection of pathways in genome-wide association studies by combined effects of SNPs from Linkage Disequilibrium blocks.

    Science.gov (United States)

    Zhao, Huiying; Nyholt, Dale R; Yang, Yuanhao; Wang, Jihua; Yang, Yuedong

    2017-06-14

    Genome-wide association studies (GWAS) have successfully identified single variants associated with diseases. To increase the power of GWAS, gene-based and pathway-based tests are commonly employed to detect more risk factors. However, the gene- and pathway-based association tests may be biased towards genes or pathways containing a large number of single-nucleotide polymorphisms (SNPs) with small P-values caused by high linkage disequilibrium (LD) correlations. To address such bias, numerous pathway-based methods have been developed. Here we propose a novel method, DGAT-path, to divide all SNPs assigned to genes in each pathway into LD blocks, and to sum the chi-square statistics of LD blocks for assessing the significance of the pathway by permutation tests. The method was proven robust with the type I error rate >1.6 times lower than other methods. Meanwhile, the method displays a higher power and is not biased by the pathway size. The applications to the GWAS summary statistics for schizophrenia and breast cancer indicate that the detected top pathways contain more genes close to associated SNPs than other methods. As a result, the method identified 17 and 12 significant pathways containing 20 and 21 novel associated genes, respectively for two diseases. The method is available online by http://sparks-lab.org/server/DGAT-path .

  10. Improving accuracy of rare variant imputation with a two-step imputation approach

    DEFF Research Database (Denmark)

    Kreiner-Møller, Eskil; Medina-Gomez, Carolina; Uitterlinden, André G

    2015-01-01

    not being comprehensively scrutinized. Next-generation arrays ensuring sufficient coverage together with new reference panels, as the 1000 Genomes panel, are emerging to facilitate imputation of low frequent single-nucleotide polymorphisms (minor allele frequency (MAF) ... reference sample genotyped on a dense array and hereafter to the 1000 Genomes reference panel. We show that mean imputation quality, measured by the r(2) using this approach, increases by 28% for variants with a MAF between 1 and 5% as compared with direct imputation to 1000 Genomes reference. Similarly......Genotype imputation has been the pillar of the success of genome-wide association studies (GWAS) for identifying common variants associated with common diseases. However, most GWAS have been run using only 60 HapMap samples as reference for imputation, meaning less frequent and rare variants...

  11. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation.

    Science.gov (United States)

    Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi

    2016-06-21

    Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. Candidate gene analysis using imputed genotypes: cell cycle single-nucleotide polymorphisms and ovarian cancer risk

    DEFF Research Database (Denmark)

    Goode, Ellen L; Fridley, Brooke L; Vierkant, Robert A

    2009-01-01

    Polymorphisms in genes critical to cell cycle control are outstanding candidates for association with ovarian cancer risk; numerous genes have been interrogated by multiple research groups using differing tagging single-nucleotide polymorphism (SNP) sets. To maximize information gleaned from......, and rs3212891; CDK2 rs2069391, rs2069414, and rs17528736; and CCNE1 rs3218036. These results exemplify the utility of imputation in candidate gene studies and lend evidence to a role of cell cycle genes in ovarian cancer etiology, suggest a reduced set of SNPs to target in additional cases and controls....

  13. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

    Directory of Open Access Journals (Sweden)

    Lotz Meredith J

    2008-01-01

    Full Text Available Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA

  14. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

    Science.gov (United States)

    Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C

    2008-01-10

    Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity

  15. Gaussian mixture clustering and imputation of microarray data.

    Science.gov (United States)

    Ouyang, Ming; Welsh, William J; Georgopoulos, Panos

    2004-04-12

    In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.

  16. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study.

    Science.gov (United States)

    Cornish, R P; Macleod, J; Carpenter, J R; Tilling, K

    2017-01-01

    When an outcome variable is missing not at random (MNAR: probability of missingness depends on outcome values), estimates of the effect of an exposure on this outcome are often biased. We investigated the extent of this bias and examined whether the bias can be reduced through incorporating proxy outcomes obtained through linkage to administrative data as auxiliary variables in multiple imputation (MI). Using data from the Avon Longitudinal Study of Parents and Children (ALSPAC) we estimated the association between breastfeeding and IQ (continuous outcome), incorporating linked attainment data (proxies for IQ) as auxiliary variables in MI models. Simulation studies explored the impact of varying the proportion of missing data (from 20 to 80%), the correlation between the outcome and its proxy (0.1-0.9), the strength of the missing data mechanism, and having a proxy variable that was incomplete. Incorporating a linked proxy for the missing outcome as an auxiliary variable reduced bias and increased efficiency in all scenarios, even when 80% of the outcome was missing. Using an incomplete proxy was similarly beneficial. High correlations (> 0.5) between the outcome and its proxy substantially reduced the missing information. Consistent with this, ALSPAC analysis showed inclusion of a proxy reduced bias and improved efficiency. Gains with additional proxies were modest. In longitudinal studies with loss to follow-up, incorporating proxies for this study outcome obtained via linkage to external sources of data as auxiliary variables in MI models can give practically important bias reduction and efficiency gains when the study outcome is MNAR.

  17. Public Undertakings and Imputability

    DEFF Research Database (Denmark)

    Ølykke, Grith Skovgaard

    2013-01-01

    In this article, the issue of impuability to the State of public undertakings’ decision-making is analysed and discussed in the context of the DSBFirst case. DSBFirst is owned by the independent public undertaking DSB and the private undertaking FirstGroup plc and won the contracts in the 2008...... Oeresund tender for the provision of passenger transport by railway. From the start, the services were provided at a loss, and in the end a part of DSBFirst was wound up. In order to frame the problems illustrated by this case, the jurisprudence-based imputability requirement in the definition of State aid...... in Article 107(1) TFEU is analysed. It is concluded that where the public undertaking transgresses the control system put in place by the State, conditions for imputability are not fulfilled, and it is argued that in the current state of law, there is no conditional link between the level of control...

  18. LinkImputeR: user-guided genotype calling and imputation for non-model organisms.

    Science.gov (United States)

    Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean

    2017-07-10

    Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from

  19. Exploring the deleterious SNPs in XRCC4 gene using computational approach and studying their association with breast cancer in the population of West India.

    Science.gov (United States)

    Singh, Preety K; Mistry, Kinnari N; Chiramana, Haritha; Rank, Dharamshi N; Joshi, Chaitanya G

    2018-05-20

    Non-homologous end joining (NHEJ) pathway has pivotal role in repair of double-strand DNA breaks that may lead to carcinogenesis. XRCC4 is one of the essential proteins of this pathway and single-nucleotide polymorphisms (SNPs) of this gene are reported to be associated with cancer risks. In our study, we first used computational approaches to predict the damaging variants of XRCC4 gene. Tools predicted rs79561451 (S110P) nsSNP as the most deleterious SNP. Along with this SNP, we analysed other two SNPs (rs3734091 and rs6869366) to study their association with breast cancer in population of West India. Variant rs3734091 was found to be significantly associated with breast cancer while rs6869366 variant did not show any association. These SNPs may influence the susceptibility of individuals to breast cancer in this population. Copyright © 2018 Elsevier B.V. All rights reserved.

  20. Estimating the accuracy of geographical imputation

    Directory of Open Access Journals (Sweden)

    Boscoe Francis P

    2008-01-01

    Full Text Available Abstract Background To reduce the number of non-geocoded cases researchers and organizations sometimes include cases geocoded to postal code centroids along with cases geocoded with the greater precision of a full street address. Some analysts then use the postal code to assign information to the cases from finer-level geographies such as a census tract. Assignment is commonly completed using either a postal centroid or by a geographical imputation method which assigns a location by using both the demographic characteristics of the case and the population characteristics of the postal delivery area. To date no systematic evaluation of geographical imputation methods ("geo-imputation" has been completed. The objective of this study was to determine the accuracy of census tract assignment using geo-imputation. Methods Using a large dataset of breast, prostate and colorectal cancer cases reported to the New Jersey Cancer Registry, we determined how often cases were assigned to the correct census tract using alternate strategies of demographic based geo-imputation, and using assignments obtained from postal code centroids. Assignment accuracy was measured by comparing the tract assigned with the tract originally identified from the full street address. Results Assigning cases to census tracts using the race/ethnicity population distribution within a postal code resulted in more correctly assigned cases than when using postal code centroids. The addition of age characteristics increased the match rates even further. Match rates were highly dependent on both the geographic distribution of race/ethnicity groups and population density. Conclusion Geo-imputation appears to offer some advantages and no serious drawbacks as compared with the alternative of assigning cases to census tracts based on postal code centroids. For a specific analysis, researchers will still need to consider the potential impact of geocoding quality on their results and evaluate

  1. Analysis of 60 reported glioma risk SNPs replicates published GWAS findings but fails to replicate associations from published candidate-gene studies.

    Science.gov (United States)

    Walsh, Kyle M; Anderson, Erik; Hansen, Helen M; Decker, Paul A; Kosel, Matt L; Kollmeyer, Thomas; Rice, Terri; Zheng, Shichun; Xiao, Yuanyuan; Chang, Jeffrey S; McCoy, Lucie S; Bracci, Paige M; Wiemels, Joe L; Pico, Alexander R; Smirnov, Ivan; Lachance, Daniel H; Sicotte, Hugues; Eckel-Passow, Jeanette E; Wiencke, John K; Jenkins, Robert B; Wrensch, Margaret R

    2013-02-01

    Genomewide association studies (GWAS) and candidate-gene studies have implicated single-nucleotide polymorphisms (SNPs) in at least 45 different genes as putative glioma risk factors. Attempts to validate these associations have yielded variable results and few genetic risk factors have been consistently replicated. We conducted a case-control study of Caucasian glioma cases and controls from the University of California San Francisco (810 cases, 512 controls) and the Mayo Clinic (852 cases, 789 controls) in an attempt to replicate previously reported genetic risk factors for glioma. Sixty SNPs selected from the literature (eight from GWAS and 52 from candidate-gene studies) were successfully genotyped on an Illumina custom genotyping panel. Eight SNPs in/near seven different genes (TERT, EGFR, CCDC26, CDKN2A, PHLDB1, RTEL1, TP53) were significantly associated with glioma risk in the combined dataset (P 0.05). Although several confirmed associations are located near genes long known to be involved in gliomagenesis (e.g., EGFR, CDKN2A, TP53), these associations were first discovered by the GWAS approach and are in noncoding regions. These results highlight that the deficiencies of the candidate-gene approach lay in selecting both appropriate genes and relevant SNPs within these genes. © 2012 WILEY PERIODICALS, INC.

  2. Cannabis-dependence risk relates to synergism between neuroticism and proenkephalin SNPs associated with amygdala gene expression: case-control study.

    Directory of Open Access Journals (Sweden)

    Didier Jutras-Aswad

    Full Text Available Many young people experiment with cannabis, yet only a subgroup progress to dependence suggesting individual differences that could relate to factors such as genetics and behavioral traits. Dopamine receptor D2 (DRD2 and proenkephalin (PENK genes have been implicated in animal studies with cannabis exposure. Whether polymorphisms of these genes are associated with cannabis dependence and related behavioral traits is unknown.Healthy young adults (18-27 years with cannabis dependence and without a dependence diagnosis were studied (N = 50/group in relation to a priori-determined single nucleotide polymorphisms (SNPs of the DRD2 and PENK genes. Negative affect, Impulsive Risk Taking and Neuroticism-Anxiety temperamental traits, positive and negative reward-learning performance and stop-signal reaction times were examined. The findings replicated the known association between the rs6277 DRD2 SNP and decisions associated with negative reinforcement outcomes. Moreover, PENK variants (rs2576573 and rs2609997 significantly related to Neuroticism and cannabis dependence. Cigarette smoking is common in cannabis users, but it was not associated to PENK SNPs as also validated in another cohort (N = 247 smokers, N = 312 non-smokers. Neuroticism mediated (15.3%-19.5% the genetic risk to cannabis dependence and interacted with risk SNPs, resulting in a 9-fold increase risk for cannabis dependence. Molecular characterization of the postmortem human brain in a different population revealed an association between PENK SNPs and PENK mRNA expression in the central amygdala nucleus emphasizing the functional relevance of the SNPs in a brain region strongly linked to negative affect.Overall, the findings suggest an important role for Neuroticism as an endophenotype linking PENK polymorphisms to cannabis-dependence vulnerability synergistically amplifying the apparent genetic risk.

  3. A second generation human haplotype map of over 3.1 million SNPs.

    Science.gov (United States)

    Frazer, Kelly A; Ballinger, Dennis G; Cox, David R; Hinds, David A; Stuve, Laura L; Gibbs, Richard A; Belmont, John W; Boudreau, Andrew; Hardenbol, Paul; Leal, Suzanne M; Pasternak, Shiran; Wheeler, David A; Willis, Thomas D; Yu, Fuli; Yang, Huanming; Zeng, Changqing; Gao, Yang; Hu, Haoran; Hu, Weitao; Li, Chaohua; Lin, Wei; Liu, Siqi; Pan, Hao; Tang, Xiaoli; Wang, Jian; Wang, Wei; Yu, Jun; Zhang, Bo; Zhang, Qingrun; Zhao, Hongbin; Zhao, Hui; Zhou, Jun; Gabriel, Stacey B; Barry, Rachel; Blumenstiel, Brendan; Camargo, Amy; Defelice, Matthew; Faggart, Maura; Goyette, Mary; Gupta, Supriya; Moore, Jamie; Nguyen, Huy; Onofrio, Robert C; Parkin, Melissa; Roy, Jessica; Stahl, Erich; Winchester, Ellen; Ziaugra, Liuda; Altshuler, David; Shen, Yan; Yao, Zhijian; Huang, Wei; Chu, Xun; He, Yungang; Jin, Li; Liu, Yangfan; Shen, Yayun; Sun, Weiwei; Wang, Haifeng; Wang, Yi; Wang, Ying; Xiong, Xiaoyan; Xu, Liang; Waye, Mary M Y; Tsui, Stephen K W; Xue, Hong; Wong, J Tze-Fei; Galver, Luana M; Fan, Jian-Bing; Gunderson, Kevin; Murray, Sarah S; Oliphant, Arnold R; Chee, Mark S; Montpetit, Alexandre; Chagnon, Fanny; Ferretti, Vincent; Leboeuf, Martin; Olivier, Jean-François; Phillips, Michael S; Roumy, Stéphanie; Sallée, Clémentine; Verner, Andrei; Hudson, Thomas J; Kwok, Pui-Yan; Cai, Dongmei; Koboldt, Daniel C; Miller, Raymond D; Pawlikowska, Ludmila; Taillon-Miller, Patricia; Xiao, Ming; Tsui, Lap-Chee; Mak, William; Song, You Qiang; Tam, Paul K H; Nakamura, Yusuke; Kawaguchi, Takahisa; Kitamoto, Takuya; Morizono, Takashi; Nagashima, Atsushi; Ohnishi, Yozo; Sekine, Akihiro; Tanaka, Toshihiro; Tsunoda, Tatsuhiko; Deloukas, Panos; Bird, Christine P; Delgado, Marcos; Dermitzakis, Emmanouil T; Gwilliam, Rhian; Hunt, Sarah; Morrison, Jonathan; Powell, Don; Stranger, Barbara E; Whittaker, Pamela; Bentley, David R; Daly, Mark J; de Bakker, Paul I W; Barrett, Jeff; Chretien, Yves R; Maller, Julian; McCarroll, Steve; Patterson, Nick; Pe'er, Itsik; Price, Alkes; Purcell, Shaun; Richter, Daniel J; Sabeti, Pardis; Saxena, Richa; Schaffner, Stephen F; Sham, Pak C; Varilly, Patrick; Altshuler, David; Stein, Lincoln D; Krishnan, Lalitha; Smith, Albert Vernon; Tello-Ruiz, Marcela K; Thorisson, Gudmundur A; Chakravarti, Aravinda; Chen, Peter E; Cutler, David J; Kashuk, Carl S; Lin, Shin; Abecasis, Gonçalo R; Guan, Weihua; Li, Yun; Munro, Heather M; Qin, Zhaohui Steve; Thomas, Daryl J; McVean, Gilean; Auton, Adam; Bottolo, Leonardo; Cardin, Niall; Eyheramendy, Susana; Freeman, Colin; Marchini, Jonathan; Myers, Simon; Spencer, Chris; Stephens, Matthew; Donnelly, Peter; Cardon, Lon R; Clarke, Geraldine; Evans, David M; Morris, Andrew P; Weir, Bruce S; Tsunoda, Tatsuhiko; Mullikin, James C; Sherry, Stephen T; Feolo, Michael; Skol, Andrew; Zhang, Houcan; Zeng, Changqing; Zhao, Hui; Matsuda, Ichiro; Fukushima, Yoshimitsu; Macer, Darryl R; Suda, Eiko; Rotimi, Charles N; Adebamowo, Clement A; Ajayi, Ike; Aniagwu, Toyin; Marshall, Patricia A; Nkwodimmah, Chibuzor; Royal, Charmaine D M; Leppert, Mark F; Dixon, Missy; Peiffer, Andy; Qiu, Renzong; Kent, Alastair; Kato, Kazuto; Niikawa, Norio; Adewole, Isaac F; Knoppers, Bartha M; Foster, Morris W; Clayton, Ellen Wright; Watkin, Jessica; Gibbs, Richard A; Belmont, John W; Muzny, Donna; Nazareth, Lynne; Sodergren, Erica; Weinstock, George M; Wheeler, David A; Yakub, Imtaz; Gabriel, Stacey B; Onofrio, Robert C; Richter, Daniel J; Ziaugra, Liuda; Birren, Bruce W; Daly, Mark J; Altshuler, David; Wilson, Richard K; Fulton, Lucinda L; Rogers, Jane; Burton, John; Carter, Nigel P; Clee, Christopher M; Griffiths, Mark; Jones, Matthew C; McLay, Kirsten; Plumb, Robert W; Ross, Mark T; Sims, Sarah K; Willey, David L; Chen, Zhu; Han, Hua; Kang, Le; Godbout, Martin; Wallenburg, John C; L'Archevêque, Paul; Bellemare, Guy; Saeki, Koji; Wang, Hongguang; An, Daochang; Fu, Hongbo; Li, Qing; Wang, Zhen; Wang, Renwu; Holden, Arthur L; Brooks, Lisa D; McEwen, Jean E; Guyer, Mark S; Wang, Vivian Ota; Peterson, Jane L; Shi, Michael; Spiegel, Jack; Sung, Lawrence M; Zacharia, Lynn F; Collins, Francis S; Kennedy, Karen; Jamieson, Ruth; Stewart, John

    2007-10-18

    We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

  4. Multiple imputation in the presence of non-normal data.

    Science.gov (United States)

    Lee, Katherine J; Carlin, John B

    2017-02-20

    Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  5. Assessment of heterogeneity between European Populations: a Baltic and Danish replication case-control study of SNPs from a recent European ulcerative colitis genome wide association study.

    Science.gov (United States)

    Andersen, Vibeke; Ernst, Anja; Sventoraityte, Jurgita; Kupcinskas, Limas; Jacobsen, Bent A; Krarup, Henrik B; Vogel, Ulla; Jonaitis, Laimas; Denapiene, Goda; Kiudelis, Gediminas; Balschun, Tobias; Franke, Andre

    2011-10-13

    Differences in the genetic architecture of inflammatory bowel disease between different European countries and ethnicities have previously been reported. In the present study, we wanted to assess the role of 11 newly identified UC risk variants, derived from a recent European UC genome wide association study (GWAS) (Franke et al., 2010), for 1) association with UC in the Nordic countries, 2) for population heterogeneity between the Nordic countries and the rest of Europe, and, 3) eventually, to drive some of the previous findings towards overall genome-wide significance. Eleven SNPs were replicated in a Danish sample consisting of 560 UC patients and 796 controls and nine missing SNPs of the German GWAS study were successfully genotyped in the Baltic sample comprising 441 UC cases and 1156 controls. The independent replication data was then jointly analysed with the original data and systematic comparisons of the findings between ethnicities were made. Pearson's χ2, Breslow-Day (BD) and Cochran-Mantel-Haenszel (CMH) tests were used for association analyses and heterogeneity testing. The rs5771069 (IL17REL) SNP was not associated with UC in the Danish panel. The rs5771069 (IL17REL) SNP was significantly associated with UC in the combined Baltic, Danish and Norwegian UC study sample driven by the Norwegian panel (OR = 0.89, 95% CI: 0.79-0.98, P = 0.02). No association was found between rs7809799 (SMURF1/KPNA7) and UC (OR = 1.20, 95% CI: 0.95-1.52, P = 0.10) or between UC and all other remaining SNPs. We had 94% chance of detecting an association for rs7809799 (SMURF1/KPNA7) in the combined replication sample, whereas the power were 55% or lower for the remaining SNPs.Statistically significant PBD was found for OR heterogeneity between the combined Baltic, Danish, and Norwegian panel versus the combined German, British, Belgian, and Greek panel (rs7520292 (P = 0.001), rs12518307 (P = 0.007), and rs2395609 (TCP11) (P = 0.01), respectively).No SNP reached genome

  6. Assessment of heterogeneity between European Populations: a Baltic and Danish replication case-control study of SNPs from a recent European ulcerative colitis genome wide association study

    Directory of Open Access Journals (Sweden)

    Jonaitis Laimas

    2011-10-01

    Full Text Available Abstract Background Differences in the genetic architecture of inflammatory bowel disease between different European countries and ethnicities have previously been reported. In the present study, we wanted to assess the role of 11 newly identified UC risk variants, derived from a recent European UC genome wide association study (GWAS (Franke et al., 2010, for 1 association with UC in the Nordic countries, 2 for population heterogeneity between the Nordic countries and the rest of Europe, and, 3 eventually, to drive some of the previous findings towards overall genome-wide significance. Methods Eleven SNPs were replicated in a Danish sample consisting of 560 UC patients and 796 controls and nine missing SNPs of the German GWAS study were successfully genotyped in the Baltic sample comprising 441 UC cases and 1156 controls. The independent replication data was then jointly analysed with the original data and systematic comparisons of the findings between ethnicities were made. Pearson's χ2, Breslow-Day (BD and Cochran-Mantel-Haenszel (CMH tests were used for association analyses and heterogeneity testing. Results The rs5771069 (IL17REL SNP was not associated with UC in the Danish panel. The rs5771069 (IL17REL SNP was significantly associated with UC in the combined Baltic, Danish and Norwegian UC study sample driven by the Norwegian panel (OR = 0.89, 95% CI: 0.79-0.98, P = 0.02. No association was found between rs7809799 (SMURF1/KPNA7 and UC (OR = 1.20, 95% CI: 0.95-1.52, P = 0.10 or between UC and all other remaining SNPs. We had 94% chance of detecting an association for rs7809799 (SMURF1/KPNA7 in the combined replication sample, whereas the power were 55% or lower for the remaining SNPs. Statistically significant PBD was found for OR heterogeneity between the combined Baltic, Danish, and Norwegian panel versus the combined German, British, Belgian, and Greek panel (rs7520292 (P = 0.001, rs12518307 (P = 0.007, and rs2395609 (TCP11 (P = 0

  7. Exploring the Interplay between Rescue Drugs, Data Imputation, and Study Outcomes: Conceptual Review and Qualitative Analysis of an Acute Pain Data Set.

    Science.gov (United States)

    Singla, Neil K; Meske, Diana S; Desjardins, Paul J

    2017-12-01

    In placebo-controlled acute surgical pain studies, provisions must be made for study subjects to receive adequate analgesic therapy. As such, most protocols allow study subjects to receive a pre-specified regimen of open-label analgesic drugs (rescue drugs) as needed. The selection of an appropriate rescue regimen is a critical experimental design choice. We hypothesized that a rescue regimen that is too liberal could lead to all study arms receiving similar levels of pain relief (thereby confounding experimental results), while a regimen that is too stringent could lead to a high subject dropout rate (giving rise to a preponderance of missing data). Despite the importance of rescue regimen as a study design feature, there exist no published review articles or meta-analysis focusing on the impact of rescue therapy on experimental outcomes. Therefore, when selecting a rescue regimen, researchers must rely on clinical factors (what analgesics do patients usually receive in similar surgical scenarios) and/or anecdotal evidence. In the following article, we attempt to bridge this gap by reviewing and discussing the experimental impacts of rescue therapy on a common acute surgical pain population: first metatarsal bunionectomy. The function of this analysis is to (1) create a framework for discussion and future exploration of rescue as a methodological study design feature, (2) discuss the interplay between data imputation techniques and rescue drugs, and (3) inform the readership regarding the impact of data imputation techniques on the validity of study conclusions. Our findings indicate that liberal rescue may degrade assay sensitivity, while stringent rescue may lead to unacceptably high dropout rates.

  8. Discovery and fine-mapping of adiposity loci using high density imputation of genome-wide association studies in individuals of African ancestry: African Ancestry Anthropometry Genetics Consortium.

    Science.gov (United States)

    Ng, Maggie C Y; Graff, Mariaelisa; Lu, Yingchang; Justice, Anne E; Mudgal, Poorva; Liu, Ching-Ti; Young, Kristin; Yanek, Lisa R; Feitosa, Mary F; Wojczynski, Mary K; Rand, Kristin; Brody, Jennifer A; Cade, Brian E; Dimitrov, Latchezar; Duan, Qing; Guo, Xiuqing; Lange, Leslie A; Nalls, Michael A; Okut, Hayrettin; Tajuddin, Salman M; Tayo, Bamidele O; Vedantam, Sailaja; Bradfield, Jonathan P; Chen, Guanjie; Chen, Wei-Min; Chesi, Alessandra; Irvin, Marguerite R; Padhukasahasram, Badri; Smith, Jennifer A; Zheng, Wei; Allison, Matthew A; Ambrosone, Christine B; Bandera, Elisa V; Bartz, Traci M; Berndt, Sonja I; Bernstein, Leslie; Blot, William J; Bottinger, Erwin P; Carpten, John; Chanock, Stephen J; Chen, Yii-Der Ida; Conti, David V; Cooper, Richard S; Fornage, Myriam; Freedman, Barry I; Garcia, Melissa; Goodman, Phyllis J; Hsu, Yu-Han H; Hu, Jennifer; Huff, Chad D; Ingles, Sue A; John, Esther M; Kittles, Rick; Klein, Eric; Li, Jin; McKnight, Barbara; Nayak, Uma; Nemesure, Barbara; Ogunniyi, Adesola; Olshan, Andrew; Press, Michael F; Rohde, Rebecca; Rybicki, Benjamin A; Salako, Babatunde; Sanderson, Maureen; Shao, Yaming; Siscovick, David S; Stanford, Janet L; Stevens, Victoria L; Stram, Alex; Strom, Sara S; Vaidya, Dhananjay; Witte, John S; Yao, Jie; Zhu, Xiaofeng; Ziegler, Regina G; Zonderman, Alan B; Adeyemo, Adebowale; Ambs, Stefan; Cushman, Mary; Faul, Jessica D; Hakonarson, Hakon; Levin, Albert M; Nathanson, Katherine L; Ware, Erin B; Weir, David R; Zhao, Wei; Zhi, Degui; Arnett, Donna K; Grant, Struan F A; Kardia, Sharon L R; Oloapde, Olufunmilayo I; Rao, D C; Rotimi, Charles N; Sale, Michele M; Williams, L Keoki; Zemel, Babette S; Becker, Diane M; Borecki, Ingrid B; Evans, Michele K; Harris, Tamara B; Hirschhorn, Joel N; Li, Yun; Patel, Sanjay R; Psaty, Bruce M; Rotter, Jerome I; Wilson, James G; Bowden, Donald W; Cupples, L Adrienne; Haiman, Christopher A; Loos, Ruth J F; North, Kari E

    2017-04-01

    Genome-wide association studies (GWAS) have identified >300 loci associated with measures of adiposity including body mass index (BMI) and waist-to-hip ratio (adjusted for BMI, WHRadjBMI), but few have been identified through screening of the African ancestry genomes. We performed large scale meta-analyses and replications in up to 52,895 individuals for BMI and up to 23,095 individuals for WHRadjBMI from the African Ancestry Anthropometry Genetics Consortium (AAAGC) using 1000 Genomes phase 1 imputed GWAS to improve coverage of both common and low frequency variants in the low linkage disequilibrium African ancestry genomes. In the sex-combined analyses, we identified one novel locus (TCF7L2/HABP2) for WHRadjBMI and eight previously established loci at P African ancestry individuals. An additional novel locus (SPRYD7/DLEU2) was identified for WHRadjBMI when combined with European GWAS. In the sex-stratified analyses, we identified three novel loci for BMI (INTS10/LPL and MLC1 in men, IRX4/IRX2 in women) and four for WHRadjBMI (SSX2IP, CASC8, PDE3B and ZDHHC1/HSD11B2 in women) in individuals of African ancestry or both African and European ancestry. For four of the novel variants, the minor allele frequency was low (African ancestry sex-combined and sex-stratified analyses, 26 BMI loci and 17 WHRadjBMI loci contained ≤ 20 variants in the credible sets that jointly account for 99% posterior probability of driving the associations. The lead variants in 13 of these loci had a high probability of being causal. As compared to our previous HapMap imputed GWAS for BMI and WHRadjBMI including up to 71,412 and 27,350 African ancestry individuals, respectively, our results suggest that 1000 Genomes imputation showed modest improvement in identifying GWAS loci including low frequency variants. Trans-ethnic meta-analyses further improved fine mapping of putative causal variants in loci shared between the African and European ancestry populations.

  9. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel.

    Science.gov (United States)

    Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit

    2017-06-01

    Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.

  10. Genome-Wide Association Study to Identify Single Nucleotide Polymorphisms (SNPs) Associated With the Development of Erectile Dysfunction in African-American Men After Radiotherapy for Prostate Cancer

    International Nuclear Information System (INIS)

    Kerns, Sarah L.; Ostrer, Harry; Stock, Richard; Li, William; Moore, Julian; Pearlman, Alexander; Campbell, Christopher; Shao Yongzhao; Stone, Nelson; Kusnetz, Lynda; Rosenstein, Barry S.

    2010-01-01

    Purpose: To identify single nucleotide polymorphisms (SNPs) associated with erectile dysfunction (ED) among African-American prostate cancer patients treated with external beam radiation therapy. Methods and Materials: A cohort of African-American prostate cancer patients treated with external beam radiation therapy was observed for the development of ED by use of the five-item Sexual Health Inventory for Men (SHIM) questionnaire. Final analysis included 27 cases (post-treatment SHIM score ≤7) and 52 control subjects (post-treatment SHIM score ≥16). A genome-wide association study was performed using approximately 909,000 SNPs genotyped on Affymetrix 6.0 arrays (Affymetrix, Santa Clara, CA). Results: We identified SNP rs2268363, located in the follicle-stimulating hormone receptor (FSHR) gene, as significantly associated with ED after correcting for multiple comparisons (unadjusted p = 5.46 x 10 -8 , Bonferroni p = 0.028). We identified four additional SNPs that tended toward a significant association with an unadjusted p value -6 . Inference of population substructure showed that cases had a higher proportion of African ancestry than control subjects (77% vs. 60%, p = 0.005). A multivariate logistic regression model that incorporated estimated ancestry and four of the top-ranked SNPs was a more accurate classifier of ED than a model that included only clinical variables. Conclusions: To our knowledge, this is the first genome-wide association study to identify SNPs associated with adverse effects resulting from radiotherapy. It is important to note that the SNP that proved to be significantly associated with ED is located within a gene whose encoded product plays a role in male gonad development and function. Another key finding of this project is that the four SNPs most strongly associated with ED were specific to persons of African ancestry and would therefore not have been identified had a cohort of European ancestry been screened. This study demonstrates

  11. An association study of 13 SNPs from seven candidate genes with pediatric asthma and a preliminary study for genetic testing by multiple variants in Taiwanese population.

    Science.gov (United States)

    Wang, Jiu-Yao; Liou, Ya-Huei; Wu, Ying-Jye; Hsiao, Ya-Hsin; Wu, Lawrence Shih-Hsin

    2009-03-01

    Asthma is one of the most common chronic diseases in children. It is caused by complex interactions between various genetic factors and exposures to environmental allergens and irritants. Because of the heterogeneity of the disease and the genetic and cultural differences among different populations, a proper association study and genetic testing for asthma and susceptibility genes is difficult to perform. We assessed 13 single-nucleotide polymorphisms (SNPs) in seven well-known asthma susceptibility genes and looked for association with pediatric asthma using 449 asthmatic subjects and 512 non-asthma subjects in Taiwanese population. CD14-159 C/T and MS4A2 Glu237Gly were identified to have difference in genotype/allele frequencies between the control group and asthma patients. Moreover, the genotype synergistic analysis showed that the co-contribution of two functional SNPs was riskier or more protective from asthma attack. Our study provided a genotype synergistic method for studying gene-gene interaction on polymorphism basis and genetic testing using multiple polymorphisms.

  12. Imputing amino acid polymorphisms in human leukocyte antigens.

    Directory of Open Access Journals (Sweden)

    Xiaoming Jia

    Full Text Available DNA sequence variation within human leukocyte antigen (HLA genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals. We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918 with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.

  13. Multiply-Imputed Synthetic Data: Advice to the Imputer

    Directory of Open Access Journals (Sweden)

    Loong Bronwyn

    2017-12-01

    Full Text Available Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.

  14. Multiple imputation and its application

    CERN Document Server

    Carpenter, James

    2013-01-01

    A practical guide to analysing partially observed data. Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete  data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods. This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures. Multiple Imputation and its Application: Discusses the issues ...

  15. Bootstrap inference when using multiple imputation.

    Science.gov (United States)

    Schomaker, Michael; Heumann, Christian

    2018-04-16

    Many modern estimators require bootstrapping to calculate confidence intervals because either no analytic standard error is available or the distribution of the parameter of interest is nonsymmetric. It remains however unclear how to obtain valid bootstrap inference when dealing with multiple imputation to address missing data. We present 4 methods that are intuitively appealing, easy to implement, and combine bootstrap estimation with multiple imputation. We show that 3 of the 4 approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness. Simulation studies reveal the behavior of our approaches in finite samples. A topical analysis from HIV treatment research, which determines the optimal timing of antiretroviral treatment initiation in young children, demonstrates the practical implications of the 4 methods in a sophisticated and realistic setting. This analysis suffers from missing data and uses the g-formula for inference, a method for which no standard errors are available. Copyright © 2018 John Wiley & Sons, Ltd.

  16. Flexible Imputation of Missing Data

    CERN Document Server

    van Buuren, Stef

    2012-01-01

    Missing data form a problem in every scientific discipline, yet the techniques required to handle them are complicated and often lacking. One of the great ideas in statistical science--multiple imputation--fills gaps in the data with plausible values, the uncertainty of which is coded in the data itself. It also solves other problems, many of which are missing data problems in disguise. Flexible Imputation of Missing Data is supported by many examples using real data taken from the author's vast experience of collaborative research, and presents a practical guide for handling missing data unde

  17. Design of a bovine low-density SNP array optimized for imputation.

    Directory of Open Access Journals (Sweden)

    Didier Boichard

    Full Text Available The Illumina BovineLD BeadChip was designed to support imputation to higher density genotypes in dairy and beef breeds by including single-nucleotide polymorphisms (SNPs that had a high minor allele frequency as well as uniform spacing across the genome except at the ends of the chromosome where densities were increased. The chip also includes SNPs on the Y chromosome and mitochondrial DNA loci that are useful for determining subspecies classification and certain paternal and maternal breed lineages. The total number of SNPs was 6,909. Accuracy of imputation to Illumina BovineSNP50 genotypes using the BovineLD chip was over 97% for most dairy and beef populations. The BovineLD imputations were about 3 percentage points more accurate than those from the Illumina GoldenGate Bovine3K BeadChip across multiple populations. The improvement was greatest when neither parent was genotyped. The minor allele frequencies were similar across taurine beef and dairy breeds as was the proportion of SNPs that were polymorphic. The new BovineLD chip should facilitate low-cost genomic selection in taurine beef and dairy cattle.

  18. R package imputeTestbench to compare imputations methods for univariate time series

    OpenAIRE

    Bokde, Neeraj; Kulat, Kishore; Beck, Marcus W; Asencio-Cortés, Gualberto

    2016-01-01

    This paper describes the R package imputeTestbench that provides a testbench for comparing imputation methods for missing data in univariate time series. The imputeTestbench package can be used to simulate the amount and type of missing data in a complete dataset and compare filled data using different imputation methods. The user has the option to simulate missing data by removing observations completely at random or in blocks of different sizes. Several default imputation methods are includ...

  19. Highly accurate sequence imputation enables precise QTL mapping in Brown Swiss cattle.

    Science.gov (United States)

    Frischknecht, Mirjam; Pausch, Hubert; Bapst, Beat; Signer-Hasler, Heidi; Flury, Christine; Garrick, Dorian; Stricker, Christian; Fries, Ruedi; Gredler-Grandl, Birgit

    2017-12-29

    Within the last few years a large amount of genomic information has become available in cattle. Densities of genomic information vary from a few thousand variants up to whole genome sequence information. In order to combine genomic information from different sources and infer genotypes for a common set of variants, genotype imputation is required. In this study we evaluated the accuracy of imputation from high density chips to whole genome sequence data in Brown Swiss cattle. Using four popular imputation programs (Beagle, FImpute, Impute2, Minimac) and various compositions of reference panels, the accuracy of the imputed sequence variant genotypes was high and differences between the programs and scenarios were small. We imputed sequence variant genotypes for more than 1600 Brown Swiss bulls and performed genome-wide association studies for milk fat percentage at two stages of lactation. We found one and three quantitative trait loci for early and late lactation fat content, respectively. Known causal variants that were imputed from the sequenced reference panel were among the most significantly associated variants of the genome-wide association study. Our study demonstrates that whole-genome sequence information can be imputed at high accuracy in cattle populations. Using imputed sequence variant genotypes in genome-wide association studies may facilitate causal variant detection.

  20. An imputation/copula-based stochastic individual tree growth model for mixed species Acadian forests: a case study using the Nova Scotia permanent sample plot network

    Directory of Open Access Journals (Sweden)

    John A. KershawJr

    2017-09-01

    Full Text Available Background A novel approach to modelling individual tree growth dynamics is proposed. The approach combines multiple imputation and copula sampling to produce a stochastic individual tree growth and yield projection system. Methods The Nova Scotia, Canada permanent sample plot network is used as a case study to develop and test the modelling approach. Predictions from this model are compared to predictions from the Acadian variant of the Forest Vegetation Simulator, a widely used statistical individual tree growth and yield model. Results Diameter and height growth rates were predicted with error rates consistent with those produced using statistical models. Mortality and ingrowth error rates were higher than those observed for diameter and height, but also were within the bounds produced by traditional approaches for predicting these rates. Ingrowth species composition was very poorly predicted. The model was capable of reproducing a wide range of stand dynamic trajectories and in some cases reproduced trajectories that the statistical model was incapable of reproducing. Conclusions The model has potential to be used as a benchmarking tool for evaluating statistical and process models and may provide a mechanism to separate signal from noise and improve our ability to analyze and learn from large regional datasets that often have underlying flaws in sample design.

  1. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer

    Directory of Open Access Journals (Sweden)

    Rosa Aghdam

    2017-12-01

    Full Text Available Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/.

  2. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer.

    Science.gov (United States)

    Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz

    2017-12-01

    Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.

  3. Evaluation and application of summary statistic imputation to discover new height-associated loci.

    Science.gov (United States)

    Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán

    2018-05-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian

  4. Missing value imputation: with application to handwriting data

    Science.gov (United States)

    Xu, Zhen; Srihari, Sargur N.

    2015-01-01

    Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.

  5. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits

    NARCIS (Netherlands)

    I. Tachmazidou (Ioanna); Süveges, D. (Dániel); J. Min (Josine); G.R.S. Ritchie (Graham R.S.); Steinberg, J. (Julia); K. Walter (Klaudia); V. Iotchkova (Valentina); J.A. Schwartzentruber (Jeremy); J. Huang (Jian); Y. Memari (Yasin); McCarthy, S. (Shane); Crawford, A.A. (Andrew A.); C. Bombieri (Cristina); M. Cocca (Massimiliano); A.-E. Farmaki (Aliki-Eleni); T.R. Gaunt (Tom); P. Jousilahti (Pekka); M.N. Kooijman (Marjolein ); Lehne, B. (Benjamin); G. Malerba (Giovanni); S. Männistö (Satu); A. Matchan (Angela); M.C. Medina-Gomez (Carolina); S. Metrustry (Sarah); A. Nag (Abhishek); I. Ntalla (Ioanna); L. Paternoster (Lavinia); N.W. Rayner (Nigel William); C. Sala (Cinzia); W.R. Scott (William R.); H.A. Shihab (Hashem A.); L. Southam (Lorraine); B. St Pourcain (Beate); M. Traglia (Michela); K. Trajanoska (Katerina); Zaza, G. (Gialuigi); W. Zhang (Weihua); M.S. Artigas; Bansal, N. (Narinder); M. Benn (Marianne); Chen, Z. (Zhongsheng); P. Danecek (Petr); Lin, W.-Y. (Wei-Yu); A. Locke (Adam); J. Luan (Jian'An); A.K. Manning (Alisa); Mulas, A. (Antonella); C. Sidore (Carlo); A. Tybjaerg-Hansen; A. Varbo (Anette); M. Zoledziewska (Magdalena); C. Finan (Chris); Hatzikotoulas, K. (Konstantinos); A.E. Hendricks (Audrey E.); J.P. Kemp (John); A. Moayyeri (Alireza); Panoutsopoulou, K. (Kalliope); Szpak, M. (Michal); S.G. Wilson (Scott); M. Boehnke (Michael); F. Cucca (Francesco); Di Angelantonio, E. (Emanuele); C. Langenberg (Claudia); C.M. Lindgren (Cecilia M.); McCarthy, M.I. (Mark I.); A.P. Morris (Andrew); B.G. Nordestgaard (Børge); R.A. Scott (Robert); M.D. Tobin (Martin); N.J. Wareham (Nick); P.R. Burton (Paul); J.C. Chambers (John); Smith, G.D. (George Davey); G.V. Dedoussis (George); J.F. Felix (Janine); O.H. Franco (Oscar); Gambaro, G. (Giovanni); P. Gasparini (Paolo); C.J. Hammond (Christopher J.); A. Hofman (Albert); V.W.V. Jaddoe (Vincent); M.E. Kleber (Marcus); J.S. Kooner (Jaspal S.); M. Perola (Markus); C.L. Relton (Caroline); S.M. Ring (Susan); F. Rivadeneira Ramirez (Fernando); V. Salomaa (Veikko); T.D. Spector (Timothy); O. Stegle (Oliver); D. Toniolo (Daniela); A.G. Uitterlinden (André); I.E. Barroso (Inês); C.M.T. Greenwood (Celia); Perry, J.R.B. (John R.B.); Walker, B.R. (Brian R.); A.S. Butterworth (Adam); Y. Xue (Yali); R. Durbin (Richard); K.S. Small (Kerrin); N. Soranzo (Nicole); N.J. Timpson (Nicholas); E. Zeggini (Eleftheria)

    2016-01-01

    textabstractDeep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the

  6. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits

    DEFF Research Database (Denmark)

    Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L

    2017-01-01

    Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader alleli...

  7. Replication and Characterization of Association between ABO SNPs and Red Blood Cell Traits by Meta-Analysis in Europeans.

    Directory of Open Access Journals (Sweden)

    Stela McLachlan

    Full Text Available Red blood cell (RBC traits are routinely measured in clinical practice as important markers of health. Deviations from the physiological ranges are usually a sign of disease, although variation between healthy individuals also occurs, at least partly due to genetic factors. Recent large scale genetic studies identified loci associated with one or more of these traits; further characterization of known loci and identification of new loci is necessary to better understand their role in health and disease and to identify potential molecular mechanisms. We performed meta-analysis of Metabochip association results for six RBC traits-hemoglobin concentration (Hb, hematocrit (Hct, mean corpuscular hemoglobin (MCH, mean corpuscular hemoglobin concentration (MCHC, mean corpuscular volume (MCV and red blood cell count (RCC-in 11 093 Europeans from seven studies of the UCL-LSHTM-Edinburgh-Bristol (UCLEB Consortium. We identified 394 non-overlapping SNPs in five loci at genome-wide significance: 6p22.1-6p21.33 (with HFE among others, 6q23.2 (with HBS1L among others, 6q23.3 (contains no genes, 9q34.3 (only ABO gene and 22q13.1 (with TMPRSS6 among others, replicating previous findings of association with RBC traits at these loci and extending them by imputation to 1000 Genomes. We further characterized associations between ABO SNPs and three traits: hemoglobin, hematocrit and red blood cell count, replicating them in an independent cohort. Conditional analyses indicated the independent association of each of these traits with ABO SNPs and a role for blood group O in mediating the association. The 15 most significant RBC-associated ABO SNPs were also associated with five cardiometabolic traits, with discordance in the direction of effect between groups of traits, suggesting that ABO may act through more than one mechanism to influence cardiometabolic risk.

  8. Data imputation analysis for Cosmic Rays time series

    Science.gov (United States)

    Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.

    2017-05-01

    The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.

  9. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs)

    DEFF Research Database (Denmark)

    Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud

    2016-01-01

    for driving breast cancer risk (lead SNP rs2787486 (OR = 0.92; CI 0.90-0.94; P = 8.96 × 10(-15))) and are correlated with two previously reported risk-associated variants at this locus, SNPs rs6504950 (OR = 0.94, P = 2.04 × 10(-09), r(2) = 0.73 with lead SNP) and rs1156287 (OR = 0.93, P = 3.41 × 10(-11), r(2......) = 0.83 with lead SNP). Analyses indicate only one causal SNP in the region and several enhancer elements targeting STXBP4 are located within the 53 kb association signal. Expression studies in breast tumor tissues found SNP rs2787486 to be associated with increased STXBP4 expression, suggesting...

  10. UGbS-Flex, a novel bioinformatics pipeline for imputation-free SNP discovery in polyploids without a reference genome: finger millet as a case study.

    Science.gov (United States)

    Qi, Peng; Gimode, Davis; Saha, Dipnarayan; Schröder, Stephan; Chakraborty, Debkanta; Wang, Xuewen; Dida, Mathews M; Malmberg, Russell L; Devos, Katrien M

    2018-06-15

    Research on orphan crops is often hindered by a lack of genomic resources. With the advent of affordable sequencing technologies, genotyping an entire genome or, for large-genome species, a representative fraction of the genome has become feasible for any crop. Nevertheless, most genotyping-by-sequencing (GBS) methods are geared towards obtaining large numbers of markers at low sequence depth, which excludes their application in heterozygous individuals. Furthermore, bioinformatics pipelines often lack the flexibility to deal with paired-end reads or to be applied in polyploid species. UGbS-Flex combines publicly available software with in-house python and perl scripts to efficiently call SNPs from genotyping-by-sequencing reads irrespective of the species' ploidy level, breeding system and availability of a reference genome. Noteworthy features of the UGbS-Flex pipeline are an ability to use paired-end reads as input, an effective approach to cluster reads across samples with enhanced outputs, and maximization of SNP calling. We demonstrate use of the pipeline for the identification of several thousand high-confidence SNPs with high representation across samples in an F 3 -derived F 2 population in the allotetraploid finger millet. Robust high-density genetic maps were constructed using the time-tested mapping program MAPMAKER which we upgraded to run efficiently and in a semi-automated manner in a Windows Command Prompt Environment. We exploited comparative GBS with one of the diploid ancestors of finger millet to assign linkage groups to subgenomes and demonstrate the presence of chromosomal rearrangements. The paper combines GBS protocol modifications, a novel flexible GBS analysis pipeline, UGbS-Flex, recommendations to maximize SNP identification, updated genetic mapping software, and the first high-density maps of finger millet. The modules used in the UGbS-Flex pipeline and for genetic mapping were applied to finger millet, an allotetraploid selfing species

  11. Identification of SNPs in chemerin gene and association with ...

    African Journals Online (AJOL)

    Chemerin is a novel adipokine that regulates adipogenesis and adipocyte metabolism via its own receptor. In this study, two novel SNPs (868A>G in exon 2 and 2692C>T in exon 5) of chemerin gene were identified by PCR-SSCP and DNA sequencing technology. The allele frequencies of the novel SNPs were determined ...

  12. Association between invasive ovarian cancer susceptibility and 11 best candidate SNPs from breast cancer genome-wide association study

    DEFF Research Database (Denmark)

    Song, Honglin; Ramus, Susan J; Kjaer, Susanne Krüger

    2009-01-01

    Because both ovarian and breast cancer are hormone-related and are known to have some predisposition genes in common, we evaluated 11 of the most significant hits (six with confirmed associations with breast cancer) from the breast cancer genome-wide association study for association with invasiv...

  13. Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle

    DEFF Research Database (Denmark)

    Ma, Peipei; Brøndum, Rasmus Froberg; Qin, Zahng

    2013-01-01

    This study investigated the imputation accuracy of different methods, considering both the minor allele frequency and relatedness between individuals in the reference and test data sets. Two data sets from the combined population of Swedish and Finnish Red Cattle were used to test the influence...... coefficient was lower when the minor allele frequency was lower. The results indicate that Beagle and IMPUTE2 provide the most robust and accurate imputation accuracies, but considering computing time and memory usage, FImpute is another alternative method....

  14. Synthetic Multiple-Imputation Procedure for Multistage Complex Samples

    Directory of Open Access Journals (Sweden)

    Zhou Hanzhi

    2016-03-01

    Full Text Available Multiple imputation (MI is commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. For multistage stratified clustered designs, this requires dummy variables to represent strata as well as primary sampling units (PSUs nested within each stratum in the imputation model. Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are many strata in the sample design. Complexity only increases when sampling weights need to be modeled. This article develops a generalpurpose analytic strategy for population inference from complex sample designs with item-level missingness. In a simulation study, the proposed procedures demonstrate efficient estimation and good coverage properties. We also consider an application to accommodate missing body mass index (BMI data in the analysis of BMI percentiles using National Health and Nutrition Examination Survey (NHANES III data. We argue that the proposed methods offer an easy-to-implement solution to problems that are not well-handled by current MI techniques. Note that, while the proposed method borrows from the MI framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data, but rather as an analytic strategy in and of itself.

  15. A periodic pattern of SNPs in the human genome

    DEFF Research Database (Denmark)

    Madsen, Bo Eskerod; Villesen, Palle; Wiuf, Carsten

    2007-01-01

    By surveying a filtered, high-quality set of SNPs in the human genome, we have found that SNPs positioned 1, 2, 4, 6, or 8 bp apart are more frequent than SNPs positioned 3, 5, 7, or 9 bp apart. The observed pattern is not restricted to genomic regions that are known to cause sequencing...... periodic DNA. Our results suggest that not all SNPs in the human genome are created by independent single nucleotide mutations, and that care should be taken in analysis of SNPs from periodic DNA. The latter may have important consequences for SNP and association studies....... or alignment errors, for example, transposable elements (SINE, LINE, and LTR), tandem repeats, and large duplicated regions. However, we found that the pattern is almost entirely confined to what we define as "periodic DNA." Periodic DNA is a genomic region with a high degree of periodicity in nucleotide usage...

  16. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS Data.

    Directory of Open Access Journals (Sweden)

    Ariel W Chan

    Full Text Available Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS methods, such as Genotyping-By-Sequencing (GBS, offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1 can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2 are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'. We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and

  17. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

    Science.gov (United States)

    Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc

    2016-01-01

    Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the

  18. Increasing imputation and prediction accuracy for Chinese Holsteins using joint Chinese-Nordic reference population

    DEFF Research Database (Denmark)

    Ma, Peipei; Lund, Mogens Sandø; Ding, X

    2015-01-01

    This study investigated the effect of including Nordic Holsteins in the reference population on the imputation accuracy and prediction accuracy for Chinese Holsteins. The data used in this study include 85 Chinese Holstein bulls genotyped with both 54K chip and 777K (HD) chip, 2862 Chinese cows...... was improved slightly when using the marker data imputed based on the combined HD reference data, compared with using the marker data imputed based on the Chinese HD reference data only. On the other hand, when using the combined reference population including 4398 Nordic Holstein bulls, the accuracy...... to increase reference population rather than increasing marker density...

  19. Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels

    Directory of Open Access Journals (Sweden)

    Xiaoyi eGao

    2012-06-01

    Full Text Available Genotype imputation is a vital tool in genome-wide association studies (GWAS and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR+CEU+YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation-based analysis in Latinos.

  20. In Silico Analysis of FMR1 Gene Missense SNPs.

    Science.gov (United States)

    Tekcan, Akin

    2016-06-01

    The FMR1 gene, a member of the fragile X-related gene family, is responsible for fragile X syndrome (FXS). Missense single-nucleotide polymorphisms (SNPs) are responsible for many complex diseases. The effect of FMR1 gene missense SNPs is unknown. The aim of this study, using in silico techniques, was to analyze all known missense mutations that can affect the functionality of the FMR1 gene, leading to mental retardation (MR) and FXS. Data on the human FMR1 gene were collected from the Ensembl database (release 81), National Centre for Biological Information dbSNP Short Genetic Variations database, 1000 Genomes Browser, and NHLBI Exome Sequencing Project Exome Variant Server. In silico analysis was then performed. One hundred-twenty different missense SNPs of the FMR1 gene were determined. Of these, 11.66 % of the FMR1 gene missense SNPs were in highly conserved domains, and 83.33 % were in domains with high variety. The results of the in silico prediction analysis showed that 31.66 % of the FMR1 gene SNPs were disease related and that 50 % of SNPs had a pathogenic effect. The results of the structural and functional analysis revealed that although the R138Q mutation did not seem to have a damaging effect on the protein, the G266E and I304N SNPs appeared to disturb the interaction between the domains and affect the function of the protein. This is the first study to analyze all missense SNPs of the FMR1 gene. The results indicate the applicability of a bioinformatics approach to FXS and other FMR1-related diseases. I think that the analysis of FMR1 gene missense SNPs using bioinformatics methods would help diagnosis of FXS and other FMR1-related diseases.

  1. Flexible Modeling of Survival Data with Covariates Subject to Detection Limits via Multiple Imputation.

    Science.gov (United States)

    Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen

    2014-01-01

    Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.

  2. Data driven estimation of imputation error-a strategy for imputation with a reject option

    DEFF Research Database (Denmark)

    Bak, Nikolaj; Hansen, Lars Kai

    2016-01-01

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values i...

  3. Imputation methods for filling missing data in urban air pollution data for Malaysia

    Directory of Open Access Journals (Sweden)

    Nur Afiqah Zakaria

    2018-06-01

    Full Text Available The air quality measurement data obtained from the continuous ambient air quality monitoring (CAAQM station usually contained missing data. The missing observations of the data usually occurred due to machine failure, routine maintenance and human error. In this study, the hourly monitoring data of CO, O3, PM10, SO2, NOx, NO2, ambient temperature and humidity were used to evaluate four imputation methods (Mean Top Bottom, Linear Regression, Multiple Imputation and Nearest Neighbour. The air pollutants observations were simulated into four percentages of simulated missing data i.e. 5%, 10%, 15% and 20%. Performance measures namely the Mean Absolute Error, Root Mean Squared Error, Coefficient of Determination and Index of Agreement were used to describe the goodness of fit of the imputation methods. From the results of the performance measures, Mean Top Bottom method was selected as the most appropriate imputation method for filling in the missing values in air pollutants data.

  4. Development of a spreadsheet for SNPs typing using Microsoft EXCEL.

    Science.gov (United States)

    Hashiyada, Masaki; Itakura, Yukio; Takahashi, Shirushi; Sakai, Jun; Funayama, Masato

    2009-04-01

    Single-nucleotide polymorphisms (SNPs) have some characteristics that make them very appropriate for forensic studies and applications. In our institute, SNPs typings were performed by the TaqMan SNP Genotyping Assays using the ABI PRISM 7500 FAST Real-Time PCR System (AppliedBiosystems) and Sequence Detection Software ver.1.4 (AppliedBiosystem). The TaqMan method was desired two positive control (Allele1 and 2) and one negative control to analyze each SNP locus. Therefore, it can be analyzed up to 24 loci of a person on a 96-well-plate at the same time. If SNPs analysis is expected to apply to biometrics authentication, 48 and over loci are required to identify a person. In this study, we designed a spreadsheet package using Microsoft EXCEL, and population data were used from our 120 SNPs population studies. On the spreadsheet, we defined SNP types using 'template files' instead of positive and negative controls. "Template files" consisted of the results of 94 unknown samples and two negative controls of each of 120 SNPs loci we had previously studied. By the use of the files, the spreadsheet could analyze 96 SNPs on a 96-wells-plate simultaneously.

  5. Imputation of missing data in time series for air pollutants

    Science.gov (United States)

    Junger, W. L.; Ponce de Leon, A.

    2015-02-01

    Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.

  6. Identification of functional SNPs in the 5-prime flanking sequences of human genes

    Directory of Open Access Journals (Sweden)

    Lenhard Boris

    2005-02-01

    Full Text Available Abstract Background Over 4 million single nucleotide polymorphisms (SNPs are currently reported to exist within the human genome. Only a small fraction of these SNPs alter gene function or expression, and therefore might be associated with a cell phenotype. These functional SNPs are consequently important in understanding human health. Information related to functional SNPs in candidate disease genes is critical for cost effective genetic association studies, which attempt to understand the genetics of complex diseases like diabetes, Alzheimer's, etc. Robust methods for the identification of functional SNPs are therefore crucial. We report one such experimental approach. Results Sequence conserved between mouse and human genomes, within 5 kilobases of the 5-prime end of 176 GPCR genes, were screened for SNPs. Sequences flanking these SNPs were scored for transcription factor binding sites. Allelic pairs resulting in a significant score difference were predicted to influence the binding of transcription factors (TFs. Ten such SNPs were selected for mobility shift assays (EMSA, resulting in 7 of them exhibiting a reproducible shift. The full-length promoter regions with 4 of the 7 SNPs were cloned in a Luciferase based plasmid reporter system. Two out of the 4 SNPs exhibited differential promoter activity in several human cell lines. Conclusions We propose a method for effective selection of functional, regulatory SNPs that are located in evolutionary conserved 5-prime flanking regions (5'-FR regions of human genes and influence the activity of the transcriptional regulatory region. Some SNPs behave differently in different cell types.

  7. Missing value imputation for epistatic MAPs

    LENUS (Irish Health Repository)

    Ryan, Colm

    2010-04-20

    Abstract Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially

  8. Association study of IL10, IL1beta, and IL1RN and schizophrenia using tag SNPs from a comprehensive database: suggestive association with rs16944 at IL1beta.

    Science.gov (United States)

    Shirts, Brian H; Wood, Joel; Yolken, Robert H; Nimgaonkar, Vishwajit L

    2006-12-01

    Genetic association studies of several candidate cytokine genes have been motivated by evidence of immune dysfunction among patients with schizophrenia. Intriguing but inconsistent associations have been reported with polymorphisms of three positional candidate genes, namely IL1beta, IL1RN, and IL10. We used comprehensive sequencing data from the Seattle SNPs database to select tag SNPs that represent all common polymorphisms in the Caucasian population at these loci. Associations with 28 tag SNPs were evaluated in 478 cases and 501 unscreened control individuals, while accounting for population sub-structure using the genomic control method. The samples were also stratified by gender, diagnostic category, and exposure to infectious agents. Significant association was not detected after correcting for multiple comparisons. However, meta-analysis of our data combined with previously published association studies of rs16944 (IL1beta -511) suggests that the C allele confers modest risk for schizophrenia among individuals reporting Caucasian ancestry, but not Asians (Caucasians, n=819 cases, 1292 controls; p=0.0013, OR=1.24, 95% CI 1.09, 1.41).

  9. Cost reduction for web-based data imputation

    KAUST Repository

    Li, Zhixu; Shang, Shuo; Xie, Qing; Zhang, Xiangliang

    2014-01-01

    Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity

  10. A nonparametric multiple imputation approach for missing categorical data

    Directory of Open Access Journals (Sweden)

    Muhan Zhou

    2017-06-01

    Full Text Available Abstract Background Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness probabilities. Methods We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model and the other fits a logistic regression for predicting missingness probabilities (the missingness model. A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with

  11. Assessment of imputation methods using varying ecological information to fill the gaps in a tree functional trait database

    Science.gov (United States)

    Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2016-04-01

    Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix

  12. Fully conditional specification in multivariate imputation

    NARCIS (Netherlands)

    van Buuren, S.; Brand, J. P.L.; Groothuis-Oudshoorn, C. G.M.; Rubin, D. B.

    2006-01-01

    The use of the Gibbs sampler with fully conditionally specified models, where the distribution of each variable given the other variables is the starting point, has become a popular method to create imputations in incomplete multivariate data. The theoretical weakness of this approach is that the

  13. Effects of Different Missing Data Imputation Techniques on the Performance of Undiagnosed Diabetes Risk Prediction Models in a Mixed-Ancestry Population of South Africa.

    Directory of Open Access Journals (Sweden)

    Katya L Masconi

    Full Text Available Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4% had missing data. Family history had the highest proportion of missing data (25%. Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals. Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.

  14. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information

    Science.gov (United States)

    Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2018-05-01

    The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in

  15. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information

    Directory of Open Access Journals (Sweden)

    R. Poyatos

    2018-05-01

    Full Text Available The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2. We simulated gaps at different missingness levels (10–80 % in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN, ordinary and regression kriging, and multivariate imputation using chained equations (MICE to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %, species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables

  16. Population differentiation in allele frequencies of obesity-associated SNPs.

    Science.gov (United States)

    Mao, Linyong; Fang, Yayin; Campbell, Michael; Southerland, William M

    2017-11-10

    Obesity is emerging as a global health problem, with more than one-third of the world's adult population being overweight or obese. In this study, we investigated worldwide population differentiation in allele frequencies of obesity-associated SNPs (single nucleotide polymorphisms). We collected a total of 225 obesity-associated SNPs from a public database. Their population-level allele frequencies were derived based on the genotype data from 1000 Genomes Project (phase 3). We used hypergeometric model to assess whether the effect allele at a given SNP is significantly enriched or depleted in each of the 26 populations surveyed in the 1000 Genomes Project with respect to the overall pooled population. Our results indicate that 195 out of 225 SNPs (86.7%) possess effect alleles significantly enriched or depleted in at least one of the 26 populations. Populations within the same continental group exhibit similar allele enrichment/depletion patterns whereas inter-continental populations show distinct patterns. Among the 225 SNPs, 15 SNPs cluster in the first intron region of the FTO gene, which is a major gene associated with body-mass index (BMI) and fat mass. African populations exhibit much smaller blocks of LD (linkage disequilibrium) among these15 SNPs while European and Asian populations have larger blocks. To estimate the cumulative effect of all variants associated with obesity, we developed the personal composite genetic risk score for obesity. Our results indicate that the East Asian populations have the lowest averages of the composite risk scores, whereas three European populations have the highest averages. In addition, the population-level average of composite genetic risk scores is significantly correlated (R 2 = 0.35, P = 0.0060) with obesity prevalence. We have detected substantial population differentiation in allele frequencies of obesity-associated SNPs. The results will help elucidate the genetic basis which may contribute to population

  17. Family-based Association Analyses of Imputed Genotypes Reveal Genome-Wide Significant Association of Alzheimer’s disease with OSBPL6, PTPRG and PDCL3

    Science.gov (United States)

    Herold, Christine; Hooli, Basavaraj V.; Mullin, Kristina; Liu, Tian; Roehr, Johannes T; Mattheisen, Manuel; Parrado, Antonio R.; Bertram, Lars; Lange, Christoph; Tanzi, Rudolph E.

    2015-01-01

    The genetic basis of Alzheimer's disease (AD) is complex and heterogeneous. Over 200 highly penetrant pathogenic variants in the genes APP, PSEN1 and PSEN2 cause a subset of early-onset familial Alzheimer's disease (EOFAD). On the other hand, susceptibility to late-onset forms of AD (LOAD) is indisputably associated to the ε4 allele in the gene APOE, and more recently to variants in more than two-dozen additional genes identified in the large-scale genome-wide association studies (GWAS) and meta-analyses reports. Taken together however, although the heritability in AD is estimated to be as high as 80%, a large proportion of the underlying genetic factors still remain to be elucidated. In this study we performed a systematic family-based genome-wide association and meta-analysis on close to 15 million imputed variants from three large collections of AD families (~3,500 subjects from 1,070 families). Using a multivariate phenotype combining affection status and onset age, meta-analysis of the association results revealed three single nucleotide polymorphisms (SNPs) that achieved genome-wide significance for association with AD risk: rs7609954 in the gene PTPRG (P-value = 3.98·10−08), rs1347297 in the gene OSBPL6 (P-value = 4.53·10−08), and rs1513625 near PDCL3 (P-value = 4.28·10−08). In addition, rs72953347 in OSBPL6 (P-value = 6.36·10−07) and two SNPs in the gene CDKAL1 showed marginally significant association with LOAD (rs10456232, P-value: 4.76·10−07; rs62400067, P-value: 3.54·10−07). In summary, family-based GWAS meta-analysis of imputed SNPs revealed novel genomic variants in (or near) PTPRG, OSBPL6, and PDCL3 that influence risk for AD with genome-wide significance. PMID:26830138

  18. Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy.

    Science.gov (United States)

    Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R

    2017-07-27

    Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.

  19. Comparison of missing value imputation methods in time series: the case of Turkish meteorological data

    Science.gov (United States)

    Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci

    2013-04-01

    This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.

  20. Accuracy of hemoglobin A1c imputation using fasting plasma glucose in diabetes research using electronic health records data

    Directory of Open Access Journals (Sweden)

    Stanley Xu

    2014-05-01

    Full Text Available In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG, age, and gender. We then created missing A1c values under two assumptions: missing completely at random (MCAR and missing at random (MAR. We then imputed A1c values, compared the imputed values to the true A1c values, and used these data to assess the impact of A1c on initiation of antihyperglycemic therapy. Under MCAR, imputation of A1c based on FPG 1 estimated a continuous A1c within ± 1.88% of the true A1c 68.3% of the time; 2 estimated a categorical A1c within ± one category from the true A1c about 50% of the time. Including RPG in imputation slightly improved the precision but did not improve the accuracy. Under MAR, including gender and age in addition to FPG improved the accuracy of imputed continuous A1c but not categorical A1c. Moreover, imputation of up to 33% of missing A1c values did not change the accuracy and precision and did not alter the impact of A1c on initiation of antihyperglycemic therapy. When using A1c values as a predictor variable, a simple imputation algorithm based only on age, sex, and fasting plasma glucose gave acceptable results.

  1. PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population.

    Directory of Open Access Journals (Sweden)

    Oren E Livne

    2015-03-01

    Full Text Available Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm, a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs, from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.

  2. iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models.

    Science.gov (United States)

    Liu, Siwei; Molenaar, Peter C M

    2014-12-01

    This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.

  3. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods

    Directory of Open Access Journals (Sweden)

    Stuart Heather

    2006-12-01

    Full Text Available Abstract Background Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS. Methods 1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation. Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1 multiple imputation, 2 single regression, 3 individual mean, 4 overall mean, 5 participant's preceding response, and 6 random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated. Results When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89, although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range

  4. Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.

    Science.gov (United States)

    Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A

    2016-01-01

    Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.

  5. Clustering with Missing Values: No Imputation Required

    Science.gov (United States)

    Wagstaff, Kiri

    2004-01-01

    Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.

  6. BRITS: Bidirectional Recurrent Imputation for Time Series

    OpenAIRE

    Cao, Wei; Wang, Dong; Li, Jian; Zhou, Hao; Li, Lei; Li, Yitan

    2018-01-01

    Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing va...

  7. Multiple imputation by chained equations for systematically and sporadically missing multilevel data.

    Science.gov (United States)

    Resche-Rigon, Matthieu; White, Ian R

    2018-06-01

    In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.

  8. VIGAN: Missing View Imputation with Generative Adversarial Networks.

    Science.gov (United States)

    Shang, Chao; Palmer, Aaron; Sun, Jiangwen; Chen, Ko-Shin; Lu, Jin; Bi, Jinbo

    2017-01-01

    In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.

  9. Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations

    DEFF Research Database (Denmark)

    Dassonneville, R; Brøndum, Rasmus Froberg; Druet, T

    2011-01-01

    The purpose of this study was to investigate the imputation error and loss of reliability of direct genomic values (DGV) or genomically enhanced breeding values (GEBV) when using genotypes imputed from a 3,000-marker single nucleotide polymorphism (SNP) panel to a 50,000-marker SNP panel. Data...... of missing markers and prediction of breeding values were performed using 2 different reference populations in each country: either a national reference population or a combined EuroGenomics reference population. Validation for accuracy of imputation and genomic prediction was done based on national test...... with a national reference data set gave an absolute loss of 0.05 in mean reliability of GEBV in the French study, whereas a loss of 0.03 was obtained for reliability of DGV in the Nordic study. When genotypes were imputed using the EuroGenomics reference, a loss of 0.02 in mean reliability of GEBV was detected...

  10. A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2015-12-01

    Full Text Available Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS and compares it with the well-known technique called multivariate imputation by chained equations (MICE. The results obtained demonstrate how the proposed method outperforms the MICE algorithm.

  11. On multivariate imputation and forecasting of decadal wind speed missing data.

    Science.gov (United States)

    Wesonga, Ronald

    2015-01-01

    This paper demonstrates the application of multiple imputations by chained equations and time series forecasting of wind speed data. The study was motivated by the high prevalence of missing wind speed historic data. Findings based on the fully conditional specification under multiple imputations by chained equations, provided reliable wind speed missing data imputations. Further, the forecasting model shows, the smoothing parameter, alpha (0.014) close to zero, confirming that recent past observations are more suitable for use to forecast wind speeds. The maximum decadal wind speed for Entebbe International Airport was estimated to be 17.6 metres per second at a 0.05 level of significance with a bound on the error of estimation of 10.8 metres per second. The large bound on the error of estimations confirms the dynamic tendencies of wind speed at the airport under study.

  12. SNPs in PPARG associate with type 2 diabetes and interact with physical activity

    DEFF Research Database (Denmark)

    Oskari Kilpeläinen, Tuomas; Lakka, Timo A; Laaksonen, David E

    2008-01-01

    To study the associations of seven single-nucleotide polymorphisms (SNPs) in the peroxisome proliferator-activated receptor gamma (PPARG) gene with the conversion from impaired glucose tolerance (IGT) to type 2 diabetes (T2D), and the interactions of the SNPs with physical activity (PA).......To study the associations of seven single-nucleotide polymorphisms (SNPs) in the peroxisome proliferator-activated receptor gamma (PPARG) gene with the conversion from impaired glucose tolerance (IGT) to type 2 diabetes (T2D), and the interactions of the SNPs with physical activity (PA)....

  13. Time Series Imputation via L1 Norm-Based Singular Spectrum Analysis

    Science.gov (United States)

    Kalantari, Mahdi; Yarmohammadi, Masoud; Hassani, Hossein; Silva, Emmanuel Sirimal

    Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.

  14. Assessment of heterogeneity between European Populations: a Baltic and Danish replication case-control study of SNPs from a recent European ulcerative colitis genome wide association study

    DEFF Research Database (Denmark)

    Andersen, Vibeke; Ernst, Anja; Sventoraityte, Jurgita

    2011-01-01

    the combined Baltic, Danish, and Norwegian panel versus the combined German, British, Belgian, and Greek panel (rs7520292 (P = 0.001), rs12518307 (P = 0.007), and rs2395609 (TCP11) (P = 0.01), respectively). No SNP reached genome-wide significance in the combined analyses of all the panels. Conclusions......Background: Differences in the genetic architecture of inflammatory bowel disease between different European countries and ethnicities have previously been reported. In the present study, we wanted to assess the role of 11 newly identified UC risk variants, derived from a recent European UC genome...... wide association study (GWAS) (Franke et al., 2010), for 1) association with UC in the Nordic countries, 2) for population heterogeneity between the Nordic countries and the rest of Europe, and, 3) eventually, to drive some of the previous findings towards overall genome-wide significance. Methods...

  15. UniFIeD Univariate Frequency-based Imputation for Time Series Data

    OpenAIRE

    Friese, Martina; Stork, Jörg; Ramos Guerra, Ricardo; Bartz-Beielstein, Thomas; Thaker, Soham; Flasch, Oliver; Zaefferer, Martin

    2013-01-01

    This paper introduces UniFIeD, a new data preprocessing method for time series. UniFIeD can cope with large intervals of missing data. A scalable test function generator, which allows the simulation of time series with different gap sizes, is presented additionally. An experimental study demonstrates that (i) UniFIeD shows a significant better performance than simple imputation methods and (ii) UniFIeD is able to handle situations, where advanced imputation methods fail. The results are indep...

  16. Blood lead levels, iron metabolism gene polymorphisms and homocysteine: a gene-environment interaction study.

    Science.gov (United States)

    Kim, Kyoung-Nam; Lee, Mee-Ri; Lim, Youn-Hee; Hong, Yun-Chul

    2017-12-01

    Homocysteine has been causally associated with various adverse health outcomes. Evidence supporting the relationship between lead and homocysteine levels has been accumulating, but most prior studies have not focused on the interaction with genetic polymorphisms. From a community-based prospective cohort, we analysed 386 participants (aged 41-71 years) with information regarding blood lead and plasma homocysteine levels. Blood lead levels were measured between 2001 and 2003, and plasma homocysteine levels were measured in 2007. Interactions of lead levels with 42 genotyped single-nucleotide polymorphisms (SNPs) in five genes ( TF , HFE , CBS , BHMT and MTR ) were assessed via a 2-degree of freedom (df) joint test and a 1-df interaction test. In secondary analyses using imputation, we further assessed 58 imputed SNPs in the TF and MTHFR genes. Blood lead concentrations were positively associated with plasma homocysteine levels (p=0.0276). Six SNPs in the TF and MTR genes were screened using the 2-df joint test, and among them, three SNPs in the TF gene showed interactions with lead with respect to homocysteine levels through the 1-df interaction test (plead levels. Blood lead levels were positively associated with plasma homocysteine levels measured 4-6 years later, and three SNPs in the TF gene modified the association. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  17. Impact of SNPs on Protein Phosphorylation Status in Rice (Oryza sativa L.

    Directory of Open Access Journals (Sweden)

    Shoukai Lin

    2016-11-01

    Full Text Available Single nucleotide polymorphisms (SNPs are widely used in functional genomics and genetics research work. The high-quality sequence of rice genome has provided a genome-wide SNP and proteome resource. However, the impact of SNPs on protein phosphorylation status in rice is not fully understood. In this paper, we firstly updated rice SNP resource based on the new rice genome Ver. 7.0, then systematically analyzed the potential impact of Non-synonymous SNPs (nsSNPs on the protein phosphorylation status. There were 3,897,312 SNPs in Ver. 7.0 rice genome, among which 9.9% was nsSNPs. Whilst, a total 2,508,261 phosphorylated sites were predicted in rice proteome. Interestingly, we observed that 150,197 (39.1% nsSNPs could influence protein phosphorylation status, among which 52.2% might induce changes of protein kinase (PK types for adjacent phosphorylation sites. We constructed a database, SNP_rice, to deposit the updated rice SNP resource and phosSNPs information. It was freely available to academic researchers at http://bioinformatics.fafu.edu.cn. As a case study, we detected five nsSNPs that potentially influenced heterotrimeric G proteins phosphorylation status in rice, indicating that genetic polymorphisms showed impact on the signal transduction by influencing the phosphorylation status of heterotrimeric G proteins. The results in this work could be a useful resource for future experimental identification and provide interesting information for better rice breeding.

  18. A suggested approach for imputation of missing dietary data for young children in daycare.

    Science.gov (United States)

    Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P; Zeng, Donglin; Vaughn, Amber E; Pratt, Charlotte; Ward, Dianne S

    2015-01-01

    Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls). Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES); lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI). From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES)] ratio among non-daycare children on weekdays and the L/(B+D+ES) ratio for all children on weekends. Daytime snack data were used to impute snacks. The reported mean (± standard deviation) weekday intake was lower for daycare children [725 (±324) kcal] compared to non-daycare children [1,048 (±463) kcal]. Weekend intake for all children was 1,173 (±427) kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409) kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  19. A suggested approach for imputation of missing dietary data for young children in daycare

    Directory of Open Access Journals (Sweden)

    June Stevens

    2015-12-01

    Full Text Available Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Design: Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls. Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES; lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI. From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES] ratio among non-daycare children on weekdays and the L/(B+D+ES ratio for all children on weekends. Daytime snack data were used to impute snacks. Results: The reported mean (± standard deviation weekday intake was lower for daycare children [725 (±324 kcal] compared to non-daycare children [1,048 (±463 kcal]. Weekend intake for all children was 1,173 (±427 kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409 kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. Conclusion: This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  20. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation.

    Science.gov (United States)

    Välikangas, Tommi; Suomi, Tomi; Elo, Laura L

    2017-05-31

    Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.

  1. Improved Correction of Misclassification Bias With Bootstrap Imputation.

    Science.gov (United States)

    van Walraven, Carl

    2018-07-01

    Diagnostic codes used in administrative database research can create bias due to misclassification. Quantitative bias analysis (QBA) can correct for this bias, requires only code sensitivity and specificity, but may return invalid results. Bootstrap imputation (BI) can also address misclassification bias but traditionally requires multivariate models to accurately estimate disease probability. This study compared misclassification bias correction using QBA and BI. Serum creatinine measures were used to determine severe renal failure status in 100,000 hospitalized patients. Prevalence of severe renal failure in 86 patient strata and its association with 43 covariates was determined and compared with results in which renal failure status was determined using diagnostic codes (sensitivity 71.3%, specificity 96.2%). Differences in results (misclassification bias) were then corrected with QBA or BI (using progressively more complex methods to estimate disease probability). In total, 7.4% of patients had severe renal failure. Imputing disease status with diagnostic codes exaggerated prevalence estimates [median relative change (range), 16.6% (0.8%-74.5%)] and its association with covariates [median (range) exponentiated absolute parameter estimate difference, 1.16 (1.01-2.04)]. QBA produced invalid results 9.3% of the time and increased bias in estimates of both disease prevalence and covariate associations. BI decreased misclassification bias with increasingly accurate disease probability estimates. QBA can produce invalid results and increase misclassification bias. BI avoids invalid results and can importantly decrease misclassification bias when accurate disease probability estimates are used.

  2. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation.

    Science.gov (United States)

    Wood, Andrew R; Perry, John R B; Tanaka, Toshiko; Hernandez, Dena G; Zheng, Hou-Feng; Melzer, David; Gibbs, J Raphael; Nalls, Michael A; Weedon, Michael N; Spector, Tim D; Richards, J Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B; Frayling, Timothy M

    2013-01-01

    Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12)). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations.

  3. Imputation Accuracy from Low to Moderate Density Single Nucleotide Polymorphism Chips in a Thai Multibreed Dairy Cattle Population

    Directory of Open Access Journals (Sweden)

    Danai Jattawa

    2016-04-01

    Full Text Available The objective of this study was to investigate the accuracy of imputation from low density (LDC to moderate density SNP chips (MDC in a Thai Holstein-Other multibreed dairy cattle population. Dairy cattle with complete pedigree information (n = 1,244 from 145 dairy farms were genotyped with GeneSeek GGP20K (n = 570, GGP26K (n = 540 and GGP80K (n = 134 chips. After checking for single nucleotide polymorphism (SNP quality, 17,779 SNP markers in common between the GGP20K, GGP26K, and GGP80K were used to represent MDC. Animals were divided into two groups, a reference group (n = 912 and a test group (n = 332. The SNP markers chosen for the test group were those located in positions corresponding to GeneSeek GGP9K (n = 7,652. The LDC to MDC genotype imputation was carried out using three different software packages, namely Beagle 3.3 (population-based algorithm, FImpute 2.2 (combined family- and population-based algorithms and Findhap 4 (combined family- and population-based algorithms. Imputation accuracies within and across chromosomes were calculated as ratios of correctly imputed SNP markers to overall imputed SNP markers. Imputation accuracy for the three software packages ranged from 76.79% to 93.94%. FImpute had higher imputation accuracy (93.94% than Findhap (84.64% and Beagle (76.79%. Imputation accuracies were similar and consistent across chromosomes for FImpute, but not for Findhap and Beagle. Most chromosomes that showed either high (73% or low (80% imputation accuracies were the same chromosomes that had above and below average linkage disequilibrium (LD; defined here as the correlation between pairs of adjacent SNP within chromosomes less than or equal to 1 Mb apart. Results indicated that FImpute was more suitable than Findhap and Beagle for genotype imputation in this Thai multibreed population. Perhaps additional increments in imputation accuracy could be achieved by increasing the completeness of pedigree information.

  4. Imputation of missing genotypes within LD-blocks relying on the basic coalescent and beyond: consideration of population growth and structure.

    Science.gov (United States)

    Kabisch, Maria; Hamann, Ute; Lorenzo Bermejo, Justo

    2017-10-17

    Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate

  5. Novel SNPs polymorphism of bovine CACNA2D1 gene and their ...

    African Journals Online (AJOL)

    In this study, the bovine CACNA2D1 gene was taken as a candidate gene for mastitis resistance. The objective of this study was to identify single nucleotide polymorphisms (SNPs) in the bovine CACNA2D1 gene and evaluate the association of these SNPs with mastitis in cattle. Through DNA sequencing and PCR-RFLP ...

  6. Cost reduction for web-based data imputation

    KAUST Repository

    Li, Zhixu

    2014-01-01

    Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy. © 2014 Springer International Publishing Switzerland.

  7. Multiple imputation to account for missing data in a survey: estimating the prevalence of osteoporosis.

    Science.gov (United States)

    Kmetic, Andrew; Joseph, Lawrence; Berger, Claudie; Tenenhouse, Alan

    2002-07-01

    Nonresponse bias is a concern in any epidemiologic survey in which a subset of selected individuals declines to participate. We reviewed multiple imputation, a widely applicable and easy to implement Bayesian methodology to adjust for nonresponse bias. To illustrate the method, we used data from the Canadian Multicentre Osteoporosis Study, a large cohort study of 9423 randomly selected Canadians, designed in part to estimate the prevalence of osteoporosis. Although subjects were randomly selected, only 42% of individuals who were contacted agreed to participate fully in the study. The study design included a brief questionnaire for those invitees who declined further participation in order to collect information on the major risk factors for osteoporosis. These risk factors (which included age, sex, previous fractures, family history of osteoporosis, and current smoking status) were then used to estimate the missing osteoporosis status for nonparticipants using multiple imputation. Both ignorable and nonignorable imputation models are considered. Our results suggest that selection bias in the study is of concern, but only slightly, in very elderly (age 80+ years), both women and men. Epidemiologists should consider using multiple imputation more often than is current practice.

  8. Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.

    Science.gov (United States)

    Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks

    2018-02-01

    Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.

  9. Multiple imputation strategies for zero-inflated cost data in economic evaluations : which method works best?

    NARCIS (Netherlands)

    MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E

    2016-01-01

    Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing

  10. Imputation by the mean score should be avoided when validating a Patient Reported Outcomes questionnaire by a Rasch model in presence of informative missing data

    LENUS (Irish Health Repository)

    Hardouin, Jean-Benoit

    2011-07-14

    Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.

  11. Genome-wide SNPs lead to strong signals of geographic structure and relatedness patterns in the major arbovirus vector, Aedes aegypti.

    Science.gov (United States)

    Rašić, Gordana; Filipović, Igor; Weeks, Andrew R; Hoffmann, Ary A

    2014-04-11

    Genetic markers are widely used to understand the biology and population dynamics of disease vectors, but often markers are limited in the resolution they provide. In particular, the delineation of population structure, fine scale movement and patterns of relatedness are often obscured unless numerous markers are available. To address this issue in the major arbovirus vector, the yellow fever mosquito (Aedes aegypti), we used double digest Restriction-site Associated DNA (ddRAD) sequencing for the discovery of genome-wide single nucleotide polymorphisms (SNPs). We aimed to characterize the new SNP set and to test the resolution against previously described microsatellite markers in detecting broad and fine-scale genetic patterns in Ae. aegypti. We developed bioinformatics tools that support the customization of restriction enzyme-based protocols for SNP discovery. We showed that our approach for RAD library construction achieves unbiased genome representation that reflects true evolutionary processes. In Ae. aegypti samples from three continents we identified more than 18,000 putative SNPs. They were widely distributed across the three Ae. aegypti chromosomes, with 47.9% found in intergenic regions and 17.8% in exons of over 2,300 genes. Pattern of their imputed effects in ORFs and UTRs were consistent with those found in a recent transcriptome study. We demonstrated that individual mosquitoes from Indonesia, Australia, Vietnam and Brazil can be assigned with a very high degree of confidence to their region of origin using a large SNP panel. We also showed that familial relatedness of samples from a 0.4 km2 area could be confidently established with a subset of SNPs. Using a cost-effective customized RAD sequencing approach supported by our bioinformatics tools, we characterized over 18,000 SNPs in field samples of the dengue fever mosquito Ae. aegypti. The variants were annotated and positioned onto the three Ae. aegypti chromosomes. The new SNP set provided much

  12. Genome-wide screen for universal individual identification SNPs based on the HapMap and 1000 Genomes databases.

    Science.gov (United States)

    Huang, Erwen; Liu, Changhui; Zheng, Jingjing; Han, Xiaolong; Du, Weian; Huang, Yuanjian; Li, Chengshi; Wang, Xiaoguang; Tong, Dayue; Ou, Xueling; Sun, Hongyu; Zeng, Zhaoshu; Liu, Chao

    2018-04-03

    Differences among SNP panels for individual identification in SNP-selecting and populations led to few common SNPs, compromising their universal applicability. To screen all universal SNPs, we performed a genome-wide SNP mining in multiple populations based on HapMap and 1000Genomes databases. SNPs with high minor allele frequencies (MAF) in 37 populations were selected. With MAF from ≥0.35 to ≥0.43, the number of selected SNPs decreased from 2769 to 0. A total of 117 SNPs with MAF ≥0.39 have no linkage disequilibrium with each other in every population. For 116 of the 117 SNPs, cumulative match probability (CMP) ranged from 2.01 × 10-48 to 1.93 × 10-50 and cumulative exclusion probability (CEP) ranged from 0.9999999996653 to 0.9999999999945. In 134 tested Han samples, 110 of the 117 SNPs remained within high MAF and conformed to Hardy-Weinberg equilibrium, with CMP = 4.70 × 10-47 and CEP = 0.999999999862. By analyzing the same number of autosomal SNPs as in the HID-Ion AmpliSeq Identity Panel, i.e. 90 randomized out of the 110 SNPs, our panel yielded preferable CMP and CEP. Taken together, the 110-SNPs panel is advantageous for forensic test, and this study provided plenty of highly informative SNPs for compiling final universal panels.

  13. Imputation of sequence variants for identification of genetic risks for Parkinson's disease: a meta-analysis of genome-wide association studies

    NARCIS (Netherlands)

    Nalls, M.A.; Plagnol, V.; Hernandez, D.G.; Sharma, M.; Sheerin, U.M.; Saad, M.; Simon-Sanchez, J.; Schulte, C.; Lesage, S.; Sveinbjornsdottir, S.; Stefansson, K.; Martinez, M.; Hardy, J.; Heutink, P.; Brice, A.; Gasser, T.; Singleton, A.B.; Wood, N.W.; Bloem, B.R.; Post, B.; Scheffer, H.; Warrenburg, B.P.C. van de; et al.,

    2011-01-01

    BACKGROUND: Genome-wide association studies (GWAS) for Parkinson's disease have linked two loci (MAPT and SNCA) to risk of Parkinson's disease. We aimed to identify novel risk loci for Parkinson's disease. METHODS: We did a meta-analysis of datasets from five Parkinson's disease GWAS from the USA

  14. Imputation of sequence variants for identification of genetic risks for Parkinson's disease: a meta-analysis of genome-wide association studies

    NARCIS (Netherlands)

    Nalls, Michael A.; Plagnol, Vincent; Hernandez, Dena G.; Sharma, Manu; Sheerin, Una-Marie; Saad, Mohamad; Simon-Sanchez, Javier; Schulte, Claudia; Lesage, Suzanne; Sveinbjornsdottir, Sigurlaug; Arepalli, Sampath; Barker, Roger; Ben-Shlomo, Yoav; Berendse, Henk W.; Berg, Daniela; Bhatia, Kailash; de Bie, Rob M. A.; Biffi, Alessandro; Bloem, Bas; Bochdanovits, Zoltan; Bonin, Michael; Bras, Jose M.; Brockmann, Kathrin; Brooks, Janet; Burn, David J.; Charlesworth, Gavin; Chen, Honglei; Chinnery, Patrick F.; Chong, Sean; Clarke, Carl E.; Cookson, Mark R.; Cooper, J. Mark; Corvol, Jean Christophe; Counsell, Carl; Damier, Philippe; Dartigues, Jean-Francois; Deloukas, Panos; Deuschl, Guenther; Dexter, David T.; van Dijk, Karin D.; Dillman, Allissa; Durif, Frank; Duerr, Alexandra; Edkins, Sarah; Evans, Jonathan R.; Foltynie, Thomas; Gao, Jianjun; Gardner, Michelle; Gibbs, J. Raphael; Goate, Alison; Gray, Emma; Guerreiro, Rita; Gustafsson, Omar; Harris, Clare; van Hilten, Jacobus J.; Hofman, Albert; Hollenbeck, Albert; Holton, Janice; Hu, Michele; Huang, Xuemei; Huber, Heiko; Hudson, Gavin; Hunt, Sarah E.; Huttenlocher, Johanna; Illig, Thomas; Jonsson, Palmi V.; Lambert, Jean-Charles; Langford, Cordelia; Lees, Andrew; Lichtner, Peter; Limousin, Patricia; Lopez, Grisel; Lorenz, Delia; McNeill, Alisdair; Moorby, Catriona; Moore, Matthew; Morris, Huw R.; Morrison, Karen E.; Mudanohwo, Ese; O'Sullivan, Sean S.; Pearson, Justin; Perlmutter, Joel S.; Petursson, Hjoervar; Pollak, Pierre; Post, Bart; Potter, Simon; Ravina, Bernard; Revesz, Tamas; Riess, Olaf; Rivadeneira, Fernando; Rizzu, Patrizia; Ryten, Mina; Sawcer, Stephen; Schapira, Anthony; Scheffer, Hans; Shaw, Karen; Shoulson, Ira; Sidransky, Ellen; Smith, Colin; Spencer, Chris C. A.; Stefansson, Hreinn; Stockton, Joanna D.; Strange, Amy; Talbot, Kevin; Tanner, Carlie M.; Tashakkori-Ghanbaria, Avazeh; Tison, Francois; Trabzuni, Daniah; Traynor, Bryan J.; Uitterlinden, Andre G.; Velseboer, Daan; Vidailhet, Marie; Walker, Robert; van de Warrenburg, Bart; Wickremaratchi, Mirdhu; Williams, Nigel; Williams-Gray, Caroline H.; Winder-Rhodes, Sophie; Stefansson, Kari; Martinez, Maria; Hardy, John; Heutink, Peter; Brice, Alexis; Gasser, Thomas; Singleton, Andrew B.; Wood, Nicholas W.

    2011-01-01

    Background Genome-wide association studies (GWAS) for Parkinson's disease have linked two loci (MAPT and SNCA) to risk of Parkinson's disease. We aimed to identify novel risk loci for Parkinson's disease. Methods We did a meta-analysis of datasets from five Parkinson's disease GWAS from the USA and

  15. Factors associated with low birth weight in Nepal using multiple imputation

    Directory of Open Access Journals (Sweden)

    Usha Singh

    2017-02-01

    Full Text Available Abstract Background Survey data from low income countries on birth weight usually pose a persistent problem. The studies conducted on birth weight have acknowledged missing data on birth weight, but they are not included in the analysis. Furthermore, other missing data presented on determinants of birth weight are not addressed. Thus, this study tries to identify determinants that are associated with low birth weight (LBW using multiple imputation to handle missing data on birth weight and its determinants. Methods The child dataset from Nepal Demographic and Health Survey (NDHS, 2011 was utilized in this study. A total of 5,240 children were born between 2006 and 2011, out of which 87% had at least one measured variable missing and 21% had no recorded birth weight. All the analyses were carried out in R version 3.1.3. Transform-then impute method was applied to check for interaction between explanatory variables and imputed missing data. Survey package was applied to each imputed dataset to account for survey design and sampling method. Survey logistic regression was applied to identify the determinants associated with LBW. Results The prevalence of LBW was 15.4% after imputation. Women with the highest autonomy on their own health compared to those with health decisions involving husband or others (adjusted odds ratio (OR 1.87, 95% confidence interval (95% CI = 1.31, 2.67, and husband and women together (adjusted OR 1.57, 95% CI = 1.05, 2.35 were less likely to give birth to LBW infants. Mothers using highly polluting cooking fuels (adjusted OR 1.49, 95% CI = 1.03, 2.22 were more likely to give birth to LBW infants than mothers using non-polluting cooking fuels. Conclusion The findings of this study suggested that obtaining the prevalence of LBW from only the sample of measured birth weight and ignoring missing data results in underestimation.

  16. Use of Multiple Imputation Method to Improve Estimation of Missing Baseline Serum Creatinine in Acute Kidney Injury Research

    Science.gov (United States)

    Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.

    2013-01-01

    Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; Pcreatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980

  17. Genotyping of 75 SNPs using arrays for individual identification in five population groups.

    Science.gov (United States)

    Hwa, Hsiao-Lin; Wu, Lawrence Shih Hsin; Lin, Chun-Yen; Huang, Tsun-Ying; Yin, Hsiang-I; Tseng, Li-Hui; Lee, James Chun-I

    2016-01-01

    Single nucleotide polymorphism (SNP) typing offers promise to forensic genetics. Various strategies and panels for analyzing SNP markers for individual identification have been published. However, the best panels with fewer identity SNPs for all major population groups are still under discussion. This study aimed to find more autosomal SNPs with high heterozygosity for individual identification among Asian populations. Ninety-six autosomal SNPs of 502 DNA samples from unrelated individuals of five population groups (208 Taiwanese Han, 83 Filipinos, 62 Thais, 69 Indonesians, and 80 individuals with European, Near Eastern, or South Asian ancestry) were analyzed using arrays in an initial screening, and 75 SNPs (group A, 46 newly selected SNPs; groups B, 29 SNPs based on a previous SNP panel) were selected for further statistical analyses. Some SNPs with high heterozygosity from Asian populations were identified. The combined random match probability of the best 40 and 45 SNPs was between 3.16 × 10(-17) and 7.75 × 10(-17) and between 2.33 × 10(-19) and 7.00 × 10(-19), respectively, in all five populations. These loci offer comparable power to short tandem repeats (STRs) for routine forensic profiling. In this study, we demonstrated the population genetic characteristics and forensic parameters of 75 SNPs with high heterozygosity from five population groups. This SNPs panel can provide valuable genotypic information and can be helpful in forensic casework for individual identification among these populations.

  18. The use of multiple imputation for the accurate measurements of individual feed intake by electronic feeders.

    Science.gov (United States)

    Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C

    2016-02-01

    Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation

  19. Linkage Disequilibrium between STRPs and SNPs across the Human Genome

    OpenAIRE

    Payseur, Bret A.; Place, Michael; Weber, James L.

    2008-01-01

    Patterns of linkage disequilibrium (LD) reveal the action of evolutionary processes and provide crucial information for association mapping of disease genes. Although recent studies have described the landscape of LD among single nucleotide polymorphisms (SNPs) from across the human genome, associations involving other classes of molecular variation remain poorly understood. In addition to recombination and population history, mutation rate and process are expected to shape LD. To test this i...

  20. SNP-VISTA: An Interactive SNPs Visualization Tool

    Energy Technology Data Exchange (ETDEWEB)

    Shah, Nameeta; Teplitsky, Michael V.; Pennacchio, Len A.; Hugenholtz, Philip; Hamann, Bernd; Dubchak, Inna L.

    2005-07-05

    Recent advances in sequencing technologies promise better diagnostics for many diseases as well as better understanding of evolution of microbial populations. Single Nucleotide Polymorphisms(SNPs) are established genetic markers that aid in the identification of loci affecting quantitative traits and/or disease in a wide variety of eukaryotic species. With today's technological capabilities, it is possible to re-sequence a large set of appropriate candidate genes in individuals with a given disease and then screen for causative mutations.In addition, SNPs have been used extensively in efforts to study the evolution of microbial populations, and the recent application of random shotgun sequencing to environmental samples makes possible more extensive SNP analysis of co-occurring and co-evolving microbial populations. The program is available at http://genome.lbl.gov/vista/snpvista.

  1. A Comparison of Joint Model and Fully Conditional Specification Imputation for Multilevel Missing Data

    Science.gov (United States)

    Mistler, Stephen A.; Enders, Craig K.

    2017-01-01

    Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…

  2. Differential network analysis with multiply imputed lipidomic data.

    Directory of Open Access Journals (Sweden)

    Maiju Kujala

    Full Text Available The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD. Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.

  3. Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes.

    Science.gov (United States)

    Baker, Jannah; White, Nicole; Mengersen, Kerrie

    2014-11-20

    Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.

  4. Genome Wide Association Study to Identify SNPs and CNPs Associated with Development of Radiation Injury in Prostate Cancer Patients Treated with Radiotherapy

    Science.gov (United States)

    2012-10-01

    association tests, we obtained low genomic inflation factors of 1.02 for the ED patients and 1.00 for the urinary morbidity patients, suggesting...study (GWAS) to identify genetic factors associated with urinary morbidity following radiotherapy for prostate cancer. Methods: Prostate cancer...increased urinary frequency, incomplete bladder emptying, weak urinary stream and incontinence , as well as more serious events such as bladder necrosis or

  5. Comprehensive genetic study of fatty acids helps explain the role of noncoding inflammatory bowel disease associated SNPs and fatty acid metabolism in disease pathogenesis.

    Science.gov (United States)

    Jezernik, Gregor; Potočnik, Uroš

    2018-03-01

    Fatty acids and their derivatives play an important role in inflammation. Diet and genetics influence fatty acid profiles. Abnormalities of fatty acid profiles have been observed in inflammatory bowel diseases (IBD), a group of complex diseases defined by chronic gastrointestinal inflammation. IBD associated fatty acid profile abnormalities were observed independently of nutritional status or disease activity, suggesting a common genetic background. However, no study so far has attempted to look for overlap between IBD loci and fatty acid associated loci or investigate the genetics of fatty acid profiles in IBD. To this end, we conducted a comprehensive genetic study of fatty acid profiles in IBD using iCHIP, a custom microarray platform designed for deep sequencing of immune-mediated disease associated loci. This study identifies 10 loci associated with fatty acid profiles in IBD. The most significant associations were a locus near CBS (p = 7.62 × 10 -8 ) and a locus in LRRK2 (p = 1.4 × 10 -7 ). Of note, this study replicates the FADS gene cluster locus, previously associated with both fatty acid profiles and IBD pathogenesis. Furthermore, we identify 18 carbon chain trans-fatty acids (p = 1.12 × 10 -3 ), total trans-fatty acids (p = 4.49 × 10 -3 ), palmitic acid (p = 5.85 × 10 -3 ) and arachidonic acid (p = 8.58 × 10 -3 ) as significantly associated with IBD pathogenesis. Copyright © 2018 Elsevier Ltd. All rights reserved.

  6. Multivariate imaging-genetics study of MRI gray matter volume and SNPs reveals biological pathways correlated with brain structural differences in Attention Deficit Hyperactivity Disorder

    Directory of Open Access Journals (Sweden)

    Sabin Khadka

    2016-07-01

    Full Text Available Background: Attention Deficit Hyperactivity Disorder (ADHD is a prevalent neurodevelopmental disorder affecting children, adolescents, and adults. Its etiology is not well-understood, but it is increasingly believed to result from diverse pathophysiologies that affect the structure and function of specific brain circuits. Although one of the best-studied neurobiological abnormalities in ADHD is reduced fronto-striatal-cerebellar gray matter volume, its specific genetic correlates are largely unknown. Methods: In this study, T1-weighted MR images of brain structure were collected from 198 adolescents (63 ADHD-diagnosed. A multivariate parallel independent component analysis technique (Para-ICA identified imaging-genetic relationships between regional gray matter volume and single nucleotide polymorphism data. Results: Para-ICA analyses extracted 14 components from genetic data and 9 from MR data. An iterative cross-validation using randomly-chosen sub-samples indicated acceptable stability of these ICA solutions. A series of partial correlation analyses controlling for age, sex, and ethnicity revealed two genotype-phenotype component pairs significantly differed between ADHD and non-ADHD groups, after a Bonferroni correction for multiple comparisons. The brain phenotype component not only included structures frequently found to have abnormally low volume in previous ADHD studies, but was also significantly associated with ADHD differences in symptom severity and performance on cognitive tests frequently found to be impaired in patients diagnosed with the disorder. Pathway analysis of the genotype component identified several different biological pathways linked to these structural abnormalities in ADHD. Conclusions: Some of these pathways implicate well-known dopaminergic neurotransmission and neurodevelopment hypothesized to be abnormal in ADHD. Other more recently implicated pathways included glutamatergic and GABA-eric physiological systems

  7. Association Study with 77 SNPs Confirms the Robust Role for the rs10830963/G of MTNR1B Variant and Identifies Two Novel Associations in Gestational Diabetes Mellitus Development.

    Directory of Open Access Journals (Sweden)

    Klara Rosta

    Full Text Available Genetic variation in human maternal DNA contributes to the susceptibility for development of gestational diabetes mellitus (GDM.We assessed 77 maternal single nucleotide gene polymorphisms (SNPs for associations with GDM or plasma glucose levels at OGTT in pregnancy.960 pregnant women (after dropouts 820: case/control: m99'WHO: 303/517, IADPSG: 287/533 were enrolled in two countries into this case-control study. After genomic DNA isolation the 820 samples were collected in a GDM biobank and assessed using KASP (LGC Genomics genotyping assay. Logistic regression risk models were used to calculate ORs according to IADPSG/m'99WHO criteria based on standard OGTT values.The most important risk alleles associated with GDM were rs10830963/G of MTNR1B (OR = 1.84/1.64 [IADPSG/m'99WHO], p = 0.0007/0.006, rs7754840/C (OR = 1.51/NS, p = 0.016 of CDKAL1 and rs1799884/T (OR = 1.4/1.56, p = 0.04/0.006 of GCK. The rs13266634/T (SLC30A8, OR = 0.74/0.71, p = 0.05/0.02 and rs7578326/G (LOC646736/IRS1, OR = 0.62/0.60, p = 0.001/0.006 variants were associated with lower risk to develop GDM. Carrying a minor allele of rs10830963 (MTNR1B; rs7903146 (TCF7L2; rs1799884 (GCK SNPs were associated with increased plasma glucose levels at routine OGTT.We confirmed the robust association of MTNR1B rs10830963/G variant with GDM binary and glycemic traits in this Caucasian case-control study. As novel associations we report the minor, G allele of the rs7578326 SNP in the LOC646736/IRS1 region as a significant and the rs13266634/T SNP (SLC30A8 as a suggestive protective variant against GDM development. Genetic susceptibility appears to be more preponderant in individuals who meet both the modified 99'WHO and the IADPSG GDM diagnostic criteria.

  8. Multiple Imputation of Predictor Variables Using Generalized Additive Models

    NARCIS (Netherlands)

    de Jong, Roel; van Buuren, Stef; Spiess, Martin

    2016-01-01

    The sensitivity of multiple imputation methods to deviations from their distributional assumptions is investigated using simulations, where the parameters of scientific interest are the coefficients of a linear regression model, and values in predictor variables are missing at random. The

  9. Comparison of different Methods for Univariate Time Series Imputation in R

    OpenAIRE

    Moritz, Steffen; Sardá, Alexis; Bartz-Beielstein, Thomas; Zaefferer, Martin; Stork, Jörg

    2015-01-01

    Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of ...

  10. Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

    OpenAIRE

    Chan, Kin Wai; Meng, Xiao-Li

    2017-01-01

    Multiple imputation (MI) inference handles missing data by first properly imputing the missing values $m$ times, and then combining the $m$ analysis results from applying a complete-data procedure to each of the completed datasets. However, the existing method for combining likelihood ratio tests has multiple defects: (i) the combined test statistic can be negative in practice when the reference null distribution is a standard $F$ distribution; (ii) it is not invariant to re-parametrization; ...

  11. A web-based approach to data imputation

    KAUST Repository

    Li, Zhixu

    2013-10-24

    In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.

  12. Semantic Modeling for SNPs Associated with Ethnic Disparities in HapMap Samples

    Directory of Open Access Journals (Sweden)

    HyoYoung Kim

    2014-03-01

    Full Text Available Single-nucleotide polymorphisms (SNPs have been emerging out of the efforts to research human diseases and ethnic disparities. A semantic network is needed for in-depth understanding of the impacts of SNPs, because phenotypes are modulated by complex networks, including biochemical and physiological pathways. We identified ethnicity-specific SNPs by eliminating overlapped SNPs from HapMap samples, and the ethnicity-specific SNPs were mapped to the UCSC RefGene lists. Ethnicity-specific genes were identified as follows: 22 genes in the USA (CEU individuals, 25 genes in the Japanese (JPT individuals, and 332 genes in the African (YRI individuals. To analyze the biologically functional implications for ethnicity-specific SNPs, we focused on constructing a semantic network model. Entities for the network represented by "Gene," "Pathway," "Disease," "Chemical," "Drug," "ClinicalTrials," "SNP," and relationships between entity-entity were obtained through curation. Our semantic modeling for ethnicity-specific SNPs showed interesting results in the three categories, including three diseases ("AIDS-associated nephropathy," "Hypertension," and "Pelvic infection", one drug ("Methylphenidate", and five pathways ("Hemostasis," "Systemic lupus erythematosus," "Prostate cancer," "Hepatitis C virus," and "Rheumatoid arthritis". We found ethnicity-specific genes using the semantic modeling, and the majority of our findings was consistent with the previous studies - that an understanding of genetic variability explained ethnicity-specific disparities.

  13. Domain altering SNPs in the human proteome and their impact on signaling pathways.

    Directory of Open Access Journals (Sweden)

    Yichuan Liu

    Full Text Available Single nucleotide polymorphisms (SNPs constitute an important mode of genetic variations observed in the human genome. A small fraction of SNPs, about four thousand out of the ten million, has been associated with genetic disorders and complex diseases. The present study focuses on SNPs that fall on protein domains, 3D structures that facilitate connectivity of proteins in cell signaling and metabolic pathways. We scanned the human proteome using the PROSITE web tool and identified proteins with SNP containing domains. We showed that SNPs that fall on protein domains are highly statistically enriched among SNPs linked to hereditary disorders and complex diseases. Proteins whose domains are dramatically altered by the presence of an SNP are even more likely to be present among proteins linked to hereditary disorders. Proteins with domain-altering SNPs comprise highly connected nodes in cellular pathways such as the focal adhesion, the axon guidance pathway and the autoimmune disease pathways. Statistical enrichment of domain/motif signatures in interacting protein pairs indicates extensive loss of connectivity of cell signaling pathways due to domain-altering SNPs, potentially leading to hereditary disorders.

  14. V-MitoSNP: visualization of human mitochondrial SNPs

    Directory of Open Access Journals (Sweden)

    Tsui Ke-Hung

    2006-08-01

    Full Text Available Abstract Background Mitochondrial single nucleotide polymorphisms (mtSNPs constitute important data when trying to shed some light on human diseases and cancers. Unfortunately, providing relevant mtSNP genotyping information in mtDNA databases in a neatly organized and transparent visual manner still remains a challenge. Amongst the many methods reported for SNP genotyping, determining the restriction fragment length polymorphisms (RFLPs is still one of the most convenient and cost-saving methods. In this study, we prepared the visualization of the mtDNA genome in a way, which integrates the RFLP genotyping information with mitochondria related cancers and diseases in a user-friendly, intuitive and interactive manner. The inherent problem associated with mtDNA sequences in BLAST of the NCBI database was also solved. Description V-MitoSNP provides complete mtSNP information for four different kinds of inputs: (1 color-coded visual input by selecting genes of interest on the genome graph, (2 keyword search by locus, disease and mtSNP rs# ID, (3 visualized input of nucleotide range by clicking the selected region of the mtDNA sequence, and (4 sequences mtBLAST. The V-MitoSNP output provides 500 bp (base pairs flanking sequences for each SNP coupled with the RFLP enzyme and the corresponding natural or mismatched primer sets. The output format enables users to see the SNP genotype pattern of the RFLP by virtual electrophoresis of each mtSNP. The rate of successful design of enzymes and primers for RFLPs in all mtSNPs was 99.1%. The RFLP information was validated by actual agarose electrophoresis and showed successful results for all mtSNPs tested. The mtBLAST function in V-MitoSNP provides the gene information within the input sequence rather than providing the complete mitochondrial chromosome as in the NCBI BLAST database. All mtSNPs with rs number entries in NCBI are integrated in the corresponding SNP in V-MitoSNP. Conclusion V-MitoSNP is a web

  15. Optimisation and validation of methods to assess single nucleotide polymorphisms (SNPs) in archival histological material

    DEFF Research Database (Denmark)

    Andreassen, C N; Sørensen, Flemming Brandt; Overgaard

    2004-01-01

    only archival specimens are available. This study was conducted to validate protocols optimised for assessment of SNPs based on paraffin embedded, formalin fixed tissue samples.PATIENTS AND METHODS: In 137 breast cancer patients, three TGFB1 SNPs were assessed based on archival histological specimens...... precipitation).RESULTS: Assessment of SNPs based on archival histological material is encumbered by a number of obstacles and pitfalls. However, these can be widely overcome by careful optimisation of the methods used for sample selection, DNA extraction and PCR. Within 130 samples that fulfil the criteria...

  16. Missing data treatments matter: an analysis of multiple imputation for anterior cervical discectomy and fusion procedures.

    Science.gov (United States)

    Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N

    2018-04-09

    The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the

  17. Screening and Evaluation of Deleterious SNPs in APOE Gene of Alzheimer’s Disease

    Directory of Open Access Journals (Sweden)

    Tariq Ahmad Masoodi

    2012-01-01

    Full Text Available Introduction. Apolipoprotein E (APOE is an important risk factor for Alzheimer’s disease (AD and is present in 30–50% of patients who develop late-onset AD. Several single-nucleotide polymorphisms (SNPs are present in APOE gene which act as the biomarkers for exploring the genetic basis of this disease. The objective of this study is to identify deleterious nsSNPs associated with APOE gene. Methods. The SNPs were retrieved from dbSNP. Using I-Mutant, protein stability change was calculated. The potentially functional nonsynonymous (ns SNPs and their effect on protein was predicted by PolyPhen and SIFT, respectively. FASTSNP was used for functional analysis and estimation of risk score. The functional impact on the APOE protein was evaluated by using Swiss PDB viewer and NOMAD-Ref server. Results. Six nsSNPs were found to be least stable by I-Mutant 2.0 with DDG value of >−1.0. Four nsSNPs showed a highly deleterious tolerance index score of 0.00. Nine nsSNPs were found to be probably damaging with position-specific independent counts (PSICs score of ≥2.0. Seven nsSNPs were found to be highly polymorphic with a risk score of 3-4. The total energies and root-mean-square deviation (RMSD values were higher for three mutant-type structures compared to the native modeled structure. Conclusion. We concluded that three nsSNPs, namely, rs11542041, rs11542040, and rs11542034, to be potentially functional polymorphic.

  18. Geographic differences in allele frequencies of susceptibility SNPs for cardiovascular disease

    Directory of Open Access Journals (Sweden)

    Kullo Iftikhar J

    2011-04-01

    Full Text Available Abstract Background We hypothesized that the frequencies of risk alleles of SNPs mediating susceptibility to cardiovascular diseases differ among populations of varying geographic origin and that population-specific selection has operated on some of these variants. Methods From the database of genome-wide association studies (GWAS, we selected 36 cardiovascular phenotypes including coronary heart disease, hypertension, and stroke, as well as related quantitative traits (eg, body mass index and plasma lipid levels. We identified 292 SNPs in 270 genes associated with a disease or trait at P -8. As part of the Human Genome-Diversity Project (HGDP, 158 (54.1% of these SNPs have been genotyped in 938 individuals belonging to 52 populations from seven geographic areas. A measure of population differentiation, FST, was calculated to quantify differences in risk allele frequencies (RAFs among populations and geographic areas. Results Large differences in RAFs were noted in populations of Africa, East Asia, America and Oceania, when compared with other geographic regions. The mean global FST (0.1042 for 158 SNPs among the populations was not significantly higher than the mean global FST of 158 autosomal SNPs randomly sampled from the HGDP database. Significantly higher global FST (P FST of 2036 putatively neutral SNPs. For four of these SNPs, additional evidence of selection was noted based on the integrated Haplotype Score. Conclusion Large differences in RAFs for a set of common SNPs that influence risk of cardiovascular disease were noted between the major world populations. Pairwise comparisons revealed RAF differences for at least eight SNPs that might be due to population-specific selection or demographic factors. These findings are relevant to a better understanding of geographic variation in the prevalence of cardiovascular disease.

  19. Association between SNPs within candidate genes and compounds related to boar taint and reproduction

    DEFF Research Database (Denmark)

    Moe, Maren; Lien, Sigbjørn; Aasmundstad, Torunn

    2009-01-01

    BACKGROUND: Boar taint is an unpleasant odour and flavour of the meat from some uncastrated male pigs primarily caused by elevated levels of androstenone and skatole in adipose tissue. Androstenone is produced in the same biochemical pathway as testosterone and estrogens, which represents...... of this study was to detect SNPs in boar taint candidate genes and to perform association studies for both single SNPs and haplotypes with levels of boar taint compounds and phenotypes related to reproduction. RESULTS: An association study involving 275 SNPs in 121 genes and compounds related to boar taint...... and reproduction were carried out in Duroc and Norwegian Landrace boars. Phenotypes investigated were levels of androstenone, skatole and indole in adipose tissue, levels of androstenone, testosterone, estrone sulphate and 17beta-estradiol in plasma, and length of bulbo urethralis gland. The SNPs were genotyped...

  20. Imputing historical statistics, soils information, and other land-use data to crop area

    Science.gov (United States)

    Perry, C. R., Jr.; Willis, R. W.; Lautenschlager, L.

    1982-01-01

    In foreign crop condition monitoring, satellite acquired imagery is routinely used. To facilitate interpretation of this imagery, it is advantageous to have estimates of the crop types and their extent for small area units, i.e., grid cells on a map represent, at 60 deg latitude, an area nominally 25 by 25 nautical miles in size. The feasibility of imputing historical crop statistics, soils information, and other ancillary data to crop area for a province in Argentina is studied.

  1. Inference for multivariate regression model based on multiply imputed synthetic data generated via posterior predictive sampling

    Science.gov (United States)

    Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.

    2017-06-01

    The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.

  2. Trend in BMI z-score among Private Schools’ Students in Delhi using Multiple Imputation for Growth Curve Model

    Directory of Open Access Journals (Sweden)

    Vinay K Gupta

    2016-06-01

    Full Text Available Objective: The aim of the study is to assess the trend in mean BMI z-score among private schools’ students from their anthropometric records when there were missing values in the outcome. Methodology: The anthropometric measurements of student from class 1 to 12 were taken from the records of two private schools in Delhi, India from 2005 to 2010. These records comprise of an unbalanced longitudinal data that is not all the students had measurements recorded at each year. The trend in mean BMI z-score was estimated through growth curve model. Prior to that, missing values of BMI z-score were imputed through multiple imputation using the same model. A complete case analysis was also performed after excluding missing values to compare the results with those obtained from analysis of multiply imputed data. Results: The mean BMI z-score among school student significantly decreased over time in imputed data (β= -0.2030, se=0.0889, p=0.0232 after adjusting age, gender, class and school. Complete case analysis also shows a decrease in mean BMI z-score though it was not statistically significant (β= -0.2861, se=0.0987, p=0.065. Conclusions: The estimates obtained from multiple imputation analysis were better than those of complete data after excluding missing values in terms of lower standard errors. We showed that anthropometric measurements from schools records can be used to monitor the weight status of children and adolescents and multiple imputation using growth curve model can be useful while analyzing such data

  3. Single nucleotide polymorphisms (SNPs in coding regions of canine dopamine- and serotonin-related genes

    Directory of Open Access Journals (Sweden)

    Lingaas Frode

    2008-01-01

    Full Text Available Abstract Background Polymorphism in genes of regulating enzymes, transporters and receptors of the neurotransmitters of the central nervous system have been associated with altered behaviour, and single nucleotide polymorphisms (SNPs represent the most frequent type of genetic variation. The serotonin and dopamine signalling systems have a central influence on different behavioural phenotypes, both of invertebrates and vertebrates, and this study was undertaken in order to explore genetic variation that may be associated with variation in behaviour. Results Single nucleotide polymorphisms in canine genes related to behaviour were identified by individually sequencing eight dogs (Canis familiaris of different breeds. Eighteen genes from the dopamine and the serotonin systems were screened, revealing 34 SNPs distributed in 14 of the 18 selected genes. A total of 24,895 bp coding sequence was sequenced yielding an average frequency of one SNP per 732 bp (1/732. A total of 11 non-synonymous SNPs (nsSNPs, which may be involved in alteration of protein function, were detected. Of these 11 nsSNPs, six resulted in a substitution of amino acid residue with concomitant change in structural parameters. Conclusion We have identified a number of coding SNPs in behaviour-related genes, several of which change the amino acids of the proteins. Some of the canine SNPs exist in codons that are evolutionary conserved between five compared species, and predictions indicate that they may have a functional effect on the protein. The reported coding SNP frequency of the studied genes falls within the range of SNP frequencies reported earlier in the dog and other mammalian species. Novel SNPs are presented and the results show a significant genetic variation in expressed sequences in this group of genes. The results can contribute to an improved understanding of the genetics of behaviour.

  4. Comprehensive exploration of the effects of miRNA SNPs on monocyte gene expression.

    Directory of Open Access Journals (Sweden)

    Nicolas Greliche

    Full Text Available We aimed to assess whether pri-miRNA SNPs (miSNPs could influence monocyte gene expression, either through marginal association or by interacting with polymorphisms located in 3'UTR regions (3utrSNPs. We then conducted a genome-wide search for marginal miSNPs effects and pairwise miSNPs × 3utrSNPs interactions in a sample of 1,467 individuals for which genome-wide monocyte expression and genotype data were available. Statistical associations that survived multiple testing correction were tested for replication in an independent sample of 758 individuals with both monocyte gene expression and genotype data. In both studies, the hsa-mir-1279 rs1463335 was found to modulate in cis the expression of LYZ and in trans the expression of CNTN6, CTRC, COPZ2, KRT9, LRRFIP1, NOD1, PCDHA6, ST5 and TRAF3IP2 genes, supporting the role of hsa-mir-1279 as a regulator of several genes in monocytes. In addition, we identified two robust miSNPs × 3utrSNPs interactions, one involving HLA-DPB1 rs1042448 and hsa-mir-219-1 rs107822, the second the H1F0 rs1894644 and hsa-mir-659 rs5750504, modulating the expression of the associated genes.As some of the aforementioned genes have previously been reported to reside at disease-associated loci, our findings provide novel arguments supporting the hypothesis that the genetic variability of miRNAs could also contribute to the susceptibility to human diseases.

  5. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods.

    Science.gov (United States)

    Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A

    2015-12-01

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  6. The Use of Imputed Sibling Genotypes in Sibship-Based Association Analysis: On Modeling Alternatives, Power and Model Misspecification

    NARCIS (Netherlands)

    Minica, C.C.; Dolan, C.V.; Willemsen, G.; Vink, J.M.; Boomsma, D.I.

    2013-01-01

    When phenotypic, but no genotypic data are available for relatives of participants in genetic association studies, previous research has shown that family-based imputed genotypes can boost the statistical power when included in such studies. Here, using simulations, we compared the performance of

  7. TRIP: An interactive retrieving-inferring data imputation approach

    KAUST Repository

    Li, Zhixu

    2016-06-25

    Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.

  8. TRIP: An interactive retrieving-inferring data imputation approach

    KAUST Repository

    Li, Zhixu; Qin, Lu; Cheng, Hong; Zhang, Xiangliang; Zhou, Xiaofang

    2016-01-01

    Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.

  9. Imputed prices of greenhouse gases and land forests

    International Nuclear Information System (INIS)

    Uzawa, Hirofumi

    1993-01-01

    The theory of dynamic optimum formulated by Maeler gives us the basic theoretical framework within which it is possible to analyse the economic and, possibly, political circumstances under which the phenomenon of global warming occurs, and to search for the policy and institutional arrangements whereby it would be effectively arrested. The analysis developed here is an application of Maeler's theory to atmospheric quality. In the analysis a central role is played by the concept of imputed price in the dynamic context. Our determination of imputed prices of atmospheric carbon dioxide and land forests takes into account the difference in the stages of economic development. Indeed, the ratios of the imputed prices of atmospheric carbon dioxide and land forests over the per capita level of real national income are identical for all countries involved. (3 figures, 2 tables) (Author)

  10. Imputation of genotypes from low density (50,000 markers) to high density (700,000 markers) of cows from research herds in Europe, North America, and Australasia using 2 reference populations

    DEFF Research Database (Denmark)

    Pryce, J E; Johnston, J; Hayes, B J

    2014-01-01

    detection in genome-wide association studies and the accuracy of genomic selection may increase when the low-density genotypes are imputed to higher density. Genotype data were available from 10 research herds: 5 from Europe [Denmark, Germany, Ireland, the Netherlands, and the United Kingdom (UK)], 2 from...... reference populations. Although it was not possible to use a combined reference population, which would probably result in the highest accuracies of imputation, differences arising from using 2 high-density reference populations on imputing 50,000-marker genotypes of 583 animals (from the UK) were...... information exploited. The UK animals were also included in the North American data set (n = 1,579) that was imputed to high density using a reference population of 2,018 bulls. After editing, 591,213 genotypes on 5,999 animals from 10 research herds remained. The correlation between imputed allele...

  11. Consortium analysis of 7 candidate SNPs for ovarian cancer

    DEFF Research Database (Denmark)

    Ramus, S.J.; Vierkant, R.A.; Johnatty, S.E.

    2008-01-01

    The Ovarian Cancer Association Consortium selected 7 candidate single nucleotide polymorphisms (SNPs), for which there is evidence from previous studies of an association with variation in ovarian cancer or breast cancer risks. The SNPs selected for analysis were F31I (rs2273535) in AURKA, N372H...... (rs144848) in BRCA2, rs2854344 in intron 17 of RB1, rs2811712 5' flanking CDKN2A, rs523349 in the 3' UTR of SRD5A2, D302H (rs1045485) in CASP8 and L10P (rs1982073) in TGFB1. Fourteen studies genotyped 4,624 invasive epithelial ovarian cancer cases and 8,113 controls of white non-Hispanic origin...... was suggestive although no longer statistically significant (ordinal OR 0.92, 95% CI 0.79-1.06). This SNP has also been shown to have an association with decreased risk in breast cancer. There was a suggestion of an association for AURKA, when one study that caused significant study heterogeneity was excluded...

  12. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions

    Science.gov (United States)

    Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier

    2014-01-01

    Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644

  13. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2014-10-01

    Full Text Available Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE. This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW and Multiple Linear Regression (MLR. The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  14. Missing data imputation of solar radiation data under different atmospheric conditions.

    Science.gov (United States)

    Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos

    2014-10-29

    Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  15. Using imputation to provide location information for nongeocoded addresses.

    Directory of Open Access Journals (Sweden)

    Frank C Curriero

    2010-02-01

    Full Text Available The importance of geography as a source of variation in health research continues to receive sustained attention in the literature. The inclusion of geographic information in such research often begins by adding data to a map which is predicated by some knowledge of location. A precise level of spatial information is conventionally achieved through geocoding, the geographic information system (GIS process of translating mailing address information to coordinates on a map. The geocoding process is not without its limitations, though, since there is always a percentage of addresses which cannot be converted successfully (nongeocodable. This raises concerns regarding bias since traditionally the practice has been to exclude nongeocoded data records from analysis.In this manuscript we develop and evaluate a set of imputation strategies for dealing with missing spatial information from nongeocoded addresses. The strategies are developed assuming a known zip code with increasing use of collateral information, namely the spatial distribution of the population at risk. Strategies are evaluated using prostate cancer data obtained from the Maryland Cancer Registry. We consider total case enumerations at the Census county, tract, and block group level as the outcome of interest when applying and evaluating the methods. Multiple imputation is used to provide estimated total case counts based on complete data (geocodes plus imputed nongeocodes with a measure of uncertainty. Results indicate that the imputation strategy based on using available population-based age, gender, and race information performed the best overall at the county, tract, and block group levels.The procedure allows for the potentially biased and likely under reported outcome, case enumerations based on only the geocoded records, to be presented with a statistically adjusted count (imputed count with a measure of uncertainty that are based on all the case data, the geocodes and imputed

  16. Evidence of Stage- and Age-Related Heterogeneity of Non-HLA SNPs and Risk of Islet Autoimmunity and Type 1 Diabetes: The Diabetes Autoimmunity Study in the Young

    Directory of Open Access Journals (Sweden)

    Brittni N. Frederiksen

    2013-01-01

    Full Text Available Previously, we examined 20 non-HLA SNPs for association with islet autoimmunity (IA and/or progression to type 1 diabetes (T1D. Our objective was to investigate fourteen additional non-HLA T1D candidate SNPs for stage- and age-related heterogeneity in the etiology of T1D. Of 1634 non-Hispanic white DAISY children genotyped, 132 developed IA (positive for GAD, insulin, or IA-2 autoantibodies at two or more consecutive visits; 50 IA positive children progressed to T1D. Cox regression was used to analyze risk of IA and progression to T1D in IA positive children. Restricted cubic splines were used to model SNPs when there was evidence that risk was not constant with age. C1QTNF6 (rs229541 predicted increased IA risk (HR: 1.57, CI: 1.20–2.05 but not progression to T1D (HR: 1.13, CI: 0.75–1.71. SNP (rs10517086 appears to exhibit an age-related effect on risk of IA, with increased risk before age 2 years (age 2 HR: 1.67, CI: 1.08–2.56 but not older ages (age 4 HR: 0.84, CI: 0.43–1.62. C1QTNF6 (rs229541, SNP (rs10517086, and UBASH3A (rs3788013 were associated with development of T1D. This prospective investigation of non-HLA T1D candidate loci shows that some SNPs may exhibit stage- and age-related heterogeneity in the etiology of T1D.

  17. Multiple imputation of missing passenger boarding data in the national census of ferry operators

    Science.gov (United States)

    2008-08-01

    This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corresp...

  18. The effects of non-synonymous single nucleotide polymorphisms (nsSNPs) on protein-protein interactions.

    Science.gov (United States)

    Yates, Christopher M; Sternberg, Michael J E

    2013-11-01

    Non-synonymous single nucleotide polymorphisms (nsSNPs) are single base changes leading to a change to the amino acid sequence of the encoded protein. Many of these variants are associated with disease, so nsSNPs have been well studied, with studies looking at the effects of nsSNPs on individual proteins, for example, on stability and enzyme active sites. In recent years, the impact of nsSNPs upon protein-protein interactions has also been investigated, giving a greater insight into the mechanisms by which nsSNPs can lead to disease. In this review, we summarize these studies, looking at the various mechanisms by which nsSNPs can affect protein-protein interactions. We focus on structural changes that can impair interaction, changes to disorder, gain of interaction, and post-translational modifications before looking at some examples of nsSNPs at human-pathogen protein-protein interfaces and the analysis of nsSNPs from a network perspective. © 2013.

  19. Imputation of single nucleotide polymorhpism genotypes of Hereford cattle: reference panel size, family relationship and population structure

    Science.gov (United States)

    The objective of this study is to investigate single nucleotide polymorphism (SNP) genotypes imputation of Hereford cattle. Purebred Herefords were from two sources, Line 1 Hereford (N=240) and representatives of Industry Herefords (N=311). Using different reference panels of 62 and 494 males with 1...

  20. Identification and analysis of Single Nucleotide Polymorphisms (SNPs in the mosquito Anopheles funestus, malaria vector

    Directory of Open Access Journals (Sweden)

    Hemingway Janet

    2007-01-01

    Full Text Available Abstract Background Single nucleotide polymorphisms (SNPs are the most common source of genetic variation in eukaryotic species and have become an important marker for genetic studies. The mosquito Anopheles funestus is one of the major malaria vectors in Africa and yet, prior to this study, no SNPs have been described for this species. Here we report a genome-wide set of SNP markers for use in genetic studies on this important human disease vector. Results DNA fragments from 50 genes were amplified and sequenced from 21 specimens of An. funestus. A third of specimens were field collected in Malawi, a third from a colony of Mozambican origin and a third form a colony of Angolan origin. A total of 494 SNPs including 303 within the coding regions of genes and 5 indels were identified. The physical positions of these SNPs in the genome are known. There were on average 7 SNPs per kilobase similar to that observed in An. gambiae and Drosophila melanogaster. Transitions outnumbered transversions, at a ratio of 2:1. The increased frequency of transition substitutions in coding regions is likely due to the structure of the genetic code and selective constraints. Synonymous sites within coding regions showed a higher polymorphism rate than non-coding introns or 3' and 5'flanking DNA with most of the substitutions in coding regions being observed at the 3rd codon position. A positive correlation in the level of polymorphism was observed between coding and non-coding regions within a gene. By genotyping a subset of 30 SNPs, we confirmed the validity of the SNPs identified during this study. Conclusion This set of SNP markers represents a useful tool for genetic studies in An. funestus, and will be useful in identifying candidate genes that affect diverse ranges of phenotypes that impact on vector control, such as resistance insecticide, mosquito behavior and vector competence.

  1. A reduced number of mtSNPs saturates mitochondrial DNA haplotype diversity of worldwide population groups.

    Science.gov (United States)

    Salas, Antonio; Amigo, Jorge

    2010-05-03

    The high levels of variation characterising the mitochondrial DNA (mtDNA) molecule are due ultimately to its high average mutation rate; moreover, mtDNA variation is deeply structured in different populations and ethnic groups. There is growing interest in selecting a reduced number of mtDNA single nucleotide polymorphisms (mtSNPs) that account for the maximum level of discrimination power in a given population. Applications of the selected mtSNP panel range from anthropologic and medical studies to forensic genetic casework. This study proposes a new simulation-based method that explores the ability of different mtSNP panels to yield the maximum levels of discrimination power. The method explores subsets of mtSNPs of different sizes randomly chosen from a preselected panel of mtSNPs based on frequency. More than 2,000 complete genomes representing three main continental human population groups (Africa, Europe, and Asia) and two admixed populations ("African-Americans" and "Hispanics") were collected from GenBank and the literature, and were used as training sets. Haplotype diversity was measured for each combination of mtSNP and compared with existing mtSNP panels available in the literature. The data indicates that only a reduced number of mtSNPs ranging from six to 22 are needed to account for 95% of the maximum haplotype diversity of a given population sample. However, only a small proportion of the best mtSNPs are shared between populations, indicating that there is not a perfect set of "universal" mtSNPs suitable for all population contexts. The discrimination power provided by these mtSNPs is much higher than the power of the mtSNP panels proposed in the literature to date. Some mtSNP combinations also yield high diversity values in admixed populations. The proposed computational approach for exploring combinations of mtSNPs that optimise the discrimination power of a given set of mtSNPs is more efficient than previous empirical approaches. In contrast to

  2. Linkage disequilibrium between STRPs and SNPs across the human genome.

    Science.gov (United States)

    Payseur, Bret A; Place, Michael; Weber, James L

    2008-05-01

    Patterns of linkage disequilibrium (LD) reveal the action of evolutionary processes and provide crucial information for association mapping of disease genes. Although recent studies have described the landscape of LD among single nucleotide polymorphisms (SNPs) from across the human genome, associations involving other classes of molecular variation remain poorly understood. In addition to recombination and population history, mutation rate and process are expected to shape LD. To test this idea, we measured associations between short-tandem-repeat polymorphisms (STRPs), which can mutate rapidly and recurrently, and SNPs in 721 regions across the human genome. We directly compared STRP-SNP LD with SNP-SNP LD from the same genomic regions in the human HapMap populations. The intensity of STRP-SNP LD, measured by the average of D', was reduced, consistent with the action of recurrent mutation. Nevertheless, a higher fraction of STRP-SNP pairs than SNP-SNP pairs showed significant LD, on both short (up to 50 kb) and long (cM) scales. These results reveal the substantial effects of mutational processes on LD at STRPs and provide important measures of the potential of STRPs for association mapping of disease genes.

  3. Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

    Directory of Open Access Journals (Sweden)

    Hardt Jochen

    2012-12-01

    Full Text Available Abstract Background Multiple imputation is becoming increasingly popular. Theoretical considerations as well as simulation studies have shown that the inclusion of auxiliary variables is generally of benefit. Methods A simulation study of a linear regression with a response Y and two predictors X1 and X2 was performed on data with n = 50, 100 and 200 using complete cases or multiple imputation with 0, 10, 20, 40 and 80 auxiliary variables. Mechanisms of missingness were either 100% MCAR or 50% MAR + 50% MCAR. Auxiliary variables had low (r=.10 vs. moderate correlations (r=.50 with X’s and Y. Results The inclusion of auxiliary variables can improve a multiple imputation model. However, inclusion of too many variables leads to downward bias of regression coefficients and decreases precision. When the correlations are low, inclusion of auxiliary variables is not useful. Conclusion More research on auxiliary variables in multiple imputation should be performed. A preliminary rule of thumb could be that the ratio of variables to cases with complete data should not go below 1 : 3.

  4. Common non-synonymous SNPs associated with breast cancer susceptibility

    DEFF Research Database (Denmark)

    Milne, Roger L; Burwinkel, Barbara; Michailidou, Kyriaki

    2014-01-01

    Candidate variant association studies have been largely unsuccessful in identifying common breast cancer susceptibility variants, although most studies have been underpowered to detect associations of a realistic magnitude. We assessed 41 common non-synonymous single-nucleotide polymorphisms (ns......SNPs) for which evidence of association with breast cancer risk had been previously reported. Case-control data were combined from 38 studies of white European women (46 450 cases and 42 600 controls) and analyzed using unconditional logistic regression. Strong evidence of association was observed for three ns...... associations reached genome-wide statistical significance in a combined analysis of available data, including independent data from nine genome-wide association studies (GWASs): for ATXN7-K264R, OR = 1.07 (95% CI = 1.05-1.10, P = 1.0 × 10(-8)); for AKAP9-M463I, OR = 1.05 (95% CI = 1.04-1.07, P = 2.0 × 10...

  5. Portability of tag SNPs across isolated population groups: an example from India.

    Science.gov (United States)

    Sarkar Roy, N; Farheen, S; Roy, N; Sengupta, S; Majumder, P P

    2008-01-01

    Isolated population groups are useful in conducting association studies of complex diseases to avoid various pitfalls, including those arising from population stratification. Since DNA resequencing is expensive, it is recommended that genotyping be carried out at tagSNP (tSNP) loci. For this, tSNPs identified in one isolated population need to be used in another. Unless tSNPs are highly portable across populations this strategy may result in loss of information in association studies. We examined the issue of tSNP portability by sampling individuals from 10 isolated ethnic groups from India. We generated DNA resequencing data pertaining to 3 genomic regions and identified tSNPs in each population. We defined an index of tSNP portability and showed that portability is low across isolated Indian ethnic groups. The extent of portability did not significantly correlate with genetic similarity among the populations studied here. We also analyzed our data with sequence data from individuals of African and European descent. Our results indicated that it may be necessary to carry out resequencing in a small number of individuals to discover SNPs and identify tSNPs in the specific isolated population in which a disease association study is to be conducted.

  6. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

    Science.gov (United States)

    Nguyen, Thanh-Tung; Huang, Joshua; Wu, Qingyao; Nguyen, Thuy; Li, Mark

    2015-01-01

    Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed

  7. Four new single nucleotide polymorphisms (SNPs) of toll-like ...

    African Journals Online (AJOL)

    In order to reveal the single nucleotide polymorphisms (SNPs), genotypes and allelic frequencies of each mutation site of TLR7 gene in Chinese native duck breeds, SNPs of duck TLR7 gene were detected by DNA sequencing. The genotypes of 465 native ducks from eight key protected duck breeds were determined by ...

  8. Investigation of SNPs in the porcine desmoglein 1 gene

    DEFF Research Database (Denmark)

    Daugaard, L.; Andresen, Lars Ole; Fredholm, M.

    2007-01-01

    epidermitis were diagnosed clinically as affected or unaffected. Two regions of the desmoglein I gene were sequenced and genotypes of the SNPs were established. Seven SNPs (823T>C, 828A>G, 829A>G, 830A>T, 831A>T, 838A>C and 1139C>T) were found in the analysed sequences and the allele frequencies were...... the location of single nucleotide polymorphisms (SNPs) in the porcine desmoglein I gene (PIG)DSGI in correlation to the cleavage site as well as if the genotype of the SNPs is correlated to susceptibility or resistance to the disease. Results: DNA from 32 affected and 32 unaffected piglets with exudative...... the genotypes of two out of seven SNPs found in the porcine desmoglein I gene and the susceptibility to exudative epidermitis....

  9. A Bayesian Hierarchical Model for Relating Multiple SNPs within Multiple Genes to Disease Risk

    Directory of Open Access Journals (Sweden)

    Lewei Duan

    2013-01-01

    Full Text Available A variety of methods have been proposed for studying the association of multiple genes thought to be involved in a common pathway for a particular disease. Here, we present an extension of a Bayesian hierarchical modeling strategy that allows for multiple SNPs within each gene, with external prior information at either the SNP or gene level. The model involves variable selection at the SNP level through latent indicator variables and Bayesian shrinkage at the gene level towards a prior mean vector and covariance matrix that depend on external information. The entire model is fitted using Markov chain Monte Carlo methods. Simulation studies show that the approach is capable of recovering many of the truly causal SNPs and genes, depending upon their frequency and size of their effects. The method is applied to data on 504 SNPs in 38 candidate genes involved in DNA damage response in the WECARE study of second breast cancers in relation to radiotherapy exposure.

  10. Genome-wide association study of primary tooth eruption identifies pleiotropic loci associated with height and craniofacial distances

    DEFF Research Database (Denmark)

    Fatemifar, Ghazaleh; Hoggart, Clive J; Paternoster, Lavinia

    2013-01-01

    Twin and family studies indicate that the timing of primary tooth eruption is highly heritable, with estimates typically exceeding 80%. To identify variants involved in primary tooth eruption, we performed a population-based genome-wide association study of 'age at first tooth' and 'number of teeth......' using 5998 and 6609 individuals, respectively, from the Avon Longitudinal Study of Parents and Children (ALSPAC) and 5403 individuals from the 1966 Northern Finland Birth Cohort (NFBC1966). We tested 2 446 724 SNPs imputed in both studies. Analyses were controlled for the effect of gestational age, sex...

  11. Combinations of SNPs Related to Signal Transduction in Bipolar Disorder

    DEFF Research Database (Denmark)

    Koefoed, Pernille; Andreassen, Ole A; Bennike, Bente

    2011-01-01

    of complex diseases, it may be useful to look at combinations of genotypes. Genes related to signal transmission, e.g., ion channel genes, may be of interest in this respect in the context of bipolar disorder. In the present study, we analysed 803 SNPs in 55 genes related to aspects of signal transmission...... and calculated all combinations of three genotypes from the 3×803 SNP genotypes for 1355 controls and 607 patients with bipolar disorder. Four clusters of patient-specific combinations were identified. Permutation tests indicated that some of these combinations might be related to bipolar disorder. The WTCCC...... in the clusters in the two datasets. The present analyses of the combinations of SNP genotypes support a role for both genetic heterogeneity and interactions in the genetic architecture of bipolar disorder....

  12. Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data.

    Science.gov (United States)

    Hu, Bo; Ji, Yuan; Xu, Yaomin; Ting, Angela H

    2013-05-01

    Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach.

  13. An Imputation Model for Dropouts in Unemployment Data

    Directory of Open Access Journals (Sweden)

    Nilsson Petra

    2016-09-01

    Full Text Available Incomplete unemployment data is a fundamental problem when evaluating labour market policies in several countries. Many unemployment spells end for unknown reasons; in the Swedish Public Employment Service’s register as many as 20 percent. This leads to an ambiguity regarding destination states (employment, unemployment, retired, etc.. According to complete combined administrative data, the employment rate among dropouts was close to 50 for the years 1992 to 2006, but from 2007 the employment rate has dropped to 40 or less. This article explores an imputation approach. We investigate imputation models estimated both on survey data from 2005/2006 and on complete combined administrative data from 2005/2006 and 2011/2012. The models are evaluated in terms of their ability to make correct predictions. The models have relatively high predictive power.

  14. Towards a more efficient representation of imputation operators in TPOT

    OpenAIRE

    Garciarena, Unai; Mendiburu, Alexander; Santana, Roberto

    2018-01-01

    Automated Machine Learning encompasses a set of meta-algorithms intended to design and apply machine learning techniques (e.g., model selection, hyperparameter tuning, model assessment, etc.). TPOT, a software for optimizing machine learning pipelines based on genetic programming (GP), is a novel example of this kind of applications. Recently we have proposed a way to introduce imputation methods as part of TPOT. While our approach was able to deal with problems with missing data, it can prod...

  15. DTW-APPROACH FOR UNCORRELATED MULTIVARIATE TIME SERIES IMPUTATION

    OpenAIRE

    Phan , Thi-Thu-Hong; Poisson Caillault , Emilie; Bigand , André; Lefebvre , Alain

    2017-01-01

    International audience; Missing data are inevitable in almost domains of applied sciences. Data analysis with missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Some well-known methods for multivariate time series imputation require high correlations between series or their features. In this paper , we propose an approach based on the shape-behaviour relation in low/un-correlated multivariate time series under an assumption of...

  16. Which DTW Method Applied to Marine Univariate Time Series Imputation

    OpenAIRE

    Phan , Thi-Thu-Hong; Caillault , Émilie; Lefebvre , Alain; Bigand , André

    2017-01-01

    International audience; Missing data are ubiquitous in any domains of applied sciences. Processing datasets containing missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Therefore, the aim of this paper is to build a framework for filling missing values in univariate time series and to perform a comparison of different similarity metrics used for the imputation task. This allows to suggest the most suitable methods for the imp...

  17. In-silico analysis of non-synonymous-SNPs of STEAP2: To provoke the progression of prostate cancer

    Directory of Open Access Journals (Sweden)

    Naveed Muhammad

    2016-01-01

    Full Text Available As a novel biomarker from the STEAP family, STEAP2 encodes six transmembrane epithelial antigens to prostate cancer. The overexpression of STEAP2 is predicted as the second most common cancer in the world that is responsible for male cancer-related deaths. Nonsynonymous SNPs are important group of SNPs which lead to alternations in encoded polypeptides. Changes in the amino acid sequence of gene products can lead to abnormal tissue function. The present study firstly sorted out those SNPs which exist in the coding region of STEAP2 and evaluated their impact through computational tools. Secondly, the three-dimensional structure of STEAP2 was formed through I-TASSER and validated by different software. Genomic data has been retrieved from the 1000 Genome project and Ensembl and subsequently analysed using computational tools. Out of 177 non-synonymous single nucleotide polymorphisms (nsSNPs within the coding region, 42 mis-sense SNPs have been predicted as deleterious by all analyses. Our research shows a welldesigned computational methodology to inspect the prostate cancer associated nsSNPs. It can be concluded that these nsSNPs can play their role in the up-regulation of STEAP2 which further leads to progression of prostate cancer. It can benefit scientists in the handling of cancerassociated diseases related to STEAP2 through developing novel drug therapies.

  18. Handling missing data in cluster randomized trials: A demonstration of multiple imputation with PAN through SAS

    Directory of Open Access Journals (Sweden)

    Jiangxiu Zhou

    2014-09-01

    Full Text Available The purpose of this study is to demonstrate a way of dealing with missing data in clustered randomized trials by doing multiple imputation (MI with the PAN package in R through SAS. The procedure for doing MI with PAN through SAS is demonstrated in detail in order for researchers to be able to use this procedure with their own data. An illustration of the technique with empirical data was also included. In this illustration thePAN results were compared with pairwise deletion and three types of MI: (1 Normal Model (NM-MI ignoring the cluster structure; (2 NM-MI with dummy-coded cluster variables (fixed cluster structure; and (3 a hybrid NM-MI which imputes half the time ignoring the cluster structure, and the other half including the dummy-coded cluster variables. The empirical analysis showed that using PAN and the other strategies produced comparable parameter estimates. However, the dummy-coded MI overestimated the intraclass correlation, whereas MI ignoring the cluster structure and the hybrid MI underestimated the intraclass correlation. When compared with PAN, the p-value and standard error for the treatment effect were higher with dummy-coded MI, and lower with MI ignoring the clusterstructure, the hybrid MI approach, and pairwise deletion. Previous studies have shown that NM-MI is not appropriate for handling missing data in clustered randomized trials. This approach, in addition to the pairwise deletion approach, leads to a biased intraclass correlation and faultystatistical conclusions. Imputation in clustered randomized trials should be performed with PAN. We have demonstrated an easy way for using PAN through SAS.

  19. Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling

    Directory of Open Access Journals (Sweden)

    Petr Novák

    2012-03-01

    Full Text Available This study is aimed at variance computation techniques for estimates of population characteristics based on survey sampling and imputation. We use the superpopulation regression model, which means that the target variable values for each statistical unit are treated as random realizations of a linear regression model with weighted variance. We focus on regression models with one auxiliary variable and no intercept, which have many applications and straightforward interpretation in business statistics. Furthermore, we deal with caseswhere the estimates are not independent and thus the covariance must be computed. We also consider chained regression models with auxiliary variables as random variables instead of constants.

  20. A suggested approach for imputation of missing dietary data for young children in daycare

    OpenAIRE

    Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P.; Zeng, Donglin; Vaughn, Amber E.; Pratt, Charlotte; Ward, Dianne S.

    2015-01-01

    Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult.Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method.Design: Data were from children aged 2-5 years in the My Parenting...

  1. A spatial haplotype copying model with applications to genotype imputation.

    Science.gov (United States)

    Yang, Wen-Yun; Hormozdiari, Farhad; Eskin, Eleazar; Pasaniuc, Bogdan

    2015-05-01

    Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.

  2. A Note on the Effect of Data Clustering on the Multiple-Imputation Variance Estimator: A Theoretical Addendum to the Lewis et al. article in JOS 2014

    Directory of Open Access Journals (Sweden)

    He Yulei

    2016-03-01

    Full Text Available Multiple imputation is a popular approach to handling missing data. Although it was originally motivated by survey nonresponse problems, it has been readily applied to other data settings. However, its general behavior still remains unclear when applied to survey data with complex sample designs, including clustering. Recently, Lewis et al. (2014 compared single- and multiple-imputation analyses for certain incomplete variables in the 2008 National Ambulatory Medicare Care Survey, which has a nationally representative, multistage, and clustered sampling design. Their study results suggested that the increase of the variance estimate due to multiple imputation compared with single imputation largely disappears for estimates with large design effects. We complement their empirical research by providing some theoretical reasoning. We consider data sampled from an equally weighted, single-stage cluster design and characterize the process using a balanced, one-way normal random-effects model. Assuming that the missingness is completely at random, we derive analytic expressions for the within- and between-multiple-imputation variance estimators for the mean estimator, and thus conveniently reveal the impact of design effects on these variance estimators. We propose approximations for the fraction of missing information in clustered samples, extending previous results for simple random samples. We discuss some generalizations of this research and its practical implications for data release by statistical agencies.

  3. Potentially functional SNPs (pfSNPs as novel genomic predictors of 5-FU response in metastatic colorectal cancer patients.

    Directory of Open Access Journals (Sweden)

    Jingbo Wang

    Full Text Available 5-Fluorouracil (5-FU and its pro-drug Capecitabine have been widely used in treating colorectal cancer. However, not all patients will respond to the drug, hence there is a need to develop reliable early predictive biomarkers for 5-FU response. Here, we report a novel potentially functional Single Nucleotide Polymorphism (pfSNP approach to identify SNPs that may serve as predictive biomarkers of response to 5-FU in Chinese metastatic colorectal cancer (CRC patients. 1547 pfSNPs and one variable number tandem repeat (VNTR in 139 genes in 5-FU drug (both PK and PD pathway and colorectal cancer disease pathways were examined in 2 groups of CRC patients. Shrinkage of liver metastasis measured by RECIST criteria was used as the clinical end point. Four non-responder-specific pfSNPs were found to account for 37.5% of all non-responders (P<0.0003. Five additional pfSNPs were identified from a multivariate model (AUC under ROC = 0.875 that was applied for all other pfSNPs, excluding the non-responder-specific pfSNPs. These pfSNPs, which can differentiate the other non-responders from responders, mainly reside in tumor suppressor genes or genes implicated in colorectal cancer risk. Hence, a total of 9 novel SNPs with potential functional significance may be able to distinguish non-responders from responders to 5-FU. These pfSNPs may be useful biomarkers for predicting response to 5-FU.

  4. No prognostic value added by vitamin D pathway SNPs to current prognostic system for melanoma survival.

    Directory of Open Access Journals (Sweden)

    Li Luo

    Full Text Available The prognostic improvement attributed to genetic markers over current prognostic system has not been well studied for melanoma. The goal of this study is to evaluate the added prognostic value of Vitamin D Pathway (VitD SNPs to currently known clinical and demographic factors such as age, sex, Breslow thickness, mitosis and ulceration (CDF. We utilized two large independent well-characterized melanoma studies: the Genes, Environment, and Melanoma (GEM and MD Anderson studies, and performed variable selection of VitD pathway SNPs and CDF using Random Survival Forest (RSF method in addition to Cox proportional hazards models. The Harrell's C-index was used to compare the performance of model predictability. The population-based GEM study enrolled 3,578 incident cases of cutaneous melanoma (CM, and the hospital-based MD Anderson study consisted of 1,804 CM patients. Including both VitD SNPs and CDF yielded C-index of 0.85, which provided slight but not significant improvement by CDF alone (C-index = 0.83 in the GEM study. Similar results were observed in the independent MD Anderson study (C-index = 0.84 and 0.83, respectively. The Cox model identified no significant associations after adjusting for multiplicity. Our results do not support clinically significant prognostic improvements attributable to VitD pathway SNPs over current prognostic system for melanoma survival.

  5. Table S1 Basic characteristics of 32 SNPs of neurotransmitter ...

    Indian Academy of Sciences (India)

    微软用户

    Basic characteristics of 32 SNPs in neurotransmitter-related genes. Gene .... rs45435444, rs80837467 and rs80980072, significant differences (P. *** * ... At the same age and environments, skin lesion scores on the ears (P < 0.001), front (P <.

  6. Forensic genetic informativeness of an SNP panel consisting of 19 multi-allelic SNPs.

    Science.gov (United States)

    Gao, Zehua; Chen, Xiaogang; Zhao, Yuancun; Zhao, Xiaohong; Zhang, Shu; Yang, Yiwen; Wang, Yufang; Zhang, Ji

    2018-05-01

    Current research focusing on forensic personal identification, phenotype inference and ancestry information on single-nucleotide polymorphisms (SNPs) has been widely reported. In the present study, we focused on tetra-allelic SNPs in the Chinese Han population. A total of 48 tetra-allelic SNPs were screened out from the Chinese Han population of the 1000 Genomes Database, including Chinese Han in Beijing (CHB) and Chinese Han South (CHS). Considering the forensic genetic requirement for the polymorphisms, only 11 tetra-allelic SNPs with a heterozygosity >0.06 were selected for further multiplex panel construction. In order to meet the demands of personal identification and parentage identification, an additional 8 tri-allelic SNPs were combined into the final multiplex panel. To ensure application in the degraded DNA analysis, all the PCR products were designed to be 87-188 bp. Employing multiple PCR reactions and SNaPshot minisequencing, 511 unrelated Chinese Han individuals from Sichuan were genotyped. The combined match probability (CMP), combined discrimination power (CDP), and cumulative probability of exclusion (CPE) of the panel were 6.07 × 10 -11 , 0.9999999999393 and 0.996764, respectively. Based on the population data retrieved from the 1000 Genomes Project, Fst values between Chinese Han in Sichuan (SCH) and all the populations included in the 1000 Genomes Project were calculated. The results indicated that two SNPs in this panel may contain ancestry information and may be used as markers of forensic biogeographical ancestry inference. Copyright © 2018 Elsevier B.V. All rights reserved.

  7. Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.

    Science.gov (United States)

    Ritz, Cecilia; Edén, Patrik

    2008-01-19

    For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.

  8. Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort.

    Directory of Open Access Journals (Sweden)

    Thomas J Hoffmann

    2015-01-01

    Full Text Available An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project. We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77. We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4x10-12. The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8x10-4 and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting

  9. Comprehensive survey of SNPs in the Affymetrix exon array using the 1000 Genomes dataset.

    Directory of Open Access Journals (Sweden)

    Eric R Gamazon

    Full Text Available Microarray gene expression data has been used in genome-wide association studies to allow researchers to study gene regulation as well as other complex phenotypes including disease risks and drug response. To reach scientifically sound conclusions from these studies, however, it is necessary to get reliable summarization of gene expression intensities. Among various factors that could affect expression profiling using a microarray platform, single nucleotide polymorphisms (SNPs in target mRNA may lead to reduced signal intensity measurements and result in spurious results. The recently released 1000 Genomes Project dataset provides an opportunity to evaluate the distribution of both known and novel SNPs in the International HapMap Project lymphoblastoid cell lines (LCLs. We mapped the 1000 Genomes Project genotypic data to the Affymetrix GeneChip Human Exon 1.0ST array (exon array, which had been used in our previous studies and for which gene expression data had been made publicly available. We also evaluated the potential impact of these SNPs on the differentially spliced probesets we had identified previously. Though the 1000 Genomes Project data allowed a comprehensive survey of the SNPs in this particular array, the same approach can certainly be applied to other microarray platforms. Furthermore, we present a detailed catalogue of SNP-containing probesets (exon-level and transcript clusters (gene-level, which can be considered in evaluating findings using the exon array as well as benefit the design of follow-up experiments and data re-analysis.

  10. Using beta coefficients to impute missing correlations in meta-analysis research: Reasons for caution.

    Science.gov (United States)

    Roth, Philip L; Le, Huy; Oh, In-Sue; Van Iddekinge, Chad H; Bobko, Philip

    2018-06-01

    Meta-analysis has become a well-accepted method for synthesizing empirical research about a given phenomenon. Many meta-analyses focus on synthesizing correlations across primary studies, but some primary studies do not report correlations. Peterson and Brown (2005) suggested that researchers could use standardized regression weights (i.e., beta coefficients) to impute missing correlations. Indeed, their beta estimation procedures (BEPs) have been used in meta-analyses in a wide variety of fields. In this study, the authors evaluated the accuracy of BEPs in meta-analysis. We first examined how use of BEPs might affect results from a published meta-analysis. We then developed a series of Monte Carlo simulations that systematically compared the use of existing correlations (that were not missing) to data sets that incorporated BEPs (that impute missing correlations from corresponding beta coefficients). These simulations estimated ρ̄ (mean population correlation) and SDρ (true standard deviation) across a variety of meta-analytic conditions. Results from both the existing meta-analysis and the Monte Carlo simulations revealed that BEPs were associated with potentially large biases when estimating ρ̄ and even larger biases when estimating SDρ. Using only existing correlations often substantially outperformed use of BEPs and virtually never performed worse than BEPs. Overall, the authors urge a return to the standard practice of using only existing correlations in meta-analysis. (PsycINFO Database Record (c) 2018 APA, all rights reserved).

  11. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints

    OpenAIRE

    Schwessinger, R; Suciu, MC; McGowan, SJ; Telenius, J; Taylor, S; Higgs, DR; Hughes, JR

    2017-01-01

    In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor bin...

  12. An evaluation of the performance of tag SNPs derived from HapMap in a Caucasian population.

    Directory of Open Access Journals (Sweden)

    Alexandre Montpetit

    2006-03-01

    Full Text Available The Haplotype Map (HapMap project recently generated genotype data for more than 1 million single-nucleotide polymorphisms (SNPs in four population samples. The main application of the data is in the selection of tag single-nucleotide polymorphisms (tSNPs to use in association studies. The usefulness of this selection process needs to be verified in populations outside those used for the HapMap project. In addition, it is not known how well the data represent the general population, as only 90-120 chromosomes were used for each population and since the genotyped SNPs were selected so as to have high frequencies. In this study, we analyzed more than 1,000 individuals from Estonia. The population of this northern European country has been influenced by many different waves of migrations from Europe and Russia. We genotyped 1,536 randomly selected SNPs from two 500-kbp ENCODE regions on Chromosome 2. We observed that the tSNPs selected from the CEPH (Centre d'Etude du Polymorphisme Humain from Utah (CEU HapMap samples (derived from US residents with northern and western European ancestry captured most of the variation in the Estonia sample. (Between 90% and 95% of the SNPs with a minor allele frequency of more than 5% have an r2 of at least 0.8 with one of the CEU tSNPs. Using the reverse approach, tags selected from the Estonia sample could almost equally well describe the CEU sample. Finally, we observed that the sample size, the allelic frequency, and the SNP density in the dataset used to select the tags each have important effects on the tagging performance. Overall, our study supports the use of HapMap data in other Caucasian populations, but the SNP density and the bias towards high-frequency SNPs have to be taken into account when designing association studies.

  13. Different methods for analysing and imputation missing values in wind speed series; La problematica de la calidad de la informacion en series de velocidad del viento-metodologias de analisis y imputacion de datos faltantes

    Energy Technology Data Exchange (ETDEWEB)

    Ferreira, A. M.

    2004-07-01

    This study concerns about different methods for analysing and imputation missing values in wind speed series. The algorithm EM and a methodology derivated from the sequential hot deck have been utilized. Series with missing values imputed are compared with original and complete series, using several criteria, such the wind potential; and appears to exist a significant goodness of fit between the estimates and real values. (Author)

  14. Imputation and quality control steps for combining multiple genome-wide datasets

    Directory of Open Access Journals (Sweden)

    Shefali S Verma

    2014-12-01

    Full Text Available The electronic MEdical Records and GEnomics (eMERGE network brings together DNA biobanks linked to electronic health records (EHRs from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes, and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2 were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

  15. Multiple imputation to account for measurement error in marginal structural models

    Science.gov (United States)

    Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.

    2015-01-01

    Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338

  16. Multiple Imputation to Account for Measurement Error in Marginal Structural Models.

    Science.gov (United States)

    Edwards, Jessie K; Cole, Stephen R; Westreich, Daniel; Crane, Heidi; Eron, Joseph J; Mathews, W Christopher; Moore, Richard; Boswell, Stephen L; Lesko, Catherine R; Mugavero, Michael J

    2015-09-01

    Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and nondifferential measurement error in a marginal structural model. We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3,686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality (hazard ratio [HR]: 1.2 [95% confidence interval [CI] = 0.6, 2.3]). The HR for current smoking and therapy [0.4 (95% CI = 0.2, 0.7)] was similar to the HR for no smoking and therapy (0.4; 95% CI = 0.2, 0.6). Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies.

  17. The association between individual SNPs or haplotypes of matrix metalloproteinase 1 and gastric cancer susceptibility, progression and prognosis.

    Directory of Open Access Journals (Sweden)

    Yong-Xi Song

    Full Text Available BACKGROUND: The single nucleotide polymorphisms (SNPs in matrix metalloproteinase 1(MMP-1 play important roles in some cancers. This study examined the associations between individual SNPs or haplotypes in MMP-1 and susceptibility, clinicopathological parameters and prognosis of gastric cancer in a large sample of the Han population in northern China. METHODS: In this case-controlled study, there were 404 patients with gastric cancer and 404 healthy controls. Seven SNPs were genotyped using the MALDI-TOF MS system. Then, SPSS software, Haploview 4.2 software, Haplo.states software and THEsias software were used to estimate the association between individual SNPs or haplotypes of MMP-1 and gastric cancer susceptibility, progression and prognosis. RESULTS: Among seven SNPs, there were no individual SNPs correlated to gastric cancer risk. Moreover, only the rs470206 genotype had a correlation with histologic grades, and the patients with GA/AA had well cell differentiation compared to the patients with genotype GG (OR=0.573; 95%CI: 0.353-0.929; P=0.023. Then, we constructed a four-marker haplotype block that contained 4 common haplotypes: TCCG, GCCG, TTCG and TTTA. However, all four common haplotypes had no correlation with gastric cancer risk and we did not find any relationship between these haplotypes and clinicopathological parameters in gastric cancer. Furthermore, neither individual SNPs nor haplotypes had an association with the survival of patients with gastric cancer. CONCLUSIONS: This study evaluated polymorphisms of the MMP-1 gene in gastric cancer with a MALDI-TOF MS method in a large northern Chinese case-controlled cohort. Our results indicated that these seven SNPs of MMP-1 might not be useful as significant markers to predict gastric cancer susceptibility, progression or prognosis, at least in the Han population in northern China.

  18. Genome-wide single nucleotide polymorphisms (SNPs) for a model invasive ascidian Botryllus schlosseri.

    Science.gov (United States)

    Gao, Yangchun; Li, Shiguo; Zhan, Aibin

    2018-04-01

    Invasive species cause huge damages to ecology, environment and economy globally. The comprehensive understanding of invasion mechanisms, particularly genetic bases of micro-evolutionary processes responsible for invasion success, is essential for reducing potential damages caused by invasive species. The golden star tunicate, Botryllus schlosseri, has become a model species in invasion biology, mainly owing to its high invasiveness nature and small well-sequenced genome. However, the genome-wide genetic markers have not been well developed in this highly invasive species, thus limiting the comprehensive understanding of genetic mechanisms of invasion success. Using restriction site-associated DNA (RAD) tag sequencing, here we developed a high-quality resource of 14,119 out of 158,821 SNPs for B. schlosseri. These SNPs were relatively evenly distributed at each chromosome. SNP annotations showed that the majority of SNPs (63.20%) were located at intergenic regions, and 21.51% and 14.58% were located at introns and exons, respectively. In addition, the potential use of the developed SNPs for population genomics studies was primarily assessed, such as the estimate of observed heterozygosity (H O ), expected heterozygosity (H E ), nucleotide diversity (π), Wright's inbreeding coefficient (F IS ) and effective population size (Ne). Our developed SNP resource would provide future studies the genome-wide genetic markers for genetic and genomic investigations, such as genetic bases of micro-evolutionary processes responsible for invasion success.

  19. Treatments of Missing Values in Large National Data Affect Conclusions: The Impact of Multiple Imputation on Arthroplasty Research.

    Science.gov (United States)

    Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N

    2018-03-01

    Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.

  20. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.

    Science.gov (United States)

    Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B

    2017-11-24

    Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.

  1. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans

    Directory of Open Access Journals (Sweden)

    Assaf Gottlieb

    2017-11-01

    Full Text Available Abstract Background Genome-wide association studies are useful for discovering genotype–phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into “gene level” effects. Methods Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression—on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. Results We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Conclusions Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort

  2. Phosphorylation states of cell cycle and DNA repair proteins can be altered by the nsSNPs

    International Nuclear Information System (INIS)

    Savas, Sevtap; Ozcelik, Hilmi

    2005-01-01

    Phosphorylation is a reversible post-translational modification that affects the intrinsic properties of proteins, such as structure and function. Non-synonymous single nucleotide polymorphisms (nsSNPs) result in the substitution of the encoded amino acids and thus are likely to alter the phosphorylation motifs in the proteins. In this study, we used the web-based NetPhos tool to predict candidate nsSNPs that either introduce or remove putative phosphorylation sites in proteins that act in DNA repair and cell cycle pathways. Our results demonstrated that a total of 15 nsSNPs (16.9%) were likely to alter the putative phosphorylation patterns of 14 proteins. Three of these SNPs (CDKN1A-S31R, OGG1-S326C, and XRCC3-T241M) have already found to be associated with altered cancer risk. We believe that this set of nsSNPs constitutes an excellent resource for further molecular and genetic analyses. The novel systematic approach used in this study will accelerate the understanding of how naturally occurring human SNPs may alter protein function through the modification of phosphorylation mechanisms and contribute to disease susceptibility

  3. Identification of pummelo cultivars by using a panel of 25 selected SNPs and 12 DNA segments.

    Directory of Open Access Journals (Sweden)

    Bo Wu

    Full Text Available Pummelo cultivars are usually difficult to identify morphologically, especially when fruits are unavailable. The problem was addressed in this study with the use of two methods: high resolution melting analysis of SNPs and sequencing of DNA segments. In the first method, a set of 25 SNPs with high polymorphic information content were selected from SNPs predicted by analyzing ESTs and sequenced DNA segments. High resolution melting analysis was then used to genotype 260 accessions including 55 from Myanmar, and 178 different genotypes were thus identified. A total of 99 cultivars were assigned to 86 different genotypes since the known somatic mutants were identical to their original genotypes at the analyzed SNP loci. The Myanmar samples were genotypically different from each other and from all other samples, indicating they were derived from sexual propagation. Statistical analysis showed that the set of SNPs was powerful enough for identifying at least 1000 pummelo genotypes, though the discrimination power varied in different pummelo groups and populations. In the second method, 12 genomic DNA segments of 24 representative pummelo accessions were sequenced. Analysis of the sequences revealed the existence of a high haplotype polymorphism in pummelo, and statistical analysis showed that the segments could be used as genetic barcodes that should be informative enough to allow reliable identification of 1200 pummelo cultivars. The high level of haplotype diversity and an apparent population structure shown by DNA segments and by SNP genotypes, respectively, were discussed in relation to the origin and domestication of the pummelo species.

  4. Meta-analysis of genome-wide association studies in African Americans provides insights into the genetic architecture of type 2 diabetes

    DEFF Research Database (Denmark)

    Ng, Maggie C Y; Shriner, Daniel; Chen, Brian H

    2014-01-01

    . In order to investigate the genetic architecture of T2D in African Americans, the MEta-analysis of type 2 DIabetes in African Americans (MEDIA) Consortium examined 17 GWAS on T2D comprising 8,284 cases and 15,543 controls in African Americans in stage 1 analysis. Single nucleotide polymorphisms (SNPs......) association analysis was conducted in each study under the additive model after adjustment for age, sex, study site, and principal components. Meta-analysis of approximately 2.6 million genotyped and imputed SNPs in all studies was conducted using an inverse variance-weighted fixed effect model. Replications...... for linkage disequilibrium, enabling fine mapping of causal variants in trans-ethnic meta-analysis studies....

  5. On Matrix Sampling and Imputation of Context Questionnaires with Implications for the Generation of Plausible Values in Large-Scale Assessments

    Science.gov (United States)

    Kaplan, David; Su, Dan

    2016-01-01

    This article presents findings on the consequences of matrix sampling of context questionnaires for the generation of plausible values in large-scale assessments. Three studies are conducted. Study 1 uses data from PISA 2012 to examine several different forms of missing data imputation within the chained equations framework: predictive mean…

  6. A comparison between genotyping-by-sequencing and array-based scoring of SNPs for genomic prediction accuracy in winter wheat.

    Science.gov (United States)

    Elbasyoni, Ibrahim S; Lorenz, A J; Guttieri, M; Frels, K; Baenziger, P S; Poland, J; Akhunov, E

    2018-05-01

    The utilization of DNA molecular markers in plant breeding to maximize selection response via marker-assisted selection (MAS) and genomic selection (GS) has revolutionized plant breeding. A key factor affecting GS applicability is the choice of molecular marker platform. Genotyping-by-sequencing scored SNPs (GBS-scored SNPs) provides a large number of markers, albeit with high rates of missing data. Array scored SNPs are of high quality, but the cost per sample is substantially higher. The objectives of this study were 1) compare GBS-scored SNPs, and array scored SNPs for genomic selection applications, and 2) compare estimates of genomic kinship and population structure calculated using the two marker platforms. SNPs were compared in a diversity panel consisting of 299 hard winter wheat (Triticum aestivum L.) accessions that were part of a multi-year, multi-environments association mapping study. The panel was phenotyped in Ithaca, Nebraska for heading date, plant height, days to physiological maturity and grain yield in 2012 and 2013. The panel was genotyped using GBS-scored SNPs, and array scored SNPs. Results indicate that GBS-scored SNPs is comparable to or better than Array-scored SNPs for genomic prediction application. Both platforms identified the same genetic patterns in the panel where 90% of the lines were classified to common genetic groups. Overall, we concluded that GBS-scored SNPs have the potential to be the marker platform of choice for genetic diversity and genomic selection in winter wheat. Copyright © 2018 Elsevier B.V. All rights reserved.

  7. Predicting deleterious nsSNPs: an analysis of sequence and structural attributes

    Directory of Open Access Journals (Sweden)

    Saqi Mansoor AS

    2006-04-01

    Full Text Available Abstract Background There has been an explosion in the number of single nucleotide polymorphisms (SNPs within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs, some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl. Results The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient (MCC varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value. Conclusion The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at http://www.brightstudy.ac.uk/das_help.html

  8. Association of p21 SNPs and risk of cervical cancer among Chinese women

    International Nuclear Information System (INIS)

    Wang, Ning; Wang, Shizhuo; Zhang, Qiao; Lu, Yanming; Wei, Heng; Li, Wei; Zhang, Shulan; Yin, Duo; Ou, Yangling

    2012-01-01

    The p21 codon 31 single nucleotide polymorphism (SNP), rs1801270, has been linked to cervical cancer but with controversial results. The aims of this study were to investigate the role of p21 SNP-rs1801270 and other untested p21 SNPs in the risk of cervical cancer in a Chinese population. We genotyped five p21 SNPs (rs762623, rs2395655, rs1801270, rs3176352, and rs1059234) using peripheral blood DNA from 393 cervical cancer patients and 434 controls. The frequency of the rs1801270 A allele in patients (0.421) was significantly lower than that in controls (0.494, p = 0.003). The frequency of the rs3176352 C allele in cases (0.319) was significantly lower than that in controls (0.417, p < 0.001).The allele frequency of other three p21 SNPs showed not statistically significantly different between patients and controls. The rs1801270 AA genotype was associated with a decreased risk for the development of cervical cancer (OR = 0.583, 95%CI: 0.399 - 0.853, P = 0.005). We observed that the three p21 SNPs (rs1801270, rs3176352, and rs1059234) was in linkage disequilibrium (LD) and thus haplotype analysis was performed. The AGT haplotype (which includes the rs1801270A allele) was the most frequent haplotype among all subjects, and both homozygosity and heterozygosity for the AGT haplotype provided a protective effect from development of cervical cancer. We show an association between the p21 SNP rs1801270A allele and a decreased risk for cervical cancer in a population of Chinese women. The AGT haplotype formed by three p21 SNPs in LD (rs1801270, rs3176352 and rs1059234) also provided a protective effect in development of cervical cancer in this population

  9. Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next Generation Sequencing Technology

    DEFF Research Database (Denmark)

    Ramos, Antonio M; Crooijmans, Richard P M A; Nabeel, Nabeel A

    2009-01-01

    Background The dissection of complex traits of economic importance to the pig industry requires the availability of a significant number of genetic markers, such as single nucleotide polymorphisms (SNPs). This study was conducted to discover several hundreds of thousands of porcine SNPs using nex...

  10. Identification of SNPs associated with muscle yield and quality traits using allelic-imbalance analysis analyses of pooled RNA-Seq samples in rainbow trout

    Science.gov (United States)

    Coding/functional SNPs change the biological function of a gene and, therefore, could serve as “large-effect” genetic markers. In this study, we used two bioinformatics pipelines, GATK and SAMtools, for discovering coding/functional SNPs with allelic-imbalances associated with total body weight, mus...

  11. Canonical Single Nucleotide Polymorphisms (SNPs for High-Resolution Subtyping of Shiga-Toxin Producing Escherichia coli (STEC O157:H7.

    Directory of Open Access Journals (Sweden)

    Sean M Griffing

    Full Text Available The objective of this study was to develop a canonical, parsimoniously-informative SNP panel for subtyping Shiga-toxin producing Escherichia coli (STEC O157:H7 that would be consistent with epidemiological, PFGE, and MLVA clustering of human specimens. Our group had previously identified 906 putative discriminatory SNPs, which were pared down to 391 SNPs based on their prevalence in a test set. The 391 SNPs were screened using a high-throughput form of TaqMan PCR against a set of clinical isolates that represent the most diverse collection of O157:H7 isolates from outbreaks and sporadic cases examined to date. Another 30 SNPs identified by others were also screened using the same method. Two additional targets were tested using standard TaqMan PCR endpoint analysis. These 423 SNPs were reduced to a 32 SNP panel with the almost the same discriminatory value. While the panel partitioned our diverse set of isolates in a manner that was consistent with epidemiological data and PFGE and MLVA phylogenies, it resulted in fewer subtypes than either existing method and insufficient epidemiological resolution in 10 of 47 clusters. Therefore, another round of SNP discovery was undertaken using comparative genomic resequencing of pooled DNA from the 10 clusters with insufficient resolution. This process identified 4,040 potential SNPs and suggested one of the ten clusters was incorrectly grouped. After its removal, there were 2,878 SNPs, of which only 63 were previously identified and 438 occurred across multiple clusters. Among highly clonal bacteria like STEC O157:H7, linkage disequilibrium greatly limits the number of parsimoniously informative SNPs. Therefore, it is perhaps unsurprising that our panel accounted for the potential discriminatory value of numerous other SNPs reported in the literature. We concluded published O157:H7 SNPs are insufficient for effective epidemiological subtyping. However, the 438 multi-cluster SNPs we identified may provide

  12. Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

    Directory of Open Access Journals (Sweden)

    Min-Wei Huang

    2018-01-01

    Full Text Available Many real-world medical datasets contain some proportion of missing (attribute values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

  13. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Science.gov (United States)

    2010-10-01

    ... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  14. [Imputing missing data in public health: general concepts and application to dichotomous variables].

    Science.gov (United States)

    Hernández, Gilma; Moriña, David; Navarro, Albert

    The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.

  15. Imputing data that are missing at high rates using a boosting algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Cauthen, Katherine Regina [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lambert, Gregory [Apple Inc., Cupertino, CA (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Lefantzi, Sophia [Sandia National Lab. (SNL-CA), Livermore, CA (United States)

    2016-09-01

    Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.

  16. GRIMP: A web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data.

    NARCIS (Netherlands)

    K. Estrada Gil (Karol); A. Abuseiris (Anis); F.G. Grosveld (Frank); A.G. Uitterlinden (André); T.A. Knoch (Tobias); F. Rivadeneira Ramirez (Fernando)

    2009-01-01

    textabstractThe current fast growth of genome-wide association studies (GWAS) combined with now common computationally expensive imputation requires the online access of large user groups to high-performance computing resources capable of analyzing rapidly and efficiently millions of genetic

  17. A new strategy for enhancing imputation quality of rare variants from next-generation sequencing data via combining SNP and exome chip data

    NARCIS (Netherlands)

    Y.J. Kim (Young Jin); J. Lee (Juyoung); B.-J. Kim (Bong-Jo); T. Park (Taesung); G.R. Abecasis (Gonçalo); M.A.A. De Almeida (Marcio); D. Altshuler (David); J.L. Asimit (Jennifer L.); G. Atzmon (Gil); M. Barber (Mathew); A. Barzilai (Ari); N.L. Beer (Nicola L.); G.I. Bell (Graeme I.); J. Below (Jennifer); T. Blackwell (Tom); J. Blangero (John); M. Boehnke (Michael); D.W. Bowden (Donald W.); N.P. Burtt (Noël); J.C. Chambers (John); H. Chen (Han); P. Chen (Ping); P.S. Chines (Peter); S. Choi (Sungkyoung); C. Churchhouse (Claire); P. Cingolani (Pablo); B.K. Cornes (Belinda); N.J. Cox (Nancy); A.G. Day-Williams (Aaron); A. Duggirala (Aparna); J. Dupuis (Josée); T. Dyer (Thomas); S. Feng (Shuang); J. Fernandez-Tajes (Juan); T. Ferreira (Teresa); T.E. Fingerlin (Tasha E.); J. Flannick (Jason); J.C. Florez (Jose); P. Fontanillas (Pierre); T.M. Frayling (Timothy); C. Fuchsberger (Christian); E. Gamazon (Eric); K. Gaulton (Kyle); S. Ghosh (Saurabh); B. Glaser (Benjamin); A.L. Gloyn (Anna); R.L. Grossman (Robert L.); J. Grundstad (Jason); C. Hanis (Craig); A. Heath (Allison); H. Highland (Heather); M. Horikoshi (Momoko); I.-S. Huh (Ik-Soo); J.R. Huyghe (Jeroen R.); M.K. Ikram (Kamran); K.A. Jablonski (Kathleen); Y. Jun (Yang); N. Kato (Norihiro); J. Kim (Jayoun); Y.J. Kim (Young Jin); B.-J. Kim (Bong-Jo); J. Lee (Juyoung); C.R. King (C. Ryan); J.S. Kooner (Jaspal S.); M.-S. Kwon (Min-Seok); H.K. Im (Hae Kyung); M. Laakso (Markku); K.K.-Y. Lam (Kevin Koi-Yau); J. Lee (Jaehoon); S. Lee (Selyeong); S. Lee (Sungyoung); D.M. Lehman (Donna M.); H. Li (Heng); C.M. Lindgren (Cecilia); X. Liu (Xuanyao); O.E. Livne (Oren E.); A.E. Locke (Adam E.); A. Mahajan (Anubha); J.B. Maller (Julian B.); A.K. Manning (Alisa K.); T.J. Maxwell (Taylor J.); A. Mazoure (Alexander); M.I. McCarthy (Mark); J.B. Meigs (James B.); B. Min (Byungju); K.L. Mohlke (Karen); A.P. Morris (Andrew); S. Musani (Solomon); Y. Nagai (Yoshihiko); M.C.Y. Ng (Maggie C.Y.); D. Nicolae (Dan); S. Oh (Sohee); N.D. Palmer (Nicholette); T. Park (Taesung); T.I. Pollin (Toni I.); I. Prokopenko (Inga); D. Reich (David); M.A. Rivas (Manuel); L.J. Scott (Laura); M. Seielstad (Mark); Y.S. Cho (Yoon Shin); X. Sim (Xueling); R. Sladek (Rob); P. Smith (Philip); I. Tachmazidou (Ioanna); E.S. Tai (Shyong); Y.Y. Teo (Yik Ying); T.M. Teslovich (Tanya M.); J. Torres (Jason); V. Trubetskoy (Vasily); S.M. Willems (Sara); A.L. Williams (Amy L.); J.G. Wilson (James); S. Wiltshire (Steven); S. Won (Sungho); A.R. Wood (Andrew); W. Xu (Wang); J. Yoon (Joon); M. Zawistowski (Matthew); E. Zeggini (Eleftheria); W. Zhang (Weihua); S. Zöllner (Sebastian)

    2015-01-01

    textabstractBackground: Rare variants have gathered increasing attention as a possible alternative source of missing heritability. Since next generation sequencing technology is not yet cost-effective for large-scale genomic studies, a widely used alternative approach is imputation. However, the

  18. Missing data imputation using statistical and machine learning methods in a real breast cancer problem.

    Science.gov (United States)

    Jerez, José M; Molina, Ignacio; García-Laencina, Pedro J; Alba, Emilio; Ribelles, Nuria; Martín, Miguel; Franco, Leonardo

    2010-10-01

    Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. Copyright © 2010 Elsevier B.V. All rights reserved.

  19. The more from East-Asian, the better: risk prediction of colorectal cancer risk by GWAS-identified SNPs among Japanese.

    Science.gov (United States)

    Abe, Makiko; Ito, Hidemi; Oze, Isao; Nomura, Masatoshi; Ogawa, Yoshihiro; Matsuo, Keitaro

    2017-12-01

    Little is known about the difference of genetic predisposition for CRC between ethnicities; however, many genetic traits common to colorectal cancer have been identified. This study investigated whether more SNPs identified in GWAS in East Asian population could improve the risk prediction of Japanese and explored possible application of genetic risk groups as an instrument of the risk communication. 558 Patients histologically verified colorectal cancer and 1116 first-visit outpatients were included for derivation study, and 547 cases and 547 controls were for replication study. Among each population, we evaluated prediction models for the risk of CRC that combined the genetic risk group based on SNPs from GWASs in European-population and a similarly developed model adding SNPs from GWASs in East Asian-population. We examined whether adding East Asian-specific SNPs would improve the discrimination. Six SNPs (rs6983267, rs4779584, rs4444235, rs9929218, rs10936599, rs16969681) from 23 SNPs by European-based GWAS and five SNPs (rs704017, rs11196172, rs10774214, rs647161, rs2423279) among ten SNPs by Asian-based GWAS were selected in CRC risk prediction model. Compared with a 6-SNP-based model, an 11-SNP model including Asian GWAS-SNPs showed improved discrimination capacity in Receiver operator characteristic analysis. A model with 11 SNPs resulted in statistically significant improvement in both derivation (P = 0.0039) and replication studies (P = 0.0018) compared with six SNP model. We estimated cumulative risk of CRC by using genetic risk group based on 11 SNPs and found that the cumulative risk at age 80 is approximately 13% in the high-risk group while 6% in the low-risk group. We constructed a more efficient CRC risk prediction model with 11 SNPs including newly identified East Asian-based GWAS SNPs (rs704017, rs11196172, rs10774214, rs647161, rs2423279). Risk grouping based on 11 SNPs depicted lifetime difference of CRC risk. This might be useful for

  20. Development of a multiplex PCR assay detecting 52 autosomal SNPs

    DEFF Research Database (Denmark)

    Sanchez Sanchez, Juan Jose; Phillips, C.; Børsting, Claus

    2006-01-01

    for amplifying 52 genomic DNA fragments, each containing one SNP, in a single tube, and accurately genotyping the PCR product mixture using two single base extension reactions. This multiplex approach reduces the cost of SNP genotyping and requires as little as 0.5 ng of genomic DNA to detect 52 SNPs. We used...

  1. Estimating Stand Height and Tree Density in Pinus taeda plantations using in-situ data, airborne LiDAR and k-Nearest Neighbor Imputation

    Directory of Open Access Journals (Sweden)

    CARLOS ALBERTO SILVA

    Full Text Available ABSTRACT Accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies. The aim of this study was to estimate stand dominate and mean heights (HD and HM and tree density (TD of Pinus taeda plantations located in South Brazil using in-situ measurements, airborne Light Detection and Ranging (LiDAR data and the non- k-nearest neighbor (k-NN imputation. Forest inventory attributes and LiDAR derived metrics were calculated at 53 regular sample plots and we used imputation models to retrieve the forest attributes at plot and landscape-levels. The best LiDAR-derived metrics to predict HD, HM and TD were H99TH, HSD, SKE and HMIN. The Imputation model using the selected metrics was more effective for retrieving height than tree density. The model coefficients of determination (adj.R2 and a root mean squared difference (RMSD for HD, HM and TD were 0.90, 0.94, 0.38m and 6.99, 5.70, 12.92%, respectively. Our results show that LiDAR and k-NN imputation can be used to predict stand heights with high accuracy in Pinus taeda. However, furthers studies need to be realized to improve the accuracy prediction of TD and to evaluate and compare the cost of acquisition and processing of LiDAR data against the conventional inventory procedures.

  2. Estimating Stand Height and Tree Density in Pinus taeda plantations using in-situ data, airborne LiDAR and k-Nearest Neighbor Imputation.

    Science.gov (United States)

    Silva, Carlos Alberto; Klauberg, Carine; Hudak, Andrew T; Vierling, Lee A; Liesenberg, Veraldo; Bernett, Luiz G; Scheraiber, Clewerson F; Schoeninger, Emerson R

    2018-01-01

    Accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies. The aim of this study was to estimate stand dominate and mean heights (HD and HM) and tree density (TD) of Pinus taeda plantations located in South Brazil using in-situ measurements, airborne Light Detection and Ranging (LiDAR) data and the non- k-nearest neighbor (k-NN) imputation. Forest inventory attributes and LiDAR derived metrics were calculated at 53 regular sample plots and we used imputation models to retrieve the forest attributes at plot and landscape-levels. The best LiDAR-derived metrics to predict HD, HM and TD were H99TH, HSD, SKE and HMIN. The Imputation model using the selected metrics was more effective for retrieving height than tree density. The model coefficients of determination (adj.R2) and a root mean squared difference (RMSD) for HD, HM and TD were 0.90, 0.94, 0.38m and 6.99, 5.70, 12.92%, respectively. Our results show that LiDAR and k-NN imputation can be used to predict stand heights with high accuracy in Pinus taeda. However, furthers studies need to be realized to improve the accuracy prediction of TD and to evaluate and compare the cost of acquisition and processing of LiDAR data against the conventional inventory procedures.

  3. Natural functional SNPs in miR-155 alter its expression level, blood cell counts and immune responses

    Directory of Open Access Journals (Sweden)

    Congcong Li

    2016-08-01

    Full Text Available miR-155 has been confirmed to be a key factor in immune responses in humans and other mammals. Therefore, investigation of variations in miR-155 could be useful for understanding the differences in immunity between individuals. In this study, four SNPs in miR-155 were identified in mice (Mus musculus and humans (Homo sapiens. In mice, the four SNPs were closely linked and formed two miR-155 haplotypes (A and B. Ten distinct types of blood parameters were associated with miR-155 expression under normal conditions. Additionally, 4 and 14 blood parameters were significantly different between these two genotypes under normal and lipopolysaccharide (LPS stimulation conditions, respectively. Moreover, the expression levels of miR-155, the inflammatory response to LPS stimulation and the lethal ratio following Salmonella typhimurium infection were significantly increased in mice harboring the AA genotype. Further, two SNPs, one in the loop region and the other near the 3' terminal of pre-miR-155, were confirmed to be responsible for the differential expression of miR-155 in mice. Interestingly, two additional SNPs, one in the loop region and the other in the middle of miR-155*, modulated the function of miR-155 in humans. Predictions of secondary RNA structure using RNAfold showed that these SNPs affected the structure of miR-155 in both mice and humans. Our results provide novel evidence of the natural functional SNPs of miR-155 in both mice and humans, which may affect the expression levels of mature miR-155 by modulating its secondary structure. The SNPs of human miR-155 may be considered as causal mutations for some immune-related diseases in the clinic. The two genotypes of mice could be used as natural models for studying the mechanisms of immune diseases caused by abnormal expression of miR-155 in humans.

  4. Nonparametric autocovariance estimation from censored time series by Gaussian imputation.

    Science.gov (United States)

    Park, Jung Wook; Genton, Marc G; Ghosh, Sujit K

    2009-02-01

    One of the most frequently used methods to model the autocovariance function of a second-order stationary time series is to use the parametric framework of autoregressive and moving average models developed by Box and Jenkins. However, such parametric models, though very flexible, may not always be adequate to model autocovariance functions with sharp changes. Furthermore, if the data do not follow the parametric model and are censored at a certain value, the estimation results may not be reliable. We develop a Gaussian imputation method to estimate an autocovariance structure via nonparametric estimation of the autocovariance function in order to address both censoring and incorrect model specification. We demonstrate the effectiveness of the technique in terms of bias and efficiency with simulations under various rates of censoring and underlying models. We describe its application to a time series of silicon concentrations in the Arctic.

  5. Traffic Speed Data Imputation Method Based on Tensor Completion

    Directory of Open Access Journals (Sweden)

    Bin Ran

    2015-01-01

    Full Text Available Traffic speed data plays a key role in Intelligent Transportation Systems (ITS; however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS. In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC, an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.

  6. Traffic speed data imputation method based on tensor completion.

    Science.gov (United States)

    Ran, Bin; Tan, Huachun; Feng, Jianshuai; Liu, Ying; Wang, Wuhong

    2015-01-01

    Traffic speed data plays a key role in Intelligent Transportation Systems (ITS); however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS). In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC), an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS) database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.

  7. Association of SNPs with the efficacy and safety of immunosuppressant therapy after heart transplantation.

    Science.gov (United States)

    Sánchez-Lázaro, Ignacio; Herrero, María José; Jordán-De Luna, Consuelo; Bosó, Virginia; Almenar, Luis; Rojas, Luis; Martínez-Dolz, Luis; Megías-Vericat, Juan E; Sendra, Luis; Miguel, Antonio; Poveda, José L; Aliño, Salvador F

    2015-01-01

    Studying the possible influence of SNPs on efficacy and safety of calcineurin inhibitors upon heart transplantation. In 60 heart transplant patients treated with tacrolimus or cyclosporine, we studied a panel of 36 SNPs correlated with a series of clinical parameters during the first post-transplantation year. The presence of serious infections was correlated to ABCB1 rs1128503 (p = 0.012), CC genotype reduced the probability of infections being also associated with lower blood cyclosporine concentrations. Lower renal function levels were found in patients with rs9282564 AG (p = 0.003), related to higher blood cyclosporine blood levels. A tendency toward increased graft rejection (p = 0.05) was correlated to rs2066844 CC in NOD2/CARD15, a gene related to lymphocyte activation. Pharmacogenetics can help identify patients at increased risk of clinical complications. Original submitted 30 January 2015; revision submitted 27 March 2015.

  8. RTEL1 tagging SNPs and haplotypes were associated with glioma development.

    Science.gov (United States)

    Li, Gang; Jin, Tianbo; Liang, Hongjuan; Zhang, Zhiguo; He, Shiming; Tu, Yanyang; Yang, Haixia; Geng, Tingting; Cui, Guangbin; Chen, Chao; Gao, Guodong

    2013-05-17

    As glioma ranks as the first most prevalent solid tumors in primary central nervous system, certain single-nucleotide polymorphisms (SNPs) may be related to increased glioma risk, and have implications in carcinogenesis. The present case-control study was carried out to elucidate how common variants contribute to glioma susceptibility. Ten candidate tagging SNPs (tSNPs) were selected from seven genes whose polymorphisms have been proven by classical literatures and reliable databases to be tended to relate with gliomas, and with the minor allele frequency (MAF)>5% in the HapMap Asian population. The selected tSNPs were genotyped in 629 glioma patients and 645 controls from a Han Chinese population using the multiplexed SNP MassEXTEND assay calibrated. Two significant tSNPs in RTEL1 gene were observed to be associated with glioma risk (rs6010620, P=0.0016, OR: 1.32, 95% CI: 1.11-1.56; rs2297440, P=0.001, OR: 1.33, 95% CI: 1.12-1.58) by χ2 test. It was identified the genotype "GG" of rs6010620 acted as the protective genotype for glioma (OR, 0.46; 95% CI, 0.31-0.7; P=0.0002), while the genotype "CC" of rs2297440 as the protective genotype in glioma (OR, 0.47; 95% CI, 0.31-0.71; P=0.0003). Furthermore, haplotype "GCT" in RTEL1 gene was found to be associated with risk of glioma (OR, 0.7; 95% CI, 0.57-0.86; Fisher's P=0.0005; Pearson's P=0.0005), and haplotype "ATT" was detected to be associated with risk of glioma (OR, 1.32; 95% CI, 1.12-1.57; Fisher's P=0.0013; Pearson's P=0.0013). Two single variants, the genotypes of "GG" of rs6010620 and "CC" of rs2297440 (rs6010620 and rs2297440) in the RTEL1 gene, together with two haplotypes of GCT and ATT, were identified to be associated with glioma development. And it might be used to evaluate the glioma development risks to screen the above RTEL1 tagging SNPs and haplotypes. The virtual slides for this article can be found here: http://www.diagnosticpathology.diagnomx.eu/vs/1993021136961998.

  9. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.

    Science.gov (United States)

    Liu, Yuzhe; Gopalakrishnan, Vanathi

    2017-03-01

    Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

  10. Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data

    OpenAIRE

    Hu, Bo; Ji, Yuan; Xu, Yaomin; Ting, Angela H

    2013-01-01

    Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multip...

  11. Identification of novel single nucleotide polymorphisms (SNPs in deer (Odocoileus spp. using the BovineSNP50 BeadChip.

    Directory of Open Access Journals (Sweden)

    Gwilym D Haynes

    Full Text Available Single nucleotide polymorphisms (SNPs are growing in popularity as a genetic marker for investigating evolutionary processes. A panel of SNPs is often developed by comparing large quantities of DNA sequence data across multiple individuals to identify polymorphic sites. For non-model species, this is particularly difficult, as performing the necessary large-scale genomic sequencing often exceeds the resources available for the project. In this study, we trial the Bovine SNP50 BeadChip developed in cattle (Bos taurus for identifying polymorphic SNPs in cervids Odocoileus hemionus (mule deer and black-tailed deer and O. virginianus (white-tailed deer in the Pacific Northwest. We found that 38.7% of loci could be genotyped, of which 5% (n = 1068 were polymorphic. Of these 1068 polymorphic SNPs, a mixture of putatively neutral loci (n = 878 and loci under selection (n = 190 were identified with the F(ST-outlier method. A range of population genetic analyses were implemented using these SNPs and a panel of 10 microsatellite loci. The three types of deer could readily be distinguished with both the SNP and microsatellite datasets. This study demonstrates that commercially developed SNP chips are a viable means of SNP discovery for non-model organisms, even when used between very distantly related species (the Bovidae and Cervidae families diverged some 25.1-30.1 million years before present.

  12. In silico analysis of consequences of non-synonymous SNPs of Slc11a2 gene in Indian bovines

    Directory of Open Access Journals (Sweden)

    Shreya M. Patel

    2015-09-01

    Full Text Available The aim of our study was to analyze the consequences of non-synonymous SNPs in Slc11a2 gene using bioinformatic tools. There is a current need of efficient bioinformatic tools for in-depth analysis of data generated by the next generation sequencing technologies. SNPs are known to play an imperative role in understanding the genetic basis of many genetic diseases. Slc11a2 is one of the major metal transporter families in mammals and plays a critical role in host defenses. In this study, we performed a comprehensive analysis of the impact of all non-synonymous SNPs in this gene using multiple tools like SIFT, PROVEAN, I-Mutant and PANTHER. Among the total 124 SNPs obtained from amplicon sequencing of Slc11a2 gene by Ion Torrent PGM involving 10 individuals of Gir cattle and Murrah buffalo each, we found 22 non-synonymous. Comparing the prediction of these 4 methods, 5 nsSNPs (G369R, Y374C, A377V, Q385H and N492S were identified as deleterious. In addition, while tested out for polar interactions with other amino acids in the protein, from above 5, Y374C, Q385H and N492S showed a change in interaction pattern and further confirmed by an increase in total energy after energy minimizations in case of mutant protein compared to the native.

  13. CLC-2 single nucleotide polymorphisms (SNPs) as potential modifiers of cystic fibrosis disease severity

    Science.gov (United States)

    Blaisdell, Carol J; Howard, Timothy D; Stern, Augustus; Bamford, Penelope; Bleecker, Eugene R; Stine, O Colin

    2004-01-01

    Background Cystic fibrosis (CF) lung disease manifest by impaired chloride secretion leads to eventual respiratory failure. Candidate genes that may modify CF lung disease severity include alternative chloride channels. The objectives of this study are to identify single nucleotide polymorphisms (SNPs) in the airway epithelial chloride channel, CLC-2, and correlate these polymorphisms with CF lung disease. Methods The CLC-2 promoter, intron 1 and exon 20 were examined for SNPs in adult CF dF508/dF508 homozygotes with mild and severe lung disease (forced expiratory volume at one second (FEV1) > 70% and < 40%). Results PCR amplification of genomic CLC-2 and sequence analysis revealed 1 polymorphism in the hClC -2 promoter, 4 in intron 1, and none in exon 20. Fisher's analysis within this data set, did not demonstrate a significant relationship between the severity of lung disease and SNPs in the CLC-2 gene. Conclusions CLC-2 is not a key modifier gene of CF lung phenotype. Further studies evaluating other phenotypes associated with CF may be useful in the future to assess the ability of CLC-2 to modify CF disease severity. PMID:15507145

  14. CLC-2 single nucleotide polymorphisms (SNPs as potential modifiers of cystic fibrosis disease severity

    Directory of Open Access Journals (Sweden)

    Bleecker Eugene R

    2004-10-01

    Full Text Available Abstract Background Cystic fibrosis (CF lung disease manifest by impaired chloride secretion leads to eventual respiratory failure. Candidate genes that may modify CF lung disease severity include alternative chloride channels. The objectives of this study are to identify single nucleotide polymorphisms (SNPs in the airway epithelial chloride channel, CLC-2, and correlate these polymorphisms with CF lung disease. Methods The CLC-2 promoter, intron 1 and exon 20 were examined for SNPs in adult CF dF508/dF508 homozygotes with mild and severe lung disease (forced expiratory volume at one second (FEV1 > 70% and Results PCR amplification of genomic CLC-2 and sequence analysis revealed 1 polymorphism in the hClC -2 promoter, 4 in intron 1, and none in exon 20. Fisher's analysis within this data set, did not demonstrate a significant relationship between the severity of lung disease and SNPs in the CLC-2 gene. Conclusions CLC-2 is not a key modifier gene of CF lung phenotype. Further studies evaluating other phenotypes associated with CF may be useful in the future to assess the ability of CLC-2 to modify CF disease severity.

  15. Supplementary data: SNPs in genes with copy number variation: A ...

    Indian Academy of Sciences (India)

    The bases at equivalent positions of the duplicon(s) for each SNP are shown in table 1 for HBA1 and table 2 (a, b) for PSORS1 and GH1. Table 1. SNPs of haemoglobin: α-locus 1 (NCBI Build 126). Nucleotide. Wild type bases. SNP ID change. Location. HbA1. HbA2. HbZ. HbQ1. HbM rs28928888. T>C exon 1. T. T. C. T. C.

  16. Genome-wide association study of smoking behaviors in COPD patients

    Science.gov (United States)

    Siedlinski, Mateusz; Cho, Michael H.; Bakke, Per; Gulsvik, Amund; Lomas, David A.; Anderson, Wayne; Kong, Xiangyang; Rennard, Stephen I.; Beaty, Terri H.; Hokanson, John E.; Crapo, James D.; Silverman, Edwin K.

    2012-01-01

    Background Cigarette smoking is a major risk factor for COPD and COPD severity. Previous genome-wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) associated with the number of cigarettes smoked per day (CPD) and a Dopamine Beta-Hydroxylase (DBH) locus associated with smoking cessation in multiple populations. Objective To identify SNPs associated with lifetime average and current CPD, age at smoking initiation, and smoking cessation in COPD subjects. Methods GWAS were conducted in 4 independent cohorts encompassing 3,441 ever-smoking COPD subjects (GOLD stage II or higher). Untyped SNPs were imputed using HapMap (phase II) panel. Results from all cohorts were meta-analyzed. Results Several SNPs near the HLA region on chromosome 6p21 and in an intergenic region on chromosome 2q21 showed associations with age at smoking initiation, both with the lowest p=2×10−7. No SNPs were associated with lifetime average CPD, current CPD or smoking cessation with p<10−6. Nominally significant associations with candidate SNPs within alpha-nicotinic acetylcholine receptors 3/5 (CHRNA3/CHRNA5; e.g. p=0.00011 for SNP rs1051730) and Cytochrome P450 2A6 (CYP2A6; e.g. p=2.78×10−5 for a nonsynonymous SNP rs1801272) regions were observed for lifetime average CPD, however only CYP2A6 showed evidence of significant association with current CPD. A candidate SNP (rs3025343) in the DBH was significantly (p=0.015) associated with smoking cessation. Conclusion We identified two candidate regions associated with age at smoking initiation in COPD subjects. Associations of CHRNA3/CHRNA5 and CYP2A6 loci with CPD and DBH with smoking cessation are also likely of importance in the smoking behaviors of COPD patients. PMID:21685187

  17. The use of the bootstrap in the analysis of case-control studies with missing data

    DEFF Research Database (Denmark)

    Siersma, Volkert Dirk; Johansen, Christoffer

    2004-01-01

    nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study......nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study...

  18. Genetic association of SNPs in the FTO gene and predisposition to obesity in Malaysian Malays

    International Nuclear Information System (INIS)

    Apalasamy, Y.D.; Ming, M.F.; Rampal, S.; Bulgiba, A.; Mohamed, Z.

    2012-01-01

    The common variants in the fat mass- and obesity-associated (FTO) gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs) and linkage disequilibrium (LD) blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese) Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D' = 1.0). In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays

  19. TGFβ1 SNPs and radio-induced toxicity in prostate cancer patients

    International Nuclear Information System (INIS)

    Fachal, Laura; Gómez-Caamaño, Antonio; Sánchez-García, Manuel; Carballo, Ana; Peleteiro, Paula; Lobato-Busto, Ramón; Carracedo, Ángel; Vega, Ana

    2012-01-01

    Background and purpose: We have performed a case-control study in 413 prostate cancer patients to test for association between TGFβ1 and the development of late normal-tissue toxicity among prostate cancer patients treated with three-dimensional conformational radiotherapy (3D-CRT) Materials and methods: Late gastrointestinal and genitourinary toxicities were assessed for at least two years after radiotherapy in 413 patients according to CTCAEvs3 scores. Codominant genotypic tests and haplotypic analyses were undertaken to evaluate the correlation between TGFβ1 SNPs rs1800469, rs1800470 and rs1800472 and radio-induced toxicity. Results: Neither the SNPs nor the haplotypes were found to be associated with the risk of late toxicity. Conclusions: We were able to exclude up to a 2-fold increase in the risk of developing late gastrointestinal and genitourinary radio-induced toxicity due to the TGFβ1 SNPs rs1800469 and rs1800470, as well as the two most frequent TGFβ1 haplotypes.

  20. Genetic association of SNPs in the FTO gene and predisposition to obesity in Malaysian Malays

    Energy Technology Data Exchange (ETDEWEB)

    Apalasamy, Y.D. [Pharmacogenomics Laboratory, Department of Pharmacology, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia); Ming, M.F.; Rampal, S.; Bulgiba, A. [Julius Centre University of Malaya, Department of Social and Preventive Medicine, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia); Mohamed, Z. [Pharmacogenomics Laboratory, Department of Pharmacology, Faculty of Medicine, University of Malaya, Kuala Lumpur (Malaysia)

    2012-08-24

    The common variants in the fat mass- and obesity-associated (FTO) gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs) and linkage disequilibrium (LD) blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese) Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D' = 1.0). In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays.

  1. Enrichment of risk SNPs in regulatory regions implicate diverse tissues in Parkinson's disease etiology.

    Science.gov (United States)

    Coetzee, Simon G; Pierce, Steven; Brundin, Patrik; Brundin, Lena; Hazelett, Dennis J; Coetzee, Gerhard A

    2016-07-27

    Recent genome-wide association studies (GWAS) of Parkinson's disease (PD) revealed at least 26 risk loci, with associated single nucleotide polymorphisms (SNPs) located in non-coding DNA having unknown functions in risk. In order to explore in which cell types these SNPs (and their correlated surrogates at r(2) ≥ 0.8) could alter cellular function, we assessed their location overlap with histone modification regions that indicate transcription regulation in 77 diverse cell types. We found statistically significant enrichment of risk SNPs at 12 loci in active enhancers or promoters. We investigated 4 risk loci in depth that were most significantly enriched (-logeP > 14) and contained 8 putative enhancers in the different cell types. These enriched loci, along with eQTL associations, were unexpectedly present in non-neuronal cell types. These included lymphocytes, mesendoderm, liver- and fat-cells, indicating that cell types outside the brain are involved in the genetic predisposition to PD. Annotating regulatory risk regions within specific cell types may unravel new putative risk mechanisms and molecular pathways that contribute to PD development.

  2. Enrichment of risk SNPs in regulatory regions implicate diverse tissues in Parkinson’s disease etiology

    Science.gov (United States)

    Coetzee, Simon G.; Pierce, Steven; Brundin, Patrik; Brundin, Lena; Hazelett, Dennis J.; Coetzee, Gerhard A.

    2016-01-01

    Recent genome-wide association studies (GWAS) of Parkinson’s disease (PD) revealed at least 26 risk loci, with associated single nucleotide polymorphisms (SNPs) located in non-coding DNA having unknown functions in risk. In order to explore in which cell types these SNPs (and their correlated surrogates at r2 ≥ 0.8) could alter cellular function, we assessed their location overlap with histone modification regions that indicate transcription regulation in 77 diverse cell types. We found statistically significant enrichment of risk SNPs at 12 loci in active enhancers or promoters. We investigated 4 risk loci in depth that were most significantly enriched (−logeP > 14) and contained 8 putative enhancers in the different cell types. These enriched loci, along with eQTL associations, were unexpectedly present in non-neuronal cell types. These included lymphocytes, mesendoderm, liver- and fat-cells, indicating that cell types outside the brain are involved in the genetic predisposition to PD. Annotating regulatory risk regions within specific cell types may unravel new putative risk mechanisms and molecular pathways that contribute to PD development. PMID:27461410

  3. SNPs of melanocortin 4 receptor (MC4R) associated with body weight in Beagle dogs.

    Science.gov (United States)

    Zeng, Ruixia; Zhang, Yibo; Du, Peng

    2014-01-01

    Melanocortin 4 receptor (MC4R), which is associated with inherited human obesity, is involoved in food intake and body weight of mammals. To study the relationships between MC4R gene polymorphism and body weight in Beagle dogs, we detected and compared the nucleotide sequence of the whole coding region and 3'- and 5'- flanking regions of the dog MC4R gene (1214 bp). In 120 Beagle dogs, two SNPs (A420C, C895T) were identified and their relation with body weight was analyzed with RFLP-PCR method. The results showed that the SNP at A420C was significantly associated with canine body weight trait when it changed amino acid 101 of the MC4R protein from asparagine to threonine, while canine body weight variations were significant in female dogs when MC4R nonsense mutation at C895T. It suggested that the two SNPs might affect the MC4R gene's function which was relative to body weight in Beagle dogs. Therefore, MC4R was a candidate gene for selecting different size dogs with the MC4R SNPs (A420C, C895T) being potentially valuable as a genetic marker.

  4. Genetic association of SNPs in the FTO gene and predisposition to obesity in Malaysian Malays

    Directory of Open Access Journals (Sweden)

    Y.D. Apalasamy

    2012-12-01

    Full Text Available The common variants in the fat mass- and obesity-associated (FTO gene have been previously found to be associated with obesity in various adult populations. The objective of the present study was to investigate whether the single nucleotide polymorphisms (SNPs and linkage disequilibrium (LD blocks in various regions of the FTO gene are associated with predisposition to obesity in Malaysian Malays. Thirty-one FTO SNPs were genotyped in 587 (158 obese and 429 non-obese Malaysian Malay subjects. Obesity traits and lipid profiles were measured and single-marker association testing, LD testing, and haplotype association analysis were performed. LD analysis of the FTO SNPs revealed the presence of 57 regions with complete LD (D’ = 1.0. In addition, we detected the association of rs17817288 with low-density lipoprotein cholesterol. The FTO gene may therefore be involved in lipid metabolism in Malaysian Malays. Two haplotype blocks were present in this region of the FTO gene, but no particular haplotype was found to be significantly associated with an increased risk of obesity in Malaysian Malays.

  5. PCA-based bootstrap confidence interval tests for gene-disease association involving multiple SNPs

    Directory of Open Access Journals (Sweden)

    Xue Fuzhong

    2010-01-01

    Full Text Available Abstract Background Genetic association study is currently the primary vehicle for identification and characterization of disease-predisposing variant(s which usually involves multiple single-nucleotide polymorphisms (SNPs available. However, SNP-wise association tests raise concerns over multiple testing. Haplotype-based methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming Hardy-Weinberg equilibrium (HWE and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA are preferable in this regard but their performance varies with methods of extracting principal components (PCs. Results PCA-based bootstrap confidence interval test (PCA-BCIT, which directly uses the PC scores to assess gene-disease association, was developed and evaluated for three ways of extracting PCs, i.e., cases only(CAES, controls only(COES and cases and controls combined(CES. Extraction of PCs with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test. Conclusions PCA-BCIT is a valid and powerful method for assessing gene-disease association involving multiple SNPs.

  6. In-Silico Computing of the Most Deleterious nsSNPs in HBA1 Gene.

    Directory of Open Access Journals (Sweden)

    Sayed AbdulAzeez

    Full Text Available α-Thalassemia (α-thal is a genetic disorder caused by the substitution of single amino acid or large deletions in the HBA1 and/or HBA2 genes.Using modern bioinformatics tools as a systematic in-silico approach to predict the deleterious SNPs in the HBA1 gene and its significant pathogenic impact on the functions and structure of HBA1 protein was predicted.A total of 389 SNPs in HBA1 were retrieved from dbSNP database, which includes: 201 non-coding synonymous (nsSNPs, 43 human active SNPs, 16 intronic SNPs, 11 mRNA 3' UTR SNPs, 9 coding synonymous SNPs, 9 5' UTR SNPs and other types. Structural homology-based method (PolyPhen and sequence homology-based tool (SIFT, SNPs&Go, PROVEAN and PANTHER revealed that 2.4% of the nsSNPs are pathogenic.A total of 5 nsSNPs (G60V, K17M, K17T, L92F and W15R were predicted to be responsible for the structural and functional modifications of HBA1 protein. It is evident from the deep comprehensive in-silico analysis that, two nsSNPs such as G60V and W15R in HBA1 are highly deleterious. These "2 pathogenic nsSNPs" can be considered for wet-lab confirmatory analysis.

  7. RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

    KAUST Repository

    Kim, Ji-Sung; Gao, Xin; Rzhetsky, Andrey

    2018-01-01

    are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race

  8. Improving accuracy of genomic prediction in Brangus cattle by adding animals with imputed low-density SNP genotypes.

    Science.gov (United States)

    Lopes, F B; Wu, X-L; Li, H; Xu, J; Perkins, T; Genho, J; Ferretti, R; Tait, R G; Bauck, S; Rosa, G J M

    2018-02-01

    Reliable genomic prediction of breeding values for quantitative traits requires the availability of sufficient number of animals with genotypes and phenotypes in the training set. As of 31 October 2016, there were 3,797 Brangus animals with genotypes and phenotypes. These Brangus animals were genotyped using different commercial SNP chips. Of them, the largest group consisted of 1,535 animals genotyped by the GGP-LDV4 SNP chip. The remaining 2,262 genotypes were imputed to the SNP content of the GGP-LDV4 chip, so that the number of animals available for training the genomic prediction models was more than doubled. The present study showed that the pooling of animals with both original or imputed 40K SNP genotypes substantially increased genomic prediction accuracies on the ten traits. By supplementing imputed genotypes, the relative gains in genomic prediction accuracies on estimated breeding values (EBV) were from 12.60% to 31.27%, and the relative gain in genomic prediction accuracies on de-regressed EBV was slightly small (i.e. 0.87%-18.75%). The present study also compared the performance of five genomic prediction models and two cross-validation methods. The five genomic models predicted EBV and de-regressed EBV of the ten traits similarly well. Of the two cross-validation methods, leave-one-out cross-validation maximized the number of animals at the stage of training for genomic prediction. Genomic prediction accuracy (GPA) on the ten quantitative traits was validated in 1,106 newly genotyped Brangus animals based on the SNP effects estimated in the previous set of 3,797 Brangus animals, and they were slightly lower than GPA in the original data. The present study was the first to leverage currently available genotype and phenotype resources in order to harness genomic prediction in Brangus beef cattle. © 2018 Blackwell Verlag GmbH.

  9. APCR, factor V gene known and novel SNPs and adverse pregnancy outcomes in an Irish cohort of pregnant women

    LENUS (Irish Health Repository)

    Sedano-Balbas, Sara

    2010-03-10

    Abstract Background Activated Protein C Resistance (APCR), a poor anticoagulant response of APC in haemostasis, is the commonest heritable thrombophilia. Adverse outcomes during pregnancy have been linked to APCR. This study determined the frequency of APCR, factor V gene known and novel SNPs and adverse outcomes in a group of pregnant women. Methods Blood samples collected from 907 pregnant women were tested using the Coatest® Classic and Modified functional haematological tests to establish the frequency of APCR. PCR-Restriction Enzyme Analysis (PCR-REA), PCR-DNA probe hybridisation analysis and DNA sequencing were used for molecular screening of known mutations in the factor V gene in subjects determined to have APCR based on the Coatest® Classic and\\/or Modified functional haematological tests. Glycosylase Mediated Polymorphism Detection (GMPD), a SNP screening technique and DNA sequencing, were used to identify SNPs in the factor V gene of 5 APCR subjects. Results Sixteen percent of the study group had an APCR phenotype. Factor V Leiden (FVL), FV Cambridge, and haplotype (H) R2 alleles were identified in this group. Thirty-three SNPs; 9 silent SNPs and 24 missense SNPs, of which 20 SNPs were novel, were identified in the 5 APCR subjects. Adverse pregnancy outcomes were found at a frequency of 35% in the group with APCR based on Classic Coatest® test only and at 45% in the group with APCR based on the Modified Coatest® test. Forty-eight percent of subjects with FVL had adverse outcomes while in the group of subjects with no FVL, adverse outcomes occurred at a frequency of 37%. Conclusions Known mutations and novel SNPs in the factor V gene were identified in the study cohort determined to have APCR in pregnancy. Further studies are required to investigate the contribution of these novel SNPs to the APCR phenotype. Adverse outcomes including early pregnancy loss (EPL), preeclampsia (PET) and intrauterine growth restriction (IGUR) were not significantly more

  10. Simple nuclear norm based algorithms for imputing missing data and forecasting in time series

    OpenAIRE

    Butcher, Holly Louise; Gillard, Jonathan William

    2017-01-01

    There has been much recent progress on the use of the nuclear norm for the so-called matrix completion problem (the problem of imputing missing values of a matrix). In this paper we investigate the use of the nuclear norm for modelling time series, with particular attention to imputing missing data and forecasting. We introduce a simple alternating projections type algorithm based on the nuclear norm for these tasks, and consider a number of practical examples.

  11. Missing value imputation for microarray gene expression data using histone acetylation information

    Directory of Open Access Journals (Sweden)

    Feng Jihua

    2008-05-01

    Full Text Available Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method is presented. It incorporates the histone acetylation information into the conventional KNN(k-nearest neighbor and LLS(local least square imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE. Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.

  12. The utility of imputed matched sets. Analyzing probabilistically linked databases in a low information setting.

    Science.gov (United States)

    Thomas, A M; Cook, L J; Dean, J M; Olson, L M

    2014-01-01

    To compare results from high probability matched sets versus imputed matched sets across differing levels of linkage information. A series of linkages with varying amounts of available information were performed on two simulated datasets derived from multiyear motor vehicle crash (MVC) and hospital databases, where true matches were known. Distributions of high probability and imputed matched sets were compared against the true match population for occupant age, MVC county, and MVC hour. Regression models were fit to simulated log hospital charges and hospitalization status. High probability and imputed matched sets were not significantly different from occupant age, MVC county, and MVC hour in high information settings (p > 0.999). In low information settings, high probability matched sets were significantly different from occupant age and MVC county (p sets were not (p > 0.493). High information settings saw no significant differences in inference of simulated log hospital charges and hospitalization status between the two methods. High probability and imputed matched sets were significantly different from the outcomes in low information settings; however, imputed matched sets were more robust. The level of information available to a linkage is an important consideration. High probability matched sets are suitable for high to moderate information settings and for situations involving case-specific analysis. Conversely, imputed matched sets are preferable for low information settings when conducting population-based analyses.

  13. Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

    Directory of Open Access Journals (Sweden)

    Xiaobo Yan

    2015-01-01

    Full Text Available This paper addresses missing value imputation for the Internet of Things (IoT. Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL, model of missing value imputation based on binary search (MBS, and model of missing value imputation based on Gaussian mixture model (MGI. Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.

  14. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.

    Science.gov (United States)

    Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M

    2018-06-01

    A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.

  15. In Vitro vs In Silico Detected SNPs for the Development of a Genotyping Array: What Can We Learn from a Non-Model Species?

    Science.gov (United States)

    Lepoittevin, Camille; Frigerio, Jean-Marc; Garnier-Géré, Pauline; Salin, Franck; Cervera, María-Teresa; Vornam, Barbara; Harvengt, Luc; Plomion, Christophe

    2010-01-01

    Background There is considerable interest in the high-throughput discovery and genotyping of single nucleotide polymorphisms (SNPs) to accelerate genetic mapping and enable association studies. This study provides an assessment of EST-derived and resequencing-derived SNP quality in maritime pine (Pinus pinaster Ait.), a conifer characterized by a huge genome size (∼23.8 Gb/C). Methodology/Principal Findings A 384-SNPs GoldenGate genotyping array was built from i/ 184 SNPs originally detected in a set of 40 re-sequenced candidate genes (in vitro SNPs), chosen on the basis of functionality scores, presence of neighboring polymorphisms, minor allele frequencies and linkage disequilibrium and ii/ 200 SNPs screened from ESTs (in silico SNPs) selected based on the number of ESTs used for SNP detection, the SNP minor allele frequency and the quality of SNP flanking sequences. The global success rate of the assay was 66.9%, and a conversion rate (considering only polymorphic SNPs) of 51% was achieved. In vitro SNPs showed significantly higher genotyping-success and conversion rates than in silico SNPs (+11.5% and +18.5%, respectively). The reproducibility was 100%, and the genotyping error rate very low (0.54%, dropping down to 0.06% when removing four SNPs showing elevated error rates). Conclusions/Significance This study demonstrates that ESTs provide a resource for SNP identification in non-model species, which do not require any additional bench work and little bio-informatics analysis. However, the time and cost benefits of in silico SNPs are counterbalanced by a lower conversion rate than in vitro SNPs. This drawback is acceptable for population-based experiments, but could be dramatic in experiments involving samples from narrow genetic backgrounds. In addition, we showed that both the visual inspection of genotyping clusters and the estimation of a per SNP error rate should help identify markers that are not suitable to the GoldenGate technology in species

  16. In vitro vs in silico detected SNPs for the development of a genotyping array: what can we learn from a non-model species?

    Directory of Open Access Journals (Sweden)

    Camille Lepoittevin

    2010-06-01

    Full Text Available There is considerable interest in the high-throughput discovery and genotyping of single nucleotide polymorphisms (SNPs to accelerate genetic mapping and enable association studies. This study provides an assessment of EST-derived and resequencing-derived SNP quality in maritime pine (Pinus pinaster Ait., a conifer characterized by a huge genome size ( approximately 23.8 Gb/C.A 384-SNPs GoldenGate genotyping array was built from i/ 184 SNPs originally detected in a set of 40 re-sequenced candidate genes (in vitro SNPs, chosen on the basis of functionality scores, presence of neighboring polymorphisms, minor allele frequencies and linkage disequilibrium and ii/ 200 SNPs screened from ESTs (in silico SNPs selected based on the number of ESTs used for SNP detection, the SNP minor allele frequency and the quality of SNP flanking sequences. The global success rate of the assay was 66.9%, and a conversion rate (considering only polymorphic SNPs of 51% was achieved. In vitro SNPs showed significantly higher genotyping-success and conversion rates than in silico SNPs (+11.5% and +18.5%, respectively. The reproducibility was 100%, and the genotyping error rate very low (0.54%, dropping down to 0.06% when removing four SNPs showing elevated error rates.This study demonstrates that ESTs provide a resource for SNP identification in non-model species, which do not require any additional bench work and little bio-informatics analysis. However, the time and cost benefits of in silico SNPs are counterbalanced by a lower conversion rate than in vitro SNPs. This drawback is acceptable for population-based experiments, but could be dramatic in experiments involving samples from narrow genetic backgrounds. In addition, we showed that both the visual inspection of genotyping clusters and the estimation of a per SNP error rate should help identify markers that are not suitable to the GoldenGate technology in species characterized by a large and complex genome.

  17. SNPs in genes implicated in radiation response are associated with radiotoxicity and evoke roles as predictive and prognostic biomarkers

    International Nuclear Information System (INIS)

    Alsbeih, Ghazi; El-Sebaie, Medhat; Al-Harbi, Najla; Al-Hadyan, Khaled; Shoukri, Mohamed; Al-Rajhi, Nasser

    2013-01-01

    Biomarkers are needed to individualize cancer radiation treatment. Therefore, we have investigated the association between various risk factors, including single nucleotide polymorphisms (SNPs) in candidate genes and late complications to radiotherapy in our nasopharyngeal cancer patients. A cohort of 155 patients was included. Normal tissue fibrosis was scored using RTOG/EORTC grading system. A total of 45 SNPs in 11 candidate genes (ATM, XRCC1, XRCC3, XRCC4, XRCC5, PRKDC, LIG4, TP53, HDM2, CDKN1A, TGFB1) were genotyped by direct genomic DNA sequencing. Patients with severe fibrosis (cases, G3-4, n = 48) were compared to controls (G0-2, n = 107). Univariate analysis showed significant association (P < 0.05) with radiation complications for 6 SNPs (ATM G/A rs1801516, HDM2 promoter T/G rs2279744 and T/A rs1196333, XRCC1 G/A rs25487, XRCC5 T/C rs1051677 and TGFB1 C/T rs1800469). In addition, Kaplan-Meier analyses have also highlighted significant association between genotypes and length of patients’ follow-up after radiotherapy. Multivariate logistic regression has further sustained these results suggesting predictive and prognostic roles of SNPs. Univariate and multivariate analysis suggest that radiation toxicity in radiotherapy patients are associated with certain SNPs, in genes including HDM2 promoter studied for the 1st time. These results support the use of SNPs as genetic predictive markers for clinical radiosensitivity and evoke a prognostic role for length of patients’ follow-up after radiotherapy

  18. 118 SNPs of folate-related genes and risks of spina bifida and conotruncal heart defects

    Directory of Open Access Journals (Sweden)

    Shaw Gary M

    2009-06-01

    Full Text Available Abstract Background Folic acid taken in early pregnancy reduces risks for delivering offspring with several congenital anomalies. The mechanism by which folic acid reduces risk is unknown. Investigations into genetic variation that influences transport and metabolism of folate will help fill this data gap. We focused on 118 SNPs involved in folate transport and metabolism. Methods Using data from a California population-based registry, we investigated whether risks of spina bifida or conotruncal heart defects were influenced by 118 single nucleotide polymorphisms (SNPs associated with the complex folate pathway. This case-control study included 259 infants with spina bifida and a random sample of 359 nonmalformed control infants born during 1983–86 or 1994–95. It also included 214 infants with conotruncal heart defects born during 1983–86. Infant genotyping was performed blinded to case or control status using a designed SNPlex assay. We examined single SNP effects for each of the 118 SNPs, as well as haplotypes, for each of the two outcomes. Results Few odds ratios (ORs revealed sizable departures from 1.0. With respect to spina bifida, we observed ORs with 95% confidence intervals that did not include 1.0 for the following SNPs (heterozygous or homozygous relative to the reference genotype: BHMT (rs3733890 OR = 1.8 (1.1–3.1, CBS (rs2851391 OR = 2.0 (1.2–3.1; CBS (rs234713 OR = 2.9 (1.3–6.7; MTHFD1 (rs2236224 OR = 1.7 (1.1–2.7; MTHFD1 (hcv11462908 OR = 0.2 (0–0.9; MTHFD2 (rs702465 OR = 0.6 (0.4–0.9; MTHFD2 (rs7571842 OR = 0.6 (0.4–0.9; MTHFR (rs1801133 OR = 2.0 (1.2–3.1; MTRR (rs162036 OR = 3.0 (1.5–5.9; MTRR (rs10380 OR = 3.4 (1.6–7.1; MTRR (rs1801394 OR = 0.7 (0.5–0.9; MTRR (rs9332 OR = 2.7 (1.3–5.3; TYMS (rs2847149 OR = 2.2 (1.4–3.5; TYMS (rs1001761 OR = 2.4 (1.5–3.8; and TYMS (rs502396 OR = 2.1 (1.3–3.3. However, multiple SNPs observed for a given gene showed evidence of linkage disequilibrium indicating

  19. Genotyping of Brucella species using clade specific SNPs

    Directory of Open Access Journals (Sweden)

    Foster Jeffrey T

    2012-06-01

    Full Text Available Abstract Background Brucellosis is a worldwide disease of mammals caused by Alphaproteobacteria in the genus Brucella. The genus is genetically monomorphic, requiring extensive genotyping to differentiate isolates. We utilized two different genotyping strategies to characterize isolates. First, we developed a microarray-based assay based on 1000 single nucleotide polymorphisms (SNPs that were identified from whole genome comparisons of two B. abortus isolates , one B. melitensis, and one B. suis. We then genotyped a diverse collection of 85 Brucella strains at these SNP loci and generated a phylogenetic tree of relationships. Second, we developed a selective primer-extension assay system using capillary electrophoresis that targeted 17 high value SNPs across 8 major branches of the phylogeny and determined their genotypes in a large collection ( n = 340 of diverse isolates. Results Our 1000 SNP microarray readily distinguished B. abortus, B. melitensis, and B. suis, differentiating B. melitensis and B. suis into two clades each. Brucella abortus was divided into four major clades. Our capillary-based SNP genotyping confirmed all major branches from the microarray assay and assigned all samples to defined lineages. Isolates from these lineages and closely related isolates, among the most commonly encountered lineages worldwide, can now be quickly and easily identified and genetically characterized. Conclusions We have identified clade-specific SNPs in Brucella that can be used for rapid assignment into major groups below the species level in the three main Brucella species. Our assays represent SNP genotyping approaches that can reliably determine the evolutionary relationships of bacterial isolates without the need for whole genome sequencing of all isolates.

  20. Typing of Y chromosome SNPs with multiplex PCR methods

    DEFF Research Database (Denmark)

    Sanchez Sanchez, Juan Jose; Børsting, Claus; Morling, Niels

    2005-01-01

    We describe a method for the simultaneous typing of Y-chromosome single nucleotide polymorphism (SNP) markers by means of multiplex polymerase chain reaction (PCR) strategies that allow the detection of 35 Y chromosome SNPs on 25 amplicons from 100 to 200 pg of chromosomal deoxyribonucleic acid...... factors for the creation of larger SNP typing PCR multiplexes include careful selection of primers for the primary amplification and the SBE reaction, use of DNA primers with homogenous composition, and balancing the primer concentrations for both the amplification and the SBE reactions....

  1. Multi-generational imputation of single nucleotide polymorphism marker genotypes and accuracy of genomic selection.

    Science.gov (United States)

    Toghiani, S; Aggrey, S E; Rekaya, R

    2016-07-01

    Availability of high-density single nucleotide polymorphism (SNP) genotyping platforms provided unprecedented opportunities to enhance breeding programmes in livestock, poultry and plant species, and to better understand the genetic basis of complex traits. Using this genomic information, genomic breeding values (GEBVs), which are more accurate than conventional breeding values. The superiority of genomic selection is possible only when high-density SNP panels are used to track genes and QTLs affecting the trait. Unfortunately, even with the continuous decrease in genotyping costs, only a small fraction of the population has been genotyped with these high-density panels. It is often the case that a larger portion of the population is genotyped with low-density and low-cost SNP panels and then imputed to a higher density. Accuracy of SNP genotype imputation tends to be high when minimum requirements are met. Nevertheless, a certain rate of genotype imputation errors is unavoidable. Thus, it is reasonable to assume that the accuracy of GEBVs will be affected by imputation errors; especially, their cumulative effects over time. To evaluate the impact of multi-generational selection on the accuracy of SNP genotypes imputation and the reliability of resulting GEBVs, a simulation was carried out under varying updating of the reference population, distance between the reference and testing sets, and the approach used for the estimation of GEBVs. Using fixed reference populations, imputation accuracy decayed by about 0.5% per generation. In fact, after 25 generations, the accuracy was only 7% lower than the first generation. When the reference population was updated by either 1% or 5% of the top animals in the previous generations, decay of imputation accuracy was substantially reduced. These results indicate that low-density panels are useful, especially when the generational interval between reference and testing population is small. As the generational interval

  2. Exceptional longevity and muscle and fitness related genotypes: a functional in vitro analysis and case-control association replication study with SNPs THRH rs7832552, IL6 rs1800795 and ACSL1 rs6552828

    Directory of Open Access Journals (Sweden)

    Noriyuki eFuku

    2015-05-01

    Full Text Available There are several gene variants that are candidates to influence functional capacity in long-lived individuals. As such, their potential association with exceptional longevity (EL, i.e., reaching 100+ years deserves analysis. Among them are rs7832552 in the thyrotropin-releasing hormone receptor (TRHR gene, rs1800795 in the interleukin-6 (IL6 gene and rs6552828 in the coenzyme A synthetase long-chain 1 (ACSL1 gene. To gain insight into their functionality (which is yet unknown, here we determined for the first time luciferase gene reporter activity at the muscle tissue level in rs7832552 and rs6552828. We then compared allele/genotype frequencies of the 3 abovementioned variants among centenarians [n=138, age range 100-111 years (114 women] and healthy controls [n=334, 20-50 years (141 women] of the same ethnic and geographic origin (Spain. We also studied healthy centenarians [n=79, 100-104 years (40 women] and controls [n=316, 27-81 years (156 women] from Italy, and centenarians [n=742, 100-116 years (623 women] and healthy controls [n=499, 23-59 years (356 women] from Japan. The THRH rs7832552 T-allele and ACSL1 rs6552828 A-allele up-regulated luciferase activity compared to the C and G-allele, respectively (P≤0.001. Yet we found no significant association of EL with rs7832552, rs1800795 or rs6552828 in any of the 3 cohorts. Further research is needed with larger cohorts of centenarians of different origin as well as with younger old people.

  3. A multianalytical approach to evaluate the association of 55 SNPs in 28 genes with obesity risk in North Indian adults.

    Science.gov (United States)

    Srivastava, Apurva; Mittal, Balraj; Prakash, Jai; Srivastava, Pranjal; Srivastava, Nimisha; Srivastava, Neena

    2017-03-01

    The aim of the study was to investigate the association of 55 SNPs in 28 genes with obesity risk in a North Indian population using a multianalytical approach. Overall, 480 subjects from the North Indian population were studied using strict inclusion/exclusion criteria. SNP Genotyping was carried out by Sequenom Mass ARRAY platform (Sequenom, San Diego, CA) and validated Taqman ® allelic discrimination (Applied Biosystems ® ). Statistical analyses were performed using SPSS software version 19.0, SNPStats, GMDR software (version 6) and GENEMANIA. Logistic regression analysis of 55 SNPs revealed significant associations (P obesity risk whereas the remaining 6 SNPs revealed no association (P > .05). The pathway-wise G-score revealed the significant role (P = .0001) of food intake-energy expenditure pathway genes. In CART analysis, the combined genotypes of FTO rs9939609 and TCF7L2 rs7903146 revealed the highest risk for BMI linked obesity. The analysis of the FTO-IRX3 locus revealed high LD and high order gene-gene interactions for BMI linked obesity. The interaction network of all of the associated genes in the present study generated by GENEMANIA revealed direct and indirect connections. In addition, the analysis with centralized obesity revealed that none of the SNPs except for FTO rs17818902 were significantly associated (P obesity risk in the North Indian population. © 2016 Wiley Periodicals, Inc.

  4. Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next Generation Sequencing Technology

    Science.gov (United States)

    Ramos, Antonio M.; Crooijmans, Richard P. M. A.; Affara, Nabeel A.; Amaral, Andreia J.; Archibald, Alan L.; Beever, Jonathan E.; Bendixen, Christian; Churcher, Carol; Clark, Richard; Dehais, Patrick; Hansen, Mark S.; Hedegaard, Jakob; Hu, Zhi-Liang; Kerstens, Hindrik H.; Law, Andy S.; Megens, Hendrik-Jan; Milan, Denis; Nonneman, Danny J.; Rohrer, Gary A.; Rothschild, Max F.; Smith, Tim P. L.; Schnabel, Robert D.; Van Tassell, Curt P.; Taylor, Jeremy F.; Wiedmann, Ralph T.; Schook, Lawrence B.; Groenen, Martien A. M.

    2009-01-01

    Background The dissection of complex traits of economic importance to the pig industry requires the availability of a significant number of genetic markers, such as single nucleotide polymorphisms (SNPs). This study was conducted to discover several hundreds of thousands of porcine SNPs using next generation sequencing technologies and use these SNPs, as well as others from different public sources, to design a high-density SNP genotyping assay. Methodology/Principal Findings A total of 19 reduced representation libraries derived from four swine breeds (Duroc, Landrace, Large White, Pietrain) and a Wild Boar population and three restriction enzymes (AluI, HaeIII and MspI) were sequenced using Illumina's Genome Analyzer (GA). The SNP discovery effort resulted in the de novo identification of over 372K SNPs. More than 549K SNPs were used to design the Illumina Porcine 60K+SNP iSelect Beadchip, now commercially available as the PorcineSNP60. A total of 64,232 SNPs were included on the Beadchip. Results from genotyping the 158 individuals used for sequencing showed a high overall SNP call rate (97.5%). Of the 62,621 loci that could be reliably scored, 58,994 were polymorphic yielding a SNP conversion success rate of 94%. The average minor allele frequency (MAF) for all scorable SNPs was 0.274. Conclusions/Significance Overall, the results of this study indicate the utility of using next generation sequencing technologies to identify large numbers of reliable SNPs. In addition, the validation of the PorcineSNP60 Beadchip demonstrated that the assay is an excellent tool that will likely be used in a variety of future studies in pigs. PMID:19654876

  5. Fiscal 1999 research achievement report on the development of SNPs related technologies; 1999 nendo SNPs kanren gijutsu kaihatsu seika hokokusho

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2001-03-01

    Efforts are made to develop specimen processing technologies for modifying and enabling various kinds of specimens to automatically undergo SNP (single nucleotide polymorphism) analysis for medicine development and clinical diagnostic activities and to develop technologies and apparatuses to enable rapid, inexpensive, and simple search and analysis of SNPs using DNA (deoxyribonucleic acid) chips and mass spectrometry. Activities are conducted in the four fields involving (1) the development of a practical clinical system for rapid detection and analysis of SNPs, (2) research and development of an SNP scoring system using bar-coded oligonucleotides and magnetic beads, (3) research and development of a high-speed SNP analysis system using a mass spectrometer, and (4) the development of a high throughput SNP analysis line. Efforts exerted in field (1) involve a protein fixation method using plasma polymerization and its application to DNA arrays, development of an SNP detection method using human genomes, construction of a rapid DNA detection device using an electric field, development of an SNP analysis system using the solid phase HPA (hybridization protection assay) method, and SNP analysis using solid phase ligation. (NEDO)

  6. Highlights from the 15th International Congress of Twin Studies/Twin Research: Differentiating MZ Co-twins Via SNPs; Mistaken Infant Twin-Singleton Hospital Registration; Narcolepsy With Cataplexy; Hearing Loss and Language Learning/Media Mentions: Broadway Musical Recalls Conjoined Hilton Twins; High Fashion Pair; Twins Turn 102; Insights From a Conjoined Twin Survivor.

    Science.gov (United States)

    Segal, Nancy L

    2015-02-01

    Highlights from the 15th International Congress of Twin Studies are presented. The congress was held November 16-19, 2014 in Budapest, Hungary. This report is followed by summaries of research addressing the differentiation of MZ co-twins by single nucleotide polymorphisms (SNPs), an unusual error in infant twin-singleton hospital registration, twins with childhood-onset narcolepsy with cataplexy, and the parenting effects of hearing loss in one co-twin. Media interest in twins covers a new Broadway musical based on the conjoined twins Violet and Daisy Hilton, male twins becoming famous in fashion, twins who turned 102 and unique insights from a conjoined twin survivor. This article is dedicated to the memory of Elizabeth (Liz) Hamel, DZA twin who met her co-twin for the first time at age seventy-eight years. Liz and her co-twin, Ann Hunt, are listed in the 2015 Guinness Book of Records as the longest separated twins in the world.

  7. Rapid multiplex high resolution melting method to analyze inflammatory related SNPs in preterm birth

    Directory of Open Access Journals (Sweden)

    Pereyra Silvana

    2012-01-01

    Full Text Available Abstract Background Complex traits like cancer, diabetes, obesity or schizophrenia arise from an intricate interaction between genetic and environmental factors. Complex disorders often cluster in families without a clear-cut pattern of inheritance. Genomic wide association studies focus on the detection of tens or hundreds individual markers contributing to complex diseases. In order to test if a subset of single nucleotide polymorphisms (SNPs from candidate genes are associated to a condition of interest in a particular individual or group of people, new techniques are needed. High-resolution melting (HRM analysis is a new method in which polymerase chain reaction (PCR and mutations scanning are carried out simultaneously in a closed tube, making the procedure fast, inexpensive and easy. Preterm birth (PTB is considered a complex disease, where genetic and environmental factors interact to carry out the delivery of a newborn before 37 weeks of gestation. It is accepted that inflammation plays an important role in pregnancy and PTB. Methods Here, we used real time-PCR followed by HRM analysis to simultaneously identify several gene variations involved in inflammatory pathways on preterm labor. SNPs from TLR4, IL6, IL1 beta and IL12RB genes were analyzed in a case-control study. The results were confirmed either by sequencing or by PCR followed by restriction fragment length polymorphism. Results We were able to simultaneously recognize the variations of four genes with similar accuracy than other methods. In order to obtain non-overlapping melting temperatures, the key step in this strategy was primer design. Genotypic frequencies found for each SNP are in concordance with those previously described in similar populations. None of the studied SNPs were associated with PTB. Conclusions Several gene variations related to the same inflammatory pathway were screened through a new flexible, fast and non expensive method with the purpose of analyzing

  8. A combined genotype of three SNPs in the bovine gene is related to growth performance in Chinese cattle

    Directory of Open Access Journals (Sweden)

    J. Huang

    2017-10-01

    Full Text Available PPARD is involved in multiple biological processes, especially for those associated with energy metabolism. PPARD regulates lipid metabolism through up-regulate expression of genes associating with adipogenesis. This makes PPARD a significant candidate gene for production traits of livestock animals. Association studies between PPARD polymorphisms and production traits have been reported in pigs but are limited for other animals, including cattle. Here, we investigated the expression profile and polymorphism of bovine PPARD as well as their association with growth traits in Chinese cattle. Our results showed that the highest expression of PPARD was detected in kidney, following by adipose, which is consistent with its involvement in energy metabolism. Three SNPs of PPARD were detected and used to undergo selection pressure according the result of Hardy–Weinberg equilibrium analysis (P < 0.05. Moreover, all of these SNPs showed moderate diversity (0.25 < PIC < 0.5, indicating their relatively high selection potential. Association analysis suggested that individuals with the GAAGTT combined genotype of three SNPs detected showed optimal values in all of the growth traits analyzed. These results revealed that the GAAGTT combined genotype of three SNPs detected in the bovine PPARD gene was a significant potential genetic marker for marker-assisted selection in Chinese cattle. However, this should be further verified in larger populations before being applied to breeding.

  9. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints.

    Science.gov (United States)

    Schwessinger, Ron; Suciu, Maria C; McGowan, Simon J; Telenius, Jelena; Taylor, Stephen; Higgs, Doug R; Hughes, Jim R

    2017-10-01

    In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k -mer-based analysis of DNase footprints to determine any k -mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome. © 2017 Schwessinger et al.; Published by Cold Spring Harbor Laboratory Press.

  10. A computational prospect to aspirin side effects: aspirin and COX-1 interaction analysis based on non-synonymous SNPs.

    Science.gov (United States)

    Marjan, Mojtabavi Naeini; Hamzeh, Mesrian Tanha; Rahman, Emamzadeh; Sadeq, Vallian

    2014-08-01

    Aspirin (ASA) is a commonly used nonsteroidal anti-inflammatory drug (NSAID), which exerts its therapeutic effects through inhibition of cyclooxygenase (COX) isoform 2 (COX-2), while the inhibition of COX-1 by ASA leads to apparent side effects. In the present study, the relationship between COX-1 non-synonymous single nucleotide polymorphisms (nsSNPs) and aspirin related side effects was investigated. The functional impacts of 37 nsSNPs on aspirin inhibition potency of COX-1 with COX-1/aspirin molecular docking were computationally analyzed, and each SNP was scored based on DOCK Amber score. The data predicted that 22 nsSNPs could reduce COX-1 inhibition, while 15 nsSNPs showed increasing inhibition level in comparison to the regular COX-1 protein. In order to perform a comparing state, the Amber scores for two Arg119 mutants (R119A and R119Q) were also calculated. Moreover, among nsSNP variants, rs117122585 represented the closest Amber score to R119A mutant. A separate docking computation validated the score and represented a new binding position for ASA that acetyl group was located within the distance of 3.86Å from Ser529 OH group. This could predict an associated loss of activity of ASA through this nsSNP variant. Our data represent a computational sub-population pattern for aspirin COX-1 related side effects, and provide basis for further research on COX-1/ASA interaction. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency.

    Science.gov (United States)

    Guo, Wei-Li; Huang, De-Shuang

    2017-08-22

    Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.

  12. Utility of X-chromosome SNPs in relationship testing

    DEFF Research Database (Denmark)

    Tomas, Carmen; Sanchez, Juan Jose; Castro, J.A.

    2008-01-01

    of the SBE primers varied between 18 and 85 nucleotides. We analyzed the allele and haplotype frequencies in 1078 unrelated males. All the SNPs were polymorphic and the lowest minor allele frequency was 0.103. All the haplotypes were unique. The forensic parameters were calculated in the Danish and Somali...... populations. In the Danish population (Ná=á93), the power of discrimination (PD) in females was one in 4.4áÎá109 individuals and the PD in males was one in 2.6áÎá106. The PD in Somalis (Ná=á91) was one in 2.7áÎá109 in females and one in 1.7áÎá106 in males. Finally, we present an example of how the 25 X...

  13. Novel SNPs of WNK1 and AKR1C3 are associated with preeclampsia.

    Science.gov (United States)

    Sun, Cheng-Juan; Li, Lin; Li, Xueyan; Zhang, Wei-Yuan; Liu, Xiao-Wei

    2018-08-20

    Preeclampsia is a hypertensive disorder of pregnancy and is one of the most common causes of poor perinatal outcomes. Preeclampsia increases the risk of hypertension in the future. Variants of WNK1 (lysine deficient protein kinase 1), ADRB2 (β2 adrenergic receptor), NEDD4L (ubiquitin-protein ligase NEDD4-like), KLK1 (kallikrein 1) contribute to hypertension, and AKR1C3 (aldo-keto reductase family1 member C3), is associated with preeclampsia. The association of single nucleotide polymorphisms (SNPs) in these five candidate preeclampsia susceptibility genes and the related traits in Chinese individuals were investigated. In this study, 13 SNPs of the five genes were genotyped in 276 preeclampsia patients and 229 age- and area-matched normal pregnancies in women of Chinese Northern Han origin. The 95% confidence interval (CI) and odds ratio (OR) were estimated by binary logistic regression. No obvious linkage disequilibrium or haplotypes were observed among these SNPs. Those with GG genotype and allele G of AKR1C3 (rs10508293) had a decreased risk of preeclampsia (adjusted OR = 3.011, 95% CI = 1.758-5.159, and adjusted OR = 1.745, 95% CI = 1.349-2.257, respectively). The AA genotype and allele A of WNK1 (rs1468326) were significantly associated with an increased risk in preeclampsia (adjusted OR = 2.307, 95% CI = 1.206-3.443, and adjusted OR = 1.663, 95% CI = 1.283-2.157, respectively). The findings indicate that the GG genotype of AKR1C3 rs10508293 is associated with decreased risk for preeclampsia and the AA genotype of WNK1 rs1468326 are related with an increased risk for preeclampsia. Copyright © 2018 Elsevier B.V. All rights reserved.

  14. Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks.

    Science.gov (United States)

    Li, YuanYuan; Parker, Lynne E

    2014-01-01

    Missing data is common in Wireless Sensor Networks (WSNs), especially with multi-hop communications. There are many reasons for this phenomenon, such as unstable wireless communications, synchronization issues, and unreliable sensors. Unfortunately, missing data creates a number of problems for WSNs. First, since most sensor nodes in the network are battery-powered, it is too expensive to have the nodes retransmit missing data across the network. Data re-transmission may also cause time delays when detecting abnormal changes in an environment. Furthermore, localized reasoning techniques on sensor nodes (such as machine learning algorithms to classify states of the environment) are generally not robust enough to handle missing data. Since sensor data collected by a WSN is generally correlated in time and space, we illustrate how replacing missing sensor values with spatially and temporally correlated sensor values can significantly improve the network's performance. However, our studies show that it is important to determine which nodes are spatially and temporally correlated with each other. Simple techniques based on Euclidean distance are not sufficient for complex environmental deployments. Thus, we have developed a novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes. To improve the search time, we utilize a k d-tree data structure, which is a non-parametric, data-driven binary search tree. Instead of using traditional mean and variance of each dimension for k d-tree construction, and Euclidean distance for k d-tree search, we use weighted variances and weighted Euclidean distances based on measured percentages of missing data. We have evaluated this approach through experiments on sensor data from a volcano dataset collected by a network of Crossbow motes, as well as experiments using sensor data from a highway traffic monitoring application. Our experimental

  15. Genome-wide association study of susceptibility loci for breast cancer in Sardinian population.

    Science.gov (United States)

    Palomba, Grazia; Loi, Angela; Porcu, Eleonora; Cossu, Antonio; Zara, Ilenia; Budroni, Mario; Dei, Mariano; Lai, Sandra; Mulas, Antonella; Olmeo, Nina; Ionta, Maria Teresa; Atzori, Francesco; Cuccuru, Gianmauro; Pitzalis, Maristella; Zoledziewska, Magdalena; Olla, Nazario; Lovicu, Mario; Pisano, Marina; Abecasis, Gonçalo R; Uda, Manuela; Tanda, Francesco; Michailidou, Kyriaki; Easton, Douglas F; Chanock, Stephen J; Hoover, Robert N; Hunter, David J; Schlessinger, David; Sanna, Serena; Crisponi, Laura; Palmieri, Giuseppe

    2015-05-10

    Despite progress in identifying genes associated with breast cancer, many more risk loci exist. Genome-wide association analyses in genetically-homogeneous populations, such as that of Sardinia (Italy), could represent an additional approach to detect low penetrance alleles. We performed a genome-wide association study comparing 1431 Sardinian patients with non-familial, BRCA1/2-mutation-negative breast cancer to 2171 healthy Sardinian blood donors. DNA was genotyped using GeneChip Human Mapping 500 K Arrays or Genome-Wide Human SNP Arrays 6.0. To increase genomic coverage, genotypes of additional SNPs were imputed using data from HapMap Phase II. After quality control filtering of genotype data, 1367 cases (9 men) and 1658 controls (1156 men) were analyzed on a total of 2,067,645 SNPs. Overall, 33 genomic regions (67 candidate SNPs) were associated with breast cancer risk at the p <  0(-6) level. Twenty of these regions contained defined genes, including one already associated with breast cancer risk: TOX3. With a lower threshold for preliminary significance to p < 10(-5), we identified 11 additional SNPs in FGFR2, a well-established breast cancer-associated gene. Ten candidate SNPs were selected, excluding those already associated with breast cancer, for technical validation as well as replication in 1668 samples from the same population. Only SNP rs345299, located in intron 1 of VAV3, remained suggestively associated (p-value, 1.16 x 10(-5)), but it did not associate with breast cancer risk in pooled data from two large, mixed-population cohorts. This study indicated the role of TOX3 and FGFR2 as breast cancer susceptibility genes in BRCA1/2-wild-type breast cancer patients from Sardinian population.

  16. Genome-wide association study of susceptibility loci for breast cancer in Sardinian population

    International Nuclear Information System (INIS)

    Palomba, Grazia; Loi, Angela; Porcu, Eleonora; Cossu, Antonio; Zara, Ilenia

    2015-01-01

    Despite progress in identifying genes associated with breast cancer, many more risk loci exist. Genome-wide association analyses in genetically-homogeneous populations, such as that of Sardinia (Italy), could represent an additional approach to detect low penetrance alleles. We performed a genome-wide association study comparing 1431 Sardinian patients with non-familial, BRCA1/2-mutation-negative breast cancer to 2171 healthy Sardinian blood donors. DNA was genotyped using GeneChip Human Mapping 500 K Arrays or Genome-Wide Human SNP Arrays 6.0. To increase genomic coverage, genotypes of additional SNPs were imputed using data from HapMap Phase II. After quality control filtering of genotype data, 1367 cases (9 men) and 1658 controls (1156 men) were analyzed on a total of 2,067,645 SNPs. Overall, 33 genomic regions (67 candidate SNPs) were associated with breast cancer risk at the p < 10 −6 level. Twenty of these regions contained defined genes, including one already associated with breast cancer risk: TOX3. With a lower threshold for preliminary significance to p < 10 −5 , we identified 11 additional SNPs in FGFR2, a well-established breast cancer-associated gene. Ten candidate SNPs were selected, excluding those already associated with breast cancer, for technical validation as well as replication in 1668 samples from the same population. Only SNP rs345299, located in intron 1 of VAV3, remained suggestively associated (p-value, 1.16x10 −5 ), but it did not associate with breast cancer risk in pooled data from two large, mixed-population cohorts. This study indicated the role of TOX3 and FGFR2 as breast cancer susceptibility genes in BRCA1/2-wild-type breast cancer patients from Sardinian population. The online version of this article (doi:10.1186/s12885-015-1392-9) contains supplementary material, which is available to authorized users

  17. WASP: a Web-based Allele-Specific PCR assay designing tool for detecting SNPs and mutations

    Directory of Open Access Journals (Sweden)

    Assawamakin Anunchai

    2007-08-01

    Full Text Available Abstract Background Allele-specific (AS Polymerase Chain Reaction is a convenient and inexpensive method for genotyping Single Nucleotide Polymorphisms (SNPs and mutations. It is applied in many recent studies including population genetics, molecular genetics and pharmacogenomics. Using known AS primer design tools to create primers leads to cumbersome process to inexperience users since information about SNP/mutation must be acquired from public databases prior to the design. Furthermore, most of these tools do not offer the mismatch enhancement to designed primers. The available web applications do not provide user-friendly graphical input interface and intuitive visualization of their primer results. Results This work presents a web-based AS primer design application called WASP. This tool can efficiently design AS primers for human SNPs as well as mutations. To assist scientists with collecting necessary information about target polymorphisms, this tool provides a local SNP database containing over 10 million SNPs of various populations from public domain databases, namely NCBI dbSNP, HapMap and JSNP respectively. This database is tightly integrated with the tool so that users can perform the design for existing SNPs without going off the site. To guarantee specificity of AS primers, the proposed system incorporates a primer specificity enhancement technique widely used in experiment protocol. In particular, WASP makes use of different destabilizing effects by introducing one deliberate 'mismatch' at the penultimate (second to last of the 3'-end base of AS primers to improve the resulting AS primers. Furthermore, WASP offers graphical user interface through scalable vector graphic (SVG draw that allow users to select SNPs and graphically visualize designed primers and their conditions. Conclusion WASP offers a tool for designing AS primers for both SNPs and mutations. By integrating the database for known SNPs (using gene ID or rs number

  18. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.

    Science.gov (United States)

    Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan

    2017-01-01

    Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

  19. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method

    Directory of Open Access Journals (Sweden)

    Jun-He Yang

    2017-01-01

    Full Text Available Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

  20. Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation

    Directory of Open Access Journals (Sweden)

    Ward Judson A

    2013-01-01

    Full Text Available Abstract Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry. Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation

  1. Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study

    DEFF Research Database (Denmark)

    de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I

    2017-01-01

    An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In...

  2. Analyzing the changing gender wage gap based on multiply imputed right censored wages

    OpenAIRE

    Gartner, Hermann; Rässler, Susanne

    2005-01-01

    "In order to analyze the gender wage gap with the German IAB-employment register we have to solve the problem of censored wages at the upper limit of the social security system. We treat this problem as a missing data problem. We regard the missingness mechanism as not missing at random (NMAR, according to Little and Rubin, 1987, 2002) as well as missing by design. The censored wages are multiply imputed by draws of a random variable from a truncated distribution. The multiple imputation is b...

  3. In silico analysis of SNPs of SYK gene Involved in Oral Cancer

    Directory of Open Access Journals (Sweden)

    Sarita Swain

    2017-12-01

    Full Text Available Oral cancer is the sixth most common cancer in the world. Oral cancer is the cancer of the oral cavity and pharynx, including cancer of the lip, tongue, salivary glands, gum, floor and other areas of the mouth. The aim of the study is to identify SNPs using dbSNP and predict the effect of mutation using Predict SNP. The association of genes is done by STRING. The disease and drugs associated with the genes are obtained from Webgestalt. The prediction of binding site is done by CASTp. The interaction of ligand and protein is done by using Autodock and Visualised through Discovery studio, pymol, Ligplot. From this report we found that oral cancer differs from person to person based on their genes and genetic interactions and expressions which recommend the clinicians to go for personalized medicine rather that generalized medicine for the patients with oral cancer. Seeking the importance of genetic background of oral cancer patients further studies can be done by mining of non-synonymous SNPs associated with genes for causing oral cancer.

  4. LD2SNPing: linkage disequilibrium plotter and RFLP enzyme mining for tag SNPs

    Directory of Open Access Journals (Sweden)

    Cheng Yu-Huei

    2009-06-01

    Full Text Available Abstract Background Linkage disequilibrium (LD mapping is commonly used to evaluate markers for genome-wide association studies. Most types of LD software focus strictly on LD analysis and visualization, but lack supporting services for genotyping. Results We developed a freeware called LD2SNPing, which provides a complete package of mining tools for genotyping and LD analysis environments. The software provides SNP ID- and gene-centric online retrievals for SNP information and tag SNP selection from dbSNP/NCBI and HapMap, respectively. Restriction fragment length polymorphism (RFLP enzyme information for SNP genotype is available to all SNP IDs and tag SNPs. Single and multiple SNP inputs are possible in order to perform LD analysis by online retrieval from HapMap and NCBI. An LD statistics section provides D, D', r2, δQ, ρ, and the P values of the Hardy-Weinberg Equilibrium for each SNP marker, and Chi-square and likelihood-ratio tests for the pair-wise association of two SNPs in LD calculation. Finally, 2D and 3D plots, as well as plain-text output of the results, can be selected. Conclusion LD2SNPing thus provides a novel visualization environment for multiple SNP input, which facilitates SNP association studies. The software, user manual, and tutorial are freely available at http://bio.kuas.edu.tw/LD2NPing.

  5. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis

    NARCIS (Netherlands)

    Eekhout, I.; Wiel, M.A. van de; Heymans, M.W.

    2017-01-01

    Background. Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin’s Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels

  6. Multiple imputation of rainfall missing data in the Iberian Mediterranean context

    Science.gov (United States)

    Miró, Juan Javier; Caselles, Vicente; Estrela, María José

    2017-11-01

    Given the increasing need for complete rainfall data networks, in recent years have been proposed diverse methods for filling gaps in observed precipitation series, progressively more advanced that traditional approaches to overcome the problem. The present study has consisted in validate 10 methods (6 linear, 2 non-linear and 2 hybrid) that allow multiple imputation, i.e., fill at the same time missing data of multiple incomplete series in a dense network of neighboring stations. These were applied for daily and monthly rainfall in two sectors in the Júcar River Basin Authority (east Iberian Peninsula), which is characterized by a high spatial irregularity and difficulty of rainfall estimation. A classification of precipitation according to their genetic origin was applied as pre-processing, and a quantile-mapping adjusting as post-processing technique. The results showed in general a better performance for the non-linear and hybrid methods, highlighting that the non-linear PCA (NLPCA) method outperforms considerably the Self Organizing Maps (SOM) method within non-linear approaches. On linear methods, the Regularized Expectation Maximization method (RegEM) was the best, but far from NLPCA. Applying EOF filtering as post-processing of NLPCA (hybrid approach) yielded the best results.

  7. Multiple imputation for estimating the risk of developing dementia and its impact on survival.

    Science.gov (United States)

    Yu, Binbing; Saczynski, Jane S; Launer, Lenore

    2010-10-01

    Dementia, Alzheimer's disease in particular, is one of the major causes of disability and decreased quality of life among the elderly and a leading obstacle to successful aging. Given the profound impact on public health, much research has focused on the age-specific risk of developing dementia and the impact on survival. Early work has discussed various methods of estimating age-specific incidence of dementia, among which the illness-death model is popular for modeling disease progression. In this article we use multiple imputation to fit multi-state models for survival data with interval censoring and left truncation. This approach allows semi-Markov models in which survival after dementia depends on onset age. Such models can be used to estimate the cumulative risk of developing dementia in the presence of the competing risk of dementia-free death. Simulations are carried out to examine the performance of the proposed method. Data from the Honolulu Asia Aging Study are analyzed to estimate the age-specific and cumulative risks of dementia and to examine the effect of major risk factors on dementia onset and death.

  8. SNPsnap: a Web-based tool for identification and annotation of matched SNPs

    DEFF Research Database (Denmark)

    Pers, Tune Hannes; Timshel, Pascal; Hirschhorn, Joel N.

    2015-01-01

    -localization of GWAS signals to gene-dense and high linkage disequilibrium (LD) regions, and correlations of gene size, location and function. The SNPsnap Web server enables SNP-based enrichment analysis by providing matched sets of SNPs that can be used to calibrate background expectations. Specifically, SNPsnap...... efficiently identifies sets of randomly drawn SNPs that are matched to a set of query SNPs based on allele frequency, number of SNPs in LD, distance to nearest gene and gene density. Availability and implementation : SNPsnap server is available at http://www.broadinstitute.org/mpg/snpsnap/. Contact: joelh...

  9. Common non-synonymous SNPs associated with breast cancer susceptibility: findings from the Breast Cancer Association Consortium

    OpenAIRE

    Milne, Roger L.; Burwinkel, Barbara; Michailidou, Kyriaki; Arias-Perez, Jose-Ignacio; Zamora, M. Pilar; Menéndez-Rodríguez, Primitiva; Hardisson, David; Mendiola, Marta; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Dennis, Joe; Wang, Qin; Bolla, Manjeet K.; Swerdlow, Anthony

    2014-01-01

    Candidate variant association studies have been largely unsuccessful in identifying common breast cancer susceptibility\\ud variants, although most studies have been underpowered to detect associations of a realistic magnitude.\\ud We assessed 41 common non-synonymous single-nucleotide polymorphisms (nsSNPs) for which\\ud evidence of association with breast cancer risk had been previously reported. Case-control data were combined\\ud from 38 studies of white European women (46 450 cases and 42 60...

  10. Common non-synonymous SNPs associated with breast cancer susceptibility: findings from the Breast Cancer Association Consortium

    OpenAIRE

    Milne, Roger L.; Burwinkel, Barbara; Michailidou, Kyriaki; Arias-Perez, Jose-Ignacio; Zamora, M. Pilar; Menéndez-Rodríguez, Primitiva; Hardisson, David; Mendiola, Marta; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Dennis, Joe; Wang, Qin; Bolla, Manjeet K.; Swerdlow, Anthony

    2014-01-01

    Candidate variant association studies have been largely unsuccessful in identifying common breast cancer susceptibility variants, although most studies have been underpowered to detect associations of a realistic magnitude. We assessed 41 common non-synonymous single-nucleotide polymorphisms (nsSNPs) for which evidence of association with breast cancer risk had been previously reported. Case-control data were combined from 38 studies of white European women (46 450 cases and 42 600 controls) ...

  11. Common non-synonymous SNPs associated with breast cancer susceptibility: findings from the Breast Cancer Association Consortium

    OpenAIRE

    Milne, Roger L; Burwinkel, Barbara; Michailidou, Kyriaki; Arias-Perez, Jose-Ignacio; Zamora, M Pilar; Menéndez-Rodríguez, Primitiva; Hardisson, David; Mendiola, Marta; González-Neira, Anna; Pita, Guillermo; Alonso, M Rosario; Dennis, Joe; Wang, Qin; Bolla, Manjeet K; Swerdlow, Anthony

    2014-01-01

    Candidate variant association studies have been largely unsuccessful in identifying common breast cancer susceptibility variants, although most studies have been underpowered to detect associations of a realistic magnitude We assessed 41 common non-synonymous single-nucleotide polymorphisms (nsSNPs) for which evidence of association with breast cancer risk had been previously reported. Case-control data were combined from 38 studies of white European women (46 450 cases and 42 600 controls) a...

  12. Meta-analysis of SNPs involved in variance heterogeneity using Levene's test for equal variances

    Science.gov (United States)

    Deng, Wei Q; Asma, Senay; Paré, Guillaume

    2014-01-01

    Meta-analysis is a commonly used approach to increase the sample size for genome-wide association searches when individual studies are otherwise underpowered. Here, we present a meta-analysis procedure to estimate the heterogeneity of the quantitative trait variance attributable to genetic variants using Levene's test without needing to exchange individual-level data. The meta-analysis of Levene's test offers the opportunity to combine the considerable sample size of a genome-wide meta-analysis to identify the genetic basis of phenotypic variability and to prioritize single-nucleotide polymorphisms (SNPs) for gene–gene and gene–environment interactions. The use of Levene's test has several advantages, including robustness to departure from the normality assumption, freedom from the influence of the main effects of SNPs, and no assumption of an additive genetic model. We conducted a meta-analysis of the log-transformed body mass index of 5892 individuals and identified a variant with a highly suggestive Levene's test P-value of 4.28E-06 near the NEGR1 locus known to be associated with extreme obesity. PMID:23921533

  13. Determining effects of non-synonymous SNPs on protein-protein interactions using supervised and semi-supervised learning.

    Directory of Open Access Journals (Sweden)

    Nan Zhao

    2014-05-01

    Full Text Available Single nucleotide polymorphisms (SNPs are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs have been found near or inside the protein-protein interaction (PPI interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor. Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1 a 2-class problem (strengthening/weakening PPI mutations, (2 another 2-class problem (mutations that disrupt/preserve a PPI, and (3 a 3-class classification (detrimental/neutral/beneficial mutation effects. In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the

  14. Determining Effects of Non-synonymous SNPs on Protein-Protein Interactions using Supervised and Semi-supervised Learning

    Science.gov (United States)

    Zhao, Nan; Han, Jing Ginger; Shyu, Chi-Ren; Korkin, Dmitry

    2014-01-01

    Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of

  15. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model.

    Science.gov (United States)

    Seaman, Shaun R; Hughes, Rachael A

    2018-06-01

    Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.

  16. Identification of single nucleotide polymorphisms (SNPs) at candidate genes involved in abiotic stress in two Prosopis species of hybrids

    OpenAIRE

    Maria F. Pomponio; Susana Marcucci Poltri; Diego Lopez Lauenstein; Susana Torales

    2014-01-01

    Aim of the study: Identify and compare SNPs on candidate genes related to abiotic stress in Prosopis chilensis, Prosopis flexuosa and interspecific hybridsArea of the study: Chaco árido, Argentina. Material and Methods: Fragments from 6 candidate genes were sequenced in 60 genotypes. DNA polymorphisms were analyzed.Main Results: The analysis revealed that the hybrids had the highest rate of polymorphism, followed by P. flexuosa and P. chilensis, the values found are comparable to other forest...

  17. Identification of single nucleotide polymorphisms (SNPs at candidate genes involved in abiotic stress in two Prosopis species of hybrids

    Directory of Open Access Journals (Sweden)

    Maria F. Pomponio

    2014-12-01

    Full Text Available Aim of the study: Identify and compare SNPs on candidate genes related to abiotic stress in Prosopis chilensis, Prosopis flexuosa and interspecific hybridsArea of the study: Chaco árido, Argentina. Material and Methods: Fragments from 6 candidate genes were sequenced in 60 genotypes. DNA polymorphisms were analyzed.Main Results: The analysis revealed that the hybrids had the highest rate of polymorphism, followed by P. flexuosa and P. chilensis, the values found are comparable to other forest tree species.Research highlights: This approach will help to study genetic diversity variation on natural populations for assessing the effects of environmental changes.Keywords: SNPs; abiotic stress; interspecific variation; molecular markers. 

  18. Applying an efficient K-nearest neighbor search to forest attribute imputation

    Science.gov (United States)

    Andrew O. Finley; Ronald E. McRoberts; Alan R. Ek

    2006-01-01

    This paper explores the utility of an efficient nearest neighbor (NN) search algorithm for applications in multi-source kNN forest attribute imputation. The search algorithm reduces the number of distance calculations between a given target vector and each reference vector, thereby, decreasing the time needed to discover the NN subset. Results of five trials show gains...

  19. Estimating cavity tree and snag abundance using negative binomial regression models and nearest neighbor imputation methods

    Science.gov (United States)

    Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett

    2009-01-01

    Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....

  20. Mapping change of older forest with nearest-neighbor imputation and Landsat time-series

    Science.gov (United States)

    Janet L. Ohmann; Matthew J. Gregory; Heather M. Roberts; Warren B. Cohen; Robert E. Kennedy; Zhiqiang. Yang

    2012-01-01

    The Northwest Forest Plan (NWFP), which aims to conserve late-successional and old-growth forests (older forests) and associated species, established new policies on federal lands in the Pacific Northwest USA. As part of monitoring for the NWFP, we tested nearest-neighbor imputation for mapping change in older forest, defined by threshold values for forest attributes...

  1. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.

    Science.gov (United States)

    Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha

    2015-12-01

    Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. Learning-Based Adaptive Imputation Methodwith kNN Algorithm for Missing Power Data

    Directory of Open Access Journals (Sweden)

    Minkyung Kim

    2017-10-01

    Full Text Available This paper proposes a learning-based adaptive imputation method (LAI for imputing missing power data in an energy system. This method estimates the missing power data by using the pattern that appears in the collected data. Here, in order to capture the patterns from past power data, we newly model a feature vector by using past data and its variations. The proposed LAI then learns the optimal length of the feature vector and the optimal historical length, which are significant hyper parameters of the proposed method, by utilizing intentional missing data. Based on a weighted distance between feature vectors representing a missing situation and past situation, missing power data are estimated by referring to the k most similar past situations in the optimal historical length. We further extend the proposed LAI to alleviate the effect of unexpected variation in power data and refer to this new approach as the extended LAI method (eLAI. The eLAI selects a method between linear interpolation (LI and the proposed LAI to improve accuracy under unexpected variations. Finally, from a simulation under various energy consumption profiles, we verify that the proposed eLAI achieves about a 74% reduction of the average imputation error in an energy system, compared to the existing imputation methods.

  3. Missing value imputation in DNA microarrays based on conjugate gradient method.

    Science.gov (United States)

    Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh

    2012-02-01

    Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  4. Statistical Analysis of a Class: Monte Carlo and Multiple Imputation Spreadsheet Methods for Estimation and Extrapolation

    Science.gov (United States)

    Fish, Laurel J.; Halcoussis, Dennis; Phillips, G. Michael

    2017-01-01

    The Monte Carlo method and related multiple imputation methods are traditionally used in math, physics and science to estimate and analyze data and are now becoming standard tools in analyzing business and financial problems. However, few sources explain the application of the Monte Carlo method for individuals and business professionals who are…

  5. Targeted Metabolic Engineering Guided by Computational Analysis of Single-Nucleotide Polymorphisms (SNPs)

    DEFF Research Database (Denmark)

    Udatha, D B R K Gupta; Rasmussen, Simon; Sicheritz-Pontén, Thomas

    2013-01-01

    The non-synonymous SNPs, the so-called non-silent SNPs, which are single-nucleotide variations in the coding regions that give "birth" to amino acid mutations, are often involved in the modulation of protein function. Understanding the effect of individual amino acid mutations on a protein...

  6. Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges

    DEFF Research Database (Denmark)

    Helyar, S.J.; Hansen, Jakob Hemmer; Bekkevold, Dorte

    2011-01-01

    Recent improvements in the speed, cost and accuracy of next generation sequencing are revolutionizing the discovery of single nucleotide polymorphisms (SNPs). SNPs are increasingly being used as an addition to the molecular ecology toolkit in nonmodel organisms, but their efficient use remains...

  7. SNPs in the coding region of the metastasis-inducing gene MACC1 and clinical outcome in colorectal cancer

    Directory of Open Access Journals (Sweden)

    Schmid Felicitas

    2012-07-01

    Full Text Available Abstract Background Colorectal cancer is one of the main cancers in the Western world. About 90% of the deaths arise from formation of distant metastasis. The expression of the newly identified gene metastasis associated in colon cancer 1 (MACC1 is a prognostic indicator for colon cancer metastasis. Here, we analyzed for the first time the impact of single nucleotide polymorphisms (SNPs in the coding region of MACC1 for clinical outcome of colorectal cancer patients. Additionally, we screened met proto-oncogene (Met, the transcriptional target gene of MACC1, for mutations. Methods We sequenced the coding exons of MACC1 in 154 colorectal tumors (stages I, II and III and the crucial exons of Met in 60 colorectal tumors (stages I, II and III. We analyzed the association of MACC1 polymorphisms with clinical data, including metachronous metastasis, UICC stages, tumor invasion, lymph node metastasis and patients’ survival (n = 154, stages I, II and III. Furthermore, we performed biological assays in order to evaluate the functional impact of MACC1 SNPs on the motility of colorectal cancer cells. Results We genotyped three MACC1 SNPs in the coding region. Thirteen % of the tumors had the genotype cg (rs4721888, L31V, 48% a ct genotype (rs975263, S515L and 84% a gc or cc genotype (rs3735615, R804T. We found no association of these SNPs with clinicopathological parameters or with patients’ survival, when analyzing the entire patients’ cohort. An increased risk for a shorter metastasis-free survival of patients with a ct genotype (rs975263 was observed in younger colon cancer patients with stage I or II (P = 0.041, n = 18. In cell culture, MACC1 SNPs did not affect MACC1-induced cell motility and proliferation. Conclusion In summary, the identification of coding MACC1 SNPs in primary colorectal tumors does not improve the prediction for metastasis formation or for patients’ survival compared to MACC1 expression analysis alone. The ct genotype (rs

  8. Two-pass imputation algorithm for missing value estimation in gene expression time series.

    Science.gov (United States)

    Tsiporkova, Elena; Boeva, Veselka

    2007-10-01

    Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different

  9. Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.

    Science.gov (United States)

    Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S

    2005-05-15

    Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE

  10. Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study.

    Directory of Open Access Journals (Sweden)

    Paul S de Vries

    Full Text Available An increasing number of genome-wide association (GWA studies are now using the higher resolution 1000 Genomes Project reference panel (1000G for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10-8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10-8, the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development.

  11. Genomic Selection Using Extreme Phenotypes and Pre-Selection of SNPs in Large Yellow Croaker (Larimichthys crocea).

    Science.gov (United States)

    Dong, Linsong; Xiao, Shijun; Chen, Junwei; Wan, Liang; Wang, Zhiyong

    2016-10-01

    Genomic selection (GS) is an effective method to improve predictive accuracies of genetic values. However, high cost in genotyping will limit the application of this technology in some species. Therefore, it is necessary to find some methods to reduce the genotyping costs in genomic selection. Large yellow croaker is one of the most commercially important marine fish species in southeast China and Eastern Asia. In this study, genotyping-by-sequencing was used to construct the libraries for the NGS sequencing and find 29,748 SNPs in the genome. Two traits, eviscerated weight (EW) and the ratio between eviscerated weight and whole body weight (REW), were chosen to study. Two strategies to reduce the costs were proposed as follows: selecting extreme phenotypes (EP) for genotyping in reference population or pre-selecting SNPs to construct low-density marker panels in candidates. Three methods of pre-selection of SNPs, i.e., pre-selecting SNPs by absolute effects (SE), by single marker analysis (SMA), and by fixed intervals of sequence number (EL), were studied. The results showed that using EP was a feasible method to save the genotyping costs in reference population. Heritability did not seem to have obvious influences on the predictive abilities estimated by EP. Using SMA was the most feasible method to save the genotyping costs in candidates. In addition, the combination of EP and SMA in genomic selection also showed good results, especially for trait of REW. We also described how to apply the new methods in genomic selection and compared the genotyping costs before and after using the new methods. Our study may not only offer a reference for aquatic genomic breeding but also offer a reference for genomic prediction in other species including livestock and plants, etc.

  12. Association of six CpG-SNPs in the inflammation-related genes with coronary heart disease

    OpenAIRE

    Chen, Xiaomin; Chen, Xiaoying; Xu, Yan; Yang, William; Wu, Nan; Ye, Huadan; Yang, Jack Y.; Hong, Qingxiao; Xin, Yanfei; Yang, Mary Qu; Deng, Youping; Duan, Shiwei

    2016-01-01

    Background Chronic inflammation has been widely considered to be the major risk factor of coronary heart disease (CHD). The goal of our study was to explore the possible association with CHD for inflammation-related single nucleotide polymorphisms (SNPs) involved in cytosine-phosphate-guanine (CpG) dinucleotides. A total of 784 CHD patients and 739 non-CHD controls were recruited from Zhejiang Province, China. Using the Sequenom MassARRAY platform, we measured the genotypes of six inflammatio...

  13. Overview of some projects of SNPS for global space communication

    International Nuclear Information System (INIS)

    Ivanov, E.; Ghitaykin, V.; Ionkin, V.; Dubinin, A.; Pyshko, A.

    2001-01-01

    In this presentation we focused on three variants of prospective concepts of SNPS. They are intended to solve tasks of global space communication (GSC) as nearest future tasks in space. Modern concepts of the application of power technology in space believe in using an onboard source of energy for maintenance of self-transportation of the vehicle into geostationary orbit (GSO). There are three more prospective systems as follows: gas cooled nuclear reactor with hybrid thermal engine and machine power converter; nuclear reactor cooled by liquid metal and with a thermoelectric power generating system; nuclear reactor with Li cooling and a thermionic and thermoelectric power generator on board. The choice of a concept must fit strong requirements such as: space nuclear power unit is aimed to be used in a powerful mission; space power unit must be able to maintain the dual - mode regime of vehicle operation (self - transportation and long life in geosynchronous orbit [GEO]); nuclear rector of unit must be safety and it must be designed in such a way that it will ensure minimum size of the complete system; the elements of the considered technology can be used for the creation of NPPI and with other sources of heat (for example, radioisotope); the degree of technical and technological readiness of units of the thermal and power circuit of installation is estimated to be high and is defined by a number of technological developments in air, space and nuclear branches; nuclear reactor and heat transfer equipment should work in a normal mode, which can be very reliably confirmed for other high-temperature nuclear systems. Considering these concepts we practically consider one of possible strategy of developing of complex system of nuclear power engineering. It is the strategy of step-by-step development of space engineering with real application of them in commercial, scientific and other powerful missions in the nearest and deep space. As starting point of this activity is

  14. Computational screening and molecular dynamics simulation of disease associated nsSNPs in CENP-E

    Energy Technology Data Exchange (ETDEWEB)

    Kumar, Ambuj [Bioinformatics Division, School of Bio Sciences and Technology, Vellore Institute of Technology University, Vellore 632014, Tamil Nadu (India); Purohit, Rituraj, E-mail: riturajpurohit@gmail.com [Bioinformatics Division, School of Bio Sciences and Technology, Vellore Institute of Technology University, Vellore 632014, Tamil Nadu (India)

    2012-10-15

    Aneuploidy and chromosomal instability (CIN) are hallmarks of most solid tumors. Mutations in centroemere proteins have been observed in promoting aneuploidy and tumorigenesis. Recent studies reported that Centromere-associated protein-E (CENP-E) is involved in inducing cancers. In this study we investigated the pathogenic effect of 132 nsSNPs reported in CENP-E using computational platform. Y63H point mutation found to be associated with cancer using SIFT, Polyphen, PhD-SNP, MutPred, CanPredict and Dr. Cancer tools. Further we investigated the binding affinity of ATP molecule to the CENP-E motor domain. Complementarity scores obtained from docking studies showed significant loss in ATP binding affinity of mutant structure. Molecular dynamics simulation was carried to examine the structural consequences of Y63H mutation. Root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (R{sub g}), solvent accessibility surface area (SASA), energy value, hydrogen bond (NH Bond), eigenvector projection, trace of covariance matrix and atom density analysis results showed notable loss in stability for mutant structure. Y63H mutation was also shown to disrupt the native conformation of ATP binding region in CENP-E motor domain. Docking studies for remaining 18 mutations at 63rd residue position as well as other two computationally predicted disease associated mutations S22L and P69S were also carried to investigate their affect on ATP binding affinity of CENP-E motor domain. Our study provided a promising computational methodology to study the tumorigenic consequences of nsSNPs that have not been characterized and clear clue to the wet lab scientist.

  15. Computational screening and molecular dynamics simulation of disease associated nsSNPs in CENP-E

    International Nuclear Information System (INIS)

    Kumar, Ambuj; Purohit, Rituraj

    2012-01-01

    Aneuploidy and chromosomal instability (CIN) are hallmarks of most solid tumors. Mutations in centroemere proteins have been observed in promoting aneuploidy and tumorigenesis. Recent studies reported that Centromere-associated protein-E (CENP-E) is involved in inducing cancers. In this study we investigated the pathogenic effect of 132 nsSNPs reported in CENP-E using computational platform. Y63H point mutation found to be associated with cancer using SIFT, Polyphen, PhD-SNP, MutPred, CanPredict and Dr. Cancer tools. Further we investigated the binding affinity of ATP molecule to the CENP-E motor domain. Complementarity scores obtained from docking studies showed significant loss in ATP binding affinity of mutant structure. Molecular dynamics simulation was carried to examine the structural consequences of Y63H mutation. Root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (R g ), solvent accessibility surface area (SASA), energy value, hydrogen bond (NH Bond), eigenvector projection, trace of covariance matrix and atom density analysis results showed notable loss in stability for mutant structure. Y63H mutation was also shown to disrupt the native conformation of ATP binding region in CENP-E motor domain. Docking studies for remaining 18 mutations at 63rd residue position as well as other two computationally predicted disease associated mutations S22L and P69S were also carried to investigate their affect on ATP binding affinity of CENP-E motor domain. Our study provided a promising computational methodology to study the tumorigenic consequences of nsSNPs that have not been characterized and clear clue to the wet lab scientist.

  16. Estimating Classification Errors Under Edit Restrictions in Composite Survey-Register Data Using Multiple Imputation Latent Class Modelling (MILC

    Directory of Open Access Journals (Sweden)

    Boeschoten Laura

    2017-12-01

    Full Text Available Both registers and surveys can contain classification errors. These errors can be estimated by making use of a composite data set. We propose a new method based on latent class modelling to estimate the number of classification errors across several sources while taking into account impossible combinations with scores on other variables. Furthermore, the latent class model, by multiply imputing a new variable, enhances the quality of statistics based on the composite data set. The performance of this method is investigated by a simulation study, which shows that whether or not the method can be applied depends on the entropy R2 of the latent class model and the type of analysis a researcher is planning to do. Finally, the method is applied to public data from Statistics Netherlands.

  17. Effective selection of informative SNPs and classification on the HapMap genotype data

    Directory of Open Access Journals (Sweden)

    Wang Lipo

    2007-12-01

    Full Text Available Abstract Background Since the single nucleotide polymorphisms (SNPs are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006 adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU, Yoruba in Ibadan, Nigeria in West Africa (YRI, and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT, from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations. Results In this paper, we propose to first rank each feature (SNP using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100. with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM, thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs. Conclusion Experimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.

  18. SNPs in the 5'-regulatory region of the tyrosinase gene do not affect plumage color in ducks (Anas platyrhynchos).

    Science.gov (United States)

    Zhang, N N; Hu, J W; Liu, H H; Xu, H Y; He, H; Li, L

    2015-12-29

    Tyrosinase, encoded by the TYR gene, is the rate-limiting enzyme in the production of melanin pigment. In this study, plumage color separation was observed in Cherry Valley duck line D and F1 and F2 hybrid generations of Liancheng white ducks. Gene sequencing and bioinformatic analysis were applied to the 5'-regulatory region of TYR, to explore the connection between TYR sequence variation and duck plumage color. Four SNPs were found in the 5'-regulatory region. The SNPs were in tight linkage and formed three haplotypes. However, the genotype distribution in groups with different plumage color was not significantly different, and there were no changes in the transcription factor binding sites between the different genotypes. In conclusion, these SNP variations may not cause the differences in feather color observed in this test group.

  19. Enrichment of minor allele of SNPs and genetic prediction of type 2 diabetes risk in British population.

    Directory of Open Access Journals (Sweden)

    Xiaoyun Lei

    Full Text Available Type 2 diabetes (T2D is a complex disorder characterized by high blood sugar, insulin resistance, and relative lack of insulin. The collective effects of genome wide minor alleles of common SNPs, or the minor allele content (MAC in an individual, have been linked with quantitative variations of complex traits and diseases. Here we studied MAC in T2D using previously published SNP datasets and found higher MAC in cases relative to matched controls. A set of 357 SNPs was found to have the best predictive accuracy in a British population. A weighted risk score calculated by using this set produced an area under the curve (AUC score of 0.86, which is comparable to risk models built by phenotypic markers. These results identify a novel genetic risk element in T2D susceptibility and provide a potentially useful genetic method to identify individuals with high risk of T2D.

  20. Risk Prediction for Epithelial Ovarian Cancer in 11 United States–Based Case-Control Studies: Incorporation of Epidemiologic Risk Factors and 17 Confirmed Genetic Loci

    Science.gov (United States)

    Clyde, Merlise A.; Palmieri Weber, Rachel; Iversen, Edwin S.; Poole, Elizabeth M.; Doherty, Jennifer A.; Goodman, Marc T.; Ness, Roberta B.; Risch, Harvey A.; Rossing, Mary Anne; Terry, Kathryn L.; Wentzensen, Nicolas; Whittemore, Alice S.; Anton-Culver, Hoda; Bandera, Elisa V.; Berchuck, Andrew; Carney, Michael E.; Cramer, Daniel W.; Cunningham, Julie M.; Cushing-Haugen, Kara L.; Edwards, Robert P.; Fridley, Brooke L.; Goode, Ellen L.; Lurie, Galina; McGuire, Valerie; Modugno, Francesmary; Moysich, Kirsten B.; Olson, Sara H.; Pearce, Celeste Leigh; Pike, Malcolm C.; Rothstein, Joseph H.; Sellers, Thomas A.; Sieh, Weiva; Stram, Daniel; Thompson, Pamela J.; Vierkant, Robert A.; Wicklund, Kristine G.; Wu, Anna H.; Ziogas, Argyrios; Tworoger, Shelley S.; Schildkraut, Joellen M.

    2016-01-01

    Previously developed models for predicting absolute risk of invasive epithelial ovarian cancer have included a limited number of risk factors and have had low discriminatory power (area under the receiver operating characteristic curve (AUC) < 0.60). Because of this, we developed and internally validated a relative risk prediction model that incorporates 17 established epidemiologic risk factors and 17 genome-wide significant single nucleotide polymorphisms (SNPs) using data from 11 case-control studies in the United States (5,793 cases; 9,512 controls) from the Ovarian Cancer Association Consortium (data accrued from 1992 to 2010). We developed a hierarchical logistic regression model for predicting case-control status that included imputation of missing data. We randomly divided the data into an 80% training sample and used the remaining 20% for model evaluation. The AUC for the full model was 0.664. A reduced model without SNPs performed similarly (AUC = 0.649). Both models performed better than a baseline model that included age and study site only (AUC = 0.563). The best predictive power was obtained in the full model among women younger than 50 years of age (AUC = 0.714); however, the addition of SNPs increased the AUC the most for women older than 50 years of age (AUC = 0.638 vs. 0.616). Adapting this improved model to estimate absolute risk and evaluating it in prospective data sets is warranted. PMID:27698005

  1. Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation

    Science.gov (United States)

    Wang, Chaolong; Zhan, Xiaowei; Liang, Liming; Abecasis, Gonçalo R.; Lin, Xihong

    2015-01-01

    Accurate estimation of individual ancestry is important in genetic association studies, especially when a large number of samples are collected from multiple sources. However, existing approaches developed for genome-wide SNP data do not work well with modest amounts of genetic data, such as in targeted sequencing or exome chip genotyping experiments. We propose a statistical framework to estimate individual ancestry in a principal component ancestry map generated by a reference set of individuals. This framework extends and improves upon our previous method for estimating ancestry using low-coverage sequence reads (LASER 1.0) to analyze either genotyping or sequencing data. In particular, we introduce a projection Procrustes analysis approach that uses high-dimensional principal components to estimate ancestry in a low-dimensional reference space. Using extensive simulations and empirical data examples, we show that our new method (LASER 2.0), combined with genotype imputation on the reference individuals, can substantially outperform LASER 1.0 in estimating fine-scale genetic ancestry. Specifically, LASER 2.0 can accurately estimate fine-scale ancestry within Europe using either exome chip genotypes or targeted sequencing data with off-target coverage as low as 0.05×. Under the framework of LASER 2.0, we can estimate individual ancestry in a shared reference space for samples assayed at different loci or by different techniques. Therefore, our ancestry estimation method will accelerate discovery in disease association studies not only by helping model ancestry within individual studies but also by facilitating combined analysis of genetic data from multiple sources. PMID:26027497

  2. Combination of individual tree detection and area-based approach in imputation of forest variables using airborne laser data

    Science.gov (United States)

    Vastaranta, Mikko; Kankare, Ville; Holopainen, Markus; Yu, Xiaowei; Hyyppä, Juha; Hyyppä, Hannu

    2012-01-01

    The two main approaches to deriving forest variables from laser-scanning data are the statistical area-based approach (ABA) and individual tree detection (ITD). With ITD it is feasible to acquire single tree information, as in field measurements. Here, ITD was used for measuring training data for the ABA. In addition to automatic ITD (ITD auto), we tested a combination of ITD auto and visual interpretation (ITD visual). ITD visual had two stages: in the first, ITD auto was carried out and in the second, the results of the ITD auto were visually corrected by interpreting three-dimensional laser point clouds. The field data comprised 509 circular plots ( r = 10 m) that were divided equally for testing and training. ITD-derived forest variables were used for training the ABA and the accuracies of the k-most similar neighbor ( k-MSN) imputations were evaluated and compared with the ABA trained with traditional measurements. The root-mean-squared error (RMSE) in the mean volume was 24.8%, 25.9%, and 27.2% with the ABA trained with field measurements, ITD auto, and ITD visual, respectively. When ITD methods were applied in acquiring training data, the mean volume, basal area, and basal area-weighted mean diameter were underestimated in the ABA by 2.7-9.2%. This project constituted a pilot study for using ITD measurements as training data for the ABA. Further studies are needed to reduce the bias and to determine the accuracy obtained in imputation of species-specific variables. The method could be applied in areas with sparse road networks or when the costs of fieldwork must be minimized.

  3. Discovery and Fine-Mapping of Glycaemic and Obesity-Related Trait Loci Using High-Density Imputation.

    Directory of Open Access Journals (Sweden)

    Momoko Horikoshi

    2015-07-01

    Full Text Available Reference panels from the 1000 Genomes (1000G Project Consortium provide near complete coverage of common and low-frequency genetic variation with minor allele frequency ≥0.5% across European ancestry populations. Within the European Network for Genetic and Genomic Epidemiology (ENGAGE Consortium, we have undertaken the first large-scale meta-analysis of genome-wide association studies (GWAS, supplemented by 1000G imputation, for four quantitative glycaemic and obesity-related traits, in up to 87,048 individuals of European ancestry. We identified two loci for body mass index (BMI at genome-wide significance, and two for fasting glucose (FG, none of which has been previously reported in larger meta-analysis efforts to combine GWAS of European ancestry. Through conditional analysis, we also detected multiple distinct signals of association mapping to established loci for waist-hip ratio adjusted for BMI (RSPO3 and FG (GCK and G6PC2. The index variant for one association signal at the G6PC2 locus is a low-frequency coding allele, H177Y, which has recently been demonstrated to have a functional role in glucose regulation. Fine-mapping analyses revealed that the non-coding variants most likely to drive association signals at established and novel loci were enriched for overlap with enhancer elements, which for FG mapped to promoter and transcription factor binding sites in pancreatic islets, in particular. Our study demonstrates that 1000G imputation and genetic fine-mapping of common and low-frequency variant association signals at GWAS loci, integrated with genomic annotation in relevant tissues, can provide insight into the functional and regulatory mechanisms through which their effects on glycaemic and obesity-related traits are mediated.

  4. Discovery and Fine-Mapping of Glycaemic and Obesity-Related Trait Loci Using High-Density Imputation.

    Science.gov (United States)

    Horikoshi, Momoko; Mӓgi, Reedik; van de Bunt, Martijn; Surakka, Ida; Sarin, Antti-Pekka; Mahajan, Anubha; Marullo, Letizia; Thorleifsson, Gudmar; Hӓgg, Sara; Hottenga, Jouke-Jan; Ladenvall, Claes; Ried, Janina S; Winkler, Thomas W; Willems, Sara M; Pervjakova, Natalia; Esko, Tõnu; Beekman, Marian; Nelson, Christopher P; Willenborg, Christina; Wiltshire, Steven; Ferreira, Teresa; Fernandez, Juan; Gaulton, Kyle J; Steinthorsdottir, Valgerdur; Hamsten, Anders; Magnusson, Patrik K E; Willemsen, Gonneke; Milaneschi, Yuri; Robertson, Neil R; Groves, Christopher J; Bennett, Amanda J; Lehtimӓki, Terho; Viikari, Jorma S; Rung, Johan; Lyssenko, Valeriya; Perola, Markus; Heid, Iris M; Herder, Christian; Grallert, Harald; Müller-Nurasyid, Martina; Roden, Michael; Hypponen, Elina; Isaacs, Aaron; van Leeuwen, Elisabeth M; Karssen, Lennart C; Mihailov, Evelin; Houwing-Duistermaat, Jeanine J; de Craen, Anton J M; Deelen, Joris; Havulinna, Aki S; Blades, Matthew; Hengstenberg, Christian; Erdmann, Jeanette; Schunkert, Heribert; Kaprio, Jaakko; Tobin, Martin D; Samani, Nilesh J; Lind, Lars; Salomaa, Veikko; Lindgren, Cecilia M; Slagboom, P Eline; Metspalu, Andres; van Duijn, Cornelia M; Eriksson, Johan G; Peters, Annette; Gieger, Christian; Jula, Antti; Groop, Leif; Raitakari, Olli T; Power, Chris; Penninx, Brenda W J H; de Geus, Eco; Smit, Johannes H; Boomsma, Dorret I; Pedersen, Nancy L; Ingelsson, Erik; Thorsteinsdottir, Unnur; Stefansson, Kari; Ripatti, Samuli; Prokopenko, Inga; McCarthy, Mark I; Morris, Andrew P

    2015-07-01

    Reference panels from the 1000 Genomes (1000G) Project Consortium provide near complete coverage of common and low-frequency genetic variation with minor allele frequency ≥0.5% across European ancestry populations. Within the European Network for Genetic and Genomic Epidemiology (ENGAGE) Consortium, we have undertaken the first large-scale meta-analysis of genome-wide association studies (GWAS), supplemented by 1000G imputation, for four quantitative glycaemic and obesity-related traits, in up to 87,048 individuals of European ancestry. We identified two loci for body mass index (BMI) at genome-wide significance, and two for fasting glucose (FG), none of which has been previously reported in larger meta-analysis efforts to combine GWAS of European ancestry. Through conditional analysis, we also detected multiple distinct signals of association mapping to established loci for waist-hip ratio adjusted for BMI (RSPO3) and FG (GCK and G6PC2). The index variant for one association signal at the G6PC2 locus is a low-frequency coding allele, H177Y, which has recently been demonstrated to have a functional role in glucose regulation. Fine-mapping analyses revealed that the non-coding variants most likely to drive association signals at established and novel loci were enriched for overlap with enhancer elements, which for FG mapped to promoter and transcription factor binding sites in pancreatic islets, in particular. Our study demonstrates that 1000G imputation and genetic fine-mapping of common and low-frequency variant association signals at GWAS loci, integrated with genomic annotation in relevant tissues, can provide insight into the functional and regulatory mechanisms through which their effects on glycaemic and obesity-related traits are mediated.

  5. Comparison of results from different imputation techniques for missing data from an anti-obesity drug trial

    DEFF Research Database (Denmark)

    Jørgensen, Anders W.; Lundstrøm, Lars H; Wetterslev, Jørn

    2014-01-01

    BACKGROUND: In randomised trials of medical interventions, the most reliable analysis follows the intention-to-treat (ITT) principle. However, the ITT analysis requires that missing outcome data have to be imputed. Different imputation techniques may give different results and some may lead to bias...... of handling missing data in a 60-week placebo controlled anti-obesity drug trial on topiramate. METHODS: We compared an analysis of complete cases with datasets where missing body weight measurements had been replaced using three different imputation methods: LOCF, baseline carried forward (BOCF) and MI...

  6. A genome-wide investigation of SNPs and CNVs in schizophrenia.

    Directory of Open Access Journals (Sweden)

    Anna C Need

    2009-02-01

    Full Text Available We report a genome-wide assessment of single nucleotide polymorphisms (SNPs and copy number variants (CNVs in schizophrenia. We investigated SNPs using 871 patients and 863 controls, following up the top hits in four independent cohorts comprising 1,460 patients and 12,995 controls, all of European origin. We found no genome-wide significant associations, nor could we provide support for any previously reported candidate gene or genome-wide associations. We went on to examine CNVs using a subset of 1,013 cases and 1,084 controls of European ancestry, and a further set of 60 cases and 64 controls of African ancestry. We found that eight cases and zero controls carried deletions greater than 2 Mb, of which two, at 8p22 and 16p13.11-p12.4, are newly reported here. A further evaluation of 1,378 controls identified no deletions greater than 2 Mb, suggesting a high prior probability of disease involvement when such deletions are observed in cases. We also provide further evidence for some smaller, previously reported, schizophrenia-associated CNVs, such as those in NRXN1 and APBA2. We could not provide strong support for the hypothesis that schizophrenia patients have a significantly greater "load" of large (>100 kb, rare CNVs, nor could we find common CNVs that associate with schizophrenia. Finally, we did not provide support for the suggestion that schizophrenia-associated CNVs may preferentially disrupt genes in neurodevelopmental pathways. Collectively, these analyses provide the first integrated study of SNPs and CNVs in schizophrenia and support the emerging view that rare deleterious variants may be more important in schizophrenia predisposition than common polymorphisms. While our analyses do not suggest that implicated CNVs impinge on particular key pathways, we do support the contribution of specific genomic regions in schizophrenia, presumably due to recurrent mutation. On balance, these data suggest that very few schizophrenia patients

  7. Reduced Representation Libraries from DNA Pools Analysed with Next Generation Semiconductor Based-Sequencing to Identify SNPs in Extreme and Divergent Pigs for Back Fat Thickness

    Directory of Open Access Journals (Sweden)

    Samuele Bovo

    2015-01-01

    Full Text Available The aim of this study was to identify single nucleotide polymorphisms (SNPs that could be associated with back fat thickness (BFT in pigs. To achieve this goal, we evaluated the potential and limits of an experimental design that combined several methodologies. DNA samples from two groups of Italian Large White pigs with divergent estimating breeding value (EBV for BFT were separately pooled and sequenced, after preparation of reduced representation libraries (RRLs, on the Ion Torrent technology. Taking advantage from SNAPE for SNPs calling in sequenced DNA pools, 39,165 SNPs were identified; 1/4 of them were novel variants not reported in dbSNP. Combining sequencing data with Illumina PorcineSNP60 BeadChip genotyping results on the same animals, 661 genomic positions overlapped with a good approximation of minor allele frequency estimation. A total of 54 SNPs showing enriched alleles in one or in the other RRLs might be potential markers associated with BFT. Some of these SNPs were close to genes involved in obesity related phenotypes.

  8. Association of CAPN10 SNPs and haplotypes with polycystic ovary syndrome among South Indian Women.

    Directory of Open Access Journals (Sweden)

    Shilpi Dasgupta

    Full Text Available Polycystic Ovary Syndrome (PCOS is known to be characterized by metabolic disorder in which hyperinsulinemia and peripheral insulin resistance are central features. Given the physiological overlap between PCOS and type-2 diabetes (T2DM, and calpain 10 gene (CAPN10 being a strong candidate for T2DM, a number of studies have analyzed CAPN10 SNPs among PCOS women yielding contradictory results. Our study is first of its kind to investigate the association pattern of CAPN10 polymorphisms (UCSNP-44, 43, 56, 19 and 63 with PCOS among Indian women. 250 PCOS cases and 299 controls from Southern India were recruited for this study. Allele and genotype frequencies of the SNPs were determined and compared between the cases and controls. Results show significant association of UCSNP-44 genotype CC with PCOS (p = 0.007 with highly significant odds ratio when compared to TC (OR = 2.51, p = 0.003, 95% CI = 1.37-4.61 as well as TT (OR = 1.94, p = 0.016, 95% CI = 1.13-3.34. While the haplotype carrying the SNP-44 and SNP-19 variants (21121 exhibited a 2 fold increase in the risk for PCOS (OR = 2.37, p = 0.03, the haplotype containing SNP-56 and SNP-19 variants (11221 seems to have a protective role against PCOS (OR = 0.20, p = 0.004. Our results support the earlier evidence for a possible role of UCSNP-44 of the CAPN10 gene in the manifestation of PCOS.

  9. Interactions between SNPs affecting inflammatory response genes are associated with multiple myeloma disease risk and survival

    DEFF Research Database (Denmark)

    Nielsen, Kaspar René; Rodrigo-Domingo, Maria; Steffensen, Rudi

    2017-01-01

    The origin of multiple myeloma depends on interactions with stromal cells in the course of normal B-cell differentiation and evolution of immunity. The concept of the present study is that genes involved in MM pathogenesis, such as immune response genes, can be identified by screening for single......3L1 gene promoters. The occurrence of single polymorphisms, haplotypes and SNP-SNP interactions were statistically analyzed for association with disease risk and outcome following high-dose therapy. Identified genes that carried SNPs or haplotypes that were identified as risk or prognostic factors......= .005). The 'risk genes' were analyzed for expression in normal B-cell subsets (N = 6) from seven healthy donors and we found TNFA and IL-6 expressed both in naïve and in memory B cells when compared to preBI, II, immature and plasma cells. The 'prognosis genes' CHI3L1, IL-6 and IL-10 were differential...

  10. Estimation of Tree Lists from Airborne Laser Scanning Using Tree Model Clustering and k-MSN Imputation

    Directory of Open Access Journals (Sweden)

    Jörgen Wallerman

    2013-04-01

    Full Text Available Individual tree crowns may be delineated from airborne laser scanning (ALS data by segmentation of surface models or by 3D analysis. Segmentation of surface models benefits from using a priori knowledge about the proportions of tree crowns, which has not yet been utilized for 3D analysis to any great extent. In this study, an existing surface segmentation method was used as a basis for a new tree model 3D clustering method applied to ALS returns in 104 circular field plots with 12 m radius in pine-dominated boreal forest (64°14'N, 19°50'E. For each cluster below the tallest canopy layer, a parabolic surface was fitted to model a tree crown. The tree model clustering identified more trees than segmentation of the surface model, especially smaller trees below the tallest canopy layer. Stem attributes were estimated with k-Most Similar Neighbours (k-MSN imputation of the clusters based on field-measured trees. The accuracy at plot level from the k-MSN imputation (stem density root mean square error or RMSE 32.7%; stem volume RMSE 28.3% was similar to the corresponding results from the surface model (stem density RMSE 33.6%; stem volume RMSE 26.1% with leave-one-out cross-validation for one field plot at a time. Three-dimensional analysis of ALS data should also be evaluated in multi-layered forests since it identified a larger number of small trees below the tallest canopy layer.

  11. Estimating past hepatitis C infection risk from reported risk factor histories: implications for imputing age of infection and modeling fibrosis progression

    Directory of Open Access Journals (Sweden)

    Busch Michael P

    2007-12-01

    Full Text Available Abstract Background Chronic hepatitis C virus infection is prevalent and often causes hepatic fibrosis, which can progress to cirrhosis and cause liver cancer or liver failure. Study of fibrosis progression often relies on imputing the time of infection, often as the reported age of first injection drug use. We sought to examine the accuracy of such imputation and implications for modeling factors that influence progression rates. Methods We analyzed cross-sectional data on hepatitis C antibody status and reported risk factor histories from two large studies, the Women's Interagency HIV Study and the Urban Health Study, using modern survival analysis methods for current status data to model past infection risk year by year. We compared fitted distributions of past infection risk to reported age of first injection drug use. Results Although injection drug use appeared to be a very strong risk factor, models for both studies showed that many subjects had considerable probability of having been infected substantially before or after their reported age of first injection drug use. Persons reporting younger age of first injection drug use were more likely to have been infected after, and persons reporting older age of first injection drug use were more likely to have been infected before. Conclusion In cross-sectional studies of fibrosis progression where date of HCV infection is estimated from risk factor histories, modern methods such as multiple imputation should be used to account for the substantial uncertainty about when infection occurred. The models presented here can provide the inputs needed by such methods. Using reported age of first injection drug use as the time of infection in studies of fibrosis progression is likely to produce a spuriously strong association of younger age of infection with slower rate of progression.

  12. Association of six CpG-SNPs in the inflammation-related genes with coronary heart disease.

    Science.gov (United States)

    Chen, Xiaomin; Chen, Xiaoying; Xu, Yan; Yang, William; Wu, Nan; Ye, Huadan; Yang, Jack Y; Hong, Qingxiao; Xin, Yanfei; Yang, Mary Qu; Deng, Youping; Duan, Shiwei

    2016-07-25

    Chronic inflammation has been widely considered to be the major risk factor of coronary heart disease (CHD). The goal of our study was to explore the possible association with CHD for inflammation-related single nucleotide polymorphisms (SNPs) involved in cytosine-phosphate-guanine (CpG) dinucleotides. A total of 784 CHD patients and 739 non-CHD controls were recruited from Zhejiang Province, China. Using the Sequenom MassARRAY platform, we measured the genotypes of six inflammation-related CpG-SNPs, including IL1B rs16944, IL1R2 rs2071008, PLA2G7 rs9395208, FAM5C rs12732361, CD40 rs1800686, and CD36 rs2065666). Allele and genotype frequencies were compared between CHD and non-CHD individuals using the CLUMP22 software with 10,000 Monte Carlo simulations. Allelic tests showed that PLA2G7 rs9395208 and CD40 rs1800686 were significantly associated with CHD. Moreover, IL1B rs16944, PLA2G7 rs9395208, and CD40 rs1800686 were shown to be associated with CHD under the dominant model. Further gender-based subgroup tests showed that one SNP (CD40 rs1800686) and two SNPs (FAM5C rs12732361 and CD36 rs2065666) were associated with CHD in females and males, respectively. And the age-based subgroup tests indicated that PLA2G7 rs9395208, IL1B rs16944, and CD40 rs1800686 were associated with CHD among individuals younger than 55, younger than 65, and over 65, respectively. In conclusion, all the six inflammation-related CpG-SNPs (rs16944, rs2071008, rs12732361, rs2065666, rs9395208, and rs1800686) were associated with CHD in the combined or subgroup tests, suggesting an important role of inflammation in the risk of CHD.

  13. Molecular genetics of nicotine dependence and abstinence: whole genome association using 520,000 SNPs

    Directory of Open Access Journals (Sweden)

    Walther Donna

    2007-04-01

    Full Text Available Abstract Background Classical genetic studies indicate that nicotine dependence is a substantially heritable complex disorder. Genetic vulnerabilities to nicotine dependence largely overlap with genetic vulnerabilities to dependence on other addictive substances. Successful abstinence from nicotine displays substantial heritable components as well. Some of the heritability for the ability to quit smoking appears to overlap with the genetics of nicotine dependence and some does not. We now report genome wide association studies of nicotine dependent individuals who were successful in abstaining from cigarette smoking, nicotine dependent individuals who were not successful in abstaining and ethnically-matched control subjects free from substantial lifetime use of any addictive substance. Results These data, and their comparison with data that we have previously obtained from comparisons of four other substance dependent vs control samples support two main ideas: 1 Single nucleotide polymorphisms (SNPs whose allele frequencies distinguish nicotine-dependent from control individuals identify a set of genes that overlaps significantly with the set of genes that contain markers whose allelic frequencies distinguish the four other substance dependent vs control groups (p vs unsuccessful abstainers cluster in small genomic regions in ways that are highly unlikely to be due to chance (Monte Carlo p Conclusion These clustered SNPs nominate candidate genes for successful abstinence from smoking that are implicated in interesting functions: cell adhesion, enzymes, transcriptional regulators, neurotransmitters and receptors and regulation of DNA, RNA and proteins. As these observations are replicated, they will provide an increasingly-strong basis for understanding mechanisms of successful abstinence, for identifying individuals more or less likely to succeed in smoking cessation efforts and for tailoring therapies so that genotypes can help match smokers

  14. Analysis of the genetic structure of the Malay population: Ancestry-informative marker SNPs in the Malay of Peninsular Malaysia.

    Science.gov (United States)

    Yahya, Padillah; Sulong, Sarina; Harun, Azian; Wan Isa, Hatin; Ab Rajab, Nur-Shafawati; Wangkumhang, Pongsakorn; Wilantho, Alisa; Ngamphiw, Chumpol; Tongsima, Sissades; Zilfalil, Bin Alwi

    2017-09-01

    Malay, the main ethnic group in Peninsular Malaysia, is represented by various sub-ethnic groups such as Melayu Banjar, Melayu Bugis, Melayu Champa, Melayu Java, Melayu Kedah Melayu Kelantan, Melayu Minang and Melayu Patani. Using data retrieved from the MyHVP (Malaysian Human Variome Project) database, a total of 135 individuals from these sub-ethnic groups were profiled using the Affymetrix GeneChip Mapping Xba 50-K single nucleotide polymorphism (SNP) array to identify SNPs that were ancestry-informative markers (AIMs) for Malays of Peninsular Malaysia. Prior to selecting the AIMs, the genetic structure of Malays was explored with reference to 11 other populations obtained from the Pan-Asian SNP Consortium database using principal component analysis (PCA) and ADMIXTURE. Iterative pruning principal component analysis (ipPCA) was further used to identify sub-groups of Malays. Subsequently, we constructed an AIMs panel for Malays using the informativeness for assignment (I n ) of genetic markers, and the K-nearest neighbor classifier (KNN) was used to teach the classification models. A model of 250 SNPs ranked by I n , correctly classified Malay individuals with an accuracy of up to 90%. The identified panel of SNPs could be utilized as a panel of AIMs to ascertain the specific ancestry of Malays, which may be useful in disease association studies, biomedical research or forensic investigation purposes. Copyright © 2017 Elsevier B.V. All rights reserved.

  15. Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage

    Directory of Open Access Journals (Sweden)

    Wilson Barry Tyler

    2013-01-01

    Full Text Available Abstract The U.S. has been providing national-scale estimates of forest carbon (C stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.’s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon and spatial scales (e.g., sub-county to biome. Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations. In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area, with weaker agreement for detrital pools (e.g., standing dead trees. Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC and regional scales (e.g., Reducing Emissions from Deforestation and Forest

  16. Massively parallel sequencing of 165 ancestry informative SNPs in two Chinese Tibetan-Burmese minority ethnicities.

    Science.gov (United States)

    Wang, Zheng; He, Guanglin; Luo, Tao; Zhao, Xueying; Liu, Jing; Wang, Mengge; Zhou, Di; Chen, Xu; Li, Chengtao; Hou, Yiping

    2018-05-01

    The Tibeto-Burman language, one subfamily of the Sino-Tibetan languages, is spoken by over 60 million people all over East Asia. Yet the ethnic origin and genetic architecture of Tibeto-Burman speaking populations remain largely unexplored. In the present study, 169 Chinese individuals from Tibeto-Burman speaking populations (two ethnic groups: Tibetan and Yi) in four different geographic regions in western China were analyzed using the Precision ID Ancestry Panel (165 AISNPs) and the Ion PGM System. The performance and corresponding forensic statistical parameters of this AISNPs panel were investigated. Comprehensive population genetic comparisons (143 populations based on Kidd' SNPs, 92 populations on the basis of Seldin' SNPs and 31 populations based on the Precision ID Ancestry Panel) and ancestry inference were further performed. Sequencing performance demonstrated that the Precision ID Ancestry Panel is effective and robust. Forensic characteristics suggested that this panel not only can be used for ancestry estimation of Tibeto-Burman populations but also for individual identification. Tibetan and Yi shared a common genetic ancestry origin but experienced the complex history of gene flow, local adaptation, and isolation, and constructed the specific genetic landscape of human genetic diversity of Highlander and Lowlander populations. Tibetan-Burman populations and other East Asian populations showed sufficient genetic difference and could be distinguished into three distinct groups. Furthermore, analysis of population structure revealed that significant genetic difference was existed inter-continent populations and strong genetic affinity was observed within-continent populations. Additional population-specific AISNPs and a relatively more comprehensive database with sufficient reference population data remain necessary to get better-scale resolution within a geographically proximate populations in East Asia. Copyright © 2018 Elsevier B.V. All rights

  17. A consensus linkage map of the grass carp (Ctenopharyngodon idella based on microsatellites and SNPs

    Directory of Open Access Journals (Sweden)

    Li Jiale

    2010-02-01

    Full Text Available Abstract Background Grass carp (Ctenopharyngodon idella belongs to the family Cyprinidae which includes more than 2000 fish species. It is one of the most important freshwater food fish species in world aquaculture. A linkage map is an essential framework for mapping traits of interest and is often the first step towards understanding genome evolution. The aim of this study is to construct a first generation genetic map of grass carp using microsatellites and SNPs to generate a new resource for mapping QTL for economically important traits and to conduct a comparative mapping analysis to shed new insights into the evolution of fish genomes. Results We constructed a first generation linkage map of grass carp with a mapping panel containing two F1 families including 192 progenies. Sixteen SNPs in genes and 263 microsatellite markers were mapped to twenty-four linkage groups (LGs. The number of LGs was corresponding to the haploid chromosome number of grass carp. The sex-specific map was 1149.4 and 888.8 cM long in females and males respectively whereas the sex-averaged map spanned 1176.1 cM. The average resolution of the map was 4.2 cM/locus. BLAST searches of sequences of mapped markers of grass carp against the whole genome sequence of zebrafish revealed substantial macrosynteny relationship and extensive colinearity of markers between grass carp and zebrafish. Conclusions The linkage map of grass carp presented here is the first linkage map of a food fish species based on co-dominant markers in the family Cyprinidae. This map provides a valuable resource for mapping phenotypic variations and serves as a reference to approach comparative genomics and understand the evolution of fish genomes and could be complementary to grass carp genome sequencing project.

  18. Identification and analysis of genome-wide SNPs provide insight into signatures of selection and domestication in channel catfish (Ictalurus punctatus.

    Directory of Open Access Journals (Sweden)

    Luyang Sun

    Full Text Available Domestication and selection for important performance traits can impact the genome, which is most often reflected by reduced heterozygosity in and surrounding genes related to traits affected by selection. In this study, analysis of the genomic impact caused by domestication and artificial selection was conducted by investigating the signatures of selection using single nucleotide polymorphisms (SNPs in channel catfish (Ictalurus punctatus. A total of 8.4 million candidate SNPs were identified by using next generation sequencing. On average, the channel catfish genome harbors one SNP per 116 bp. Approximately 6.6 million, 5.3 million, 4.9 million, 7.1 million and 6.7 million SNPs were detected in the Marion, Thompson, USDA103, Hatchery strain, and wild population, respectively. The allele frequencies of 407,861 SNPs differed significantly between the domestic and wild populations. With these SNPs, 23 genomic regions with putative selective sweeps were identified that included 11 genes. Although the function for the majority of the genes remain unknown in catfish, several genes with known function related to aquaculture performance traits were included in the regions with selective sweeps. These included hypoxia-inducible factor 1β. HIFιβ.. and the transporter gene ATP-binding cassette sub-family B member 5 (ABCB5. HIF1β. is important for response to hypoxia and tolerance to low oxygen levels is a critical aquaculture trait. The large numbers of SNPs identified from this study are valuable for the development of high-density SNP arrays for genetic and genomic studies of performance traits in catfish.

  19. Semiautomatic imputation of activity travel diaries : use of global positioning system traces, prompted recall, and context-sensitive learning algorithms

    NARCIS (Netherlands)

    Moiseeva, A.; Jessurun, A.J.; Timmermans, H.J.P.; Stopher, P.

    2016-01-01

    Anastasia Moiseeva, Joran Jessurun and Harry Timmermans (2010), ‘Semiautomatic Imputation of Activity Travel Diaries: Use of Global Positioning System Traces, Prompted Recall, and Context-Sensitive Learning Algorithms’, Transportation Research Record: Journal of the Transportation Research Board,

  20. Using mi impute chained to fit ANCOVA models in randomized trials with censored dependent and independent variables

    DEFF Research Database (Denmark)

    Andersen, Andreas; Rieckmann, Andreas

    2016-01-01

    In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design.......In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design....

  1. Imputation of microsatellite alleles from dense SNP genotypes for parental verification

    Directory of Open Access Journals (Sweden)

    Matthew eMcclure

    2012-08-01

    Full Text Available Microsatellite (MS markers have recently been used for parental verification and are still the international standard despite higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP-based assays. Despite domestic and international interest from producers and research communities, no viable means currently exist to verify parentage for an individual unless all familial connections were analyzed using the same DNA marker type (MS or SNP. A simple and cost-effective method was devised to impute MS alleles from SNP haplotypes within breeds. For some MS, imputation results may allow inference across breeds. A total of 347 dairy cattle representing 4 dairy breeds (Brown Swiss, Guernsey, Holstein, and Jersey were used to generate reference haplotypes. This approach has been verified (>98% accurate for imputing the International Society of Animal Genetics (ISAG recommended panel of 12 MS for cattle parentage verification across a validation set of 1,307 dairy animals.. Implementation of this method will allow producers and breed associations to transition to SNP-based parentage verification utilizing MS genotypes from historical data on parents where SNP genotypes are missing. This approach may be applicable to additional cattle breeds and other species that wish to migrate from MS- to SNP- based parental verification.

  2. TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.

    Science.gov (United States)

    Allen, Genevera I; Tibshirani, Robert

    2010-06-01

    Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable , meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal , in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

  3. Data Editing and Imputation in Business Surveys Using “R”

    Directory of Open Access Journals (Sweden)

    Elena Romascanu

    2014-06-01

    Full Text Available Purpose – Missing data are a recurring problem that can cause bias or lead to inefficient analyses. The objective of this paper is a direct comparison between the two statistical software features R and SPSS, in order to take full advantage of the existing automated methods for data editing process and imputation in business surveys (with a proper design of consistency rules as a partial alternative to the manual editing of data. Approach – The comparison of different methods on editing surveys data, in R with the ‘editrules’ and ‘survey’ packages because inside those, exist commonly used transformations in official statistics, as visualization of missing values pattern using ‘Amelia’ and ‘VIM’ packages, imputation approaches for longitudinal data using ‘VIMGUI’ and a comparison of another statistical software performance on the same features, such as SPSS. Findings – Data on business statistics received by NIS’s (National Institute of Statistics are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. The appropriate automatic methods from R packages, offers the ability to set the erroneous fields in edit-violating records, to verify the results after the imputation of missing values providing for users a flexible, less time consuming approach and easy to perform automation in R than in SPSS Macros syntax situations, when macros are very handy.

  4. High-throughput SNP genotyping: combining tag SNPs and molecular beacons

    CSIR Research Space (South Africa)

    Barreiro, LB

    2009-10-01

    Full Text Available In the last decade, molecular beacons have emerged to become a widely used tool in the multiplex typing of single nucleotide polymorphisms (SNPs). Improvements in detection technologies in instrumentation and chemistries to label these probes have...

  5. Association analysis of IL10, TNF-α and IL23R-IL12RB2 SNPs with Behçet's disease risk in Western Algeria

    Directory of Open Access Journals (Sweden)

    Ouahiba eKhaib Dit Naib

    2013-10-01

    Full Text Available Objective: We have conducted the first study of the association of interleukin (IL-10, tumor necrosis factor alpha (TNF-α and IL23R-IL12RB2 regionSNPswith Behçet's disease (BD in Western Algeria. Methods: A total of 51 BD patients and 96 unrelated controls from West region of Algeria were genotyped by direct sequencing for 11 SNPs including 2 SNPsfrom the IL10 promoter [c.-819T>C (rs1800871, c.-592A>C (rs1800872], 6 SNPs from the TNF-α promoter [c.-1211T>C (rs1799964, c.-1043C>A (rs1800630, c.-1037C>T (rs1799724, c.-556G>A (rs1800750, c.-488G>A (rs1800629 and c.-418G>A (rs361525], and 3 SNPs from the IL23R-IL12RB2 region [g.67747415A>C (rs12119179, g.67740092G>A (rs11209032 and g.67760140T>C (rs924080]. Results: The minor alleles c.-819T and c.-592A were significantly associated with BD (OR= 2.18; 95% CI 1.28-3.73, p = 0.003; whereas, there was weaker association between TNF-αpromoter SNPs or IL23R-IL12RB2 region and disease risk.Conclusion: Unlike the TNF-αand the IL23R-IL12RB2 region SNPs, the two IL10 SNPs were strongly associated with BD. The -819T, and -592A alleles and the -819TT, -819CT, and -592AA and -592CA genotypes seem to be highly involved in the risk of developing of BD in the population of Western Algeria.

  6. A SNP Harvester Analysis to Better Detect SNPs of CCDC158 Gene That Are Associated with Carcass Quality Traits in Hanwoo

    Directory of Open Access Journals (Sweden)

    Jea-Young Lee

    2013-06-01

    Full Text Available The purpose of this study was to investigate interaction effects of genes using a Harvester method. A sample of Korean cattle, Hanwoo (n = 476 was chosen from the National Livestock Research Institute of Korea that were sired by 50 Korean proven bulls. The steers were born between the spring of 1998 and the autumn of 2002 and reared under a progeny-testing program at the Daekwanryeong and Namwon branches of NLRI. The steers were slaughtered at approximately 24 months of age and carcass quality traits were measured. A SNP Harvester method was applied with a support vector machine (SVM to detect significant SNPs in the CCDC158 gene and interaction effects between the SNPs that were associated with average daily gains, cold carcass weight, longissimus dorsi muscle area, and marbling scores. The statistical significance of the major SNP combinations was evaluated with x2-statistics. The genotype combinations of three SNPs, g.34425+102 A>T(AA, g.4102636T>G(GT, and g.11614+19G>T(GG had a greater effect than the rest of SNP combinations, e.g. 0.82 vs. 0.75 kg, 343 vs. 314 kg, 80.4 vs 74.7 cm2, and 7.35 vs. 5.01, for the four respective traits (p<0.001. Also, the estimates were greater compared with single SNPs analyzed (the greatest estimates were 0.76 kg, 320 kg, 75.5 cm2, and 5.31, respectively. This result suggests that the SNP Harvester method is a good option when multiple SNPs and interaction effects are tested. The significant SNPs could be applied to improve meat quality of Hanwoo via marker-assisted selection.

  7. A novel method for in silico identification of regulatory SNPs in human genome.

    Science.gov (United States)

    Li, Rong; Zhong, Dexing; Liu, Ruiling; Lv, Hongqiang; Zhang, Xinman; Liu, Jun; Han, Jiuqiang

    2017-02-21

    Regulatory single nucleotide polymorphisms (rSNPs), kind of functional noncoding genetic variants, can affect gene expression in a regulatory way, and they are thought to be associated with increased susceptibilities to complex diseases. Here a novel computational approach to identify potential rSNPs is presented. Different from most other rSNPs finding methods which based on hypothesis that SNPs causing large allele-specific changes in transcription factor binding affinities are more likely to play regulatory functions, we use a set of documented experimentally verified rSNPs and nonfunctional background SNPs to train classifiers, so the discriminating features are found. To characterize variants, an extensive range of characteristics, such as sequence context, DNA structure and evolutionary conservation etc. are analyzed. Support vector machine is adopted to build the classifier model together with an ensemble method to deal with unbalanced data. 10-fold cross-validation result shows that our method can achieve accuracy with sensitivity of ~78% and specificity of ~82%. Furthermore, our method performances better than some other algorithms based on aforementioned hypothesis in handling false positives. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnppredict/. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. Characterization of the linkage disequilibrium structure and identification of tagging-SNPs in five DNA repair genes

    International Nuclear Information System (INIS)

    Allen-Brady, Kristina; Camp, Nicola J

    2005-01-01

    Characterization of the linkage disequilibrium (LD) structure of candidate genes is the basis for an effective association study of complex diseases such as cancer. In this study, we report the LD and haplotype architecture and tagging-single nucleotide polymorphisms (tSNPs) for five DNA repair genes: ATM, MRE11A, XRCC4, NBS1 and RAD50. The genes ATM, MRE11A, and XRCC4 were characterized using a panel of 94 unrelated female subjects (47 breast cancer cases, 47 controls) obtained from high-risk breast cancer families. A similar LD structure and tSNP analysis was performed for NBS1 and RAD50, using publicly available genotyping data. We studied a total of 61 SNPs at an average marker density of 10 kb. Using a matrix decomposition algorithm, based on principal component analysis, we captured >90% of the intragenetic variation for each gene. Our results revealed that three of the five genes did not conform to a haplotype block structure (MRE11A, RAD50 and XRCC4). Instead, the data fit a more flexible LD group paradigm, where SNPs in high LD are not required to be contiguous. Traditional haplotype blocks assume recombination is the only dynamic at work. For ATM, MRE11A and XRCC4 we repeated the analysis in cases and controls separately to determine whether LD structure was consistent across breast cancer cases and controls. No substantial difference in LD structures was found. This study suggests that appropriate SNP selection for an association study involving candidate genes should allow for both mutation and recombination, which shape the population-level genomic structure. Furthermore, LD structure characterization in either breast cancer cases or controls appears to be sufficient for future cancer studies utilizing these genes

  9. Characterization of genomic variations in SNPs of PE_PGRS genes reveals deletions and insertions in extensively drug resistant (XDR) M. tuberculosis strains from Pakistan

    KAUST Repository

    Kanji, Akbar; Hasan, Zahra; Ali, Asho; McNerney, Ruth; Mallard, Kim; Coll, Francesc; Hill-Cawthorne, Grant A.; Nair, Mridul; Clark, Taane G.; Zaver, Ambreen; Jafri, Sana; Hasan, Rumina

    2015-01-01

    Genetic diversity in PE_PGRS genes contributes to antigenic variability and may result in increased immunogenicity of strains. This is the first study identifying variations in nsSNPs and INDELs in the PE_PGRS genes of XDR-TB strains from Pakistan. It highlights common genetic variations which may contribute to persistence.

  10. A Markov blanket-based method for detecting causal SNPs in GWAS

    Directory of Open Access Journals (Sweden)

    Han Bing

    2010-04-01

    Full Text Available Abstract Background Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS, designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity. Results We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases. Conclusions Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset

  11. Comparação de métodos de imputação única e múltipla usando como exemplo um modelo de risco para mortalidade cirúrgica Comparison of simple and multiple imputation methods using a risk model for surgical mortality as example

    Directory of Open Access Journals (Sweden)

    Luciana Neves Nunes

    2010-12-01

    Full Text Available INTRODUÇÃO: A perda de informações é um problema frequente em estudos realizados na área da Saúde. Na literatura essa perda é chamada de missing data ou dados faltantes. Através da imputação dos dados faltantes são criados conjuntos de dados artificialmente completos que podem ser analisados por técnicas estatísticas tradicionais. O objetivo desse artigo foi comparar, em um exemplo baseado em dados reais, a utilização de três técnicas de imputações diferentes. MÉTODO: Os dados utilizados referem-se a um estudo de desenvolvimento de modelo de risco cirúrgico, sendo que o tamanho da amostra foi de 450 pacientes. Os métodos de imputação empregados foram duas imputações únicas e uma imputação múltipla (IM, e a suposição sobre o mecanismo de não-resposta foi MAR (Missing at Random. RESULTADOS: A variável com dados faltantes foi a albumina sérica, com 27,1% de perda. Os modelos obtidos pelas imputações únicas foram semelhantes entre si, mas diferentes dos obtidos com os dados imputados pela IM quanto à inclusão de variáveis nos modelos. CONCLUSÕES: Os resultados indicam que faz diferença levar em conta a relação da albumina com outras variáveis observadas, pois foram obtidos modelos diferentes nas imputações única e múltipla. A imputação única subestima a variabilidade, gerando intervalos de confiança mais estreitos. É importante se considerar o uso de métodos de imputação quando há dados faltantes, especialmente a IM que leva em conta a variabilidade entre imputações para as estimativas do modelo.INTRODUCTION: It is common for studies in health to face problems with missing data. Through imputation, complete data sets are built artificially and can be analyzed by traditional statistical analysis. The objective of this paper is to compare three types of imputation based on real data. METHODS: The data used came from a study on the development of risk models for surgical mortality. The

  12. Common non-synonymous SNPs associated with breast cancer susceptibility: findings from the Breast Cancer Association Consortium.

    Science.gov (United States)

    Milne, Roger L; Burwinkel, Barbara; Michailidou, Kyriaki; Arias-Perez, Jose-Ignacio; Zamora, M Pilar; Menéndez-Rodríguez, Primitiva; Hardisson, David; Mendiola, Marta; González-Neira, Anna; Pita, Guillermo; Alonso, M Rosario; Dennis, Joe; Wang, Qin; Bolla, Manjeet K; Swerdlow, Anthony; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk; Ko, Yon-Dschun; Brauch, Hiltrud; Hamann, Ute; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Tchatchou, Sandrine; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Tajima, Kazuo; Li, Jingmei; Brand, Judith S; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Lambrechts, Diether; Peuteman, Gilian; Christiaens, Marie-Rose; Smeets, Ann; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katazyna; Hartman, Mikael; Hui, Miao; Yen Lim, Wei; Wan Chan, Ching; Marme, Federick; Yang, Rongxi; Bugert, Peter; Lindblom, Annika; Margolin, Sara; García-Closas, Montserrat; Chanock, Stephen J; Lissowska, Jolanta; Figueroa, Jonine D; Bojesen, Stig E; Nordestgaard, Børge G; Flyger, Henrik; Hooning, Maartje J; Kriege, Mieke; van den Ouweland, Ans M W; Koppert, Linetta B; Fletcher, Olivia; Johnson, Nichola; dos-Santos-Silva, Isabel; Peto, Julian; Zheng, Wei; Deming-Halverson, Sandra; Shrubsole, Martha J; Long, Jirong; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Cox, Angela; Cross, Simon S; Reed, Malcolm W R; Schmidt, Marjanka K; Broeks, Annegien; Cornelissen, Sten; Braaf, Linde; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Noh, Dong-Young; Simard, Jacques; Dumont, Martine; Goldberg, Mark S; Labrèche, France; Fasching, Peter A; Hein, Alexander; Ekici, Arif B; Beckmann, Matthias W; Radice, Paolo; Peterlongo, Paolo; Azzollini, Jacopo; Barile, Monica; Sawyer, Elinor; Tomlinson, Ian; Kerin, Michael; Miller, Nicola; Hopper, John L; Schmidt, Daniel F; Makalic, Enes; Southey, Melissa C; Hwang Teo, Soo; Har Yip, Cheng; Sivanandan, Kavitta; Tay, Wan-Ting; Shen, Chen-Yang; Hsiung, Chia-Ni; Yu, Jyh-Cherng; Hou, Ming-Feng; Guénel, Pascal; Truong, Therese; Sanchez, Marie; Mulot, Claire; Blot, William; Cai, Qiuyin; Nevanlinna, Heli; Muranen, Taru A; Aittomäki, Kristiina; Blomqvist, Carl; Wu, Anna H; Tseng, Chiu-Chen; Van Den Berg, David; Stram, Daniel O; Bogdanova, Natalia; Dörk, Thilo; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Zhang, Ben; Couch, Fergus J; Toland, Amanda E; Yannoukakos, Drakoulis; Sangrajrang, Suleeporn; McKay, James; Wang, Xianshu; Olson, Janet E; Vachon, Celine; Purrington, Kristen; Severi, Gianluca; Baglietto, Laura; Haiman, Christopher A; Henderson, Brian E; Schumacher, Fredrick; Le Marchand, Loic; Devilee, Peter; Tollenaar, Robert A E M; Seynaeve, Caroline; Czene, Kamila; Eriksson, Mikael; Humphreys, Keith; Darabi, Hatef; Ahmed, Shahana; Shah, Mitul; Pharoah, Paul D P; Hall, Per; Giles, Graham G; Benítez, Javier; Dunning, Alison M; Chenevix-Trench, Georgia; Easton, Douglas F

    2014-11-15

    Candidate variant association studies have been largely unsuccessful in identifying common breast cancer susceptibility variants, although most studies have been underpowered to detect associations of a realistic magnitude. We assessed 41 common non-synonymous single-nucleotide polymorphisms (nsSNPs) for which evidence of association with breast cancer risk had been previously reported. Case-control data were combined from 38 studies of white European women (46 450 cases and 42 600 controls) and analyzed using unconditional logistic regression. Strong evidence of association was observed for three nsSNPs: ATXN7-K264R at 3p21 [rs1053338, per allele OR = 1.07, 95% confidence interval (CI) = 1.04-1.10, P = 2.9 × 10(-6)], AKAP9-M463I at 7q21 (rs6964587, OR = 1.05, 95% CI = 1.03-1.07, P = 1.7 × 10(-6)) and NEK10-L513S at 3p24 (rs10510592, OR = 1.10, 95% CI = 1.07-1.12, P = 5.1 × 10(-17)). The first two associations reached genome-wide statistical significance in a combined analysis of available data, including independent data from nine genome-wide association studies (GWASs): for ATXN7-K264R, OR = 1.07 (95% CI = 1.05-1.10, P = 1.0 × 10(-8)); for AKAP9-M463I, OR = 1.05 (95% CI = 1.04-1.07, P = 2.0 × 10(-10)). Further analysis of other common variants in these two regions suggested that intronic SNPs nearby are more strongly associated with disease risk. We have thus identified a novel susceptibility locus at 3p21, and confirmed previous suggestive evidence that rs6964587 at 7q21 is associated with risk. The third locus, rs10510592, is located in an established breast cancer susceptibility region; the association was substantially attenuated after adjustment for the known GWAS hit. Thus, each of the associated nsSNPs is likely to be a marker for another, non-coding, variant causally related to breast cancer risk. Further fine-mapping and functional studies are required to identify the underlying risk-modifying variants and the genes through which they act. © The

  13. In silico analysis of single nucleotide polymorphism (SNPs in human β-globin gene.

    Directory of Open Access Journals (Sweden)

    Mohammed Alanazi

    Full Text Available Single amino acid substitutions in the globin chain are the most common forms of genetic variations that produce hemoglobinopathies--the most widespread inherited disorders worldwide. Several hemoglobinopathies result from homozygosity or compound heterozygosity to beta-globin (HBB gene mutations, such as that producing sickle cell hemoglobin (HbS, HbC, HbD and HbE. Several of these mutations are deleterious and result in moderate to severe hemolytic anemia, with associated complications, requiring lifelong care and management. Even though many hemoglobinopathies result from single amino acid changes producing similar structural abnormalities, there are functional differences in the generated variants. Using in silico methods, we examined the genetic variations that can alter the expression and function of the HBB gene. Using a sequence homology-based Sorting Intolerant from Tolerant (SIFT server we have searched for the SNPs, which showed that 200 (80% non-synonymous polymorphism were found to be deleterious. The structure-based method via PolyPhen server indicated that 135 (40% non-synonymous polymorphism may modify protein function and structure. The Pupa Suite software showed that the SNPs will have a phenotypic consequence on the structure and function of the altered protein. Structure analysis was performed on the key mutations that occur in the native protein coded by the HBB gene that causes hemoglobinopathies such as: HbC (E→K, HbD (E→Q, HbE (E→K and HbS (E→V. Atomic Non-Local Environment Assessment (ANOLEA, Yet Another Scientific Artificial Reality Application (YASARA, CHARMM-GUI webserver for macromolecular dynamics and mechanics, and Normal Mode Analysis, Deformation and Refinement (NOMAD-Ref of Gromacs server were used to perform molecular dynamics simulations and energy minimization calculations on β-Chain residue of the HBB gene before and after mutation. Furthermore, in the native and altered protein models, amino acid

  14. Who cares and how much? The imputed economic contribution to the Canadian healthcare system of middle-aged and older unpaid caregivers providing care to the elderly.

    Science.gov (United States)

    Hollander, Marcus J; Liu, Guiping; Chappell, Neena L

    2009-01-01

    Canadians provide significant amounts of unpaid care to elderly family members and friends with long-term health problems. While some information is available on the nature of the tasks unpaid caregivers perform, and the amounts of time they spend on these tasks, the contribution of unpaid caregivers is often hidden. (It is recognized that some caregiving may be for short periods of time or may entail matters better described as "help" or "assistance," such as providing transportation. However, we use caregiving to cover the full range of unpaid care provided from some basic help to personal care.) Aggregate estimates of the market costs to replace the unpaid care provided are important to governments for policy development as they provide a means to situate the contributions of unpaid caregivers within Canada's healthcare system. The purpose of this study was to obtain an assessment of the imputed costs of replacing the unpaid care provided by Canadians to the elderly. (Imputed costs is used to refer to costs that would be incurred if the care provided by an unpaid caregiver was, instead, provided by a paid caregiver, on a direct hour-for-hour substitution basis.) The economic value of unpaid care as understood in this study is defined as the cost to replace the services provided by unpaid caregivers at rates for paid care providers.

  15. Analysis of association of clinical aspects and IL1B tagSNPs with severe preeclampsia.

    Science.gov (United States)

    Leme Galvão, Larissa Paes; Menezes, Filipe Emanuel; Mendonca, Caio; Barreto, Ikaro; Alvim-Pereira, Claudia; Alvim-Pereira, Fabiano; Gurgel, Ricardo

    2016-01-01

    This study investigates the association between IL1B genotypes using a tag SNP (single polymorphism) approach, maternal and environmental factors in Brazilian women with severe preeclampsia. A case-control study with a total of 456 patients (169 preeclamptic women and 287 controls) was conducted in the two reference maternity hospitals of Sergipe state, Northeast Brazil. A questionnaire was administered and DNA was extracted to genotype the population for four tag SNPs of the IL1Beta: rs 1143643, rs 1143633, rs 1143634 and rs 1143630. Haplotype association analysis and p-values were calculated using the THESIAS test. Odds ratio (OR) estimation, confidence interval (CI) and multivariate logistic regression were performed. High pregestational body mass index (pre-BMI), first gestation, cesarean section, more than six medical visits, low level of consciousness on admission and TC and TT genotype in rs1143630 of IL1Beta showed association with the preeclamptic group in univariate analysis. After multivariate logistic regression pre-BMI, first gestation and low level of consciousness on admission remained associated. We identified an association between clinical variables and preeclampsia. Univariate analysis suggested that inflammatory process-related genes, such as IL1B, may be involved and should be targeted in further studies. The identification of the genetic background involved in preeclampsia host response modulation is mandatory in order to understand the preeclampsia process.

  16. Two Novel SNPs of PPARγ Significantly Affect Weaning Growth Traits of Nanyang Cattle.

    Science.gov (United States)

    Huang, Jieping; Chen, Ningbo; Li, Xin; An, Shanshan; Zhao, Minghui; Sun, Taihong; Hao, Ruijie; Ma, Yun

    2018-01-02

    Peroxisome-proliferator-activated receptor gamma (PPARγ) is a key transcription factor that controls adipocyte differentiation and energy in mammals. Therefore, PPARγ is a potential factor influencing animal growth traits. This study primarily evaluates PPARγ as candidate gene for growth traits of cattle and identifies potential molecular marker for cattle breeding. Per previous studies, PPARγ mRNA was mainly expressed at extremely high levels in adipose tissues as shown by quantitative real-time polymerase chain reaction analysis. Three novel SNPs of the bovine PPARγ gene were identified in 514 individuals from six Chinese cattle breeds: SNP1 (AC_000179.1 g.57386668 C > G) in intron 2 and SNP2 (AC_000179.1 g.57431964 C > T) and SNP3 (AC_000179.1 g.57431994 T > C) in exon 7. The present study also investigated genetic characteristics of these SNP loci in six populations. Association analysis showed that SNP1 and SNP3 loci significantly affect weaning growth traits, especially body weight of Nanyang cattle. These results revealed that SNP1 and SNP3 are potential molecular markers for cattle breeding.

  17. Association of ESR1 gene tagging SNPs with breast cancer risk

    Science.gov (United States)

    Dunning, Alison M.; Healey, Catherine S.; Baynes, Caroline; Maia, Ana-Teresa; Scollen, Serena; Vega, Ana; Rodríguez, Raquel; Barbosa-Morais, Nuno L.; Ponder, Bruce A.J.; Low, Yen-Ling; Bingham, Sheila; Haiman, Christopher A.; Le Marchand, Loic; Broeks, Annegien; Schmidt, Marjanka K.; Hopper, John; Southey, Melissa; Beckmann, Matthias W.; Fasching, Peter A.; Peto, Julian; Johnson, Nichola; Bojesen, Stig E.; Nordestgaard, Børge; Milne, Roger L.; Benitez, Javier; Hamann, Ute; Ko, Yon; Schmutzler, Rita K.; Burwinkel, Barbara; Schürmann, Peter; Dörk, Thilo; Heikkinen, Tuomas; Nevanlinna, Heli; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kosma, Veli-Matti; Chen, Xiaoqing; Spurdle, Amanda; Change-Claude, Jenny; Flesch-Janys, Dieter; Couch, Fergus J.; Olson, Janet E.; Severi, Gianluca; Baglietto, Laura; Børresen-Dale, Anne-Lise; Kristensen, Vessela; Hunter, David J.; Hankinson, Susan E.; Devilee, Peter; Vreeswijk, Maaike; Lissowska, Jolanta; Brinton, Louise; Liu, Jianjun; Hall, Per; Kang, Daehee; Yoo, Keun-Young; Shen, Chen-Yang; Yu, Jyh-Cherng; Anton-Culver, Hoda; Ziogoas, Argyrios; Sigurdson, Alice; Struewing, Jeff; Easton, Douglas F.; Garcia-Closas, Montserrat; Humphreys, Manjeet K.; Morrison, Jonathan; Pharoah, Paul D.P.; Pooley, Karen A.; Chenevix-Trench, Georgia

    2009-01-01

    We have conducted a three-stage, comprehensive single nucleotide polymorphism (SNP)-tagging association study of ESR1 gene variants (SNPs) in more than 55 000 breast cancer cases and controls from studies within the Breast Cancer Association Consortium (BCAC). No large risks or highly significant associations were revealed. SNP rs3020314, tagging a region of ESR1 intron 4, is associated with an increase in breast cancer susceptibility with a dominant mode of action in European populations. Carriers of the c-allele have an odds ratio (OR) of 1.05 [95% Confidence Intervals (CI) 1.02–1.09] relative to t-allele homozygotes, P = 0.004. There is significant heterogeneity between studies, P = 0.002. The increased risk appears largely confined to oestrogen receptor-positive tumour risk. The region tagged by SNP rs3020314 contains sequence that is more highly conserved across mammalian species than the rest of intron 4, and it may subtly alter the ratio of two mRNA splice forms. PMID:19126777

  18. FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases

    Directory of Open Access Journals (Sweden)

    Watson-Haigh Nathan S

    2010-06-01

    Full Text Available Abstract Background Whole genome association studies using highly dense single nucleotide polymorphisms (SNPs are a set of methods to identify DNA markers associated with variation in a particular complex trait of interest. One of the main outcomes from these studies is a subset of statistically significant SNPs. Finding the potential biological functions of such SNPs can be an important step towards further use in human and agricultural populations (e.g., for identifying genes related to susceptibility to complex diseases or genes playing key roles in development or performance. The current challenge is that the information holding the clues to SNP functions is distributed across many different databases. Efficient bioinformatics tools are therefore needed to seamlessly integrate up-to-date functional information on SNPs. Many web services have arisen to meet the challenge but most work only within the framework of human medical research. Although we acknowledge the importance of human research, we identify there is a need for SNP annotation tools for other organisms. Description We introduce an R package called FunctSNP, which is the user interface to custom built species-specific databases. The local relational databases contain SNP data together with functional annotations extracted from online resources. FunctSNP provides a unified bioinformatics resource to link SNPs with functional knowledge (e.g., genes, pathways, ontologies. We also introduce dbAutoMaker, a suite of Perl scripts, which can be scheduled to run periodically to automatically create/update the customised SNP databases. We illustrate the use of FunctSNP with a livestock example, but the approach and software tools presented here can be applied also to human and other organisms. Conclusions Finding the potential functional significance of SNPs is important when further using the outcomes from whole genome association studies. FunctSNP is unique in that it is the only R

  19. Mining the 30UTR of Autism-implicated Genes for SNPs Perturbing MicroRNA Regulation

    Institute of Scientific and Technical Information of China (English)

    Varadharajan Vaishnavi; Mayakannan Manikandan; Arasambattu Kannan Munirajan

    2014-01-01

    Autism spectrum disorder (ASD) refers to a group of childhood neurodevelopmental dis-orders with polygenic etiology. The expression of many genes implicated in ASD is tightly regulated by various factors including microRNAs (miRNAs), a class of noncoding RNAs 22 nucleotides in length that function to suppress translation by pairing with‘miRNA recognition elements’ (MREs) present in the 30untranslated region (30UTR) of target mRNAs. This emphasizes the role played by miRNAs in regulating neurogenesis, brain development and differentiation and hence any perturba-tions in this regulatory mechanism might affect these processes as well. Recently, single nucleotide polymorphisms (SNPs) present within 30UTRs of mRNAs have been shown to modulate existing MREs or even create new MREs. Therefore, we hypothesized that SNPs perturbing miRNA-medi-ated gene regulation might lead to aberrant expression of autism-implicated genes, thus resulting in disease predisposition or pathogenesis in at least a subpopulation of ASD individuals. We developed a systematic computational pipeline that integrates data from well-established databases. By following a stringent selection criterion, we identified 9 MRE-modulating SNPs and another 12 MRE-creating SNPs in the 30UTR of autism-implicated genes. These high-confidence candidate SNPs may play roles in ASD and hence would be valuable for further functional validation.

  20. Partition dataset according to amino acid type improves the prediction of deleterious non-synonymous SNPs

    International Nuclear Information System (INIS)

    Yang, Jing; Li, Yuan-Yuan; Li, Yi-Xue; Ye, Zhi-Qiang

    2012-01-01

    Highlights: ► Proper dataset partition can improve the prediction of deleterious nsSNPs. ► Partition according to original residue type at nsSNP is a good criterion. ► Similar strategy is supposed promising in other machine learning problems. -- Abstract: Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNP site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.

  1. Estimation of caries experience by multiple imputation and direct standardization

    NARCIS (Netherlands)

    Schuller, A. A.; Van Buuren, S.

    2014-01-01

    Valid estimates of caries experience are needed to monitor oral population health. Obtaining such estimates in practice is often complicated by nonresponse and missing data. The goal of this study was to estimate caries experiences in a population of children aged 5 and 11 years, in the presence of

  2. Estimation of Caries Experience by Multiple Imputation and Direct Standardization

    NARCIS (Netherlands)

    Schuller, A. A.; van Buuren, S.

    2014-01-01

    Valid estimates of caries experience are needed to monitor oral population health. Obtaining such estimates in practice is often complicated by nonresponse and missing data. The goal of this study was to estimate caries experiences in a population of children aged 5 and 11 years, in the presence of

  3. New insights into the Lake Chad Basin population structure revealed by high-throughput genotyping of mitochondrial DNA coding SNPs.

    Directory of Open Access Journals (Sweden)

    María Cerezo

    Full Text Available BACKGROUND: Located in the Sudan belt, the Chad Basin forms a remarkable ecosystem, where several unique agricultural and pastoral techniques have been developed. Both from an archaeological and a genetic point of view, this region has been interpreted to be the center of a bidirectional corridor connecting West and East Africa, as well as a meeting point for populations coming from North Africa through the Saharan desert. METHODOLOGY/PRINCIPAL FINDINGS: Samples from twelve ethnic groups from the Chad Basin (n = 542 have been high-throughput genotyped for 230 coding region mitochondrial DNA (mtDNA Single Nucleotide Polymorphisms (mtSNPs using Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF mass spectrometry. This set of mtSNPs allowed for much better phylogenetic resolution than previous studies of this geographic region, enabling new insights into its population history. Notable haplogroup (hg heterogeneity has been observed in the Chad Basin mirroring the different demographic histories of these ethnic groups. As estimated using a Bayesian framework, nomadic populations showed negative growth which was not always correlated to their estimated effective population sizes. Nomads also showed lower diversity values than sedentary groups. CONCLUSIONS/SIGNIFICANCE: Compared to sedentary population, nomads showed signals of stronger genetic drift occurring in their ancestral populations. These populations, however, retained more haplotype diversity in their hypervariable segments I (HVS-I, but not their mtSNPs, suggesting a more ancestral ethnogenesis. Whereas the nomadic population showed a higher Mediterranean influence signaled mainly by sub-lineages of M1, R0, U6, and U5, the other populations showed a more consistent sub-Saharan pattern. Although lifestyle may have an influence on diversity patterns and hg composition, analysis of molecular variance has not identified these differences. The present study indicates that

  4. Transcriptome characterization and high throughput SSRs and SNPs discovery in Cucurbita pepo (Cucurbitaceae).

    Science.gov (United States)

    Blanca, José; Cañizares, Joaquín; Roig, Cristina; Ziarsolo, Pello; Nuez, Fernando; Picó, Belén

    2011-02-10

    Cucurbita pepo belongs to the Cucurbitaceae family. The "Zucchini" types rank among the highest-valued vegetables worldwide, and other C. pepo and related Cucurbita spp., are food staples and rich sources of fat and vitamins. A broad range of genomic tools are today available for other cucurbits that have become models for the study of different metabolic processes. However, these tools are still lacking in the Cucurbita genus, thus limiting gene discovery and the process of breeding. We report the generation of a total of 512,751 C. pepo EST sequences, using 454 GS FLX Titanium technology. ESTs were obtained from normalized cDNA libraries (root, leaves, and flower tissue) prepared using two varieties with contrasting phenotypes for plant, flowering and fruit traits, representing the two C. pepo subspecies: subsp. pepo cv. Zucchini and subsp. ovifera cv Scallop. De novo assembling was performed to generate a collection of 49,610 Cucurbita unigenes (average length of 626 bp) that represent the first transcriptome of the species. Over 60% of the unigenes were functionally annotated and assigned to one or more Gene Ontology terms. The distributions of Cucurbita unigenes followed similar tendencies than that reported for Arabidopsis or melon, suggesting that the dataset may represent the whole Cucurbita transcriptome. About 34% unigenes were detected to have known orthologs of Arabidopsis or melon, including genes potentially involved in disease resistance, flowering and fruit quality. Furthermore, a set of 1,882 unigenes with SSR motifs and 9,043 high confidence SNPs between Zucchini and Scallop were identified, of which 3,538 SNPs met criteria for use with high throughput genotyping platforms, and 144 could be detected as CAPS. A set of markers were validated, being 80% of them polymorphic in a set of variable C. pepo and C. moschata accessions. We present the first broad survey of gene sequences and allelic variation in C. pepo, where limited prior genomic

  5. Novel approach identifies SNPs in SLC2A10 and KCNK9 with evidence for parent-of-origin effect on body mass index.

    Directory of Open Access Journals (Sweden)

    Clive J Hoggart

    2014-07-01

    Full Text Available The phenotypic effect of some single nucleotide polymorphisms (SNPs depends on their parental origin. We present a novel approach to detect parent-of-origin effects (POEs in genome-wide genotype data of unrelated individuals. The method exploits increased phenotypic variance in the heterozygous genotype group relative to the homozygous groups. We applied the method to >56,000 unrelated individuals to search for POEs influencing body mass index (BMI. Six lead SNPs were carried forward for replication in five family-based studies (of ∼4,000 trios. Two SNPs replicated: the paternal rs2471083-C allele (located near the imprinted KCNK9 gene and the paternal rs3091869-T allele (located near the SLC2A10 gene increased BMI equally (beta = 0.11 (SD, P<0.0027 compared to the respective maternal alleles. Real-time PCR experiments of lymphoblastoid cell lines from the CEPH families showed that expression of both genes was dependent on parental origin of the SNPs alleles (P<0.01. Our scheme opens new opportunities to exploit GWAS data of unrelated individuals to identify POEs and demonstrates that they play an important role in adult obesity.

  6. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

    Science.gov (United States)

    Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas

    2016-04-01

    Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

  7. Missing data in clinical trials: control-based mean imputation and sensitivity analysis.

    Science.gov (United States)

    Mehrotra, Devan V; Liu, Fang; Permutt, Thomas

    2017-09-01

    In some randomized (drug versus placebo) clinical trials, the estimand of interest is the between-treatment difference in population means of a clinical endpoint that is free from the confounding effects of "rescue" medication (e.g., HbA1c change from baseline at 24 weeks that would be observed without rescue medication regardless of whether or when the assigned treatment was discontinued). In such settings, a missing data problem arises if some patients prematurely discontinue from the trial or initiate rescue medication while in the trial, the latter necessitating the discarding of post-rescue data. We caution that the commonly used mixed-effects model repeated measures analysis with the embedded missing at random assumption can deliver an exaggerated estimate of the aforementioned estimand of interest. This happens, in part, due to implicit imputation of an overly optimistic mean for "dropouts" (i.e., patients with missing endpoint data of interest) in the drug arm. We propose an alternative approach in which the missing mean for the drug arm dropouts is explicitly replaced with either the estimated mean of the entire endpoint distribution under placebo (primary analysis) or a sequence of increasingly more conservative means within a tipping point framework (sensitivity analysis); patient-level imputation is not required. A supplemental "dropout = failure" analysis is considered in which a common poor outcome is imputed for all dropouts followed by a between-treatment comparison using quantile regression. All analyses address the same estimand and can adjust for baseline covariates. Three examples and simulation results are used to support our recommendations. Copyright © 2017 John Wiley & Sons, Ltd.

  8. Imputation of Baseline LDL Cholesterol Concentration in Patients with Familial Hypercholesterolemia on Statins or Ezetimibe.

    Science.gov (United States)

    Ruel, Isabelle; Aljenedil, Sumayah; Sadri, Iman; de Varennes, Émilie; Hegele, Robert A; Couture, Patrick; Bergeron, Jean; Wanneh, Eric; Baass, Alexis; Dufour, Robert; Gaudet, Daniel; Brisson, Diane; Brunham, Liam R; Francis, Gordon A; Cermakova, Lubomira; Brophy, James M; Ryomoto, Arnold; Mancini, G B John; Genest, Jacques

    2018-02-01

    Familial hypercholesterolemia (FH) is the most frequent genetic disorder seen clinically and is characterized by increased LDL cholesterol (LDL-C) (>95th percentile), family history of increased LDL-C, premature atherosclerotic cardiovascular disease (ASCVD) in the patient or in first-degree relatives, presence of tendinous xanthomas or premature corneal arcus, or presence of a pathogenic mutation in the LDLR , PCSK9 , or APOB genes. A diagnosis of FH has important clinical implications with respect to lifelong risk of ASCVD and requirement for intensive pharmacological therapy. The concentration of baseline LDL-C (untreated) is essential for the diagnosis of FH but is often not available because the individual is already on statin therapy. To validate a new algorithm to impute baseline LDL-C, we examined 1297 patients. The baseline LDL-C was compared with the imputed baseline obtained within 18 months of the initiation of therapy. We compared the percent reduction in LDL-C on treatment from baseline with the published percent reductions. After eliminating individuals with missing data, nonstandard doses of statins, or medications other than statins or ezetimibe, we provide data on 951 patients. The mean ± SE baseline LDL-C was 243.0 (2.2) mg/dL [6.28 (0.06) mmol/L], and the mean ± SE imputed baseline LDL-C was 244.2 (2.6) mg/dL [6.31 (0.07) mmol/L] ( P = 0.48). There was no difference in response according to the patient's sex or in percent reduction between observed and expected for individual doses or types of statin or ezetimibe. We provide a validated estimation of baseline LDL-C for patients with FH that may help clinicians in making a diagnosis. © 2017 American Association for Clinical Chemistry.

  9. Construction of High Density Sweet Cherry (Prunus avium L. Linkage Maps Using Microsatellite Markers and SNPs Detected by Genotyping-by-Sequencing (GBS.

    Directory of Open Access Journals (Sweden)

    Verónica Guajardo

    Full Text Available Linkage maps are valuable tools in genetic and genomic studies. For sweet cherry, linkage maps have been constructed using mainly microsatellite markers (SSRs and, recently, using single nucleotide polymorphism markers (SNPs from a cherry 6K SNP array. Genotyping-by-sequencing (GBS, a new methodology based on high-throughput sequencing, holds great promise for identification of high number of SNPs and construction of high density linkage maps. In this study, GBS was used to identify SNPs from an intra-specific sweet cherry cross. A total of 8,476 high quality SNPs were selected for mapping. The physical position for each SNP was determined using the peach genome, Peach v1.0, as reference, and a homogeneous distribution of markers along the eight peach scaffolds was obtained. On average, 65.6% of the SNPs were present in genic regions and 49.8% were located in exonic regions. In addition to the SNPs, a group of SSRs was also used for construction of linkage maps. Parental and consensus high density maps were constructed by genotyping 166 siblings from a 'Rainier' x 'Rivedel' (Ra x Ri cross. Using Ra x Ri population, 462, 489 and 985 markers were mapped into eight linkage groups in 'Rainier', 'Rivedel' and the Ra x Ri map, respectively, with 80% of mapped SNPs located in genic regions. Obtained maps spanned 549.5, 582.6 and 731.3 cM for 'Rainier', 'Rivedel' and consensus maps, respectively, with an average distance of 1.2 cM between adjacent markers for both 'Rainier' and 'Rivedel' maps and of 0.7 cM for Ra x Ri map. High synteny and co-linearity was observed between obtained maps and with Peach v1.0. These new high density linkage maps provide valuable information on the sweet cherry genome, and serve as the basis for identification of QTLs and genes relevant for the breeding of the species.

  10. Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome.

    Science.gov (United States)

    Johnston, Henry Richard; Hu, Yi-Juan; Gao, Jingjing; O'Connor, Timothy D; Abecasis, Gonçalo R; Wojcik, Genevieve L; Gignoux, Christopher R; Gourraud, Pierre-Antoine; Lizee, Antoine; Hansen, Mark; Genuario, Rob; Bullis, Dave; Lawley, Cindy; Kenny, Eimear E; Bustamante, Carlos; Beaty, Terri H; Mathias, Rasika A; Barnes, Kathleen C; Qin, Zhaohui S

    2017-04-21

    A primary goal of The Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) is to develop an 'African Diaspora Power Chip' (ADPC), a genotyping array consisting of tagging SNPs, useful in comprehensively identifying African specific genetic variation. This array is designed based on the novel variation identified in 642 CAAPA samples of African ancestry with high coverage whole genome sequence data (~30× depth). This novel variation extends the pattern of variation catalogued in the 1000 Genomes and Exome Sequencing Projects to a spectrum of populations representing the wide range of West African genomic diversity. These individuals from CAAPA also comprise a large swath of the African Diaspora population and incorporate historical genetic diversity covering nearly the entire Atlantic coast of the Americas. Here we show the results of designing and producing such a microchip array. This novel array covers African specific variation far better than other commercially available arrays, and will enable better GWAS analyses for researchers with individuals of African descent in their study populations. A recent study cataloging variation in continental African populations suggests this type of African-specific genotyping array is both necessary and valuable for facilitating large-scale GWAS in populations of African ancestry.

  11. Imputing Variants in HLA-DR Beta Genes Reveals That HLA-DRB1 Is Solely Associated with Rheumatoid Arthritis and Systemic Lupus Erythematosus.

    Directory of Open Access Journals (Sweden)

    Kwangwoo Kim

    Full Text Available The genetic association of HLA-DRB1 with rheumatoid arthritis (RA and systemic lupus erythematosus (SLE is well documented, but association with other HLA-DR beta genes (HLA-DRB3, HLA-DRB4 and HLA-DRB5 has not been thoroughly studied, despite their similar functions and chromosomal positions. We examined variants in all functional HLA-DR beta genes in RA and SLE patients and controls, down to the amino-acid level, to better understand disease association with the HLA-DR locus. To this end, we improved an existing HLA reference panel to impute variants in all protein-coding HLA-DR beta genes. Using the reference panel, HLA variants were inferred from high-density SNP data of 9,271 RA-control subjects and 5,342 SLE-control subjects. Disease association tests were performed by logistic regression and log-likelihood ratio tests. After imputation using the newly constructed HLA reference panel and statistical analysis, we observed that HLA-DRB1 variants better accounted for the association between MHC and susceptibility to RA and SLE than did the other three HLA-DRB variants. Moreover, there were no secondary effects in HLA-DRB3, HLA-DRB4, or HLA-DRB5 in RA or SLE. Of all the HLA-DR beta chain paralogs, those encoded by HLA-DRB1 solely or dominantly influence susceptibility to RA and SLE.

  12. The development of a high density linkage map for black tiger shrimp (Penaeus monodon based on cSNPs.

    Directory of Open Access Journals (Sweden)

    Matthew Baranski

    Full Text Available Transcriptome sequencing using Illumina RNA-seq was performed on populations of black tiger shrimp from India. Samples were collected from (i four landing centres around the east coastline (EC of India, (ii survivors of a severe WSSV infection during pond culture (SUR and (iii the Andaman Islands (AI in the Bay of Bengal. Equal quantities of purified total RNA from homogenates of hepatopancreas, muscle, nervous tissue, intestinal tract, heart, gonad, gills, pleopod and lymphoid organs were combined to create AI, EC and SUR pools for RNA sequencing. De novo transcriptome assembly resulted in 136,223 contigs (minimum size 100 base pairs, bp with a total length 61 Mb, an average length of 446 bp and an average coverage of 163× across all pools. Approximately 16% of contigs were annotated with BLAST hit information and gene ontology annotations. A total of 473,620 putative SNPs/indels were identified. An Illumina iSelect genotyping array containing 6,000 SNPs was developed and used to genotype 1024 offspring belonging to seven full-sibling families. A total of 3959 SNPs were mapped to 44 linkage groups. The linkage groups consisted of between 16-129 and 13-130 markers, of length between 139-10.8 and 109.1-10.5 cM and with intervals averaging between 1.2 and 0.9 cM for the female and male maps respectively. The female map was 28% longer than the male map (4060 and 2917 cM respectively with a 1.6 higher recombination rate observed for female compared to male meioses. This approach has substantially increased expressed sequence and DNA marker resources for tiger shrimp and is a useful resource for QTL mapping and association studies for evolutionarily and commercially important traits.

  13. Risk-Association of Five SNPs in TOX3/LOC643714 with Breast Cancer in Southern China

    Directory of Open Access Journals (Sweden)

    Xuanqiu He

    2014-01-01

    Full Text Available The specific mechanism by which low-risk genetic variants confer breast cancer risk is currently unclear, with contradictory evidence on the role of single nucleotide polymorphisms (SNPs in TOX3/LOC643714 as a breast cancer susceptibility locus. Investigations of this locus using a Chinese population may indicate whether the findings initially identified in a European population are generalizable to other populations, and may provide new insight into the role of genetic variants in the etiology of breast cancer. In this case-control study, 623 Chinese female breast cancer patients and 620 cancer-free controls were recruited to investigate the role of five SNPs in TOX3/LOC643714 (rs8051542, rs12443621, rs3803662, rs4784227, and rs3112612; Linkage disequilibrium (LD pattern analysis was performed. Additionally, we evaluated how these common SNPs influence the risk of specific types of breast cancer, as defined by estrogen receptor (ER status, progesterone receptor (PR status and human epidermal growth factor receptor 2 (HER2 status. Significant associations with breast cancer risk were observed for rs4784227 and rs8051542 with odds ratios (OR of 1.31 ((95% confidence intervals (CI, 1.10–1.57 and 1.26 (95% CI, 1.02–1.56, respectively, per T allele. The T-rs8051542 allele was significantly associated with ER-positive and HER2-negative carriers. No significant association existed between rs12443621, rs3803662, and rs3112612 polymorphisms and risk of breast cancer. Our results support the hypothesis that the applicability of a common susceptibility locus must be confirmed among genetically different populations, which may together explain an appreciable fraction of the genetic etiology of breast cancer.

  14. A survey of endogenous retrovirus (ERV) sequences in the vicinity of multiple sclerosis (MS)-associated single nucleotide polymorphisms (SNPs).

    Science.gov (United States)

    Brütting, Christine; Emmer, Alexander; Kornhuber, Malte; Staege, Martin S

    2016-08-01

    Although multiple sclerosis (MS) is one of the most common central nervous system diseases in young adults, little is known about its etiology. Several human endogenous retroviruses (ERVs) are considered to play a role in MS. We are interested in which ERVs can be identified in the vicinity of MS associated genetic marker to find potential initiators of MS. We analysed the chromosomal regions surrounding 58 single nucleotide polymorphisms (SNPs) that are associated with MS identified in one of the last major genome wide association studies. We scanned these regions for putative endogenous retrovirus sequences with large open reading frames (ORFs). We observed that more retrovirus-related putative ORFs exist in the relatively close vicinity of SNP marker indices in multiple sclerosis compared to control SNPs. We found very high homologies to HERV-K, HCML-ARV, XMRV, Galidia ERV, HERV-H/env62 and XMRV-like mouse endogenous retrovirus mERV-XL. The associated genes (CYP27B1, CD6, CD58, MPV17L2, IL12RB1, CXCR5, PTGER4, TAGAP, TYK2, ICAM3, CD86, GALC, GPR65 as well as the HLA DRB1*1501) are mainly involved in the immune system, but also in vitamin D regulation. The most frequently detected ERV sequences are related to the multiple sclerosis-associated retrovirus, the human immunodeficiency virus 1, HERV-K, and the Simian foamy virus. Our data shows that there is a relation between MS associated SNPs and the number of retroviral elements compared to control. Our data identifies new ERV sequences that have not been associated with MS, so far.

  15. Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

    OpenAIRE

    Yan, Xiaobo; Xiong, Weiqing; Hu, Liang; Wang, Feng; Zhao, Kuo

    2015-01-01

    This paper addresses missing value imputation for the Internet of Things (IoT). Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the red...

  16. Non-imputability, criminal dangerousness and curative safety measures: myths and realities

    Directory of Open Access Journals (Sweden)

    Frank Harbottle Quirós

    2017-04-01

    Full Text Available The curative safety measures are imposed in a criminal proceeding to the non-imputable people provided that through a prognosis it is concluded in an affirmative way about its criminal dangerousness. Although this statement seems very elementary, in judicial practice several myths remain in relation to these legal institutes whose versions may vary, to a greater or lesser extent, between the different countries of the world. In this context, the present article formulates ten myths based on the experience of Costa Rica and provides an explanation that seeks to weaken or knock them down, inviting the reader to reflect on them.

  17. Is really endogenous ghrelin a hunger signal in chickens? Association of GHSR SNPs with increase appetite, growth traits, expression and serum level of GHRL, and GH.

    Science.gov (United States)

    El-Magd, Mohammed Abu; Saleh, Ayman A; Abdel-Hamid, Tamer M; Saleh, Rasha M; Afifi, Mohammed A

    2016-10-01

    Chicken growth hormone secretagogue receptor (GHSR) is a receptor for ghrelin (GHRL), a peptide hormone produced by chicken proventriculus, which stimulates growth hormone (GH) release and food intake. The purpose of this study was to search for single nucleotide polymorphisms (SNPs) in exon 2 of GHSR gene and to analyze their effect on the appetite, growth traits and expression levels of GHSR, GHRL, and GH genes as well as serum levels of GH and GHRL in Mandara chicken. Two adjacent SNPs, A239G and G244A, were detected in exon 2 of GHSR gene. G244A SNP was non-synonymous mutation and led to replacement of lysine amino acid (aa) by arginine aa, while A239G SNP was synonymous mutation. The combined genotypes of A239G and G244A SNPs produced three haplotypes; GG/GG, GG/AG, AG/AG, which associated significantly (P4 to 16w. Chickens with the homozygous GG/GG haplotype showed higher growth performance than other chickens. The two SNPs were also correlated with mRNA levels of GHSR and GH (in pituitary gland), and GHRL (in proventriculus and hypothalamus) as well as with serum level of GH and GHRL. Also, chickens with GG/GG haplotype showed higher mRNA and serum levels. This is the first study to demonstrate that SNPs in GHSR can increase appetite, growth traits, expression and level of GHRL, suggesting a hunger signal role for endogenous GHRL. Copyright © 2016 Elsevier Inc. All rights reserved.

  18. MiRNA-Related SNPs and Risk of Esophageal Adenocarcinoma and Barrett's Esophagus: Post Genome-Wide Association Analysis in the BEACON Consortium.

    Directory of Open Access Journals (Sweden)

    Matthew F Buas

    Full Text Available Incidence of esophageal adenocarcinoma (EA has increased substantially in recent decades. Multiple risk factors have been identified for EA and its precursor, Barrett's esophagus (BE, such as reflux, European ancestry, male sex, obesity, and tobacco smoking, and several germline genetic variants were recently associated with disease risk. Using data from the Barrett's and Esophageal Adenocarcinoma Consortium (BEACON genome-wide association study (GWAS of 2,515 EA cases, 3,295 BE cases, and 3,207 controls, we examined single nucleotide polymorphisms (SNPs that potentially affect the biogenesis or biological activity of microRNAs (miRNAs, small non-coding RNAs implicated in post-transcriptional gene regulation, and deregulated in many cancers, including EA. Polymorphisms in three classes of genes were examined for association with risk of EA or BE: miRNA biogenesis genes (157 SNPs, 21 genes; miRNA gene loci (234 SNPs, 210 genes; and miRNA-targeted mRNAs (177 SNPs, 158 genes. Nominal associations (P0.50, and we did not find evidence for interactions between variants analyzed and two risk factors for EA/BE (smoking and obesity. This analysis provides the most extensive assessment to date of miRNA-related SNPs in relation to risk of EA and BE. While common genetic variants within components of the miRNA biogenesis core pathway appear unlikely to modulate susceptibility to EA or BE, further studies may be warranted to examine potential associations between unassessed variants in miRNA genes and targets with disease risk.

  19. Characterization of genomic variations in SNPs of PE_PGRS genes reveals deletions and insertions in extensively drug resistant (XDR) M. tuberculosis strains from Pakistan

    KAUST Repository

    Kanji, Akbar

    2015-03-01

    Background: Mycobacterium tuberculosis (MTB) PE_PGRS genes belong to the PE multi-gene family. Although the function of the members of the PE_PGRS multi-gene family is not yet known, it is hypothesized that the PE_PGRS genes may be associated with genetic variability. Material and methods: Whole genome sequencing analysis was performed on (n= 37) extensively drug resistant (XDR) MTB strains from Pakistan which included Central Asian (n= 23), East African Indian (n= 2), X3 (n= 1), T group (n= 3) and Orphan (n= 8) MTB strains. Results: By analyzing 42 PE_PGRS genes, 111 SNPs were identified, of which 13 were non-synonymous SNPs (nsSNPs). The nsSNPs identified in the PE_PGRS genes were as follows: 6, 9, 10 and 55 present in each of the CAS, EAI, Orphan, T1 and X3 XDR MTB strains studied. Deletions in PE_PGRS genes: 19, 21 and 23 were observed in 7 (35.0%) CAS1 and 3 (37.5%) in Orphan XDR MTB strains, while deletions in the PE_PGRS genes: 49 and 50 were observed in 36 (95.0%) CAS1 and all CAS, CAS2 and Orphan XDR MTB strains. An insertion in PE_PGRS6 gene was observed in all CAS, EAI3 and Orphan, while insertions in the PE_PGRS genes 19 and 33 were observed in 19 (95%) CAS1 and all CAS, CAS2, EAI3 and Orphan XDR MTB strains. Conclusion: Genetic diversity in PE_PGRS genes contributes to antigenic variability and may result in increased immunogenicity of strains. This is the first study identifying variations in nsSNPs, Insertions and Deletions in the PE_PGRS genes of XDR-TB strains from Pakistan. It highlights common genetic variations which may contribute to persistence.

  20. Serum urate gene associations with incident gout, measured in the Framingham Heart Study, are modified by renal disease and not by body mass index.

    Science.gov (United States)

    Reynolds, Richard J; Vazquez, Ana I; Srinivasasainagendra, Vinodh; Klimentidis, Yann C; Bridges, S Louis; Allison, David B; Singh, Jasvinder A

    2016-02-01

    We hypothesized that serum urate-associated SNPs, individually or collectively, interact with BMI and renal disease to contribute to risk of incident gout. We measured the incidence of gout and associated comorbidities using the original and offspring cohorts of the Framingham Heart Study. We used direct and imputed genotypes for eight validated serum urate loci. We fit binomial regression models of gout incidence as a function of the covariates, age, type 2 diabetes, sex, and all main and interaction effects of the eight serum urate SNPs with BMI and renal disease. Models were also fit with a genetic risk score for serum urate levels which corresponds to the sum of risk alleles at the eight SNPs. Model covariates, age (P = 5.95E-06), sex (P = 2.46E-39), diabetes (P = 2.34E-07), BMI (P = 1.14E-11) and the SNPs, rs1967017 (P = 9.54E-03), rs13129697 (P = 4.34E-07), rs2199936 (P = 7.28E-03) and rs675209 (P = 4.84E-02) were all associated with incident gout. No BMI by SNP or BMI by serum urate genetic risk score interactions were statistically significant, but renal disease by rs1106766 was statistically significant (P = 6.12E-03). We demonstrated that minor alleles of rs1106766 (intergenic, INHBC) were negatively associated with the risk of incident gout in subjects without renal disease, but not for individuals with renal disease. These analyses demonstrate that a significant component of the risk of gout may involve complex interplay between genes and environment.

  1. Association of OCT derived drusen measurements with AMD associated-genotypic SNPs in Amish population.

    Science.gov (United States)

    Chavali, Venkata Ramana Murthy; Diniz, Bruno; Huang, Jiayan; Ying, Gui-Shuang; Sadda, SriniVas R; Stambolian, Dwight

    To investigate the association of OCT derived drusen measures in Amish age-related macular degeneration (AMD) patients with known loci for macular degeneration. Members of the Old Order Amish community in Pennsylvania ages 50 and older were assessed for drusen area, volume and regions of retinal pigment epithelium (RPE) atrophy using a Cirrus High- Definition-OCT. Measurements were obtained in the macula region within a central circle (CC) of 3 mm diameter and a surrounding perifoveal ring (PR) of 3 to 5 mm diameter using the Cirrus OCT RPE analysis software. Other demographic information including age, gender and smoking status were collected. Study subjects were further genotyped to determine their risk for the AMD associated SNPs in SYN3, LIPC, ARMS2, C3, CFB, CETP, CFI and CFH genes using TaqMan genotyping assays. The association of genotypes with OCT measures were assessed using linear trend p-values calculated from univariate and multivariate generalized linear models. 432 eyes were included in the analysis. Multivariate analysis (adjusted by age, gender and smoking status) confirmed the known significant association between AMD and macular drusen with the number of CFH risk alleles for drusen area (area increased 0.12 mm 2 for a risk allele increase, pAmish AMD population.

  2. Chromosome 5p Region SNPs Are Associated with Risk of NSCLC among Women

    International Nuclear Information System (INIS)

    Dyke, A. L. V.

    2009-01-01

    In a population-based case-control study, we explored the associations between 42 polymorphisms in seven genes in this region and non-small cell lung cancer (NSCLC) risk among Caucasian (364 cases; 380 controls) and African American (95 cases; 103 controls) women. Two TERT region SNPs, rs2075786 and rs2853677, conferred an increased risk of developing NSCLC, especially among African American women, and TERT-rs2735940 was associated with a decreased risk of lung cancer among African Americans. Five of the 20 GHR polymorphisms and SEPP1-rs6413428 were associated with a marginally increased risk of NSCLC among Caucasians. Random forest analysis reinforced the importance of GHR among Caucasians and identified AMACR, TERT, and GHR among African Americans, which were also significant using gene-based risk scores. Smoking-SNP interactions were explored, and haplotype in TERT and GHR associated with NSCLC risk were identified. The roles of TERT, GHR, AMACR and SEPP1 genes in lung carcinogenesis warrant further exploration

  3. Characterization of genomic variations in SNPs of PE_PGRS genes reveals deletions and insertions in extensively drug resistant (XDR) M. tuberculosis strains from Pakistan

    KAUST Repository

    Kanji, Akbar

    2015-01-21

    Background Mycobacterium tuberculosis (MTB) PE_PGRS genes belong to the PE multigene family. Although the function of PE_PGRS genes is unknown, it is hypothesized that the PE_PGRS genes may be associated with antigenic variability in MTB. Material and methods Whole genome sequencing analysis was performed on (n = 37) extensively drug-resistant (XDR) MTB strains from Pakistan, which included Lineage 1 (East African Indian, n = 2); Other lineage 1 (n = 3); Lineage 3 (Central Asian, n = 24); Other lineage 3 (n = 4); Lineage 4 (X3, n = 1) and T group (n = 3) MTB strains. Results There were 107 SNPs identified from the analysis of 42 PE_PGRS genes; of these, 13 were non-synonymous SNPs (nsSNPs). The nsSNPs identified in PE_PGRS genes – 6, 9 and 10 – were common in all EAI, CAS, Other lineages (1 and 3), T1 and X3. Deletions (DELs) in PE_PGRS genes – 3 and 19 – were observed in 17 (80.9%) CAS1 and 6 (85.7%) in Other lineages (1 and 3) XDR MTB strains, while DELs in the PE_PGRS49 were observed in all CAS1, CAS, CAS2 and Other lineages (1 and 3) XDR MTB strains. All CAS, EAI and Other lineages (1 and 3) strains showed insertions (INS) in PE_PGRS6 gene, while INS in the PE_PGRS genes 19 and 33 were observed in 20 (95.2%) CAS1, all CAS, CAS2, EAI and Other lineages (1 and 3) XDR MTB strains. Conclusion Genetic diversity in PE_PGRS genes contributes to antigenic variability and may result in increased immunogenicity of strains. This is the first study identifying variations in nsSNPs and INDELs in the PE_PGRS genes of XDR-TB strains from Pakistan. It highlights common genetic variations which may contribute to persistence.

  4. Chosen single nucleotide polymorphisms (SNPs) of enamel formation genes and dental caries in a population of Polish children.

    Science.gov (United States)

    Gerreth, Karolina; Zaorska, Katarzyna; Zabel, Maciej; Borysewicz-Lewicka, Maria; Nowicki, Michał

    2017-09-01

    It is increasingly emphasized that the influence of a host's factors in the etiology of dental caries are of most interest, particularly those concerned with genetic aspect. The aim of the study was to analyze the genotype and allele frequencies of single nucleotide polymorphisms (SNPs) in AMELX, AMBN, TUFT1, TFIP11, MMP20 and KLK4 genes and to prove their association with dental caries occurrence in a population of Polish children. The study was performed in 96 children (48 individuals with caries - "cases" and 48 free of this disease - "controls"), aged 20-42 months, chosen out of 262 individuals who had dental examination performed and attended 4 day nurseries located in Poznań (Poland). From both groups oral swab was collected for molecular evaluation. Eleven selected SNPs markers were genotyped by Sanger sequencing. Genotype and allele frequencies were calculated and a standard χ2 analysis was used to test for deviation from Hardy-Weinberg equilibrium. The association of genetic variations with caries susceptibility or resistance was assessed by the Fisher's exact test and p ≤ 0.05 was considered statistically significant. Five markers were significantly associated with caries incidence in children in the study: rs17878486 in AMELX (p caries occurrence in Polish children.

  5. Combining item response theory with multiple imputation to equate health assessment questionnaires.

    Science.gov (United States)

    Gu, Chenyang; Gutman, Roee

    2017-09-01

    The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.

  6. FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification

    Science.gov (United States)

    Salleh, Mohd Najib Mohd; Ashikin Samat, Nurul

    2017-08-01

    The application of data mining and machine learning in directing clinical research into possible hidden knowledge is becoming greatly influential in medical areas. Heart Disease is a killer disease around the world, and early prevention through efficient methods can help to reduce the mortality number. Medical data may contain many uncertainties, as they are fuzzy and vague in nature. Nonetheless, imprecise features data such as no values and missing values can affect quality of classification results. Nevertheless, the other complete features are still capable to give information in certain features. Therefore, an imputation approach based on Fuzzy C-Means and Particle Swarm Optimization (FCMPSO) is developed in preprocessing stage to help fill in the missing values. Then, the complete dataset is trained in classification algorithm, Decision Tree. The experiment is trained with Heart Disease dataset and the performance is analysed using accuracy, precision, and ROC values. Results show that the performance of Decision Tree is increased after the application of FCMSPO for imputation.

  7. Partial F-tests with multiply imputed data in the linear regression framework via coefficient of determination.

    Science.gov (United States)

    Chaurasia, Ashok; Harel, Ofer

    2015-02-10

    Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.

  8. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    Science.gov (United States)

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  9. Missing Value Imputation Improves Mortality Risk Prediction Following Cardiac Surgery: An Investigation of an Australian Patient Cohort.

    Science.gov (United States)

    Karim, Md Nazmul; Reid, Christopher M; Tran, Lavinia; Cochrane, Andrew; Billah, Baki

    2017-03-01

    The aim of this study was to evaluate the impact of missing values on the prediction performance of the model predicting 30-day mortality following cardiac surgery as an example. Information from 83,309 eligible patients, who underwent cardiac surgery, recorded in the Australia and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) database registry between 2001 and 2014, was used. An existing 30-day mortality risk prediction model developed from ANZSCTS database was re-estimated using the complete cases (CC) analysis and using multiple imputation (MI) analysis. Agreement between the risks generated by the CC and MI analysis approaches was assessed by the Bland-Altman method. Performances of the two models were compared. One or more missing predictor variables were present in 15.8% of the patients in the dataset. The Bland-Altman plot demonstrated significant disagreement between the risk scores (prisk of mortality. Compared to CC analysis, MI analysis resulted in an average of 8.5% decrease in standard error, a measure of uncertainty. The MI model provided better prediction of mortality risk (observed: 2.69%; MI: 2.63% versus CC: 2.37%, Pvalues improved the 30-day mortality risk prediction following cardiac surgery. Copyright © 2016 Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) and the Cardiac Society of Australia and New Zealand (CSANZ). Published by Elsevier B.V. All rights reserved.

  10. Multiple sclerosis susceptibility-associated SNPs do not influence disease severity measures in a cohort of Australian MS patients.

    Directory of Open Access Journals (Sweden)

    Cathy J Jensen

    Full Text Available Recent association studies in multiple sclerosis (MS have identified and replicated several single nucleotide polymorphism (SNP susceptibility loci including CLEC16A, IL2RA, IL7R, RPL5, CD58, CD40 and chromosome 12q13-14 in addition to the well established allele HLA-DR15. There is potential that these genetic susceptibility factors could also modulate MS disease severity, as demonstrated previously for the MS risk allele HLA-DR15. We investigated this hypothesis in a cohort of 1006 well characterised MS patients from South-Eastern Australia. We tested the MS-associated SNPs for association with five measures of disease severity incorporating disability, age of onset, cognition and brain atrophy. We observed trends towards association between the RPL5 risk SNP and time between first demyelinating event and relapse, and between the CD40 risk SNP and symbol digit test score. No associations were significant after correction for multiple testing. We found no evidence for the hypothesis that these new MS disease risk-associated SNPs influence disease severity.

  11. Screening of Missense SNPs in Coding Regions of COX-2 as a Key Enzyme Involved in Cancer

    Directory of Open Access Journals (Sweden)

    Sodabeh Jahanbakhsh-Godehkahriz

    2013-09-01

    Full Text Available Background & Objectives: Non-synonymous single nucleotide polymorphism (nsSNPs which results in disruption of protein function are used as markers in linkage and association of human proteins that might be involved in diseases and cancers .   Methods: To study the functional effect of nsSNP in cyclooxygenase-2 (COX2 amino acids, the nucleotide sequences encoding COX-2 gene in cancers were extracted from the NCBI (gi|223941909 data bank (283 cases and analyzed by SIFT, I-Mutant 2.0, SNP and GO, PANTHER and FASTSNP servers. These servers involve programs that predict the effects of amino acid substitution on protein function, stability and missense .   Results: COX-2 is an essential enzyme for the production of pro-inflammatory prostaglandins which are relevant to cancer development and progression. The substitutions in some positions such as R228H and S428A of COX-2 in most of cancers linked to reformed protein function through disruption in enzyme active site.   Conclusion: Amino acid substitutions as a consequence of COX-2 nsSNPs have important role in human disease. Substitutions which are located in catalytic domain are important for the enzymatic function of COX-2 and associated with higher expression of COX-2.

  12. Typing of 49 autosomal SNPs by single base extension and capillary electrophoresis for forensic genetic testing

    DEFF Research Database (Denmark)

    Børsting, Claus; Tomas Mas, Carmen; Morling, Niels

    2012-01-01

    of the amplicons range from 65 to 115 bp. The high sensitivity and the short amplicon sizes make the assay very suitable for typing of degraded DNA samples, and the low mutation rate of SNPs makes the assay very useful for relationship testing. Combined, these advantages make the assay well suited for disaster...

  13. Typing of 49 autosomal SNPs by SNaPshot in the Slovenian population

    DEFF Research Database (Denmark)

    Drobnic, Katja; Børsting, Claus; Rockenbauer, Eszter

    2010-01-01

    A total of 157 unrelated individuals residing in Slovenia were typed for 49 of the autosomal single nucleotide polymorphisms (SNPs) in the SNPforID 52plex with the SNaPshot assay. We obtained full SNP profiles in all but one individual and perfect concordance was obtained in duplicated analyses...

  14. Functional SNPs in the human ficolin (FCN) genes reveal distinct geographical patterns

    DEFF Research Database (Denmark)

    Hummelshøj, Tina; Munthe-Fog, Lea; Madsen, Hans O

    2008-01-01

    -Xaa-Yaa repeats and a Trp279STOP introduces a stop codon, thereby destroying the fibrinogen-like domain of Ficolin-1. In contrast to FCN1 and FCN2, the number of SNPs in FCN3 was very low. In conclusion, large ethnic differences in the FCN genes that will affect the concentration, structure, and function...

  15. Analysis of 49 autosomal SNPs in three ethnic groups from Iran

    DEFF Research Database (Denmark)

    Sharafi Farzad, M; Tomas Mas, Carmen; Børsting, C

    2013-01-01

    Asian populations in the MDS plot drawn from the FST values. Statistical parameters of forensic interest calculated for the Iranian ethnic groups showed values of the same order of magnitudes as those obtained for Asians. The mean match probability calculated for the 49 SNPs ranged from 1.7x10...

  16. Analysis of SNPs of MC4R , GNB3 and FTO gene polymorphism in ...

    African Journals Online (AJOL)

    Analysis of SNPs of MC4R , GNB3 and FTO gene polymorphism in obese Saudi subjects. Said Salama Moselhy, Yasmeen A Alhetari, Archana Iyer, Etimad A Huwait, Maryam A AL-Ghamdi, Shareefa AL-Ghamdi, Khadijah Saeed Balamash, Ashraf A Basuni, Mohamed N Alama, Taha A Kumosani, Soonham Sami Yaghmoor ...

  17. In vitro human keratinocyte migration rates are associated with SNPs in the KRT1 interval.

    Directory of Open Access Journals (Sweden)

    Heng Tao

    Full Text Available Efforts to develop effective therapeutic treatments for promoting fast wound healing after injury to the epidermis are hindered by a lack of understanding of the factors involved. Re-epithelialization is an essential step of wound healing involving the migration of epidermal keratinocytes over the wound site. Here, we examine genetic variants in the keratin-1 (KRT1 locus for association with migration rates of human epidermal keratinocytes (HEK isolated from different individuals. Although the role of intermediate filament genes, including KRT1, in wound activated keratinocytes is well established, this is the first study to examine if genetic variants in humans contribute to differences in the migration rates of these cells. Using an in vitro scratch wound assay we observe quantifiable variation in HEK migration rates in two independent sets of samples; 24 samples in the first set and 17 samples in the second set. We analyze genetic variants in the KRT1 interval and identify SNPs significantly associated with HEK migration rates in both samples sets. Additionally, we show in the first set of samples that the average migration rate of HEK cells homozygous for one common haplotype pattern in the KRT1 interval is significantly faster than that of HEK cells homozygous for a second common haplotype pattern. Our study demonstrates that genetic variants in the KRT1 interval contribute to quantifiable differences in the migration rates of keratinocytes isolated from different individuals. Furthermore we show that in vitro cell assays can successfully be used to deconstruct complex traits into simple biological model systems for genetic association studies.

  18. Genomic Selection for Drought Tolerance Using Genome-Wide SNPs in Maize

    Directory of Open Access Journals (Sweden)

    Thirunavukkarasu Nepolean

    2017-04-01

    Full Text Available Traditional breeding strategies for selecting superior genotypes depending on phenotypic traits have proven to be of limited success, as this direct selection is hindered by low heritability, genetic interactions such as epistasis, environmental-genotype interactions, and polygenic effects. With the advent of new genomic tools, breeders have paved a way for selecting superior breeds. Genomic selection (GS has emerged as one of the most important approaches for predicting genotype performance. Here, we tested the breeding values of 240 maize subtropical lines phenotyped for drought at different environments using 29,619 cured SNPs. Prediction accuracies of seven genomic selection models (ridge regression, LASSO, elastic net, random forest, reproducing kernel Hilbert space, Bayes A and Bayes B were tested for their agronomic traits. Though prediction accuracies of Bayes B, Bayes A and RKHS were comparable, Bayes B outperformed the other models by predicting highest Pearson correlation coefficient in all three environments. From Bayes B, a set of the top 1053 significant SNPs with higher marker effects was selected across all datasets to validate the genes and QTLs. Out of these 1053 SNPs, 77 SNPs associated with 10 drought-responsive transcription factors. These transcription factors were associated with different physiological and molecular functions (stomatal closure, root development, hormonal signaling and photosynthesis. Of several models, Bayes B has been shown to have the highest level of prediction accuracy for our data sets. Our experiments also highlighted several SNPs based on their performance and relative importance to drought tolerance. The result of our experiments is important for the selection of superior genotypes and candidate genes for breeding drought-tolerant maize hybrids.

  19. 21 CFR 1404.630 - May the Office of National Drug Control Policy impute conduct of one person to another?

    Science.gov (United States)

    2010-04-01

    ... 21 Food and Drugs 9 2010-04-01 2010-04-01 false May the Office of National Drug Control Policy impute conduct of one person to another? 1404.630 Section 1404.630 Food and Drugs OFFICE OF NATIONAL DRUG CONTROL POLICY GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles Relating to Suspension and Debarment Actions § 1404.630...

  20. Mapping wildland fuels and forest structure for land management: a comparison of nearest neighbor imputation and other methods

    Science.gov (United States)

    Kenneth B. Pierce; Janet L. Ohmann; Michael C. Wimberly; Matthew J. Gregory; Jeremy S. Fried

    2009-01-01

    Land managers need consistent information about the geographic distribution of wildland fuels and forest structure over large areas to evaluate fire risk and plan fuel treatments. We compared spatial predictions for 12 fuel and forest structure variables across three regions in the western United States using gradient nearest neighbor (GNN) imputation, linear models (...

  1. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    DEFF Research Database (Denmark)

    Huang, Jie; Howie, Bryan; Mccarthy, Shane

    2015-01-01

    Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low de...

  2. 29 CFR 1471.630 - May the Federal Mediation and Conciliation Service impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 29 Labor 4 2010-07-01 2010-07-01 false May the Federal Mediation and Conciliation Service impute...) FEDERAL MEDIATION AND CONCILIATION SERVICE GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles Relating to Suspension and Debarment Actions § 1471.630 May the Federal Mediation and...

  3. Age at menopause: imputing age at menopause for women with a hysterectomy with application to risk of postmenopausal breast cancer

    Science.gov (United States)

    Rosner, Bernard; Colditz, Graham A.

    2011-01-01

    Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037

  4. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    NARCIS (Netherlands)

    J. Huang (Jie); B. Howie (Bryan); S. McCarthy (Shane); Y. Memari (Yasin); K. Walter (Klaudia); J.L. Min (Josine L.); P. Danecek (Petr); G. Malerba (Giovanni); E. Trabetti (Elisabetta); H.-F. Zheng (Hou-Feng); G. Gambaro (Giovanni); J.B. Richards (Brent); R. Durbin (Richard); N.J. Timpson (Nicholas); J. Marchini (Jonathan); N. Soranzo (Nicole); S.H. Al Turki (Saeed); A. Amuzu (Antoinette); C. Anderson (Carl); R. Anney (Richard); D. Antony (Dinu); M.S. Artigas; M. Ayub (Muhammad); S. Bala (Senduran); J.C. Barrett (Jeffrey); I.E. Barroso (Inês); P.L. Beales (Philip); M. Benn (Marianne); J. Bentham (Jamie); S. Bhattacharya (Shoumo); E. Birney (Ewan); D.H.R. Blackwood (Douglas); M. Bobrow (Martin); E. Bochukova (Elena); P.F. Bolton (Patrick F.); R. Bounds (Rebecca); C. Boustred (Chris); G. Breen (Gerome); M. Calissano (Mattia); K. Carss (Keren); J.P. Casas (Juan Pablo); J.C. Chambers (John C.); R. Charlton (Ruth); K. Chatterjee (Krishna); L. Chen (Lu); A. Ciampi (Antonio); S. Cirak (Sebahattin); P. Clapham (Peter); G. Clement (Gail); G. Coates (Guy); M. Cocca (Massimiliano); D.A. Collier (David); C. Cosgrove (Catherine); T. Cox (Tony); N.J. Craddock (Nick); L. Crooks (Lucy); S. Curran (Sarah); D. Curtis (David); A. Daly (Allan); I.N.M. Day (Ian N.M.); A.G. Day-Williams (Aaron); G.V. Dedoussis (George); T. Down (Thomas); Y. Du (Yuanping); C.M. van Duijn (Cornelia); I. Dunham (Ian); T. Edkins (Ted); R. Ekong (Rosemary); P. Ellis (Peter); D.M. Evans (David); I.S. Farooqi (I. Sadaf); D.R. Fitzpatrick (David R.); P. Flicek (Paul); J. Floyd (James); A.R. Foley (A. Reghan); C.S. Franklin (Christopher S.); M. Futema (Marta); L. Gallagher (Louise); P. Gasparini (Paolo); T.R. Gaunt (Tom); M. Geihs (Matthias); D. Geschwind (Daniel); C.M.T. Greenwood (Celia); H. Griffin (Heather); D. Grozeva (Detelina); X. Guo (Xiaosen); X. Guo (Xueqin); H. Gurling (Hugh); D. Hart (Deborah); A.E. Hendricks (Audrey E.); P.A. Holmans (Peter A.); L. Huang (Liren); T. Hubbard (Tim); S.E. Humphries (Steve E.); M.E. Hurles (Matthew); P.G. Hysi (Pirro); V. Iotchkova (Valentina); A. Isaacs (Aaron); D.K. Jackson (David K.); Y. Jamshidi (Yalda); J. Johnson (Jon); C. Joyce (Chris); K.J. Karczewski (Konrad); J. Kaye (Jane); T. Keane (Thomas); J.P. Kemp (John); K. Kennedy (Karen); A. Kent (Alastair); J. Keogh (Julia); F. Khawaja (Farrah); M.E. Kleber (Marcus); M. Van Kogelenberg (Margriet); A. Kolb-Kokocinski (Anja); J.S. Kooner (Jaspal S.); G. Lachance (Genevieve); C. Langenberg (Claudia); C. Langford (Cordelia); D. Lawson (Daniel); I. Lee (Irene); E.M. van Leeuwen (Elisa); M. Lek (Monkol); R. Li (Rui); Y. Li (Yingrui); J. Liang (Jieqin); H. Lin (Hong); R. Liu (Ryan); J. Lönnqvist (Jouko); L.R. Lopes (Luis R.); M.C. Lopes (Margarida); J. Luan; D.G. MacArthur (Daniel G.); M. Mangino (Massimo); G. Marenne (Gaëlle); W. März (Winfried); J. Maslen (John); A. Matchan (Angela); I. Mathieson (Iain); P. McGuffin (Peter); A.M. McIntosh (Andrew); A.G. McKechanie (Andrew G.); A. McQuillin (Andrew); S. Metrustry (Sarah); N. Migone (Nicola); H.M. Mitchison (Hannah M.); A. Moayyeri (Alireza); J. Morris (James); R. Morris (Richard); D. Muddyman (Dawn); F. Muntoni; B.G. Nordestgaard (Børge G.); K. Northstone (Kate); M.C. O'donovan (Michael); S. O'Rahilly (Stephen); A. Onoufriadis (Alexandros); K. Oualkacha (Karim); M.J. Owen (Michael J.); A. Palotie (Aarno); K. Panoutsopoulou (Kalliope); V. Parker (Victoria); J.R. Parr (Jeremy R.); L. Paternoster (Lavinia); T. Paunio (Tiina); F. Payne (Felicity); S.J. Payne (Stewart J.); J.R.B. Perry (John); O.P.H. Pietiläinen (Olli); V. Plagnol (Vincent); R.C. Pollitt (Rebecca C.); S. Povey (Sue); M.A. Quail (Michael A.); L. Quaye (Lydia); L. Raymond (Lucy); K. Rehnström (Karola); C.K. Ridout (Cheryl K.); S.M. Ring (Susan); G.R.S. Ritchie (Graham R.S.); N. Roberts (Nicola); R.L. Robinson (Rachel L.); D.B. Savage (David); P.J. Scambler (Peter); S. Schiffels (Stephan); M. Schmidts (Miriam); N. Schoenmakers (Nadia); R.H. Scott (Richard H.); R.A. Scott (Robert); R.K. Semple (Robert K.); E. Serra (Eva); S.I. Sharp (Sally I.); A.C. Shaw (Adam C.); H.A. Shihab (Hashem A.); S.-Y. Shin (So-Youn); D. Skuse (David); K.S. Small (Kerrin); C. Smee (Carol); G.D. Smith; L. Southam (Lorraine); O. Spasic-Boskovic (Olivera); T.D. Spector (Timothy); D. St. Clair (David); B. St Pourcain (Beate); J. Stalker (Jim); E. Stevens (Elizabeth); J. Sun (Jianping); G. Surdulescu (Gabriela); J. Suvisaari (Jaana); P. Syrris (Petros); I. Tachmazidou (Ioanna); R. Taylor (Rohan); J. Tian (Jing); M.D. Tobin (Martin); D. Toniolo (Daniela); M. Traglia (Michela); A. Tybjaerg-Hansen; A.M. Valdes; A.M. Vandersteen (Anthony M.); A. Varbo (Anette); P. Vijayarangakannan (Parthiban); P.M. Visscher (Peter); L.V. Wain (Louise); J.T. Walters (James); G. Wang (Guangbiao); J. Wang (Jun); Y. Wang (Yu); K. Ward (Kirsten); E. Wheeler (Eleanor); P.H. Whincup (Peter); T. Whyte (Tamieka); H.J. Williams (Hywel J.); K.A. Williamson (Kathleen); C. Wilson (Crispian); S.G. Wilson (Scott); K. Wong (Kim); C. Xu (Changjiang); J. Yang (Jian); G. Zaza (Gianluigi); E. Zeggini (Eleftheria); F. Zhang (Feng); P. Zhang (Pingbo); W. Zhang (Weihua)

    2015-01-01

    textabstractImputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced

  5. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    van Leeuwen, E.M.; Karssen, L.C.; Deelen, J.; Isaacs, A.; Medina-Gomez, C.; Mbarek, H.; Kanterakis, A.; Trompet, S.; Postmus, I.; Verweij, N.; van Enckevort, D.; Huffman, J.E.; White, C.C.; Feitosa, M.F.; Bartz, T.M.; Manichaikul, A.; Joshi, P.K.; Peloso, G.M.; Deelen, P.; Dijk, F.; Willemsen, G.; de Geus, E.J.C.; Milaneschi, Y.; Penninx, B.W.J.H.; Francioli, L.C.; Menelaou, A.; Pulit, S.L.; Rivadeneira, F.; Hofman, A.; Oostra, B.A.; Franco, O.H.; Mateo Leach, I.; Beekman, M.; de Craen, A.J.; Uh, H.W.; Trochet, H.; Hocking, L.J.; Porteous, D.J.; Sattar, N.; Packard, C.J.; Buckley, B.M.; Brody, J.A.; Bis, J.C.; Rotter, J.I.; Mychaleckyj, J.C.; Campbell, H.; Duan, Q.; Lange, L.A.; Wilson, J.F.; Hayward, C.; Polasek, O.; Vitart, V.; Rudan, I.; Wright, A.F.; Rich, S.S.; Psaty, B.M.; Borecki, I.B.; Kearney, P.M.; Stott, D.J.; Cupples, L.A.; Jukema, J.W.; van der Harst, P.; Sijbrands, E.J.; Hottenga, J.J.; Uitterlinden, A.G.; Swertz, M.A.; van Ommen, G.J.B; Bakker, P.I.W.; Slagboom, P.E.; Boomsma, D.I.; Wijmenga, C.; van Duijn, C.M.

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference panel created

  6. 31 CFR 19.630 - May the Department of the Treasury impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 31 Money and Finance: Treasury 1 2010-07-01 2010-07-01 false May the Department of the Treasury impute conduct of one person to another? 19.630 Section 19.630 Money and Finance: Treasury Office of the Secretary of the Treasury GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles...

  7. Temperature Switch PCR (TSP: Robust assay design for reliable amplification and genotyping of SNPs

    Directory of Open Access Journals (Sweden)

    Mather Diane E

    2009-12-01

    Full Text Available Abstract Background Many research and diagnostic applications rely upon the assay of individual single nucleotide polymorphisms (SNPs. Thus, methods to improve the speed and efficiency for single-marker SNP genotyping are highly desirable. Here, we describe the method of temperature-switch PCR (TSP, a biphasic four-primer PCR system with a universal primer design that permits amplification of the target locus in the first phase of thermal cycling before switching to the detection of the alleles. TSP can simplify assay design for a range of commonly used single-marker SNP genotyping methods, and reduce the requirement for individual assay optimization and operator expertise in the deployment of SNP assays. Results We demonstrate the utility of TSP for the rapid construction of robust and convenient endpoint SNP genotyping assays based on allele-specific PCR and high resolution melt analysis by generating a total of 11,232 data points. The TSP assays were performed under standardised reaction conditions, requiring minimal optimization of individual assays. High genotyping accuracy was verified by 100% concordance of TSP genotypes in a blinded study with an independent genotyping method. Conclusion Theoretically, TSP can be directly incorporated into the design of assays for most current single-marker SNP genotyping methods. TSP provides several technological advances for single-marker SNP genotyping including simplified assay design and development, increased assay specificity and genotyping accuracy, and opportunities for assay automation. By reducing the requirement for operator expertise, TSP provides opportunities to deploy a wider range of single-marker SNP genotyping methods in the laboratory. TSP has broad applications and can be deployed in any animal and plant species.

  8. Population genomic analyses based on 1 million SNPs in commercial egg layers.

    Directory of Open Access Journals (Sweden)

    Mahmood Gholami

    Full Text Available Identifying signatures of selection can provide valuable insight about the genes or genomic regions that are or have been under selective pressure, which can lead to a better understanding of genotype-phenotype relationships. A common strategy for selection signature detection is to compare samples from several populations and search for genomic regions with outstanding genetic differentiation. Wright's fixation index, FST, is a useful index for evaluation of genetic differentiation between populations. The aim of this study was to detect selective signatures between different chicken groups based on SNP-wise FST calculation. A total of 96 individuals of three commercial layer breeds and 14 non-commercial fancy breeds were genotyped with three different 600K SNP-chips. After filtering a total of 1 million SNPs were available for FST calculation. Averages of FST values were calculated for overlapping windows. Comparisons of these were then conducted between commercial egg layers and non-commercial fancy breeds, as well as between white egg layers and brown egg layers. Comparing non-commercial and commercial breeds resulted in the detection of 630 selective signatures, while 656 selective signatures were detected in the comparison between the commercial egg-layer breeds. Annotation of selection signature regions revealed various genes corresponding to productions traits, for which layer breeds were selected. Among them were NCOA1, SREBF2 and RALGAPA1 associated with reproductive traits, broodiness and egg production. Furthermore, several of the detected genes were associated with growth and carcass traits, including POMC, PRKAB2, SPP1, IGF2, CAPN1, TGFb2 and IGFBP2. Our approach demonstrates that including different populations with a specific breeding history can provide a unique opportunity for a better understanding of farm animal selection.

  9. Presence of SNPs in GDF9 mRNA of Iranian Afshari Sheep

    Directory of Open Access Journals (Sweden)

    Talat Saiedi

    2012-01-01

    Full Text Available Background: Multiple births occur frequently in some Iranian sheep breeds, while infertilityscarcely occurs. Mutation detection in major fecundity genes has been explored in most of Iraniansheep flocks over the last decade. However, previously reported single nucleotide polymorphisms(SNPs for bone morphogenetic protein receptor-(BMPR-1B and growth differentiation factor GDF9( known to affect fertility have not been detected. This study was conducted to assess whetherany significant mutations in GDF9 were extracted from slaughtered ewe ovaries of Iranian Afsharisheep breed.Materials and Methods: Ovaries defined as poor, fair, and excellent quality based on externalvisual appearance of follicles were used for histology and RNA extraction processes. High qualityRNAs underwent reverse transcriptase-polymerase chain reaction (RT-PCR from GDF9 mRNA,and the products sequenced.Results: No streak ovaries, which are considered indicators of infertility due to homozygocity forsome mutations in GDF9 and BMP15, were found. Sequencing results from GDF9 cDNA showedthat G2 (C471T, G3 (G477A, and G4 (G721A mutations were observed from 1, 4, and 1 out of12 ewes, respectively. Though all 3 mutations were previously reported, this is the first report ontheir presence in Iranian breeds. The first and second mutations do not alter the amino acids, whileG4 is a non-conservative mutation leading to E241K in the prohormone.Conclusion: As the G4 mutation was observed only in ovaries defined superficially as top quality,it could be considered as one of reasons for higher ovulation rate in some sheep. Furthermore sincemultiple mutations were observed in some cases, it might be possible that combinations of minormutations in GDF9 and BMP15 interact to affect fecundity in some Iranian sheep breeds.

  10. A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets

    Science.gov (United States)

    Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.

    2015-01-01

    Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437

  11. Impute DC link (IDCL) cell based power converters and control thereof

    Science.gov (United States)

    Divan, Deepakraj M.; Prasai, Anish; Hernendez, Jorge; Moghe, Rohit; Iyer, Amrit; Kandula, Rajendra Prasad

    2016-04-26

    Power flow controllers based on Imputed DC Link (IDCL) cells are provided. The IDCL cell is a self-contained power electronic building block (PEBB). The IDCL cell may be stacked in series and parallel to achieve power flow control at higher voltage and current levels. Each IDCL cell may comprise a gate drive, a voltage sharing module, and a thermal management component in order to facilitate easy integration of the cell into a variety of applications. By providing direct AC conversion, the IDCL cell based AC/AC converters reduce device count, eliminate the use of electrolytic capacitors that have life and reliability issues, and improve system efficiency compared with similarly rated back-to-back inverter system.

  12. A genome-wide association study of atopic dermatitis identifies loci with overlapping effects on asthma and psoriasis.

    Science.gov (United States)

    Weidinger, Stephan; Willis-Owen, Saffron A G; Kamatani, Yoichiro; Baurecht, Hansjörg; Morar, Nilesh; Liang, Liming; Edser, Pauline; Street, Teresa; Rodriguez, Elke; O'Regan, Grainne M; Beattie, Paula; Fölster-Holst, Regina; Franke, Andre; Novak, Natalija; Fahy, Caoimhe M; Winge, Mårten C G; Kabesch, Michael; Illig, Thomas; Heath, Simon; Söderhäll, Cilla; Melén, Erik; Pershagen, Göran; Kere, Juha; Bradley, Maria; Lieden, Agne; Nordenskjold, Magnus; Harper, John I; McLean, W H Irwin; Brown, Sara J; Cookson, William O C; Lathrop, G Mark; Irvine, Alan D; Moffatt, Miriam F

    2013-12-01

    Atopic dermatitis (AD) is the most common dermatological disease of childhood. Many children with AD have asthma and AD shares regions of genetic linkage with psoriasis, another chronic inflammatory skin disease. We present here a genome-wide association study (GWAS) of childhood-onset AD in 1563 European cases with known asthma status and 4054 European controls. Using Illumina genotyping followed by imputation, we generated 268 034 consensus genotypes and in excess of 2 million single nucleotide polymorphisms (SNPs) for analysis. Association signals were assessed for replication in a second panel of 2286 European cases and 3160 European controls. Four loci achieved genome-wide significance for AD and replicated consistently across all cohorts. These included the epidermal differentiation complex (EDC) on chromosome 1, the genomic region proximal to LRRC32 on chromosome 11, the RAD50/IL13 locus on chromosome 5 and the major histocompatibility complex (MHC) on chromosome 6; reflecting action of classical HLA alleles. We observed variation in the contribution towards co-morbid asthma for these regions of association. We further explored the genetic relationship between AD, asthma and psoriasis by examining previously identified susceptibility SNPs for these diseases. We found considerable overlap between AD and psoriasis together with variable coincidence between allergic rhinitis (AR) and asthma. Our results indicate that the pathogenesis of AD incorporates immune and epidermal barrier defects with combinations of specific and overlapping effects at individual loci.

  13. Sub-populations within the major European and African derived haplogroups R1b3 and E3a are differentiated by previously phylogenetically undefined Y-SNPs.

    Science.gov (United States)

    Sims, Lynn M; Garvey, Dennis; Ballantyne, Jack

    2007-01-01

    Single nucleotide polymorphisms on the Y chromosome (Y-SNPs) have been widely used in the study of human migration patterns and evolution. Potential forensic applications of Y-SNPs include their use in predicting the ethnogeographic origin of the donor of a crime scene sample, or exclusion of suspects of sexual assaults (the evidence of which often comprises male/female mixtures and may involve multiple perpetrators), paternity testing, and identification of non- and half-siblings. In this study, we used a population of 118 African- and 125 European-Americans to evaluate 12 previously phylogenetically undefined Y-SNPs for their ability to further differentiate individuals who belong to the major African (E3a)- and European (R1b3, I)-derived haplogroups. Ten of these markers define seven new sub-clades (equivalent to E3a7a, E3a8, E3a8a, E3a8a1, R1b3h, R1b3i, and R1b3i1 using the Y Chromosome Consortium nomenclature) within haplogroups E and R. Interestingly, during the course of this study we evaluated M222, a sub-R1b3 marker rarely used, and found that this sub-haplogroup in effect defines the Y-STR Irish Modal Haplotype (IMH). The new bi-allelic markers described here are expected to find application in human evolutionary studies and forensic genetics. (c) 2006 Wiley-Liss, Inc.

  14. Genome-wide association study of retinopathy in individuals without diabetes.

    Directory of Open Access Journals (Sweden)

    Richard A Jensen

    Full Text Available Mild retinopathy (microaneurysms or dot-blot hemorrhages is observed in persons without diabetes or hypertension and may reflect microvascular disease in other organs. We conducted a genome-wide association study (GWAS of mild retinopathy in persons without diabetes.A working group agreed on phenotype harmonization, covariate selection and analytic plans for within-cohort GWAS. An inverse-variance weighted fixed effects meta-analysis was performed with GWAS results from six cohorts of 19,411 Caucasians. The primary analysis included individuals without diabetes and secondary analyses were stratified by hypertension status. We also singled out the results from single nucleotide polymorphisms (SNPs previously shown to be associated with diabetes and hypertension, the two most common causes of retinopathy.No SNPs reached genome-wide significance in the primary analysis or the secondary analysis of participants with hypertension. SNP, rs12155400, in the histone deacetylase 9 gene (HDAC9 on chromosome 7, was associated with retinopathy in analysis of participants without hypertension, -1.3±0.23 (beta ± standard error, p = 6.6×10(-9. Evidence suggests this was a false positive finding. The minor allele frequency was low (∼2%, the quality of the imputation was moderate (r(2 ∼0.7, and no other common variants in the HDAC9 gene were associated with the outcome. SNPs found to be associated with diabetes and hypertension in other GWAS were not associated with retinopathy in persons without diabetes or in subgroups with or without hypertension.This GWAS of retinopathy in individuals without diabetes showed little evidence of genetic associations. Further studies are needed to identify genes associated with these signs in order to help unravel novel pathways and determinants of microvascular diseases.

  15. LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures.

    Science.gov (United States)

    Ryan, Michael; Diekhans, Mark; Lien, Stephanie; Liu, Yun; Karchin, Rachel

    2009-06-01

    LS-SNP/PDB is a new WWW resource for genome-wide annotation of human non-synonymous (amino acid changing) SNPs. It serves high-quality protein graphics rendered with UCSF Chimera molecular visualization software. The system is kept up-to-date by an automated, high-throughput build pipeline that systematically maps human nsSNPs onto Protein Data Bank structures and annotates several biologically relevant features. LS-SNP/PDB is available at (http://ls-snp.icm.jhu.edu/ls-snp-pdb) and via links from protein data bank (PDB) biology and chemistry tabs, UCSC Genome Browser Gene Details and SNP Details pages and PharmGKB Gene Variants Downloads/Cross-References pages.

  16. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs

    DEFF Research Database (Denmark)

    Radhakrishnan, Sabarinathan; Tafer, Hakim; Seemann, Ernst Stefan

    2013-01-01

    into structural effects of SNPs. The global measures employed so far suffer from limited accuracy of folding programs on large RNAs and are computationally too demanding for genome-wide applications. Here, we present a strategy that focuses on the local regions of maximal structural change between mutant and wild......-type. These local regions are approximated in a "screening mode" that is intended for genome-wide applications. Furthermore, localized regions are identified as those with maximal discrepancy. The mutation effects are quantified in terms of empirical P values. To this end, the RNAsnp software uses extensive...... precomputed tables of the distribution of SNP effects as function of length and GC content. RNAsnp thus achieves both a noise reduction and speed-up of several orders of magnitude over shuffling-based approaches. On a data set comprising 501 SNPs associated with human-inherited diseases, we predict 54 to have...

  17. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs

    DEFF Research Database (Denmark)

    Lee, S Hong; Ripke, Stephan; Neale, Benjamin M

    2013-01-01

    Most psychiatric disorders are moderately to highly heritable. The degree to which genetic variation is unique to individual disorders or shared across disorders is unclear. To examine shared genetic etiology, we use genome-wide genotype data from the Psychiatric Genomics Consortium (PGC) for cases...... and controls in schizophrenia, bipolar disorder, major depressive disorder, autism spectrum disorders (ASD) and attention-deficit/hyperactivity disorder (ADHD). We apply univariate and bivariate methods for the estimation of genetic variation within and covariation between disorders. SNPs explained 17......-29% of the variance in liability. The genetic correlation calculated using common SNPs was high between schizophrenia and bipolar disorder (0.68 ± 0.04 s.e.), moderate between schizophrenia and major depressive disorder (0.43 ± 0.06 s.e.), bipolar disorder and major depressive disorder (0.47 ± 0.06 s.e.), and ADHD...

  18. Novel SNPs in the exon region of bovine DKK4 gene and their association with body measurement traits in Qinchuan cattle.

    Science.gov (United States)

    Gao, J B; Li, Y K; Yang, N; Ma, X H; Adoligbe, C; Jiang, B J; Fu, C Z; Cheng, G; Zan, L S

    2013-02-28

    The aim of this study was to determine whether single nucleotide polymorphisms (SNPs) of bovine Dickkopf homolog 4 (DKK4) are associated with body measurement traits in Qinchuan cattle. By using PCR-SSCP technology and DNA sequencing, we discovered 5 DKK4 SNPs in Qingchuan cattle, including -65G>A and -77G>T in the 5'-untranslated region, 1532C>G and 1533T>C in exon 2, and 2088C>T in exon 3. The sequencing map showed that 1532C>G and 1533T>C were in close linkage disequilibrium and were treated as 1532C>G-1533T>C in this study. Allele frequencies were calculated and analyzed by the chi-square test, which showed that -65G>A and 1532C>G-1533T>C were in Hardy-Weinberg equilibrium (P > 0.05), whereas -77G>T and 2088C>T were not in all 633 tested Qinchuan cattle individuals (P A; 0.472, 1.894, and 0.361 at -77G>T; 0.476, 1.908, and 0.363 at 1532C>G-1533T>C; and 0.218, 1.279, and 0.195 at 2088C>T. We also evaluated the potential association of these SNPs with body measurement traits in all 633 individuals; the results suggest that several SNPs in Qinchuan cattle DKK4 were significantly associated with body length, hip height, rump length, hip width, heart girth, and pin bone width (P bovine DKK4 could be used as candidate gene for Qinchuan cattle breeding.

  19. A Genome Wide Association Study Links Glutamate Receptor Pathway to Sporadic Creutzfeldt-Jakob Disease Risk

    Science.gov (United States)

    Sanchez-Juan, Pascual; Bishop, Matthew T.; Kovacs, Gabor G.; Calero, Miguel; Aulchenko, Yurii S.; Ladogana, Anna; Boyd, Alison; Lewis, Victoria; Ponto, Claudia; Calero, Olga; Poleggi, Anna; Carracedo, Ángel; van der Lee, Sven J.; Ströbel, Thomas; Rivadeneira, Fernando; Hofman, Albert; Haïk, Stéphane; Combarros, Onofre; Berciano, José; Uitterlinden, Andre G.; Collins, Steven J.; Budka, Herbert; Brandel, Jean-Philippe; Laplanche, Jean Louis; Pocchiari, Maurizio; Zerr, Inga; Knight, Richard S. G.; Will, Robert G.; van Duijn, Cornelia M.

    2015-01-01

    We performed a genome-wide association (GWA) study in 434 sporadic Creutzfeldt-Jakob disease (sCJD) patients and 1939 controls from the United Kingdom, Germany and The Netherlands. The findings were replicated in an independent sample of 1109 sCJD and 2264 controls provided by a multinational consortium. From the initial GWA analysis we selected 23 SNPs for further genotyping in 1109 sCJD cases from seven different countries. Five SNPs were significantly associated with sCJD after correction for multiple testing. Subsequently these five SNPs were genotyped in 2264 controls. The pooled analysis, including 1543 sCJD cases and 4203 controls, yielded two genome wide significant results: rs6107516 (p-value=7.62x10-9) a variant tagging the prion protein gene (PRNP); and rs6951643 (p-value=1.66x10-8) tagging the Glutamate Receptor Metabotropic 8 gene (GRM8). Next we analysed the data stratifying by country of origin combining samples from the pooled analysis with genotypes from the 1000 Genomes Project and imputed genotypes from the Rotterdam Study (Total n=12967). The meta-analysis of the results showed that rs6107516 (p-value=3.00x10-8) and rs6951643 (p-value=3.91x10-5) remained as the two most significantly associated SNPs. Rs6951643 is located in an intronic region of GRM8, a gene that was additionally tagged by a cluster of 12 SNPs within our top100 ranked results. GRM8 encodes for mGluR8, a protein which belongs to the metabotropic glutamate receptor family, recently shown to be involved in the transduction of cellular signals triggered by the prion protein. Pathway enrichment analyses performed with both Ingenuity Pathway Analysis and ALIGATOR postulates glutamate receptor signalling as one of the main pathways associated with sCJD. In summary, we have detected GRM8 as a novel, non-PRNP, genome-wide significant marker associated with heightened disease risk, providing additional evidence supporting a role of glutamate receptors in sCJD pathogenesis. PMID:25918841

  20. Novel SNPs in HSPB8 gene and their association with heat tolerance traits in Sahiwal indigenous cattle.

    Science.gov (United States)

    Verma, Nishant; Gupta, Ishwar Dayal; Verma, Archana; Kumar, Rakesh; Das, Ramendra; Vineeth, M R

    2016-01-01

    Heat shock proteins (HSPs) are expressed in