WorldWideScience

Sample records for protein-coding gene prediction

  1. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    Science.gov (United States)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  2. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    International Nuclear Information System (INIS)

    Yu Jia-Feng; Sui Tian-Xiang; Wang Ji-Hua; Wang Hong-Mei; Wang Chun-Ling; Jing Li

    2015-01-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. (special topic)

  3. Histone modification profiles are predictive for tissue/cell-type specific expression of both protein-coding and microRNA genes

    Directory of Open Access Journals (Sweden)

    Zhang Michael Q

    2011-05-01

    Full Text Available Abstract Background Gene expression is regulated at both the DNA sequence level and through modification of chromatin. However, the effect of chromatin on tissue/cell-type specific gene regulation (TCSR is largely unknown. In this paper, we present a method to elucidate the relationship between histone modification/variation (HMV and TCSR. Results A classifier for differentiating CD4+ T cell-specific genes from housekeeping genes using HMV data was built. We found HMV in both promoter and gene body regions to be predictive of genes which are targets of TCSR. For example, the histone modification types H3K4me3 and H3K27ac were identified as the most predictive for CpG-related promoters, whereas H3K4me3 and H3K79me3 were the most predictive for nonCpG-related promoters. However, genes targeted by TCSR can be predicted using other type of HMVs as well. Such redundancy implies that multiple type of underlying regulatory elements, such as enhancers or intragenic alternative promoters, which can regulate gene expression in a tissue/cell-type specific fashion, may be marked by the HMVs. Finally, we show that the predictive power of HMV for TCSR is not limited to protein-coding genes in CD4+ T cells, as we successfully predicted TCSR targeted genes in muscle cells, as well as microRNA genes with expression specific to CD4+ T cells, by the same classifier which was trained on HMV data of protein-coding genes in CD4+ T cells. Conclusion We have begun to understand the HMV patterns that guide gene expression in both tissue/cell-type specific and ubiquitous manner.

  4. De novo origin of human protein-coding genes.

    Directory of Open Access Journals (Sweden)

    Dong-Dong Wu

    2011-11-01

    Full Text Available The de novo origin of a new protein-coding gene from non-coding DNA is considered to be a very rare occurrence in genomes. Here we identify 60 new protein-coding genes that originated de novo on the human lineage since divergence from the chimpanzee. The functionality of these genes is supported by both transcriptional and proteomic evidence. RNA-seq data indicate that these genes have their highest expression levels in the cerebral cortex and testes, which might suggest that these genes contribute to phenotypic traits that are unique to humans, such as improved cognitive ability. Our results are inconsistent with the traditional view that the de novo origin of new genes is very rare, thus there should be greater appreciation of the importance of the de novo origination of genes.

  5. De Novo Origin of Human Protein-Coding Genes

    Science.gov (United States)

    Wu, Dong-Dong; Irwin, David M.; Zhang, Ya-Ping

    2011-01-01

    The de novo origin of a new protein-coding gene from non-coding DNA is considered to be a very rare occurrence in genomes. Here we identify 60 new protein-coding genes that originated de novo on the human lineage since divergence from the chimpanzee. The functionality of these genes is supported by both transcriptional and proteomic evidence. RNA–seq data indicate that these genes have their highest expression levels in the cerebral cortex and testes, which might suggest that these genes contribute to phenotypic traits that are unique to humans, such as improved cognitive ability. Our results are inconsistent with the traditional view that the de novo origin of new genes is very rare, thus there should be greater appreciation of the importance of the de novo origination of genes. PMID:22102831

  6. Revisiting the missing protein-coding gene catalog of the domestic dog

    Directory of Open Access Journals (Sweden)

    Galibert Francis

    2009-02-01

    Full Text Available Abstract Background Among mammals for which there is a high sequence coverage, the whole genome assembly of the dog is unique in that it predicts a low number of protein-coding genes, ~19,000, compared to the over 20,000 reported for other mammalian species. Of particular interest are the more than 400 of genes annotated in primates and rodent genomes, but missing in dog. Results Using over 14,000 orthologous genes between human, chimpanzee, mouse rat and dog, we built multiple pairwise synteny maps to infer short orthologous intervals that were targeted for characterizing the canine missing genes. Based on gene prediction and a functionality test using the ratio of replacement to silent nucleotide substitution rates (dN/dS, we provide compelling structural and functional evidence for the identification of 232 new protein-coding genes in the canine genome and 69 gene losses, characterized as undetected gene or pseudogenes. Gene loss phyletic pattern analysis using ten species from chicken to human allowed us to characterize 28 canine-specific gene losses that have functional orthologs continuously from chicken or marsupials through human, and 10 genes that arose specifically in the evolutionary lineage leading to rodent and primates. Conclusion This study demonstrates the central role of comparative genomics for refining gene catalogs and exploring the evolutionary history of gene repertoires, particularly as applied for the characterization of species-specific gene gains and losses.

  7. IN-MACA-MCC: Integrated Multiple Attractor Cellular Automata with Modified Clonal Classifier for Human Protein Coding and Promoter Prediction

    Directory of Open Access Journals (Sweden)

    Kiran Sree Pokkuluri

    2014-01-01

    Full Text Available Protein coding and promoter region predictions are very important challenges of bioinformatics (Attwood and Teresa, 2000. The identification of these regions plays a crucial role in understanding the genes. Many novel computational and mathematical methods are introduced as well as existing methods that are getting refined for predicting both of the regions separately; still there is a scope for improvement. We propose a classifier that is built with MACA (multiple attractor cellular automata and MCC (modified clonal classifier to predict both regions with a single classifier. The proposed classifier is trained and tested with Fickett and Tung (1992 datasets for protein coding region prediction for DNA sequences of lengths 54, 108, and 162. This classifier is trained and tested with MMCRI datasets for protein coding region prediction for DNA sequences of lengths 252 and 354. The proposed classifier is trained and tested with promoter sequences from DBTSS (Yamashita et al., 2006 dataset and nonpromoters from EID (Saxonov et al., 2000 and UTRdb (Pesole et al., 2002 datasets. The proposed model can predict both regions with an average accuracy of 90.5% for promoter and 89.6% for protein coding region predictions. The specificity and sensitivity values of promoter and protein coding region predictions are 0.89 and 0.92, respectively.

  8. Proteogenomics of rare taxonomic phyla: A prospective treasure trove of protein coding genes.

    Science.gov (United States)

    Kumar, Dhirendra; Mondal, Anupam Kumar; Kutum, Rintu; Dash, Debasis

    2016-01-01

    Sustainable innovations in sequencing technologies have resulted in a torrent of microbial genome sequencing projects. However, the prokaryotic genomes sequenced so far are unequally distributed along their phylogenetic tree; few phyla contain the majority, the rest only a few representatives. Accurate genome annotation lags far behind genome sequencing. While automated computational prediction, aided by comparative genomics, remains a popular choice for genome annotation, substantial fraction of these annotations are erroneous. Proteogenomics utilizes protein level experimental observations to annotate protein coding genes on a genome wide scale. Benefits of proteogenomics include discovery and correction of gene annotations regardless of their phylogenetic conservation. This not only allows detection of common, conserved proteins but also the discovery of protein products of rare genes that may be horizontally transferred or taxonomy specific. Chances of encountering such genes are more in rare phyla that comprise a small number of complete genome sequences. We collated all bacterial and archaeal proteogenomic studies carried out to date and reviewed them in the context of genome sequencing projects. Here, we present a comprehensive list of microbial proteogenomic studies, their taxonomic distribution, and also urge for targeted proteogenomics of underexplored taxa to build an extensive reference of protein coding genes. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  9. Expression of protein-coding genes embedded in ribosomal DNA

    DEFF Research Database (Denmark)

    Johansen, Steinar D; Haugen, Peik; Nielsen, Henrik

    2007-01-01

    Ribosomal DNA (rDNA) is a specialised chromosomal location that is dedicated to high-level transcription of ribosomal RNA genes. Interestingly, rDNAs are frequently interrupted by parasitic elements, some of which carry protein genes. These are non-LTR retrotransposons and group II introns that e...... in the nucleolus....

  10. Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes

    KAUST Repository

    Alam, Tanvir

    2014-10-02

    Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptional regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.

  11. Codon usage and expression level of human mitochondrial 13 protein coding genes across six continents.

    Science.gov (United States)

    Chakraborty, Supriyo; Uddin, Arif; Mazumder, Tarikul Huda; Choudhury, Monisha Nath; Malakar, Arup Kumar; Paul, Prosenjit; Halder, Binata; Deka, Himangshu; Mazumder, Gulshana Akthar; Barbhuiya, Riazul Ahmed; Barbhuiya, Masuk Ahmed; Devi, Warepam Jesmi

    2017-12-02

    The study of codon usage coupled with phylogenetic analysis is an important tool to understand the genetic and evolutionary relationship of a gene. The 13 protein coding genes of human mitochondria are involved in electron transport chain for the generation of energy currency (ATP). However, no work has yet been reported on the codon usage of the mitochondrial protein coding genes across six continents. To understand the patterns of codon usage in mitochondrial genes across six different continents, we used bioinformatic analyses to analyze the protein coding genes. The codon usage bias was low as revealed from high ENC value. Correlation between codon usage and GC3 suggested that all the codons ending with G/C were positively correlated with GC3 but vice versa for A/T ending codons with the exception of ND4L and ND5 genes. Neutrality plot revealed that for the genes ATP6, COI, COIII, CYB, ND4 and ND4L, natural selection might have played a major role while mutation pressure might have played a dominant role in the codon usage bias of ATP8, COII, ND1, ND2, ND3, ND5 and ND6 genes. Phylogenetic analysis indicated that evolutionary relationships in each of 13 protein coding genes of human mitochondria were different across six continents and further suggested that geographical distance was an important factor for the origin and evolution of 13 protein coding genes of human mitochondria. Copyright © 2017 Elsevier B.V. and Mitochondria Research Society. All rights reserved.

  12. Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs.

    Directory of Open Access Journals (Sweden)

    Chen Xie

    2012-09-01

    Full Text Available Tinkering with pre-existing genes has long been known as a major way to create new genes. Recently, however, motherless protein-coding genes have been found to have emerged de novo from ancestral non-coding DNAs. How these genes originated is not well addressed to date. Here we identified 24 hominoid-specific de novo protein-coding genes with precise origination timing in vertebrate phylogeny. Strand-specific RNA-Seq analyses were performed in five rhesus macaque tissues (liver, prefrontal cortex, skeletal muscle, adipose, and testis, which were then integrated with public transcriptome data from human, chimpanzee, and rhesus macaque. On the basis of comparing the RNA expression profiles in the three species, we found that most of the hominoid-specific de novo protein-coding genes encoded polyadenylated non-coding RNAs in rhesus macaque or chimpanzee with a similar transcript structure and correlated tissue expression profile. According to the rule of parsimony, the majority of these hominoid-specific de novo protein-coding genes appear to have acquired a regulated transcript structure and expression profile before acquiring coding potential. Interestingly, although the expression profile was largely correlated, the coding genes in human often showed higher transcriptional abundance than their non-coding counterparts in rhesus macaque. The major findings we report in this manuscript are robust and insensitive to the parameters used in the identification and analysis of de novo genes. Our results suggest that at least a portion of long non-coding RNAs, especially those with active and regulated transcription, may serve as a birth pool for protein-coding genes, which are then further optimized at the transcriptional level.

  13. Selfish DNA in protein-coding genes of Rickettsia.

    Science.gov (United States)

    Ogata, H; Audic, S; Barbe, V; Artiguenave, F; Fournier, P E; Raoult, D; Claverie, J M

    2000-10-13

    Rickettsia conorii, the aetiological agent of Mediterranean spotted fever, is an intracellular bacterium transmitted by ticks. Preliminary analyses of the nearly complete genome sequence of R. conorii have revealed 44 occurrences of a previously undescribed palindromic repeat (150 base pairs long) throughout the genome. Unexpectedly, this repeat was found inserted in-frame within 19 different R. conorii open reading frames likely to encode functional proteins. We found the same repeat in proteins of other Rickettsia species. The finding of a mobile element inserted in many unrelated genes suggests the potential role of selfish DNA in the creation of new protein sequences.

  14. Discovery of rare protein-coding genes in model methylotroph Methylobacterium extorquens AM1.

    Science.gov (United States)

    Kumar, Dhirendra; Mondal, Anupam Kumar; Yadav, Amit Kumar; Dash, Debasis

    2014-12-01

    Proteogenomics involves the use of MS to refine annotation of protein-coding genes and discover genes in a genome. We carried out comprehensive proteogenomic analysis of Methylobacterium extorquens AM1 (ME-AM1) from publicly available proteomics data with a motive to improve annotation for methylotrophs; organisms capable of surviving in reduced carbon compounds such as methanol. Besides identifying 2482(50%) proteins, 29 new genes were discovered and 66 annotated gene models were revised in ME-AM1 genome. One such novel gene is identified with 75 peptides, lacks homolog in other methylobacteria but has glycosyl transferase and lipopolysaccharide biosynthesis protein domains, indicating its potential role in outer membrane synthesis. Many novel genes are present only in ME-AM1 among methylobacteria. Distant homologs of these genes in unrelated taxonomic classes and low GC-content of few genes suggest lateral gene transfer as a potential mode of their origin. Annotations of methylotrophy related genes were also improved by the discovery of a short gene in methylotrophy gene island and redefining a gene important for pyrroquinoline quinone synthesis, essential for methylotrophy. The combined use of proteogenomics and rigorous bioinformatics analysis greatly enhanced the annotation of protein-coding genes in model methylotroph ME-AM1 genome. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  15. Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes

    KAUST Repository

    Alam, Tanvir; Medvedeva, Yulia A.; Jia, Hui; Brown, James B.; Lipovich, Leonard; Bajic, Vladimir B.

    2014-01-01

    raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted

  16. A dual origin of the Xist gene from a protein-coding gene and a set of transposable elements.

    Directory of Open Access Journals (Sweden)

    Eugeny A Elisaphenko

    2008-06-01

    Full Text Available X-chromosome inactivation, which occurs in female eutherian mammals is controlled by a complex X-linked locus termed the X-inactivation center (XIC. Previously it was proposed that genes of the XIC evolved, at least in part, as a result of pseudogenization of protein-coding genes. In this study we show that the key XIC gene Xist, which displays fragmentary homology to a protein-coding gene Lnx3, emerged de novo in early eutherians by integration of mobile elements which gave rise to simple tandem repeats. The Xist gene promoter region and four out of ten exons found in eutherians retain homology to exons of the Lnx3 gene. The remaining six Xist exons including those with simple tandem repeats detectable in their structure have similarity to different transposable elements. Integration of mobile elements into Xist accompanies the overall evolution of the gene and presumably continues in contemporary eutherian species. Additionally we showed that the combination of remnants of protein-coding sequences and mobile elements is not unique to the Xist gene and is found in other XIC genes producing non-coding nuclear RNA.

  17. The small RNA content of human sperm reveals pseudogene-derived piRNAs complementary to protein-coding genes

    DEFF Research Database (Denmark)

    Pantano, Lorena; Jodar, Meritxell; Bak, Mads

    2015-01-01

    -specific genes. The most abundant class of small noncoding RNAs in sperm are PIWI-interacting RNAs (piRNAs). Surprisingly, we found that human sperm cells contain piRNAs processed from pseudogenes. Clusters of piRNAs from human testes contain pseudogenes transcribed in the antisense strand and processed...... into small RNAs. Several human protein-coding genes contain antisense predicted targets of pseudogene-derived piRNAs in the male germline and these piRNAs are still found in mature sperm. Our study provides the most extensive data set and annotation of human sperm small RNAs to date and is a resource...... for further functional studies on the roles of sperm small RNAs. In addition, we propose that some of the pseudogene-derived human piRNAs may regulate expression of their parent gene in the male germline....

  18. A human-specific de novo protein-coding gene associated with human brain functions.

    Directory of Open Access Journals (Sweden)

    Chuan-Yun Li

    2010-03-01

    Full Text Available To understand whether any human-specific new genes may be associated with human brain functions, we computationally screened the genetic vulnerable factors identified through Genome-Wide Association Studies and linkage analyses of nicotine addiction and found one human-specific de novo protein-coding gene, FLJ33706 (alternative gene symbol C20orf203. Cross-species analysis revealed interesting evolutionary paths of how this gene had originated from noncoding DNA sequences: insertion of repeat elements especially Alu contributed to the formation of the first coding exon and six standard splice junctions on the branch leading to humans and chimpanzees, and two subsequent substitutions in the human lineage escaped two stop codons and created an open reading frame of 194 amino acids. We experimentally verified FLJ33706's mRNA and protein expression in the brain. Real-Time PCR in multiple tissues demonstrated that FLJ33706 was most abundantly expressed in brain. Human polymorphism data suggested that FLJ33706 encodes a protein under purifying selection. A specifically designed antibody detected its protein expression across human cortex, cerebellum and midbrain. Immunohistochemistry study in normal human brain cortex revealed the localization of FLJ33706 protein in neurons. Elevated expressions of FLJ33706 were detected in Alzheimer's brain samples, suggesting the role of this novel gene in human-specific pathogenesis of Alzheimer's disease. FLJ33706 provided the strongest evidence so far that human-specific de novo genes can have protein-coding potential and differential protein expression, and be involved in human brain functions.

  19. Both noncoding and protein-coding RNAs contribute to gene expression evolution in the primate brain.

    Science.gov (United States)

    Babbitt, Courtney C; Fedrigo, Olivier; Pfefferle, Adam D; Boyle, Alan P; Horvath, Julie E; Furey, Terrence S; Wray, Gregory A

    2010-01-18

    Despite striking differences in cognition and behavior between humans and our closest primate relatives, several studies have found little evidence for adaptive change in protein-coding regions of genes expressed primarily in the brain. Instead, changes in gene expression may underlie many cognitive and behavioral differences. Here, we used digital gene expression: tag profiling (here called Tag-Seq, also called DGE:tag profiling) to assess changes in global transcript abundance in the frontal cortex of the brains of 3 humans, 3 chimpanzees, and 3 rhesus macaques. A substantial fraction of transcripts we identified as differentially transcribed among species were not assayed in previous studies based on microarrays. Differentially expressed tags within coding regions are enriched for gene functions involved in synaptic transmission, transport, oxidative phosphorylation, and lipid metabolism. Importantly, because Tag-Seq technology provides strand-specific information about all polyadenlyated transcripts, we were able to assay expression in noncoding intragenic regions, including both sense and antisense noncoding transcripts (relative to nearby genes). We find that many noncoding transcripts are conserved in both location and expression level between species, suggesting a possible functional role. Lastly, we examined the overlap between differential gene expression and signatures of positive selection within putative promoter regions, a sign that these differences represent adaptations during human evolution. Comparative approaches may provide important insights into genes responsible for differences in cognitive functions between humans and nonhuman primates, as well as highlighting new candidate genes for studies investigating neurological disorders.

  20. A study on climatic adaptation of dipteran mitochondrial protein coding genes

    Directory of Open Access Journals (Sweden)

    Debajyoti Kabiraj

    2017-10-01

    Full Text Available Diptera, the true flies are frequently found in nature and their habitat is found all over the world including Antarctica and Polar Regions. The number of documented species for order diptera is quite high and thought to be 14% of the total animal present in the earth [1]. Most of the study in diptera has focused on the taxa of economic and medical importance, such as the fruit flies Ceratitis capitata and Bactrocera spp. (Tephritidae, which are serious agricultural pests; the blowflies (Calliphoridae and oestrid flies (Oestridae, which can cause myiasis; the anopheles mosquitoes (Culicidae, are the vectors of malaria; and leaf-miners (Agromyzidae, vegetable and horticultural pests [2]. Insect mitochondrion consists of 13 protein coding genes, 22 tRNAs and 2 rRNAs, are the remnant portion of alpha-proteobacteria is responsible for simultaneous function of energy production and thermoregulation of the cell through the bi-genomic system thus different adaptability in different climatic condition might have compensated by complementary changes is the both genomes [3,4]. In this study we have collected complete mitochondrial genome and occurrence data of one hundred thirteen such dipteran insects from different databases and literature survey. Our understanding of the genetic basis of climatic adaptation in diptera is limited to the basic information on the occurrence location of those species and mito genetic factors underlying changes in conspicuous phenotypes. To examine this hypothesis, we have taken an approach of Nucleotide substitution analysis for 13 protein coding genes of mitochondrial DNA individually and combined by different software for monophyletic group as well as paraphyletic group of dipteran species. Moreover, we have also calculated codon adaptation index for all dipteran mitochondrial protein coding genes. Following this work, we have classified our sample organisms according to their location data from GBIF (https

  1. Natural selection in avian protein-coding genes expressed in brain.

    Science.gov (United States)

    Axelsson, Erik; Hultin-Rosenberg, Lina; Brandström, Mikael; Zwahlén, Martin; Clayton, David F; Ellegren, Hans

    2008-06-01

    The evolution of birds from theropod dinosaurs took place approximately 150 million years ago, and was associated with a number of specific adaptations that are still evident among extant birds, including feathers, song and extravagant secondary sexual characteristics. Knowledge about the molecular evolutionary background to such adaptations is lacking. Here, we analyse the evolution of > 5000 protein-coding gene sequences expressed in zebra finch brain by comparison to orthologous sequences in chicken. Mean d(N)/d(S) is 0.085 and genes with their maximal expression in the eye and central nervous system have the lowest mean d(N)/d(S) value, while those expressed in digestive and reproductive tissues exhibit the highest. We find that fast-evolving genes (those which have higher than expected rate of nonsynonymous substitution, indicative of adaptive evolution) are enriched for biological functions such as fertilization, muscle contraction, defence response, response to stress, wounding and endogenous stimulus, and cell death. After alignment to mammalian orthologues, we identify a catalogue of 228 genes that show a significantly higher rate of protein evolution in the two bird lineages than in mammals. These accelerated bird genes, representing candidates for avian-specific adaptations, include genes implicated in vocal learning and other cognitive processes. Moreover, colouration genes evolve faster in birds than in mammals, which may have been driven by sexual selection for extravagant plumage characteristics.

  2. Inheritance-mode specific pathogenicity prioritization (ISPP) for human protein coding genes.

    Science.gov (United States)

    Hsu, Jacob Shujui; Kwan, Johnny S H; Pan, Zhicheng; Garcia-Barcelo, Maria-Mercè; Sham, Pak Chung; Li, Miaoxin

    2016-10-15

    Exome sequencing studies have facilitated the detection of causal genetic variants in yet-unsolved Mendelian diseases. However, the identification of disease causal genes among a list of candidates in an exome sequencing study is still not fully settled, and it is often difficult to prioritize candidate genes for follow-up studies. The inheritance mode provides crucial information for understanding Mendelian diseases, but none of the existing gene prioritization tools fully utilize this information. We examined the characteristics of Mendelian disease genes under different inheritance modes. The results suggest that Mendelian disease genes with autosomal dominant (AD) inheritance mode are more haploinsufficiency and de novo mutation sensitive, whereas those autosomal recessive (AR) genes have significantly more non-synonymous variants and regulatory transcript isoforms. In addition, the X-linked (XL) Mendelian disease genes have fewer non-synonymous and synonymous variants. As a result, we derived a new scoring system for prioritizing candidate genes for Mendelian diseases according to the inheritance mode. Our scoring system assigned to each annotated protein-coding gene (N = 18 859) three pathogenic scores according to the inheritance mode (AD, AR and XL). This inheritance mode-specific framework achieved higher accuracy (area under curve  = 0.84) in XL mode. The inheritance-mode specific pathogenicity prioritization (ISPP) outperformed other well-known methods including Haploinsufficiency, Recessive, Network centrality, Genic Intolerance, Gene Damage Index and Gene Constraint scores. This systematic study suggests that genes manifesting disease inheritance modes tend to have unique characteristics. ISPP is included in KGGSeq v1.0 (http://grass.cgs.hku.hk/limx/kggseq/), and source code is available from (https://github.com/jacobhsu35/ISPP.git). mxli@hku.hkSupplementary information: Supplementary data are available at Bioinformatics online. © The Author

  3. Conserved syntenic clusters of protein coding genes are missing in birds.

    Science.gov (United States)

    Lovell, Peter V; Wirthlin, Morgan; Wilhelm, Larry; Minx, Patrick; Lazar, Nathan H; Carbone, Lucia; Warren, Wesley C; Mello, Claudio V

    2014-01-01

    Birds are one of the most highly successful and diverse groups of vertebrates, having evolved a number of distinct characteristics, including feathers and wings, a sturdy lightweight skeleton and unique respiratory and urinary/excretion systems. However, the genetic basis of these traits is poorly understood. Using comparative genomics based on extensive searches of 60 avian genomes, we have found that birds lack approximately 274 protein coding genes that are present in the genomes of most vertebrate lineages and are for the most part organized in conserved syntenic clusters in non-avian sauropsids and in humans. These genes are located in regions associated with chromosomal rearrangements, and are largely present in crocodiles, suggesting that their loss occurred subsequent to the split of dinosaurs/birds from crocodilians. Many of these genes are associated with lethality in rodents, human genetic disorders, or biological functions targeting various tissues. Functional enrichment analysis combined with orthogroup analysis and paralog searches revealed enrichments that were shared by non-avian species, present only in birds, or shared between all species. Together these results provide a clearer definition of the genetic background of extant birds, extend the findings of previous studies on missing avian genes, and provide clues about molecular events that shaped avian evolution. They also have implications for fields that largely benefit from avian studies, including development, immune system, oncogenesis, and brain function and cognition. With regards to the missing genes, birds can be considered ‘natural knockouts’ that may become invaluable model organisms for several human diseases.

  4. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

    Science.gov (United States)

    Paul, Sandip; Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2015-12-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing the pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen - a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for a species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars - Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. Copyright © 2015 Elsevier Inc. All rights reserved.

  5. PanCoreGen – profiling, detecting, annotating protein-coding genes in microbial genomes

    Science.gov (United States)

    Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V.

    2015-01-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen – a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars – Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. PMID:26456591

  6. The small RNA content of human sperm reveals pseudogene-derived piRNAs complementary to protein-coding genes

    Science.gov (United States)

    Pantano, Lorena; Jodar, Meritxell; Bak, Mads; Ballescà, Josep Lluís; Tommerup, Niels; Oliva, Rafael; Vavouri, Tanya

    2015-01-01

    At the end of mammalian sperm development, sperm cells expel most of their cytoplasm and dispose of the majority of their RNA. Yet, hundreds of RNA molecules remain in mature sperm. The biological significance of the vast majority of these molecules is unclear. To better understand the processes that generate sperm small RNAs and what roles they may have, we sequenced and characterized the small RNA content of sperm samples from two human fertile individuals. We detected 182 microRNAs, some of which are highly abundant. The most abundant microRNA in sperm is miR-1246 with predicted targets among sperm-specific genes. The most abundant class of small noncoding RNAs in sperm are PIWI-interacting RNAs (piRNAs). Surprisingly, we found that human sperm cells contain piRNAs processed from pseudogenes. Clusters of piRNAs from human testes contain pseudogenes transcribed in the antisense strand and processed into small RNAs. Several human protein-coding genes contain antisense predicted targets of pseudogene-derived piRNAs in the male germline and these piRNAs are still found in mature sperm. Our study provides the most extensive data set and annotation of human sperm small RNAs to date and is a resource for further functional studies on the roles of sperm small RNAs. In addition, we propose that some of the pseudogene-derived human piRNAs may regulate expression of their parent gene in the male germline. PMID:25904136

  7. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes

    Directory of Open Access Journals (Sweden)

    Nicholas T. Ingolia

    2014-09-01

    Full Text Available Ribosome profiling suggests that ribosomes occupy many regions of the transcriptome thought to be noncoding, including 5′ UTRs and long noncoding RNAs (lncRNAs. Apparent ribosome footprints outside of protein-coding regions raise the possibility of artifacts unrelated to translation, particularly when they occupy multiple, overlapping open reading frames (ORFs. Here, we show hallmarks of translation in these footprints: copurification with the large ribosomal subunit, response to drugs targeting elongation, trinucleotide periodicity, and initiation at early AUGs. We develop a metric for distinguishing between 80S footprints and nonribosomal sources using footprint size distributions, which validates the vast majority of footprints outside of coding regions. We present evidence for polypeptide production beyond annotated genes, including the induction of immune responses following human cytomegalovirus (HCMV infection. Translation is pervasive on cytosolic transcripts outside of conserved reading frames, and direct detection of this expanded universe of translated products enables efforts at understanding how cells manage and exploit its consequences.

  8. Comparison of protein coding gene contents of the fungal phyla Pezizomycotina and Saccharomycotina

    DEFF Research Database (Denmark)

    Arvas, Mikko; Kivioja, Teemu; Mitchell, Alex

    2007-01-01

    Saccharomycotina are slightly better characterised and predicted to encode mainly enzymes. The genes specific to Saccharomycotina are enriched in transcription and mitochondrion related functions. Especially mitochondrial ribosomal proteins seem to have diverged from those of Pezizomycotina. In addition, we...

  9. Evaluation of the efficacy of twelve mitochondrial protein-coding genes as barcodes for mollusk DNA barcoding.

    Science.gov (United States)

    Yu, Hong; Kong, Lingfeng; Li, Qi

    2016-01-01

    In this study, we evaluated the efficacy of 12 mitochondrial protein-coding genes from 238 mitochondrial genomes of 140 molluscan species as potential DNA barcodes for mollusks. Three barcoding methods (distance, monophyly and character-based methods) were used in species identification. The species recovery rates based on genetic distances for the 12 genes ranged from 70.83 to 83.33%. There were no significant differences in intra- or interspecific variability among the 12 genes. The monophyly and character-based methods provided higher resolution than the distance-based method in species delimitation. Especially in closely related taxa, the character-based method showed some advantages. The results suggested that besides the standard COI barcode, other 11 mitochondrial protein-coding genes could also be potentially used as a molecular diagnostic for molluscan species discrimination. Our results also showed that the combination of mitochondrial genes did not enhance the efficacy for species identification and a single mitochondrial gene would be fully competent.

  10. Non-Protein Coding RNAs

    CERN Document Server

    Walter, Nils G; Batey, Robert T

    2009-01-01

    This book assembles chapters from experts in the Biophysics of RNA to provide a broadly accessible snapshot of the current status of this rapidly expanding field. The 2006 Nobel Prize in Physiology or Medicine was awarded to the discoverers of RNA interference, highlighting just one example of a large number of non-protein coding RNAs. Because non-protein coding RNAs outnumber protein coding genes in mammals and other higher eukaryotes, it is now thought that the complexity of organisms is correlated with the fraction of their genome that encodes non-protein coding RNAs. Essential biological processes as diverse as cell differentiation, suppression of infecting viruses and parasitic transposons, higher-level organization of eukaryotic chromosomes, and gene expression itself are found to largely be directed by non-protein coding RNAs. The biophysical study of these RNAs employs X-ray crystallography, NMR, ensemble and single molecule fluorescence spectroscopy, optical tweezers, cryo-electron microscopy, and ot...

  11. Bioinformatics analysis identify novel OB fold protein coding genes in C. elegans.

    Directory of Open Access Journals (Sweden)

    Daryanaz Dargahi

    Full Text Available BACKGROUND: The C. elegans genome has been extensively annotated by the WormBase consortium that uses state of the art bioinformatics pipelines, functional genomics and manual curation approaches. As a result, the identification of novel genes in silico in this model organism is becoming more challenging requiring new approaches. The Oligonucleotide-oligosaccharide binding (OB fold is a highly divergent protein family, in which protein sequences, in spite of having the same fold, share very little sequence identity (5-25%. Therefore, evidence from sequence-based annotation may not be sufficient to identify all the members of this family. In C. elegans, the number of OB-fold proteins reported is remarkably low (n=46 compared to other evolutionary-related eukaryotes, such as yeast S. cerevisiae (n=344 or fruit fly D. melanogaster (n=84. Gene loss during evolution or differences in the level of annotation for this protein family, may explain these discrepancies. METHODOLOGY/PRINCIPAL FINDINGS: This study examines the possibility that novel OB-fold coding genes exist in the worm. We developed a bioinformatics approach that uses the most sensitive sequence-sequence, sequence-profile and profile-profile similarity search methods followed by 3D-structure prediction as a filtering step to eliminate false positive candidate sequences. We have predicted 18 coding genes containing the OB-fold that have remarkably partially been characterized in C. elegans. CONCLUSIONS/SIGNIFICANCE: This study raises the possibility that the annotation of highly divergent protein fold families can be improved in C. elegans. Similar strategies could be implemented for large scale analysis by the WormBase consortium when novel versions of the genome sequence of C. elegans, or other evolutionary related species are being released. This approach is of general interest to the scientific community since it can be used to annotate any genome.

  12. Vertebrate gene predictions and the problem of large genes

    DEFF Research Database (Denmark)

    Wang, Jun; Li, ShengTing; Zhang, Yong

    2003-01-01

    To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistent...

  13. A systematic genome-wide analysis of zebrafish protein-coding gene function

    NARCIS (Netherlands)

    Kettleborough, R.N.; Busch-Nentwich, E.M.; Harvey, S.A.; Dooley, C.M.; de Bruijn, E.; van Eeden, F.; Sealy, I.; White, R.J.; Herd, C.; Nijman, I.J.; Fenyes, F.; Mehroke, S.; Scahill, C.; Gibbons, R.; Wali, N.; Carruthers, S.; Hall, A.; Yen, J.; Cuppen, E.; Stemple, D.L.

    2013-01-01

    Since the publication of the human reference genome, the identities of specific genes associated with human diseases are being discovered at a rapid rate. A central problem is that the biological activity of these genes is often unclear. Detailed investigations in model vertebrate organisms,

  14. Novel methods for the molecular discrimination of Fasciola spp. on the basis of nuclear protein-coding genes.

    Science.gov (United States)

    Shoriki, Takuya; Ichikawa-Seki, Madoka; Suganuma, Keisuke; Naito, Ikunori; Hayashi, Kei; Nakao, Minoru; Aita, Junya; Mohanta, Uday Kumar; Inoue, Noboru; Murakami, Kenji; Itagaki, Tadashi

    2016-06-01

    Fasciolosis is an economically important disease of livestock caused by Fasciola hepatica, Fasciola gigantica, and aspermic Fasciola flukes. The aspermic Fasciola flukes have been discriminated morphologically from the two other species by the absence of sperm in their seminal vesicles. To date, the molecular discrimination of F. hepatica and F. gigantica has relied on the nucleotide sequences of the internal transcribed spacer 1 (ITS1) region. However, ITS1 genotypes of aspermic Fasciola flukes cannot be clearly differentiated from those of F. hepatica and F. gigantica. Therefore, more precise and robust methods are required to discriminate Fasciola spp. In this study, we developed PCR restriction fragment length polymorphism and multiplex PCR methods to discriminate F. hepatica, F. gigantica, and aspermic Fasciola flukes on the basis of the nuclear protein-coding genes, phosphoenolpyruvate carboxykinase and DNA polymerase delta, which are single locus genes in most eukaryotes. All aspermic Fasciola flukes used in this study had mixed fragment pattern of F. hepatica and F. gigantica for both of these genes, suggesting that the flukes are descended through hybridization between the two species. These molecular methods will facilitate the identification of F. hepatica, F. gigantica, and aspermic Fasciola flukes, and will also prove useful in etiological studies of fasciolosis. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  15. Natural selection on protein-coding genes in the human genome

    DEFF Research Database (Denmark)

    Bustamente, Carlos D.; Fledel-Alon, Adi; Williamson, Scott

    2005-01-01

    , showing an excess of deleterious variation within local populations 9, 10 . Here we contrast patterns of coding sequence polymorphism identified by direct sequencing of 39 humans for over 11,000 genes to divergence between humans and chimpanzees, and find strong evidence that natural selection has shaped......Comparisons of DNA polymorphism within species to divergence between species enables the discovery of molecular adaptation in evolutionarily constrained genes as well as the differentiation of weak from strong purifying selection 1, 2, 3, 4 . The extent to which weak negative and positive darwinian...... selection have driven the molecular evolution of different species varies greatly 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 , with some species, such as Drosophila melanogaster, showing strong evidence of pervasive positive selection 6, 7, 8, 9 , and others, such as the selfing weed Arabidopsis thaliana...

  16. ChIPBase v2.0: decoding transcriptional regulatory networks of non-coding RNAs and protein-coding genes from ChIP-seq data.

    Science.gov (United States)

    Zhou, Ke-Ren; Liu, Shun; Sun, Wen-Ju; Zheng, Ling-Ling; Zhou, Hui; Yang, Jian-Hua; Qu, Liang-Hu

    2017-01-04

    The abnormal transcriptional regulation of non-coding RNAs (ncRNAs) and protein-coding genes (PCGs) is contributed to various biological processes and linked with human diseases, but the underlying mechanisms remain elusive. In this study, we developed ChIPBase v2.0 (http://rna.sysu.edu.cn/chipbase/) to explore the transcriptional regulatory networks of ncRNAs and PCGs. ChIPBase v2.0 has been expanded with ∼10 200 curated ChIP-seq datasets, which represent about 20 times expansion when comparing to the previous released version. We identified thousands of binding motif matrices and their binding sites from ChIP-seq data of DNA-binding proteins and predicted millions of transcriptional regulatory relationships between transcription factors (TFs) and genes. We constructed 'Regulator' module to predict hundreds of TFs and histone modifications that were involved in or affected transcription of ncRNAs and PCGs. Moreover, we built a web-based tool, Co-Expression, to explore the co-expression patterns between DNA-binding proteins and various types of genes by integrating the gene expression profiles of ∼10 000 tumor samples and ∼9100 normal tissues and cell lines. ChIPBase also provides a ChIP-Function tool and a genome browser to predict functions of diverse genes and visualize various ChIP-seq data. This study will greatly expand our understanding of the transcriptional regulations of ncRNAs and PCGs. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Analysis of antisense expression by whole genome tiling microarrays and siRNAs suggests mis-annotation of Arabidopsis orphan protein-coding genes.

    Directory of Open Access Journals (Sweden)

    Casey R Richardson

    2010-05-01

    Full Text Available MicroRNAs (miRNAs and trans-acting small-interfering RNAs (tasi-RNAs are small (20-22 nt long RNAs (smRNAs generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery.We explored rice (Oryza sativa sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis 'orphan' hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the "ancient" (deeply conserved class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for "new" rapidly-evolving MIRNA genes.Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other

  18. Phylogenetic relationships within Echinococcus and Taenia tapeworms (Cestoda: Taeniidae): an inference from nuclear protein-coding genes.

    Science.gov (United States)

    Knapp, Jenny; Nakao, Minoru; Yanagida, Tetsuya; Okamoto, Munehiro; Saarma, Urmas; Lavikainen, Antti; Ito, Akira

    2011-12-01

    The family Taeniidae of tapeworms is composed of two genera, Echinococcus and Taenia, which obligately parasitize mammals including humans. Inferring phylogeny via molecular markers is the only way to trace back their evolutionary histories. However, molecular dating approaches are lacking so far. Here we established new markers from nuclear protein-coding genes for RNA polymerase II second largest subunit (rpb2), phosphoenolpyruvate carboxykinase (pepck) and DNA polymerase delta (pold). Bayesian inference and maximum likelihood analyses of the concatenated gene sequences allowed us to reconstruct phylogenetic trees for taeniid parasites. The tree topologies clearly demonstrated that Taenia is paraphyletic and that the clade of Echinococcus oligarthrus and Echinococcusvogeli is sister to all other members of Echinococcus. Both species are endemic in Central and South America, and their definitive hosts originated from carnivores that immigrated from North America after the formation of the Panamanian land bridge about 3 million years ago (Ma). A time-calibrated phylogeny was estimated by a Bayesian relaxed-clock method based on the assumption that the most recent common ancestor of E. oligarthrus and E. vogeli existed during the late Pliocene (3.0 Ma). The results suggest that a clade of Taenia including human-pathogenic species diversified primarily in the late Miocene (11.2 Ma), whereas Echinococcus started to diversify later, in the end of the Miocene (5.8 Ma). Close genetic relationships among the members of Echinococcus imply that the genus is a young group in which speciation and global radiation occurred rapidly. Copyright © 2011 Elsevier Inc. All rights reserved.

  19. Improvement of genome assembly completeness and identification of novel full-length protein-coding genes by RNA-seq in the giant panda genome.

    Science.gov (United States)

    Chen, Meili; Hu, Yibo; Liu, Jingxing; Wu, Qi; Zhang, Chenglin; Yu, Jun; Xiao, Jingfa; Wei, Fuwen; Wu, Jiayan

    2015-12-11

    High-quality and complete gene models are the basis of whole genome analyses. The giant panda (Ailuropoda melanoleuca) genome was the first genome sequenced on the basis of solely short reads, but the genome annotation had lacked the support of transcriptomic evidence. In this study, we applied RNA-seq to globally improve the genome assembly completeness and to detect novel expressed transcripts in 12 tissues from giant pandas, by using a transcriptome reconstruction strategy that combined reference-based and de novo methods. Several aspects of genome assembly completeness in the transcribed regions were effectively improved by the de novo assembled transcripts, including genome scaffolding, the detection of small-size assembly errors, the extension of scaffold/contig boundaries, and gap closure. Through expression and homology validation, we detected three groups of novel full-length protein-coding genes. A total of 12.62% of the novel protein-coding genes were validated by proteomic data. GO annotation analysis showed that some of the novel protein-coding genes were involved in pigmentation, anatomical structure formation and reproduction, which might be related to the development and evolution of the black-white pelage, pseudo-thumb and delayed embryonic implantation of giant pandas. The updated genome annotation will help further giant panda studies from both structural and functional perspectives.

  20. Emerging putative associations between non-coding RNAs and protein-coding genes in Neuropathic Pain. Added value from re-using microarray data.

    Directory of Open Access Journals (Sweden)

    Enrico Capobianco

    2016-10-01

    Full Text Available Regeneration of injured nerves is likely occurring in the peripheral nervous system, but not in the central nervous system. Although protein-coding gene expression has been assessed during nerve regeneration, little is currently known about the role of non-coding RNAs (ncRNAs. This leaves open questions about the potential effects of ncRNAs at transcriptome level. Due to the limited availability of human neuropathic pain data, we have identified the most comprehensive time-course gene expression profile referred to sciatic nerve injury, and studied in a rat model, using two neuronal tissues, namely dorsal root ganglion (DRG and sciatic nerve (SN. We have developed a methodology to identify differentially expressed bioentities starting from microarray probes, and re-purposing them to annotate ncRNAs, while analyzing the expression profiles of protein-coding genes. The approach is designed to reuse microarray data and perform first profiling and then meta-analysis through three main steps. First, we used contextual analysis to identify what we considered putative or potential protein coding targets for selected ncRNAs. Relevance was therefore assigned to differential expression of neighbor protein-coding genes, with neighborhood defined by a fixed genomic distance from long or antisense ncRNA loci, and of parent genes associated with pseudogenes. Second, connectivity among putative targets was used to build networks, in turn useful to conduct inference at interactomic scale. Last, network paths were annotated to assess relevance to neuropathic pain. We found significant differential expression in long-intergenic ncRNAs (32 lincRNAs in SN, and 8 in DRG, antisense RNA (31 asRNA in SN, and 12 in DRG and pseudogenes (456 in SN, 56 in DRG. In particular, contextual analysis centered on pseudogenes revealed some targets with known association to neurodegeneration and/or neurogenesis processes. While modules of the olfactory receptors were clearly

  1. A novel TaqMan® assay for Nosema ceranae quantification in honey bee, based on the protein coding gene Hsp70.

    Science.gov (United States)

    Cilia, Giovanni; Cabbri, Riccardo; Maiorana, Giacomo; Cardaio, Ilaria; Dall'Olio, Raffaele; Nanetti, Antonio

    2018-04-01

    Nosema ceranae is now a widespread honey bee pathogen with high incidence in apiculture. Rapid and reliable detection and quantification methods are a matter of concern for research community, nowadays mainly relying on the use of biomolecular techniques such as PCR, RT-PCR or HRMA. The aim of this technical paper is to provide a new qPCR assay, based on the highly-conserved protein coding gene Hsp70, to detect and quantify the microsporidian Nosema ceranae affecting the western honey bee Apis mellifera. The validation steps to assess efficiency, sensitivity, specificity and robustness of the assay are described also. Copyright © 2018 Elsevier GmbH. All rights reserved.

  2. Nucleotide sequence of the Escherichia coli pyrE gene and of the DNA in front of the protein-coding region

    DEFF Research Database (Denmark)

    Poulsen, Peter; Jensen, Kaj Frank; Valentin-Hansen, Poul

    1983-01-01

    leader segment in front of the protein-coding region. This leader contains a structure with features characteristic for a (translated?) rho-independent transcriptional terminator, which is preceded by a cluster of uridylate residues. This indicates that the frequency of pyrE transcription is regulated......Orotate phosphoribosyltransferase (EC 2.4.2.10) was purified to electrophoretic homogeneity from a strain of Escherichia coli containing the pyrE gene cloned on a multicopy plasmid. The relative molecular masses (Mr) of the native enzyme and its subunit were estimated by means of gel filtration...

  3. A compendium of transcription factor and Transcriptionally active protein coding gene families in cowpea (Vigna unguiculata L.).

    Science.gov (United States)

    Misra, Vikram A; Wang, Yu; Timko, Michael P

    2017-11-22

    Cowpea (Vigna unguiculata (L.) Walp.) is the most important food and forage legume in the semi-arid tropics of sub-Saharan Africa where approximately 80% of worldwide production takes place primarily on low-input, subsistence farm sites. Among the major goals of cowpea breeding and improvement programs are the rapid manipulation of agronomic traits for seed size and quality and improved resistance to abiotic and biotic stresses to enhance productivity. Knowing the suite of transcription factors (TFs) and transcriptionally active proteins (TAPs) that control various critical plant cellular processes would contribute tremendously to these improvement aims. We used a computational approach that employed three different predictive pipelines to data mine the cowpea genome and identified over 4400 genes representing 136 different TF and TAP families. We compare the information content of cowpea to two evolutionarily close species common bean (Phaseolus vulgaris), and soybean (Glycine max) to gauge the relative informational content. Our data indicate that correcting for genome size cowpea has fewer TF and TAP genes than common bean (4408 / 5291) and soybean (4408/ 11,065). Members of the GROWTH-REGULATING FACTOR (GRF) and Auxin/indole-3-acetic acid (Aux/IAA) gene families appear to be over-represented in the genome relative to common bean and soybean, whereas members of the MADS (Minichromosome maintenance deficient 1 (MCM1), AGAMOUS, DEFICIENS, and serum response factor (SRF)) and C2C2-YABBY appear to be under-represented. Analysis of the AP2-EREBP APETALA2-Ethylene Responsive Element Binding Protein (AP2-EREBP), NAC (NAM (no apical meristem), ATAF1, 2 (Arabidopsis transcription activation factor), CUC (cup-shaped cotyledon)), and WRKY families, known to be important in defense signaling, revealed changes and phylogenetic rearrangements relative to common bean and soybean that suggest these groups may have evolved different functions. The availability of detailed

  4. The complete mitochondrial genome of the land snail Cornu aspersum (Helicidae: Mollusca: intra-specific divergence of protein-coding genes and phylogenetic considerations within Euthyneura.

    Directory of Open Access Journals (Sweden)

    Juan Diego Gaitán-Espitia

    Full Text Available The complete sequences of three mitochondrial genomes from the land snail Cornu aspersum were determined. The mitogenome has a length of 14050 bp, and it encodes 13 protein-coding genes, 22 transfer RNA genes and two ribosomal RNA genes. It also includes nine small intergene spacers, and a large AT-rich intergenic spacer. The intra-specific divergence analysis revealed that COX1 has the lower genetic differentiation, while the most divergent genes were NADH1, NADH3 and NADH4. With the exception of Euhadra herklotsi, the structural comparisons showed the same gene order within the family Helicidae, and nearly identical gene organization to that found in order Pulmonata. Phylogenetic reconstruction recovered Basommatophora as polyphyletic group, whereas Eupulmonata and Pulmonata as paraphyletic groups. Bayesian and Maximum Likelihood analyses showed that C. aspersum is a close relative of Cepaea nemoralis, and with the other Helicidae species form a sister group of Albinaria caerulea, supporting the monophyly of the Stylommatophora clade.

  5. Arabidopsis RNASE THREE LIKE2 Modulates the Expression of Protein-Coding Genes via 24-Nucleotide Small Interfering RNA-Directed DNA Methylation.

    Science.gov (United States)

    Elvira-Matelot, Emilie; Hachet, Mélanie; Shamandi, Nahid; Comella, Pascale; Sáez-Vásquez, Julio; Zytnicki, Matthias; Vaucheret, Hervé

    2016-02-01

    RNaseIII enzymes catalyze the cleavage of double-stranded RNA (dsRNA) and have diverse functions in RNA maturation. Arabidopsis thaliana RNASE THREE LIKE2 (RTL2), which carries one RNaseIII and two dsRNA binding (DRB) domains, is a unique Arabidopsis RNaseIII enzyme resembling the budding yeast small interfering RNA (siRNA)-producing Dcr1 enzyme. Here, we show that RTL2 modulates the production of a subset of small RNAs and that this activity depends on both its RNaseIII and DRB domains. However, the mode of action of RTL2 differs from that of Dcr1. Whereas Dcr1 directly cleaves dsRNAs into 23-nucleotide siRNAs, RTL2 likely cleaves dsRNAs into longer molecules, which are subsequently processed into small RNAs by the DICER-LIKE enzymes. Depending on the dsRNA considered, RTL2-mediated maturation either improves (RTL2-dependent loci) or reduces (RTL2-sensitive loci) the production of small RNAs. Because the vast majority of RTL2-regulated loci correspond to transposons and intergenic regions producing 24-nucleotide siRNAs that guide DNA methylation, RTL2 depletion modifies DNA methylation in these regions. Nevertheless, 13% of RTL2-regulated loci correspond to protein-coding genes. We show that changes in 24-nucleotide siRNA levels also affect DNA methylation levels at such loci and inversely correlate with mRNA steady state levels, thus implicating RTL2 in the regulation of protein-coding gene expression. © 2016 American Society of Plant Biologists. All rights reserved.

  6. A novel bidirectional expression system for simultaneous expression of both the protein-coding genes and short hairpin RNAs in mammalian cells

    International Nuclear Information System (INIS)

    Hung, C.-F.; Cheng, T.-L.; Wu, R.-H.; Teng, C.-F.; Chang, W.-T.

    2006-01-01

    RNA interference (RNAi) is an extremely powerful and widely used gene silencing approach for reverse functional genomics and molecular therapeutics. In mammals, the conserved poly(ADP-ribose) polymerase 2 (PARP-2)/RNase P bidirectional control promoter simultaneously expresses both the PARP-2 protein and RNase P RNA by RNA polymerase II- and III-dependent mechanisms, respectively. To explore this unique bidirectional control system in RNAi-mediated gene silencing strategy, we have constructed two novel bidirectional expression vectors, pbiHsH1 and pbiMmH1, which contained the PARP-2/RNase P bidirectional control promoters from human and mouse, for simultaneous expression of both the protein-coding genes and short hairpin RNAs. Analyses of the dual transcriptional activities indicated that these two bidirectional expression vectors could not only express enhanced green fluorescent protein as a functional reporter but also simultaneously transcribe shLuc for inhibiting the firefly luciferase expression. In addition, to extend its utility for the establishment of inherited stable clones, we have also reconstructed this bidirectional expression system with the blasticidin S deaminase gene, an effective dominant drug resistance selectable marker, and examined both the selection and inhibition efficiencies in drug resistance and gene expression. Moreover, we have further demonstrated that this bidirectional expression system could efficiently co-regulate the functionally important genes, such as overexpression of tumor suppressor protein p53 and inhibition of anti-apoptotic protein Bcl-2 at the same time. In summary, the bidirectional expression vectors, pbiHsH1 and pbiMmH1, should provide a simple, convenient, and efficient novel tool for manipulating the gene function in mammalian cells

  7. Annotation of the protein coding regions of the equine genome

    DEFF Research Database (Denmark)

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced m...... and appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross...

  8. Dataset of the first transcriptome assembly of the tree crop “yerba mate” (Ilex paraguariensis and systematic characterization of protein coding genes

    Directory of Open Access Journals (Sweden)

    Patricia M. Aguilera

    2018-04-01

    Full Text Available This contribution contains data associated to the research article entitled “Exploring the genes of yerba mate (Ilex paraguariensis A. St.-Hil. by NGS and de novo transcriptome assembly” (Debat et al., 2014 [1]. By means of a bioinformatic approach involving extensive NGS data analyses, we provide a resource encompassing the full transcriptome assembly of yerba mate, the first available reference for the Ilex L. genus. This dataset (Supplementary files 1 and 2 consolidates the transcriptome-wide assembled sequences of I. paraguariensis with further comprehensive annotation of the protein coding genes of yerba mate via the integration of Arabidopsis thaliana databases. The generated data is pivotal for the characterization of agronomical relevant genes in the tree crop yerba mate -a non-model species- and related taxa in Ilex. The raw sequencing data dissected here is available at DDBJ/ENA/GenBank (NCBI Resource Coordinators, 2016 [2] Sequence Read Archive (SRA under the accession SRP043293 and the assembled sequences have been deposited at the Transcriptome Shotgun Assembly Sequence Database (TSA under the accession GFHV00000000.

  9. Nonsynonymous substitution rate (Ka is a relatively consistent parameter for defining fast-evolving and slow-evolving protein-coding genes

    Directory of Open Access Journals (Sweden)

    Wang Lei

    2011-02-01

    Full Text Available Abstract Background Mammalian genome sequence data are being acquired in large quantities and at enormous speeds. We now have a tremendous opportunity to better understand which genes are the most variable or conserved, and what their particular functions and evolutionary dynamics are, through comparative genomics. Results We chose human and eleven other high-coverage mammalian genome data–as well as an avian genome as an outgroup–to analyze orthologous protein-coding genes using nonsynonymous (Ka and synonymous (Ks substitution rates. After evaluating eight commonly-used methods of Ka and Ks calculation, we observed that these methods yielded a nearly uniform result when estimating Ka, but not Ks (or Ka/Ks. When sorting genes based on Ka, we noticed that fast-evolving and slow-evolving genes often belonged to different functional classes, with respect to species-specificity and lineage-specificity. In particular, we identified two functional classes of genes in the acquired immune system. Fast-evolving genes coded for signal-transducing proteins, such as receptors, ligands, cytokines, and CDs (cluster of differentiation, mostly surface proteins, whereas the slow-evolving genes were for function-modulating proteins, such as kinases and adaptor proteins. In addition, among slow-evolving genes that had functions related to the central nervous system, neurodegenerative disease-related pathways were enriched significantly in most mammalian species. We also confirmed that gene expression was negatively correlated with evolution rate, i.e. slow-evolving genes were expressed at higher levels than fast-evolving genes. Our results indicated that the functional specializations of the three major mammalian clades were: sensory perception and oncogenesis in primates, reproduction and hormone regulation in large mammals, and immunity and angiotensin in rodents. Conclusion Our study suggests that Ka calculation, which is less biased compared to Ks and Ka

  10. High abundance of Serine/Threonine-rich regions predicted to be hyper-O-glycosylated in the secretory proteins coded by eight fungal genomes

    Directory of Open Access Journals (Sweden)

    González Mario

    2012-09-01

    Full Text Available Abstract Background O-glycosylation of secretory proteins has been found to be an important factor in fungal biology and virulence. It consists in the addition of short glycosidic chains to Ser or Thr residues in the protein backbone via O-glycosidic bonds. Secretory proteins in fungi frequently display Ser/Thr rich regions that could be sites of extensive O-glycosylation. We have analyzed in silico the complete sets of putatively secretory proteins coded by eight fungal genomes (Botrytis cinerea, Magnaporthe grisea, Sclerotinia sclerotiorum, Ustilago maydis, Aspergillus nidulans, Neurospora crassa, Trichoderma reesei, and Saccharomyces cerevisiae in search of Ser/Thr-rich regions as well as regions predicted to be highly O-glycosylated by NetOGlyc (http://www.cbs.dtu.dk. Results By comparison with experimental data, NetOGlyc was found to overestimate the number of O-glycosylation sites in fungi by a factor of 1.5, but to be quite reliable in the prediction of highly O-glycosylated regions. About half of secretory proteins have at least one Ser/Thr-rich region, with a Ser/Thr content of at least 40% over an average length of 40 amino acids. Most secretory proteins in filamentous fungi were predicted to be O-glycosylated, sometimes in dozens or even hundreds of sites. Residues predicted to be O-glycosylated have a tendency to be grouped together forming hyper-O-glycosylated regions of varying length. Conclusions About one fourth of secretory fungal proteins were predicted to have at least one hyper-O-glycosylated region, which consists of 45 amino acids on average and displays at least one O-glycosylated Ser or Thr every four residues. These putative highly O-glycosylated regions can be found anywhere along the proteins but have a slight tendency to be at either one of the two ends.

  11. Expression of the Long Intergenic Non-Protein Coding RNA 665 (LINC00665) Gene and the Cell Cycle in Hepatocellular Carcinoma Using The Cancer Genome Atlas, the Gene Expression Omnibus, and Quantitative Real-Time Polymerase Chain Reaction.

    Science.gov (United States)

    Wen, Dong-Yue; Lin, Peng; Pang, Yu-Yan; Chen, Gang; He, Yun; Dang, Yi-Wu; Yang, Hong

    2018-05-05

    BACKGROUND Long non-coding RNAs (lncRNAs) have a role in physiological and pathological processes, including cancer. The aim of this study was to investigate the expression of the long intergenic non-protein coding RNA 665 (LINC00665) gene and the cell cycle in hepatocellular carcinoma (HCC) using database analysis including The Cancer Genome Atlas (TCGA), the Gene Expression Omnibus (GEO), and quantitative real-time polymerase chain reaction (qPCR). MATERIAL AND METHODS Expression levels of LINC00665 were compared between human tissue samples of HCC and adjacent normal liver, clinicopathological correlations were made using TCGA and the GEO, and qPCR was performed to validate the findings. Other public databases were searched for other genes associated with LINC00665 expression, including The Atlas of Noncoding RNAs in Cancer (TANRIC), the Multi Experiment Matrix (MEM), Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and protein-protein interaction (PPI) networks. RESULTS Overexpression of LINC00665 in patients with HCC was significantly associated with gender, tumor grade, stage, and tumor cell type. Overexpression of LINC00665 in patients with HCC was significantly associated with overall survival (OS) (HR=1.47795%; CI: 1.046-2.086). Bioinformatics analysis identified 469 related genes and further analysis supported a hypothesis that LINC00665 regulates pathways in the cell cycle to facilitate the development and progression of HCC through ten identified core genes: CDK1, BUB1B, BUB1, PLK1, CCNB2, CCNB1, CDC20, ESPL1, MAD2L1, and CCNA2. CONCLUSIONS Overexpression of the lncRNA, LINC00665 may be involved in the regulation of cell cycle pathways in HCC through ten identified hub genes.

  12. Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models.

    Science.gov (United States)

    Mahony, Shaun; McInerney, James O; Smith, Terry J; Golden, Aaron

    2004-03-05

    Many current gene prediction methods use only one model to represent protein-coding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation. This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential. While its raw accuracy rate can be less than other methods, RescueNet consistently identifies some genes that other methods do not, and should therefore be of interest to gene-prediction software developers and genome annotation teams alike. RescueNet is recommended for use in conjunction with, or as a complement to, other gene prediction methods.

  13. Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.

    Science.gov (United States)

    Peng, Hui; Lan, Chaowang; Liu, Yuansheng; Liu, Tao; Blumenstein, Michael; Li, Jinyan

    2017-10-03

    Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.

  14. In-depth comparative analysis of malaria parasite genomes reveals protein-coding genes linked to human disease in Plasmodium falciparum genome.

    Science.gov (United States)

    Liu, Xuewu; Wang, Yuanyuan; Liang, Jiao; Wang, Luojun; Qin, Na; Zhao, Ya; Zhao, Gang

    2018-05-02

    Plasmodium falciparum is the most virulent malaria parasite capable of parasitizing human erythrocytes. The identification of genes related to this capability can enhance our understanding of the molecular mechanisms underlying human malaria and lead to the development of new therapeutic strategies for malaria control. With the availability of several malaria parasite genome sequences, performing computational analysis is now a practical strategy to identify genes contributing to this disease. Here, we developed and used a virtual genome method to assign 33,314 genes from three human malaria parasites, namely, P. falciparum, P. knowlesi and P. vivax, and three rodent malaria parasites, namely, P. berghei, P. chabaudi and P. yoelii, to 4605 clusters. Each cluster consisted of genes whose protein sequences were significantly similar and was considered as a virtual gene. Comparing the enriched values of all clusters in human malaria parasites with those in rodent malaria parasites revealed 115 P. falciparum genes putatively responsible for parasitizing human erythrocytes. These genes are mainly located in the chromosome internal regions and participate in many biological processes, including membrane protein trafficking and thiamine biosynthesis. Meanwhile, 289 P. berghei genes were included in the rodent parasite-enriched clusters. Most are located in subtelomeric regions and encode erythrocyte surface proteins. Comparing cluster values in P. falciparum with those in P. vivax and P. knowlesi revealed 493 candidate genes linked to virulence. Some of them encode proteins present on the erythrocyte surface and participate in cytoadhesion, virulence factor trafficking, or erythrocyte invasion, but many genes with unknown function were also identified. Cerebral malaria is characterized by accumulation of infected erythrocytes at trophozoite stage in brain microvascular. To discover cerebral malaria-related genes, fast Fourier transformation (FFT) was introduced to extract

  15. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.

    Science.gov (United States)

    Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B

    2017-11-24

    Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.

  16. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans

    Directory of Open Access Journals (Sweden)

    Assaf Gottlieb

    2017-11-01

    Full Text Available Abstract Background Genome-wide association studies are useful for discovering genotype–phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into “gene level” effects. Methods Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression—on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. Results We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Conclusions Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort

  17. Retrotransposons and non-protein coding RNAs

    DEFF Research Database (Denmark)

    Mourier, Tobias; Willerslev, Eske

    2009-01-01

    does not merely represent spurious transcription. We review examples of functional RNAs transcribed from retrotransposons, and address the collection of non-protein coding RNAs derived from transposable element sequences, including numerous human microRNAs and the neuronal BC RNAs. Finally, we review...

  18. Genome-wide targeted prediction of ABA responsive genes in rice based on over-represented cis-motif in co-expressed genes.

    Science.gov (United States)

    Lenka, Sangram K; Lohia, Bikash; Kumar, Abhay; Chinnusamy, Viswanathan; Bansal, Kailash C

    2009-02-01

    Abscisic acid (ABA), the popular plant stress hormone, plays a key role in regulation of sub-set of stress responsive genes. These genes respond to ABA through specific transcription factors which bind to cis-regulatory elements present in their promoters. We discovered the ABA Responsive Element (ABRE) core (ACGT) containing CGMCACGTGB motif as over-represented motif among the promoters of ABA responsive co-expressed genes in rice. Targeted gene prediction strategy using this motif led to the identification of 402 protein coding genes potentially regulated by ABA-dependent molecular genetic network. RT-PCR analysis of arbitrarily chosen 45 genes from the predicted 402 genes confirmed 80% accuracy of our prediction. Plant Gene Ontology (GO) analysis of ABA responsive genes showed enrichment of signal transduction and stress related genes among diverse functional categories.

  19. What does a worm want with 20,000 genes?

    OpenAIRE

    Hodgkin, Jonathan

    2001-01-01

    The number of genes predicted for the Caenorhabditis elegans genome is remarkably high: approximately 20,000, if both protein-coding and RNA-coding genes are counted. This article discusses possible explanations for such a high value.

  20. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.

    Science.gov (United States)

    Fu, Wenqing; O'Connor, Timothy D; Jun, Goo; Kang, Hyun Min; Abecasis, Goncalo; Leal, Suzanne M; Gabriel, Stacey; Rieder, Mark J; Altshuler, David; Shendure, Jay; Nickerson, Deborah A; Bamshad, Michael J; Akey, Joshua M

    2013-01-10

    Establishing the age of each mutation segregating in contemporary human populations is important to fully understand our evolutionary history and will help to facilitate the development of new approaches for disease-gene discovery. Large-scale surveys of human genetic variation have reported signatures of recent explosive population growth, notable for an excess of rare genetic variants, suggesting that many mutations arose recently. To more quantitatively assess the distribution of mutation ages, we resequenced 15,336 genes in 6,515 individuals of European American and African American ancestry and inferred the age of 1,146,401 autosomal single nucleotide variants (SNVs). We estimate that approximately 73% of all protein-coding SNVs and approximately 86% of SNVs predicted to be deleterious arose in the past 5,000-10,000 years. The average age of deleterious SNVs varied significantly across molecular pathways, and disease genes contained a significantly higher proportion of recently arisen deleterious SNVs than other genes. Furthermore, European Americans had an excess of deleterious variants in essential and Mendelian disease genes compared to African Americans, consistent with weaker purifying selection due to the Out-of-Africa dispersal. Our results better delimit the historical details of human protein-coding variation, show the profound effect of recent human history on the burden of deleterious SNVs segregating in contemporary populations, and provide important practical information that can be used to prioritize variants in disease-gene discovery.

  1. MOCAT: a metagenomics assembly and gene prediction toolkit.

    Science.gov (United States)

    Kultima, Jens Roat; Sunagawa, Shinichi; Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer

    2012-01-01

    MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

  2. MOCAT: a metagenomics assembly and gene prediction toolkit.

    Directory of Open Access Journals (Sweden)

    Jens Roat Kultima

    Full Text Available MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

  3. Transduplication resulted in the incorporation of two protein-coding sequences into the Turmoil-1 transposable element of C. elegans

    Directory of Open Access Journals (Sweden)

    Pupko Tal

    2008-10-01

    Full Text Available Abstract Transposable elements may acquire unrelated gene fragments into their sequences in a process called transduplication. Transduplication of protein-coding genes is common in plants, but is unknown of in animals. Here, we report that the Turmoil-1 transposable element in C. elegans has incorporated two protein-coding sequences into its inverted terminal repeat (ITR sequences. The ITRs of Turmoil-1 contain a conserved RNA recognition motif (RRM that originated from the rsp-2 gene and a fragment from the protein-coding region of the cpg-3 gene. We further report that an open reading frame specific to C. elegans may have been created as a result of a Turmoil-1 insertion. Mutations at the 5' splice site of this open reading frame may have reactivated the transduplicated RRM motif. Reviewers This article was reviewed by Dan Graur and William Martin. For the full reviews, please go to the Reviewers' Reports section.

  4. lncRNA Gene Signatures for Prediction of Breast Cancer Intrinsic Subtypes and Prognosis

    Directory of Open Access Journals (Sweden)

    Silu Zhang

    2018-01-01

    Full Text Available Background: Breast cancer is intrinsically heterogeneous and is commonly classified into four main subtypes associated with distinct biological features and clinical outcomes. However, currently available data resources and methods are limited in identifying molecular subtyping on protein-coding genes, and little is known about the roles of long non-coding RNAs (lncRNAs, which occupies 98% of the whole genome. lncRNAs may also play important roles in subgrouping cancer patients and are associated with clinical phenotypes. Methods: The purpose of this project was to identify lncRNA gene signatures that are associated with breast cancer subtypes and clinical outcomes. We identified lncRNA gene signatures from The Cancer Genome Atlas (TCGA RNAseq data that are associated with breast cancer subtypes by an optimized 1-Norm SVM feature selection algorithm. We evaluated the prognostic performance of these gene signatures with a semi-supervised principal component (superPC method. Results: Although lncRNAs can independently predict breast cancer subtypes with satisfactory accuracy, a combined gene signature including both coding and non-coding genes will give the best clinically relevant prediction performance. We highlighted eight potential biomarkers (three from coding genes and five from non-coding genes that are significantly associated with survival outcomes. Conclusion: Our proposed methods are a novel means of identifying subtype-specific coding and non-coding potential biomarkers that are both clinically relevant and biologically significant.

  5. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing.

    Science.gov (United States)

    Zhao, Yingwen; Fu, Guangyuan; Wang, Jun; Guo, Maozu; Yu, Guoxian

    2018-02-23

    Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash. Copyright © 2018 Elsevier Inc. All rights reserved.

  6. Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes

    DEFF Research Database (Denmark)

    Lin, Michael F; Kheradpour, Pouya; Washietl, Stefan

    2011-01-01

    conservation compared to typical protein-coding genes—especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29......-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collect numerous lines of evidence that the observed...... synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian...

  7. MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes

    Directory of Open Access Journals (Sweden)

    Yang Yi-Fan

    2007-03-01

    Full Text Available Abstract Background Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes. Results This paper describes a new prokaryotic genefinding algorithm based on a comprehensive statistical model of protein coding Open Reading Frames (ORFs and Translation Initiation Sites (TISs. The former is based on a linguistic "Entropy Density Profile" (EDP model of coding DNA sequence and the latter comprises several relevant features related to the translation initiation. They are combined to form a so-called Multivariate Entropy Distance (MED algorithm, MED 2.0, that incorporates several strategies in the iterative program. The iterations enable us to develop a non-supervised learning process and to obtain a set of genome-specific parameters for the gene structure, before making the prediction of genes. Conclusion Results of extensive tests show that MED 2.0 achieves a competitive high performance in the gene prediction for both 5' and 3' end matches, compared to the current best prokaryotic gene finders. The advantage of the MED 2.0 is particularly evident for GC-rich genomes and archaeal genomes. Furthermore, the genome-specific parameters given by MED 2.0 match with the current understanding of prokaryotic genomes and may serve as tools for comparative genomic studies. In particular, MED 2.0 is shown to reveal divergent translation initiation mechanisms in archaeal genomes while making a more accurate prediction of TISs compared to the existing gene finders and the current GenBank annotation.

  8. Comparative analysis of codon usage patterns and identification of predicted highly expressed genes in five Salmonella genomes

    Directory of Open Access Journals (Sweden)

    Mondal U

    2008-01-01

    Full Text Available Purpose: To anlyse codon usage patterns of five complete genomes of Salmonella , predict highly expressed genes, examine horizontally transferred pathogenicity-related genes to detect their presence in the strains, and scrutinize the nature of highly expressed genes to infer upon their lifestyle. Methods: Protein coding genes, ribosomal protein genes, and pathogenicity-related genes were analysed with Codon W and CAI (codon adaptation index Calculator. Results: Translational efficiency plays a role in codon usage variation in Salmonella genes. Low bias was noticed in most of the genes. GC3 (guanine cytosine at third position composition does not influence codon usage variation in the genes of these Salmonella strains. Among the cluster of orthologous groups (COGs, translation, ribosomal structure biogenesis [J], and energy production and conversion [C] contained the highest number of potentially highly expressed (PHX genes. Correspondence analysis reveals the conserved nature of the genes. Highly expressed genes were detected. Conclusions: Selection for translational efficiency is the major source of variation of codon usage in the genes of Salmonella . Evolution of pathogenicity-related genes as a unit suggests their ability to infect and exist as a pathogen. Presence of a lot of PHX genes in the information and storage-processing category of COGs indicated their lifestyle and revealed that they were not subjected to genome reduction.

  9. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    Science.gov (United States)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  10. Combining gene prediction methods to improve metagenomic gene annotation

    Directory of Open Access Journals (Sweden)

    Rosen Gail L

    2011-01-01

    Full Text Available Abstract Background Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation. Results We not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset. Conclusions To optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding reads on a real human gut sample sequenced by Illumina technology.

  11. Computational prediction of miRNA genes from small RNA sequencing data

    Directory of Open Access Journals (Sweden)

    Wenjing eKang

    2015-01-01

    Full Text Available Next-generation sequencing now for the first time allows researchers to gauge the depth and variation of entire transcriptomes. However, now as rare transcripts can be detected that are present in cells at single copies, more advanced computational tools are needed to accurately annotate and profile them. miRNAs are 22 nucleotide small RNAs (sRNAs that post-transcriptionally reduce the output of protein coding genes. They have established roles in numerous biological processes, including cancers and other diseases. During miRNA biogenesis, the sRNAs are sequentially cleaved from precursor molecules that have a characteristic hairpin RNA structure. The vast majority of new miRNA genes that are discovered are mined from small RNA sequencing (sRNA-seq, which can detect more than a billion RNAs in a single run. However, given that many of the detected RNAs are degradation products from all types of transcripts, the accurate identification of miRNAs remain a non-trivial computational problem. Here we review the tools available to predict animal miRNAs from sRNA sequencing data. We present tools for generalist and specialist use cases, including prediction from massively pooled data or in species without reference genome. We also present wet-lab methods used to validate predicted miRNAs, and approaches to computationally benchmark prediction accuracy. For each tool, we reference validation experiments and benchmarking efforts. Last, we discuss the future of the field.

  12. Deep Sequencing Reveals Uncharted Isoform Heterogeneity of the Protein-Coding Transcriptome in Cerebral Ischemia.

    Science.gov (United States)

    Bhattarai, Sunil; Aly, Ahmed; Garcia, Kristy; Ruiz, Diandra; Pontarelli, Fabrizio; Dharap, Ashutosh

    2018-06-03

    Gene expression in cerebral ischemia has been a subject of intense investigations for several years. Studies utilizing probe-based high-throughput methodologies such as microarrays have contributed significantly to our existing knowledge but lacked the capacity to dissect the transcriptome in detail. Genome-wide RNA-sequencing (RNA-seq) enables comprehensive examinations of transcriptomes for attributes such as strandedness, alternative splicing, alternative transcription start/stop sites, and sequence composition, thus providing a very detailed account of gene expression. Leveraging this capability, we conducted an in-depth, genome-wide evaluation of the protein-coding transcriptome of the adult mouse cortex after transient focal ischemia at 6, 12, or 24 h of reperfusion using RNA-seq. We identified a total of 1007 transcripts at 6 h, 1878 transcripts at 12 h, and 1618 transcripts at 24 h of reperfusion that were significantly altered as compared to sham controls. With isoform-level resolution, we identified 23 splice variants arising from 23 genes that were novel mRNA isoforms. For a subset of genes, we detected reperfusion time-point-dependent splice isoform switching, indicating an expression and/or functional switch for these genes. Finally, for 286 genes across all three reperfusion time-points, we discovered multiple, distinct, simultaneously expressed and differentially altered isoforms per gene that were generated via alternative transcription start/stop sites. Of these, 165 isoforms derived from 109 genes were novel mRNAs. Together, our data unravel the protein-coding transcriptome of the cerebral cortex at an unprecedented depth to provide several new insights into the flexibility and complexity of stroke-related gene transcription and transcript organization.

  13. Blood Gene Expression Predicts Bronchiolitis Obliterans Syndrome

    Directory of Open Access Journals (Sweden)

    Richard Danger

    2018-01-01

    Full Text Available Bronchiolitis obliterans syndrome (BOS, the main manifestation of chronic lung allograft dysfunction, leads to poor long-term survival after lung transplantation. Identifying predictors of BOS is essential to prevent the progression of dysfunction before irreversible damage occurs. By using a large set of 107 samples from lung recipients, we performed microarray gene expression profiling of whole blood to identify early biomarkers of BOS, including samples from 49 patients with stable function for at least 3 years, 32 samples collected at least 6 months before BOS diagnosis (prediction group, and 26 samples at or after BOS diagnosis (diagnosis group. An independent set from 25 lung recipients was used for validation by quantitative PCR (13 stables, 11 in the prediction group, and 8 in the diagnosis group. We identified 50 transcripts differentially expressed between stable and BOS recipients. Three genes, namely POU class 2 associating factor 1 (POU2AF1, T-cell leukemia/lymphoma protein 1A (TCL1A, and B cell lymphocyte kinase, were validated as predictive biomarkers of BOS more than 6 months before diagnosis, with areas under the curve of 0.83, 0.77, and 0.78 respectively. These genes allow stratification based on BOS risk (log-rank test p < 0.01 and are not associated with time posttransplantation. This is the first published large-scale gene expression analysis of blood after lung transplantation. The three-gene blood signature could provide clinicians with new tools to improve follow-up and adapt treatment of patients likely to develop BOS.

  14. Predicting Hydrologic Function With Aquatic Gene Fragments

    Science.gov (United States)

    Good, S. P.; URycki, D. R.; Crump, B. C.

    2018-03-01

    Recent advances in microbiology techniques, such as genetic sequencing, allow for rapid and cost-effective collection of large quantities of genetic information carried within water samples. Here we posit that the unique composition of aquatic DNA material within a water sample contains relevant information about hydrologic function at multiple temporal scales. In this study, machine learning was used to develop discharge prediction models trained on the relative abundance of bacterial taxa classified into operational taxonomic units (OTUs) based on 16S rRNA gene sequences from six large arctic rivers. We term this approach "genohydrology," and show that OTU relative abundances can be used to predict river discharge at monthly and longer timescales. Based on a single DNA sample from each river, the average Nash-Sutcliffe efficiency (NSE) for predicted mean monthly discharge values throughout the year was 0.84, while the NSE for predicted discharge values across different return intervals was 0.67. These are considerable improvements over predictions based only on the area-scaled mean specific discharge of five similar rivers, which had average NSE values of 0.64 and -0.32 for seasonal and recurrence interval discharge values, respectively. The genohydrology approach demonstrates that genetic diversity within the aquatic microbiome is a large and underutilized data resource with benefits for prediction of hydrologic function.

  15. Predicting cellular growth from gene expression signatures.

    Directory of Open Access Journals (Sweden)

    Edoardo M Airoldi

    2009-01-01

    Full Text Available Maintaining balanced growth in a changing environment is a fundamental systems-level challenge for cellular physiology, particularly in microorganisms. While the complete set of regulatory and functional pathways supporting growth and cellular proliferation are not yet known, portions of them are well understood. In particular, cellular proliferation is governed by mechanisms that are highly conserved from unicellular to multicellular organisms, and the disruption of these processes in metazoans is a major factor in the development of cancer. In this paper, we develop statistical methodology to identify quantitative aspects of the regulatory mechanisms underlying cellular proliferation in Saccharomyces cerevisiae. We find that the expression levels of a small set of genes can be exploited to predict the instantaneous growth rate of any cellular culture with high accuracy. The predictions obtained in this fashion are robust to changing biological conditions, experimental methods, and technological platforms. The proposed model is also effective in predicting growth rates for the related yeast Saccharomyces bayanus and the highly diverged yeast Schizosaccharomyces pombe, suggesting that the underlying regulatory signature is conserved across a wide range of unicellular evolution. We investigate the biological significance of the gene expression signature that the predictions are based upon from multiple perspectives: by perturbing the regulatory network through the Ras/PKA pathway, observing strong upregulation of growth rate even in the absence of appropriate nutrients, and discovering putative transcription factor binding sites, observing enrichment in growth-correlated genes. More broadly, the proposed methodology enables biological insights about growth at an instantaneous time scale, inaccessible by direct experimental methods. Data and tools enabling others to apply our methods are available at http://function.princeton.edu/growthrate.

  16. Genomic Prediction of Gene Bank Wheat Landraces

    Directory of Open Access Journals (Sweden)

    José Crossa

    2016-07-01

    Full Text Available This study examines genomic prediction within 8416 Mexican landrace accessions and 2403 Iranian landrace accessions stored in gene banks. The Mexican and Iranian collections were evaluated in separate field trials, including an optimum environment for several traits, and in two separate environments (drought, D and heat, H for the highly heritable traits, days to heading (DTH, and days to maturity (DTM. Analyses accounting and not accounting for population structure were performed. Genomic prediction models include genotype × environment interaction (G × E. Two alternative prediction strategies were studied: (1 random cross-validation of the data in 20% training (TRN and 80% testing (TST (TRN20-TST80 sets, and (2 two types of core sets, “diversity” and “prediction”, including 10% and 20%, respectively, of the total collections. Accounting for population structure decreased prediction accuracy by 15–20% as compared to prediction accuracy obtained when not accounting for population structure. Accounting for population structure gave prediction accuracies for traits evaluated in one environment for TRN20-TST80 that ranged from 0.407 to 0.677 for Mexican landraces, and from 0.166 to 0.662 for Iranian landraces. Prediction accuracy of the 20% diversity core set was similar to accuracies obtained for TRN20-TST80, ranging from 0.412 to 0.654 for Mexican landraces, and from 0.182 to 0.647 for Iranian landraces. The predictive core set gave similar prediction accuracy as the diversity core set for Mexican collections, but slightly lower for Iranian collections. Prediction accuracy when incorporating G × E for DTH and DTM for Mexican landraces for TRN20-TST80 was around 0.60, which is greater than without the G × E term. For Iranian landraces, accuracies were 0.55 for the G × E model with TRN20-TST80. Results show promising prediction accuracies for potential use in germplasm enhancement and rapid introgression of exotic germplasm

  17. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences

    KAUST Repository

    Zhang, Zhang; Yu, Jun

    2010-01-01

    Background: Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge.Results: To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences.Conclusions: We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms.Reviewers: This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft. 2010 Zhang and Yu; licensee BioMed Central Ltd.

  18. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences

    KAUST Repository

    Zhang, Zhang

    2010-11-08

    Background: Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge.Results: To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences.Conclusions: We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms.Reviewers: This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft. 2010 Zhang and Yu; licensee BioMed Central Ltd.

  19. Ecological transition predictably associated with gene degeneration.

    Science.gov (United States)

    Wessinger, Carolyn A; Rausher, Mark D

    2015-02-01

    Gene degeneration or loss can significantly contribute to phenotypic diversification, but may generate genetic constraints on future evolutionary trajectories, potentially restricting phenotypic reversal. Such constraints may manifest as directional evolutionary trends when parallel phenotypic shifts consistently involve gene degeneration or loss. Here, we demonstrate that widespread parallel evolution in Penstemon from blue to red flowers predictably involves the functional inactivation and degeneration of the enzyme flavonoid 3',5'-hydroxylase (F3'5'H), an anthocyanin pathway enzyme required for the production of blue floral pigments. Other types of genetic mutations do not consistently accompany this phenotypic shift. This pattern may be driven by the relatively large mutational target size of degenerative mutations to this locus and the apparent lack of associated pleiotropic effects. The consistent degeneration of F3'5'H may provide a mechanistic explanation for the observed asymmetry in the direction of flower color evolution in Penstemon: Blue to red transitions are common, but reverse transitions have not been observed. Although phenotypic shifts in this system are likely driven by natural selection, internal constraints may generate predictable genetic outcomes and may restrict future evolutionary trajectories. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  20. A comparative analysis of soft computing techniques for gene prediction.

    Science.gov (United States)

    Goel, Neelam; Singh, Shailendra; Aseri, Trilok Chand

    2013-07-01

    The rapid growth of genomic sequence data for both human and nonhuman species has made analyzing these sequences, especially predicting genes in them, very important and is currently the focus of many research efforts. Beside its scientific interest in the molecular biology and genomics community, gene prediction is of considerable importance in human health and medicine. A variety of gene prediction techniques have been developed for eukaryotes over the past few years. This article reviews and analyzes the application of certain soft computing techniques in gene prediction. First, the problem of gene prediction and its challenges are described. These are followed by different soft computing techniques along with their application to gene prediction. In addition, a comparative analysis of different soft computing techniques for gene prediction is given. Finally some limitations of the current research activities and future research directions are provided. Copyright © 2013 Elsevier Inc. All rights reserved.

  1. Clinicopathologic and gene expression parameters predict liver cancer prognosis

    International Nuclear Information System (INIS)

    Hao, Ke; Zhong, Hua; Greenawalt, Danielle; Ferguson, Mark D; Ng, Irene O; Sham, Pak C; Poon, Ronnie T; Molony, Cliona; Schadt, Eric E; Dai, Hongyue; Luk, John M; Lamb, John; Zhang, Chunsheng; Xie, Tao; Wang, Kai; Zhang, Bin; Chudin, Eugene; Lee, Nikki P; Mao, Mao

    2011-01-01

    The prognosis of hepatocellular carcinoma (HCC) varies following surgical resection and the large variation remains largely unexplained. Studies have revealed the ability of clinicopathologic parameters and gene expression to predict HCC prognosis. However, there has been little systematic effort to compare the performance of these two types of predictors or combine them in a comprehensive model. Tumor and adjacent non-tumor liver tissues were collected from 272 ethnic Chinese HCC patients who received curative surgery. We combined clinicopathologic parameters and gene expression data (from both tissue types) in predicting HCC prognosis. Cross-validation and independent studies were employed to assess prediction. HCC prognosis was significantly associated with six clinicopathologic parameters, which can partition the patients into good- and poor-prognosis groups. Within each group, gene expression data further divide patients into distinct prognostic subgroups. Our predictive genes significantly overlap with previously published gene sets predictive of prognosis. Moreover, the predictive genes were enriched for genes that underwent normal-to-tumor gene network transformation. Previously documented liver eSNPs underlying the HCC predictive gene signatures were enriched for SNPs that associated with HCC prognosis, providing support that these genes are involved in key processes of tumorigenesis. When applied individually, clinicopathologic parameters and gene expression offered similar predictive power for HCC prognosis. In contrast, a combination of the two types of data dramatically improved the power to predict HCC prognosis. Our results also provided a framework for understanding the impact of gene expression on the processes of tumorigenesis and clinical outcome

  2. Natural variation of rice blast resistance gene Pi-d2

    Science.gov (United States)

    Studying natural variation of rice resistance (R) genes in cultivated and wild rice relatives can predict resistance stability to rice blast fungus. In the present study, the protein coding regions of rice R gene Pi-d2 in 35 rice accessions of subgroups, aus (AUS), indica (IND), temperate japonica (...

  3. An algorithm to discover gene signatures with predictive potential

    Directory of Open Access Journals (Sweden)

    Hallett Robin M

    2010-09-01

    Full Text Available Abstract Background The advent of global gene expression profiling has generated unprecedented insight into our molecular understanding of cancer, including breast cancer. For example, human breast cancer patients display significant diversity in terms of their survival, recurrence, metastasis as well as response to treatment. These patient outcomes can be predicted by the transcriptional programs of their individual breast tumors. Predictive gene signatures allow us to correctly classify human breast tumors into various risk groups as well as to more accurately target therapy to ensure more durable cancer treatment. Results Here we present a novel algorithm to generate gene signatures with predictive potential. The method first classifies the expression intensity for each gene as determined by global gene expression profiling as low, average or high. The matrix containing the classified data for each gene is then used to score the expression of each gene based its individual ability to predict the patient characteristic of interest. Finally, all examined genes are ranked based on their predictive ability and the most highly ranked genes are included in the master gene signature, which is then ready for use as a predictor. This method was used to accurately predict the survival outcomes in a cohort of human breast cancer patients. Conclusions We confirmed the capacity of our algorithm to generate gene signatures with bona fide predictive ability. The simplicity of our algorithm will enable biological researchers to quickly generate valuable gene signatures without specialized software or extensive bioinformatics training.

  4. Exploring the Optimal Strategy to Predict Essential Genes in Microbes

    Directory of Open Access Journals (Sweden)

    Yao Lu

    2011-12-01

    Full Text Available Accurately predicting essential genes is important in many aspects of biology, medicine and bioengineering. In previous research, we have developed a machine learning based integrative algorithm to predict essential genes in bacterial species. This algorithm lends itself to two approaches for predicting essential genes: learning the traits from known essential genes in the target organism, or transferring essential gene annotations from a closely related model organism. However, for an understudied microbe, each approach has its potential limitations. The first is constricted by the often small number of known essential genes. The second is limited by the availability of model organisms and by evolutionary distance. In this study, we aim to determine the optimal strategy for predicting essential genes by examining four microbes with well-characterized essential genes. Our results suggest that, unless the known essential genes are few, learning from the known essential genes in the target organism usually outperforms transferring essential gene annotations from a related model organism. In fact, the required number of known essential genes is surprisingly small to make accurate predictions. In prokaryotes, when the number of known essential genes is greater than 2% of total genes, this approach already comes close to its optimal performance. In eukaryotes, achieving the same best performance requires over 4% of total genes, reflecting the increased complexity of eukaryotic organisms. Combining the two approaches resulted in an increased performance when the known essential genes are few. Our investigation thus provides key information on accurately predicting essential genes and will greatly facilitate annotations of microbial genomes.

  5. A hybrid approach of gene sets and single genes for the prediction of survival risks with gene expression data.

    Science.gov (United States)

    Seok, Junhee; Davis, Ronald W; Xiao, Wenzhong

    2015-01-01

    Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.

  6. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    Energy Technology Data Exchange (ETDEWEB)

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  7. Gene prediction validation and functional analysis of redundant pathways

    DEFF Research Database (Denmark)

    Sønderkær, Mads

    2011-01-01

    have employed a large mRNA-seq data set to improve and validate ab initio predicted gene models. This direct experimental evidence also provides reliable determinations of UTR regions and polyadenylation sites, which are not easily predicted in plants. Furthermore, once an annotated genome sequence...... is available, gene expression by mRNA-Seq enables acquisition of a more complete overview of gene isoform usage in complex enzymatic pathways enabling the identification of key genes. Metabolism in potatoes This information is useful e.g. for crop improvement based on manipulation of agronomically important...

  8. Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms.

    Science.gov (United States)

    Mattick, John S

    2003-10-01

    The central dogma of biology holds that genetic information normally flows from DNA to RNA to protein. As a consequence it has been generally assumed that genes generally code for proteins, and that proteins fulfil not only most structural and catalytic but also most regulatory functions, in all cells, from microbes to mammals. However, the latter may not be the case in complex organisms. A number of startling observations about the extent of non-protein-coding RNA (ncRNA) transcription in the higher eukaryotes and the range of genetic and epigenetic phenomena that are RNA-directed suggests that the traditional view of the structure of genetic regulatory systems in animals and plants may be incorrect. ncRNA dominates the genomic output of the higher organisms and has been shown to control chromosome architecture, mRNA turnover and the developmental timing of protein expression, and may also regulate transcription and alternative splicing. This paper re-examines the available evidence and suggests a new framework for considering and understanding the genomic programming of biological complexity, autopoietic development and phenotypic variation. Copyright 2003 Wiley Periodicals, Inc.

  9. A network approach to predict pathogenic genes for Fusarium graminearum.

    Science.gov (United States)

    Liu, Xiaoping; Tang, Wei-Hua; Zhao, Xing-Ming; Chen, Luonan

    2010-10-04

    Fusarium graminearum is the pathogenic agent of Fusarium head blight (FHB), which is a destructive disease on wheat and barley, thereby causing huge economic loss and health problems to human by contaminating foods. Identifying pathogenic genes can shed light on pathogenesis underlying the interaction between F. graminearum and its plant host. However, it is difficult to detect pathogenic genes for this destructive pathogen by time-consuming and expensive molecular biological experiments in lab. On the other hand, computational methods provide an alternative way to solve this problem. Since pathogenesis is a complicated procedure that involves complex regulations and interactions, the molecular interaction network of F. graminearum can give clues to potential pathogenic genes. Furthermore, the gene expression data of F. graminearum before and after its invasion into plant host can also provide useful information. In this paper, a novel systems biology approach is presented to predict pathogenic genes of F. graminearum based on molecular interaction network and gene expression data. With a small number of known pathogenic genes as seed genes, a subnetwork that consists of potential pathogenic genes is identified from the protein-protein interaction network (PPIN) of F. graminearum, where the genes in the subnetwork are further required to be differentially expressed before and after the invasion of the pathogenic fungus. Therefore, the candidate genes in the subnetwork are expected to be involved in the same biological processes as seed genes, which imply that they are potential pathogenic genes. The prediction results show that most of the pathogenic genes of F. graminearum are enriched in two important signal transduction pathways, including G protein coupled receptor pathway and MAPK signaling pathway, which are known related to pathogenesis in other fungi. In addition, several pathogenic genes predicted by our method are verified in other pathogenic fungi, which

  10. Reranking candidate gene models with cross-species comparison for improved gene prediction

    Directory of Open Access Journals (Sweden)

    Pereira Fernando CN

    2008-10-01

    Full Text Available Abstract Background Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc. Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models.

  11. Neural Inductive Matrix Completion for Predicting Disease-Gene Associations

    KAUST Repository

    Hou, Siqing

    2018-05-21

    In silico prioritization of undiscovered associations can help find causal genes of newly discovered diseases. Some existing methods are based on known associations, and side information of diseases and genes. We exploit the possibility of using a neural network model, Neural inductive matrix completion (NIMC), in disease-gene prediction. Comparing to the state-of-the-art inductive matrix completion method, using neural networks allows us to learn latent features from non-linear functions of input features. Previous methods use disease features only from mining text. Comparing to text mining, disease ontology is a more informative way of discovering correlation of dis- eases, from which we can calculate the similarities between diseases and help increase the performance of predicting disease-gene associations. We compare the proposed method with other state-of-the-art methods for pre- dicting associated genes for diseases from the Online Mendelian Inheritance in Man (OMIM) database. Results show that both new features and the proposed NIMC model can improve the chance of recovering an unknown associated gene in the top 100 predicted genes. Best results are obtained by using both the new features and the new model. Results also show the proposed method does better in predicting associated genes for newly discovered diseases.

  12. Semi-supervised prediction of gene regulatory networks using ...

    Indian Academy of Sciences (India)

    2015-09-28

    Sep 28, 2015 ... Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging ... two types of methods differ primarily based on whether ..... negligible, allowing us to draw the qualitative conclusions .... research will be conducted to develop additional biologically.

  13. Prediction and analysis of three gene families related to leaf rust (Puccinia triticina) resistance in wheat (Triticum aestivum L.).

    Science.gov (United States)

    Peng, Fred Y; Yang, Rong-Cai

    2017-06-20

    The resistance to leaf rust (Lr) caused by Puccinia triticina in wheat (Triticum aestivum L.) has been well studied over the past decades with over 70 Lr genes being mapped on different chromosomes and numerous QTLs (quantitative trait loci) being detected or mapped using DNA markers. Such resistance is often divided into race-specific and race-nonspecific resistance. The race-nonspecific resistance can be further divided into resistance to most or all races of the same pathogen and resistance to multiple pathogens. At the molecular level, these three types of resistance may cover across the whole spectrum of pathogen specificities that are controlled by genes encoding different protein families in wheat. The objective of this study is to predict and analyze genes in three such families: NBS-LRR (nucleotide-binding sites and leucine-rich repeats or NLR), START (Steroidogenic Acute Regulatory protein [STaR] related lipid-transfer) and ABC (ATP-Binding Cassette) transporter. The focus of the analysis is on the patterns of relationships between these protein-coding genes within the gene families and QTLs detected for leaf rust resistance. We predicted 526 ABC, 1117 NLR and 144 START genes in the hexaploid wheat genome through a domain analysis of wheat proteome. Of the 1809 SNPs from leaf rust resistance QTLs in seedling and adult stages of wheat, 126 SNPs were found within coding regions of these genes or their neighborhood (5 Kb upstream from transcription start site [TSS] or downstream from transcription termination site [TTS] of the genes). Forty-three of these SNPs for adult resistance and 18 SNPs for seedling resistance reside within coding or neighboring regions of the ABC genes whereas 14 SNPs for adult resistance and 29 SNPs for seedling resistance reside within coding or neighboring regions of the NLR gene. Moreover, we found 17 nonsynonymous SNPs for adult resistance and five SNPs for seedling resistance in the ABC genes, and five nonsynonymous SNPs for

  14. Embryo quality predictive models based on cumulus cells gene expression

    Directory of Open Access Journals (Sweden)

    Devjak R

    2016-06-01

    Full Text Available Since the introduction of in vitro fertilization (IVF in clinical practice of infertility treatment, the indicators for high quality embryos were investigated. Cumulus cells (CC have a specific gene expression profile according to the developmental potential of the oocyte they are surrounding, and therefore, specific gene expression could be used as a biomarker. The aim of our study was to combine more than one biomarker to observe improvement in prediction value of embryo development. In this study, 58 CC samples from 17 IVF patients were analyzed. This study was approved by the Republic of Slovenia National Medical Ethics Committee. Gene expression analysis [quantitative real time polymerase chain reaction (qPCR] for five genes, analyzed according to embryo quality level, was performed. Two prediction models were tested for embryo quality prediction: a binary logistic and a decision tree model. As the main outcome, gene expression levels for five genes were taken and the area under the curve (AUC for two prediction models were calculated. Among tested genes, AMHR2 and LIF showed significant expression difference between high quality and low quality embryos. These two genes were used for the construction of two prediction models: the binary logistic model yielded an AUC of 0.72 ± 0.08 and the decision tree model yielded an AUC of 0.73 ± 0.03. Two different prediction models yielded similar predictive power to differentiate high and low quality embryos. In terms of eventual clinical decision making, the decision tree model resulted in easy-to-interpret rules that are highly applicable in clinical practice.

  15. Microbial Functional Gene Diversity Predicts Groundwater Contamination and Ecosystem Functioning.

    Science.gov (United States)

    He, Zhili; Zhang, Ping; Wu, Linwei; Rocha, Andrea M; Tu, Qichao; Shi, Zhou; Wu, Bo; Qin, Yujia; Wang, Jianjun; Yan, Qingyun; Curtis, Daniel; Ning, Daliang; Van Nostrand, Joy D; Wu, Liyou; Yang, Yunfeng; Elias, Dwayne A; Watson, David B; Adams, Michael W W; Fields, Matthew W; Alm, Eric J; Hazen, Terry C; Adams, Paul D; Arkin, Adam P; Zhou, Jizhong

    2018-02-20

    Contamination from anthropogenic activities has significantly impacted Earth's biosphere. However, knowledge about how environmental contamination affects the biodiversity of groundwater microbiomes and ecosystem functioning remains very limited. Here, we used a comprehensive functional gene array to analyze groundwater microbiomes from 69 wells at the Oak Ridge Field Research Center (Oak Ridge, TN), representing a wide pH range and uranium, nitrate, and other contaminants. We hypothesized that the functional diversity of groundwater microbiomes would decrease as environmental contamination (e.g., uranium or nitrate) increased or at low or high pH, while some specific populations capable of utilizing or resistant to those contaminants would increase, and thus, such key microbial functional genes and/or populations could be used to predict groundwater contamination and ecosystem functioning. Our results indicated that functional richness/diversity decreased as uranium (but not nitrate) increased in groundwater. In addition, about 5.9% of specific key functional populations targeted by a comprehensive functional gene array (GeoChip 5) increased significantly ( P contamination and ecosystem functioning. This study indicates great potential for using microbial functional genes to predict environmental contamination and ecosystem functioning. IMPORTANCE Disentangling the relationships between biodiversity and ecosystem functioning is an important but poorly understood topic in ecology. Predicting ecosystem functioning on the basis of biodiversity is even more difficult, particularly with microbial biomarkers. As an exploratory effort, this study used key microbial functional genes as biomarkers to provide predictive understanding of environmental contamination and ecosystem functioning. The results indicated that the overall functional gene richness/diversity decreased as uranium increased in groundwater, while specific key microbial guilds increased significantly as

  16. Predictability of Genetic Interactions from Functional Gene Modules

    Directory of Open Access Journals (Sweden)

    Jonathan H. Young

    2017-02-01

    Full Text Available Characterizing genetic interactions is crucial to understanding cellular and organismal response to gene-level perturbations. Such knowledge can inform the selection of candidate disease therapy targets, yet experimentally determining whether genes interact is technically nontrivial and time-consuming. High-fidelity prediction of different classes of genetic interactions in multiple organisms would substantially alleviate this experimental burden. Under the hypothesis that functionally related genes tend to share common genetic interaction partners, we evaluate a computational approach to predict genetic interactions in Homo sapiens, Drosophila melanogaster, and Saccharomyces cerevisiae. By leveraging knowledge of functional relationships between genes, we cross-validate predictions on known genetic interactions and observe high predictive power of multiple classes of genetic interactions in all three organisms. Additionally, our method suggests high-confidence candidate interaction pairs that can be directly experimentally tested. A web application is provided for users to query genes for predicted novel genetic interaction partners. Finally, by subsampling the known yeast genetic interaction network, we found that novel genetic interactions are predictable even when knowledge of currently known interactions is minimal.

  17. Prediction of regulatory gene pairs using dynamic time warping and gene ontology.

    Science.gov (United States)

    Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K

    2014-01-01

    Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.

  18. Global discriminative learning for higher-accuracy computational gene prediction.

    Directory of Open Access Journals (Sweden)

    Axel Bernal

    2007-03-01

    Full Text Available Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

  19. Random Subspace Aggregation for Cancer Prediction with Gene Expression Profiles

    Directory of Open Access Journals (Sweden)

    Liying Yang

    2016-01-01

    Full Text Available Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio. Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM, K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance. Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.

  20. A deep auto-encoder model for gene expression prediction.

    Science.gov (United States)

    Xie, Rui; Wen, Jia; Quitadamo, Andrew; Cheng, Jianlin; Shi, Xinghua

    2017-11-17

    Gene expression is a key intermediate level that genotypes lead to a particular trait. Gene expression is affected by various factors including genotypes of genetic variants. With an aim of delineating the genetic impact on gene expression, we build a deep auto-encoder model to assess how good genetic variants will contribute to gene expression changes. This new deep learning model is a regression-based predictive model based on the MultiLayer Perceptron and Stacked Denoising Auto-encoder (MLP-SAE). The model is trained using a stacked denoising auto-encoder for feature selection and a multilayer perceptron framework for backpropagation. We further improve the model by introducing dropout to prevent overfitting and improve performance. To demonstrate the usage of this model, we apply MLP-SAE to a real genomic datasets with genotypes and gene expression profiles measured in yeast. Our results show that the MLP-SAE model with dropout outperforms other models including Lasso, Random Forests and the MLP-SAE model without dropout. Using the MLP-SAE model with dropout, we show that gene expression quantifications predicted by the model solely based on genotypes, align well with true gene expression patterns. We provide a deep auto-encoder model for predicting gene expression from SNP genotypes. This study demonstrates that deep learning is appropriate for tackling another genomic problem, i.e., building predictive models to understand genotypes' contribution to gene expression. With the emerging availability of richer genomic data, we anticipate that deep learning models play a bigger role in modeling and interpreting genomics.

  1. Combining Gene Signatures Improves Prediction of Breast Cancer Survival

    Science.gov (United States)

    Zhao, Xi; Naume, Bjørn; Langerød, Anita; Frigessi, Arnoldo; Kristensen, Vessela N.; Børresen-Dale, Anne-Lise; Lingjærde, Ole Christian

    2011-01-01

    Background Several gene sets for prediction of breast cancer survival have been derived from whole-genome mRNA expression profiles. Here, we develop a statistical framework to explore whether combination of the information from such sets may improve prediction of recurrence and breast cancer specific death in early-stage breast cancers. Microarray data from two clinically similar cohorts of breast cancer patients are used as training (n = 123) and test set (n = 81), respectively. Gene sets from eleven previously published gene signatures are included in the study. Principal Findings To investigate the relationship between breast cancer survival and gene expression on a particular gene set, a Cox proportional hazards model is applied using partial likelihood regression with an L2 penalty to avoid overfitting and using cross-validation to determine the penalty weight. The fitted models are applied to an independent test set to obtain a predicted risk for each individual and each gene set. Hierarchical clustering of the test individuals on the basis of the vector of predicted risks results in two clusters with distinct clinical characteristics in terms of the distribution of molecular subtypes, ER, PR status, TP53 mutation status and histological grade category, and associated with significantly different survival probabilities (recurrence: p = 0.005; breast cancer death: p = 0.014). Finally, principal components analysis of the gene signatures is used to derive combined predictors used to fit a new Cox model. This model classifies test individuals into two risk groups with distinct survival characteristics (recurrence: p = 0.003; breast cancer death: p = 0.001). The latter classifier outperforms all the individual gene signatures, as well as Cox models based on traditional clinical parameters and the Adjuvant! Online for survival prediction. Conclusion Combining the predictive strength of multiple gene signatures improves prediction of breast

  2. Combining gene signatures improves prediction of breast cancer survival.

    Directory of Open Access Journals (Sweden)

    Xi Zhao

    Full Text Available BACKGROUND: Several gene sets for prediction of breast cancer survival have been derived from whole-genome mRNA expression profiles. Here, we develop a statistical framework to explore whether combination of the information from such sets may improve prediction of recurrence and breast cancer specific death in early-stage breast cancers. Microarray data from two clinically similar cohorts of breast cancer patients are used as training (n = 123 and test set (n = 81, respectively. Gene sets from eleven previously published gene signatures are included in the study. PRINCIPAL FINDINGS: To investigate the relationship between breast cancer survival and gene expression on a particular gene set, a Cox proportional hazards model is applied using partial likelihood regression with an L2 penalty to avoid overfitting and using cross-validation to determine the penalty weight. The fitted models are applied to an independent test set to obtain a predicted risk for each individual and each gene set. Hierarchical clustering of the test individuals on the basis of the vector of predicted risks results in two clusters with distinct clinical characteristics in terms of the distribution of molecular subtypes, ER, PR status, TP53 mutation status and histological grade category, and associated with significantly different survival probabilities (recurrence: p = 0.005; breast cancer death: p = 0.014. Finally, principal components analysis of the gene signatures is used to derive combined predictors used to fit a new Cox model. This model classifies test individuals into two risk groups with distinct survival characteristics (recurrence: p = 0.003; breast cancer death: p = 0.001. The latter classifier outperforms all the individual gene signatures, as well as Cox models based on traditional clinical parameters and the Adjuvant! Online for survival prediction. CONCLUSION: Combining the predictive strength of multiple gene signatures improves

  3. Bioinformatic prediction and functional characterization of human KIAA0100 gene

    Directory of Open Access Journals (Sweden)

    He Cui

    2017-02-01

    Full Text Available Our previous study demonstrated that human KIAA0100 gene was a novel acute monocytic leukemia-associated antigen (MLAA gene. But the functional characterization of human KIAA0100 gene has remained unknown to date. Here, firstly, bioinformatic prediction of human KIAA0100 gene was carried out using online softwares; Secondly, Human KIAA0100 gene expression was downregulated by the clustered regularly interspaced short palindromic repeats (CRISPR/CRISPR-associated (Cas 9 system in U937 cells. Cell proliferation and apoptosis were next evaluated in KIAA0100-knockdown U937 cells. The bioinformatic prediction showed that human KIAA0100 gene was located on 17q11.2, and human KIAA0100 protein was located in the secretory pathway. Besides, human KIAA0100 protein contained a signalpeptide, a transmembrane region, three types of secondary structures (alpha helix, extended strand, and random coil , and four domains from mitochondrial protein 27 (FMP27. The observation on functional characterization of human KIAA0100 gene revealed that its downregulation inhibited cell proliferation, and promoted cell apoptosis in U937 cells. To summarize, these results suggest human KIAA0100 gene possibly comes within mitochondrial genome; moreover, it is a novel anti-apoptotic factor related to carcinogenesis or progression in acute monocytic leukemia, and may be a potential target for immunotherapy against acute monocytic leukemia.

  4. A network approach to predict pathogenic genes for Fusarium graminearum.

    Directory of Open Access Journals (Sweden)

    Xiaoping Liu

    Full Text Available Fusarium graminearum is the pathogenic agent of Fusarium head blight (FHB, which is a destructive disease on wheat and barley, thereby causing huge economic loss and health problems to human by contaminating foods. Identifying pathogenic genes can shed light on pathogenesis underlying the interaction between F. graminearum and its plant host. However, it is difficult to detect pathogenic genes for this destructive pathogen by time-consuming and expensive molecular biological experiments in lab. On the other hand, computational methods provide an alternative way to solve this problem. Since pathogenesis is a complicated procedure that involves complex regulations and interactions, the molecular interaction network of F. graminearum can give clues to potential pathogenic genes. Furthermore, the gene expression data of F. graminearum before and after its invasion into plant host can also provide useful information. In this paper, a novel systems biology approach is presented to predict pathogenic genes of F. graminearum based on molecular interaction network and gene expression data. With a small number of known pathogenic genes as seed genes, a subnetwork that consists of potential pathogenic genes is identified from the protein-protein interaction network (PPIN of F. graminearum, where the genes in the subnetwork are further required to be differentially expressed before and after the invasion of the pathogenic fungus. Therefore, the candidate genes in the subnetwork are expected to be involved in the same biological processes as seed genes, which imply that they are potential pathogenic genes. The prediction results show that most of the pathogenic genes of F. graminearum are enriched in two important signal transduction pathways, including G protein coupled receptor pathway and MAPK signaling pathway, which are known related to pathogenesis in other fungi. In addition, several pathogenic genes predicted by our method are verified in other

  5. Gene Prediction in Metagenomic Fragments with Deep Learning

    Directory of Open Access Journals (Sweden)

    Shao-Wu Zhang

    2017-01-01

    Full Text Available Next generation sequencing technologies used in metagenomics yield numerous sequencing fragments which come from thousands of different species. Accurately identifying genes from metagenomics fragments is one of the most fundamental issues in metagenomics. In this article, by fusing multifeatures (i.e., monocodon usage, monoamino acid usage, ORF length coverage, and Z-curve features and using deep stacking networks learning model, we present a novel method (called Meta-MFDL to predict the metagenomic genes. The results with 10 CV and independent tests show that Meta-MFDL is a powerful tool for identifying genes from metagenomic fragments.

  6. Gene-specific function prediction for non-synonymous mutations in monogenic diabetes genes.

    Directory of Open Access Journals (Sweden)

    Quan Li

    Full Text Available The rapid progress of genomic technologies has been providing new opportunities to address the need of maturity-onset diabetes of the young (MODY molecular diagnosis. However, whether a new mutation causes MODY can be questionable. A number of in silico methods have been developed to predict functional effects of rare human mutations. The purpose of this study is to compare the performance of different bioinformatics methods in the functional prediction of nonsynonymous mutations in each MODY gene, and provides reference matrices to assist the molecular diagnosis of MODY. Our study showed that the prediction scores by different methods of the diabetes mutations were highly correlated, but were more complimentary than replacement to each other. The available in silico methods for the prediction of diabetes mutations had varied performances across different genes. Applying gene-specific thresholds defined by this study may be able to increase the performance of in silico prediction of disease-causing mutations.

  7. Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data

    Directory of Open Access Journals (Sweden)

    Osamu Komori

    2013-01-01

    Full Text Available This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence.

  8. The prediction of candidate genes for cervix related cancer through gene ontology and graph theoretical approach.

    Science.gov (United States)

    Hindumathi, V; Kranthi, T; Rao, S B; Manimaran, P

    2014-06-01

    With rapidly changing technology, prediction of candidate genes has become an indispensable task in recent years mainly in the field of biological research. The empirical methods for candidate gene prioritization that succors to explore the potential pathway between genetic determinants and complex diseases are highly cumbersome and labor intensive. In such a scenario predicting potential targets for a disease state through in silico approaches are of researcher's interest. The prodigious availability of protein interaction data coupled with gene annotation renders an ease in the accurate determination of disease specific candidate genes. In our work we have prioritized the cervix related cancer candidate genes by employing Csaba Ortutay and his co-workers approach of identifying the candidate genes through graph theoretical centrality measures and gene ontology. With the advantage of the human protein interaction data, cervical cancer gene sets and the ontological terms, we were able to predict 15 novel candidates for cervical carcinogenesis. The disease relevance of the anticipated candidate genes was corroborated through a literature survey. Also the presence of the drugs for these candidates was detected through Therapeutic Target Database (TTD) and DrugMap Central (DMC) which affirms that they may be endowed as potential drug targets for cervical cancer.

  9. Inductive matrix completion for predicting gene-disease associations.

    Science.gov (United States)

    Natarajan, Nagarajan; Dhillon, Inderjit S

    2014-06-15

    Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies-for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies-for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene-disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive. Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better-it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has bigdata.ices.utexas.edu/project/gene-disease. © The Author 2014. Published by Oxford University Press.

  10. Predictions of Gene Family Distributions in Microbial Genomes: Evolution by Gene Duplication and Modification

    International Nuclear Information System (INIS)

    Yanai, Itai; Camacho, Carlos J.; DeLisi, Charles

    2000-01-01

    A universal property of microbial genomes is the considerable fraction of genes that are homologous to other genes within the same genome. The process by which these homologues are generated is not well understood, but sequence analysis of 20 microbial genomes unveils a recurrent distribution of gene family sizes. We show that a simple evolutionary model based on random gene duplication and point mutations fully accounts for these distributions and permits predictions for the number of gene families in genomes not yet complete. Our findings are consistent with the notion that a genome evolves from a set of precursor genes to a mature size by gene duplications and increasing modifications. (c) 2000 The American Physical Society

  11. Predictions of Gene Family Distributions in Microbial Genomes: Evolution by Gene Duplication and Modification

    Energy Technology Data Exchange (ETDEWEB)

    Yanai, Itai; Camacho, Carlos J.; DeLisi, Charles

    2000-09-18

    A universal property of microbial genomes is the considerable fraction of genes that are homologous to other genes within the same genome. The process by which these homologues are generated is not well understood, but sequence analysis of 20 microbial genomes unveils a recurrent distribution of gene family sizes. We show that a simple evolutionary model based on random gene duplication and point mutations fully accounts for these distributions and permits predictions for the number of gene families in genomes not yet complete. Our findings are consistent with the notion that a genome evolves from a set of precursor genes to a mature size by gene duplications and increasing modifications. (c) 2000 The American Physical Society.

  12. Prediction of epigenetically regulated genes in breast cancer cell lines

    Energy Technology Data Exchange (ETDEWEB)

    Loss, Leandro A; Sadanandam, Anguraj; Durinck, Steffen; Nautiyal, Shivani; Flaucher, Diane; Carlton, Victoria EH; Moorhead, Martin; Lu, Yontao; Gray, Joe W; Faham, Malek; Spellman, Paul; Parvin, Bahram

    2010-05-04

    panel of breast cancer cell lines. Subnetwork enrichment of these genes has identifed 35 common regulators with 6 or more predicted markers. In addition to identifying epigenetically regulated genes, we show evidence of differentially expressed methylation patterns between the basal and luminal subtypes. Our results indicate that the proposed computational protocol is a viable platform for identifying epigenetically regulated genes. Our protocol has generated a list of predictors including COL1A2, TOP2A, TFF1, and VAV3, genes whose key roles in epigenetic regulation is documented in the literature. Subnetwork enrichment of these predicted markers further suggests that epigenetic regulation of individual genes occurs in a coordinated fashion and through common regulators.

  13. Prediction of human protein function according to Gene Ontology categories

    DEFF Research Database (Denmark)

    Jensen, Lars Juhl; Gupta, Ramneek; Stærfeldt, Hans Henrik

    2003-01-01

    developed a method for prediction of protein function for a subset of classes from the Gene Ontology classification scheme. This subset includes several pharmaceutically interesting categories-transcription factors, receptors, ion channels, stress and immune response proteins, hormones and growth factors...

  14. Bioinformatics Prediction of Polyketide Synthase Gene Clusters from Mycosphaerella fijiensis.

    Science.gov (United States)

    Noar, Roslyn D; Daub, Margaret E

    2016-01-01

    Mycosphaerella fijiensis, causal agent of black Sigatoka disease of banana, is a Dothideomycete fungus closely related to fungi that produce polyketides important for plant pathogenicity. We utilized the M. fijiensis genome sequence to predict PKS genes and their gene clusters and make bioinformatics predictions about the types of compounds produced by these clusters. Eight PKS gene clusters were identified in the M. fijiensis genome, placing M. fijiensis into the 23rd percentile for the number of PKS genes compared to other Dothideomycetes. Analysis of the PKS domains identified three of the PKS enzymes as non-reducing and two as highly reducing. Gene clusters contained types of genes frequently found in PKS clusters including genes encoding transporters, oxidoreductases, methyltransferases, and non-ribosomal peptide synthases. Phylogenetic analysis identified a putative PKS cluster encoding melanin biosynthesis. None of the other clusters were closely aligned with genes encoding known polyketides, however three of the PKS genes fell into clades with clusters encoding alternapyrone, fumonisin, and solanapyrone produced by Alternaria and Fusarium species. A search for homologs among available genomic sequences from 103 Dothideomycetes identified close homologs (>80% similarity) for six of the PKS sequences. One of the PKS sequences was not similar (< 60% similarity) to sequences in any of the 103 genomes, suggesting that it encodes a unique compound. Comparison of the M. fijiensis PKS sequences with those of two other banana pathogens, M. musicola and M. eumusae, showed that these two species have close homologs to five of the M. fijiensis PKS sequences, but three others were not found in either species. RT-PCR and RNA-Seq analysis showed that the melanin PKS cluster was down-regulated in infected banana as compared to growth in culture. Three other clusters, however were strongly upregulated during disease development in banana, suggesting that they may encode

  15. Bioinformatics Prediction of Polyketide Synthase Gene Clusters from Mycosphaerella fijiensis.

    Directory of Open Access Journals (Sweden)

    Roslyn D Noar

    Full Text Available Mycosphaerella fijiensis, causal agent of black Sigatoka disease of banana, is a Dothideomycete fungus closely related to fungi that produce polyketides important for plant pathogenicity. We utilized the M. fijiensis genome sequence to predict PKS genes and their gene clusters and make bioinformatics predictions about the types of compounds produced by these clusters. Eight PKS gene clusters were identified in the M. fijiensis genome, placing M. fijiensis into the 23rd percentile for the number of PKS genes compared to other Dothideomycetes. Analysis of the PKS domains identified three of the PKS enzymes as non-reducing and two as highly reducing. Gene clusters contained types of genes frequently found in PKS clusters including genes encoding transporters, oxidoreductases, methyltransferases, and non-ribosomal peptide synthases. Phylogenetic analysis identified a putative PKS cluster encoding melanin biosynthesis. None of the other clusters were closely aligned with genes encoding known polyketides, however three of the PKS genes fell into clades with clusters encoding alternapyrone, fumonisin, and solanapyrone produced by Alternaria and Fusarium species. A search for homologs among available genomic sequences from 103 Dothideomycetes identified close homologs (>80% similarity for six of the PKS sequences. One of the PKS sequences was not similar (< 60% similarity to sequences in any of the 103 genomes, suggesting that it encodes a unique compound. Comparison of the M. fijiensis PKS sequences with those of two other banana pathogens, M. musicola and M. eumusae, showed that these two species have close homologs to five of the M. fijiensis PKS sequences, but three others were not found in either species. RT-PCR and RNA-Seq analysis showed that the melanin PKS cluster was down-regulated in infected banana as compared to growth in culture. Three other clusters, however were strongly upregulated during disease development in banana, suggesting that

  16. Combining many interaction networks to predict gene function and analyze gene lists.

    Science.gov (United States)

    Mostafavi, Sara; Morris, Quaid

    2012-05-01

    In this article, we review how interaction networks can be used alone or in combination in an automated fashion to provide insight into gene and protein function. We describe the concept of a "gene-recommender system" that can be applied to any large collection of interaction networks to make predictions about gene or protein function based on a query list of proteins that share a function of interest. We discuss these systems in general and focus on one specific system, GeneMANIA, that has unique features and uses different algorithms from the majority of other systems. © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  17. The role of gene-gene interaction in the prediction of criminal behavior.

    Science.gov (United States)

    Boutwell, Brian B; Menard, Scott; Barnes, J C; Beaver, Kevin M; Armstrong, Todd A; Boisvert, Danielle

    2014-04-01

    A host of research has examined the possibility that environmental risk factors might condition the influence of genes on various outcomes. Less research, however, has been aimed at exploring the possibility that genetic factors might interact to impact the emergence of human traits. Even fewer studies exist examining the interaction of genes in the prediction of behavioral outcomes. The current study expands this body of research by testing the interaction between genes involved in neural transmission. Our findings suggest that certain dopamine genes interact to increase the odds of criminogenic outcomes in a national sample of Americans. Copyright © 2014 Elsevier Inc. All rights reserved.

  18. GOPET: A tool for automated predictions of Gene Ontology terms

    Directory of Open Access Journals (Sweden)

    Glatting Karl-Heinz

    2006-03-01

    Full Text Available Abstract Background Vast progress in sequencing projects has called for annotation on a large scale. A Number of methods have been developed to address this challenging task. These methods, however, either apply to specific subsets, or their predictions are not formalised, or they do not provide precise confidence values for their predictions. Description We recently established a learning system for automated annotation, trained with a broad variety of different organisms to predict the standardised annotation terms from Gene Ontology (GO. Now, this method has been made available to the public via our web-service GOPET (Gene Ontology term Prediction and Evaluation Tool. It supplies annotation for sequences of any organism. For each predicted term an appropriate confidence value is provided. The basic method had been developed for predicting molecular function GO-terms. It is now expanded to predict biological process terms. This web service is available via http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar Conclusion Our web service gives experimental researchers as well as the bioinformatics community a valuable sequence annotation device. Additionally, GOPET also provides less significant annotation data which may serve as an extended discovery platform for the user.

  19. Microbial Functional Gene Diversity Predicts Groundwater Contamination and Ecosystem Functioning

    Directory of Open Access Journals (Sweden)

    Zhili He

    2018-02-01

    Full Text Available Contamination from anthropogenic activities has significantly impacted Earth’s biosphere. However, knowledge about how environmental contamination affects the biodiversity of groundwater microbiomes and ecosystem functioning remains very limited. Here, we used a comprehensive functional gene array to analyze groundwater microbiomes from 69 wells at the Oak Ridge Field Research Center (Oak Ridge, TN, representing a wide pH range and uranium, nitrate, and other contaminants. We hypothesized that the functional diversity of groundwater microbiomes would decrease as environmental contamination (e.g., uranium or nitrate increased or at low or high pH, while some specific populations capable of utilizing or resistant to those contaminants would increase, and thus, such key microbial functional genes and/or populations could be used to predict groundwater contamination and ecosystem functioning. Our results indicated that functional richness/diversity decreased as uranium (but not nitrate increased in groundwater. In addition, about 5.9% of specific key functional populations targeted by a comprehensive functional gene array (GeoChip 5 increased significantly (P < 0.05 as uranium or nitrate increased, and their changes could be used to successfully predict uranium and nitrate contamination and ecosystem functioning. This study indicates great potential for using microbial functional genes to predict environmental contamination and ecosystem functioning.

  20. Distinct gene number-genome size relationships for eukaryotes and non-eukaryotes: gene content estimation for dinoflagellate genomes.

    Directory of Open Access Journals (Sweden)

    Yubo Hou

    Full Text Available The ability to predict gene content is highly desirable for characterization of not-yet sequenced genomes like those of dinoflagellates. Using data from completely sequenced and annotated genomes from phylogenetically diverse lineages, we investigated the relationship between gene content and genome size using regression analyses. Distinct relationships between log(10-transformed protein-coding gene number (Y' versus log(10-transformed genome size (X', genome size in kbp were found for eukaryotes and non-eukaryotes. Eukaryotes best fit a logarithmic model, Y' = ln(-46.200+22.678X', whereas non-eukaryotes a linear model, Y' = 0.045+0.977X', both with high significance (p0.91. Total gene number shows similar trends in both groups to their respective protein coding regressions. The distinct correlations reflect lower and decreasing gene-coding percentages as genome size increases in eukaryotes (82%-1% compared to higher and relatively stable percentages in prokaryotes and viruses (97%-47%. The eukaryotic regression models project that the smallest dinoflagellate genome (3x10(6 kbp contains 38,188 protein-coding (40,086 total genes and the largest (245x10(6 kbp 87,688 protein-coding (92,013 total genes, corresponding to 1.8% and 0.05% gene-coding percentages. These estimates do not likely represent extraordinarily high functional diversity of the encoded proteome but rather highly redundant genomes as evidenced by high gene copy numbers documented for various dinoflagellate species.

  1. Microbial Functional Gene Diversity Predicts Groundwater Contamination and Ecosystem Functioning

    Science.gov (United States)

    Zhang, Ping; Wu, Linwei; Rocha, Andrea M.; Shi, Zhou; Wu, Bo; Qin, Yujia; Wang, Jianjun; Yan, Qingyun; Curtis, Daniel; Ning, Daliang; Van Nostrand, Joy D.; Wu, Liyou; Watson, David B.; Adams, Michael W. W.; Alm, Eric J.; Adams, Paul D.; Arkin, Adam P.

    2018-01-01

    ABSTRACT Contamination from anthropogenic activities has significantly impacted Earth’s biosphere. However, knowledge about how environmental contamination affects the biodiversity of groundwater microbiomes and ecosystem functioning remains very limited. Here, we used a comprehensive functional gene array to analyze groundwater microbiomes from 69 wells at the Oak Ridge Field Research Center (Oak Ridge, TN), representing a wide pH range and uranium, nitrate, and other contaminants. We hypothesized that the functional diversity of groundwater microbiomes would decrease as environmental contamination (e.g., uranium or nitrate) increased or at low or high pH, while some specific populations capable of utilizing or resistant to those contaminants would increase, and thus, such key microbial functional genes and/or populations could be used to predict groundwater contamination and ecosystem functioning. Our results indicated that functional richness/diversity decreased as uranium (but not nitrate) increased in groundwater. In addition, about 5.9% of specific key functional populations targeted by a comprehensive functional gene array (GeoChip 5) increased significantly (P contamination and ecosystem functioning. This study indicates great potential for using microbial functional genes to predict environmental contamination and ecosystem functioning. PMID:29463661

  2. Evolutionary modeling and prediction of non-coding RNAs in Drosophila.

    Directory of Open Access Journals (Sweden)

    Robert K Bradley

    2009-08-01

    Full Text Available We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3' end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.

  3. Predictive modelling of gene expression from transcriptional regulatory elements.

    Science.gov (United States)

    Budden, David M; Hurley, Daniel G; Crampin, Edmund J

    2015-07-01

    Predictive modelling of gene expression provides a powerful framework for exploring the regulatory logic underpinning transcriptional regulation. Recent studies have demonstrated the utility of such models in identifying dysregulation of gene and miRNA expression associated with abnormal patterns of transcription factor (TF) binding or nucleosomal histone modifications (HMs). Despite the growing popularity of such approaches, a comparative review of the various modelling algorithms and feature extraction methods is lacking. We define and compare three methods of quantifying pairwise gene-TF/HM interactions and discuss their suitability for integrating the heterogeneous chromatin immunoprecipitation (ChIP)-seq binding patterns exhibited by TFs and HMs. We then construct log-linear and ϵ-support vector regression models from various mouse embryonic stem cell (mESC) and human lymphoblastoid (GM12878) data sets, considering both ChIP-seq- and position weight matrix- (PWM)-derived in silico TF-binding. The two algorithms are evaluated both in terms of their modelling prediction accuracy and ability to identify the established regulatory roles of individual TFs and HMs. Our results demonstrate that TF-binding and HMs are highly predictive of gene expression as measured by mRNA transcript abundance, irrespective of algorithm or cell type selection and considering both ChIP-seq and PWM-derived TF-binding. As we encourage other researchers to explore and develop these results, our framework is implemented using open-source software and made available as a preconfigured bootable virtual environment. © The Author 2014. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  4. Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing

    DEFF Research Database (Denmark)

    Wu, Jia Qian; Shteynberg, David; Arumugam, Manimozhiyan

    2004-01-01

    an alternative approach: reverse transcription-polymerase chain reaction (RT-PCR) and direct sequencing based on dual-genome de novo predictions from TWINSCAN. We tested 444 TWINSCAN-predicted rat genes that showed significant homology to known human genes implicated in disease but that were partially...... in the single-intron experiment. Spliced sequences were amplified in 46 cases (34%). We conclude that this procedure for elucidating gene structures with native cDNA sequences is cost-effective and will become even more so as it is further optimized.......The publication of a draft sequence of a third mammalian genome--that of the rat--suggests a need to rethink genome annotation. New mammalian sequences will not receive the kind of labor-intensive annotation efforts that are currently being devoted to human. In this paper, we demonstrate...

  5. Dinucleotide controlled null models for comparative RNA gene prediction

    Directory of Open Access Journals (Sweden)

    Gesell Tanja

    2008-05-01

    Full Text Available Abstract Background Comparative prediction of RNA structures can be used to identify functional noncoding RNAs in genomic screens. It was shown recently by Babak et al. [BMC Bioinformatics. 8:33] that RNA gene prediction programs can be biased by the genomic dinucleotide content, in particular those programs using a thermodynamic folding model including stacking energies. As a consequence, there is need for dinucleotide-preserving control strategies to assess the significance of such predictions. While there have been randomization algorithms for single sequences for many years, the problem has remained challenging for multiple alignments and there is currently no algorithm available. Results We present a program called SISSIz that simulates multiple alignments of a given average dinucleotide content. Meeting additional requirements of an accurate null model, the randomized alignments are on average of the same sequence diversity and preserve local conservation and gap patterns. We make use of a phylogenetic substitution model that includes overlapping dependencies and site-specific rates. Using fast heuristics and a distance based approach, a tree is estimated under this model which is used to guide the simulations. The new algorithm is tested on vertebrate genomic alignments and the effect on RNA structure predictions is studied. In addition, we directly combined the new null model with the RNAalifold consensus folding algorithm giving a new variant of a thermodynamic structure based RNA gene finding program that is not biased by the dinucleotide content. Conclusion SISSIz implements an efficient algorithm to randomize multiple alignments preserving dinucleotide content. It can be used to get more accurate estimates of false positive rates of existing programs, to produce negative controls for the training of machine learning based programs, or as standalone RNA gene finding program. Other applications in comparative genomics that require

  6. Dinucleotide controlled null models for comparative RNA gene prediction.

    Science.gov (United States)

    Gesell, Tanja; Washietl, Stefan

    2008-05-27

    Comparative prediction of RNA structures can be used to identify functional noncoding RNAs in genomic screens. It was shown recently by Babak et al. [BMC Bioinformatics. 8:33] that RNA gene prediction programs can be biased by the genomic dinucleotide content, in particular those programs using a thermodynamic folding model including stacking energies. As a consequence, there is need for dinucleotide-preserving control strategies to assess the significance of such predictions. While there have been randomization algorithms for single sequences for many years, the problem has remained challenging for multiple alignments and there is currently no algorithm available. We present a program called SISSIz that simulates multiple alignments of a given average dinucleotide content. Meeting additional requirements of an accurate null model, the randomized alignments are on average of the same sequence diversity and preserve local conservation and gap patterns. We make use of a phylogenetic substitution model that includes overlapping dependencies and site-specific rates. Using fast heuristics and a distance based approach, a tree is estimated under this model which is used to guide the simulations. The new algorithm is tested on vertebrate genomic alignments and the effect on RNA structure predictions is studied. In addition, we directly combined the new null model with the RNAalifold consensus folding algorithm giving a new variant of a thermodynamic structure based RNA gene finding program that is not biased by the dinucleotide content. SISSIz implements an efficient algorithm to randomize multiple alignments preserving dinucleotide content. It can be used to get more accurate estimates of false positive rates of existing programs, to produce negative controls for the training of machine learning based programs, or as standalone RNA gene finding program. Other applications in comparative genomics that require randomization of multiple alignments can be considered. SISSIz

  7. Deep developmental transcriptome sequencing uncovers numerous new genes and enhances gene annotation in the sponge Amphimedon queenslandica.

    Science.gov (United States)

    Fernandez-Valverde, Selene L; Calcino, Andrew D; Degnan, Bernard M

    2015-05-15

    The demosponge Amphimedon queenslandica is amongst the few early-branching metazoans with an assembled and annotated draft genome, making it an important species in the study of the origin and early evolution of animals. Current gene models in this species are largely based on in silico predictions and low coverage expressed sequence tag (EST) evidence. Amphimedon queenslandica protein-coding gene models are improved using deep RNA-Seq data from four developmental stages and CEL-Seq data from 82 developmental samples. Over 86% of previously predicted genes are retained in the new gene models, although 24% have additional exons; there is also a marked increase in the total number of annotated 3' and 5' untranslated regions (UTRs). Importantly, these new developmental transcriptome data reveal numerous previously unannotated protein-coding genes in the Amphimedon genome, increasing the total gene number by 25%, from 30,060 to 40,122. In general, Amphimedon genes have introns that are markedly smaller than those in other animals and most of the alternatively spliced genes in Amphimedon undergo intron-retention; exon-skipping is the least common mode of alternative splicing. Finally, in addition to canonical polyadenylation signal sequences, Amphimedon genes are enriched in a number of unique AT-rich motifs in their 3' UTRs. The inclusion of developmental transcriptome data has substantially improved the structure and composition of protein-coding gene models in Amphimedon queenslandica, providing a more accurate and comprehensive set of genes for functional and comparative studies. These improvements reveal the Amphimedon genome is comprised of a remarkably high number of tightly packed genes. These genes have small introns and there is pervasive intron retention amongst alternatively spliced transcripts. These aspects of the sponge genome are more similar unicellular opisthokont genomes than to other animal genomes.

  8. RNA editing differently affects protein-coding genes in D. melanogaster and H. sapiens.

    Science.gov (United States)

    Grassi, Luigi; Leoni, Guido; Tramontano, Anna

    2015-07-14

    When an RNA editing event occurs within a coding sequence it can lead to a different encoded amino acid. The biological significance of these events remains an open question: they can modulate protein functionality, increase the complexity of transcriptomes or arise from a loose specificity of the involved enzymes. We analysed the editing events in coding regions that produce or not a change in the encoded amino acid (nonsynonymous and synonymous events, respectively) in D. melanogaster and in H. sapiens and compared them with the appropriate random models. Interestingly, our results show that the phenomenon has rather different characteristics in the two organisms. For example, we confirm the observation that editing events occur more frequently in non-coding than in coding regions, and report that this effect is much more evident in H. sapiens. Additionally, in this latter organism, editing events tend to affect less conserved residues. The less frequently occurring editing events in Drosophila tend to avoid drastic amino acid changes. Interestingly, we find that, in Drosophila, changes from less frequently used codons to more frequently used ones are favoured, while this is not the case in H. sapiens.

  9. Discrete Ramanujan transform for distinguishing the protein coding regions from other regions.

    Science.gov (United States)

    Hua, Wei; Wang, Jiasong; Zhao, Jian

    2014-01-01

    Based on the study of Ramanujan sum and Ramanujan coefficient, this paper suggests the concepts of discrete Ramanujan transform and spectrum. Using Voss numerical representation, one maps a symbolic DNA strand as a numerical DNA sequence, and deduces the discrete Ramanujan spectrum of the numerical DNA sequence. It is well known that of discrete Fourier power spectrum of protein coding sequence has an important feature of 3-base periodicity, which is widely used for DNA sequence analysis by the technique of discrete Fourier transform. It is performed by testing the signal-to-noise ratio at frequency N/3 as a criterion for the analysis, where N is the length of the sequence. The results presented in this paper show that the property of 3-base periodicity can be only identified as a prominent spike of the discrete Ramanujan spectrum at period 3 for the protein coding regions. The signal-to-noise ratio for discrete Ramanujan spectrum is defined for numerical measurement. Therefore, the discrete Ramanujan spectrum and the signal-to-noise ratio of a DNA sequence can be used for distinguishing the protein coding regions from the noncoding regions. All the exon and intron sequences in whole chromosomes 1, 2, 3 and 4 of Caenorhabditis elegans have been tested and the histograms and tables from the computational results illustrate the reliability of our method. In addition, we have analyzed theoretically and gotten the conclusion that the algorithm for calculating discrete Ramanujan spectrum owns the lower computational complexity and higher computational accuracy. The computational experiments show that the technique by using discrete Ramanujan spectrum for classifying different DNA sequences is a fast and effective method. Copyright © 2014 Elsevier Ltd. All rights reserved.

  10. A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.

    Directory of Open Access Journals (Sweden)

    Ai-bing Zhang

    Full Text Available Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish and two representing non-coding ITS barcodes (rust fungi and brown algae. Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ and Maximum likelihood (ML methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40% for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37% for 1094 brown algae queries, both using ITS barcodes.

  11. A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.

    Science.gov (United States)

    Zhang, Ai-bing; Feng, Jie; Ward, Robert D; Wan, Ping; Gao, Qiang; Wu, Jun; Zhao, Wei-zhong

    2012-01-01

    Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.

  12. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

    Science.gov (United States)

    Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D

    2018-01-04

    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  13. Identification of a robust gene signature that predicts breast cancer outcome in independent data sets

    International Nuclear Information System (INIS)

    Korkola, James E; Waldman, Frederic M; Blaveri, Ekaterina; DeVries, Sandy; Moore, Dan H II; Hwang, E Shelley; Chen, Yunn-Yi; Estep, Anne LH; Chew, Karen L; Jensen, Ronald H

    2007-01-01

    Breast cancer is a heterogeneous disease, presenting with a wide range of histologic, clinical, and genetic features. Microarray technology has shown promise in predicting outcome in these patients. We profiled 162 breast tumors using expression microarrays to stratify tumors based on gene expression. A subset of 55 tumors with extensive follow-up was used to identify gene sets that predicted outcome. The predictive gene set was further tested in previously published data sets. We used different statistical methods to identify three gene sets associated with disease free survival. A fourth gene set, consisting of 21 genes in common to all three sets, also had the ability to predict patient outcome. To validate the predictive utility of this derived gene set, it was tested in two published data sets from other groups. This gene set resulted in significant separation of patients on the basis of survival in these data sets, correctly predicting outcome in 62–65% of patients. By comparing outcome prediction within subgroups based on ER status, grade, and nodal status, we found that our gene set was most effective in predicting outcome in ER positive and node negative tumors. This robust gene selection with extensive validation has identified a predictive gene set that may have clinical utility for outcome prediction in breast cancer patients

  14. mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences.

    Science.gov (United States)

    Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E

    2013-08-15

    Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.

  15. Predictive gene testing for Huntington disease and other neurodegenerative disorders.

    Science.gov (United States)

    Wedderburn, S; Panegyres, P K; Andrew, S; Goldblatt, J; Liebeck, T; McGrath, F; Wiltshire, M; Pestell, C; Lee, J; Beilby, J

    2013-12-01

    Controversies exist around predictive testing (PT) programmes in neurodegenerative disorders. This study sets out to answer the following questions relating to Huntington disease (HD) and other neurodegenerative disorders: differences between these patients in their PT journeys, why and when individuals withdraw from PT, and decision-making processes regarding reproductive genetic testing. A case series analysis of patients having PT from the multidisciplinary Western Australian centre for PT over the past 20 years was performed using internationally recognised guidelines for predictive gene testing in neurodegenerative disorders. Of 740 at-risk patients, 518 applied for PT: 466 at risk of HD, 52 at risk of other neurodegenerative disorders - spinocerebellar ataxias, hereditary prion disease and familial Alzheimer disease. Thirteen percent withdrew from PT - 80.32% of withdrawals occurred during counselling stages. Major withdrawal reasons related to timing in the patients' lives or unknown as the patient did not disclose the reason. Thirty-eight HD individuals had reproductive genetic testing: 34 initiated prenatal testing (of which eight withdrew from the process) and four initiated pre-implantation genetic diagnosis. There was no recorded or other evidence of major psychological reactions or suicides during PT. People withdrew from PT in relation to life stages and reasons that are unknown. Our findings emphasise the importance of: (i) adherence to internationally recommended guidelines for PT; (ii) the role of the multidisciplinary team in risk minimisation; and (iii) patient selection. © 2013 The Authors; Internal Medicine Journal © 2013 Royal Australasian College of Physicians.

  16. Network-based prediction and knowledge mining of disease genes.

    Science.gov (United States)

    Carson, Matthew B; Lu, Hui

    2015-01-01

    In recent years, high-throughput protein interaction identification methods have generated a large amount of data. When combined with the results from other in vivo and in vitro experiments, a complex set of relationships between biological molecules emerges. The growing popularity of network analysis and data mining has allowed researchers to recognize indirect connections between these molecules. Due to the interdependent nature of network entities, evaluating proteins in this context can reveal relationships that may not otherwise be evident. We examined the human protein interaction network as it relates to human illness using the Disease Ontology. After calculating several topological metrics, we trained an alternating decision tree (ADTree) classifier to identify disease-associated proteins. Using a bootstrapping method, we created a tree to highlight conserved characteristics shared by many of these proteins. Subsequently, we reviewed a set of non-disease-associated proteins that were misclassified by the algorithm with high confidence and searched for evidence of a disease relationship. Our classifier was able to predict disease-related genes with 79% area under the receiver operating characteristic (ROC) curve (AUC), which indicates the tradeoff between sensitivity and specificity and is a good predictor of how a classifier will perform on future data sets. We found that a combination of several network characteristics including degree centrality, disease neighbor ratio, eccentricity, and neighborhood connectivity help to distinguish between disease- and non-disease-related proteins. Furthermore, the ADTree allowed us to understand which combinations of strongly predictive attributes contributed most to protein-disease classification. In our post-processing evaluation, we found several examples of potential novel disease-related proteins and corresponding literature evidence. In addition, we showed that first- and second-order neighbors in the PPI network

  17. Tree ferns: monophyletic groups and their relationships as revealed by four protein-coding plastid loci.

    Science.gov (United States)

    Korall, Petra; Pryer, Kathleen M; Metzgar, Jordan S; Schneider, Harald; Conant, David S

    2006-06-01

    Tree ferns are a well-established clade within leptosporangiate ferns. Most of the 600 species (in seven families and 13 genera) are arborescent, but considerable morphological variability exists, spanning the giant scaly tree ferns (Cyatheaceae), the low, erect plants (Plagiogyriaceae), and the diminutive endemics of the Guayana Highlands (Hymenophyllopsidaceae). In this study, we investigate phylogenetic relationships within tree ferns based on analyses of four protein-coding, plastid loci (atpA, atpB, rbcL, and rps4). Our results reveal four well-supported clades, with genera of Dicksoniaceae (sensu ) interspersed among them: (A) (Loxomataceae, (Culcita, Plagiogyriaceae)), (B) (Calochlaena, (Dicksonia, Lophosoriaceae)), (C) Cibotium, and (D) Cyatheaceae, with Hymenophyllopsidaceae nested within. How these four groups are related to one other, to Thyrsopteris, or to Metaxyaceae is weakly supported. Our results show that Dicksoniaceae and Cyatheaceae, as currently recognised, are not monophyletic and new circumscriptions for these families are needed.

  18. An evolutionary model for protein-coding regions with conserved RNA structure

    DEFF Research Database (Denmark)

    Pedersen, Jakob Skou; Forsberg, Roald; Meyer, Irmtraud Margret

    2004-01-01

    in the RNA structure. The overlap of these fundamental dependencies is sufficient to cause "contagious" context dependencies which cascade across many nucleotide sites. Such large-scale dependencies challenge the use of traditional phylogenetic models in evolutionary inference because they explicitly assume...... components of traditional phylogenetic models. We applied this to a data set of full-genome sequences from the hepatitis C virus where five RNA structures are mapped within the coding region. This allowed us to partition the effects of selection on different structural elements and to test various hypotheses......Here we present a model of nucleotide substitution in protein-coding regions that also encode the formation of conserved RNA structures. In such regions, apparent evolutionary context dependencies exist, both between nucleotides occupying the same codon and between nucleotides forming a base pair...

  19. Allele-Selective Transcriptome Recruitment to Polysomes Primed for Translation: Protein-Coding and Noncoding RNAs, and RNA Isoforms.

    Directory of Open Access Journals (Sweden)

    Roshan Mascarenhas

    Full Text Available mRNA translation into proteins is highly regulated, but the role of mRNA isoforms, noncoding RNAs (ncRNAs, and genetic variants remains poorly understood. mRNA levels on polysomes have been shown to correlate well with expressed protein levels, pointing to polysomal loading as a critical factor. To study regulation and genetic factors of protein translation we measured levels and allelic ratios of mRNAs and ncRNAs (including microRNAs in lymphoblast cell lines (LCL and in polysomal fractions. We first used targeted assays to measure polysomal loading of mRNA alleles, confirming reported genetic effects on translation of OPRM1 and NAT1, and detecting no effect of rs1045642 (3435C>T in ABCB1 (MDR1 on polysomal loading while supporting previous results showing increased mRNA turnover of the 3435T allele. Use of high-throughput sequencing of complete transcript profiles (RNA-Seq in three LCLs revealed significant differences in polysomal loading of individual RNA classes and isoforms. Correlated polysomal distribution between protein-coding and non-coding RNAs suggests interactions between them. Allele-selective polysome recruitment revealed strong genetic influence for multiple RNAs, attributable either to differential expression of RNA isoforms or to differential loading onto polysomes, the latter defining a direct genetic effect on translation. Genes identified by different allelic RNA ratios between cytosol and polysomes were enriched with published expression quantitative trait loci (eQTLs affecting RNA functions, and associations with clinical phenotypes. Polysomal RNA-Seq combined with allelic ratio analysis provides a powerful approach to study polysomal RNA recruitment and regulatory variants affecting protein translation.

  20. A genome-wide gene function prediction resource for Drosophila melanogaster.

    Directory of Open Access Journals (Sweden)

    Han Yan

    2010-08-01

    Full Text Available Predicting gene functions by integrating large-scale biological data remains a challenge for systems biology. Here we present a resource for Drosophila melanogaster gene function predictions. We trained function-specific classifiers to optimize the influence of different biological datasets for each functional category. Our model predicted GO terms and KEGG pathway memberships for Drosophila melanogaster genes with high accuracy, as affirmed by cross-validation, supporting literature evidence, and large-scale RNAi screens. The resulting resource of prioritized associations between Drosophila genes and their potential functions offers a guide for experimental investigations.

  1. Identifying the Gene Signatures from Gene-Pathway Bipartite Network Guarantees the Robust Model Performance on Predicting the Cancer Prognosis

    Directory of Open Access Journals (Sweden)

    Li He

    2014-01-01

    Full Text Available For the purpose of improving the prediction of cancer prognosis in the clinical researches, various algorithms have been developed to construct the predictive models with the gene signatures detected by DNA microarrays. Due to the heterogeneity of the clinical samples, the list of differentially expressed genes (DEGs generated by the statistical methods or the machine learning algorithms often involves a number of false positive genes, which are not associated with the phenotypic differences between the compared clinical conditions, and subsequently impacts the reliability of the predictive models. In this study, we proposed a strategy, which combined the statistical algorithm with the gene-pathway bipartite networks, to generate the reliable lists of cancer-related DEGs and constructed the models by using support vector machine for predicting the prognosis of three types of cancers, namely, breast cancer, acute myeloma leukemia, and glioblastoma. Our results demonstrated that, combined with the gene-pathway bipartite networks, our proposed strategy can efficiently generate the reliable cancer-related DEG lists for constructing the predictive models. In addition, the model performance in the swap analysis was similar to that in the original analysis, indicating the robustness of the models in predicting the cancer outcomes.

  2. Identifying novel genes in C. elegans using SAGE tags

    Directory of Open Access Journals (Sweden)

    Chen Nansheng

    2010-12-01

    Full Text Available Abstract Background Despite extensive efforts devoted to predicting protein-coding genes in genome sequences, many bona fide genes have not been found and many existing gene models are not accurate in all sequenced eukaryote genomes. This situation is partly explained by the fact that gene prediction programs have been developed based on our incomplete understanding of gene feature information such as splicing and promoter characteristics. Additionally, full-length cDNAs of many genes and their isoforms are hard to obtain due to their low level or rare expression. In order to obtain full-length sequences of all protein-coding genes, alternative approaches are required. Results In this project, we have developed a method of reconstructing full-length cDNA sequences based on short expressed sequence tags which is called sequence tag-based amplification of cDNA ends (STACE. Expressed tags are used as anchors for retrieving full-length transcripts in two rounds of PCR amplification. We have demonstrated the application of STACE in reconstructing full-length cDNA sequences using expressed tags mined in an array of serial analysis of gene expression (SAGE of C. elegans cDNA libraries. We have successfully applied STACE to recover sequence information for 12 genes, for two of which we found isoforms. STACE was used to successfully recover full-length cDNA sequences for seven of these genes. Conclusions The STACE method can be used to effectively reconstruct full-length cDNA sequences of genes that are under-represented in cDNA sequencing projects and have been missed by existing gene prediction methods, but their existence has been suggested by short sequence tags such as SAGE tags.

  3. Benchmarking of gene prediction programs for metagenomic data.

    Science.gov (United States)

    Yok, Non; Rosen, Gail

    2010-01-01

    This manuscript presents the most rigorous benchmarking of gene annotation algorithms for metagenomic datasets to date. We compare three different programs: GeneMark, MetaGeneAnnotator (MGA) and Orphelia. The comparisons are based on their performances over simulated fragments from one hundred species of diverse lineages. We defined four different types of fragments; two types come from the inter- and intra-coding regions and the other types are from the gene edges. Hoff et al. used only 12 species in their comparison; therefore, their sample is too small to represent an environmental sample. Also, no predecessors has separately examined fragments that contain gene edges as opposed to intra-coding regions. General observations in our results are that performances of all these programs improve as we increase the length of the fragment. On the other hand, intra-coding fragments of our data show low annotation error in all of the programs if compared to the gene edge fragments. Overall, we found an upper-bound performance by combining all the methods.

  4. A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP Inhibitors

    Science.gov (United States)

    2017-02-01

    affecting the function of Fanconi Anemia (FA) genes ( FANCA /B/C/D2/E/F/G/I/J/L/M, PALB2) or DNA damage response genes involved in HR 5 (ATM, ATR...Award Number: W81XWH-10-1-0585 TITLE: A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP Inhibitors...To) 15 July 2010 – 2 Nov.2016 4. TITLE AND SUBTITLE A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP

  5. Genome-wide prediction and analysis of human tissue-selective genes using microarray expression data

    Directory of Open Access Journals (Sweden)

    Teng Shaolei

    2013-01-01

    Full Text Available Abstract Background Understanding how genes are expressed specifically in particular tissues is a fundamental question in developmental biology. Many tissue-specific genes are involved in the pathogenesis of complex human diseases. However, experimental identification of tissue-specific genes is time consuming and difficult. The accurate predictions of tissue-specific gene targets could provide useful information for biomarker development and drug target identification. Results In this study, we have developed a machine learning approach for predicting the human tissue-specific genes using microarray expression data. The lists of known tissue-specific genes for different tissues were collected from UniProt database, and the expression data retrieved from the previously compiled dataset according to the lists were used for input vector encoding. Random Forests (RFs and Support Vector Machines (SVMs were used to construct accurate classifiers. The RF classifiers were found to outperform SVM models for tissue-specific gene prediction. The results suggest that the candidate genes for brain or liver specific expression can provide valuable information for further experimental studies. Our approach was also applied for identifying tissue-selective gene targets for different types of tissues. Conclusions A machine learning approach has been developed for accurately identifying the candidate genes for tissue specific/selective expression. The approach provides an efficient way to select some interesting genes for developing new biomedical markers and improve our knowledge of tissue-specific expression.

  6. Predictive value of MSH2 gene expression in colorectal cancer treated with capecitabine

    DEFF Research Database (Denmark)

    Jensen, Lars H; Danenberg, Kathleen D; Danenberg, Peter V

    2007-01-01

    was associated with a hazard ratio of 0.5 (95% confidence interval, 0.23-1.11; P = 0.083) in survival analysis. CONCLUSION: The higher gene expression of MSH2 in responders and the trend for predicting overall survival indicates a predictive value of this marker in the treatment of advanced CRC with capecitabine.......PURPOSE: The objective of the present study was to evaluate the gene expression of the DNA mismatch repair gene MSH2 as a predictive marker in advanced colorectal cancer (CRC) treated with first-line capecitabine. PATIENTS AND METHODS: Microdissection of paraffin-embedded tumor tissue, RNA...

  7. Grimontia indica AK16T, sp. nov., isolated from a seawater sample reports the presence of pathogenic genes similar to Vibrio Genus

    Digital Repository Service at National Institute of Oceanography (India)

    Aditya, S.; Bhumika, V.; Khatri, I.; Srinivas, T.N.R.; Subramanian, S.; Korpole, S.; AnilKumar, P.

    sequence was amplified using standard primers. The amplified product was sequenced using Genetic Analyzer ABI 3130XL (Applied Biosystems, California, USA). High quality sequence region was used for BLAST (Basic Local Alignment Search Tool) [13] against 16S....pone.0085590.g003 Novel Bacterium with Pathogenic Genes PLOS ONE | www.plosone.org 3 January 2014 | Volume 9 | Issue 1 | e85590 content) in 13 contigs. A total of 3,777 genes were predicted out of which 3,651 genes were protein coding genes while 126 were RNA...

  8. Testing the predictive value of peripheral gene expression for nonremission following citalopram treatment for major depression.

    Science.gov (United States)

    Guilloux, Jean-Philippe; Bassi, Sabrina; Ding, Ying; Walsh, Chris; Turecki, Gustavo; Tseng, George; Cyranowski, Jill M; Sibille, Etienne

    2015-02-01

    Major depressive disorder (MDD) in general, and anxious-depression in particular, are characterized by poor rates of remission with first-line treatments, contributing to the chronic illness burden suffered by many patients. Prospective research is needed to identify the biomarkers predicting nonremission prior to treatment initiation. We collected blood samples from a discovery cohort of 34 adult MDD patients with co-occurring anxiety and 33 matched, nondepressed controls at baseline and after 12 weeks (of citalopram plus psychotherapy treatment for the depressed cohort). Samples were processed on gene arrays and group differences in gene expression were investigated. Exploratory analyses suggest that at pretreatment baseline, nonremitting patients differ from controls with gene function and transcription factor analyses potentially related to elevated inflammation and immune activation. In a second phase, we applied an unbiased machine learning prediction model and corrected for model-selection bias. Results show that baseline gene expression predicted nonremission with 79.4% corrected accuracy with a 13-gene model. The same gene-only model predicted nonremission after 8 weeks of citalopram treatment with 76% corrected accuracy in an independent validation cohort of 63 MDD patients treated with citalopram at another institution. Together, these results demonstrate the potential, but also the limitations, of baseline peripheral blood-based gene expression to predict nonremission after citalopram treatment. These results not only support their use in future prediction tools but also suggest that increased accuracy may be obtained with the inclusion of additional predictors (eg, genetics and clinical scales).

  9. Exploring gene expression signatures for predicting disease free survival after resection of colorectal cancer liver metastases.

    Directory of Open Access Journals (Sweden)

    Nikol Snoeren

    Full Text Available BACKGROUND AND OBJECTIVES: This study was designed to identify and validate gene signatures that can predict disease free survival (DFS in patients undergoing a radical resection for their colorectal liver metastases (CRLM. METHODS: Tumor gene expression profiles were collected from 119 patients undergoing surgery for their CRLM in the Paul Brousse Hospital (France and the University Medical Center Utrecht (The Netherlands. Patients were divided into high and low risk groups. A randomly selected training set was used to find predictive gene signatures. The ability of these gene signatures to predict DFS was tested in an independent validation set comprising the remaining patients. Furthermore, 5 known clinical risk scores were tested in our complete patient cohort. RESULT: No gene signature was found that significantly predicted DFS in the validation set. In contrast, three out of five clinical risk scores were able to predict DFS in our patient cohort. CONCLUSIONS: No gene signature was found that could predict DFS in patients undergoing CRLM resection. Three out of five clinical risk scores were able to predict DFS in our patient cohort. These results emphasize the need for validating risk scores in independent patient groups and suggest improved designs for future studies.

  10. Comprehensive transcriptome analysis reveals novel genes involved in cardiac glycoside biosynthesis and mlncRNAs associated with secondary metabolism and stress response in Digitalis purpurea

    Directory of Open Access Journals (Sweden)

    Wu Bin

    2012-01-01

    Full Text Available Abstract Background Digitalis purpurea is an important ornamental and medicinal plant. There is considerable interest in exploring its transcriptome. Results Through high-throughput 454 sequencing and subsequent assembly, we obtained 23532 genes, of which 15626 encode conserved proteins. We determined 140 unigenes to be candidates involved in cardiac glycoside biosynthesis. It could be grouped into 30 families, of which 29 were identified for the first time in D. purpurea. We identified 2660 mRNA-like npcRNA (mlncRNA candidates, an emerging class of regulators, using a computational mlncRNA identification pipeline and 13 microRNA-producing unigenes based on sequence conservation and hairpin structure-forming capability. Twenty five protein-coding unigenes were predicted to be targets of these microRNAs. Among the mlncRNA candidates, only 320 could be grouped into 140 families with at least two members in a family. The majority of D. purpurea mlncRNAs were species-specific and many of them showed tissue-specific expression and responded to cold and dehydration stresses. We identified 417 protein-coding genes with regions significantly homologous or complementary to 375 mlncRNAs. It includes five genes involved in secondary metabolism. A positive correlation was found in gene expression between protein-coding genes and the homologous mlncRNAs in response to cold and dehydration stresses, while the correlation was negative when protein-coding genes and mlncRNAs were complementary to each other. Conclusions Through comprehensive transcriptome analysis, we not only identified 29 novel gene families potentially involved in the biosynthesis of cardiac glycosides but also characterized a large number of mlncRNAs. Our results suggest the importance of mlncRNAs in secondary metabolism and stress response in D. purpurea.

  11. Effects of using coding potential, sequence conservation and mRNA structure conservation for predicting pyrroly-sine containing genes

    DEFF Research Database (Denmark)

    Have, Christian Theil; Zambach, Sine; Christiansen, Henning

    2013-01-01

    for prediction of pyrrolysine incorporating genes in genomes of bacteria and archaea leading to insights about the factors driving pyrrolysine translation and identification of new gene candidates. The method predicts known conserved genes with high recall and predicts several other promising candidates...... for experimental verification. The method is implemented as a computational pipeline which is available on request....

  12. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature

    Directory of Open Access Journals (Sweden)

    Kim Seon-Young

    2009-05-01

    Full Text Available Abstract Background Few overlap between independently developed gene signatures and poor inter-study applicability of gene signatures are two of major concerns raised in the development of microarray-based prognostic gene signatures. One recent study suggested that thousands of samples are needed to generate a robust prognostic gene signature. Results A data set of 1,372 samples was generated by combining eight breast cancer gene expression data sets produced using the same microarray platform and, using the data set, effects of varying samples sizes on a few performances of a prognostic gene signature were investigated. The overlap between independently developed gene signatures was increased linearly with more samples, attaining an average overlap of 16.56% with 600 samples. The concordance between predicted outcomes by different gene signatures also was increased with more samples up to 94.61% with 300 samples. The accuracy of outcome prediction also increased with more samples. Finally, analysis using only Estrogen Receptor-positive (ER+ patients attained higher prediction accuracy than using both patients, suggesting that sub-type specific analysis can lead to the development of better prognostic gene signatures Conclusion Increasing sample sizes generated a gene signature with better stability, better concordance in outcome prediction, and better prediction accuracy. However, the degree of performance improvement by the increased sample size was different between the degree of overlap and the degree of concordance in outcome prediction, suggesting that the sample size required for a study should be determined according to the specific aims of the study.

  13. Bioinformatics analysis of the predicted polyprenol reductase genes in higher plants

    Science.gov (United States)

    Basyuni, M.; Wati, R.

    2018-03-01

    The present study evaluates the bioinformatics methods to analyze twenty-four predicted polyprenol reductase genes from higher plants on GenBank as well as predicted the structure, composition, similarity, subcellular localization, and phylogenetic. The physicochemical properties of plant polyprenol showed diversity among the observed genes. The percentage of the secondary structure of plant polyprenol genes followed the ratio order of α helix > random coil > extended chain structure. The values of chloroplast but not signal peptide were too low, indicated that few chloroplast transit peptide in plant polyprenol reductase genes. The possibility of the potential transit peptide showed variation among the plant polyprenol reductase, suggested the importance of understanding the variety of peptide components of plant polyprenol genes. To clarify this finding, a phylogenetic tree was drawn. The phylogenetic tree shows several branches in the tree, suggested that plant polyprenol reductase genes grouped into divergent clusters in the tree.

  14. PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources.

    Science.gov (United States)

    Kahanda, Indika; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa

    2015-01-01

    The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

  15. A Third Approach to Gene Prediction Suggests Thousands of Additional Human Transcribed Regions

    Science.gov (United States)

    Glusman, Gustavo; Qin, Shizhen; El-Gewely, M. Raafat; Siegel, Andrew F; Roach, Jared C; Hood, Leroy; Smit, Arian F. A

    2006-01-01

    The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.” PMID:16543943

  16. Prediction of highly expressed genes in microbes based on chromatin accessibility

    DEFF Research Database (Denmark)

    Willenbrock, Hanni; Ussery, David

    2007-01-01

    BACKGROUND: It is well known that gene expression is dependent on chromatin structure in eukaryotes and it is likely that chromatin can play a role in bacterial gene expression as well. Here, we use a nucleosomal position preference measure of anisotropic DNA flexibility to predict highly expressed...

  17. An Approach for Predicting Essential Genes Using Multiple Homology Mapping and Machine Learning Algorithms.

    Science.gov (United States)

    Hua, Hong-Li; Zhang, Fa-Zhan; Labena, Abraham Alemayehu; Dong, Chuan; Jin, Yan-Ting; Guo, Feng-Biao

    Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus , which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.

  18. A seven-gene CpG-island methylation panel predicts breast cancer progression

    International Nuclear Information System (INIS)

    Li, Yan; Melnikov, Anatoliy A.; Levenson, Victor; Guerra, Emanuela; Simeone, Pasquale; Alberti, Saverio; Deng, Youping

    2015-01-01

    DNA methylation regulates gene expression, through the inhibition/activation of gene transcription of methylated/unmethylated genes. Hence, DNA methylation profiling can capture pivotal features of gene expression in cancer tissues from patients at the time of diagnosis. In this work, we analyzed a breast cancer case series, to identify DNA methylation determinants of metastatic versus non-metastatic tumors. CpG-island methylation was evaluated on a 56-gene cancer-specific biomarker microarray in metastatic versus non-metastatic breast cancers in a multi-institutional case series of 123 breast cancer patients. Global statistical modeling and unsupervised hierarchical clustering were applied to identify a multi-gene binary classifier with high sensitivity and specificity. Network analysis was utilized to quantify the connectivity of the identified genes. Seven genes (BRCA1, DAPK1, MSH2, CDKN2A, PGR, PRKCDBP, RANKL) were found informative for prognosis of metastatic diffusion and were used to calculate classifier accuracy versus the entire data-set. Individual-gene performances showed sensitivities of 63–79 %, 53–84 % specificities, positive predictive values of 59–83 % and negative predictive values of 63–80 %. When modelled together, these seven genes reached a sensitivity of 93 %, 100 % specificity, a positive predictive value of 100 % and a negative predictive value of 93 %, with high statistical power. Unsupervised hierarchical clustering independently confirmed these findings, in close agreement with the accuracy measurements. Network analyses indicated tight interrelationship between the identified genes, suggesting this to be a functionally-coordinated module, linked to breast cancer progression. Our findings identify CpG-island methylation profiles with deep impact on clinical outcome, paving the way for use as novel prognostic assays in clinical settings. The online version of this article (doi:10.1186/s12885-015-1412-9) contains supplementary

  19. Concerted down-regulation of immune-system related genes predicts metastasis in colorectal carcinoma

    International Nuclear Information System (INIS)

    Fehlker, Marion; Huska, Matthew R; Jöns, Thomas; Andrade-Navarro, Miguel A; Kemmner, Wolfgang

    2014-01-01

    This study aimed at the identification of prognostic gene expression markers in early primary colorectal carcinomas without metastasis at the time point of surgery by analyzing genome-wide gene expression profiles using oligonucleotide microarrays. Cryo-conserved tumor specimens from 45 patients with early colorectal cancers were examined, with the majority of them being UICC stage II or earlier and with a follow-up time of 41–115 months. Gene expression profiling was performed using Whole Human Genome 4x44K Oligonucleotide Microarrays. Validation of microarray data was performed on five of the genes in a smaller cohort. Using a novel algorithm based on the recursive application of support vector machines (SVMs), we selected a signature of 44 probes that discriminated between patients developing later metastasis and patients with a good prognosis. Interestingly, almost half of the genes was related to the patients’ immune response and showed reduced expression in the metastatic cases. Whereas up to now gene signatures containing genes with various biological functions have been described for prediction of metastasis in CRC, in this study metastasis could be well predicted by a set of gene expression markers consisting exclusively of genes related to the MHC class II complex involved in immune response. Thus, our data emphasize that the proper function of a comprehensive network of immune response genes is of vital importance for the survival of colorectal cancer patients

  20. Genomic Features That Predict Allelic Imbalance in Humans Suggest Patterns of Constraint on Gene Expression Variation

    Science.gov (United States)

    Fédrigo, Olivier; Haygood, Ralph; Mukherjee, Sayan; Wray, Gregory A.

    2009-01-01

    Variation in gene expression is an important contributor to phenotypic diversity within and between species. Although this variation often has a genetic component, identification of the genetic variants driving this relationship remains challenging. In particular, measurements of gene expression usually do not reveal whether the genetic basis for any observed variation lies in cis or in trans to the gene, a distinction that has direct relevance to the physical location of the underlying genetic variant, and which may also impact its evolutionary trajectory. Allelic imbalance measurements identify cis-acting genetic effects by assaying the relative contribution of the two alleles of a cis-regulatory region to gene expression within individuals. Identification of patterns that predict commonly imbalanced genes could therefore serve as a useful tool and also shed light on the evolution of cis-regulatory variation itself. Here, we show that sequence motifs, polymorphism levels, and divergence levels around a gene can be used to predict commonly imbalanced genes in a human data set. Reduction of this feature set to four factors revealed that only one factor significantly differentiated between commonly imbalanced and nonimbalanced genes. We demonstrate that these results are consistent between the original data set and a second published data set in humans obtained using different technical and statistical methods. Finally, we show that variation in the single allelic imbalance-associated factor is partially explained by the density of genes in the region of a target gene (allelic imbalance is less probable for genes in gene-dense regions), and, to a lesser extent, the evenness of expression of the gene across tissues and the magnitude of negative selection on putative regulatory regions of the gene. These results suggest that the genomic distribution of functional cis-regulatory variants in the human genome is nonrandom, perhaps due to local differences in evolutionary

  1. Prediction of highly expressed genes in microbes based on chromatin accessibility

    Directory of Open Access Journals (Sweden)

    Ussery David W

    2007-02-01

    Full Text Available Abstract Background It is well known that gene expression is dependent on chromatin structure in eukaryotes and it is likely that chromatin can play a role in bacterial gene expression as well. Here, we use a nucleosomal position preference measure of anisotropic DNA flexibility to predict highly expressed genes in microbial genomes. We compare these predictions with those based on codon adaptation index (CAI values, and also with experimental data for 6 different microbial genomes, with a particular interest in experimental data from Escherichia coli. Moreover, position preference is examined further in 328 sequenced microbial genomes. Results We find that absolute gene expression levels are correlated with the position preference in many microbial genomes. It is postulated that in these regions, the DNA may be more accessible to the transcriptional machinery. Moreover, ribosomal proteins and ribosomal RNA are encoded by DNA having significantly lower position preference values than other genes in fast-replicating microbes. Conclusion This insight into DNA structure-dependent gene expression in microbes may be exploited for predicting the expression of non-translated genes such as non-coding RNAs that may not be predicted by any of the conventional codon usage bias approaches.

  2. Accurate prediction of secondary metabolite gene clusters in filamentous fungi

    DEFF Research Database (Denmark)

    Andersen, Mikael Rørdam; Nielsen, Jakob Blæsbjerg; Klitgaard, Andreas

    2013-01-01

    Biosynthetic pathways of secondary metabolites from fungi are currently subject to an intense effort to elucidate the genetic basis for these compounds due to their large potential within pharmaceutics and synthetic biochemistry. The preferred method is methodical gene deletions to identify...... used A. nidulans for our method development and validation due to the wealth of available biochemical data, but the method can be applied to any fungus with a sequenced and assembled genome, thus supporting further secondary metabolite pathway elucidation in the fungal kingdom....

  3. Gene prediction and RFX transcriptional regulation analysis using comparative genomics

    OpenAIRE

    Chu, Jeffrey Shih Chieh

    2011-01-01

    Regulatory Factor X (RFX) is a family of transcription factors (TF) that is conserved in all metazoans, in some fungi, and in only a few single-cellular organisms. Seven members are found in mammals, nine in fishes, three in fruit flies, and a single member in nematodes and fungi. RFX is involved in many different roles in humans, but a particular function that is conserved in many metazoans is its regulation of ciliogenesis. Probing over 150 genomes for the presence of RFX and ciliary genes ...

  4. Clustering gene expression data based on predicted differential effects of GV interaction.

    Science.gov (United States)

    Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu

    2005-02-01

    Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

  5. Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer

    International Nuclear Information System (INIS)

    Yu, Jack X; Sieuwerts, Anieta M; Zhang, Yi; Martens, John WM; Smid, Marcel; Klijn, Jan GM; Wang, Yixin; Foekens, John A

    2007-01-01

    Published prognostic gene signatures in breast cancer have few genes in common. Here we provide a rationale for this observation by studying the prognostic power and the underlying biological pathways of different gene signatures. Gene signatures to predict the development of metastases in estrogen receptor-positive and estrogen receptor-negative tumors were identified using 500 re-sampled training sets and mapping to Gene Ontology Biological Process to identify over-represented pathways. The Global Test program confirmed that gene expression profilings in the common pathways were associated with the metastasis of the patients. The apoptotic pathway and cell division, or cell growth regulation and G-protein coupled receptor signal transduction, were most significantly associated with the metastatic capability of estrogen receptor-positive or estrogen-negative tumors, respectively. A gene signature derived of the common pathways predicted metastasis in an independent cohort. Mapping of the pathways represented by different published prognostic signatures showed that they share 53% of the identified pathways. We show that divergent gene sets classifying patients for the same clinical endpoint represent similar biological processes and that pathway-derived signatures can be used to predict prognosis. Furthermore, our study reveals that the underlying biology related to aggressiveness of estrogen receptor subgroups of breast cancer is quite different

  6. Analysis and prediction of gene splice sites in four Aspergillus genomes

    DEFF Research Database (Denmark)

    Wang, Kai; Ussery, David; Brunak, Søren

    2009-01-01

    Several Aspergillus fungal genomic sequences have been published, with many more in progress. Obviously, it is essential to have high-quality, consistently annotated sets of proteins from each of the genomes, in order to make meaningful comparisons. We have developed a dedicated, publicly available......, splice site prediction program called NetAspGene, for the genus Aspergillus. Gene sequences from Aspergillus fumigatus, the most common mould pathogen, were used to build and test our model. Compared to many animals and plants, Aspergillus contains smaller introns; thus we have applied a larger window...... better splice site prediction than other available tools. NetAspGene will be very helpful for the study in Aspergillus splice sites and especially in alternative splicing. A webpage for NetAspGene is publicly available at http://www.cbs.dtu.dk/services/NetAspGene....

  7. Oxytocin receptor gene variation predicts subjective responses to MDMA.

    Science.gov (United States)

    Bershad, Anya K; Weafer, Jessica J; Kirkpatrick, Matthew G; Wardle, Margaret C; Miller, Melissa A; de Wit, Harriet

    2016-12-01

    3,4-Methylenedioxymethamphetamine (MDMA, "ecstasy") enhances desire to socialize and feelings of empathy, which are thought to be related to increased oxytocin levels. Thus, variation in the oxytocin receptor gene (OXTR) may influence responses to the drug. Here, we examined the influence of a single OXTR nucleotide polymorphism (SNP) on responses to MDMA in humans. Based on findings that carriers of the A allele at rs53576 exhibit reduced sensitivity to oxytocin-induced social behavior, we hypothesized that these individuals would show reduced subjective responses to MDMA, including sociability. In this three-session, double blind, within-subjects study, healthy volunteers with past MDMA experience (N = 68) received a MDMA (0, 0.75 mg/kg, and 1.5 mg/kg) and provided self-report ratings of sociability, anxiety, and drug effects. These responses were examined in relation to rs53576. MDMA (1.5 mg/kg) did not increase sociability in individuals with the A/A genotype as it did in G allele carriers. The genotypic groups did not differ in responses at the lower MDMA dose, or in cardiovascular or other subjective responses. These findings are consistent with the idea that MDMA-induced sociability is mediated by oxytocin, and that variation in the oxytocin receptor gene may influence responses to the drug.

  8. Reduce manual curation by combining gene predictions from multiple annotation engines, a case study of start codon prediction.

    Directory of Open Access Journals (Sweden)

    Thomas H A Ederveen

    Full Text Available Nowadays, prokaryotic genomes are sequenced faster than the capacity to manually curate gene annotations. Automated genome annotation engines provide users a straight-forward and complete solution for predicting ORF coordinates and function. For many labs, the use of AGEs is therefore essential to decrease the time necessary for annotating a given prokaryotic genome. However, it is not uncommon for AGEs to provide different and sometimes conflicting predictions. Combining multiple AGEs might allow for more accurate predictions. Here we analyzed the ab initio open reading frame (ORF calling performance of different AGEs based on curated genome annotations of eight strains from different bacterial species with GC% ranging from 35-52%. We present a case study which demonstrates a novel way of comparative genome annotation, using combinations of AGEs in a pre-defined order (or path to predict ORF start codons. The order of AGE combinations is from high to low specificity, where the specificity is based on the eight genome annotations. For each AGE combination we are able to derive a so-called projected confidence value, which is the average specificity of ORF start codon prediction based on the eight genomes. The projected confidence enables estimating likeliness of a correct prediction for a particular ORF start codon by a particular AGE combination, pinpointing ORFs notoriously difficult to predict start codons. We correctly predict start codons for 90.5±4.8% of the genes in a genome (based on the eight genomes with an accuracy of 81.1±7.6%. Our consensus-path methodology allows a marked improvement over majority voting (9.7±4.4% and with an optimal path ORF start prediction sensitivity is gained while maintaining a high specificity.

  9. Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas.

    Science.gov (United States)

    Mathelier, Anthony; Lefebvre, Calvin; Zhang, Allen W; Arenillas, David J; Ding, Jiarui; Wasserman, Wyeth W; Shah, Sohrab P

    2015-04-23

    With the rapid increase of whole-genome sequencing of human cancers, an important opportunity to analyze and characterize somatic mutations lying within cis-regulatory regions has emerged. A focus on protein-coding regions to identify nonsense or missense mutations disruptive to protein structure and/or function has led to important insights; however, the impact on gene expression of mutations lying within cis-regulatory regions remains under-explored. We analyzed somatic mutations from 84 matched tumor-normal whole genomes from B-cell lymphomas with accompanying gene expression measurements to elucidate the extent to which these cancers are disrupted by cis-regulatory mutations. We characterize mutations overlapping a high quality set of well-annotated transcription factor binding sites (TFBSs), covering a similar portion of the genome as protein-coding exons. Our results indicate that cis-regulatory mutations overlapping predicted TFBSs are enriched in promoter regions of genes involved in apoptosis or growth/proliferation. By integrating gene expression data with mutation data, our computational approach culminates with identification of cis-regulatory mutations most likely to participate in dysregulation of the gene expression program. The impact can be measured along with protein-coding mutations to highlight key mutations disrupting gene expression and pathways in cancer. Our study yields specific genes with disrupted expression triggered by genomic mutations in either the coding or the regulatory space. It implies that mutated regulatory components of the genome contribute substantially to cancer pathways. Our analyses demonstrate that identifying genomically altered cis-regulatory elements coupled with analysis of gene expression data will augment biological interpretation of mutational landscapes of cancers.

  10. Predicting Genes Involved in Human Cancer Using Network Contextual Information

    Directory of Open Access Journals (Sweden)

    Rahmani Hossein

    2012-03-01

    Full Text Available Protein-Protein Interaction (PPI networks have been widely used for the task of predicting proteins involved in cancer. Previous research has shown that functional information about the protein for which a prediction is made, proximity to specific other proteins in the PPI network, as well as local network structure are informative features in this respect. In this work, we introduce two new types of input features, reflecting additional information: (1 Functional Context: the functions of proteins interacting with the target protein (rather than the protein itself; and (2 Structural Context: the relative position of the target protein with respect to specific other proteins selected according to a novel ANOVA (analysis of variance based measure. We also introduce a selection strategy to pinpoint the most informative features. Results show that the proposed feature types and feature selection strategy yield informative features. A standard machine learning method (Naive Bayes that uses the features proposed here outperforms the current state-of-the-art methods by more than 5% with respect to F-measure. In addition, manual inspection confirms the biological relevance of the top-ranked features.

  11. Prioritizing orphan proteins for further study using phylogenomics and gene expression profiles in Streptomyces coelicolor

    Directory of Open Access Journals (Sweden)

    Takano Eriko

    2011-09-01

    Full Text Available Abstract Background Streptomyces coelicolor, a model organism of antibiotic producing bacteria, has one of the largest genomes of the bacterial kingdom, including 7825 predicted protein coding genes. A large number of these genes, nearly 34%, are functionally orphan (hypothetical proteins with unknown function. However, in gene expression time course data, many of these functionally orphan genes show interesting expression patterns. Results In this paper, we analyzed all functionally orphan genes of Streptomyces coelicolor and identified a list of "high priority" orphans by combining gene expression analysis and additional phylogenetic information (i.e. the level of evolutionary conservation of each protein. Conclusions The prioritized orphan genes are promising candidates to be examined experimentally in the lab for further characterization of their function.

  12. An approach for reduction of false predictions in reverse engineering of gene regulatory networks.

    Science.gov (United States)

    Khan, Abhinandan; Saha, Goutam; Pal, Rajat Kumar

    2018-05-14

    A gene regulatory network discloses the regulatory interactions amongst genes, at a particular condition of the human body. The accurate reconstruction of such networks from time-series genetic expression data using computational tools offers a stiff challenge for contemporary computer scientists. This is crucial to facilitate the understanding of the proper functioning of a living organism. Unfortunately, the computational methods produce many false predictions along with the correct predictions, which is unwanted. Investigations in the domain focus on the identification of as many correct regulations as possible in the reverse engineering of gene regulatory networks to make it more reliable and biologically relevant. One way to achieve this is to reduce the number of incorrect predictions in the reconstructed networks. In the present investigation, we have proposed a novel scheme to decrease the number of false predictions by suitably combining several metaheuristic techniques. We have implemented the same using a dataset ensemble approach (i.e. combining multiple datasets) also. We have employed the proposed methodology on real-world experimental datasets of the SOS DNA Repair network of Escherichia coli and the IMRA network of Saccharomyces cerevisiae. Subsequently, we have experimented upon somewhat larger, in silico networks, namely, DREAM3 and DREAM4 Challenge networks, and 15-gene and 20-gene networks extracted from the GeneNetWeaver database. To study the effect of multiple datasets on the quality of the inferred networks, we have used four datasets in each experiment. The obtained results are encouraging enough as the proposed methodology can reduce the number of false predictions significantly, without using any supplementary prior biological information for larger gene regulatory networks. It is also observed that if a small amount of prior biological information is incorporated here, the results improve further w.r.t. the prediction of true positives

  13. Adipose gene expression prior to weight loss can differentiate and weakly predict dietary responders.

    Directory of Open Access Journals (Sweden)

    David M Mutch

    Full Text Available BACKGROUND: The ability to identify obese individuals who will successfully lose weight in response to dietary intervention will revolutionize disease management. Therefore, we asked whether it is possible to identify subjects who will lose weight during dietary intervention using only a single gene expression snapshot. METHODOLOGY/PRINCIPAL FINDINGS: The present study involved 54 female subjects from the Nutrient-Gene Interactions in Human Obesity-Implications for Dietary Guidelines (NUGENOB trial to determine whether subcutaneous adipose tissue gene expression could be used to predict weight loss prior to the 10-week consumption of a low-fat hypocaloric diet. Using several statistical tests revealed that the gene expression profiles of responders (8-12 kgs weight loss could always be differentiated from non-responders (<4 kgs weight loss. We also assessed whether this differentiation was sufficient for prediction. Using a bottom-up (i.e. black-box approach, standard class prediction algorithms were able to predict dietary responders with up to 61.1%+/-8.1% accuracy. Using a top-down approach (i.e. using differentially expressed genes to build a classifier improved prediction accuracy to 80.9%+/-2.2%. CONCLUSION: Adipose gene expression profiling prior to the consumption of a low-fat diet is able to differentiate responders from non-responders as well as serve as a weak predictor of subjects destined to lose weight. While the degree of prediction accuracy currently achieved with a gene expression snapshot is perhaps insufficient for clinical use, this work reveals that the comprehensive molecular signature of adipose tissue paves the way for the future of personalized nutrition.

  14. Phylogenomic detection and functional prediction of genes potentially important for plant meiosis.

    Science.gov (United States)

    Zhang, Luoyan; Kong, Hongzhi; Ma, Hong; Yang, Ji

    2018-02-15

    Meiosis is a specialized type of cell division necessary for sexual reproduction in eukaryotes. A better understanding of the cytological procedures of meiosis has been achieved by comprehensive cytogenetic studies in plants, while the genetic mechanisms regulating meiotic progression remain incompletely understood. The increasing accumulation of complete genome sequences and large-scale gene expression datasets has provided a powerful resource for phylogenomic inference and unsupervised identification of genes involved in plant meiosis. By integrating sequence homology and expression data, 164, 131, 124 and 162 genes potentially important for meiosis were identified in the genomes of Arabidopsis thaliana, Oryza sativa, Selaginella moellendorffii and Pogonatum aloides, respectively. The predicted genes were assigned to 45 meiotic GO terms, and their functions were related to different processes occurring during meiosis in various organisms. Most of the predicted meiotic genes underwent lineage-specific duplication events during plant evolution, with about 30% of the predicted genes retaining only a single copy in higher plant genomes. The results of this study provided clues to design experiments for better functional characterization of meiotic genes in plants, promoting the phylogenomic approach to the evolutionary dynamics of the plant meiotic machineries. Copyright © 2017 Elsevier B.V. All rights reserved.

  15. The integration of weighted human gene association networks based on link prediction.

    Science.gov (United States)

    Yang, Jian; Yang, Tinghong; Wu, Duzhi; Lin, Limei; Yang, Fan; Zhao, Jing

    2017-01-31

    Physical and functional interplays between genes or proteins have important biological meaning for cellular functions. Some efforts have been made to construct weighted gene association meta-networks by integrating multiple biological resources, where the weight indicates the confidence of the interaction. However, it is found that these existing human gene association networks share only quite limited overlapped interactions, suggesting their incompleteness and noise. Here we proposed a workflow to construct a weighted human gene association network using information of six existing networks, including two weighted specific PPI networks and four gene association meta-networks. We applied link prediction algorithm to predict possible missing links of the networks, cross-validation approach to refine each network and finally integrated the refined networks to get the final integrated network. The common information among the refined networks increases notably, suggesting their higher reliability. Our final integrated network owns much more links than most of the original networks, meanwhile its links still keep high functional relevance. Being used as background network in a case study of disease gene prediction, the final integrated network presents good performance, implying its reliability and application significance. Our workflow could be insightful for integrating and refining existing gene association data.

  16. Genomic Prediction and Association Mapping of Curd-Related Traits in Gene Bank Accessions of Cauliflower.

    Science.gov (United States)

    Thorwarth, Patrick; Yousef, Eltohamy A A; Schmid, Karl J

    2018-02-02

    Genetic resources are an important source of genetic variation for plant breeding. Genome-wide association studies (GWAS) and genomic prediction greatly facilitate the analysis and utilization of useful genetic diversity for improving complex phenotypic traits in crop plants. We explored the potential of GWAS and genomic prediction for improving curd-related traits in cauliflower ( Brassica oleracea var. botrytis ) by combining 174 randomly selected cauliflower gene bank accessions from two different gene banks. The collection was genotyped with genotyping-by-sequencing (GBS) and phenotyped for six curd-related traits at two locations and three growing seasons. A GWAS analysis based on 120,693 single-nucleotide polymorphisms identified a total of 24 significant associations for curd-related traits. The potential for genomic prediction was assessed with a genomic best linear unbiased prediction model and BayesB. Prediction abilities ranged from 0.10 to 0.66 for different traits and did not differ between prediction methods. Imputation of missing genotypes only slightly improved prediction ability. Our results demonstrate that GWAS and genomic prediction in combination with GBS and phenotyping of highly heritable traits can be used to identify useful quantitative trait loci and genotypes among genetically diverse gene bank material for subsequent utilization as genetic resources in cauliflower breeding. Copyright © 2018 Thorwarth et al.

  17. Genomic Prediction and Association Mapping of Curd-Related Traits in Gene Bank Accessions of Cauliflower

    Directory of Open Access Journals (Sweden)

    Patrick Thorwarth

    2018-02-01

    Full Text Available Genetic resources are an important source of genetic variation for plant breeding. Genome-wide association studies (GWAS and genomic prediction greatly facilitate the analysis and utilization of useful genetic diversity for improving complex phenotypic traits in crop plants. We explored the potential of GWAS and genomic prediction for improving curd-related traits in cauliflower (Brassica oleracea var. botrytis by combining 174 randomly selected cauliflower gene bank accessions from two different gene banks. The collection was genotyped with genotyping-by-sequencing (GBS and phenotyped for six curd-related traits at two locations and three growing seasons. A GWAS analysis based on 120,693 single-nucleotide polymorphisms identified a total of 24 significant associations for curd-related traits. The potential for genomic prediction was assessed with a genomic best linear unbiased prediction model and BayesB. Prediction abilities ranged from 0.10 to 0.66 for different traits and did not differ between prediction methods. Imputation of missing genotypes only slightly improved prediction ability. Our results demonstrate that GWAS and genomic prediction in combination with GBS and phenotyping of highly heritable traits can be used to identify useful quantitative trait loci and genotypes among genetically diverse gene bank material for subsequent utilization as genetic resources in cauliflower breeding.

  18. Proteome Profiling Outperforms Transcriptome Profiling for Coexpression Based Gene Function Prediction

    Energy Technology Data Exchange (ETDEWEB)

    Wang, Jing; Ma, Zihao; Carr, Steven A.; Mertins, Philipp; Zhang, Hui; Zhang, Zhen; Chan, Daniel W.; Ellis, Matthew J. C.; Townsend, R. Reid; Smith, Richard D.; McDermott, Jason E.; Chen, Xian; Paulovich, Amanda G.; Boja, Emily S.; Mesri, Mehdi; Kinsinger, Christopher R.; Rodriguez, Henry; Rodland, Karin D.; Liebler, Daniel C.; Zhang, Bing

    2016-11-11

    Coexpression of mRNAs under multiple conditions is commonly used to infer cofunctionality of their gene products despite well-known limitations of this “guilt-by-association” (GBA) approach. Recent advancements in mass spectrometry-based proteomic technologies have enabled global expression profiling at the protein level; however, whether proteome profiling data can outperform transcriptome profiling data for coexpression based gene function prediction has not been systematically investigated. Here, we address this question by constructing and analyzing mRNA and protein coexpression networks for three cancer types with matched mRNA and protein profiling data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Our analyses revealed a marked difference in wiring between the mRNA and protein coexpression networks. Whereas protein coexpression was driven primarily by functional similarity between coexpressed genes, mRNA coexpression was driven by both cofunction and chromosomal colocalization of the genes. Functionally coherent mRNA modules were more likely to have their edges preserved in corresponding protein networks than functionally incoherent mRNA modules. Proteomic data strengthened the link between gene expression and function for at least 75% of Gene Ontology (GO) biological processes and 90% of KEGG pathways. A web application Gene2Net (http://cptac.gene2net.org) developed based on the three protein coexpression networks revealed novel gene-function relationships, such as linking ERBB2 (HER2) to lipid biosynthetic process in breast cancer, identifying PLG as a new gene involved in complement activation, and identifying AEBP1 as a new epithelial-mesenchymal transition (EMT) marker. Our results demonstrate that proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. Proteomics should be integrated if not preferred in gene function and human disease studies

  19. Gene expression variation to predict 10-year survival in lymph-node-negative breast cancer

    International Nuclear Information System (INIS)

    Karlsson, Elin; Delle, Ulla; Danielsson, Anna; Olsson, Björn; Abel, Frida; Karlsson, Per; Helou, Khalil

    2008-01-01

    It is of great significance to find better markers to correctly distinguish between high-risk and low-risk breast cancer patients since the majority of breast cancer cases are at present being overtreated. 46 tumours from node-negative breast cancer patients were studied with gene expression microarrays. A t-test was carried out in order to find a set of genes where the expression might predict clinical outcome. Two classifiers were used for evaluation of the gene lists, a correlation-based classifier and a Voting Features Interval (VFI) classifier. We then evaluated the predictive accuracy of this expression signature on tumour sets from two similar studies on lymph-node negative patients. They had both developed gene expression signatures superior to current methods in classifying node-negative breast tumours. These two signatures were also tested on our material. A list of 51 genes whose expression profiles could predict clinical outcome with high accuracy in our material (96% or 89% accuracy in cross-validation, depending on type of classifier) was developed. When tested on two independent data sets, the expression signature based on the 51 identified genes had good predictive qualities in one of the data sets (74% accuracy), whereas their predictive value on the other data set were poor, presumably due to the fact that only 23 of the 51 genes were found in that material. We also found that previously developed expression signatures could predict clinical outcome well to moderately well in our material (72% and 61%, respectively). The list of 51 genes derived in this study might have potential for clinical utility as a prognostic gene set, and may include candidate genes of potential relevance for clinical outcome in breast cancer. According to the predictions by this expression signature, 30 of the 46 patients may have benefited from different adjuvant treatment than they recieved. The research on these tumours was approved by the Medical Faculty Research

  20. Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks.

    Science.gov (United States)

    Haibe-Kains, Benjamin; Olsen, Catharina; Djebbari, Amira; Bontempi, Gianluca; Correll, Mick; Bouton, Christopher; Quackenbush, John

    2012-01-01

    Genomics provided us with an unprecedented quantity of data on the genes that are activated or repressed in a wide range of phenotypes. We have increasingly come to recognize that defining the networks and pathways underlying these phenotypes requires both the integration of multiple data types and the development of advanced computational methods to infer relationships between the genes and to estimate the predictive power of the networks through which they interact. To address these issues we have developed Predictive Networks (PN), a flexible, open-source, web-based application and data services framework that enables the integration, navigation, visualization and analysis of gene interaction networks. The primary goal of PN is to allow biomedical researchers to evaluate experimentally derived gene lists in the context of large-scale gene interaction networks. The PN analytical pipeline involves two key steps. The first is the collection of a comprehensive set of known gene interactions derived from a variety of publicly available sources. The second is to use these 'known' interactions together with gene expression data to infer robust gene networks. The PN web application is accessible from http://predictivenetworks.org. The PN code base is freely available at https://sourceforge.net/projects/predictivenets/.

  1. Can Thrifty Gene(s or Predictive Fetal Programming for Thriftiness Lead to Obesity?

    Directory of Open Access Journals (Sweden)

    Ulfat Baig

    2011-01-01

    Full Text Available Obesity and related disorders are thought to have their roots in metabolic “thriftiness” that evolved to combat periodic starvation. The association of low birth weight with obesity in later life caused a shift in the concept from thrifty gene to thrifty phenotype or anticipatory fetal programming. The assumption of thriftiness is implicit in obesity research. We examine here, with the help of a mathematical model, the conditions for evolution of thrifty genes or fetal programming for thriftiness. The model suggests that a thrifty gene cannot exist in a stable polymorphic state in a population. The conditions for evolution of thrifty fetal programming are restricted if the correlation between intrauterine and lifetime conditions is poor. Such a correlation is not observed in natural courses of famine. If there is fetal programming for thriftiness, it could have evolved in anticipation of social factors affecting nutrition that can result in a positive correlation.

  2. EvoCor: a platform for predicting functionally related genes using phylogenetic and expression profiles.

    Science.gov (United States)

    Dittmar, W James; McIver, Lauren; Michalak, Pawel; Garner, Harold R; Valdez, Gregorio

    2014-07-01

    The wealth of publicly available gene expression and genomic data provides unique opportunities for computational inference to discover groups of genes that function to control specific cellular processes. Such genes are likely to have co-evolved and be expressed in the same tissues and cells. Unfortunately, the expertise and computational resources required to compare tens of genomes and gene expression data sets make this type of analysis difficult for the average end-user. Here, we describe the implementation of a web server that predicts genes involved in affecting specific cellular processes together with a gene of interest. We termed the server 'EvoCor', to denote that it detects functional relationships among genes through evolutionary analysis and gene expression correlation. This web server integrates profiles of sequence divergence derived by a Hidden Markov Model (HMM) and tissue-wide gene expression patterns to determine putative functional linkages between pairs of genes. This server is easy to use and freely available at http://pilot-hmm.vbi.vt.edu/. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. A gene signature in histologically normal surgical margins is predictive of oral carcinoma recurrence

    International Nuclear Information System (INIS)

    Reis, Patricia P; Simpson, Colleen; Goldstein, David; Brown, Dale; Gilbert, Ralph; Gullane, Patrick; Irish, Jonathan; Jurisica, Igor; Kamel-Reid, Suzanne; Waldron, Levi; Perez-Ordonez, Bayardo; Pintilie, Melania; Galloni, Natalie Naranjo; Xuan, Yali; Cervigne, Nilva K; Warner, Giles C; Makitie, Antti A

    2011-01-01

    Oral Squamous Cell Carcinoma (OSCC) is a major cause of cancer death worldwide, which is mainly due to recurrence leading to treatment failure and patient death. Histological status of surgical margins is a currently available assessment for recurrence risk in OSCC; however histological status does not predict recurrence, even in patients with histologically negative margins. Therefore, molecular analysis of histologically normal resection margins and the corresponding OSCC may aid in identifying a gene signature predictive of recurrence. We used a meta-analysis of 199 samples (OSCCs and normal oral tissues) from five public microarray datasets, in addition to our microarray analysis of 96 OSCCs and histologically normal margins from 24 patients, to train a gene signature for recurrence. Validation was performed by quantitative real-time PCR using 136 samples from an independent cohort of 30 patients. We identified 138 significantly over-expressed genes (> 2-fold, false discovery rate of 0.01) in OSCC. By penalized likelihood Cox regression, we identified a 4-gene signature with prognostic value for recurrence in our training set. This signature comprised the invasion-related genes MMP1, COL4A1, P4HA2, and THBS2. Over-expression of this 4-gene signature in histologically normal margins was associated with recurrence in our training cohort (p = 0.0003, logrank test) and in our independent validation cohort (p = 0.04, HR = 6.8, logrank test). Gene expression alterations occur in histologically normal margins in OSCC. Over-expression of the 4-gene signature in histologically normal surgical margins was validated and highly predictive of recurrence in an independent patient cohort. Our findings may be applied to develop a molecular test, which would be clinically useful to help predict which patients are at a higher risk of local recurrence

  4. MITEs in the promoters of effector genes allow prediction of novel virulence genes in Fusarium oxysporum

    NARCIS (Netherlands)

    Schmidt, S.M.; Houterman, P.M.; Schreiver, I.; Ma, L.; Amyotte, S.; Chellappan, B.; Boeren, S.; Takken, F.L.W.; Rep, M.

    2013-01-01

    Background The plant-pathogenic fungus Fusarium oxysporum f.sp.lycopersici (Fol) has accessory, lineage-specific (LS) chromosomes that can be transferred horizontally between strains. A single LS chromosome in the Fol4287 reference strain harbors all known Fol effector genes. Transfer of this

  5. Gene prediction in metagenomic fragments: A large scale machine learning approach

    Directory of Open Access Journals (Sweden)

    Morgenstern Burkhard

    2008-04-01

    Full Text Available Abstract Background Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions. Results We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability. Conclusion Large scale machine learning methods are well-suited for gene

  6. Probability-based collaborative filtering model for predicting gene-disease associations.

    Science.gov (United States)

    Zeng, Xiangxiang; Ding, Ningxiang; Rodríguez-Patón, Alfonso; Zou, Quan

    2017-12-28

    Accurately predicting pathogenic human genes has been challenging in recent research. Considering extensive gene-disease data verified by biological experiments, we can apply computational methods to perform accurate predictions with reduced time and expenses. We propose a probability-based collaborative filtering model (PCFM) to predict pathogenic human genes. Several kinds of data sets, containing data of humans and data of other nonhuman species, are integrated in our model. Firstly, on the basis of a typical latent factorization model, we propose model I with an average heterogeneous regularization. Secondly, we develop modified model II with personal heterogeneous regularization to enhance the accuracy of aforementioned models. In this model, vector space similarity or Pearson correlation coefficient metrics and data on related species are also used. We compared the results of PCFM with the results of four state-of-arts approaches. The results show that PCFM performs better than other advanced approaches. PCFM model can be leveraged for predictions of disease genes, especially for new human genes or diseases with no known relationships.

  7. Computational prediction and experimental validation of Ciona intestinalis microRNA genes

    Directory of Open Access Journals (Sweden)

    Pasquinelli Amy E

    2007-11-01

    Full Text Available Abstract Background This study reports the first collection of validated microRNA genes in the sea squirt, Ciona intestinalis. MicroRNAs are processed from hairpin precursors to ~22 nucleotide RNAs that base pair to target mRNAs and inhibit expression. As a member of the subphylum Urochordata (Tunicata whose larval form has a notochord, the sea squirt is situated at the emergence of vertebrates, and therefore may provide information about the evolution of molecular regulators of early development. Results In this study, computational methods were used to predict 14 microRNA gene families in Ciona intestinalis. The microRNA prediction algorithm utilizes configurable microRNA sequence conservation and stem-loop specificity parameters, grouping by miRNA family, and phylogenetic conservation to the related species, Ciona savignyi. The expression for 8, out of 9 attempted, of the putative microRNAs in the adult tissue of Ciona intestinalis was validated by Northern blot analyses. Additionally, a target prediction algorithm was implemented, which identified a high confidence list of 240 potential target genes. Over half of the predicted targets can be grouped into the gene ontology categories of metabolism, transport, regulation of transcription, and cell signaling. Conclusion The computational techniques implemented in this study can be applied to other organisms and serve to increase the understanding of the origins of non-coding RNAs, embryological and cellular developmental pathways, and the mechanisms for microRNA-controlled gene regulatory networks.

  8. An ensemble method to predict target genes and pathways in uveal melanoma

    Directory of Open Access Journals (Sweden)

    Wei Chao

    2018-04-01

    Full Text Available This work proposes to predict target genes and pathways for uveal melanoma (UM based on an ensemble method and pathway analyses. Methods: The ensemble method integrated a correlation method (Pearson correlation coefficient, PCC, a causal inference method (IDA and a regression method (Lasso utilizing the Borda count election method. Subsequently, to validate the performance of PIL method, comparisons between confirmed database and predicted miRNA targets were performed. Ultimately, pathway enrichment analysis was conducted on target genes in top 1000 miRNA-mRNA interactions to identify target pathways for UM patients. Results: Thirty eight of the predicted interactions were matched with the confirmed interactions, indicating that the ensemble method was a suitable and feasible approach to predict miRNA targets. We obtained 50 seed miRNA-mRNA interactions of UM patients and extracted target genes from these interactions, such as ASPG, BSDC1 and C4BP. The 601 target genes in top 1,000 miRNA-mRNA interactions were enriched in 12 target pathways, of which Phototransduction was the most significant one. Conclusion: The target genes and pathways might provide a new way to reveal the molecular mechanism of UM and give hand for target treatments and preventions of this malignant tumor.

  9. Primate-specific spliced PMCHL RNAs are non-protein coding in human and macaque tissues

    Directory of Open Access Journals (Sweden)

    Delerue-Audegond Audrey

    2008-12-01

    Full Text Available Abstract Background Brain-expressed genes that were created in primate lineage represent obvious candidates to investigate molecular mechanisms that contributed to neural reorganization and emergence of new behavioural functions in Homo sapiens. PMCHL1 arose from retroposition of a pro-melanin-concentrating hormone (PMCH antisense mRNA on the ancestral human chromosome 5p14 when platyrrhines and catarrhines diverged. Mutations before divergence of hylobatidae led to creation of new exons and finally PMCHL1 duplicated in an ancestor of hominids to generate PMCHL2 at the human chromosome 5q13. A complex pattern of spliced and unspliced PMCHL RNAs were found in human brain and testis. Results Several novel spliced PMCHL transcripts have been characterized in human testis and fetal brain, identifying an additional exon and novel splice sites. Sequencing of PMCHL genes in several non-human primates allowed to carry out phylogenetic analyses revealing that the initial retroposition event took place within an intron of the brain cadherin (CDH12 gene, soon after platyrrhine/catarrhine divergence, i.e. 30–35 Mya, and was concomitant with the insertion of an AluSg element. Sequence analysis of the spliced PMCHL transcripts identified only short ORFs of less than 300 bp, with low (VMCH-p8 and protein variants or no evolutionary conservation. Western blot analyses of human and macaque tissues expressing PMCHL RNA failed to reveal any protein corresponding to VMCH-p8 and protein variants encoded by spliced transcripts. Conclusion Our present results improve our knowledge of the gene structure and the evolutionary history of the primate-specific chimeric PMCHL genes. These genes produce multiple spliced transcripts, bearing short, non-conserved and apparently non-translated ORFs that may function as mRNA-like non-coding RNAs.

  10. CAsubtype: An R Package to Identify Gene Sets Predictive of Cancer Subtypes and Clinical Outcomes.

    Science.gov (United States)

    Kong, Hualei; Tong, Pan; Zhao, Xiaodong; Sun, Jielin; Li, Hua

    2018-03-01

    In the past decade, molecular classification of cancer has gained high popularity owing to its high predictive power on clinical outcomes as compared with traditional methods commonly used in clinical practice. In particular, using gene expression profiles, recent studies have successfully identified a number of gene sets for the delineation of cancer subtypes that are associated with distinct prognosis. However, identification of such gene sets remains a laborious task due to the lack of tools with flexibility, integration and ease of use. To reduce the burden, we have developed an R package, CAsubtype, to efficiently identify gene sets predictive of cancer subtypes and clinical outcomes. By integrating more than 13,000 annotated gene sets, CAsubtype provides a comprehensive repertoire of candidates for new cancer subtype identification. For easy data access, CAsubtype further includes the gene expression and clinical data of more than 2000 cancer patients from TCGA. CAsubtype first employs principal component analysis to identify gene sets (from user-provided or package-integrated ones) with robust principal components representing significantly large variation between cancer samples. Based on these principal components, CAsubtype visualizes the sample distribution in low-dimensional space for better understanding of the distinction between samples and classifies samples into subgroups with prevalent clustering algorithms. Finally, CAsubtype performs survival analysis to compare the clinical outcomes between the identified subgroups, assessing their clinical value as potentially novel cancer subtypes. In conclusion, CAsubtype is a flexible and well-integrated tool in the R environment to identify gene sets for cancer subtype identification and clinical outcome prediction. Its simple R commands and comprehensive data sets enable efficient examination of the clinical value of any given gene set, thus facilitating hypothesis generating and testing in biological and

  11. CpG + CpNpG Analysis of Protein-Coding Sequences from Tomato

    DEFF Research Database (Denmark)

    Hobolth, Asger; Nielsen, Rasmus; Wang, Ying

    2006-01-01

    We develop codon-based models for simultaneously inferring the mutational effects of CpG and CpNpG methylation in coding regions. In a data set of 369 tomato genes, we show that there is very little effect of CpNpG methylation but a strong effect of CpG methylation affecting almost all genes. We...... further show that the CpNpG and CpG effects are largely uncorrelated. Our results suggest different roles of CpG and CpNpG methylation, with CpNpG methylation possibly playing a specialized role in defense against transposons and RNA viruses....

  12. Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

    Directory of Open Access Journals (Sweden)

    Karacali Bilge

    2007-10-01

    Full Text Available Abstract Background Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a all genes on the microarray platform and b a list of known disease-related genes (a priori selection. We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms. Results Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform. Conclusion Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine

  13. Entropy-based gene ranking without selection bias for the predictive classification of microarray data

    Directory of Open Access Journals (Sweden)

    Serafini Maria

    2003-11-01

    Full Text Available Abstract Background We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process. Results With E-RFE, we speed up the recursive feature elimination (RFE with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.

  14. Minimal gene selection for classification and diagnosis prediction based on gene expression profile

    Directory of Open Access Journals (Sweden)

    Alireza Mehridehnavi

    2013-01-01

    Conclusion: We have shown that the use of two most significant genes based on their S/N ratios and selection of suitable training samples can lead to classify DLBCL patients with a rather good result. Actually with the aid of mentioned methods we could compensate lack of enough number of patients, improve accuracy of classifying and reduce complication of computations and so running time.

  15. Gene expression prediction by soft integration and the elastic net-best performance of the DREAM3 gene expression challenge.

    Directory of Open Access Journals (Sweden)

    Mika Gustafsson

    Full Text Available BACKGROUND: To predict gene expressions is an important endeavour within computational systems biology. It can both be a way to explore how drugs affect the system, as well as providing a framework for finding which genes are interrelated in a certain process. A practical problem, however, is how to assess and discriminate among the various algorithms which have been developed for this purpose. Therefore, the DREAM project invited the year 2008 to a challenge for predicting gene expression values, and here we present the algorithm with best performance. METHODOLOGY/PRINCIPAL FINDINGS: We develop an algorithm by exploring various regression schemes with different model selection procedures. It turns out that the most effective scheme is based on least squares, with a penalty term of a recently developed form called the "elastic net". Key components in the algorithm are the integration of expression data from other experimental conditions than those presented for the challenge and the utilization of transcription factor binding data for guiding the inference process towards known interactions. Of importance is also a cross-validation procedure where each form of external data is used only to the extent it increases the expected performance. CONCLUSIONS/SIGNIFICANCE: Our algorithm proves both the possibility to extract information from large-scale expression data concerning prediction of gene levels, as well as the benefits of integrating different data sources for improving the inference. We believe the former is an important message to those still hesitating on the possibilities for computational approaches, while the latter is part of an important way forward for the future development of the field of computational systems biology.

  16. PROTEIN-CODING GENES AS MOLECULAR MARKERS FOR ECOLOGICALLY DISTINCT POPULATIONS: THE CASE OF TWO BACILLUS SPECIES. (R825348)

    Science.gov (United States)

    The perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Concl...

  17. A 65‑gene signature for prognostic prediction in colon adenocarcinoma.

    Science.gov (United States)

    Jiang, Hui; Du, Jun; Gu, Jiming; Jin, Liugen; Pu, Yong; Fei, Bojian

    2018-04-01

    The aim of the present study was to examine the molecular factors associated with the prognosis of colon cancer. Gene expression datasets were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus databases to screen differentially expressed genes (DEGs) between colon cancer samples and normal samples. Survival‑related genes were selected from the DEGs using the Cox regression method. A co‑expression network of survival‑related genes was then constructed, and functional clusters were extracted from this network. The significantly enriched functions and pathways of the genes in the network were identified. Using Bayesian discriminant analysis, a prognostic prediction system was established to distinguish the positive from negative prognostic samples. The discrimination efficacy of the system was validated in the GSE17538 dataset using Kaplan‑Meier survival analysis. A total of 636 and 1,892 DEGs between the colon cancer samples and normal samples were screened from the TCGA and GSE44861 dataset, respectively. There were 155 survival‑related genes selected. The co‑expression network of survival‑related genes included 138 genes, 534 lines (connections) and five functional clusters, including the signaling pathway, cellular response to cAMP, and immune system process functional clusters. The molecular function, cellular components and biological processes were the significantly enriched functions. The peroxisome proliferator‑activated receptor signaling pathway, Wnt signaling pathway, B cell receptor signaling pathway, and cytokine‑cytokine receptor interactions were the significant pathways. A prognostic prediction system based on a 65‑gene signature was established using this co‑expression network. Its discriminatory effect was validated in the TCGA dataset (P=3.56e‑12) and the GSE17538 dataset (P=1.67e‑6). The 65‑gene signature included kallikrein‑related peptidase 6 (KLK6), collagen type XI α1 (COL11A1), cartilage

  18. The Development of Three Long Universal Nuclear Protein-Coding Locus Markers and Their Application to Osteichthyan Phylogenetics with Nested PCR

    Science.gov (United States)

    Zhang, Peng

    2012-01-01

    Background Universal nuclear protein-coding locus (NPCL) markers that are applicable across diverse taxa and show good phylogenetic discrimination have broad applications in molecular phylogenetic studies. For example, RAG1, a representative NPCL marker, has been successfully used to make phylogenetic inferences within all major osteichthyan groups. However, such markers with broad working range and high phylogenetic performance are still scarce. It is necessary to develop more universal NPCL markers comparable to RAG1 for osteichthyan phylogenetics. Methodology/Principal Findings We developed three long universal NPCL markers (>1.6 kb each) based on single-copy nuclear genes (KIAA1239, SACS and TTN) that possess large exons and exhibit the appropriate evolutionary rates. We then compared their phylogenetic utilities with that of the reference marker RAG1 in 47 jawed vertebrate species. In comparison with RAG1, each of the three long universal markers yielded similar topologies and branch supports, all in congruence with the currently accepted osteichthyan phylogeny. To compare their phylogenetic performance visually, we also estimated the phylogenetic informativeness (PI) profile for each of the four long universal NPCL markers. The PI curves indicated that SACS performed best over the whole timescale, while RAG1, KIAA1239 and TTN exhibited similar phylogenetic performances. In addition, we compared the success of nested PCR and standard PCR when amplifying NPCL marker fragments. The amplification success rate and efficiency of the nested PCR were overwhelmingly higher than those of standard PCR. Conclusions/Significance Our work clearly demonstrates the superiority of nested PCR over the conventional PCR in phylogenetic studies and develops three long universal NPCL markers (KIAA1239, SACS and TTN) with the nested PCR strategy. The three markers exhibit high phylogenetic utilities in osteichthyan phylogenetics and can be widely used as pilot genes for

  19. The predictive value of the 70-gene signature for adjuvant chemotherapy in early breast cancer

    NARCIS (Netherlands)

    Knauer, Michael; Mook, Stella; Rutgers, Emiel J. T.; Bender, Richard A.; Hauptmann, Michael; van de Vijver, Marc J.; Koornstra, Rutger H. T.; Bueno-de-Mesquita, Jolien M.; Linn, Sabine C.; van 't Veer, Laura J.

    2010-01-01

    Multigene assays have been developed and validated to determine the prognosis of breast cancer. In this study, we assessed the additional predictive value of the 70-gene MammaPrint signature for chemotherapy (CT) benefit in addition to endocrine therapy (ET) from pooled study series. For 541

  20. [The value of 5-HTT gene polymorphism for the assessment and prediction of male adolescence violence].

    Science.gov (United States)

    Yu, Yue; Liu, Xiang; Yang, Zhen-xing; Qiu, Chang-jian; Ma, Xiao-hong

    2012-08-01

    To establish an adolescent violence crime prediction model, and to assess the value of serotonin transporter (5-HTT) gene polymorphism for the assessment and prediction of violent crime. Investigative tools were used to analyze the difference in personality dimensions, social support, coping styles, aggressiveness, impulsivity, and family condition scale between 223 adolescents with violence behavior and 148 adolescents without violence behavior. The distribution of 5-HTT gene polymorphisms (5-HTTLPR and 5-HTTVNTR) was compared between the two groups. The role of 5-HTT gene polymorphism on adolescent personality, impulsion and aggression scale also was also analyzed. Stepwise logistic regression was used to establish a predictive model for adolescent violent crime. Significant difference was found between the violence group and the control group on multiple dimensions of psychology and environment scales. However, no statistical difference was found with regard to the 5-HTT genotypes and alleles between adolescents with violent behaviors and normal controls. The rate of prediction accuracy was not significantly improved when 5-HTT gene polymorphism was taken into the model. The violent crime of adolescents was closely related with social and environmental factors. No association was found between 5-HTT polymorphisms and adolescent violence criminal behavior.

  1. Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data.

    Science.gov (United States)

    Zhu, Mingzhu; Dahmen, Jeremy L; Stacey, Gary; Cheng, Jianlin

    2013-09-22

    High-throughput RNA sequencing (RNA-Seq) is a revolutionary technique to study the transcriptome of a cell under various conditions at a systems level. Despite the wide application of RNA-Seq techniques to generate experimental data in the last few years, few computational methods are available to analyze this huge amount of transcription data. The computational methods for constructing gene regulatory networks from RNA-Seq expression data of hundreds or even thousands of genes are particularly lacking and urgently needed. We developed an automated bioinformatics method to predict gene regulatory networks from the quantitative expression values of differentially expressed genes based on RNA-Seq transcriptome data of a cell in different stages and conditions, integrating transcriptional, genomic and gene function data. We applied the method to the RNA-Seq transcriptome data generated for soybean root hair cells in three different development stages of nodulation after rhizobium infection. The method predicted a soybean nodulation-related gene regulatory network consisting of 10 regulatory modules common for all three stages, and 24, 49 and 70 modules separately for the first, second and third stage, each containing both a group of co-expressed genes and several transcription factors collaboratively controlling their expression under different conditions. 8 of 10 common regulatory modules were validated by at least two kinds of validations, such as independent DNA binding motif analysis, gene function enrichment test, and previous experimental data in the literature. We developed a computational method to reliably reconstruct gene regulatory networks from RNA-Seq transcriptome data. The method can generate valuable hypotheses for interpreting biological data and designing biological experiments such as ChIP-Seq, RNA interference, and yeast two hybrid experiments.

  2. Coregulation of terpenoid pathway genes and prediction of isoprene production in Bacillus subtilis using transcriptomics

    Energy Technology Data Exchange (ETDEWEB)

    Hess, Becky M.; Xue, Junfeng; Markillie, Lye Meng; Taylor, Ronald C.; Wiley, H. S.; Ahring, Birgitte K.; Linggi, Bryan E.

    2013-06-19

    The isoprenoid pathway converts pyruvate to isoprene and related isoprenoid compounds in plants and some bacteria. Currently, this pathway is of great interest because of the critical role that isoprenoids play in basic cellular processes as well as the industrial value of metabolites such as isoprene. Although the regulation of several pathway genes has been described, there is a paucity of information regarding the system level regulation and control of the pathway. To address this limitation, we examined Bacillus subtilis grown under multiple conditions and then determined the relationship between altered isoprene production and the pattern of gene expression. We found that terpenoid genes appeared to fall into two distinct subsets with opposing correlations with respect to the amount of isoprene produced. The group whose expression levels positively correlated with isoprene production included dxs, the gene responsible for the commitment step in the pathway, as well as ispD, and two genes that participate in the mevalonate pathway, yhfS and pksG. The subset of terpenoid genes that inversely correlated with isoprene production included ispH, ispF, hepS, uppS, ispE, and dxr. A genome wide partial least squares regression model was created to identify other genes or pathways that contribute to isoprene production. This analysis showed that a subset of 213 regulated genes was sufficient to create a predictive model of isoprene production under different conditions and showed correlations at the transcriptional level. We conclude that gene expression levels alone are sufficiently informative about the metabolic state of a cell that produces increased isoprene and can be used to build a model which accurately predicts production of this secondary metabolite across many simulated environmental conditions.

  3. PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources [v1; ref status: indexed, http://f1000r.es/5j2

    Directory of Open Access Journals (Sweden)

    Indika Kahanda

    2015-07-01

    Full Text Available The human phenotype ontology (HPO was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

  4. HOX Gene Promoter Prediction and Inter-genomic Comparison: An Evo-Devo Study

    Directory of Open Access Journals (Sweden)

    Marla A. Endriga

    2010-10-01

    Full Text Available Homeobox genes direct the anterior-posterior axis of the body plan in eukaryotic organisms. Promoter regions upstream of the Hox genes jumpstart the transcription process. CpG islands found within the promoter regions can cause silencing of these promoters. The locations of the promoter regions and the CpG islands of Homeo sapiens sapiens (human, Pan troglodytes (chimpanzee, Mus musculus (mouse, and Rattus norvegicus (brown rat are compared and related to the possible influence on the specification of the mammalian body plan. The sequence of each gene in Hox clusters A-D of the mammals considered were retrieved from Ensembl and locations of promoter regions and CpG islands predicted using Exon Finder. The predicted promoter sequences were confirmed via BLAST and verified against the Eukaryotic Promoter Database. The significance of the locations was determined using the Kruskal-Wallis test. Among the four clusters, only promoter locations in cluster B showed significant difference. HOX B genes have been linked with the control of genes that direct the development of axial morphology, particularly of the vertebral column bones. The magnitude of variation among the body plans of closely-related species can thus be partially attributed to the promoter kind, location and number, and gene inactivation via CpG methylation.

  5. Effectiveness of gene expression profiling for response prediction of rectal cancer to preoperative radiotherapy

    International Nuclear Information System (INIS)

    Ojima, Eiki; Inoue, Yasuhiro; Miki, Chikao; Kusunoki, Masato; Mori, Masaki

    2007-01-01

    Our aim was to determine whether the expression levels of specific genes could predict clinical radiosensitivity in human colorectal cancer. Radioresistant colorectal cancer cell lines were established by repeated X-ray exposure (total, 100 Gy), and the gene expressions of the parent and radioresistant cell lines were compared in a microarray analysis. To verify the microarray data, we carried out a reverse transcriptase-polymerase chain reaction analysis of identified genes in clinical samples from 30 irradiated rectal cancer patients. A comparison of the intensity data for the parent and three radioresistant cell lines revealed 17 upregulated and 142 downregulated genes in all radioresistant cell lines. Next, we focused on two upregulated genes, PTMA (prothymosin α) and EIF5a2 (eukaryotic translation initiation factor 5A), in the radioresistant cell lines. In clinical samples, the expression of PTMA was significantly higher in the minor effect group than in the major effect group (P=0.004), but there were no significant differences in EIF5a2 expression between the two groups. We identified radiation-related genes in colorectal cancer and demonstrated that PTMA may play an important role in radiosensitivity. Our findings suggest that PTMA may be a novel marker for predicting the effectiveness of radiotherapy in clinical cases. (author)

  6. Paired hormone response elements predict caveolin-1 as a glucocorticoid target gene.

    Directory of Open Access Journals (Sweden)

    Marinus F van Batenburg

    2010-01-01

    Full Text Available Glucocorticoids act in part via glucocorticoid receptor binding to hormone response elements (HREs, but their direct target genes in vivo are still largely unknown. We developed the criterion that genomic occurrence of paired HREs at an inter-HRE distance less than 200 bp predicts hormone responsiveness, based on synergy of multiple HREs, and HRE information from known target genes. This criterion predicts a substantial number of novel responsive genes, when applied to genomic regions 10 kb upstream of genes. Multiple-tissue in situ hybridization showed that mRNA expression of 6 out of 10 selected genes was induced in a tissue-specific manner in mice treated with a single dose of corticosterone, with the spleen being the most responsive organ. Caveolin-1 was strongly responsive in several organs, and the HRE pair in its upstream region showed increased occupancy by glucocorticoid receptor in response to corticosterone. Our approach allowed for discovery of novel tissue specific glucocorticoid target genes, which may exemplify responses underlying the permissive actions of glucocorticoids.

  7. Prediction of operon-like gene clusters in the Arabidopsis thaliana genome based on co-expression analysis of neighboring genes.

    Science.gov (United States)

    Wada, Masayoshi; Takahashi, Hiroki; Altaf-Ul-Amin, Md; Nakamura, Kensuke; Hirai, Masami Y; Ohta, Daisaku; Kanaya, Shigehiko

    2012-07-15

    Operon-like arrangements of genes occur in eukaryotes ranging from yeasts and filamentous fungi to nematodes, plants, and mammals. In plants, several examples of operon-like gene clusters involved in metabolic pathways have recently been characterized, e.g. the cyclic hydroxamic acid pathways in maize, the avenacin biosynthesis gene clusters in oat, the thalianol pathway in Arabidopsis thaliana, and the diterpenoid momilactone cluster in rice. Such operon-like gene clusters are defined by their co-regulation or neighboring positions within immediate vicinity of chromosomal regions. A comprehensive analysis of the expression of neighboring genes therefore accounts a crucial step to reveal the complete set of operon-like gene clusters within a genome. Genome-wide prediction of operon-like gene clusters should contribute to functional annotation efforts and provide novel insight into evolutionary aspects acquiring certain biological functions as well. We predicted co-expressed gene clusters by comparing the Pearson correlation coefficient of neighboring genes and randomly selected gene pairs, based on a statistical method that takes false discovery rate (FDR) into consideration for 1469 microarray gene expression datasets of A. thaliana. We estimated that A. thaliana contains 100 operon-like gene clusters in total. We predicted 34 statistically significant gene clusters consisting of 3 to 22 genes each, based on a stringent FDR threshold of 0.1. Functional relationships among genes in individual clusters were estimated by sequence similarity and functional annotation of genes. Duplicated gene pairs (determined based on BLAST with a cutoff of EOperon-like clusters tend to include genes encoding bio-machinery associated with ribosomes, the ubiquitin/proteasome system, secondary metabolic pathways, lipid and fatty-acid metabolism, and the lipid transfer system. Copyright © 2012 Elsevier B.V. All rights reserved.

  8. Intra- and interspecies gene expression models for predicting drug response in canine osteosarcoma.

    Science.gov (United States)

    Fowles, Jared S; Brown, Kristen C; Hess, Ann M; Duval, Dawn L; Gustafson, Daniel L

    2016-02-19

    Genomics-based predictors of drug response have the potential to improve outcomes associated with cancer therapy. Osteosarcoma (OS), the most common primary bone cancer in dogs, is commonly treated with adjuvant doxorubicin or carboplatin following amputation of the affected limb. We evaluated the use of gene-expression based models built in an intra- or interspecies manner to predict chemosensitivity and treatment outcome in canine OS. Models were built and evaluated using microarray gene expression and drug sensitivity data from human and canine cancer cell lines, and canine OS tumor datasets. The "COXEN" method was utilized to filter gene signatures between human and dog datasets based on strong co-expression patterns. Models were built using linear discriminant analysis via the misclassification penalized posterior algorithm. The best doxorubicin model involved genes identified in human lines that were co-expressed and trained on canine OS tumor data, which accurately predicted clinical outcome in 73 % of dogs (p = 0.0262, binomial). The best carboplatin model utilized canine lines for gene identification and model training, with canine OS tumor data for co-expression. Dogs whose treatment matched our predictions had significantly better clinical outcomes than those that didn't (p = 0.0006, Log Rank), and this predictor significantly associated with longer disease free intervals in a Cox multivariate analysis (hazard ratio = 0.3102, p = 0.0124). Our data show that intra- and interspecies gene expression models can successfully predict response in canine OS, which may improve outcome in dogs and serve as pre-clinical validation for similar methods in human cancer research.

  9. Cell-specific prediction and application of drug-induced gene expression profiles.

    Science.gov (United States)

    Hodos, Rachel; Zhang, Ping; Lee, Hao-Chih; Duan, Qiaonan; Wang, Zichen; Clark, Neil R; Ma'ayan, Avi; Wang, Fei; Kidd, Brian; Hu, Jianying; Sontag, David; Dudley, Joel

    2018-01-01

    Gene expression profiling of in vitro drug perturbations is useful for many biomedical discovery applications including drug repurposing and elucidation of drug mechanisms. However, limited data availability across cell types has hindered our capacity to leverage or explore the cell-specificity of these perturbations. While recent efforts have generated a large number of drug perturbation profiles across a variety of human cell types, many gaps remain in this combinatorial drug-cell space. Hence, we asked whether it is possible to fill these gaps by predicting cell-specific drug perturbation profiles using available expression data from related conditions--i.e. from other drugs and cell types. We developed a computational framework that first arranges existing profiles into a three-dimensional array (or tensor) indexed by drugs, genes, and cell types, and then uses either local (nearest-neighbors) or global (tensor completion) information to predict unmeasured profiles. We evaluate prediction accuracy using a variety of metrics, and find that the two methods have complementary performance, each superior in different regions in the drug-cell space. Predictions achieve correlations of 0.68 with true values, and maintain accurate differentially expressed genes (AUC 0.81). Finally, we demonstrate that the predicted profiles add value for making downstream associations with drug targets and therapeutic classes.

  10. In silico prediction of novel therapeutic targets using gene-disease association data.

    Science.gov (United States)

    Ferrero, Enrico; Dunham, Ian; Sanseau, Philippe

    2017-08-29

    Target identification and validation is a pressing challenge in the pharmaceutical industry, with many of the programmes that fail for efficacy reasons showing poor association between the drug target and the disease. Computational prediction of successful targets could have a considerable impact on attrition rates in the drug discovery pipeline by significantly reducing the initial search space. Here, we explore whether gene-disease association data from the Open Targets platform is sufficient to predict therapeutic targets that are actively being pursued by pharmaceutical companies or are already on the market. To test our hypothesis, we train four different classifiers (a random forest, a support vector machine, a neural network and a gradient boosting machine) on partially labelled data and evaluate their performance using nested cross-validation and testing on an independent set. We then select the best performing model and use it to make predictions on more than 15,000 genes. Finally, we validate our predictions by mining the scientific literature for proposed therapeutic targets. We observe that the data types with the best predictive power are animal models showing a disease-relevant phenotype, differential expression in diseased tissue and genetic association with the disease under investigation. On a test set, the neural network classifier achieves over 71% accuracy with an AUC of 0.76 when predicting therapeutic targets in a semi-supervised learning setting. We use this model to gain insights into current and failed programmes and to predict 1431 novel targets, of which a highly significant proportion has been independently proposed in the literature. Our in silico approach shows that data linking genes and diseases is sufficient to predict novel therapeutic targets effectively and confirms that this type of evidence is essential for formulating or strengthening hypotheses in the target discovery process. Ultimately, more rapid and automated target

  11. Prediction of metastasis from low-malignant breast cancer by gene expression profiling

    DEFF Research Database (Denmark)

    Thomassen, Mads; Tan, Qihua; Eiriksdottir, Freyja

    2007-01-01

    examined in these studies is the low-risk patients for whom outcome is very difficult to predict with currently used methods. These patients do not receive adjuvant treatment according to the guidelines of the Danish Breast Cancer Cooperative Group (DBCG). In this study, 26 tumors from low-risk patients...... with different characteristics and risk, expression-based classification specifically developed in low-risk patients have higher predictive power in this group.......Promising results for prediction of outcome in breast cancer have been obtained by genome wide gene expression profiling. Some studies have suggested that an extensive overtreatment of breast cancer patients might be reduced by risk assessment with gene expression profiling. A patient group hardly...

  12. Detecting microRNA activity from gene expression data

    LENUS (Irish Health Repository)

    Madden, Stephen F

    2010-05-18

    Abstract Background MicroRNAs (miRNAs) are non-coding RNAs that regulate gene expression by binding to the messenger RNA (mRNA) of protein coding genes. They control gene expression by either inhibiting translation or inducing mRNA degradation. A number of computational techniques have been developed to identify the targets of miRNAs. In this study we used predicted miRNA-gene interactions to analyse mRNA gene expression microarray data to predict miRNAs associated with particular diseases or conditions. Results Here we combine correspondence analysis, between group analysis and co-inertia analysis (CIA) to determine which miRNAs are associated with differences in gene expression levels in microarray data sets. Using a database of miRNA target predictions from TargetScan, TargetScanS, PicTar4way PicTar5way, and miRanda and combining these data with gene expression levels from sets of microarrays, this method produces a ranked list of miRNAs associated with a specified split in samples. We applied this to three different microarray datasets, a papillary thyroid carcinoma dataset, an in-house dataset of lipopolysaccharide treated mouse macrophages, and a multi-tissue dataset. In each case we were able to identified miRNAs of biological importance. Conclusions We describe a technique to integrate gene expression data and miRNA target predictions from multiple sources.

  13. Detecting microRNA activity from gene expression data.

    LENUS (Irish Health Repository)

    Madden, Stephen F

    2010-01-01

    BACKGROUND: MicroRNAs (miRNAs) are non-coding RNAs that regulate gene expression by binding to the messenger RNA (mRNA) of protein coding genes. They control gene expression by either inhibiting translation or inducing mRNA degradation. A number of computational techniques have been developed to identify the targets of miRNAs. In this study we used predicted miRNA-gene interactions to analyse mRNA gene expression microarray data to predict miRNAs associated with particular diseases or conditions. RESULTS: Here we combine correspondence analysis, between group analysis and co-inertia analysis (CIA) to determine which miRNAs are associated with differences in gene expression levels in microarray data sets. Using a database of miRNA target predictions from TargetScan, TargetScanS, PicTar4way PicTar5way, and miRanda and combining these data with gene expression levels from sets of microarrays, this method produces a ranked list of miRNAs associated with a specified split in samples. We applied this to three different microarray datasets, a papillary thyroid carcinoma dataset, an in-house dataset of lipopolysaccharide treated mouse macrophages, and a multi-tissue dataset. In each case we were able to identified miRNAs of biological importance. CONCLUSIONS: We describe a technique to integrate gene expression data and miRNA target predictions from multiple sources.

  14. Muscle myeloid type I interferon gene expression may predict therapeutic responses to rituximab in myositis patients.

    Science.gov (United States)

    Nagaraju, Kanneboyina; Ghimbovschi, Svetlana; Rayavarapu, Sree; Phadke, Aditi; Rider, Lisa G; Hoffman, Eric P; Miller, Frederick W

    2016-09-01

    To identify muscle gene expression patterns that predict rituximab responses and assess the effects of rituximab on muscle gene expression in PM and DM. In an attempt to understand the molecular mechanism of response and non-response to rituximab therapy, we performed Affymetrix gene expression array analyses on muscle biopsy specimens taken before and after rituximab therapy from eight PM and two DM patients in the Rituximab in Myositis study. We also analysed selected muscle-infiltrating cell phenotypes in these biopsies by immunohistochemical staining. Partek and Ingenuity pathway analyses assessed the gene pathways and networks. Myeloid type I IFN signature genes were expressed at higher levels at baseline in the skeletal muscle of rituximab responders than in non-responders, whereas classic non-myeloid IFN signature genes were expressed at higher levels in non-responders at baseline. Also, rituximab responders have a greater reduction of the myeloid and non-myeloid type I IFN signatures than non-responders. The decrease in the type I IFN signature following administration of rituximab may be associated with the decreases in muscle-infiltrating CD19(+) B cells and CD68(+) macrophages in responders. Our findings suggest that high levels of myeloid type I IFN gene expression in skeletal muscle predict responses to rituximab in PM/DM and that rituximab responders also have a greater decrease in the expression of these genes. These data add further evidence to recent studies defining the type I IFN signature as both a predictor of therapeutic responses and a biomarker of myositis disease activity. Published by Oxford University Press on behalf British Society for Rheumatology 2016. This work is written by US Government employees and is in the public domain in the US.

  15. Predictive gene signatures: molecular markers distinguishing colon adenomatous polyp and carcinoma.

    Directory of Open Access Journals (Sweden)

    Janice E Drew

    Full Text Available Cancers exhibit abnormal molecular signatures associated with disease initiation and progression. Molecular signatures could improve cancer screening, detection, drug development and selection of appropriate drug therapies for individual patients. Typically only very small amounts of tissue are available from patients for analysis and biopsy samples exhibit broad heterogeneity that cannot be captured using a single marker. This report details application of an in-house custom designed GenomeLab System multiplex gene expression assay, the hCellMarkerPlex, to assess predictive gene signatures of normal, adenomatous polyp and carcinoma colon tissue using archived tissue bank material. The hCellMarkerPlex incorporates twenty-one gene markers: epithelial (EZR, KRT18, NOX1, SLC9A2, proliferation (PCNA, CCND1, MS4A12, differentiation (B4GANLT2, CDX1, CDX2, apoptotic (CASP3, NOX1, NTN1, fibroblast (FSP1, COL1A1, structural (ACTG2, CNN1, DES, gene transcription (HDAC1, stem cell (LGR5, endothelial (VWF and mucin production (MUC2. Gene signatures distinguished normal, adenomatous polyp and carcinoma. Individual gene targets significantly contributing to molecular tissue types, classifier genes, were further characterised using real-time PCR, in-situ hybridisation and immunohistochemistry revealing aberrant epithelial expression of MS4A12, LGR5 CDX2, NOX1 and SLC9A2 prior to development of carcinoma. Identified gene signatures identify aberrant epithelial expression of genes prior to cancer development using in-house custom designed gene expression multiplex assays. This approach may be used to assist in objective classification of disease initiation, staging, progression and therapeutic responses using biopsy material.

  16. PRGdb 3.0: a comprehensive platform for prediction and analysis of plant disease resistance genes.

    Science.gov (United States)

    Osuna-Cruz, Cristina M; Paytuvi-Gallart, Andreu; Di Donato, Antimo; Sundesha, Vicky; Andolfo, Giuseppe; Aiese Cigliano, Riccardo; Sanseverino, Walter; Ercolano, Maria R

    2018-01-04

    The Plant Resistance Genes database (PRGdb; http://prgdb.org) has been redesigned with a new user interface, new sections, new tools and new data for genetic improvement, allowing easy access not only to the plant science research community but also to breeders who want to improve plant disease resistance. The home page offers an overview of easy-to-read search boxes that streamline data queries and directly show plant species for which data from candidate or cloned genes have been collected. Bulk data files and curated resistance gene annotations are made available for each plant species hosted. The new Gene Model view offers detailed information on each cloned resistance gene structure to highlight shared attributes with other genes. PRGdb 3.0 offers 153 reference resistance genes and 177 072 annotated candidate Pathogen Receptor Genes (PRGs). Compared to the previous release, the number of putative genes has been increased from 106 to 177 K from 76 sequenced Viridiplantae and algae genomes. The DRAGO 2 tool, which automatically annotates and predicts (PRGs) from DNA and amino acid with high accuracy and sensitivity, has been added. BLAST search has been implemented to offer users the opportunity to annotate and compare their own sequences. The improved section on plant diseases displays useful information linked to genes and genomes to connect complementary data and better address specific needs. Through, a revised and enlarged collection of data, the development of new tools and a renewed portal, PRGdb 3.0 engages the plant science community in developing a consensus plan to improve knowledge and strategies to fight diseases that afflict main crops and other plants. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Protein-Protein Interactions Prediction Based on Iterative Clique Extension with Gene Ontology Filtering

    Directory of Open Access Journals (Sweden)

    Lei Yang

    2014-01-01

    Full Text Available Cliques (maximal complete subnets in protein-protein interaction (PPI network are an important resource used to analyze protein complexes and functional modules. Clique-based methods of predicting PPI complement the data defection from biological experiments. However, clique-based predicting methods only depend on the topology of network. The false-positive and false-negative interactions in a network usually interfere with prediction. Therefore, we propose a method combining clique-based method of prediction and gene ontology (GO annotations to overcome the shortcoming and improve the accuracy of predictions. According to different GO correcting rules, we generate two predicted interaction sets which guarantee the quality and quantity of predicted protein interactions. The proposed method is applied to the PPI network from the Database of Interacting Proteins (DIP and most of the predicted interactions are verified by another biological database, BioGRID. The predicted protein interactions are appended to the original protein network, which leads to clique extension and shows the significance of biological meaning.

  18. Predictive gene lists for breast cancer prognosis: A topographic visualisation study

    Directory of Open Access Journals (Sweden)

    Lowe David

    2008-04-01

    Full Text Available Abstract Background The controversy surrounding the non-uniqueness of predictive gene lists (PGL of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged 1. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists. Methods We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE and the Locally Linear Embedding(LLE techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether a-posteriori two prognosis groups are separable on the evidence of the gene lists. A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset. Results The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results. Conclusion The random correlation effect to an arbitrary outcome induced by small subset selection from very high

  19. Influence of the Leader protein coding region of foot-and-mouth disease virus on virus replication

    DEFF Research Database (Denmark)

    Belsham, Graham

    2013-01-01

    The foot-and-mouth disease virus (FMDV) Leader (L) protein is produced in two forms, Lab and Lb, differing only at their amino-termini, due to the use of separate initiation codons, usually 84 nt apart. It has been shown previously, and confirmed here, that precise deletion of the Lab coding......, in the context of the virus lacking the Lb coding region, was also tolerated by the virus within BHK cells. However, precise loss of the Lb coding sequence alone blocked FMDV replication in primary bovine thyroid cells. Thus, the requirement for the Leader protein coding sequences is highly dependent...... on the nature and extent of the residual Leader protein sequences and on the host cell system used. FMDVs precisely lacking Lb and with the Lab initiation codon modified may represent safer seed viruses for vaccine production....

  20. Experimental Definition and Validation of Protein Coding Transcripts in Chlamydomonas reinhardtii

    Energy Technology Data Exchange (ETDEWEB)

    Kourosh Salehi-Ashtiani; Jason A. Papin

    2012-01-13

    Algal fuel sources promise unsurpassed yields in a carbon neutral manner that minimizes resource competition between agriculture and fuel crops. Many challenges must be addressed before algal biofuels can be accepted as a component of the fossil fuel replacement strategy. One significant challenge is that the cost of algal fuel production must become competitive with existing fuel alternatives. Algal biofuel production presents the opportunity to fine-tune microbial metabolic machinery for an optimal blend of biomass constituents and desired fuel molecules. Genome-scale model-driven algal metabolic design promises to facilitate both goals by directing the utilization of metabolites in the complex, interconnected metabolic networks to optimize production of the compounds of interest. Using Chlamydomonas reinhardtii as a model, we developed a systems-level methodology bridging metabolic network reconstruction with annotation and experimental verification of enzyme encoding open reading frames. We reconstructed a genome-scale metabolic network for this alga and devised a novel light-modeling approach that enables quantitative growth prediction for a given light source, resolving wavelength and photon flux. We experimentally verified transcripts accounted for in the network and physiologically validated model function through simulation and generation of new experimental growth data, providing high confidence in network contents and predictive applications. The network offers insight into algal metabolism and potential for genetic engineering and efficient light source design, a pioneering resource for studying light-driven metabolism and quantitative systems biology. Our approach to generate a predictive metabolic model integrated with cloned open reading frames, provides a cost-effective platform to generate metabolic engineering resources. While the generated resources are specific to algal systems, the approach that we have developed is not specific to algae and

  1. MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction

    Directory of Open Access Journals (Sweden)

    Kohlbacher Oliver

    2009-09-01

    Full Text Available Abstract Background Knowledge of subcellular localization of proteins is crucial to proteomics, drug target discovery and systems biology since localization and biological function are highly correlated. In recent years, numerous computational prediction methods have been developed. Nevertheless, there is still a need for prediction methods that show more robustness and higher accuracy. Results We extended our previous MultiLoc predictor by incorporating phylogenetic profiles and Gene Ontology terms. Two different datasets were used for training the system, resulting in two versions of this high-accuracy prediction method. One version is specialized for globular proteins and predicts up to five localizations, whereas a second version covers all eleven main eukaryotic subcellular localizations. In a benchmark study with five localizations, MultiLoc2 performs considerably better than other methods for animal and plant proteins and comparably for fungal proteins. Furthermore, MultiLoc2 performs clearly better when using a second dataset that extends the benchmark study to all eleven main eukaryotic subcellular localizations. Conclusion MultiLoc2 is an extensive high-performance subcellular protein localization prediction system. By incorporating phylogenetic profiles and Gene Ontology terms MultiLoc2 yields higher accuracies compared to its previous version. Moreover, it outperforms other prediction systems in two benchmarks studies. MultiLoc2 is available as user-friendly and free web-service, available at: http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc2.

  2. Adipose Gene Expression Prior to Weight Loss Can Differentiate and Weakly Predict Dietary Responders

    Science.gov (United States)

    Mutch, David M.; Temanni, M. Ramzi; Henegar, Corneliu; Combes, Florence; Pelloux, Véronique; Holst, Claus; Sørensen, Thorkild I. A.; Astrup, Arne; Martinez, J. Alfredo; Saris, Wim H. M.; Viguerie, Nathalie; Langin, Dominique; Zucker, Jean-Daniel; Clément, Karine

    2007-01-01

    Background The ability to identify obese individuals who will successfully lose weight in response to dietary intervention will revolutionize disease management. Therefore, we asked whether it is possible to identify subjects who will lose weight during dietary intervention using only a single gene expression snapshot. Methodology/Principal Findings The present study involved 54 female subjects from the Nutrient-Gene Interactions in Human Obesity-Implications for Dietary Guidelines (NUGENOB) trial to determine whether subcutaneous adipose tissue gene expression could be used to predict weight loss prior to the 10-week consumption of a low-fat hypocaloric diet. Using several statistical tests revealed that the gene expression profiles of responders (8–12 kgs weight loss) could always be differentiated from non-responders (diet is able to differentiate responders from non-responders as well as serve as a weak predictor of subjects destined to lose weight. While the degree of prediction accuracy currently achieved with a gene expression snapshot is perhaps insufficient for clinical use, this work reveals that the comprehensive molecular signature of adipose tissue paves the way for the future of personalized nutrition. PMID:18094752

  3. CRC-113 gene expression signature for predicting prognosis in patients with colorectal cancer.

    Science.gov (United States)

    Nguyen, Minh Nam; Choi, Tae Gyu; Nguyen, Dinh Truong; Kim, Jin-Hwan; Jo, Yong Hwa; Shahid, Muhammad; Akter, Salima; Aryal, Saurav Nath; Yoo, Ji Youn; Ahn, Yong-Joo; Cho, Kyoung Min; Lee, Ju-Seog; Choe, Wonchae; Kang, Insug; Ha, Joohun; Kim, Sung Soo

    2015-10-13

    Colorectal cancer (CRC) is the third leading cause of global cancer mortality. Recent studies have proposed several gene signatures to predict CRC prognosis, but none of those have proven reliable for predicting prognosis in clinical practice yet due to poor reproducibility and molecular heterogeneity. Here, we have established a prognostic signature of 113 probe sets (CRC-113) that include potential biomarkers and reflect the biological and clinical characteristics. Robustness and accuracy were significantly validated in external data sets from 19 centers in five countries. In multivariate analysis, CRC-113 gene signature showed a stronger prognostic value for survival and disease recurrence in CRC patients than current clinicopathological risk factors and molecular alterations. We also demonstrated that the CRC-113 gene signature reflected both genetic and epigenetic molecular heterogeneity in CRC patients. Furthermore, incorporation of the CRC-113 gene signature into a clinical context and molecular markers further refined the selection of the CRC patients who might benefit from postoperative chemotherapy. Conclusively, CRC-113 gene signature provides new possibilities for improving prognostic models and personalized therapeutic strategies.

  4. The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection.

    Science.gov (United States)

    Tang, Zaixiang; Shen, Yueping; Zhang, Xinyan; Yi, Nengjun

    2017-01-01

    Large-scale "omics" data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). Copyright © 2017 by the Genetics Society of America.

  5. Genome sequence analysis of predicted polyprenol reductase gene from mangrove plant kandelia obovata

    Science.gov (United States)

    Basyuni, M.; Sagami, H.; Baba, S.; Oku, H.

    2018-03-01

    It has been previously reported that dolichols but not polyprenols were predominated in mangrove leaves and roots. Therefore, the occurrence of larger amounts of dolichol in leaves of mangrove plants implies that polyprenol reductase is responsible for the conversion of polyprenol to dolichol may be active in mangrove leaves. Here we report the early assessment of probably polyprenol reductase gene from genome sequence of mangrove plant Kandelia obovata. The functional assignment of the gene was based on a homology search of the sequences against the non-redundant (nr) peptide database of NCBI using Blastx. The degree of sequence identity between DNA sequence and known polyprenol reductase was confirmed using the Blastx probability E-value, total score, and identity. The genome sequence data resulted in three partial sequences, termed c23157 (700 bp), c23901 (960 bp), and c24171 (531 bp). The c23157 gene showed the highest similarity (61%) to predicted polyprenol reductase 2- like from Gossypium raimondii with E-value 2e-100. The second gene was c23901 to exhibit high similarity (78%) to the steroid 5-alpha-reductase Det2 from J. curcas with E-value 2e-140. Furthermore, the c24171 gene depicted highest similarity (79%) to the polyprenol reductase 2 isoform X1 from Jatropha curcas with E- value 7e-21.The present study suggested that the c23157, c23901, and c24171, genes may encode predicted polyprenol reductase. The c23157, c23901, c24171 are therefore the new type of predicted polyprenol reductase from K. obovata.

  6. Using gene co-expression network analysis to predict biomarkers for chronic lymphocytic leukemia

    Directory of Open Access Journals (Sweden)

    Borlawsky Tara B

    2010-10-01

    Full Text Available Abstract Background Chronic lymphocytic leukemia (CLL is the most common adult leukemia. It is a highly heterogeneous disease, and can be divided roughly into indolent and progressive stages based on classic clinical markers. Immunoglobin heavy chain variable region (IgVH mutational status was found to be associated with patient survival outcome, and biomarkers linked to the IgVH status has been a focus in the CLL prognosis research field. However, biomarkers highly correlated with IgVH mutational status which can accurately predict the survival outcome are yet to be discovered. Results In this paper, we investigate the use of gene co-expression network analysis to identify potential biomarkers for CLL. Specifically we focused on the co-expression network involving ZAP70, a well characterized biomarker for CLL. We selected 23 microarray datasets corresponding to multiple types of cancer from the Gene Expression Omnibus (GEO and used the frequent network mining algorithm CODENSE to identify highly connected gene co-expression networks spanning the entire genome, then evaluated the genes in the co-expression network in which ZAP70 is involved. We then applied a set of feature selection methods to further select genes which are capable of predicting IgVH mutation status from the ZAP70 co-expression network. Conclusions We have identified a set of genes that are potential CLL prognostic biomarkers IL2RB, CD8A, CD247, LAG3 and KLRK1, which can predict CLL patient IgVH mutational status with high accuracies. Their prognostic capabilities were cross-validated by applying these biomarker candidates to classify patients into different outcome groups using a CLL microarray datasets with clinical information.

  7. Methylation of cancer-stem-cell-associated Wnt target genes predicts poor prognosis in colorectal cancer patients

    NARCIS (Netherlands)

    de Sousa E Melo, Felipe; Colak, Selcuk; Buikhuisen, Joyce; Koster, Jan; Cameron, Kate; de Jong, Joan H.; Tuynman, Jurriaan B.; Prasetyanti, Pramudita R.; Fessler, Evelyn; van den Bergh, Saskia P.; Rodermond, Hans; Dekker, Evelien; van der Loos, Chris M.; Pals, Steven T.; van de Vijver, Marc J.; Versteeg, Rogier; Richel, Dick J.; Vermeulen, Louis; Medema, Jan Paul

    2011-01-01

    Gene signatures derived from cancer stem cells (CSCs) predict tumor recurrence for many forms of cancer. Here, we derived a gene signature for colorectal CSCs defined by high Wnt signaling activity, which in agreement with previous observations predicts poor prognosis. Surprisingly, however, we

  8. New Markov Model Approaches to Deciphering Microbial Genome Function and Evolution: Comparative Genomics of Laterally Transferred Genes

    Energy Technology Data Exchange (ETDEWEB)

    Borodovsky, M.

    2013-04-11

    Algorithmic methods for gene prediction have been developed and successfully applied to many different prokaryotic genome sequences. As the set of genes in a particular genome is not homogeneous with respect to DNA sequence composition features, the GeneMark.hmm program utilizes two Markov models representing distinct classes of protein coding genes denoted "typical" and "atypical". Atypical genes are those whose DNA features deviate significantly from those classified as typical and they represent approximately 10% of any given genome. In addition to the inherent interest of more accurately predicting genes, the atypical status of these genes may also reflect their separate evolutionary ancestry from other genes in that genome. We hypothesize that atypical genes are largely comprised of those genes that have been relatively recently acquired through lateral gene transfer (LGT). If so, what fraction of atypical genes are such bona fide LGTs? We have made atypical gene predictions for all fully completed prokaryotic genomes; we have been able to compare these results to other "surrogate" methods of LGT prediction.

  9. Integrative Analysis of Gene Expression Data Including an Assessment of Pathway Enrichment for Predicting Prostate Cancer

    Directory of Open Access Journals (Sweden)

    Pingzhao Hu

    2006-01-01

    biological pathways. In particular, we observed that by integrating information from the insulin signalling pathway into our prediction model, we achieved better prediction of prostate cancer. Conclusions: Our data integration methodology provides an efficient way to identify biologically sound and statistically significant pathways from gene expression data. The significant gene expression phenotypes identified in our study have the potential to characterize complex genetic alterations in prostate cancer.

  10. TYK2 protein-coding variants protect against rheumatoid arthritis and autoimmunity, with no evidence of major pleiotropic effects on non-autoimmune complex traits.

    Directory of Open Access Journals (Sweden)

    Dorothée Diogo

    Full Text Available Despite the success of genome-wide association studies (GWAS in detecting a large number of loci for complex phenotypes such as rheumatoid arthritis (RA susceptibility, the lack of information on the causal genes leaves important challenges to interpret GWAS results in the context of the disease biology. Here, we genetically fine-map the RA risk locus at 19p13 to define causal variants, and explore the pleiotropic effects of these same variants in other complex traits. First, we combined Immunochip dense genotyping (n = 23,092 case/control samples, Exomechip genotyping (n = 18,409 case/control samples and targeted exon-sequencing (n = 2,236 case/controls samples to demonstrate that three protein-coding variants in TYK2 (tyrosine kinase 2 independently protect against RA: P1104A (rs34536443, OR = 0.66, P = 2.3 x 10(-21, A928V (rs35018800, OR = 0.53, P = 1.2 x 10(-9, and I684S (rs12720356, OR = 0.86, P = 4.6 x 10(-7. Second, we show that the same three TYK2 variants protect against systemic lupus erythematosus (SLE, Pomnibus = 6 x 10(-18, and provide suggestive evidence that two of the TYK2 variants (P1104A and A928V may also protect against inflammatory bowel disease (IBD; P(omnibus = 0.005. Finally, in a phenome-wide association study (PheWAS assessing >500 phenotypes using electronic medical records (EMR in >29,000 subjects, we found no convincing evidence for association of P1104A and A928V with complex phenotypes other than autoimmune diseases such as RA, SLE and IBD. Together, our results demonstrate the role of TYK2 in the pathogenesis of RA, SLE and IBD, and provide supporting evidence for TYK2 as a promising drug target for the treatment of autoimmune diseases.

  11. A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

    Directory of Open Access Journals (Sweden)

    Ruzzo Walter L

    2006-03-01

    Full Text Available Abstract Background As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. Methods In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. Results We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. Conclusion Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets.

  12. Predictive models for mutations in mismatch repair genes: implication for genetic counseling in developing countries

    Directory of Open Access Journals (Sweden)

    Monteiro Santos Erika

    2012-02-01

    Full Text Available Abstract Background Lynch syndrome (LS is the most common form of inherited predisposition to colorectal cancer (CRC, accounting for 2-5% of all CRC. LS is an autosomal dominant disease characterized by mutations in the mismatch repair genes mutL homolog 1 (MLH1, mutS homolog 2 (MSH2, postmeiotic segregation increased 1 (PMS1, post-meiotic segregation increased 2 (PMS2 and mutS homolog 6 (MSH6. Mutation risk prediction models can be incorporated into clinical practice, facilitating the decision-making process and identifying individuals for molecular investigation. This is extremely important in countries with limited economic resources. This study aims to evaluate sensitivity and specificity of five predictive models for germline mutations in repair genes in a sample of individuals with suspected Lynch syndrome. Methods Blood samples from 88 patients were analyzed through sequencing MLH1, MSH2 and MSH6 genes. The probability of detecting a mutation was calculated using the PREMM, Barnetson, MMRpro, Wijnen and Myriad models. To evaluate the sensitivity and specificity of the models, receiver operating characteristic curves were constructed. Results Of the 88 patients included in this analysis, 31 mutations were identified: 16 were found in the MSH2 gene, 15 in the MLH1 gene and no pathogenic mutations were identified in the MSH6 gene. It was observed that the AUC for the PREMM (0.846, Barnetson (0.850, MMRpro (0.821 and Wijnen (0.807 models did not present significant statistical difference. The Myriad model presented lower AUC (0.704 than the four other models evaluated. Considering thresholds of ≥ 5%, the models sensitivity varied between 1 (Myriad and 0.87 (Wijnen and specificity ranged from 0 (Myriad to 0.38 (Barnetson. Conclusions The Barnetson, PREMM, MMRpro and Wijnen models present similar AUC. The AUC of the Myriad model is statistically inferior to the four other models.

  13. Predictive models for mutations in mismatch repair genes: implication for genetic counseling in developing countries

    Energy Technology Data Exchange (ETDEWEB)

    Monteiro Santos, Erika Maria [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); International Center of Research and Training (CIPE), AC Camargo Hospital, Sao Paulo (Brazil); Silva Junior, Wilson Araujo da [Sao Paulo University, Department of Genetics, Medical School of Ribeirao Preto, Ribeirao Preto (Brazil); Carraro, Dirce Maria [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); International Center of Research and Training (CIPE), AC Camargo Hospital, Sao Paulo (Brazil); Rossi, Benedito Mauro; Valentin, Mev Dominguez [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); Carneiro, Felipe [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); International Center of Research and Training (CIPE), AC Camargo Hospital, Sao Paulo (Brazil); Oliveira, Ligia Petrolini de [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); Oliveira Ferreira, Fabio de; Junior, Samuel Aguiar [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); Hereditary Colorectal Cancer Registry, AC Camargo Hospital, Sao Paulo (Brazil); Nakagawa, Wilson Toshihiko [Hereditary Colorectal Cancer Registry, AC Camargo Hospital, Sao Paulo (Brazil); Gomy, Israel [Graduation Program, AC Camargo Hospital, Sao Paulo (Brazil); Sao Paulo University, Department of Genetics, Medical School of Ribeirao Preto, Ribeirao Preto (Brazil); Faria Ferraz, Victor Evangelista de [Sao Paulo University, Department of Genetics, Medical School of Ribeirao Preto, Ribeirao Preto (Brazil)

    2012-02-09

    Lynch syndrome (LS) is the most common form of inherited predisposition to colorectal cancer (CRC), accounting for 2-5% of all CRC. LS is an autosomal dominant disease characterized by mutations in the mismatch repair genes mutL homolog 1 (MLH1), mutS homolog 2 (MSH2), postmeiotic segregation increased 1 (PMS1), post-meiotic segregation increased 2 (PMS2) and mutS homolog 6 (MSH6). Mutation risk prediction models can be incorporated into clinical practice, facilitating the decision-making process and identifying individuals for molecular investigation. This is extremely important in countries with limited economic resources. This study aims to evaluate sensitivity and specificity of five predictive models for germline mutations in repair genes in a sample of individuals with suspected Lynch syndrome. Blood samples from 88 patients were analyzed through sequencing MLH1, MSH2 and MSH6 genes. The probability of detecting a mutation was calculated using the PREMM, Barnetson, MMRpro, Wijnen and Myriad models. To evaluate the sensitivity and specificity of the models, receiver operating characteristic curves were constructed. Of the 88 patients included in this analysis, 31 mutations were identified: 16 were found in the MSH2 gene, 15 in the MLH1 gene and no pathogenic mutations were identified in the MSH6 gene. It was observed that the AUC for the PREMM (0.846), Barnetson (0.850), MMRpro (0.821) and Wijnen (0.807) models did not present significant statistical difference. The Myriad model presented lower AUC (0.704) than the four other models evaluated. Considering thresholds of ≥ 5%, the models sensitivity varied between 1 (Myriad) and 0.87 (Wijnen) and specificity ranged from 0 (Myriad) to 0.38 (Barnetson). The Barnetson, PREMM, MMRpro and Wijnen models present similar AUC. The AUC of the Myriad model is statistically inferior to the four other models.

  14. Predictive models for mutations in mismatch repair genes: implication for genetic counseling in developing countries

    International Nuclear Information System (INIS)

    Monteiro Santos, Erika Maria; Silva Junior, Wilson Araujo da; Carraro, Dirce Maria; Rossi, Benedito Mauro; Valentin, Mev Dominguez; Carneiro, Felipe; Oliveira, Ligia Petrolini de; Oliveira Ferreira, Fabio de; Junior, Samuel Aguiar; Nakagawa, Wilson Toshihiko; Gomy, Israel; Faria Ferraz, Victor Evangelista de

    2012-01-01

    Lynch syndrome (LS) is the most common form of inherited predisposition to colorectal cancer (CRC), accounting for 2-5% of all CRC. LS is an autosomal dominant disease characterized by mutations in the mismatch repair genes mutL homolog 1 (MLH1), mutS homolog 2 (MSH2), postmeiotic segregation increased 1 (PMS1), post-meiotic segregation increased 2 (PMS2) and mutS homolog 6 (MSH6). Mutation risk prediction models can be incorporated into clinical practice, facilitating the decision-making process and identifying individuals for molecular investigation. This is extremely important in countries with limited economic resources. This study aims to evaluate sensitivity and specificity of five predictive models for germline mutations in repair genes in a sample of individuals with suspected Lynch syndrome. Blood samples from 88 patients were analyzed through sequencing MLH1, MSH2 and MSH6 genes. The probability of detecting a mutation was calculated using the PREMM, Barnetson, MMRpro, Wijnen and Myriad models. To evaluate the sensitivity and specificity of the models, receiver operating characteristic curves were constructed. Of the 88 patients included in this analysis, 31 mutations were identified: 16 were found in the MSH2 gene, 15 in the MLH1 gene and no pathogenic mutations were identified in the MSH6 gene. It was observed that the AUC for the PREMM (0.846), Barnetson (0.850), MMRpro (0.821) and Wijnen (0.807) models did not present significant statistical difference. The Myriad model presented lower AUC (0.704) than the four other models evaluated. Considering thresholds of ≥ 5%, the models sensitivity varied between 1 (Myriad) and 0.87 (Wijnen) and specificity ranged from 0 (Myriad) to 0.38 (Barnetson). The Barnetson, PREMM, MMRpro and Wijnen models present similar AUC. The AUC of the Myriad model is statistically inferior to the four other models

  15. A postprocessing method in the HMC framework for predicting gene function based on biological instrumental data

    Science.gov (United States)

    Feng, Shou; Fu, Ping; Zheng, Wenbin

    2018-03-01

    Predicting gene function based on biological instrumental data is a complicated and challenging hierarchical multi-label classification (HMC) problem. When using local approach methods to solve this problem, a preliminary results processing method is usually needed. This paper proposed a novel preliminary results processing method called the nodes interaction method. The nodes interaction method revises the preliminary results and guarantees that the predictions are consistent with the hierarchy constraint. This method exploits the label dependency and considers the hierarchical interaction between nodes when making decisions based on the Bayesian network in its first phase. In the second phase, this method further adjusts the results according to the hierarchy constraint. Implementing the nodes interaction method in the HMC framework also enhances the HMC performance for solving the gene function prediction problem based on the Gene Ontology (GO), the hierarchy of which is a directed acyclic graph that is more difficult to tackle. The experimental results validate the promising performance of the proposed method compared to state-of-the-art methods on eight benchmark yeast data sets annotated by the GO.

  16. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    Science.gov (United States)

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  17. Semi-supervised prediction of gene regulatory networks using machine learning algorithms.

    Science.gov (United States)

    Patel, Nihir; Wang, Jason T L

    2015-10-01

    Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely, support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabelled data for training. We investigated inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabelled data. We then applied our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluated the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.

  18. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.

    Science.gov (United States)

    Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke

    2009-02-15

    Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http

  19. Integrative annotation of 21,037 human genes validated by full-length cDNA clones.

    Directory of Open Access Journals (Sweden)

    Tadashi Imanishi

    2004-06-01

    Full Text Available The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/. It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs, identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA

  20. Expression of estrogen-related gene markers in breast cancer tissue predicts aromatase inhibitor responsiveness.

    Directory of Open Access Journals (Sweden)

    Irene Moy

    Full Text Available Aromatase inhibitors (AIs are the most effective class of drugs in the endocrine treatment of breast cancer, with an approximate 50% treatment response rate. Our objective was to determine whether intratumoral expression levels of estrogen-related genes are predictive of AI responsiveness in postmenopausal women with breast cancer. Primary breast carcinomas were obtained from 112 women who received AI therapy after failing adjuvant tamoxifen therapy and developing recurrent breast cancer. Tumor ERα and PR protein expression were analyzed by immunohistochemistry (IHC. Messenger RNA (mRNA levels of 5 estrogen-related genes-AKR1C3, aromatase, ERα, and 2 estradiol/ERα target genes, BRCA1 and PR-were measured by real-time PCR. Tumor protein and mRNA levels were compared with breast cancer progression rates to determine predictive accuracy. Responsiveness to AI therapy-defined as the combined complete response, partial response, and stable disease rates for at least 6 months-was 51%; rates were 56% in ERα-IHC-positive and 14% in ERα-IHC-negative tumors. Levels of ERα, PR, or BRCA1 mRNA were independently predictive for responsiveness to AI. In cross-validated analyses, a combined measurement of tumor ERα and PR mRNA levels yielded a more superior specificity (36% and identical sensitivity (96% to the current clinical practice (ERα/PR-IHC. In patients with ERα/PR-IHC-negative tumors, analysis of mRNA expression revealed either non-significant trends or statistically significant positive predictive values for AI responsiveness. In conclusion, expression levels of estrogen-related mRNAs are predictive for AI responsiveness in postmenopausal women with breast cancer, and mRNA expression analysis may improve patient selection.

  1. Prediction of novel target genes and pathways involved in irinotecan-resistant colorectal cancer.

    Directory of Open Access Journals (Sweden)

    Precious Takondwa Makondi

    Full Text Available Acquired drug resistance to the chemotherapeutic drug irinotecan (the active metabolite of which is SN-38 is one of the significant obstacles in the treatment of advanced colorectal cancer (CRC. The molecular mechanism or targets mediating irinotecan resistance are still unclear. It is urgent to find the irinotecan response biomarkers to improve CRC patients' therapy.Genetic Omnibus Database GSE42387 which contained the gene expression profiles of parental and irinotecan-resistant HCT-116 cell lines was used. Differentially expressed genes (DEGs between parental and irinotecan-resistant cells, protein-protein interactions (PPIs, gene ontologies (GOs and pathway analysis were performed to identify the overall biological changes. The most common DEGs in the PPIs, GOs and pathways were identified and were validated clinically by their ability to predict overall survival and disease free survival. The gene-gene expression correlation and gene-resistance correlation was also evaluated in CRC patients using The Cancer Genomic Atlas data (TCGA.The 135 DEGs were identified of which 36 were upregulated and 99 were down regulated. After mapping the PPI networks, the GOs and the pathways, nine genes (GNAS, PRKACB, MECOM, PLA2G4C, BMP6, BDNF, DLG4, FGF2 and FGF9 were found to be commonly enriched. Signal transduction was the most significant GO and MAPK pathway was the most significant pathway. The five genes (FGF2, FGF9, PRKACB, MECOM and PLA2G4C in the MAPK pathway were all contained in the signal transduction and the levels of those genes were upregulated. The FGF2, FGF9 and MECOM expression were highly associated with CRC patients' survival rate but not PRKACB and PLA2G4C. In addition, FGF9 was also associated with irinotecan resistance and poor disease free survival. FGF2, FGF9 and PRKACB were positively correlated with each other while MECOM correlated positively with FGF9 and PLA2G4C, and correlated negatively with FGF2 and PRKACB after doing gene-gene

  2. Establishment of a 12-gene expression signature to predict colon cancer prognosis

    Directory of Open Access Journals (Sweden)

    Dalong Sun

    2018-06-01

    Full Text Available A robust and accurate gene expression signature is essential to assist oncologists to determine which subset of patients at similar Tumor-Lymph Node-Metastasis (TNM stage has high recurrence risk and could benefit from adjuvant therapies. Here we applied a two-step supervised machine-learning method and established a 12-gene expression signature to precisely predict colon adenocarcinoma (COAD prognosis by using COAD RNA-seq transcriptome data from The Cancer Genome Atlas (TCGA. The predictive performance of the 12-gene signature was validated with two independent gene expression microarray datasets: GSE39582 includes 566 COAD cases for the development of six molecular subtypes with distinct clinical, molecular and survival characteristics; GSE17538 is a dataset containing 232 colon cancer patients for the generation of a metastasis gene expression profile to predict recurrence and death in COAD patients. The signature could effectively separate the poor prognosis patients from good prognosis group (disease specific survival (DSS: Kaplan Meier (KM Log Rank p = 0.0034; overall survival (OS: KM Log Rank p = 0.0336 in GSE17538. For patients with proficient mismatch repair system (pMMR in GSE39582, the signature could also effectively distinguish high risk group from low risk group (OS: KM Log Rank p = 0.005; Relapse free survival (RFS: KM Log Rank p = 0.022. Interestingly, advanced stage patients were significantly enriched in high 12-gene score group (Fisher’s exact test p = 0.0003. After stage stratification, the signature could still distinguish poor prognosis patients in GSE17538 from good prognosis within stage II (Log Rank p = 0.01 and stage II & III (Log Rank p = 0.017 in the outcome of DFS. Within stage III or II/III pMMR patients treated with Adjuvant Chemotherapies (ACT and patients with higher 12-gene score showed poorer prognosis (III, OS: KM Log Rank p = 0.046; III & II, OS: KM Log Rank p = 0.041. Among stage II/III pMMR patients

  3. A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress

    Directory of Open Access Journals (Sweden)

    Ching-Hsue Cheng

    2018-01-01

    Full Text Available The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i the proposed model is different from the previous models lacking the concept of time series; (ii the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies.

  4. A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress

    Science.gov (United States)

    2018-01-01

    The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies. PMID:29765399

  5. Predicting gene function using hierarchical multi-label decision tree ensembles

    Directory of Open Access Journals (Sweden)

    Kocev Dragi

    2010-01-01

    Full Text Available Abstract Background S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO. We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.

  6. A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress.

    Science.gov (United States)

    Cheng, Ching-Hsue; Chan, Chia-Pang; Yang, Jun-He

    2018-01-01

    The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies.

  7. False positive reduction in protein-protein interaction predictions using gene ontology annotations

    Directory of Open Access Journals (Sweden)

    Lin Yen-Han

    2007-07-01

    Full Text Available Abstract Background Many crucial cellular operations such as metabolism, signalling, and regulations are based on protein-protein interactions. However, the lack of robust protein-protein interaction information is a challenge. One reason for the lack of solid protein-protein interaction information is poor agreement between experimental findings and computational sets that, in turn, comes from huge false positive predictions in computational approaches. Reduction of false positive predictions and enhancing true positive fraction of computationally predicted protein-protein interaction datasets based on highly confident experimental results has not been adequately investigated. Results Gene Ontology (GO annotations were used to reduce false positive protein-protein interactions (PPI pairs resulting from computational predictions. Using experimentally obtained PPI pairs as a training dataset, eight top-ranking keywords were extracted from GO molecular function annotations. The sensitivity of these keywords is 64.21% in the yeast experimental dataset and 80.83% in the worm experimental dataset. The specificities, a measure of recovery power, of these keywords applied to four predicted PPI datasets for each studied organisms, are 48.32% and 46.49% (by average of four datasets in yeast and worm, respectively. Based on eight top-ranking keywords and co-localization of interacting proteins a set of two knowledge rules were deduced and applied to remove false positive protein pairs. The 'strength', a measure of improvement provided by the rules was defined based on the signal-to-noise ratio and implemented to measure the applicability of knowledge rules applying to the predicted PPI datasets. Depending on the employed PPI-predicting methods, the strength varies between two and ten-fold of randomly removing protein pairs from the datasets. Conclusion Gene Ontology annotations along with the deduced knowledge rules could be implemented to partially

  8. A Computational Gene Expression Score for Predicting Immune Injury in Renal Allografts.

    Directory of Open Access Journals (Sweden)

    Tara K Sigdel

    Full Text Available Whole genome microarray meta-analyses of 1030 kidney, heart, lung and liver allograft biopsies identified a common immune response module (CRM of 11 genes that define acute rejection (AR across different engrafted tissues. We evaluated if the CRM genes can provide a molecular microscope to quantify graft injury in acute rejection (AR and predict risk of progressive interstitial fibrosis and tubular atrophy (IFTA in histologically normal kidney biopsies.Computational modeling was done on tissue qPCR based gene expression measurements for the 11 CRM genes in 146 independent renal allografts from 122 unique patients with AR (n = 54 and no-AR (n = 92. 24 demographically matched patients with no-AR had 6 and 24 month paired protocol biopsies; all had histologically normal 6 month biopsies, and 12 had evidence of progressive IFTA (pIFTA on their 24 month biopsies. Results were correlated with demographic, clinical and pathology variables.The 11 gene qPCR based tissue CRM score (tCRM was significantly increased in AR (5.68 ± 0.91 when compared to STA (1.29 ± 0.28; p < 0.001 and pIFTA (7.94 ± 2.278 versus 2.28 ± 0.66; p = 0.04, with greatest significance for CXCL9 and CXCL10 in AR (p <0.001 and CD6 (p<0.01, CXCL9 (p<0.05, and LCK (p<0.01 in pIFTA. tCRM was a significant independent correlate of biopsy confirmed AR (p < 0.001; AUC of 0.900; 95% CI = 0.705-903. Gene expression modeling of 6 month biopsies across 7/11 genes (CD6, INPP5D, ISG20, NKG7, PSMB9, RUNX3, and TAP1 significantly (p = 0.037 predicted the development of pIFTA at 24 months.Genome-wide tissue gene expression data mining has supported the development of a tCRM-qPCR based assay for evaluating graft immune inflammation. The tCRM score quantifies injury in AR and stratifies patients at increased risk of future pIFTA prior to any perturbation of graft function or histology.

  9. Predicting acute cardiac rejection from donor heart and pre-transplant recipient blood gene expression.

    Science.gov (United States)

    Hollander, Zsuzsanna; Chen, Virginia; Sidhu, Keerat; Lin, David; Ng, Raymond T; Balshaw, Robert; Cohen-Freue, Gabriela V; Ignaszewski, Andrew; Imai, Carol; Kaan, Annemarie; Tebbutt, Scott J; Wilson-McManus, Janet E; McMaster, Robert W; Keown, Paul A; McManus, Bruce M

    2013-02-01

    Acute rejection in cardiac transplant patients remains a contributory factor to limited survival of implanted hearts. Currently, there are no biomarkers in clinical use that can predict, at the time of transplantation, the likelihood of post-transplant acute cellular rejection. Such a development would be of great value in personalizing immunosuppressive treatment. Recipient age, donor age, cold ischemic time, warm ischemic time, panel-reactive antibody, gender mismatch, blood type mismatch and human leukocyte antigens (HLA-A, -B and -DR) mismatch between recipients and donors were tested in 53 heart transplant patients for their power to predict post-transplant acute cellular rejection. Donor transplant biopsy and recipient pre-transplant blood were also examined for the presence of genomic biomarkers in 7 rejection and 11 non-rejection patients, using non-targeted data mining techniques. The biomarker based on the 8 clinical variables had an area under the receiver operating characteristic curve (AUC) of 0.53. The pre-transplant recipient blood gene-based panel did not yield better performance, but the donor heart tissue gene-based panel had an AUC = 0.78. A combination of 25 probe sets from the transplant donor biopsy and 18 probe sets from the pre-transplant recipient whole blood had an AUC = 0.90. Biologic pathways implicated include VEGF- and EGFR-signaling, and MAPK. Based on this study, the best predictive biomarker panel contains genes from recipient whole blood and donor myocardial tissue. This panel provides clinically relevant prediction power and, if validated, may personalize immunosuppressive treatment and rejection monitoring. Copyright © 2013 International Society for Heart and Lung Transplantation. Published by Elsevier Inc. All rights reserved.

  10. CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation.

    Science.gov (United States)

    Nikulova, Anna A; Favorov, Alexander V; Sutormin, Roman A; Makeev, Vsevolod J; Mironov, Andrey A

    2012-07-01

    Identification of transcriptional regulatory regions and tracing their internal organization are important for understanding the eukaryotic cell machinery. Cis-regulatory modules (CRMs) of higher eukaryotes are believed to possess a regulatory 'grammar', or preferred arrangement of binding sites, that is crucial for proper regulation and thus tends to be evolutionarily conserved. Here, we present a method CORECLUST (COnservative REgulatory CLUster STructure) that predicts CRMs based on a set of positional weight matrices. Given regulatory regions of orthologous and/or co-regulated genes, CORECLUST constructs a CRM model by revealing the conserved rules that describe the relative location of binding sites. The constructed model may be consequently used for the genome-wide prediction of similar CRMs, and thus detection of co-regulated genes, and for the investigation of the regulatory grammar of the system. Compared with related methods, CORECLUST shows better performance at identification of CRMs conferring muscle-specific gene expression in vertebrates and early-developmental CRMs in Drosophila.

  11. Analysis of pan-genome to identify the core genes and essential genes of Brucella spp.

    Science.gov (United States)

    Yang, Xiaowen; Li, Yajie; Zang, Juan; Li, Yexia; Bie, Pengfei; Lu, Yanli; Wu, Qingmin

    2016-04-01

    Brucella spp. are facultative intracellular pathogens, that cause a contagious zoonotic disease, that can result in such outcomes as abortion or sterility in susceptible animal hosts and grave, debilitating illness in humans. For deciphering the survival mechanism of Brucella spp. in vivo, 42 Brucella complete genomes from NCBI were analyzed for the pan-genome and core genome by identification of their composition and function of Brucella genomes. The results showed that the total 132,143 protein-coding genes in these genomes were divided into 5369 clusters. Among these, 1710 clusters were associated with the core genome, 1182 clusters with strain-specific genes and 2477 clusters with dispensable genomes. COG analysis indicated that 44 % of the core genes were devoted to metabolism, which were mainly responsible for energy production and conversion (COG category C), and amino acid transport and metabolism (COG category E). Meanwhile, approximately 35 % of the core genes were in positive selection. In addition, 1252 potential essential genes were predicted in the core genome by comparison with a prokaryote database of essential genes. The results suggested that the core genes in Brucella genomes are relatively conservation, and the energy and amino acid metabolism play a more important role in the process of growth and reproduction in Brucella spp. This study might help us to better understand the mechanisms of Brucella persistent infection and provide some clues for further exploring the gene modules of the intracellular survival in Brucella spp.

  12. Pancreatic cancer circulating tumour cells express a cell motility gene signature that predicts survival after surgery

    International Nuclear Information System (INIS)

    Sergeant, Gregory; Eijsden, Rudy van; Roskams, Tania; Van Duppen, Victor; Topal, Baki

    2012-01-01

    (95% CI) = 1.366 (1.004 – 1.861)). Pancreatic CTC isolated from blood samples using FACS-based negative depletion, express a cell motility gene signature. Expression of this newly defined cell motility gene signature in the primary tumour can predict survival of patients undergoing surgical resection for pancreatic cancer. Clinical trials.gov NCT00495924

  13. A hemocyte gene expression signature correlated with predictive capacity of oysters to survive Vibrio infections

    Directory of Open Access Journals (Sweden)

    Rosa Rafael

    2012-06-01

    Full Text Available Abstract Background The complex balance between environmental and host factors is an important determinant of susceptibility to infection. Disturbances of this equilibrium may result in multifactorial diseases as illustrated by the summer mortality syndrome, a worldwide and complex phenomenon that affects the oysters, Crassostrea gigas. The summer mortality syndrome reveals a physiological intolerance making this oyster species susceptible to diseases. Exploration of genetic basis governing the oyster resistance or susceptibility to infections is thus a major goal for understanding field mortality events. In this context, we used high-throughput genomic approaches to identify genetic traits that may characterize inherent survival capacities in C. gigas. Results Using digital gene expression (DGE, we analyzed the transcriptomes of hemocytes (immunocompetent cells of oysters able or not able to survive infections by Vibrio species shown to be involved in summer mortalities. Hemocytes were nonlethally collected from oysters before Vibrio experimental infection, and two DGE libraries were generated from individuals that survived or did not survive. Exploration of DGE data and microfluidic qPCR analyses at individual level showed an extraordinary polymorphism in gene expressions, but also a set of hemocyte-expressed genes whose basal mRNA levels discriminate oyster capacity to survive infections by the pathogenic V. splendidus LGP32. Finally, we identified a signature of 14 genes that predicted oyster survival capacity. Their expressions are likely driven by distinct transcriptional regulation processes associated or not associated to gene copy number variation (CNV. Conclusions We provide here for the first time in oyster a gene expression survival signature that represents a useful tool for understanding mortality events and for assessing genetic traits of interest for disease resistance selection programs.

  14. Reduced Set of Virulence Genes Allows High Accuracy Prediction of Bacterial Pathogenicity in Humans

    Science.gov (United States)

    Iraola, Gregorio; Vazquez, Gustavo; Spangenberg, Lucía; Naya, Hugo

    2012-01-01

    Although there have been great advances in understanding bacterial pathogenesis, there is still a lack of integrative information about what makes a bacterium a human pathogen. The advent of high-throughput sequencing technologies has dramatically increased the amount of completed bacterial genomes, for both known human pathogenic and non-pathogenic strains; this information is now available to investigate genetic features that determine pathogenic phenotypes in bacteria. In this work we determined presence/absence patterns of different virulence-related genes among more than finished bacterial genomes from both human pathogenic and non-pathogenic strains, belonging to different taxonomic groups (i.e: Actinobacteria, Gammaproteobacteria, Firmicutes, etc.). An accuracy of 95% using a cross-fold validation scheme with in-fold feature selection is obtained when classifying human pathogens and non-pathogens. A reduced subset of highly informative genes () is presented and applied to an external validation set. The statistical model was implemented in the BacFier v1.0 software (freely available at ), that displays not only the prediction (pathogen/non-pathogen) and an associated probability for pathogenicity, but also the presence/absence vector for the analyzed genes, so it is possible to decipher the subset of virulence genes responsible for the classification on the analyzed genome. Furthermore, we discuss the biological relevance for bacterial pathogenesis of the core set of genes, corresponding to eight functional categories, all with evident and documented association with the phenotypes of interest. Also, we analyze which functional categories of virulence genes were more distinctive for pathogenicity in each taxonomic group, which seems to be a completely new kind of information and could lead to important evolutionary conclusions. PMID:22916122

  15. A six-gene signature predicts survival of patients with localized pancreatic ductal adenocarcinoma.

    Directory of Open Access Journals (Sweden)

    Jeran K Stratford

    2010-07-01

    Full Text Available Pancreatic ductal adenocarcinoma (PDAC remains a lethal disease. For patients with localized PDAC, surgery is the best option, but with a median survival of less than 2 years and a difficult and prolonged postoperative course for most, there is an urgent need to better identify patients who have the most aggressive disease.We analyzed the gene expression profiles of primary tumors from patients with localized compared to metastatic disease and identified a six-gene signature associated with metastatic disease. We evaluated the prognostic potential of this signature in a training set of 34 patients with localized and resected PDAC and selected a cut-point associated with outcome using X-tile. We then applied this cut-point to an independent test set of 67 patients with localized and resected PDAC and found that our signature was independently predictive of survival and superior to established clinical prognostic factors such as grade, tumor size, and nodal status, with a hazard ratio of 4.1 (95% confidence interval [CI] 1.7-10.0. Patients defined to be high-risk patients by the six-gene signature had a 1-year survival rate of 55% compared to 91% in the low-risk group.Our six-gene signature may be used to better stage PDAC patients and assist in the difficult treatment decisions of surgery and to select patients whose tumor biology may benefit most from neoadjuvant therapy. The use of this six-gene signature should be investigated in prospective patient cohorts, and if confirmed, in future PDAC clinical trials, its potential as a biomarker should be investigated. Genes in this signature, or the pathways that they fall into, may represent new therapeutic targets. Please see later in the article for the Editors' Summary.

  16. Gene network inherent in genomic big data improves the accuracy of prognostic prediction for cancer patients.

    Science.gov (United States)

    Kim, Yun Hak; Jeong, Dae Cheon; Pak, Kyoungjune; Goh, Tae Sik; Lee, Chi-Seung; Han, Myoung-Eun; Kim, Ji-Young; Liangwen, Liu; Kim, Chi Dae; Jang, Jeon Yeob; Cha, Wonjae; Oh, Sae-Ock

    2017-09-29

    Accurate prediction of prognosis is critical for therapeutic decisions regarding cancer patients. Many previously developed prognostic scoring systems have limitations in reflecting recent progress in the field of cancer biology such as microarray, next-generation sequencing, and signaling pathways. To develop a new prognostic scoring system for cancer patients, we used mRNA expression and clinical data in various independent breast cancer cohorts (n=1214) from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Gene Expression Omnibus (GEO). A new prognostic score that reflects gene network inherent in genomic big data was calculated using Network-Regularized high-dimensional Cox-regression (Net-score). We compared its discriminatory power with those of two previously used statistical methods: stepwise variable selection via univariate Cox regression (Uni-score) and Cox regression via Elastic net (Enet-score). The Net scoring system showed better discriminatory power in prediction of disease-specific survival (DSS) than other statistical methods (p=0 in METABRIC training cohort, p=0.000331, 4.58e-06 in two METABRIC validation cohorts) when accuracy was examined by log-rank test. Notably, comparison of C-index and AUC values in receiver operating characteristic analysis at 5 years showed fewer differences between training and validation cohorts with the Net scoring system than other statistical methods, suggesting minimal overfitting. The Net-based scoring system also successfully predicted prognosis in various independent GEO cohorts with high discriminatory power. In conclusion, the Net-based scoring system showed better discriminative power than previous statistical methods in prognostic prediction for breast cancer patients. This new system will mark a new era in prognosis prediction for cancer patients.

  17. Meta4: a web application for sharing and annotating metagenomic gene predictions using web services.

    Science.gov (United States)

    Richardson, Emily J; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J; Watson, Mick

    2013-01-01

    Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website, code is available on Github, a cloud image is available, and an example implementation can be seen at.

  18. Systems-based biological concordance and predictive reproducibility of gene set discovery methods in cardiovascular disease.

    Science.gov (United States)

    Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying

    2011-08-01

    The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.

  19. Characterizing haploinsufficiency of SHELL gene to improve fruit form prediction in introgressive hybrids of oil palm.

    Science.gov (United States)

    Teh, Chee-Keng; Muaz, Siti Dalila; Tangaya, Praveena; Fong, Po-Yee; Ong, Ai-Ling; Mayes, Sean; Chew, Fook-Tim; Kulaveerasingam, Harikrishna; Appleton, David

    2017-06-08

    The fundamental trait in selective breeding of oil palm (Eleais guineensis Jacq.) is the shell thickness surrounding the kernel. The monogenic shell thickness is inversely correlated to mesocarp thickness, where the crude palm oil accumulates. Commercial thin-shelled tenera derived from thick-shelled dura × shell-less pisifera generally contain 30% higher oil per bunch. Two mutations, sh MPOB (M1) and sh AVROS (M2) in the SHELL gene - a type II MADS-box transcription factor mainly present in AVROS and Nigerian origins, were reported to be responsible for different fruit forms. In this study, we have tested 1,339 samples maintained in Sime Darby Plantation using both mutations. Five genotype-phenotype discrepancies and eight controls were then re-tested with all five reported mutations (sh AVROS , sh MPOB , sh MPOB2 , sh MPOB3 and sh MPOB4 ) within the same gene. The integration of genotypic data, pedigree records and shell formation model further explained the haploinsufficiency effect on the SHELL gene with different number of functional copies. Some rare mutations were also identified, suggesting a need to further confirm the existence of cis-compound mutations in the gene. With this, the prediction accuracy of fruit forms can be further improved, especially in introgressive hybrids of oil palm. Understanding causative variant segregation is extremely important, even for monogenic traits such as shell thickness in oil palm.

  20. Predictive minimum description length principle approach to inferring gene regulatory networks.

    Science.gov (United States)

    Chaitankar, Vijender; Zhang, Chaoyang; Ghosh, Preetam; Gong, Ping; Perkins, Edward J; Deng, Youping

    2011-01-01

    Reverse engineering of gene regulatory networks using information theory models has received much attention due to its simplicity, low computational cost, and capability of inferring large networks. One of the major problems with information theory models is to determine the threshold that defines the regulatory relationships between genes. The minimum description length (MDL) principle has been implemented to overcome this problem. The description length of the MDL principle is the sum of model length and data encoding length. A user-specified fine tuning parameter is used as control mechanism between model and data encoding, but it is difficult to find the optimal parameter. In this work, we propose a new inference algorithm that incorporates mutual information (MI), conditional mutual information (CMI), and predictive minimum description length (PMDL) principle to infer gene regulatory networks from DNA microarray data. In this algorithm, the information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle method attempts to determine the best MI threshold without the need of a user-specified fine tuning parameter. The performance of the proposed algorithm is evaluated using both synthetic time series data sets and a biological time series data set (Saccharomyces cerevisiae). The results show that the proposed algorithm produced fewer false edges and significantly improved the precision when compared to existing MDL algorithm.

  1. Radiation-induced gene expression in human subcutaneous fibroblasts is predictive of radiation-induced fibrosis

    DEFF Research Database (Denmark)

    Rødningen, Olaug Kristin; Børresen-Dale, Anne-Lise; Alsner, Jan

    2008-01-01

    BACKGROUND AND PURPOSE: Breast cancer patients show a large variation in normal tissue reactions after ionizing radiation (IR) therapy. One of the most common long-term adverse effects of ionizing radiotherapy is radiation-induced fibrosis (RIF), and several attempts have been made over the last...... years to develop predictive assays for RIF. Our aim was to identify basal and radiation-induced transcriptional profiles in fibroblasts from breast cancer patients that might be related to the individual risk of RIF in these patients. MATERIALS AND METHODS: Fibroblast cell lines from 31 individuals......-treated fibroblasts. Transcriptional differences in basal and radiation-induced gene expression profiles were investigated using 15K cDNA microarrays, and results analyzed by both SAM and PAM. RESULTS: Sixty differentially expressed genes were identified by applying SAM on 10 patients with the highest risk of RIF...

  2. Angiotensinogen gene polymorphism predicts hypertension, and iridological constitutional classification enhances the risk for hypertension in Koreans.

    Science.gov (United States)

    Cho, Joo-Jang; Hwang, Woo-Jun; Hong, Seung-Heon; Jeong, Hyun-Ja; Lee, Hye-Jung; Kim, Hyung-Min; Um, Jae-Young

    2008-05-01

    This study investigated the relationship between iridological constitution and angiotensinogen (AGN) gene polymorphism in hypertensives. In addition to angiotensin converting enzyme gene, AGN genotype is also one of the most well studied genetic markers of hypertension. Furthermore, iridology, one of complementary and alternative medicine, is the diagnosis of the medical conditions through noting irregularities of the pigmentation in the iris. Iridological constitution has a strong familial aggregation and is implicated in heredity. Therefore, the study classified 87 hypertensive patients with familial history of cerebral infarction and controls (n = 88) according to Iris constitution, and determined AGN genotype. As a result, the AGN/TT genotype was associated with hypertension (chi2 = 13.413, p iridological constitutional classification increased the relative risk for hypertension in the subjects with AGN/T allele. These results suggest that AGN polymorphism predicts hypertension, and iridological constitutional classification enhances the risk for hypertension associated with AGN/T in a Korean population.

  3. A 7 gene expression score predicts for radiation response in cancer cervix

    International Nuclear Information System (INIS)

    Rajkumar, Thangarajan; Vijayalakshmi, Neelakantan; Sabitha, Kesavan; Shirley, Sundersingh; Selvaluxmy, Ganesharaja; Bose, Mayil Vahanan; Nambaru, Lavanya

    2009-01-01

    Cervical cancer is the most common cancer among Indian women. The current recommendations are to treat the stage IIB, IIIA, IIIB and IVA with radical radiotherapy and weekly cisplatin based chemotherapy. However, Radiotherapy alone can help cure more than 60% of stage IIB and up to 40% of stage IIIB patients. Archival RNA samples from 15 patients who had achieved complete remission and stayed disease free for more than 36 months (No Evidence of Disease or NED group) and 10 patients who had failed radical radiotherapy (Failed group) were included in the study. The RNA were amplified, labelled and hybridized to Stanford microarray chips and analyzed using BRB Array Tools software and Significance Analysis of Microarray (SAM) analysis. 20 genes were selected for further validation using Relative Quantitation (RQ) Taqman assay in a Taqman Low-Density Array (TLDA) format. The RQ value was calculated, using each of the NED sample once as a calibrator. A scoring system was developed based on the RQ value for the genes. Using a seven gene based scoring system, it was possible to distinguish between the tumours which were likely to respond to the radiotherapy and those likely to fail. The mean score ± 2 SE (standard error of mean) was used and at a cut-off score of greater than 5.60, the sensitivity, specificity, Positive predictive value (PPV) and Negative predictive value (NPV) were 0.64, 1.0, 1.0, 0.67, respectively, for the low risk group. We have identified a 7 gene signature which could help identify patients with cervical cancer who can be treated with radiotherapy alone. However, this needs to be validated in a larger patient population

  4. Using purine skews to predict genes in AT-rich poxviruses

    Directory of Open Access Journals (Sweden)

    Upton Chris

    2005-02-01

    Full Text Available Abstract Background Clusters or runs of purines on the mRNA synonymous strand have been found in many different organisms including orthopoxviruses. The purine bias that is exhibited by these clusters can be observed using a purine skew and in the case of poxviruses, these skews can be used to help determine the coding strand of a particular segment of the genome. Combined with previous findings that minor ORFs have lower than average aspartate and glutamate composition and higher than average serine composition, purine content can be used to predict the likelihood of a poxvirus ORF being a "real gene". Results Using purine skews and a "quality" measure designed to incorporate previous findings about minor ORFs, we have found that in our training case (vaccinia virus strain Copenhagen, 59 of 65 minor (small and unlikely to be a real genes ORFs were correctly classified as being minor. Of the 201 major (large and likely to be real genes vaccinia ORFs, 192 were correctly classified as being major. Performing a similar analysis with the entomopoxvirus amsacta moorei (AMEV, it was found that 4 major ORFs were incorrectly classified as minor and 9 minor ORFs were incorrectly classified as major. The purine abundance observed for major ORFs in vaccinia virus was found to stem primarily from the first codon position with both the second and third codon positions containing roughly equal amounts of purines and pyrimidines. Conclusion Purine skews and a "quality" measure can be used to predict functional ORFs and purine skews in particular can be used to determine which of two overlapping ORFs is most likely to be the real gene if neither of the two ORFs has orthologs in other poxviruses.

  5. Prediction of lymphatic metastasis based on gene expression profile analysis after brachytherapy for early-stage oral tongue carcinoma

    International Nuclear Information System (INIS)

    Watanabe, Hiroshi; Mogushi, Kaoru; Miura, Masahiko; Yoshimura, Ryo-ichi; Kurabayashi, Tohru; Shibuya, Hitoshi; Tanaka, Hiroshi; Noda, Shuhei; Iwakawa, Mayumi; Imai, Takashi

    2008-01-01

    Background and purpose: The management of lymphatic metastasis of early-stage oral tongue carcinoma patients is crucial for its prognosis. The purpose of this study was to evaluate the predictive ability of lymphatic metastasis after brachytherapy (BRT) for early-stage tongue carcinoma based on gene expression profiling. Patients and methods: Pre-therapeutic biopsies from 39 patients with T1 or T2 tongue cancer were analyzed for gene expression signatures using Codelink Uniset Human 20K Bioarray. All patients were treated with low dose-rate BRT for their primary lesions and underwent strict follow-up under a wait-and-see policy for cervical lymphatic metastasis. Candidate genes were selected for predicting lymph-node status in the reference group by the permutation test. Predictive accuracy was further evaluated by the prediction strength (PS) scoring system using an independent validation group. Results: We selected a set of 19 genes whose expression differed significantly between classes with or without lymphatic metastasis in the reference group. The lymph-node status in the validation group was predicted by the PS scoring system with an accuracy of 76%. Conclusions: Gene expression profiling using 19 genes in primary tumor tissues may allow prediction of lymphatic metastasis after BRT for early-stage oral tongue carcinoma

  6. Expression Pattern Similarities Support the Prediction of Orthologs Retaining Common Functions after Gene Duplication Events1[OPEN

    Science.gov (United States)

    Haberer, Georg; Panda, Arup; Das Laha, Shayani; Ghosh, Tapas Chandra; Schäffner, Anton R.

    2016-01-01

    The identification of functionally equivalent, orthologous genes (functional orthologs) across genomes is necessary for accurate transfer of experimental knowledge from well-characterized organisms to others. This frequently relies on automated, coding sequence-based approaches such as OrthoMCL, Inparanoid, and KOG, which usually work well for one-to-one homologous states. However, this strategy does not reliably work for plants due to the occurrence of extensive gene/genome duplication. Frequently, for one query gene, multiple orthologous genes are predicted in the other genome, and it is not clear a priori from sequence comparison and similarity which one preserves the ancestral function. We have studied 11 organ-dependent and stress-induced gene expression patterns of 286 Arabidopsis lyrata duplicated gene groups and compared them with the respective Arabidopsis (Arabidopsis thaliana) genes to predict putative expressologs and nonexpressologs based on gene expression similarity. Promoter sequence divergence as an additional tool to substantiate functional orthology only partially overlapped with expressolog classification. By cloning eight A. lyrata homologs and complementing them in the respective four Arabidopsis loss-of-function mutants, we experimentally proved that predicted expressologs are indeed functional orthologs, while nonexpressologs or nonfunctionalized orthologs are not. Our study demonstrates that even a small set of gene expression data in addition to sequence homologies are instrumental in the assignment of functional orthologs in the presence of multiple orthologs. PMID:27303025

  7. A statistical method for predicting splice variants between two groups of samples using GeneChip® expression array data

    Directory of Open Access Journals (Sweden)

    Olson James M

    2006-04-01

    Full Text Available Abstract Background Alternative splicing of pre-messenger RNA results in RNA variants with combinations of selected exons. It is one of the essential biological functions and regulatory components in higher eukaryotic cells. Some of these variants are detectable with the Affymetrix GeneChip® that uses multiple oligonucleotide probes (i.e. probe set, since the target sequences for the multiple probes are adjacent within each gene. Hybridization intensity from a probe correlates with abundance of the corresponding transcript. Although the multiple-probe feature in the current GeneChip® was designed to assess expression values of individual genes, it also measures transcriptional abundance for a sub-region of a gene sequence. This additional capacity motivated us to develop a method to predict alternative splicing, taking advance of extensive repositories of GeneChip® gene expression array data. Results We developed a two-step approach to predict alternative splicing from GeneChip® data. First, we clustered the probes from a probe set into pseudo-exons based on similarity of probe intensities and physical adjacency. A pseudo-exon is defined as a sequence in the gene within which multiple probes have comparable probe intensity values. Second, for each pseudo-exon, we assessed the statistical significance of the difference in probe intensity between two groups of samples. Differentially expressed pseudo-exons are predicted to be alternatively spliced. We applied our method to empirical data generated from GeneChip® Hu6800 arrays, which include 7129 probe sets and twenty probes per probe set. The dataset consists of sixty-nine medulloblastoma (27 metastatic and 42 non-metastatic samples and four cerebellum samples as normal controls. We predicted that 577 genes would be alternatively spliced when we compared normal cerebellum samples to medulloblastomas, and predicted that thirteen genes would be alternatively spliced when we compared metastatic

  8. ABC gene-ranking for prediction of drug-induced cholestasis in rats

    Directory of Open Access Journals (Sweden)

    Yauheniya Cherkas

    drugs that behaved very differently, and were distinct from both non-cholestatic and cholestatic drugs (ketoconazole, dipyridamole, cyproheptadine and aniline, and many postulated human cholestatic drugs that in rat showed no evidence of cholestasis (chlorpromazine, erythromycin, niacin, captopril, dapsone, rifampicin, glibenclamide, simvastatin, furosemide, tamoxifen, and sulfamethoxazole. Most of these latter drugs were noted previously by other groups as showing cholestasis only in humans. The results of this work suggest that the ABC procedure and similar statistical approaches can be instrumental in combining data to compare toxicants across toxicogenomics databases, extract similarities among responses and reduce unexplained data varation. Keywords: Cluster analysis, Cholestasis, Gene signature, Microarray, Prediction, Toxicogenomics

  9. Computational Prediction of MicroRNAs from Toxoplasma gondii Potentially Regulating the Hosts’ Gene Expression

    Directory of Open Access Journals (Sweden)

    Müşerref Duygu Saçar

    2014-10-01

    Full Text Available MicroRNAs (miRNAs were discovered two decades ago, yet there is still a great need for further studies elucidating their genesis and targeting in different phyla. Since experimental discovery and validation of miRNAs is difficult, computational predictions are indispensable and today most computational approaches employ machine learning. Toxoplasma gondii, a parasite residing within the cells of its hosts like human, uses miRNAs for its post-transcriptional gene regulation. It may also regulate its hosts’ gene expression, which has been shown in brain cancer. Since previous studies have shown that overexpressed miRNAs within the host are causal for disease onset, we hypothesized that T. gondii could export miRNAs into its host cell. We computationally predicted all hairpins from the genome of T. gondii and used mouse and human models to filter possible candidates. These were then further compared to known miRNAs in human and rodents and their expression was examined for T. gondii grown in mouse and human hosts, respectively. We found that among the millions of potential hairpins in T. gondii, only a few thousand pass filtering using a human or mouse model and that even fewer of those are expressed. Since they are expressed and differentially expressed in rodents and human, we suggest that there is a chance that T. gondii may export miRNAs into its hosts for direct regulation.

  10. Prediction of Associations between microRNAs and Gene Expression in Glioma Biology.

    Directory of Open Access Journals (Sweden)

    Stefan Wuchty

    Full Text Available Despite progress in the determination of miR interactions, their regulatory role in cancer is only beginning to be unraveled. Utilizing gene expression data from 27 glioblastoma samples we found that the mere knowledge of physical interactions between specific mRNAs and miRs can be used to determine associated regulatory interactions, allowing us to identify 626 associated interactions, involving 128 miRs that putatively modulate the expression of 246 mRNAs. Experimentally determining the expression of miRs, we found an over-representation of over(under-expressed miRs with various predicted mRNA target sequences. Such significantly associated miRs that putatively bind over-expressed genes strongly tend to have binding sites nearby the 3'UTR of the corresponding mRNAs, suggesting that the presence of the miRs near the translation stop site may be a factor in their regulatory ability. Our analysis predicted a significant association between miR-128 and the protein kinase WEE1, which we subsequently validated experimentally by showing that the over-expression of the naturally under-expressed miR-128 in glioma cells resulted in the inhibition of WEE1 in glioblastoma cells.

  11. Can survival prediction be improved by merging gene expression data sets?

    Directory of Open Access Journals (Sweden)

    Haleh Yasrebi

    Full Text Available BACKGROUND: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. RESULTS: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. CONCLUSIONS: Merging did not deteriorate performance on average despite (a The diversity of microarray platforms used. (b The heterogeneity of patients cohorts. (c The heterogeneity of breast cancer disease. (d Substantial variation of time to death or relapse. (e The reduced number of genes in the merged data

  12. Chronic and Acute Stress, Gender, and Serotonin Transporter Gene-Environment Interactions Predicting Depression Symptoms in Youth

    Science.gov (United States)

    Hammen, Constance; Brennan, Patricia A.; Keenan-Miller, Danielle; Hazel, Nicholas A.; Najman, Jake M.

    2010-01-01

    Background: Many recent studies of serotonin transporter gene by environment effects predicting depression have used stress assessments with undefined or poor psychometric methods, possibly contributing to wide variation in findings. The present study attempted to distinguish between effects of acute and chronic stress to predict depressive…

  13. Bioinformatic Prediction of Gene Functions Regulated by Quorum Sensing in the Bioleaching Bacterium Acidithiobacillus ferrooxidans

    Directory of Open Access Journals (Sweden)

    Alvaro Banderas

    2013-08-01

    Full Text Available The biomining bacterium Acidithiobacillus ferrooxidans oxidizes sulfide ores and promotes metal solubilization. The efficiency of this process depends on the attachment of cells to surfaces, a process regulated by quorum sensing (QS cell-to-cell signalling in many Gram-negative bacteria. At. ferrooxidans has a functional QS system and the presence of AHLs enhances its attachment to pyrite. However, direct targets of the QS transcription factor AfeR remain unknown. In this study, a bioinformatic approach was used to infer possible AfeR direct targets based on the particular palindromic features of the AfeR binding site. A set of Hidden Markov Models designed to maintain palindromic regions and vary non-palindromic regions was used to screen for putative binding sites. By annotating the context of each predicted binding site (PBS, we classified them according to their positional coherence relative to other putative genomic structures such as start codons, RNA polymerase promoter elements and intergenic regions. We further used the Multiple EM for Motif Elicitation algorithm (MEME to further filter out low homology PBSs. In summary, 75 target-genes were identified, 34 of which have a higher confidence level. Among the identified genes, we found afeR itself, zwf, genes encoding glycosyltransferase activities, metallo-beta lactamases, and active transport-related proteins. Glycosyltransferases and Zwf (Glucose 6-phosphate-1-dehydrogenase might be directly involved in polysaccharide biosynthesis and attachment to minerals by At. ferrooxidans cells during the bioleaching process.

  14. Bioinformatic Prediction of Gene Functions Regulated by Quorum Sensing in the Bioleaching Bacterium Acidithiobacillus ferrooxidans

    Science.gov (United States)

    Banderas, Alvaro; Guiliani, Nicolas

    2013-01-01

    The biomining bacterium Acidithiobacillus ferrooxidans oxidizes sulfide ores and promotes metal solubilization. The efficiency of this process depends on the attachment of cells to surfaces, a process regulated by quorum sensing (QS) cell-to-cell signalling in many Gram-negative bacteria. At. ferrooxidans has a functional QS system and the presence of AHLs enhances its attachment to pyrite. However, direct targets of the QS transcription factor AfeR remain unknown. In this study, a bioinformatic approach was used to infer possible AfeR direct targets based on the particular palindromic features of the AfeR binding site. A set of Hidden Markov Models designed to maintain palindromic regions and vary non-palindromic regions was used to screen for putative binding sites. By annotating the context of each predicted binding site (PBS), we classified them according to their positional coherence relative to other putative genomic structures such as start codons, RNA polymerase promoter elements and intergenic regions. We further used the Multiple EM for Motif Elicitation algorithm (MEME) to further filter out low homology PBSs. In summary, 75 target-genes were identified, 34 of which have a higher confidence level. Among the identified genes, we found afeR itself, zwf, genes encoding glycosyltransferase activities, metallo-beta lactamases, and active transport-related proteins. Glycosyltransferases and Zwf (Glucose 6-phosphate-1-dehydrogenase) might be directly involved in polysaccharide biosynthesis and attachment to minerals by At. ferrooxidans cells during the bioleaching process. PMID:23959118

  15. Calibration of Multiple In Silico Tools for Predicting Pathogenicity of Mismatch Repair Gene Missense Substitutions

    Science.gov (United States)

    Thompson, Bryony A.; Greenblatt, Marc S.; Vallee, Maxime P.; Herkert, Johanna C.; Tessereau, Chloe; Young, Erin L.; Adzhubey, Ivan A.; Li, Biao; Bell, Russell; Feng, Bingjian; Mooney, Sean D.; Radivojac, Predrag; Sunyaev, Shamil R.; Frebourg, Thierry; Hofstra, Robert M.W.; Sijmons, Rolf H.; Boucher, Ken; Thomas, Alun; Goldgar, David E.; Spurdle, Amanda B.; Tavtigian, Sean V.

    2015-01-01

    Classification of rare missense substitutions observed during genetic testing for patient management is a considerable problem in clinical genetics. The Bayesian integrated evaluation of unclassified variants is a solution originally developed for BRCA1/2. Here, we take a step toward an analogous system for the mismatch repair (MMR) genes (MLH1, MSH2, MSH6, and PMS2) that confer colon cancer susceptibility in Lynch syndrome by calibrating in silico tools to estimate prior probabilities of pathogenicity for MMR gene missense substitutions. A qualitative five-class classification system was developed and applied to 143 MMR missense variants. This identified 74 missense substitutions suitable for calibration. These substitutions were scored using six different in silico tools (Align-Grantham Variation Grantham Deviation, multivariate analysis of protein polymorphisms [MAPP], Mut-Pred, PolyPhen-2.1, Sorting Intolerant From Tolerant, and Xvar), using curated MMR multiple sequence alignments where possible. The output from each tool was calibrated by regression against the classifications of the 74 missense substitutions; these calibrated outputs are interpretable as prior probabilities of pathogenicity. MAPP was the most accurate tool and MAPP + PolyPhen-2.1 provided the best-combined model (R2 = 0.62 and area under receiver operating characteristic = 0.93). The MAPP + PolyPhen-2.1 output is sufficiently predictive to feed as a continuous variable into the quantitative Bayesian integrated evaluation for clinical classification of MMR gene missense substitutions. PMID:22949387

  16. In silico prediction of functional loss of cst3 gene in hereditary cerebral amyloid angiopathy

    Directory of Open Access Journals (Sweden)

    Piyush Choudhary

    2013-12-01

    Full Text Available The computational identification of missense mutation in CST3 (CYSTATIN 3 or CYSTATIN C gene has been done in the present study. The missense mutations in the CST3 gene will leads to hereditary cerebral amyloid angiopathy The initiation of the analysis was done with SIFT followed by POLYPHEN-2 and I-Mutant 2.0 using 24 variants of CST3 gene of Homo sapiens which were derived from dbSNP. The analysis showed that 5 variants (Y60C, C123Y, L19P, Y88C, L94Q were found to be less stable and damaging by SIFT, POLYPHEN-2 and I-MUTANT2.0. Furthermore the outputs of SNP & GO are collaborated with PHD-SNP (Predictor of Human Deleterious-Single Nucleotide Polymorphism and PANTHER to predict 5 variants (Y60C, Y88C, C123Y, L19P, and L94Q having clinical impact in causing the disease. These findings will be certainly helpful for the present medical practitioners for the treatment of cerebral amyloid angiopathy.

  17. The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction

    Directory of Open Access Journals (Sweden)

    Garzón-Martínez Gina A

    2012-04-01

    Full Text Available Abstract Background Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry. Results We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs, using 454 GS FLX Titanium technology. ESTs, with an average length of 371 bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638 bp and 354 bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato and Solanum tuberosum (potato. We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs. Conclusions We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the

  18. The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction.

    Science.gov (United States)

    Garzón-Martínez, Gina A; Zhu, Z Iris; Landsman, David; Barrero, Luz S; Mariño-Ramírez, Leonardo

    2012-04-25

    Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry. We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371 bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638 bp and 354 bp, respectively. Functional annotation was performed using NCBI's BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs. We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S

  19. Whole genome transcript profiling of drug induced steatosis in rats reveals a gene signature predictive of outcome.

    Directory of Open Access Journals (Sweden)

    Nishika Sahini

    Full Text Available Drug induced steatosis (DIS is characterised by excess triglyceride accumulation in the form of lipid droplets (LD in liver cells. To explore mechanisms underlying DIS we interrogated the publically available microarray data from the Japanese Toxicogenomics Project (TGP to study comprehensively whole genome gene expression changes in the liver of treated rats. For this purpose a total of 17 and 12 drugs which are diverse in molecular structure and mode of action were considered based on their ability to cause either steatosis or phospholipidosis, respectively, while 7 drugs served as negative controls. In our efforts we focused on 200 genes which are considered to be mechanistically relevant in the process of lipid droplet biogenesis in hepatocytes as recently published (Sahini and Borlak, 2014. Based on mechanistic considerations we identified 19 genes which displayed dose dependent responses while 10 genes showed time dependency. Importantly, the present study defined 9 genes (ANGPTL4, FABP7, FADS1, FGF21, GOT1, LDLR, GK, STAT3, and PKLR as signature genes to predict DIS. Moreover, cross tabulation revealed 9 genes to be regulated ≥10 times amongst the various conditions and included genes linked to glucose metabolism, lipid transport and lipogenesis as well as signalling events. Additionally, a comparison between drugs causing phospholipidosis and/or steatosis revealed 26 genes to be regulated in common including 4 signature genes to predict DIS (PKLR, GK, FABP7 and FADS1. Furthermore, a comparison between in vivo single dose (3, 6, 9 and 24 h and findings from rat hepatocyte studies (2 h, 8 h, 24 h identified 10 genes which are regulated in common and contained 2 DIS signature genes (FABP7, FGF21. Altogether, our studies provide comprehensive information on mechanistically linked gene expression changes of a range of drugs causing steatosis and phospholipidosis and encourage the screening of DIS signature genes at the preclinical stage.

  20. Landscape genetics as a tool for conservation planning: predicting the effects of landscape change on gene flow.

    Science.gov (United States)

    van Strien, Maarten J; Keller, Daniela; Holderegger, Rolf; Ghazoul, Jaboury; Kienast, Felix; Bolliger, Janine

    2014-03-01

    For conservation managers, it is important to know whether landscape changes lead to increasing or decreasing gene flow. Although the discipline of landscape genetics assesses the influence of landscape elements on gene flow, no studies have yet used landscape-genetic models to predict gene flow resulting from landscape change. A species that has already been severely affected by landscape change is the large marsh grasshopper (Stethophyma grossum), which inhabits moist areas in fragmented agricultural landscapes in Switzerland. From transects drawn between all population pairs within maximum dispersal distance (landscape composition as well as some measures of habitat configuration. Additionally, a complete sampling of all populations in our study area allowed incorporating measures of population topology. These measures together with the landscape metrics formed the predictor variables in linear models with gene flow as response variable (F(ST) and mean pairwise assignment probability). With a modified leave-one-out cross-validation approach, we selected the model with the highest predictive accuracy. With this model, we predicted gene flow under several landscape-change scenarios, which simulated construction, rezoning or restoration projects, and the establishment of a new population. For some landscape-change scenarios, significant increase or decrease in gene flow was predicted, while for others little change was forecast. Furthermore, we found that the measures of population topology strongly increase model fit in landscape genetic analysis. This study demonstrates the use of predictive landscape-genetic models in conservation and landscape planning.

  1. Prediction of metabolic flux distribution from gene expression data based on the flux minimization principle.

    Directory of Open Access Journals (Sweden)

    Hyun-Seob Song

    Full Text Available Prediction of possible flux distributions in a metabolic network provides detailed phenotypic information that links metabolism to cellular physiology. To estimate metabolic steady-state fluxes, the most common approach is to solve a set of macroscopic mass balance equations subjected to stoichiometric constraints while attempting to optimize an assumed optimal objective function. This assumption is justifiable in specific cases but may be invalid when tested across different conditions, cell populations, or other organisms. With an aim to providing a more consistent and reliable prediction of flux distributions over a wide range of conditions, in this article we propose a framework that uses the flux minimization principle to predict active metabolic pathways from mRNA expression data. The proposed algorithm minimizes a weighted sum of flux magnitudes, while biomass production can be bounded to fit an ample range from very low to very high values according to the analyzed context. We have formulated the flux weights as a function of the corresponding enzyme reaction's gene expression value, enabling the creation of context-specific fluxes based on a generic metabolic network. In case studies of wild-type Saccharomyces cerevisiae, and wild-type and mutant Escherichia coli strains, our method achieved high prediction accuracy, as gauged by correlation coefficients and sums of squared error, with respect to the experimentally measured values. In contrast to other approaches, our method was able to provide quantitative predictions for both model organisms under a variety of conditions. Our approach requires no prior knowledge or assumption of a context-specific metabolic functionality and does not require trial-and-error parameter adjustments. Thus, our framework is of general applicability for modeling the transcription-dependent metabolism of bacteria and yeasts.

  2. Mining predicted essential genes of Brugia malayi for nematode drug targets.

    Directory of Open Access Journals (Sweden)

    Sanjay Kumar

    Full Text Available We report results from the first genome-wide application of a rational drug target selection methodology to a metazoan pathogen genome, the completed draft sequence of Brugia malayi, a parasitic nematode responsible for human lymphatic filariasis. More than 1.5 billion people worldwide are at risk of contracting lymphatic filariasis and onchocerciasis, a related filarial disease. Drug treatments for filariasis have not changed significantly in over 20 years, and with the risk of resistance rising, there is an urgent need for the development of new anti-filarial drug therapies. The recent publication of the draft genomic sequence for B. malayi enables a genome-wide search for new drug targets. However, there is no functional genomics data in B. malayi to guide the selection of potential drug targets. To circumvent this problem, we have utilized the free-living model nematode Caenorhabditis elegans as a surrogate for B. malayi. Sequence comparisons between the two genomes allow us to map C. elegans orthologs to B. malayi genes. Using these orthology mappings and by incorporating the extensive genomic and functional genomic data, including genome-wide RNAi screens, that already exist for C. elegans, we identify potentially essential genes in B. malayi. Further incorporation of human host genome sequence data and a custom algorithm for prioritization enables us to collect and rank nearly 600 drug target candidates. Previously identified potential drug targets cluster near the top of our prioritized list, lending credibility to our methodology. Over-represented Gene Ontology terms, predicted InterPro domains, and RNAi phenotypes of C. elegans orthologs associated with the potential target pool are identified. By virtue of the selection procedure, the potential B. malayi drug targets highlight components of key processes in nematode biology such as central metabolism, molting and regulation of gene expression.

  3. Integrating circadian activity and gene expression profiles to predict chronotoxicity of Drosophila suzukii response to insecticides.

    Science.gov (United States)

    Hamby, Kelly A; Kwok, Rosanna S; Zalom, Frank G; Chiu, Joanna C

    2013-01-01

    Native to Southeast Asia, Drosophila suzukii (Matsumura) is a recent invader that infests intact ripe and ripening fruit, leading to significant crop losses in the U.S., Canada, and Europe. Since current D. suzukii management strategies rely heavily on insecticide usage and insecticide detoxification gene expression is under circadian regulation in the closely related Drosophila melanogaster, we set out to determine if integrative analysis of daily activity patterns and detoxification gene expression can predict chronotoxicity of D. suzukii to insecticides. Locomotor assays were performed under conditions that approximate a typical summer or winter day in Watsonville, California, where D. suzukii was first detected in North America. As expected, daily activity patterns of D. suzukii appeared quite different between 'summer' and 'winter' conditions due to differences in photoperiod and temperature. In the 'summer', D. suzukii assumed a more bimodal activity pattern, with maximum activity occurring at dawn and dusk. In the 'winter', activity was unimodal and restricted to the warmest part of the circadian cycle. Expression analysis of six detoxification genes and acute contact bioassays were performed at multiple circadian times, but only in conditions approximating Watsonville summer, the cropping season, when most insecticide applications occur. Five of the genes tested exhibited rhythmic expression, with the majority showing peak expression at dawn (ZT0, 6am). We observed significant differences in the chronotoxicity of D. suzukii towards malathion, with highest susceptibility at ZT0 (6am), corresponding to peak expression of cytochrome P450s that may be involved in bioactivation of malathion. High activity levels were not found to correlate with high insecticide susceptibility as initially hypothesized. Chronobiology and chronotoxicity of D. suzukii provide valuable insights for monitoring and control efforts, because insect activity as well as insecticide timing

  4. A novel gene network inference algorithm using predictive minimum description length approach.

    Science.gov (United States)

    Chaitankar, Vijender; Ghosh, Preetam; Perkins, Edward J; Gong, Ping; Deng, Youping; Zhang, Chaoyang

    2010-05-28

    Reverse engineering of gene regulatory networks using information theory models has received much attention due to its simplicity, low computational cost, and capability of inferring large networks. One of the major problems with information theory models is to determine the threshold which defines the regulatory relationships between genes. The minimum description length (MDL) principle has been implemented to overcome this problem. The description length of the MDL principle is the sum of model length and data encoding length. A user-specified fine tuning parameter is used as control mechanism between model and data encoding, but it is difficult to find the optimal parameter. In this work, we proposed a new inference algorithm which incorporated mutual information (MI), conditional mutual information (CMI) and predictive minimum description length (PMDL) principle to infer gene regulatory networks from DNA microarray data. In this algorithm, the information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle method attempts to determine the best MI threshold without the need of a user-specified fine tuning parameter. The performance of the proposed algorithm was evaluated using both synthetic time series data sets and a biological time series data set for the yeast Saccharomyces cerevisiae. The benchmark quantities precision and recall were used as performance measures. The results show that the proposed algorithm produced less false edges and significantly improved the precision, as compared to the existing algorithm. For further analysis the performance of the algorithms was observed over different sizes of data. We have proposed a new algorithm that implements the PMDL principle for inferring gene regulatory networks from time series DNA microarray data that eliminates the need of a fine tuning parameter. The evaluation results obtained from both synthetic and actual biological data sets show that the

  5. Predicting spatial and temporal gene expression using an integrative model of transcription factor occupancy and chromatin state.

    Directory of Open Access Journals (Sweden)

    Bartek Wilczynski

    Full Text Available Precise patterns of spatial and temporal gene expression are central to metazoan complexity and act as a driving force for embryonic development. While there has been substantial progress in dissecting and predicting cis-regulatory activity, our understanding of how information from multiple enhancer elements converge to regulate a gene's expression remains elusive. This is in large part due to the number of different biological processes involved in mediating regulation as well as limited availability of experimental measurements for many of them. Here, we used a Bayesian approach to model diverse experimental regulatory data, leading to accurate predictions of both spatial and temporal aspects of gene expression. We integrated whole-embryo information on transcription factor recruitment to multiple cis-regulatory modules, insulator binding and histone modification status in the vicinity of individual gene loci, at a genome-wide scale during Drosophila development. The model uses Bayesian networks to represent the relation between transcription factor occupancy and enhancer activity in specific tissues and stages. All parameters are optimized in an Expectation Maximization procedure providing a model capable of predicting tissue- and stage-specific activity of new, previously unassayed genes. Performing the optimization with subsets of input data demonstrated that neither enhancer occupancy nor chromatin state alone can explain all gene expression patterns, but taken together allow for accurate predictions of spatio-temporal activity. Model predictions were validated using the expression patterns of more than 600 genes recently made available by the BDGP consortium, demonstrating an average 15-fold enrichment of genes expressed in the predicted tissue over a naïve model. We further validated the model by experimentally testing the expression of 20 predicted target genes of unknown expression, resulting in an accuracy of 95% for temporal

  6. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes.

    Directory of Open Access Journals (Sweden)

    Christof Winter

    Full Text Available Predicting the clinical outcome of cancer patients based on the expression of marker genes in their tumors has received increasing interest in the past decade. Accurate predictors of outcome and response to therapy could be used to personalize and thereby improve therapy. However, state of the art methods used so far often found marker genes with limited prediction accuracy, limited reproducibility, and unclear biological relevance. To address this problem, we developed a novel computational approach to identify genes prognostic for outcome that couples gene expression measurements from primary tumor samples with a network of known relationships between the genes. Our approach ranks genes according to their prognostic relevance using both expression and network information in a manner similar to Google's PageRank. We applied this method to gene expression profiles which we obtained from 30 patients with pancreatic cancer, and identified seven candidate marker genes prognostic for outcome. Compared to genes found with state of the art methods, such as Pearson correlation of gene expression with survival time, we improve the prediction accuracy by up to 7%. Accuracies were assessed using support vector machine classifiers and Monte Carlo cross-validation. We then validated the prognostic value of our seven candidate markers using immunohistochemistry on an independent set of 412 pancreatic cancer samples. Notably, signatures derived from our candidate markers were independently predictive of outcome and superior to established clinical prognostic factors such as grade, tumor size, and nodal status. As the amount of genomic data of individual tumors grows rapidly, our algorithm meets the need for powerful computational approaches that are key to exploit these data for personalized cancer therapies in clinical practice.

  7. Gene expression patterns in formalin-fixed, paraffin-embedded core biopsies predict docetaxel chemosensitivity in breast cancer patients.

    Science.gov (United States)

    Chang, Jenny C; Makris, Andreas; Gutierrez, M Carolina; Hilsenbeck, Susan G; Hackett, James R; Jeong, Jennie; Liu, Mei-Lan; Baker, Joffre; Clark-Langone, Kim; Baehner, Frederick L; Sexton, Krsytal; Mohsin, Syed; Gray, Tara; Alvarez, Laura; Chamness, Gary C; Osborne, C Kent; Shak, Steven

    2008-03-01

    Previously, we had identified gene expression patterns that predicted response to neoadjuvant docetaxel. Other studies have validated that a high Recurrence Score (RS) by the 21-gene RT-PCR assay is predictive of worse prognosis but better response to chemotherapy. We investigated whether tumor expression of these 21 genes and other candidate genes can predict response to docetaxel. Core biopsies from 97 patients were obtained before treatment with neoadjuvant docetaxel (4 cycles, 100 mg/m2 q3 weeks). Three 10-microm FFPE sections were submitted for quantitative RT-PCR assays of 192 genes that were selected from our previous work and the literature. Of the 97 patients, 81 (84%) had sufficient invasive cancer, 80 (82%) had sufficient RNA for QRTPCR assay, and 72 (74%) had clinical response data. Mean age was 48.5 years, and the median tumor size was 6 cm. Clinical complete responses (CR) were observed in 12 (17%), partial responses in 41 (57%), stable disease in 17 (24%), and progressive disease in 2 patients (3%). A significant relationship (P<0.05) between gene expression and CR was observed for 14 genes, including CYBA. CR was associated with lower expression of the ER gene group and higher expression of the proliferation gene group from the 21 gene assay. Of note, CR was more likely with a high RS (P=0.008). We have established molecular profiles of sensitivity to docetaxel. RT-PCR technology provides a potential platform for a predictive test of docetaxel chemosensitivity using small amounts of routinely processed material.

  8. Prediction of essential proteins based on subcellular localization and gene expression correlation.

    Science.gov (United States)

    Fan, Yetian; Tang, Xiwei; Hu, Xiaohua; Wu, Wei; Ping, Qing

    2017-12-01

    Essential proteins are indispensable to the survival and development process of living organisms. To understand the functional mechanisms of essential proteins, which can be applied to the analysis of disease and design of drugs, it is important to identify essential proteins from a set of proteins first. As traditional experimental methods designed to test out essential proteins are usually expensive and laborious, computational methods, which utilize biological and topological features of proteins, have attracted more attention in recent years. Protein-protein interaction networks, together with other biological data, have been explored to improve the performance of essential protein prediction. The proposed method SCP is evaluated on Saccharomyces cerevisiae datasets and compared with five other methods. The results show that our method SCP outperforms the other five methods in terms of accuracy of essential protein prediction. In this paper, we propose a novel algorithm named SCP, which combines the ranking by a modified PageRank algorithm based on subcellular compartments information, with the ranking by Pearson correlation coefficient (PCC) calculated from gene expression data. Experiments show that subcellular localization information is promising in boosting essential protein prediction.

  9. Gene expression signatures that predict radiation exposure in mice and humans.

    Directory of Open Access Journals (Sweden)

    Holly K Dressman

    2007-04-01

    Full Text Available The capacity to assess environmental inputs to biological phenotypes is limited by methods that can accurately and quantitatively measure these contributions. One such example can be seen in the context of exposure to ionizing radiation.We have made use of gene expression analysis of peripheral blood (PB mononuclear cells to develop expression profiles that accurately reflect prior radiation exposure. We demonstrate that expression profiles can be developed that not only predict radiation exposure in mice but also distinguish the level of radiation exposure, ranging from 50 cGy to 1,000 cGy. Likewise, a molecular signature of radiation response developed solely from irradiated human patient samples can predict and distinguish irradiated human PB samples from nonirradiated samples with an accuracy of 90%, sensitivity of 85%, and specificity of 94%. We further demonstrate that a radiation profile developed in the mouse can correctly distinguish PB samples from irradiated and nonirradiated human patients with an accuracy of 77%, sensitivity of 82%, and specificity of 75%. Taken together, these data demonstrate that molecular profiles can be generated that are highly predictive of different levels of radiation exposure in mice and humans.We suggest that this approach, with additional refinement, could provide a method to assess the effects of various environmental inputs into biological phenotypes as well as providing a more practical application of a rapid molecular screening test for the diagnosis of radiation exposure.

  10. Response-predictive gene expression profiling of glioma progenitor cells in vitro.

    Directory of Open Access Journals (Sweden)

    Sylvia Moeckel

    Full Text Available High-grade gliomas are amongst the most deadly human tumors. Treatment results are disappointing. Still, in several trials around 20% of patients respond to therapy. To date, diagnostic strategies to identify patients that will profit from a specific therapy do not exist.In this study, we used serum-free short-term treated in vitro cell cultures to predict treatment response in vitro. This approach allowed us (a to enrich specimens for brain tumor initiating cells and (b to confront cells with a therapeutic agent before expression profiling.As a proof of principle we analyzed gene expression in 18 short-term serum-free cultures of high-grade gliomas enhanced for brain tumor initiating cells (BTIC before and after in vitro treatment with the tyrosine kinase inhibitor Sunitinib. Profiles from treated progenitor cells allowed to predict therapy-induced impairment of proliferation in vitro.For the tyrosine kinase inhibitor Sunitinib used in this dataset, the approach revealed additional predictive information in comparison to the evaluation of classical signaling analysis.

  11. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data.

    Science.gov (United States)

    Arango-Argoty, Gustavo; Garner, Emily; Pruden, Amy; Heath, Lenwood S; Vikesland, Peter; Zhang, Liqing

    2018-02-01

    Growing concerns about increasing rates of antibiotic resistance call for expanded and comprehensive global monitoring. Advancing methods for monitoring of environmental media (e.g., wastewater, agricultural waste, food, and water) is especially needed for identifying potential resources of novel antibiotic resistance genes (ARGs), hot spots for gene exchange, and as pathways for the spread of ARGs and human exposure. Next-generation sequencing now enables direct access and profiling of the total metagenomic DNA pool, where ARGs are typically identified or predicted based on the "best hits" of sequence searches against existing databases. Unfortunately, this approach produces a high rate of false negatives. To address such limitations, we propose here a deep learning approach, taking into account a dissimilarity matrix created using all known categories of ARGs. Two deep learning models, DeepARG-SS and DeepARG-LS, were constructed for short read sequences and full gene length sequences, respectively. Evaluation of the deep learning models over 30 antibiotic resistance categories demonstrates that the DeepARG models can predict ARGs with both high precision (> 0.97) and recall (> 0.90). The models displayed an advantage over the typical best hit approach, yielding consistently lower false negative rates and thus higher overall recall (> 0.9). As more data become available for under-represented ARG categories, the DeepARG models' performance can be expected to be further enhanced due to the nature of the underlying neural networks. Our newly developed ARG database, DeepARG-DB, encompasses ARGs predicted with a high degree of confidence and extensive manual inspection, greatly expanding current ARG repositories. The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice. DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs. The

  12. Gene expression programming for prediction of scour depth downstream of sills

    Science.gov (United States)

    Azamathulla, H. Md.

    2012-08-01

    SummaryLocal scour is crucial in the degradation of river bed and the stability of grade control structures, stilling basins, aprons, ski-jump bucket spillways, bed sills, weirs, check dams, etc. This short communication presents gene-expression programming (GEP), which is an extension to genetic programming (GP), as an alternative approach to predict scour depth downstream of sills. Published data were compiled from the literature for the scour depth downstream of sills. The proposed GEP approach gives satisfactory results (R2 = 0.967 and RMSE = 0.088) compared to the existing predictors (Chinnarasri and Kositgittiwong, 2008) with R2 = 0.87 and RMSE = 2.452 for relative scour depth.

  13. Boolean Dynamic Modeling Approaches to Study Plant Gene Regulatory Networks: Integration, Validation, and Prediction.

    Science.gov (United States)

    Velderraín, José Dávila; Martínez-García, Juan Carlos; Álvarez-Buylla, Elena R

    2017-01-01

    Mathematical models based on dynamical systems theory are well-suited tools for the integration of available molecular experimental data into coherent frameworks in order to propose hypotheses about the cooperative regulatory mechanisms driving developmental processes. Computational analysis of the proposed models using well-established methods enables testing the hypotheses by contrasting predictions with observations. Within such framework, Boolean gene regulatory network dynamical models have been extensively used in modeling plant development. Boolean models are simple and intuitively appealing, ideal tools for collaborative efforts between theorists and experimentalists. In this chapter we present protocols used in our group for the study of diverse plant developmental processes. We focus on conceptual clarity and practical implementation, providing directions to the corresponding technical literature.

  14. A Machine Learned Classifier That Uses Gene Expression Data to Accurately Predict Estrogen Receptor Status

    Science.gov (United States)

    Bastani, Meysam; Vos, Larissa; Asgarian, Nasimeh; Deschenes, Jean; Graham, Kathryn; Mackey, John; Greiner, Russell

    2013-01-01

    Background Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. Methods To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. Results This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. Conclusions Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions. PMID:24312637

  15. A machine learned classifier that uses gene expression data to accurately predict estrogen receptor status.

    Directory of Open Access Journals (Sweden)

    Meysam Bastani

    Full Text Available BACKGROUND: Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. METHODS: To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. RESULTS: This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. CONCLUSIONS: Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.

  16. Dopamine Gene Profiling to Predict Impulse Control and Effects of Dopamine Agonist Ropinirole.

    Science.gov (United States)

    MacDonald, Hayley J; Stinear, Cathy M; Ren, April; Coxon, James P; Kao, Justin; Macdonald, Lorraine; Snow, Barry; Cramer, Steven C; Byblow, Winston D

    2016-07-01

    Dopamine agonists can impair inhibitory control and cause impulse control disorders for those with Parkinson disease (PD), although mechanistically this is not well understood. In this study, we hypothesized that the extent of such drug effects on impulse control is related to specific dopamine gene polymorphisms. This double-blind, placebo-controlled study aimed to examine the effect of single doses of 0.5 and 1.0 mg of the dopamine agonist ropinirole on impulse control in healthy adults of typical age for PD onset. Impulse control was measured by stop signal RT on a response inhibition task and by an index of impulsive decision-making on the Balloon Analogue Risk Task. A dopamine genetic risk score quantified basal dopamine neurotransmission from the influence of five genes: catechol-O-methyltransferase, dopamine transporter, and those encoding receptors D1, D2, and D3. With placebo, impulse control was better for the high versus low genetic risk score groups. Ropinirole modulated impulse control in a manner dependent on genetic risk score. For the lower score group, both doses improved response inhibition (decreased stop signal RT) whereas the lower dose reduced impulsiveness in decision-making. Conversely, the higher score group showed a trend for worsened response inhibition on the lower dose whereas both doses increased impulsiveness in decision-making. The implications of the present findings are that genotyping can be used to predict impulse control and whether it will improve or worsen with the administration of dopamine agonists.

  17. Multiple genetic interaction experiments provide complementary information useful for gene function prediction.

    Directory of Open Access Journals (Sweden)

    Magali Michaut

    Full Text Available Genetic interactions help map biological processes and their functional relationships. A genetic interaction is defined as a deviation from the expected phenotype when combining multiple genetic mutations. In Saccharomyces cerevisiae, most genetic interactions are measured under a single phenotype - growth rate in standard laboratory conditions. Recently genetic interactions have been collected under different phenotypic readouts and experimental conditions. How different are these networks and what can we learn from their differences? We conducted a systematic analysis of quantitative genetic interaction networks in yeast performed under different experimental conditions. We find that networks obtained using different phenotypic readouts, in different conditions and from different laboratories overlap less than expected and provide significant unique information. To exploit this information, we develop a novel method to combine individual genetic interaction data sets and show that the resulting network improves gene function prediction performance, demonstrating that individual networks provide complementary information. Our results support the notion that using diverse phenotypic readouts and experimental conditions will substantially increase the amount of gene function information produced by genetic interaction screens.

  18. AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density.

    Science.gov (United States)

    Zhao, X G; Dai, W; Li, Y; Tian, L

    2011-11-01

    The area under the receiver operating characteristic (ROC) curve (AUC), long regarded as a 'golden' measure for the predictiveness of a continuous score, has propelled the need to develop AUC-based predictors. However, the AUC-based ensemble methods are rather scant, largely due to the fact that the associated objective function is neither continuous nor concave. Indeed, there is no reliable numerical algorithm identifying optimal combination of a set of biomarkers to maximize the AUC, especially when the number of biomarkers is large. We have proposed a novel AUC-based statistical ensemble methods for combining multiple biomarkers to differentiate a binary response of interest. Specifically, we propose to replace the non-continuous and non-convex AUC objective function by a convex surrogate loss function, whose minimizer can be efficiently identified. With the established framework, the lasso and other regularization techniques enable feature selections. Extensive simulations have demonstrated the superiority of the new methods to the existing methods. The proposal has been applied to a gene expression dataset to construct gene expression scores to differentiate elderly women with low bone mineral density (BMD) and those with normal BMD. The AUCs of the resulting scores in the independent test dataset has been satisfactory. Aiming for directly maximizing AUC, the proposed AUC-based ensemble method provides an efficient means of generating a stable combination of multiple biomarkers, which is especially useful under the high-dimensional settings. lutian@stanford.edu. Supplementary data are available at Bioinformatics online.

  19. Interactions of adolescent social experiences and dopamine genes to predict physical intimate partner violence perpetration.

    Directory of Open Access Journals (Sweden)

    Laura M Schwab-Reese

    Full Text Available We examined the interactions between three dopamine gene alleles (DAT1, DRD2, DRD4 previously associated with violent behavior and two components of the adolescent environment (exposure to violence, school social environment to predict adulthood physical intimate partner violence (IPV perpetration among white men and women.We used data from Wave IV of the National Longitudinal Study of Adolescent to Adult Health, a cohort study following individuals from adolescence to adulthood. Based on the prior literature, we categorized participants as at risk for each of the three dopamine genes using this coding scheme: two 10-R alleles for DAT1; at least one A-1 allele for DRD2; at least one 7-R or 8-R allele for DRD4. Adolescent exposure to violence and school social environment was measured in 1994 and 1995 when participants were in high school or middle school. Intimate partner violence perpetration was measured in 2008 when participants were 24 to 32 years old. We used simple and multivariable logistic regression models, including interactions of genes and the adolescent environments for the analysis.Presence of risk alleles was not independently associated with IPV perpetration but increasing exposure to violence and disconnection from the school social environment was associated with physical IPV perpetration. The effects of these adolescent experiences on physical IPV perpetration varied by dopamine risk allele status. Among individuals with non-risk dopamine alleles, increased exposure to violence during adolescence and perception of disconnection from the school environment were significantly associated with increased odds of physical IPV perpetration, but individuals with high risk alleles, overall, did not experience the same increase.Our results suggested the effects of adolescent environment on adulthood physical IPV perpetration varied by genetic factors. This analysis did not find a direct link between risk alleles and violence, but

  20. Complete genome sequence analysis of the fish pathogen Flavobacterium columnare provides insights into antibiotic resistance and pathogenicity related genes.

    Science.gov (United States)

    Zhang, Yulei; Zhao, Lijuan; Chen, Wenjie; Huang, Yunmao; Yang, Ling; Sarathbabu, V; Wu, Zaohe; Li, Jun; Nie, Pin; Lin, Li

    2017-10-01

    We analyzed here the complete genome sequences of a highly virulent Flavobacterium columnare Pf1 strain isolated in our laboratory. The complete genome consists of a 3,171,081 bp circular DNA with 2784 predicted protein-coding genes. Among these, 286 genes were predicted as antibiotic resistance genes, including 32 RND-type efflux pump related genes which were associated with the export of aminoglycosides, indicating inducible aminoglycosides resistances in F. columnare. On the other hand, 328 genes were predicted as pathogenicity related genes which could be classified as virulence factors, gliding motility proteins, adhesins, and many putative secreted proteases. These genes were probably involved in the colonization, invasion and destruction of fish tissues during the infection of F. columnare. Apparently, our obtained complete genome sequences provide the basis for the explanation of the interactions between the F. columnare and the infected fish. The predicted antibiotic resistance and pathogenicity related genes will shed a new light on the development of more efficient preventional strategies against the infection of F. columnare, which is a major worldwide fish pathogen. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Melanopsin gene variations interact with season to predict sleep onset and chronotype.

    Science.gov (United States)

    Roecklein, Kathryn A; Wong, Patricia M; Franzen, Peter L; Hasler, Brant P; Wood-Vasey, W Michael; Nimgaonkar, Vishwajit L; Miller, Megan A; Kepreos, Kyle M; Ferrell, Robert E; Manuck, Stephen B

    2012-10-01

    The human melanopsin gene has been reported to mediate risk for seasonal affective disorder (SAD), which is hypothesized to be caused by decreased photic input during winter when light levels fall below threshold, resulting in differences in circadian phase and/or sleep. However, it is unclear if melanopsin increases risk of SAD by causing differences in sleep or circadian phase, or if those differences are symptoms of the mood disorder. To determine if melanopsin sequence variations are associated with differences in sleep-wake behavior among those not suffering from a mood disorder, the authors tested associations between melanopsin gene polymorphisms and self-reported sleep timing (sleep onset and wake time) in a community sample (N = 234) of non-Hispanic Caucasian participants (age 30-54 yrs) with no history of psychological, neurological, or sleep disorders. The authors also tested the effect of melanopsin variations on differences in preferred sleep and activity timing (i.e., chronotype), which may reflect differences in circadian phase, sleep homeostasis, or both. Daylength on the day of assessment was measured and included in analyses. DNA samples were genotyped for melanopsin gene polymorphisms using fluorescence polarization. P10L genotype interacted with daylength to predict self-reported sleep onset (interaction p sleep onset among those with the TT genotype was later in the day when individuals were assessed on longer days and earlier in the day on shorter days, whereas individuals in the other genotype groups (i.e., CC and CT) did not show this interaction effect. P10L genotype also interacted in an analogous way with daylength to predict self-reported morningness (interaction p sleep onset and chronotype as a function of daylength, whereas other genotypes at P10L do not seem to have effects that vary by daylength. A better understanding of how melanopsin confers heightened responsivity to daylength may improve our understanding of a broad range of

  2. Gene expression markers in circulating tumor cells may predict bone metastasis and response to hormonal treatment in breast cancer.

    Science.gov (United States)

    Wang, Haiying; Molina, Julian; Jiang, John; Ferber, Matthew; Pruthi, Sandhya; Jatkoe, Timothy; Derecho, Carlo; Rajpurohit, Yashoda; Zheng, Jian; Wang, Yixin

    2013-11-01

    Circulating tumor cells (CTCs) have recently attracted attention due to their potential as prognostic and predictive markers for the clinical management of metastatic breast cancer patients. The isolation of CTCs from patients may enable the molecular characterization of these cells, which may help establish a minimally invasive assay for the prediction of metastasis and further optimization of treatment. Molecular markers of proven clinical value may therefore be useful in predicting disease aggressiveness and response to treatment. In our earlier study, we identified a gene signature in breast cancer that appears to be significantly associated with bone metastasis. Among the genes that constitute this signature, trefoil factor 1 (TFF1) was identified as the most differentially expressed gene associated with bone metastasis. In this study, we investigated 25 candidate gene markers in the CTCs of metastatic breast cancer patients with different metastatic sites. The panel of the 25 markers was investigated in 80 baseline samples (first blood draw of CTCs) and 30 follow-up samples. In addition, 40 healthy blood donors (HBDs) were analyzed as controls. The assay was performed using quantitative reverse transcriptase polymerase chain reaction (qRT-PCR) with RNA extracted from CTCs captured by the CellSearch system. Our study indicated that 12 of the genes were uniquely expressed in CTCs and 10 were highly expressed in the CTCs obtained from patients compared to those obtained from HBDs. Among these genes, the expression of keratin 19 was highly correlated with the CTC count. The TFF1 expression in CTCs was a strong predictor of bone metastasis and the patients with a high expression of estrogen receptor β in CTCs exhibited a better response to hormonal treatment. Molecular characterization of these genes in CTCs may provide a better understanding of the mechanism underlying tumor metastasis and identify gene markers in CTCs for predicting disease progression and

  3. Peripheral neuropathy predicts nuclear gene defect in patients with mitochondrial ophthalmoplegia.

    Science.gov (United States)

    Horga, Alejandro; Pitceathly, Robert D S; Blake, Julian C; Woodward, Catherine E; Zapater, Pedro; Fratter, Carl; Mudanohwo, Ese E; Plant, Gordon T; Houlden, Henry; Sweeney, Mary G; Hanna, Michael G; Reilly, Mary M

    2014-12-01

    Progressive external ophthalmoplegia is a common clinical feature in mitochondrial disease caused by nuclear DNA defects and single, large-scale mitochondrial DNA deletions and is less frequently associated with point mutations of mitochondrial DNA. Peripheral neuropathy is also a frequent manifestation of mitochondrial disease, although its prevalence and characteristics varies considerably among the different syndromes and genetic aetiologies. Based on clinical observations, we systematically investigated whether the presence of peripheral neuropathy could predict the underlying genetic defect in patients with progressive external ophthalmoplegia. We analysed detailed demographic, clinical and neurophysiological data from 116 patients with genetically-defined mitochondrial disease and progressive external ophthalmoplegia. Seventy-eight patients (67%) had a single mitochondrial DNA deletion, 12 (10%) had a point mutation of mitochondrial DNA and 26 (22%) had mutations in either POLG, C10orf2 or RRM2B, or had multiple mitochondrial DNA deletions in muscle without an identified nuclear gene defect. Seventy-seven patients had neurophysiological studies; of these, 16 patients (21%) had a large-fibre peripheral neuropathy. The prevalence of peripheral neuropathy was significantly lower in patients with a single mitochondrial DNA deletion (2%) as compared to those with a point mutation of mitochondrial DNA or with a nuclear DNA defect (44% and 52%, respectively; Pperipheral neuropathy as the only independent predictor associated with a nuclear DNA defect (P=0.002; odds ratio 8.43, 95% confidence interval 2.24-31.76). Multinomial logistic regression analysis identified peripheral neuropathy, family history and hearing loss as significant predictors of the genotype, and the same three variables showed the highest performance in genotype classification in a decision tree analysis. Of these variables, peripheral neuropathy had the highest specificity (91%), negative

  4. High-Throughput Gene Expression Profiles to Define Drug Similarity and Predict Compound Activity.

    Science.gov (United States)

    De Wolf, Hans; Cougnaud, Laure; Van Hoorde, Kirsten; De Bondt, An; Wegner, Joerg K; Ceulemans, Hugo; Göhlmann, Hinrich

    2018-04-01

    By adding biological information, beyond the chemical properties and desired effect of a compound, uncharted compound areas and connections can be explored. In this study, we add transcriptional information for 31K compounds of Janssen's primary screening deck, using the HT L1000 platform and assess (a) the transcriptional connection score for generating compound similarities, (b) machine learning algorithms for generating target activity predictions, and (c) the scaffold hopping potential of the resulting hits. We demonstrate that the transcriptional connection score is best computed from the significant genes only and should be interpreted within its confidence interval for which we provide the stats. These guidelines help to reduce noise, increase reproducibility, and enable the separation of specific and promiscuous compounds. The added value of machine learning is demonstrated for the NR3C1 and HSP90 targets. Support Vector Machine models yielded balanced accuracy values ≥80% when the expression values from DDIT4 & SERPINE1 and TMEM97 & SPR were used to predict the NR3C1 and HSP90 activity, respectively. Combining both models resulted in 22 new and confirmed HSP90-independent NR3C1 inhibitors, providing two scaffolds (i.e., pyrimidine and pyrazolo-pyrimidine), which could potentially be of interest in the treatment of depression (i.e., inhibiting the glucocorticoid receptor (i.e., NR3C1), while leaving its chaperone, HSP90, unaffected). As such, the initial hit rate increased by a factor 300, as less, but more specific chemistry could be screened, based on the upfront computed activity predictions.

  5. Group spike-and-slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information.

    Science.gov (United States)

    Tang, Zaixiang; Shen, Yueping; Li, Yan; Zhang, Xinyan; Wen, Jia; Qian, Chen'ao; Zhuang, Wenzhuo; Shi, Xinghua; Yi, Nengjun

    2018-03-15

    Large-scale molecular data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, standard approaches for omics data analysis ignore the group structure among genes encoded in functional relationships or pathway information. We propose new Bayesian hierarchical generalized linear models, called group spike-and-slab lasso GLMs, for predicting disease outcomes and detecting associated genes by incorporating large-scale molecular data and group structures. The proposed model employs a mixture double-exponential prior for coefficients that induces self-adaptive shrinkage amount on different coefficients. The group information is incorporated into the model by setting group-specific parameters. We have developed a fast and stable deterministic algorithm to fit the proposed hierarchal GLMs, which can perform variable selection within groups. We assess the performance of the proposed method on several simulated scenarios, by varying the overlap among groups, group size, number of non-null groups, and the correlation within group. Compared with existing methods, the proposed method provides not only more accurate estimates of the parameters but also better prediction. We further demonstrate the application of the proposed procedure on three cancer datasets by utilizing pathway structures of genes. Our results show that the proposed method generates powerful models for predicting disease outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). nyi@uab.edu. Supplementary data are available at Bioinformatics online.

  6. Prediction of target genes for miR-140-5p in pulmonary arterial hypertension using bioinformatics methods.

    Science.gov (United States)

    Li, Fangwei; Shi, Wenhua; Wan, Yixin; Wang, Qingting; Feng, Wei; Yan, Xin; Wang, Jian; Chai, Limin; Zhang, Qianqian; Li, Manxiang

    2017-12-01

    The expression of microRNA (miR)-140-5p is known to be reduced in both pulmonary arterial hypertension (PAH) patients and monocrotaline-induced PAH models in rat. Identification of target genes for miR-140-5p with bioinformatics analysis may reveal new pathways and connections in PAH. This study aimed to explore downstream target genes and relevant signaling pathways regulated by miR-140-5p to provide theoretical evidences for further researches on role of miR-140-5p in PAH. Multiple downstream target genes and upstream transcription factors (TFs) of miR-140-5p were predicted in the analysis. Gene ontology (GO) enrichment analysis indicated that downstream target genes of miR-140-5p were enriched in many biological processes, such as biological regulation, signal transduction, response to chemical stimulus, stem cell proliferation, cell surface receptor signaling pathways. Kyoto Encyclopedia of Genes and Genome (KEGG) pathway analysis found that downstream target genes were mainly located in Notch, TGF-beta, PI3K/Akt, and Hippo signaling pathway. According to TF-miRNA-mRNA network, the important downstream target genes of miR-140-5p were PPI, TGF-betaR1, smad4, JAG1, ADAM10, FGF9, PDGFRA, VEGFA, LAMC1, TLR4, and CREB. After thoroughly reviewing published literature, we found that 23 target genes and seven signaling pathways were truly inhibited by miR-140-5p in various tissues or cells; most of these verified targets were in accordance with our present prediction. Other predicted targets still need further verification in vivo and in vitro .

  7. dPORE-miRNA: Polymorphic regulation of microRNA genes

    KAUST Repository

    Schmeier, Sebastian

    2011-02-04

    Background: MicroRNAs (miRNAs) are short non-coding RNA molecules that act as post-transcriptional regulators and affect the regulation of protein-coding genes. Mostly transcribed by PolII, miRNA genes are regulated at the transcriptional level similarly to protein-coding genes. In this study we focus on human miRNAs. These miRNAs are involved in a variety of pathways and can affect many diseases. Our interest is on possible deregulation of the transcription initiation of the miRNA encoding genes, which is facilitated by variations in the genomic sequence of transcriptional control regions (promoters). Methodology: Our aim is to provide an online resource to facilitate the investigation of the potential effects of single nucleotide polymorphisms (SNPs) on miRNA gene regulation. We analyzed SNPs overlapped with predicted transcription factor binding sites (TFBSs) in promoters of miRNA genes. We also accounted for the creation of novel TFBSs due to polymorphisms not present in the reference genome. The resulting changes in the original TFBSs and potential creation of new TFBSs were incorporated into the Dragon Database of Polymorphic Regulation of miRNA genes (dPORE-miRNA). Conclusions: The dPORE-miRNA database enables researchers to explore potential effects of SNPs on the regulation of miRNAs. dPORE-miRNA can be interrogated with regards to: a/miRNAs (their targets, or involvement in diseases, or biological pathways), b/SNPs, or c/transcription factors. dPORE-miRNA can be accessed at http://cbrc.kaust.edu.sa/dpore and http://apps.sanbi.ac.za/dpore/. Its use is free for academic and non-profit users. © 2011 Schmeier et al.

  8. dPORE-miRNA: Polymorphic regulation of microRNA genes

    KAUST Repository

    Schmeier, Sebastian; Schaefer, Ulf; MacPherson, Cameron R.; Bajic, Vladimir B.

    2011-01-01

    Background: MicroRNAs (miRNAs) are short non-coding RNA molecules that act as post-transcriptional regulators and affect the regulation of protein-coding genes. Mostly transcribed by PolII, miRNA genes are regulated at the transcriptional level similarly to protein-coding genes. In this study we focus on human miRNAs. These miRNAs are involved in a variety of pathways and can affect many diseases. Our interest is on possible deregulation of the transcription initiation of the miRNA encoding genes, which is facilitated by variations in the genomic sequence of transcriptional control regions (promoters). Methodology: Our aim is to provide an online resource to facilitate the investigation of the potential effects of single nucleotide polymorphisms (SNPs) on miRNA gene regulation. We analyzed SNPs overlapped with predicted transcription factor binding sites (TFBSs) in promoters of miRNA genes. We also accounted for the creation of novel TFBSs due to polymorphisms not present in the reference genome. The resulting changes in the original TFBSs and potential creation of new TFBSs were incorporated into the Dragon Database of Polymorphic Regulation of miRNA genes (dPORE-miRNA). Conclusions: The dPORE-miRNA database enables researchers to explore potential effects of SNPs on the regulation of miRNAs. dPORE-miRNA can be interrogated with regards to: a/miRNAs (their targets, or involvement in diseases, or biological pathways), b/SNPs, or c/transcription factors. dPORE-miRNA can be accessed at http://cbrc.kaust.edu.sa/dpore and http://apps.sanbi.ac.za/dpore/. Its use is free for academic and non-profit users. © 2011 Schmeier et al.

  9. Prediction of the prognosis of breast cancer in routine histologic specimens using a simplified, low-cost gene expression signature

    DEFF Research Database (Denmark)

    Marcell, S.A.; Balazs, A.; Emese, A.

    2013-01-01

    Prediction of the prognosis of breast cancer in routine histologic specimens using a simplified, low-cost gene expression signature Background: Grade 2 breast carcinomas do not form a uniform prognostic group. Aim: To extend the number of patients and the investigated genes of a previously...... grade 2 breast carcinomas into prognostic groups. Gene expression was investigated by polymerase chain reaction in 249 formalin-fixed, paraffin-embedded breast tumors. The results were correlated with relapse-free survival. Results: Histologically grade 2 carcinomas were split into good and a poor...... identified prognostic signature described by the authors that reflect chromosomal instability in order to refine characterization of grade 2 breast cancers and identify driver genes. Methods: Using publicly available databases, the authors selected 9 target and 3 housekeeping genes that are capable to divide...

  10. Prediction of the contact sensitizing potential of chemicals using analysis of gene expression changes in human THP-1 monocytes.

    Science.gov (United States)

    Arkusz, Joanna; Stępnik, Maciej; Sobala, Wojciech; Dastych, Jarosław

    2010-11-10

    The aim of this study was to find differentially regulated genes in THP-1 monocytic cells exposed to sensitizers and nonsensitizers and to investigate if such genes could be reliable markers for an in vitro predictive method for the identification of skin sensitizing chemicals. Changes in expression of 35 genes in the THP-1 cell line following treatment with chemicals of different sensitizing potential (from nonsensitizers to extreme sensitizers) were assessed using real-time PCR. Verification of 13 candidate genes by testing a large number of chemicals (an additional 22 sensitizers and 8 nonsensitizers) revealed that prediction of contact sensitization potential was possible based on evaluation of changes in three genes: IL8, HMOX1 and PAIMP1. In total, changes in expression of these genes allowed correct detection of sensitization potential of 21 out of 27 (78%) test sensitizers. The gene expression levels inside potency groups varied and did not allow estimation of sensitization potency of test chemicals. Results of this study indicate that evaluation of changes in expression of proposed biomarkers in THP-1 cells could be a valuable model for preliminary screening of chemicals to discriminate an appreciable majority of sensitizers from nonsensitizers. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.

  11. Predicting the use of corporal punishment: Child aggression, parent religiosity, and the BDNF gene.

    Science.gov (United States)

    Avinun, Reut; Davidov, Maayan; Mankuta, David; Knafo-Noam, Ariel

    2018-03-01

    Corporal punishment (CP) has been associated with deleterious child outcomes, highlighting the importance of understanding its underpinnings. Although several factors have been linked with parents' CP use, genetic influences on CP have rarely been studied, and an integrative view examining the interplay between different predictors of CP is missing. We focused on the separate and joint effects of religiosity, child aggression, parent's gender, and a valine (Val) to methionine (Met) substitution in the brain-derived neurotrophic factor (BDNF) gene. Data came from a twin sample (51% male, aged 6.5 years). We used mothers' and fathers' self-reports of CP and religiosity, and the other parent's report on child aggression. Complete data were available for 244 mothers and their 466 children, and for 217 fathers and their 409 children. The random split method was employed to examine replicability. For mothers, only the effect of religiosity appeared to replicate. For fathers, several effects predicting CP use replicated in both samples: child aggression, child sex, religiosity, and a three-way (GxExE) interaction implicating fathers' BDNF genotype, child aggression and religiosity. Religious fathers who carried the Met allele and had an aggressive child used CP more frequently; in contrast, secular fathers' CP use was not affected by their BDNF genotype or child aggression. Results were also repeated longitudinally in a subsample with age 8-9 data. Findings highlight the utility of a bio-ecological approach for studying CP use by shedding light on pertinent gene-environment interaction processes. Possible implications for intervention and public policy are discussed. © 2017 Wiley Periodicals, Inc.

  12. In Silico Analysis of Microarray-Based Gene Expression Profiles Predicts Tumor Cell Response to Withanolides

    Directory of Open Access Journals (Sweden)

    Thomas Efferth

    2012-05-01

    Full Text Available Withania somnifera (L. Dunal (Indian ginseng, winter cherry, Solanaceae is widely used in traditional medicine. Roots are either chewed or used to prepare beverages (aqueous decocts. The major secondary metabolites of Withania somnifera are the withanolides, which are C-28-steroidal lactone triterpenoids. Withania somnifera extracts exert chemopreventive and anticancer activities in vitro and in vivo. The aims of the present in silico study were, firstly, to investigate whether tumor cells develop cross-resistance between standard anticancer drugs and withanolides and, secondly, to elucidate the molecular determinants of sensitivity and resistance of tumor cells towards withanolides. Using IC50 concentrations of eight different withanolides (withaferin A, withaferin A diacetate, 3-azerininylwithaferin A, withafastuosin D diacetate, 4-B-hydroxy-withanolide E, isowithanololide E, withafastuosin E, and withaperuvin and 19 established anticancer drugs, we analyzed the cross-resistance profile of 60 tumor cell lines. The cell lines revealed cross-resistance between the eight withanolides. Consistent cross-resistance between withanolides and nitrosoureas (carmustin, lomustin, and semimustin was also observed. Then, we performed transcriptomic microarray-based COMPARE and hierarchical cluster analyses of mRNA expression to identify mRNA expression profiles predicting sensitivity or resistance towards withanolides. Genes from diverse functional groups were significantly associated with response of tumor cells to withaferin A diacetate, e.g. genes functioning in DNA damage and repair, stress response, cell growth regulation, extracellular matrix components, cell adhesion and cell migration, constituents of the ribosome, cytoskeletal organization and regulation, signal transduction, transcription factors, and others.

  13. Ortholog-based screening and identification of genes related to intracellular survival.

    Science.gov (United States)

    Yang, Xiaowen; Wang, Jiawei; Bing, Guoxia; Bie, Pengfei; De, Yanyan; Lyu, Yanli; Wu, Qingmin

    2018-04-20

    Bioinformatics and comparative genomics analysis methods were used to predict unknown pathogen genes based on homology with identified or functionally clustered genes. In this study, the genes of common pathogens were analyzed to screen and identify genes associated with intracellular survival through sequence similarity, phylogenetic tree analysis and the λ-Red recombination system test method. The total 38,952 protein-coding genes of common pathogens were divided into 19,775 clusters. As demonstrated through a COG analysis, information storage and processing genes might play an important role intracellular survival. Only 19 clusters were present in facultative intracellular pathogens, and not all were present in extracellular pathogens. Construction of a phylogenetic tree selected 18 of these 19 clusters. Comparisons with the DEG database and previous research revealed that seven other clusters are considered essential gene clusters and that seven other clusters are associated with intracellular survival. Moreover, this study confirmed that clusters screened by orthologs with similar function could be replaced with an approved uvrY gene and its orthologs, and the results revealed that the usg gene is associated with intracellular survival. The study improves the current understanding of intracellular pathogens characteristics and allows further exploration of the intracellular survival-related gene modules in these pathogens. Copyright © 2018. Published by Elsevier B.V.

  14. A common polymorphism in a Williams syndrome gene predicts amygdala reactivity and extraversion in healthy adults

    Science.gov (United States)

    Swartz, Johnna R.; Waller, Rebecca; Bogdan, Ryan; Knodt, Annchen R.; Sabhlok, Aditi; Hyde, Luke W.; Hariri, Ahmad R.

    2015-01-01

    Background Williams syndrome (WS), a genetic disorder resulting from hemizygous microdeletion of chromosome 7q11.23, has emerged as a model for identifying the genetic architecture of socioemotional behavior. Recently, common polymorphisms in GTF2I, which is found within the WS microdeletion, have been associated with reduced social anxiety in the general population. Identifying neural phenotypes affected by these polymorphisms will help advance our understanding not only of this specific genetic association but also the broader neurogenetic mechanisms of variability in socioemotional behavior. Methods Through an ongoing parent protocol, the Duke Neurogenetics Study, we measured threat-related amygdala reactivity to fearful and angry facial expressions using functional MRI (fMRI), assessed trait personality using the Revised NEO Personality Inventory, and imputed GTF2I rs13227433 from saliva-derived DNA using custom Illumina arrays. Participants included 808 non-Hispanic Caucasian, African American, and Asian university students. Results The GTF2I rs13227433 AA genotype, previously associated with lower social anxiety, predicted decreased threat-related amygdala reactivity. An indirect effect of GTF2I genotype on the warmth facet of extraversion was mediated by decreased threat-related amygdala reactivity in women but not men. Conclusions A common polymorphism in the WS gene GTF2I associated with reduced social anxiety predicts decreased threat-related amygdala reactivity, which mediates an association between genotype and increased warmth in women. These results are consistent with reduced threat-related amygdala reactivity in WS and suggest that common variation in GTF2I contributes to broader variability in socioemotional brain function and behavior, with implications for understanding the neurogenetic bases of WS as well as social anxiety. PMID:26853120

  15. Interaction between serotonin transporter gene variants and life events predicts response to antidepressants in the GENDEP project

    DEFF Research Database (Denmark)

    Keers, R.; Uher, R.; Huezo-Diaz, P.

    2011-01-01

    , and several polymorphisms in the serotonin transporter gene (SLC6A4) have been genotyped including the serotonin transporter-linked polymorphic region (5-HTTLPR). Stressful life events were shown to predict a significantly better response to escitalopram but had no effect on response to nortriptyline...

  16. Interactions between Serotonin Transporter Gene Haplotypes and Quality of Mothers' Parenting Predict the Development of Children's Noncompliance

    Science.gov (United States)

    Sulik, Michael J.; Eisenberg, Nancy; Lemery-Chalfant, Kathryn; Spinrad, Tracy L.; Silva, Kassondra M.; Eggum, Natalie D.; Betkowski, Jennifer A.; Kupfer, Anne; Smith, Cynthia L.; Gaertner, Bridget; Stover, Daryn A.; Verrelli, Brian C.

    2012-01-01

    The LPR and STin2 polymorphisms of the serotonin transporter gene (SLC6A4) were combined into haplotypes that, together with quality of maternal parenting, were used to predict initial levels and linear change in children's (N = 138) noncompliance and aggression from age 18-54 months. Quality of mothers' parenting behavior was observed when…

  17. Arabidopsis CPR5 is a senescence-regulatory gene with pleiotropic functions as predicted by the evolutionary theory of senescence

    NARCIS (Netherlands)

    Jing, Hai-Chun; Anderson, Lisa; Sturre, Marcel J. G.; Hille, Jacques; Dijkwel, Paul P.

    2007-01-01

    Arabidopsis CPR5 is a senescence-regulatory gene with pleiotropic functions as predicted by the evolutionary theory of senescence Hai-Chun Jing1,2, Lisa Anderson3, Marcel J.G. Sturre1, Jacques Hille1 and Paul P. Dijkwel1,* 1Molecular Biology of Plants, Groningen Biomolecular Sciences and

  18. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information

    Directory of Open Access Journals (Sweden)

    Lemke Ney

    2009-09-01

    Full Text Available Abstract Background The identification of essential genes is important for the understanding of the minimal requirements for cellular life and for practical purposes, such as drug design. However, the experimental techniques for essential genes discovery are labor-intensive and time-consuming. Considering these experimental constraints, a computational approach capable of accurately predicting essential genes would be of great value. We therefore present here a machine learning-based computational approach relying on network topological features, cellular localization and biological process information for prediction of essential genes. Results We constructed a decision tree-based meta-classifier and trained it on datasets with individual and grouped attributes-network topological features, cellular compartments and biological processes-to generate various predictors of essential genes. We showed that the predictors with better performances are those generated by datasets with integrated attributes. Using the predictor with all attributes, i.e., network topological features, cellular compartments and biological processes, we obtained the best predictor of essential genes that was then used to classify yeast genes with unknown essentiality status. Finally, we generated decision trees by training the J48 algorithm on datasets with all network topological features, cellular localization and biological process information to discover cellular rules for essentiality. We found that the number of protein physical interactions, the nuclear localization of proteins and the number of regulating transcription factors are the most important factors determining gene essentiality. Conclusion We were able to demonstrate that network topological features, cellular localization and biological process information are reliable predictors of essential genes. Moreover, by constructing decision trees based on these data, we could discover cellular rules governing

  19. SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells

    Directory of Open Access Journals (Sweden)

    Xu Huilei

    2010-12-01

    Full Text Available Abstract Background Mouse embryonic stem cells (mESCs are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renew and to differentiate to all adult cell types. Genes that maintain mESCs self-renewal and pluripotency identity are of interest to stem cell biologists. Although significant steps have been made toward the identification and characterization of such genes, the list is still incomplete and controversial. For example, the overlap among candidate self-renewal and pluripotency genes across different RNAi screens is surprisingly small. Meanwhile, machine learning approaches have been used to analyze multi-dimensional experimental data and integrate results from many studies, yet they have not been applied to specifically tackle the task of predicting and classifying self-renewal and pluripotency gene membership. Results For this study we developed a classifier, a supervised machine learning framework for predicting self-renewal and pluripotency mESCs stemness membership genes (MSMG using support vector machines (SVM. The data used to train the classifier was derived from mESCs-related studies using mRNA microarrays, measuring gene expression in various stages of early differentiation, as well as ChIP-seq studies applied to mESCs profiling genome-wide binding of key transcription factors, such as Nanog, Oct4, and Sox2, to the regulatory regions of other genes. Comparison to other classification methods using the leave-one-out cross-validation method was employed to evaluate the accuracy and generality of the classification. Finally, two sets of candidate genes from genome-wide RNA interference screens are used to test the generality and potential application of the classifier. Conclusions Our results reveal that an SVM approach can be useful for prioritizing genes for functional validation experiments and complement the analyses of high

  20. Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations.

    Science.gov (United States)

    Leong, Ivone U S; Stuckey, Alexander; Lai, Daniel; Skinner, Jonathan R; Love, Donald R

    2015-05-13

    Long QT syndrome (LQTS) is an autosomal dominant condition predisposing to sudden death from malignant arrhythmia. Genetic testing identifies many missense single nucleotide variants of uncertain pathogenicity. Establishing genetic pathogenicity is an essential prerequisite to family cascade screening. Many laboratories use in silico prediction tools, either alone or in combination, or metaservers, in order to predict pathogenicity; however, their accuracy in the context of LQTS is unknown. We evaluated the accuracy of five in silico programs and two metaservers in the analysis of LQTS 1-3 gene variants. The in silico tools SIFT, PolyPhen-2, PROVEAN, SNPs&GO and SNAP, either alone or in all possible combinations, and the metaservers Meta-SNP and PredictSNP, were tested on 312 KCNQ1, KCNH2 and SCN5A gene variants that have previously been characterised by either in vitro or co-segregation studies as either "pathogenic" (283) or "benign" (29). The accuracy, sensitivity, specificity and Matthews Correlation Coefficient (MCC) were calculated to determine the best combination of in silico tools for each LQTS gene, and when all genes are combined. The best combination of in silico tools for KCNQ1 is PROVEAN, SNPs&GO and SIFT (accuracy 92.7%, sensitivity 93.1%, specificity 100% and MCC 0.70). The best combination of in silico tools for KCNH2 is SIFT and PROVEAN or PROVEAN, SNPs&GO and SIFT. Both combinations have the same scores for accuracy (91.1%), sensitivity (91.5%), specificity (87.5%) and MCC (0.62). In the case of SCN5A, SNAP and PROVEAN provided the best combination (accuracy 81.4%, sensitivity 86.9%, specificity 50.0%, and MCC 0.32). When all three LQT genes are combined, SIFT, PROVEAN and SNAP is the combination with the best performance (accuracy 82.7%, sensitivity 83.0%, specificity 80.0%, and MCC 0.44). Both metaservers performed better than the single in silico tools; however, they did not perform better than the best performing combination of in silico

  1. Multi-gene genetic programming based predictive models for municipal solid waste gasification in a fluidized bed gasifier.

    Science.gov (United States)

    Pandey, Daya Shankar; Pan, Indranil; Das, Saptarshi; Leahy, James J; Kwapinski, Witold

    2015-03-01

    A multi-gene genetic programming technique is proposed as a new method to predict syngas yield production and the lower heating value for municipal solid waste gasification in a fluidized bed gasifier. The study shows that the predicted outputs of the municipal solid waste gasification process are in good agreement with the experimental dataset and also generalise well to validation (untrained) data. Published experimental datasets are used for model training and validation purposes. The results show the effectiveness of the genetic programming technique for solving complex nonlinear regression problems. The multi-gene genetic programming are also compared with a single-gene genetic programming model to show the relative merits and demerits of the technique. This study demonstrates that the genetic programming based data-driven modelling strategy can be a good candidate for developing models for other types of fuels as well. Copyright © 2014 Elsevier Ltd. All rights reserved.

  2. LncRNAs: emerging players in gene regulation and disease ...

    Indian Academy of Sciences (India)

    and Glavac 2013), accounting for about 20,000 protein coding ... general information on lncRNAs' feature (Da Sacco et al. 2012). ..... mal cells, stabilized Zeb2 intron encompasses an internal ..... cially growth-control genes and cell mobility-induced genes ..... RNAs in development and disease of the central nervous system.

  3. On splice site prediction using weight array models: a comparison of smoothing techniques

    International Nuclear Information System (INIS)

    Taher, Leila; Meinicke, Peter; Morgenstern, Burkhard

    2007-01-01

    In most eukaryotic genes, protein-coding exons are separated by non-coding introns which are removed from the primary transcript by a process called 'splicing'. The positions where introns are cut and exons are spliced together are called 'splice sites'. Thus, computational prediction of splice sites is crucial for gene finding in eukaryotes. Weight array models are a powerful probabilistic approach to splice site detection. Parameters for these models are usually derived from m-tuple frequencies in trusted training data and subsequently smoothed to avoid zero probabilities. In this study we compare three different ways of parameter estimation for m-tuple frequencies, namely (a) non-smoothed probability estimation, (b) standard pseudo counts and (c) a Gaussian smoothing procedure that we recently developed

  4. PRGPred: A platform for prediction of domains of resistance gene analogue (RGA in Arecaceae developed using machine learning algorithms

    Directory of Open Access Journals (Sweden)

    MATHODIYIL S. MANJULA

    2015-12-01

    Full Text Available Plant disease resistance genes (R-genes are responsible for initiation of defense mechanism against various phytopathogens. The majority of plant R-genes are members of very large multi-gene families, which encode structurally related proteins containing nucleotide binding site domains (NBS and C-terminal leucine rich repeats (LRR. Other classes possess' an extracellular LRR domain, a transmembrane domain and sometimes, an intracellular serine/threonine kinase domain. R-proteins work in pathogen perception and/or the activation of conserved defense signaling networks. In the present study, sequences representing resistance gene analogues (RGAs of coconut, arecanut, oil palm and date palm were collected from NCBI, sorted based on domains and assembled into a database. The sequences were analyzed in PRINTS database to find out the conserved domains and their motifs present in the RGAs. Based on these domains, we have also developed a tool to predict the domains of palm R-genes using various machine learning algorithms. The model files were selected based on the performance of the best classifier in training and testing. All these information is stored and made available in the online ‘PRGpred' database and prediction tool.

  5. No specific gene expression signature in human granulosa and cumulus cells for prediction of oocyte fertilisation and embryo implantation.

    Directory of Open Access Journals (Sweden)

    Tanja Burnik Papler

    Full Text Available In human IVF procedures objective and reliable biomarkers of oocyte and embryo quality are needed in order to increase the use of single embryo transfer (SET and thus prevent multiple pregnancies. During folliculogenesis there is an intense bi-directional communication between oocyte and follicular cells. For this reason gene expression profile of follicular cells could be an important indicator and biomarker of oocyte and embryo quality. The objective of this study was to identify gene expression signature(s in human granulosa (GC and cumulus (CC cells predictive of successful embryo implantation and oocyte fertilization. Forty-one patients were included in the study and individual GC and CC samples were collected; oocytes were cultivated separately, allowing a correlation with IVF outcome and elective SET was performed. Gene expression analysis was performed using microarrays, followed by a quantitative real-time PCR validation. After statistical analysis of microarray data, there were no significantly differentially expressed genes (FDR<0,05 between non-fertilized and fertilized oocytes and non-implanted and implanted embryos in either of the cell type. Furthermore, the results of quantitative real-time PCR were in consent with microarray data as there were no significant differences in gene expression of genes selected for validation. In conclusion, we did not find biomarkers for prediction of oocyte fertilization and embryo implantation in IVF procedures in the present study.

  6. Novel Accurate Bacterial Discrimination by MALDI-Time-of-Flight MS Based on Ribosomal Proteins Coding in S10-spc-alpha Operon at Strain Level S10-GERMS

    Science.gov (United States)

    Tamura, Hiroto; Hotta, Yudai; Sato, Hiroaki

    2013-08-01

    Matrix-assisted laser-desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is one of the most widely used mass-based approaches for bacterial identification and classification because of the simple sample preparation and extremely rapid analysis within a few minutes. To establish the accurate MALDI-TOF MS bacterial discrimination method at strain level, the ribosomal subunit proteins coded in the S 10-spc-alpha operon, which encodes half of the ribosomal subunit protein and is highly conserved in eubacterial genomes, were selected as reliable biomarkers. This method, named the S10-GERMS method, revealed that the strains of genus Pseudomonas were successfully identified and discriminated at species and strain levels, respectively; therefore, the S10-GERMS method was further applied to discriminate the pathovar of P. syringae. The eight selected biomarkers (L24, L30, S10, S12, S14, S16, S17, and S19) suggested the rapid discrimination of P. syringae at the strain (pathovar) level. The S10-GERMS method appears to be a powerful tool for rapid and reliable bacterial discrimination and successful phylogenetic characterization. In this article, an overview of the utilization of results from the S10-GERMS method is presented, highlighting the characterization of the Lactobacillus casei group and discrimination of the bacteria of genera Bacillus and Sphingopyxis despite only two and one base difference in the 16S rRNA gene sequence, respectively.

  7. DNA methylation of the oxytocin receptor gene predicts neural response to ambiguous social stimuli

    Directory of Open Access Journals (Sweden)

    Allison eJack

    2012-10-01

    Full Text Available Oxytocin and its receptor (OXTR play an important role in a variety of social perceptual and affiliative processes. Individual variability in social information processing likely has a strong heritable component, and as such, many investigations have established an association between common genetic variants of OXTR and variability in the social phenotype. However, to date, these investigations have primarily focused only on changes in the sequence of DNA without considering the role of epigenetic factors. DNA methylation is an epigenetic mechanism by which cells control transcription through modification of chromatin structure. DNA methylation of OXTR decreases expression of the gene and high levels of methylation have been associated with autism spectrum disorders. This link between epigenetic variability and social phenotype allows for the possibility that social processes are under epigenetic control. We hypothesized that the level of DNA methylation of OXTR would predict individual variability in social perception. Using the brain’s sensitivity to displays of animacy as a neural endophenotype of social perception, we found significant associations between the degree of OXTR methylation and brain activity evoked by the perception of animacy. Our results suggest that consideration of DNA methylation may substantially improve our ability to explain individual differences in imaging genetic association studies.

  8. Predicting human miRNA target genes using a novel evolutionary methodology

    KAUST Repository

    Aigli, Korfiati; Kleftogiannis, Dimitrios A.; Konstantinos, Theofilatos; Spiros, Likothanassis; Athanasios, Tsakalidis; Seferina, Mavroudi

    2012-01-01

    The discovery of miRNAs had great impacts on traditional biology. Typically, miRNAs have the potential to bind to the 3'untraslated region (UTR) of their mRNA target genes for cleavage or translational repression. The experimental identification of their targets has many drawbacks including cost, time and low specificity and these are the reasons why many computational approaches have been developed so far. However, existing computational approaches do not include any advanced feature selection technique and they are facing problems concerning their classification performance and their interpretability. In the present paper, we propose a novel hybrid methodology which combines genetic algorithms and support vector machines in order to locate the optimal feature subset while achieving high classification performance. The proposed methodology was compared with two of the most promising existing methodologies in the problem of predicting human miRNA targets. Our approach outperforms existing methodologies in terms of classification performances while selecting a much smaller feature subset. © 2012 Springer-Verlag.

  9. Predicting human miRNA target genes using a novel evolutionary methodology

    KAUST Repository

    Aigli, Korfiati

    2012-01-01

    The discovery of miRNAs had great impacts on traditional biology. Typically, miRNAs have the potential to bind to the 3\\'untraslated region (UTR) of their mRNA target genes for cleavage or translational repression. The experimental identification of their targets has many drawbacks including cost, time and low specificity and these are the reasons why many computational approaches have been developed so far. However, existing computational approaches do not include any advanced feature selection technique and they are facing problems concerning their classification performance and their interpretability. In the present paper, we propose a novel hybrid methodology which combines genetic algorithms and support vector machines in order to locate the optimal feature subset while achieving high classification performance. The proposed methodology was compared with two of the most promising existing methodologies in the problem of predicting human miRNA targets. Our approach outperforms existing methodologies in terms of classification performances while selecting a much smaller feature subset. © 2012 Springer-Verlag.

  10. Optimized outcome prediction in breast cancer by combining the 70-gene signature with clinical risk prediction algorithms

    NARCIS (Netherlands)

    Drukker, C.A.; Nijenhuis, M.V.; Bueno de Mesquita, J.M.; Retel, V.P.; Retel, Valesca; van Harten, Willem H.; van Tinteren, H.; Wesseling, J.; Schmidt, M.K.; van 't Veer, L.J.; Sonke, G.S.; Rutgers, E.J.T.; van de Vijver, M.J.; Linn, S.C.

    2014-01-01

    Clinical guidelines for breast cancer treatment differ in their selection of patients at a high risk of recurrence who are eligible to receive adjuvant systemic treatment (AST). The 70-gene signature is a molecular tool to better guide AST decisions. The aim of this study was to evaluate whether

  11. Predicting response to primary chemotherapy: gene expression profiling of paraffin-embedded core biopsy tissue.

    Science.gov (United States)

    Mina, Lida; Soule, Sharon E; Badve, Sunil; Baehner, Fredrick L; Baker, Joffre; Cronin, Maureen; Watson, Drew; Liu, Mei-Lan; Sledge, George W; Shak, Steve; Miller, Kathy D

    2007-06-01

    Primary chemotherapy provides an ideal opportunity to correlate gene expression with response to treatment. We used paraffin-embedded core biopsies from a completed phase II trial to identify genes that correlate with response to primary chemotherapy. Patients with newly diagnosed stage II or III breast cancer were treated with sequential doxorubicin 75 mg/M2 q2 wks x 3 and docetaxel 40 mg/M2 weekly x 6; treatment order was randomly assigned. Pretreatment core biopsy samples were interrogated for genes that might correlate with pathologic complete response (pCR). In addition to the individual genes, the correlation of the Oncotype DX Recurrence Score with pCR was examined. Of 70 patients enrolled in the parent trial, core biopsies samples with sufficient RNA for gene analyses were available from 45 patients; 9 (20%) had inflammatory breast cancer (IBC). Six (14%) patients achieved a pCR. Twenty-two of the 274 candidate genes assessed correlated with pCR (p < 0.05). Genes correlating with pCR could be grouped into three large clusters: angiogenesis-related genes, proliferation related genes, and invasion-related genes. Expression of estrogen receptor (ER)-related genes and Recurrence Score did not correlate with pCR. In an exploratory analysis we compared gene expression in IBC to non-inflammatory breast cancer; twenty-four (9%) of the genes were differentially expressed (p < 0.05), 5 were upregulated and 19 were downregulated in IBC. Gene expression analysis on core biopsy samples is feasible and identifies candidate genes that correlate with pCR to primary chemotherapy. Gene expression in IBC differs significantly from noninflammatory breast cancer.

  12. Use of tiling array data and RNA secondary structure predictions to identify noncoding RNA genes

    DEFF Research Database (Denmark)

    Weile, Christian; Gardner, Paul P; Hedegaard, Mads M

    2007-01-01

    neuroblastoma cell line SK-N-AS. Using this strategy, we identify thousands of human candidate RNA genes. To further verify the expression of these genes, we focused on candidate genes that had a stable hairpin structures or a high level of covariance. Using northern blotting, we verify the expression of 2 out...

  13. ColoLipidGene: signature of lipid metabolism-related genes to predict prognosis in stage-II colon cancer patients

    Science.gov (United States)

    Vargas, Teodoro; Moreno-Rubio, Juan; Herranz, Jesús; Cejas, Paloma; Molina, Susana; González-Vallinas, Margarita; Mendiola, Marta; Burgos, Emilio; Aguayo, Cristina; Custodio, Ana B.; Machado, Isidro; Ramos, David; Gironella, Meritxell; Espinosa-Salinas, Isabel; Ramos, Ricardo; Martín-Hernández, Roberto; Risueño, Alberto; De Las Rivas, Javier; Reglero, Guillermo; Yaya, Ricardo; Fernández-Martos, Carlos; Aparicio, Jorge; Maurel, Joan; Feliu, Jaime; de Molina, Ana Ramírez

    2015-01-01

    Lipid metabolism plays an essential role in carcinogenesis due to the requirements of tumoral cells to sustain increased structural, energetic and biosynthetic precursor demands for cell proliferation. We investigated the association between expression of lipid metabolism-related genes and clinical outcome in intermediate-stage colon cancer patients with the aim of identifying a metabolic profile associated with greater malignancy and increased risk of relapse. Expression profile of 70 lipid metabolism-related genes was determined in 77 patients with stage II colon cancer. Cox regression analyses using c-index methodology was applied to identify a metabolic-related signature associated to prognosis. The metabolic signature was further confirmed in two independent validation sets of 120 patients and additionally, in a group of 264 patients from a public database. The combined analysis of these 4 genes, ABCA1, ACSL1, AGPAT1 and SCD, constitutes a metabolic-signature (ColoLipidGene) able to accurately stratify stage II colon cancer patients with 5-fold higher risk of relapse with strong statistical power in the four independent groups of patients. The identification of a group of 4 genes that predict survival in intermediate-stage colon cancer patients allows delineation of a high-risk group that may benefit from adjuvant therapy, and avoids the toxic and unnecessary chemotherapy in patients classified as low-risk group. PMID:25749516

  14. In Silico Prediction of Horizontal Gene Transfer Events in Lactobacillus bulgaricus and Streptococcus thermophilus Reveals Protocooperation in Yogurt Manufacturing▿ †

    Science.gov (United States)

    Liu, Mengjin; Siezen, Roland J.; Nauta, Arjen

    2009-01-01

    Lactobacillus bulgaricus and Streptococcus thermophilus, used in yogurt starter cultures, are well known for their stability and protocooperation during their coexistence in milk. In this study, we show that a close interaction between the two species also takes place at the genetic level. We performed an in silico analysis, combining gene composition and gene transfer mechanism-associated features, and predicted horizontally transferred genes in both L. bulgaricus and S. thermophilus. Putative horizontal gene transfer (HGT) events that have occurred between the two bacterial species include the transfer of exopolysaccharide (EPS) biosynthesis genes, transferred from S. thermophilus to L. bulgaricus, and the gene cluster cbs-cblB(cglB)-cysE for the metabolism of sulfur-containing amino acids, transferred from L. bulgaricus or Lactobacillus helveticus to S. thermophilus. The HGT event for the cbs-cblB(cglB)-cysE gene cluster was analyzed in detail, with respect to both evolutionary and functional aspects. It can be concluded that during the coexistence of both yogurt starter species in a milk environment, agonistic coevolution at the genetic level has probably been involved in the optimization of their combined growth and interactions. PMID:19395564

  15. In silico prediction of horizontal gene transfer events in Lactobacillus bulgaricus and Streptococcus thermophilus reveals protocooperation in yogurt manufacturing.

    Science.gov (United States)

    Liu, Mengjin; Siezen, Roland J; Nauta, Arjen

    2009-06-01

    Lactobacillus bulgaricus and Streptococcus thermophilus, used in yogurt starter cultures, are well known for their stability and protocooperation during their coexistence in milk. In this study, we show that a close interaction between the two species also takes place at the genetic level. We performed an in silico analysis, combining gene composition and gene transfer mechanism-associated features, and predicted horizontally transferred genes in both L. bulgaricus and S. thermophilus. Putative horizontal gene transfer (HGT) events that have occurred between the two bacterial species include the transfer of exopolysaccharide (EPS) biosynthesis genes, transferred from S. thermophilus to L. bulgaricus, and the gene cluster cbs-cblB(cglB)-cysE for the metabolism of sulfur-containing amino acids, transferred from L. bulgaricus or Lactobacillus helveticus to S. thermophilus. The HGT event for the cbs-cblB(cglB)-cysE gene cluster was analyzed in detail, with respect to both evolutionary and functional aspects. It can be concluded that during the coexistence of both yogurt starter species in a milk environment, agonistic coevolution at the genetic level has probably been involved in the optimization of their combined growth and interactions.

  16. Predicting survival in patients with metastatic kidney cancer by gene-expression profiling in the primary tumor.

    Science.gov (United States)

    Vasselli, James R; Shih, Joanna H; Iyengar, Shuba R; Maranchie, Jodi; Riss, Joseph; Worrell, Robert; Torres-Cabala, Carlos; Tabios, Ray; Mariotti, Andra; Stearman, Robert; Merino, Maria; Walther, McClellan M; Simon, Richard; Klausner, Richard D; Linehan, W Marston

    2003-06-10

    To identify potential molecular determinants of tumor biology and possible clinical outcomes, global gene-expression patterns were analyzed in the primary tumors of patients with metastatic renal cell cancer by using cDNA microarrays. We used grossly dissected tumor masses that included tumor, blood vessels, connective tissue, and infiltrating immune cells to obtain a gene-expression "profile" from each primary tumor. Two patterns of gene expression were found within this uniformly staged patient population, which correlated with a significant difference in overall survival between the two patient groups. Subsets of genes most significantly associated with survival were defined, and vascular cell adhesion molecule-1 (VCAM-1) was the gene most predictive for survival. Therefore, despite the complex biological nature of metastatic cancer, basic clinical behavior as defined by survival may be determined by the gene-expression patterns expressed within the compilation of primary gross tumor cells. We conclude that survival in patients with metastatic renal cell cancer can be correlated with the expression of various genes based solely on the expression profile in the primary kidney tumor.

  17. The cure: design and evaluation of a crowdsourcing game for gene selection for breast cancer survival prediction.

    Science.gov (United States)

    Good, Benjamin M; Loguercio, Salvatore; Griffith, Obi L; Nanis, Max; Wu, Chunlei; Su, Andrew I

    2014-07-29

    Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility, and biological interpretability. Methods that take advantage of structured prior knowledge (eg, protein interaction networks) show promise in helping to define better signatures, but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes unheard of before. The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player's prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. We developed and evaluated an online game called The Cure that captured information from players regarding genes for use as predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach, and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10 year survival. Between its launch in September 2012 and September 2013, The Cure attracted more than 1000 registered players, who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as cancer, disease progression, and recurrence. In terms of the predictive accuracy of models trained using this

  18. Genomic instability of osteosarcoma cell lines in culture: impact on the prediction of metastasis relevant genes.

    Science.gov (United States)

    Muff, Roman; Rath, Prisni; Ram Kumar, Ram Mohan; Husmann, Knut; Born, Walter; Baudis, Michael; Fuchs, Bruno

    2015-01-01

    Osteosarcoma is a rare but highly malignant cancer of the bone. As a consequence, the number of established cell lines used for experimental in vitro and in vivo osteosarcoma research is limited and the value of these cell lines relies on their stability during culture. Here we investigated the stability in gene expression by microarray analysis and array genomic hybridization of three low metastatic cell lines and derivatives thereof with increased metastatic potential using cells of different passages. The osteosarcoma cell lines showed altered gene expression during in vitro culture, and it was more pronounced in two metastatic cell lines compared to the respective parental cells. Chromosomal instability contributed in part to the altered gene expression in SAOS and LM5 cells with low and high metastatic potential. To identify metastasis-relevant genes in a background of passage-dependent altered gene expression, genes involved in "Pathways in cancer" that were consistently regulated under all passage comparisons were evaluated. Genes belonging to "Hedgehog signaling pathway" and "Wnt signaling pathway" were significantly up-regulated, and IHH, WNT10B and TCF7 were found up-regulated in all three metastatic compared to the parental cell lines. Considerable instability during culture in terms of gene expression and chromosomal aberrations was observed in osteosarcoma cell lines. The use of cells from different passages and a search for genes consistently regulated in early and late passages allows the analysis of metastasis-relevant genes despite the observed instability in gene expression in osteosarcoma cell lines during culture.

  19. Genic regions of a large salamander genome contain long introns and novel genes

    Directory of Open Access Journals (Sweden)

    Bryant Susan V

    2009-01-01

    Full Text Available Abstract Background The basis of genome size variation remains an outstanding question because DNA sequence data are lacking for organisms with large genomes. Sixteen BAC clones from the Mexican axolotl (Ambystoma mexicanum: c-value = 32 × 109 bp were isolated and sequenced to characterize the structure of genic regions. Results Annotation of genes within BACs showed that axolotl introns are on average 10× longer than orthologous vertebrate introns and they are predicted to contain more functional elements, including miRNAs and snoRNAs. Loci were discovered within BACs for two novel EST transcripts that are differentially expressed during spinal cord regeneration and skin metamorphosis. Unexpectedly, a third novel gene was also discovered while manually annotating BACs. Analysis of human-axolotl protein-coding sequences suggests there are 2% more lineage specific genes in the axolotl genome than the human genome, but the great majority (86% of genes between axolotl and human are predicted to be 1:1 orthologs. Considering that axolotl genes are on average 5× larger than human genes, the genic component of the salamander genome is estimated to be incredibly large, approximately 2.8 gigabases! Conclusion This study shows that a large salamander genome has a correspondingly large genic component, primarily because genes have incredibly long introns. These intronic sequences may harbor novel coding and non-coding sequences that regulate biological processes that are unique to salamanders.

  20. Accurate prediction of the functional significance of single nucleotide polymorphisms and mutations in the ABCA1 gene.

    Directory of Open Access Journals (Sweden)

    Liam R Brunham

    2005-12-01

    Full Text Available The human genome contains an estimated 100,000 to 300,000 DNA variants that alter an amino acid in an encoded protein. However, our ability to predict which of these variants are functionally significant is limited. We used a bioinformatics approach to define the functional significance of genetic variation in the ABCA1 gene, a cholesterol transporter crucial for the metabolism of high density lipoprotein cholesterol. To predict the functional consequence of each coding single nucleotide polymorphism and mutation in this gene, we calculated a substitution position-specific evolutionary conservation score for each variant, which considers site-specific variation among evolutionarily related proteins. To test the bioinformatics predictions experimentally, we evaluated the biochemical consequence of these sequence variants by examining the ability of cell lines stably transfected with the ABCA1 alleles to elicit cholesterol efflux. Our bioinformatics approach correctly predicted the functional impact of greater than 94% of the naturally occurring variants we assessed. The bioinformatics predictions were significantly correlated with the degree of functional impairment of ABCA1 mutations (r2 = 0.62, p = 0.0008. These results have allowed us to define the impact of genetic variation on ABCA1 function and to suggest that the in silico evolutionary approach we used may be a useful tool in general for predicting the effects of DNA variation on gene function. In addition, our data suggest that considering patterns of positive selection, along with patterns of negative selection such as evolutionary conservation, may improve our ability to predict the functional effects of amino acid variation.

  1. An 80-gene set to predict response to preoperative chemoradiotherapy for rectal cancer by principle component analysis.

    Science.gov (United States)

    Empuku, Shinichiro; Nakajima, Kentaro; Akagi, Tomonori; Kaneko, Kunihiko; Hijiya, Naoki; Etoh, Tsuyoshi; Shiraishi, Norio; Moriyama, Masatsugu; Inomata, Masafumi

    2016-05-01

    Preoperative chemoradiotherapy (CRT) for locally advanced rectal cancer not only improves the postoperative local control rate, but also induces downstaging. However, it has not been established how to individually select patients who receive effective preoperative CRT. The aim of this study was to identify a predictor of response to preoperative CRT for locally advanced rectal cancer. This study is additional to our multicenter phase II study evaluating the safety and efficacy of preoperative CRT using oral fluorouracil (UMIN ID: 03396). From April, 2009 to August, 2011, 26 biopsy specimens obtained prior to CRT were analyzed by cyclopedic microarray analysis. Response to CRT was evaluated according to a histological grading system using surgically resected specimens. To decide on the number of genes for dividing into responder and non-responder groups, we statistically analyzed the data using a dimension reduction method, a principle component analysis. Of the 26 cases, 11 were responders and 15 non-responders. No significant difference was found in clinical background data between the two groups. We determined that the optimal number of genes for the prediction of response was 80 of 40,000 and the functions of these genes were analyzed. When comparing non-responders with responders, genes expressed at a high level functioned in alternative splicing, whereas those expressed at a low level functioned in the septin complex. Thus, an 80-gene expression set that predicts response to preoperative CRT for locally advanced rectal cancer was identified using a novel statistical method.

  2. Common and rare variants in the exons and regulatory regions of osteoporosis-related genes improve osteoporotic fracture risk prediction.

    Science.gov (United States)

    Lee, Seung Hun; Kang, Moo Il; Ahn, Seong Hee; Lim, Kyeong-Hye; Lee, Gun Eui; Shin, Eun-Soon; Lee, Jong-Eun; Kim, Beom-Jun; Cho, Eun-Hee; Kim, Sang-Wook; Kim, Tae-Ho; Kim, Hyun-Ju; Yoon, Kun-Ho; Lee, Won Chul; Kim, Ghi Su; Koh, Jung-Min; Kim, Shin-Yoon

    2014-11-01

    Osteoporotic fracture risk is highly heritable, but genome-wide association studies have explained only a small proportion of the heritability to date. Genetic data may improve prediction of fracture risk in osteopenic subjects and assist early intervention and management. To detect common and rare variants in coding and regulatory regions related to osteoporosis-related traits, and to investigate whether genetic profiling improves the prediction of fracture risk. This cross-sectional study was conducted in three clinical units in Korea. Postmenopausal women with extreme phenotypes (n = 982) were used for the discovery set, and 3895 participants were used for the replication set. We performed targeted resequencing of 198 genes. Genetic risk scores from common variants (GRS-C) and from common and rare variants (GRS-T) were calculated. Nineteen common variants in 17 genes (of the discovered 34 functional variants in 26 genes) and 31 rare variants in five genes (of the discovered 87 functional variants in 15 genes) were associated with one or more osteoporosis-related traits. Accuracy of fracture risk classification was improved in the osteopenic patients by adding GRS-C to fracture risk assessment models (6.8%; P risk in an osteopenic individual.

  3. Locus-specific ribosomal RNA gene silencing in nucleolar dominance.

    Directory of Open Access Journals (Sweden)

    Michelle S Lewis

    2007-08-01

    Full Text Available The silencing of one parental set of rRNA genes in a genetic hybrid is an epigenetic phenomenon known as nucleolar dominance. We showed previously that silencing is restricted to the nucleolus organizer regions (NORs, the loci where rRNA genes are tandemly arrayed, and does not spread to or from neighboring protein-coding genes. One hypothesis is that nucleolar dominance is the net result of hundreds of silencing events acting one rRNA gene at a time. A prediction of this hypothesis is that rRNA gene silencing should occur independent of chromosomal location. An alternative hypothesis is that the regulatory unit in nucleolar dominance is the NOR, rather than each individual rRNA gene, in which case NOR localization may be essential for rRNA gene silencing. To test these alternative hypotheses, we examined the fates of rRNA transgenes integrated at ectopic locations. The transgenes were accurately transcribed in all independent transgenic Arabidopsis thaliana lines tested, indicating that NOR localization is not required for rRNA gene expression. Upon crossing the transgenic A. thaliana lines as ovule parents with A. lyrata to form F1 hybrids, a new system for the study of nucleolar dominance, the endogenous rRNA genes located within the A. thaliana NORs are silenced. However, rRNA transgenes escaped silencing in multiple independent hybrids. Collectively, our data suggest that rRNA gene activation can occur in a gene-autonomous fashion, independent of chromosomal location, whereas rRNA gene silencing in nucleolar dominance is locus-dependent.

  4. Transcriptional dissection of melanoma identifies a high-risk subtype underlying TP53 family genes and epigenome deregulation

    Science.gov (United States)

    Badal, Brateil; Solovyov, Alexander; Di Cecilia, Serena; Chan, Joseph Minhow; Chang, Li-Wei; Iqbal, Ramiz; Aydin, Iraz T.; Rajan, Geena S.; Chen, Chen; Abbate, Franco; Arora, Kshitij S.; Tanne, Antoine; Gruber, Stephen B.; Johnson, Timothy M.; Fullen, Douglas R.; Phelps, Robert; Bhardwaj, Nina; Bernstein, Emily; Ting, David T.; Brunner, Georg; Schadt, Eric E.; Greenbaum, Benjamin D.; Celebi, Julide Tok

    2017-01-01

    BACKGROUND. Melanoma is a heterogeneous malignancy. We set out to identify the molecular underpinnings of high-risk melanomas, those that are likely to progress rapidly, metastasize, and result in poor outcomes. METHODS. We examined transcriptome changes from benign states to early-, intermediate-, and late-stage tumors using a set of 78 treatment-naive melanocytic tumors consisting of primary melanomas of the skin and benign melanocytic lesions. We utilized a next-generation sequencing platform that enabled a comprehensive analysis of protein-coding and -noncoding RNA transcripts. RESULTS. Gene expression changes unequivocally discriminated between benign and malignant states, and a dual epigenetic and immune signature emerged defining this transition. To our knowledge, we discovered previously unrecognized melanoma subtypes. A high-risk primary melanoma subset was distinguished by a 122-epigenetic gene signature (“epigenetic” cluster) and TP53 family gene deregulation (TP53, TP63, and TP73). This subtype associated with poor overall survival and showed enrichment of cell cycle genes. Noncoding repetitive element transcripts (LINEs, SINEs, and ERVs) that can result in immunostimulatory signals recapitulating a state of “viral mimicry” were significantly repressed. The high-risk subtype and its poor predictive characteristics were validated in several independent cohorts. Additionally, primary melanomas distinguished by specific immune signatures (“immune” clusters) were identified. CONCLUSION. The TP53 family of genes and genes regulating the epigenetic machinery demonstrate strong prognostic and biological relevance during progression of early disease. Gene expression profiling of protein-coding and -noncoding RNA transcripts may be a better predictor for disease course in melanoma. This study outlines the transcriptional interplay of the cancer cell’s epigenome with the immune milieu with potential for future therapeutic targeting. FUNDING

  5. Multiclass Prediction with Partial Least Square Regression for Gene Expression Data: Applications in Breast Cancer Intrinsic Taxonomy

    Directory of Open Access Journals (Sweden)

    Chi-Cheng Huang

    2013-01-01

    Full Text Available Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n=535. The agreement between PAM50 centroid-based single sample prediction (SSP and PLS-regression was excellent (weighted Kappa: 0.988 within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed. Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.

  6. Construction of a novel multi-gene assay (42-gene classifier) for prediction of late recurrence in ER-positive breast cancer patients.

    Science.gov (United States)

    Tsunashima, Ryo; Naoi, Yasuto; Shimazu, Kenzo; Kagara, Naofumi; Shimoda, Masashi; Tanei, Tomonori; Miyake, Tomohiro; Kim, Seung Jin; Noguchi, Shinzaburo

    2018-05-04

    Prediction models for late (> 5 years) recurrence in ER-positive breast cancer need to be developed for the accurate selection of patients for extended hormonal therapy. We attempted to develop such a prediction model focusing on the differences in gene expression between breast cancers with early and late recurrence. For the training set, 779 ER-positive breast cancers treated with tamoxifen alone for 5 years were selected from the databases (GSE6532, GSE12093, GSE17705, and GSE26971). For the validation set, 221 ER-positive breast cancers treated with adjuvant hormonal therapy for 5 years with or without chemotherapy at our hospital were included. Gene expression was assayed by DNA microarray analysis (Affymetrix U133 plus 2.0). With the 42 genes differentially expressed in early and late recurrence breast cancers in the training set, a prediction model (42GC) for late recurrence was constructed. The patients classified by 42GC into the late recurrence-like group showed a significantly (P = 0.006) higher late recurrence rate as expected but a significantly (P = 1.62 × E-13) lower rate for early recurrence than non-late recurrence-like group. These observations were confirmed for the validation set, i.e., P = 0.020 for late recurrence and P = 5.70 × E-5 for early recurrence. We developed a unique prediction model (42GC) for late recurrence by focusing on the biological differences between breast cancers with early and late recurrence. Interestingly, patients in the late recurrence-like group by 42GC were at low risk for early recurrence.

  7. Prediction of disease-related genes based on weighted tissue-specific networks by using DNA methylation.

    Science.gov (United States)

    Li, Min; Zhang, Jiayi; Liu, Qing; Wang, Jianxin; Wu, Fang-Xiang

    2014-01-01

    Predicting disease-related genes is one of the most important tasks in bioinformatics and systems biology. With the advances in high-throughput techniques, a large number of protein-protein interactions are available, which make it possible to identify disease-related genes at the network level. However, network-based identification of disease-related genes is still a challenge as the considerable false-positives are still existed in the current available protein interaction networks (PIN). Considering the fact that the majority of genetic disorders tend to manifest only in a single or a few tissues, we constructed tissue-specific networks (TSN) by integrating PIN and tissue-specific data. We further weighed the constructed tissue-specific network (WTSN) by using DNA methylation as it plays an irreplaceable role in the development of complex diseases. A PageRank-based method was developed to identify disease-related genes from the constructed networks. To validate the effectiveness of the proposed method, we constructed PIN, weighted PIN (WPIN), TSN, WTSN for colon cancer and leukemia, respectively. The experimental results on colon cancer and leukemia show that the combination of tissue-specific data and DNA methylation can help to identify disease-related genes more accurately. Moreover, the PageRank-based method was effective to predict disease-related genes on the case studies of colon cancer and leukemia. Tissue-specific data and DNA methylation are two important factors to the study of human diseases. The same method implemented on the WTSN can achieve better results compared to those being implemented on original PIN, WPIN, or TSN. The PageRank-based method outperforms degree centrality-based method for identifying disease-related genes from WTSN.

  8. Hierarchy in gene expression is predictive of risk, progression, and outcome in adult acute myeloid leukemia

    Science.gov (United States)

    Tripathi, Shubham; Deem, Michael W.

    2015-02-01

    Cancer progresses with a change in the structure of the gene network in normal cells. We define a measure of organizational hierarchy in gene networks of affected cells in adult acute myeloid leukemia (AML) patients. With a retrospective cohort analysis based on the gene expression profiles of 116 AML patients, we find that the likelihood of future cancer relapse and the level of clinical risk are directly correlated with the level of organization in the cancer related gene network. We also explore the variation of the level of organization in the gene network with cancer progression. We find that this variation is non-monotonic, which implies the fitness landscape in the evolution of AML cancer cells is non-trivial. We further find that the hierarchy in gene expression at the time of diagnosis may be a useful biomarker in AML prognosis.

  9. Hierarchy in gene expression is predictive of risk, progression, and outcome in adult acute myeloid leukemia

    International Nuclear Information System (INIS)

    Tripathi, Shubham; Deem, Michael W

    2015-01-01

    Cancer progresses with a change in the structure of the gene network in normal cells. We define a measure of organizational hierarchy in gene networks of affected cells in adult acute myeloid leukemia (AML) patients. With a retrospective cohort analysis based on the gene expression profiles of 116 AML patients, we find that the likelihood of future cancer relapse and the level of clinical risk are directly correlated with the level of organization in the cancer related gene network. We also explore the variation of the level of organization in the gene network with cancer progression. We find that this variation is non-monotonic, which implies the fitness landscape in the evolution of AML cancer cells is non-trivial. We further find that the hierarchy in gene expression at the time of diagnosis may be a useful biomarker in AML prognosis. (paper)

  10. A Predictive Coexpression Network Identifies Novel Genes Controlling the Seed-to-Seedling Phase Transition in Arabidopsis thaliana.

    Science.gov (United States)

    Silva, Anderson Tadeu; Ribone, Pamela A; Chan, Raquel L; Ligterink, Wilco; Hilhorst, Henk W M

    2016-04-01

    The transition from a quiescent dry seed to an actively growing photoautotrophic seedling is a complex and crucial trait for plant propagation. This study provides a detailed description of global gene expression in seven successive developmental stages of seedling establishment in Arabidopsis (Arabidopsis thaliana). Using the transcriptome signature from these developmental stages, we obtained a coexpression gene network that highlights interactions between known regulators of the seed-to-seedling transition and predicts the functions of uncharacterized genes in seedling establishment. The coexpressed gene data sets together with the transcriptional module indicate biological functions related to seedling establishment. Characterization of the homeodomain leucine zipper I transcription factor AtHB13, which is expressed during the seed-to-seedling transition, demonstrated that this gene regulates some of the network nodes and affects late seedling establishment. Knockout mutants for athb13 showed increased primary root length as compared with wild-type (Columbia-0) seedlings, suggesting that this transcription factor is a negative regulator of early root growth, possibly repressing cell division and/or cell elongation or the length of time that cells elongate. The signal transduction pathways present during the early phases of the seed-to-seedling transition anticipate the control of important events for a vigorous seedling, such as root growth. This study demonstrates that a gene coexpression network together with transcriptional modules can provide insights that are not derived from comparative transcript profiling alone. © 2016 American Society of Plant Biologists. All Rights Reserved.

  11. Efficient CRISPR/Cas9-Mediated Versatile, Predictable, and Donor-Free Gene Knockout in Human Pluripotent Stem Cells.

    Science.gov (United States)

    Liu, Zhongliang; Hui, Yi; Shi, Lei; Chen, Zhenyu; Xu, Xiangjie; Chi, Liankai; Fan, Beibei; Fang, Yujiang; Liu, Yang; Ma, Lin; Wang, Yiran; Xiao, Lei; Zhang, Quanbin; Jin, Guohua; Liu, Ling; Zhang, Xiaoqing

    2016-09-13

    Loss-of-function studies in human pluripotent stem cells (hPSCs) require efficient methodologies for lesion of genes of interest. Here, we introduce a donor-free paired gRNA-guided CRISPR/Cas9 knockout strategy (paired-KO) for efficient and rapid gene ablation in hPSCs. Through paired-KO, we succeeded in targeting all genes of interest with high biallelic targeting efficiencies. More importantly, during paired-KO, the cleaved DNA was repaired mostly through direct end joining without insertions/deletions (precise ligation), and thus makes the lesion product predictable. The paired-KO remained highly efficient for one-step targeting of multiple genes and was also efficient for targeting of microRNA, while for long non-coding RNA over 8 kb, cleavage of a short fragment of the core promoter region was sufficient to eradicate downstream gene transcription. This work suggests that the paired-KO strategy is a simple and robust system for loss-of-function studies for both coding and non-coding genes in hPSCs. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.

  12. Tumour gene expression predicts response to cetuximab in patients with KRAS wild-type metastatic colorectal cancer.

    Science.gov (United States)

    Baker, J B; Dutta, D; Watson, D; Maddala, T; Munneke, B M; Shak, S; Rowinsky, E K; Xu, L-A; Harbison, C T; Clark, E A; Mauro, D J; Khambata-Ford, S

    2011-02-01

    Although it is accepted that metastatic colorectal cancers (mCRCs) that carry activating mutations in KRAS are unresponsive to anti-epidermal growth factor receptor (EGFR) monoclonal antibodies, a significant fraction of KRAS wild-type (wt) mCRCs are also unresponsive to anti-EGFR therapy. Genes encoding EGFR ligands amphiregulin (AREG) and epiregulin (EREG) are promising gene expression-based markers but have not been incorporated into a test to dichotomise KRAS wt mCRC patients with respect to sensitivity to anti-EGFR treatment. We used RT-PCR to test 110 candidate gene expression markers in primary tumours from 144 KRAS wt mCRC patients who received monotherapy with the anti-EGFR antibody cetuximab. Results were correlated with multiple clinical endpoints: disease control, objective response, and progression-free survival (PFS). Expression of many of the tested candidate genes, including EREG and AREG, strongly associate with all clinical endpoints. Using multivariate analysis with two-layer five-fold cross-validation, we constructed a four-gene predictive classifier. Strikingly, patients below the classifier cutpoint had PFS and disease control rates similar to those of patients with KRAS mutant mCRC. Gene expression appears to identify KRAS wt mCRC patients who receive little benefit from cetuximab. It will be important to test this model in an independent validation study.

  13. Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering

    Directory of Open Access Journals (Sweden)

    Sharma Animesh

    2007-01-01

    Full Text Available Abstract Background The four heterogeneous childhood cancers, neuroblastoma, non-Hodgkin lymphoma, rhabdomyosarcoma, and Ewing sarcoma present a similar histology of small round blue cell tumor (SRBCT and thus often leads to misdiagnosis. Identification of biomarkers for distinguishing these cancers is a well studied problem. Existing methods typically evaluate each gene separately and do not take into account the nonlinear interaction between genes and the tools that are used to design the diagnostic prediction system. Consequently, more genes are usually identified as necessary for prediction. We propose a general scheme for finding a small set of biomarkers to design a diagnostic system for accurate classification of the cancer subgroups. We use multilayer networks with online gene selection ability and relational fuzzy clustering to identify a small set of biomarkers for accurate classification of the training and blind test cases of a well studied data set. Results Our method discerned just seven biomarkers that precisely categorized the four subgroups of cancer both in training and blind samples. For the same problem, others suggested 19–94 genes. These seven biomarkers include three novel genes (NAB2, LSP1 and EHD1 – not identified by others with distinct class-specific signatures and important role in cancer biology, including cellular proliferation, transendothelial migration and trafficking of MHC class antigens. Interestingly, NAB2 is downregulated in other tumors including Non-Hodgkin lymphoma and Neuroblastoma but we observed moderate to high upregulation in a few cases of Ewing sarcoma and Rabhdomyosarcoma, suggesting that NAB2 might be mutated in these tumors. These genes can discover the subgroups correctly with unsupervised learning, can differentiate non-SRBCT samples and they perform equally well with other machine learning tools including support vector machines. These biomarkers lead to four simple human interpretable

  14. Predicting Recurrence and Progression of Noninvasive Papillary Bladder Cancer at Initial Presentation Based on Quantitative Gene Expression Profiles

    DEFF Research Database (Denmark)

    Birkhahn, M.; Mitra, A.P.; Williams, Johan

    2010-01-01

    % specificity. Since this is a small retrospective study using medium-throughput profiling, larger confirmatory studies are needed. Conclusions: Gene expression profiling across relevant cancer pathways appears to be a promising approach for Ta bladder tumor outcome prediction at initial diagnosis......Background: Currently, tumor grade is the best predictor of outcome at first presentation of noninvasive papillary (Ta) bladder cancer. However, reliable predictors of Ta tumor recurrence and progression for individual patients, which could optimize treatment and follow-up schedules based...... on specific tumor biology, are yet to be identified. Objective: To identify genes predictive for recurrence and progression in Ta bladder cancer at first presentation using a quantitative, pathway-specific approach. Design, setting, and participants: Retrospective study of patients with Ta G2/3 bladder tumors...

  15. Genomic instability of osteosarcoma cell lines in culture: impact on the prediction of metastasis relevant genes.

    Directory of Open Access Journals (Sweden)

    Roman Muff

    Full Text Available Osteosarcoma is a rare but highly malignant cancer of the bone. As a consequence, the number of established cell lines used for experimental in vitro and in vivo osteosarcoma research is limited and the value of these cell lines relies on their stability during culture. Here we investigated the stability in gene expression by microarray analysis and array genomic hybridization of three low metastatic cell lines and derivatives thereof with increased metastatic potential using cells of different passages.The osteosarcoma cell lines showed altered gene expression during in vitro culture, and it was more pronounced in two metastatic cell lines compared to the respective parental cells. Chromosomal instability contributed in part to the altered gene expression in SAOS and LM5 cells with low and high metastatic potential. To identify metastasis-relevant genes in a background of passage-dependent altered gene expression, genes involved in "Pathways in cancer" that were consistently regulated under all passage comparisons were evaluated. Genes belonging to "Hedgehog signaling pathway" and "Wnt signaling pathway" were significantly up-regulated, and IHH, WNT10B and TCF7 were found up-regulated in all three metastatic compared to the parental cell lines.Considerable instability during culture in terms of gene expression and chromosomal aberrations was observed in osteosarcoma cell lines. The use of cells from different passages and a search for genes consistently regulated in early and late passages allows the analysis of metastasis-relevant genes despite the observed instability in gene expression in osteosarcoma cell lines during culture.

  16. A multiplexed miRNA and transgene expression platform for simultaneous repression and expression of protein coding sequences.

    Science.gov (United States)

    Seyhan, Attila A

    2016-01-01

    Knockdown of single or multiple gene targets by RNA interference (RNAi) is necessary to overcome escape mutants or isoform redundancy. It is also necessary to use multiple RNAi reagents to knockdown multiple targets. It is also desirable to express a transgene or positive regulatory elements and inhibit a target gene in a coordinated fashion. This study reports a flexible multiplexed RNAi and transgene platform using endogenous intronic primary microRNAs (pri-miRNAs) as a scaffold located in the green fluorescent protein (GFP) as a model for any functional transgene. The multiplexed intronic miRNA - GFP transgene platform was designed to co-express multiple small RNAs within the polycistronic cluster from a Pol II promoter at more moderate levels to reduce potential vector toxicity. The native intronic miRNAs are co-transcribed with a precursor GFP mRNA as a single transcript and presumably cleaved out of the precursor-(pre) mRNA by the RNA splicing machinery, spliceosome. The spliced intron with miRNA hairpins will be further processed into mature miRNAs or small interfering RNAs (siRNAs) capable of triggering RNAi effects, while the ligated exons become a mature messenger RNA for the translation of the functional GFP protein. Data show that this approach led to robust RNAi-mediated silencing of multiple Renilla Luciferase (R-Luc)-tagged target genes and coordinated expression of functional GFP from a single transcript in transiently transfected HeLa cells. The results demonstrated that this design facilitates the coordinated expression of all mature miRNAs either as individual miRNAs or as multiple miRNAs and the associated protein. The data suggest that, it is possible to simultaneously deliver multiple negative (miRNA or shRNA) and positive (transgene) regulatory elements. Because many cellular processes require simultaneous repression and activation of downstream pathways, this approach offers a platform technology to achieve that dual manipulation efficiently

  17. Gene trio signatures as molecular markers to predict response to doxorubicin cyclophosphamide neoadjuvant chemotherapy in breast cancerpatients

    Directory of Open Access Journals (Sweden)

    M.C. Barros Filho

    2010-12-01

    Full Text Available In breast cancer patients submitted to neoadjuvant chemotherapy (4 cycles of doxorubicin and cyclophosphamide, AC, expression of groups of three genes (gene trio signatures could distinguish responsive from non-responsive tumors, as demonstrated by cDNA microarray profiling in a previous study by our group. In the current study, we determined if the expression of the same genes would retain the predictive strength, when analyzed by a more accessible technique (real-time RT-PCR. We evaluated 28 samples already analyzed by cDNA microarray, as a technical validation procedure, and 14 tumors, as an independent biological validation set. All patients received neoadjuvant chemotherapy (4 AC. Among five trio combinations previously identified, defined by nine genes individually investigated (BZRP, CLPTM1,MTSS1, NOTCH1, NUP210, PRSS11, RPL37A, SMYD2, and XLHSRF-1, the most accurate were established by RPL37A, XLHSRF-1based trios, with NOTCH1 or NUP210. Both trios correctly separated 86% of tumors (87% sensitivity and 80% specificity for predicting response, according to their response to chemotherapy (82% in a leave-one-out cross-validation method. Using the pre-established features obtained by linear discriminant analysis, 71% samples from the biological validation set were also correctly classified by both trios (72% sensitivity; 66% specificity. Furthermore, we explored other gene combinations to achieve a higher accuracy in the technical validation group (as a training set. A new trio, MTSS1, RPL37 and SMYD2, correctly classified 93% of samples from the technical validation group (95% sensitivity and 80% specificity; 86% accuracy by the cross-validation method and 79% from the biological validation group (72% sensitivity and 100% specificity. Therefore, the combined expression of MTSS1, RPL37 and SMYD2, as evaluated by real-time RT-PCR, is a potential candidate to predict response to neoadjuvant doxorubicin and cyclophosphamide in breast cancer

  18. Advanced colorectal adenoma related gene expression signature may predict prognostic for colorectal cancer patients with adenoma-carcinoma sequence.

    Science.gov (United States)

    Li, Bing; Shi, Xiao-Yu; Liao, Dai-Xiang; Cao, Bang-Rong; Luo, Cheng-Hua; Cheng, Shu-Jun

    2015-01-01

    There are still no absolute parameters predicting progression of adenoma into cancer. The present study aimed to characterize functional differences on the multistep carcinogenetic process from the adenoma-carcinoma sequence. All samples were collected and mRNA expression profiling was performed by using Agilent Microarray high-throughput gene-chip technology. Then, the characteristics of mRNA expression profiles of adenoma-carcinoma sequence were described with bioinformatics software, and we analyzed the relationship between gene expression profiles of adenoma-adenocarcinoma sequence and clinical prognosis of colorectal cancer. The mRNA expressions of adenoma-carcinoma sequence were significantly different between high-grade intraepithelial neoplasia group and adenocarcinoma group. The biological process of gene ontology function enrichment analysis on differentially expressed genes between high-grade intraepithelial neoplasia group and adenocarcinoma group showed that genes enriched in the extracellular structure organization, skeletal system development, biological adhesion and itself regulated growth regulation, with the P value after FDR correction of less than 0.05. In addition, IPR-related protein mainly focused on the insulin-like growth factor binding proteins. The variable trends of gene expression profiles for adenoma-carcinoma sequence were mainly concentrated in high-grade intraepithelial neoplasia and adenocarcinoma. The differentially expressed genes are significantly correlated between high-grade intraepithelial neoplasia group and adenocarcinoma group. Bioinformatics analysis is an effective way to study the gene expression profiles in the adenoma-carcinoma sequence, and may provide an effective tool to involve colorectal cancer research strategy into colorectal adenoma or advanced adenoma.

  19. Prediction of Metastasis and Recurrence in Colorectal Cancer Based on Gene Expression Analysis: Ready for the Clinic?

    International Nuclear Information System (INIS)

    Shibayama, Masaki; Maak, Matthias; Nitsche, Ulrich; Gotoh, Kengo; Rosenberg, Robert; Janssen, Klaus-Peter

    2011-01-01

    Cancers of the colon and rectum, which rank among the most frequent human tumors, are currently treated by surgical resection in locally restricted tumor stages. However, disease recurrence and formation of local and distant metastasis frequently occur even in cases with successful curative resection of the primary tumor (R0). Recent technological advances in molecular diagnostic analysis have led to a wealth of knowledge about the changes in gene transcription in all stages of colorectal tumors. Differential gene expression, or transcriptome analysis, has been proposed by many groups to predict disease recurrence, clinical outcome, and also response to therapy, in addition to the well-established clinico-pathological factors. However, the clinical usability of gene expression profiling as a reliable and robust prognostic tool that allows evidence-based clinical decisions is currently under debate. In this review, we will discuss the most recent data on the prognostic significance and potential clinical application of genome wide expression analysis in colorectal cancer

  20. Prediction of Metastasis and Recurrence in Colorectal Cancer Based on Gene Expression Analysis: Ready for the Clinic?

    Energy Technology Data Exchange (ETDEWEB)

    Shibayama, Masaki [Sysmex Corporation, Central Research Laboratories, Kobe 651-2271 (Japan); Maak, Matthias; Nitsche, Ulrich [Chirurgische Klinik, Klinikum Rechts der Isar der TUM, München 81657 (Germany); Gotoh, Kengo [Sysmex Corporation, Central Research Laboratories, Kobe 651-2271 (Japan); Rosenberg, Robert; Janssen, Klaus-Peter, E-mail: klaus-peter.janssen@lrz.tum.de [Chirurgische Klinik, Klinikum Rechts der Isar der TUM, München 81657 (Germany)

    2011-07-07

    Cancers of the colon and rectum, which rank among the most frequent human tumors, are currently treated by surgical resection in locally restricted tumor stages. However, disease recurrence and formation of local and distant metastasis frequently occur even in cases with successful curative resection of the primary tumor (R0). Recent technological advances in molecular diagnostic analysis have led to a wealth of knowledge about the changes in gene transcription in all stages of colorectal tumors. Differential gene expression, or transcriptome analysis, has been proposed by many groups to predict disease recurrence, clinical outcome, and also response to therapy, in addition to the well-established clinico-pathological factors. However, the clinical usability of gene expression profiling as a reliable and robust prognostic tool that allows evidence-based clinical decisions is currently under debate. In this review, we will discuss the most recent data on the prognostic significance and potential clinical application of genome wide expression analysis in colorectal cancer.

  1. Advanced colorectal adenoma related gene expression signature may predict prognostic for colorectal cancer patients with adenoma-carcinoma sequence

    OpenAIRE

    Li, Bing; Shi, Xiao-Yu; Liao, Dai-Xiang; Cao, Bang-Rong; Luo, Cheng-Hua; Cheng, Shu-Jun

    2015-01-01

    Background: There are still no absolute parameters predicting progression of adenoma into cancer. The present study aimed to characterize functional differences on the multistep carcinogenetic process from the adenoma-carcinoma sequence. Methods: All samples were collected and mRNA expression profiling was performed by using Agilent Microarray high-throughput gene-chip technology. Then, the characteristics of mRNA expression profiles of adenoma-carcinoma sequence were described with bioinform...

  2. Prognostic and predictive value of VHL gene alteration in renal cell carcinoma: a meta-analysis and review.

    Science.gov (United States)

    Kim, Bum Jun; Kim, Jung Han; Kim, Hyeong Su; Zang, Dae Young

    2017-02-21

    The von Hippel-Lindau (VHL) gene is often inactivated in sporadic renal cell carcinoma (RCC) by mutation or promoter hypermethylation. The prognostic or predictive value of VHL gene alteration is not well established. We conducted this meta-analysis to evaluate the association between the VHL alteration and clinical outcomes in patients with RCC. We searched PUBMED, MEDLINE and EMBASE for articles including following terms in their titles, abstracts, or keywords: 'kidney or renal', 'carcinoma or cancer or neoplasm or malignancy', 'von Hippel-Lindau or VHL', 'alteration or mutation or methylation', and 'prognostic or predictive'. There were six studies fulfilling inclusion criteria and a total of 633 patients with clear cell RCC were included in the study: 244 patients who received anti-vascular endothelial growth factor (VEGF) therapy in the predictive value analysis and 419 in the prognostic value analysis. Out of 663 patients, 410 (61.8%) had VHL alteration. The meta-analysis showed no association between the VHL gene alteration and overall response rate (relative risk = 1.47 [95% CI, 0.81-2.67], P = 0.20) or progression free survival (hazard ratio = 1.02 [95% CI, 0.72-1.44], P = 0.91) in patients with RCC who received VEGF-targeted therapy. There was also no correlation between the VHL alteration and overall survival (HR = 0.80 [95% CI, 0.56-1.14], P = 0.21). In conclusion, this meta-analysis indicates that VHL gene alteration has no prognostic or predictive value in patients with clear cell RCC.

  3. Gene Expression Signature TOPFOX Reflecting Chromosomal Instability Refines Prediction of Prognosis in Grade 2 Breast Cancer

    DEFF Research Database (Denmark)

    Szasz, A.; Li, Qiyuan; Sztupinszki, Z.

    2011-01-01

    Purpose: To assess the ability of genes selected from those reflecting chromosomal instability to identify good and poor prognostic subsets of Grade 2 breast carcinomas. Methods: We selected genes for splitting grade 2 tumours into low and high grade type groups by using public databases. Patient...

  4. SVMRFE based approach for prediction of most discriminatory gene target for type II diabetes

    Directory of Open Access Journals (Sweden)

    Atul Kumar

    2017-06-01

    Full Text Available Type II diabetes is a chronic condition that affects the way our body metabolizes sugar. The body's important source of fuel is now becoming a chronic disease all over the world. It is now very necessary to identify the new potential targets for the drugs which not only control the disease but also can treat it. Support vector machines are the classifier which has a potential to make a classification of the discriminatory genes and non-discriminatory genes. SVMRFE a modification of SVM ranks the genes based on their discriminatory power and eliminate the genes which are not involved in causing the disease. A gene regulatory network has been formed with the top ranked coding genes to identify their role in causing diabetes. To further validate the results pathway study was performed to identify the involvement of the coding genes in type II diabetes. The genes obtained from this study showed a significant involvement in causing the disease, which may be used as a potential drug target.

  5. Predicting Essential Genes and Proteins Based on Machine Learning and Network Topological Features: A Comprehensive Review

    Science.gov (United States)

    Zhang, Xue; Acencio, Marcio Luis; Lemke, Ney

    2016-01-01

    Essential proteins/genes are indispensable to the survival or reproduction of an organism, and the deletion of such essential proteins will result in lethality or infertility. The identification of essential genes is very important not only for understanding the minimal requirements for survival of an organism, but also for finding human disease genes and new drug targets. Experimental methods for identifying essential genes are costly, time-consuming, and laborious. With the accumulation of sequenced genomes data and high-throughput experimental data, many computational methods for identifying essential proteins are proposed, which are useful complements to experimental methods. In this review, we show the state-of-the-art methods for identifying essential genes and proteins based on machine learning and network topological features, point out the progress and limitations of current methods, and discuss the challenges and directions for further research. PMID:27014079

  6. Predictive Value of Gene Polymorphisms on Recurrence after the Withdrawal of Antithyroid Drugs in Patients with Graves’ Disease

    Directory of Open Access Journals (Sweden)

    Jia Liu

    2017-09-01

    Full Text Available Graves’ disease (GD is one of the most common endocrine diseases. Antithyroid drugs (ATDs treatment is frequently used as the first-choice therapy for GD patients in most countries due to the superiority in safety and tolerance. However, GD patients treated with ATD have a relatively high recurrence rate after drug withdrawal, which is a main limitation for ATD treatment. It is of great importance to identify some predictors of the higher recurrence risk for GD patients, which may facilitate an appropriate therapeutic approach for a given patient at the time of GD diagnosis. The genetic factor was widely believed to be an important pathogenesis for GD. Increasing studies were conducted to investigate the relationship between gene polymorphisms and the recurrence risk in GD patients. In this article, we updated the current literatures to highlight the predictive value of gene polymorphisms on recurrence risk in GD patients after ATD withdrawal. Some gene polymorphisms, such as CTLA4 rs231775, human leukocyte antigen polymorphisms (DRB1*03, DQA1*05, and DQB1*02 might be associated with the high recurrence risk in GD patients. Further prospective studies on patients of different ethnicities, especially studies with large sample sizes, and long-term follow-up, should be conducted to confirm the predictive roles of gene polymorphism.

  7. The accuracy of survival time prediction for patients with glioma is improved by measuring mitotic spindle checkpoint gene expression.

    Directory of Open Access Journals (Sweden)

    Li Bie

    Full Text Available Identification of gene expression changes that improve prediction of survival time across all glioma grades would be clinically useful. Four Affymetrix GeneChip datasets from the literature, containing data from 771 glioma samples representing all WHO grades and eight normal brain samples, were used in an ANOVA model to screen for transcript changes that correlated with grade. Observations were confirmed and extended using qPCR assays on RNA derived from 38 additional glioma samples and eight normal samples for which survival data were available. RNA levels of eight major mitotic spindle assembly checkpoint (SAC genes (BUB1, BUB1B, BUB3, CENPE, MAD1L1, MAD2L1, CDC20, TTK significantly correlated with glioma grade and six also significantly correlated with survival time. In particular, the level of BUB1B expression was highly correlated with survival time (p<0.0001, and significantly outperformed all other measured parameters, including two standards; WHO grade and MIB-1 (Ki-67 labeling index. Measurement of the expression levels of a small set of SAC genes may complement histological grade and other clinical parameters for predicting survival time.

  8. Increased production of free fatty acids in Aspergillus oryzae by disruption of a predicted acyl-CoA synthetase gene.

    Science.gov (United States)

    Tamano, Koichi; Bruno, Kenneth S; Koike, Hideaki; Ishii, Tomoko; Miura, Ai; Umemura, Myco; Culley, David E; Baker, Scott E; Machida, Masayuki

    2015-04-01

    Fatty acids are attractive molecules as source materials for the production of biodiesel fuel. Previously, we attained a 2.4-fold increase in fatty acid production by increasing the expression of fatty acid synthesis-related genes in Aspergillus oryzae. In this study, we achieved an additional increase in the production of fatty acids by disrupting a predicted acyl-CoA synthetase gene in A. oryzae. The A. oryzae genome is predicted to encode six acyl-CoA synthetase genes and disruption of AO090011000642, one of the six genes, resulted in a 9.2-fold higher accumulation (corresponding to an increased production of 0.23 mmol/g dry cell weight) of intracellular fatty acid in comparison to the wild-type strain. Furthermore, by introducing a niaD marker from Aspergillus nidulans to the disruptant, as well as changing the concentration of nitrogen in the culture medium from 10 to 350 mM, fatty acid productivity reached 0.54 mmol/g dry cell weight. Analysis of the relative composition of the major intracellular free fatty acids caused by disruption of AO090011000642 in comparison to the wild-type strain showed an increase in stearic acid (7 to 26 %), decrease in linoleic acid (50 to 27 %), and no significant changes in palmitic or oleic acid (each around 20-25 %).

  9. FocusHeuristics - expression-data-driven network optimization and disease gene prediction.

    Science.gov (United States)

    Ernst, Mathias; Du, Yang; Warsow, Gregor; Hamed, Mohamed; Endlich, Nicole; Endlich, Karlhans; Murua Escobar, Hugo; Sklarz, Lisa-Madeleine; Sender, Sina; Junghanß, Christian; Möller, Steffen; Fuellen, Georg; Struckmann, Stephan

    2017-02-16

    To identify genes contributing to disease phenotypes remains a challenge for bioinformatics. Static knowledge on biological networks is often combined with the dynamics observed in gene expression levels over disease development, to find markers for diagnostics and therapy, and also putative disease-modulatory drug targets and drugs. The basis of current methods ranges from a focus on expression-levels (Limma) to concentrating on network characteristics (PageRank, HITS/Authority Score), and both (DeMAND, Local Radiality). We present an integrative approach (the FocusHeuristics) that is thoroughly evaluated based on public expression data and molecular disease characteristics provided by DisGeNet. The FocusHeuristics combines three scores, i.e. the log fold change and another two, based on the sum and difference of log fold changes of genes/proteins linked in a network. A gene is kept when one of the scores to which it contributes is above a threshold. Our FocusHeuristics is both, a predictor for gene-disease-association and a bioinformatics method to reduce biological networks to their disease-relevant parts, by highlighting the dynamics observed in expression data. The FocusHeuristics is slightly, but significantly better than other methods by its more successful identification of disease-associated genes measured by AUC, and it delivers mechanistic explanations for its choice of genes.

  10. Phylogenetic Reconstruction of the Calosphaeriales and Togniniales Using Five Genes and Predicted RNA Secondary Structures of ITS, and Flabellascus tenuirostris gen. et sp. nov.

    Science.gov (United States)

    Réblová, Martina; Jaklitsch, Walter M; Réblová, Kamila; Štěpánek, Václav

    2015-01-01

    The Calosphaeriales is revisited with new collection data, living cultures, morphological studies of ascoma centrum, secondary structures of the internal transcribed spacer (ITS) rDNA and phylogeny based on novel DNA sequences of five nuclear ribosomal and protein-coding loci. Morphological features, molecular evidence and information from predicted RNA secondary structures of ITS converged upon robust phylogenies of the Calosphaeriales and Togniniales. The current concept of the Calosphaeriales includes the Calosphaeriaceae and Pleurostomataceae encompassing five monophyletic genera, Calosphaeria, Flabellascus gen. nov., Jattaea, Pleurostoma and Togniniella, strongly supported by Bayesian and Maximum Likelihood methods. The structural elements of ITS1 form characteristic patterns that are phylogenetically conserved, corroborate observations based on morphology and have a high predictive value at the generic level. Three major clades containing 44 species of Phaeoacremonium were recovered in the closely related Togniniales based on ITS, actin and β-tubulin sequences. They are newly characterized by sexual and RNA structural characters and ecology. This approach is a first step towards understanding of the molecular systematics of Phaeoacremonium and possibly its new classification. In the Calosphaeriales, Jattaea aphanospora sp. nov. and J. ribicola sp. nov. are introduced, Calosphaeria taediosa is combined in Jattaea and epitypified. The sexual morph of Phaeoacremonium cinereum was encountered for the first time on decaying wood and obtained in vitro. In order to achieve a single nomenclature, the genera of asexual morphs linked with the Calosphaeriales are transferred to synonymy of their sexual morphs following the principle of priority, i.e. Calosphaeriophora to Calosphaeria, Phaeocrella to Togniniella and Pleurostomophora to Pleurostoma. Three new combinations are proposed, i.e. Pleurostoma ochraceum comb. nov., P. repens comb. nov. and P. richardsiae comb

  11. A computational method based on the integration of heterogeneous networks for predicting disease-gene associations.

    Directory of Open Access Journals (Sweden)

    Xingli Guo

    Full Text Available The identification of disease-causing genes is a fundamental challenge in human health and of great importance in improving medical care, and provides a better understanding of gene functions. Recent computational approaches based on the interactions among human proteins and disease similarities have shown their power in tackling the issue. In this paper, a novel systematic and global method that integrates two heterogeneous networks for prioritizing candidate disease-causing genes is provided, based on the observation that genes causing the same or similar diseases tend to lie close to one another in a network of protein-protein interactions. In this method, the association score function between a query disease and a candidate gene is defined as the weighted sum of all the association scores between similar diseases and neighbouring genes. Moreover, the topological correlation of these two heterogeneous networks can be incorporated into the definition of the score function, and finally an iterative algorithm is designed for this issue. This method was tested with 10-fold cross-validation on all 1,126 diseases that have at least a known causal gene, and it ranked the correct gene as one of the top ten in 622 of all the 1,428 cases, significantly outperforming a state-of-the-art method called PRINCE. The results brought about by this method were applied to study three multi-factorial disorders: breast cancer, Alzheimer disease and diabetes mellitus type 2, and some suggestions of novel causal genes and candidate disease-causing subnetworks were provided for further investigation.

  12. Comparison of loline alkaloid gene clusters across fungal endophytes: predicting the co-regulatory sequence motifs and the evolutionary history.

    Science.gov (United States)

    Kutil, Brandi L; Greenwald, Charles; Liu, Gang; Spiering, Martin J; Schardl, Christopher L; Wilkinson, Heather H

    2007-10-01

    LOL, a fungal secondary metabolite gene cluster found in Epichloë and Neotyphodium species, is responsible for production of insecticidal loline alkaloids. To analyze the genetic architecture and to predict the evolutionary history of LOL, we compared five clusters from four fungal species (single clusters from Epichloë festucae, Neotyphodium sp. PauTG-1, Neotyphodium coenophialum, and two clusters we previously characterized in Neotyphodium uncinatum). Using PhyloCon to compare putative lol gene promoter regions, we have identified four motifs conserved across the lol genes in all five clusters. Each motif has significant similarity to known fungal transcription factor binding sites in the TRANSFAC database. Conservation of these motifs is further support for the hypothesis that the lol genes are co-regulated. Interestingly, the history of asexual Neotyphodium spp. includes multiple interspecific hybridization events. Comparing clusters from three Neotyphodium species and E. festucae allowed us to determine which Epichloë ancestors are the most likely contributors of LOL in these asexual species. For example, while no present day Epichloë typhina isolates are known to produce lolines, our data support the hypothesis that the E. typhina ancestor(s) of three asexual endophyte species contained a LOL gene cluster. Thus, these data support a model of evolution in which the polymorphism in loline alkaloid production phenotypes among endophyte species is likely due to the loss of the trait over time.

  13. Transcription factor binding site enrichment analysis predicts drivers of altered gene expression in nonalcoholic steatohepatitis

    Czech Academy of Sciences Publication Activity Database

    Lake, A.D.; Chaput, A.L.; Novák, Petr; Cherrington, N.J.; Smith, C.L.

    2016-01-01

    Roč. 122, December 15 (2016), s. 62-71 ISSN 0006-2952 Institutional support: RVO:60077344 Keywords : Transcription factor * Liver * Gene expression * Bioinformatics Subject RIV: CE - Biochemistry Impact factor: 4.581, year: 2016

  14. An Individual-Based Diploid Model Predicts Limited Conditions Under Which Stochastic Gene Expression Becomes Advantageous

    KAUST Repository

    Matsumoto, Tomotaka; Mineta, Katsuhiko; Osada, Naoki; Araki, Hitoshi

    2015-01-01

    Recent studies suggest the existence of a stochasticity in gene expression (SGE) in many organisms, and its non-negligible effect on their phenotype and fitness. To date, however, how SGE affects the key parameters of population genetics

  15. PREDICTION OF THE COURSE OF OSTEOARTHROSIS FROM mTOR (MAMMALIAN TARGET OF RAPAMYCIN GENE EXPRESSION

    Directory of Open Access Journals (Sweden)

    E V Chetina

    2012-01-01

    Results. Analysis of gene expression in the outpatients with OA identified two subgroups: in one subgroup (n = 13 mTOR expression was considerably much less than that in the control group; the expression of ATG1 and p21 did not differ greatly from the control and that of caspase 3 and TNF-α was significantly higher. The other outpatients (n = 20 and all the examined patients needing endoprosthetic replacement were ascertained to have a higher gene expression of mTOR, ATG1, p21, caspase 3, and TNF-α than in the control group. Before endoprosthetic replacement, severe joint destruction in patients with OA was associated with enhanced gene expression of mTOR, ATG1, p21, and caspase 3. Conclusion. In early-stage disease, increased mTOR gene expression may serve as a prognostic marker of the severity of the disease and articular cartilage destruction.

  16. TBC2target: A Resource of Predicted Target Genes of Tea Bioactive Compounds

    Directory of Open Access Journals (Sweden)

    Shihua Zhang

    2018-02-01

    Full Text Available Tea is one of the most popular non-alcoholic beverages consumed worldwide. Numerous bioactive constituents of tea were confirmed to possess healthy benefits via the mechanisms of regulating gene expressions or protein activities. However, a complete interacting profile between tea bioactive compounds (TBCs and their target genes is lacking, which put an obstacle in the study of healthy function of tea. To fill this gap, we developed a database of target genes of TBCs (TBC2target, http://camellia.ahau.edu.cn/TBC2target based on a pharmacophore mapping approach. In TBC2target, 6,226 interactions between 240 TBCs and 673 target genes were documented. TBC2target contains detailed information about each interacting entry, such as TBC, CAS number, PubChem CID, source of compound (e.g., green, black, compound type, target gene(s of TBC, gene symbol, gene ID, ENSEMBL ID, PDB ID, TBC bioactivity and the reference. Using the TBC-target associations, we constructed a bipartite network and provided users the global network and local sub-network visualization and topological analyses. The entire database is free for online browsing, searching and downloading. In addition, TBC2target provides a BLAST search function to facilitate use of the database. The particular strengths of TBC2target are the inclusion of the comprehensive TBC-target interactions, and the capacity to visualize and analyze the interacting networks, which may help uncovering the beneficial effects of tea on human health as a central resource in tea health community.

  17. TBC2target: A Resource of Predicted Target Genes of Tea Bioactive Compounds.

    Science.gov (United States)

    Zhang, Shihua; Zhang, Liang; Wang, Yijun; Yang, Jian; Liao, Mingzhi; Bi, Shoudong; Xie, Zhongwen; Ho, Chi-Tang; Wan, Xiaochun

    2018-01-01

    Tea is one of the most popular non-alcoholic beverages consumed worldwide. Numerous bioactive constituents of tea were confirmed to possess healthy benefits via the mechanisms of regulating gene expressions or protein activities. However, a complete interacting profile between tea bioactive compounds (TBCs) and their target genes is lacking, which put an obstacle in the study of healthy function of tea. To fill this gap, we developed a database of target genes of TBCs (TBC2target, http://camellia.ahau.edu.cn/TBC2target) based on a pharmacophore mapping approach. In TBC2target, 6,226 interactions between 240 TBCs and 673 target genes were documented. TBC2target contains detailed information about each interacting entry, such as TBC, CAS number, PubChem CID, source of compound (e.g., green, black), compound type, target gene(s) of TBC, gene symbol, gene ID, ENSEMBL ID, PDB ID, TBC bioactivity and the reference. Using the TBC-target associations, we constructed a bipartite network and provided users the global network and local sub-network visualization and topological analyses. The entire database is free for online browsing, searching and downloading. In addition, TBC2target provides a BLAST search function to facilitate use of the database. The particular strengths of TBC2target are the inclusion of the comprehensive TBC-target interactions, and the capacity to visualize and analyze the interacting networks, which may help uncovering the beneficial effects of tea on human health as a central resource in tea health community.

  18. Predicting multi-level drug response with gene expression profile in multiple myeloma using hierarchical ordinal regression.

    Science.gov (United States)

    Zhang, Xinyan; Li, Bingzong; Han, Huiying; Song, Sha; Xu, Hongxia; Hong, Yating; Yi, Nengjun; Zhuang, Wenzhuo

    2018-05-10

    Multiple myeloma (MM), like other cancers, is caused by the accumulation of genetic abnormalities. Heterogeneity exists in the patients' response to treatments, for example, bortezomib. This urges efforts to identify biomarkers from numerous molecular features and build predictive models for identifying patients that can benefit from a certain treatment scheme. However, previous studies treated the multi-level ordinal drug response as a binary response where only responsive and non-responsive groups are considered. It is desirable to directly analyze the multi-level drug response, rather than combining the response to two groups. In this study, we present a novel method to identify significantly associated biomarkers and then develop ordinal genomic classifier using the hierarchical ordinal logistic model. The proposed hierarchical ordinal logistic model employs the heavy-tailed Cauchy prior on the coefficients and is fitted by an efficient quasi-Newton algorithm. We apply our hierarchical ordinal regression approach to analyze two publicly available datasets for MM with five-level drug response and numerous gene expression measures. Our results show that our method is able to identify genes associated with the multi-level drug response and to generate powerful predictive models for predicting the multi-level response. The proposed method allows us to jointly fit numerous correlated predictors and thus build efficient models for predicting the multi-level drug response. The predictive model for the multi-level drug response can be more informative than the previous approaches. Thus, the proposed approach provides a powerful tool for predicting multi-level drug response and has important impact on cancer studies.

  19. Evolutionary relationships between miRNA genes and their activity.

    Science.gov (United States)

    Zhu, Yan; Skogerbø, Geir; Ning, Qianqian; Wang, Zhen; Li, Biqing; Yang, Shuang; Sun, Hong; Li, Yixue

    2012-12-22

    The emergence of vertebrates is characterized by a strong increase in miRNA families. MicroRNAs interact broadly with many transcripts, and the evolution of such a system is intriguing. However, evolutionary questions concerning the origin of miRNA genes and their subsequent evolution remain unexplained. In order to systematically understand the evolutionary relationship between miRNAs gene and their function, we classified human known miRNAs into eight groups based on their evolutionary ages estimated by maximum parsimony method. New miRNA genes with new functional sequences accumulated more dynamically in vertebrates than that observed in Drosophila. Different levels of evolutionary selection were observed over miRNA gene sequences with different time of origin. Most genic miRNAs differ from their host genes in time of origin, there is no particular relationship between the age of a miRNA and the age of its host genes, genic miRNAs are mostly younger than the corresponding host genes. MicroRNAs originated over different time-scales are often predicted/verified to target the same or overlapping sets of genes, opening the possibility of substantial functional redundancy among miRNAs of different ages. Higher degree of tissue specificity and lower expression level was found in young miRNAs. Our data showed that compared with protein coding genes, miRNA genes are more dynamic in terms of emergence and decay. Evolution patterns are quite different between miRNAs of different ages. MicroRNAs activity is under tight control with well-regulated expression increased and targeting decreased over time. Our work calls attention to the study of miRNA activity with a consideration of their origin time.

  20. Characterization and detection of a widely distributed gene cluster that predicts anaerobic choline utilization by human gut bacteria.

    Science.gov (United States)

    Martínez-del Campo, Ana; Bodea, Smaranda; Hamer, Hilary A; Marks, Jonathan A; Haiser, Henry J; Turnbaugh, Peter J; Balskus, Emily P

    2015-04-14

    Elucidation of the molecular mechanisms underlying the human gut microbiota's effects on health and disease has been complicated by difficulties in linking metabolic functions associated with the gut community as a whole to individual microorganisms and activities. Anaerobic microbial choline metabolism, a disease-associated metabolic pathway, exemplifies this challenge, as the specific human gut microorganisms responsible for this transformation have not yet been clearly identified. In this study, we established the link between a bacterial gene cluster, the choline utilization (cut) cluster, and anaerobic choline metabolism in human gut isolates by combining transcriptional, biochemical, bioinformatic, and cultivation-based approaches. Quantitative reverse transcription-PCR analysis and in vitro biochemical characterization of two cut gene products linked the entire cluster to growth on choline and supported a model for this pathway. Analyses of sequenced bacterial genomes revealed that the cut cluster is present in many human gut bacteria, is predictive of choline utilization in sequenced isolates, and is widely but discontinuously distributed across multiple bacterial phyla. Given that bacterial phylogeny is a poor marker for choline utilization, we were prompted to develop a degenerate PCR-based method for detecting the key functional gene choline TMA-lyase (cutC) in genomic and metagenomic DNA. Using this tool, we found that new choline-metabolizing gut isolates universally possessed cutC. We also demonstrated that this gene is widespread in stool metagenomic data sets. Overall, this work represents a crucial step toward understanding anaerobic choline metabolism in the human gut microbiota and underscores the importance of examining this microbial community from a function-oriented perspective. Anaerobic choline utilization is a bacterial metabolic activity that occurs in the human gut and is linked to multiple diseases. While bacterial genes responsible for

  1. Phylogenetic classification of Pleurothecium and Pleurotheciella gen. nov. and its dactylaria-like anamorph (Sordariomycetes) based on nuclear ribosomal and protein-coding genes

    Czech Academy of Sciences Publication Activity Database

    Réblová, Martina; Seifert, K. A.; Fournier, J.; Štěpánek, Václav

    2012-01-01

    Roč. 104, č. 6 (2012), s. 1299-1314 ISSN 0027-5514 R&D Projects: GA ČR GAP506/12/0038 Institutional support: RVO:67985939 ; RVO:61388971 Keywords : holoblastic denticulate conidiogenesis * life cycles * Steringmatobotrys Subject RIV: EF - Botanics; EE - Microbiology, Virology (MBU-M) Impact factor: 2.110, year: 2012

  2. Methylation of miRNA genes and oncogenesis.

    Science.gov (United States)

    Loginov, V I; Rykov, S V; Fridman, M V; Braga, E A

    2015-02-01

    Interaction between microRNA (miRNA) and messenger RNA of target genes at the posttranscriptional level provides fine-tuned dynamic regulation of cell signaling pathways. Each miRNA can be involved in regulating hundreds of protein-coding genes, and, conversely, a number of different miRNAs usually target a structural gene. Epigenetic gene inactivation associated with methylation of promoter CpG-islands is common to both protein-coding genes and miRNA genes. Here, data on functions of miRNAs in development of tumor-cell phenotype are reviewed. Genomic organization of promoter CpG-islands of the miRNA genes located in inter- and intragenic areas is discussed. The literature and our own results on frequency of CpG-island methylation in miRNA genes from tumors are summarized, and data regarding a link between such modification and changed activity of miRNA genes and, consequently, protein-coding target genes are presented. Moreover, the impact of miRNA gene methylation on key oncogenetic processes as well as affected signaling pathways is discussed.

  3. Identification of epigenetically regulated genes that predict patient outcome in neuroblastoma

    International Nuclear Information System (INIS)

    Carén, Helena; Djos, Anna; Nethander, Maria; Sjöberg, Rose-Marie; Kogner, Per; Enström, Camilla; Nilsson, Staffan; Martinsson, Tommy

    2011-01-01

    Epigenetic mechanisms such as DNA methylation and histone modifications are important regulators of gene expression and are frequently involved in silencing tumor suppressor genes. In order to identify genes that are epigenetically regulated in neuroblastoma tumors, we treated four neuroblastoma cell lines with the demethylating agent 5-Aza-2'-deoxycytidine (5-Aza-dC) either separately or in conjunction with the histone deacetylase inhibitor trichostatin A (TSA). Expression was analyzed using whole-genome expression arrays to identify genes activated by the treatment. These data were then combined with data from genome-wide DNA methylation arrays to identify candidate genes silenced in neuroblastoma due to DNA methylation. We present eight genes (KRT19, PRKCDBP, SCNN1A, POU2F2, TGFBI, COL1A2, DHRS3 and DUSP23) that are methylated in neuroblastoma, most of them not previously reported as such, some of which also distinguish between biological subsets of neuroblastoma tumors. Differential methylation was observed for the genes SCNN1A (p < 0.001), PRKCDBP (p < 0.001) and KRT19 (p < 0.01). Among these, the mRNA expression of KRT19 and PRKCDBP was significantly lower in patients that have died from the disease compared with patients with no evidence of disease (fold change -8.3, p = 0.01 for KRT19 and fold change -2.4, p = 0.04 for PRKCDBP). In our study, a low methylation frequency of SCNN1A, PRKCDBP and KRT19 is significantly associated with favorable outcome in neuroblastoma. It is likely that analysis of specific DNA methylation will be one of several methods in future patient therapy stratification protocols for treatment of childhood neuroblastomas

  4. Predicting statistical properties of open reading frames in bacterial genomes.

    Directory of Open Access Journals (Sweden)

    Katharina Mir

    Full Text Available An analytical model based on the statistical properties of Open Reading Frames (ORFs of eubacterial genomes such as codon composition and sequence length of all reading frames was developed. This new model predicts the average length, maximum length as well as the length distribution of the ORFs of 70 species with GC contents varying between 21% and 74%. Furthermore, the number of annotated genes is predicted with high accordance. However, the ORF length distribution in the five alternative reading frames shows interesting deviations from the predicted distribution. In particular, long ORFs appear more often than expected statistically. The unexpected depletion of stop codons in these alternative open reading frames cannot completely be explained by a biased codon usage in the +1 frame. While it is unknown if the stop codon depletion has a biological function, it could be due to a protein coding capacity of alternative ORFs exerting a selection pressure which prevents the fixation of stop codon mutations. The comparison of the analytical model with bacterial genomes, therefore, leads to a hypothesis suggesting novel gene candidates which can now be investigated in subsequent wet lab experiments.

  5. Gene

    Data.gov (United States)

    U.S. Department of Health & Human Services — Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes,...

  6. Exceptions to the rule: case studies in the prediction of pathogenicity for genetic variants in hereditary cancer genes.

    Science.gov (United States)

    Rosenthal, E T; Bowles, K R; Pruss, D; van Kan, A; Vail, P J; McElroy, H; Wenstrup, R J

    2015-12-01

    Based on current consensus guidelines and standard practice, many genetic variants detected in clinical testing are classified as disease causing based on their predicted impact on the normal expression or function of the gene in the absence of additional data. However, our laboratory has identified a subset of such variants in hereditary cancer genes for which compelling contradictory evidence emerged after the initial evaluation following the first observation of the variant. Three representative examples of variants in BRCA1, BRCA2 and MSH2 that are predicted to disrupt splicing, prematurely truncate the protein, or remove the start codon were evaluated for pathogenicity by analyzing clinical data with multiple classification algorithms. Available clinical data for all three variants contradicts the expected pathogenic classification. These variants illustrate potential pitfalls associated with standard approaches to variant classification as well as the challenges associated with monitoring data, updating classifications, and reporting potentially contradictory interpretations to the clinicians responsible for translating test outcomes to appropriate clinical action. It is important to address these challenges now as the model for clinical testing moves toward the use of large multi-gene panels and whole exome/genome analysis, which will dramatically increase the number of genetic variants identified. © 2015 The Authors. Clinical Genetics published by John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  7. Use of Artificial Intelligence and Machine Learning Algorithms with Gene Expression Profiling to Predict Recurrent Nonmuscle Invasive Urothelial Carcinoma of the Bladder.

    Science.gov (United States)

    Bartsch, Georg; Mitra, Anirban P; Mitra, Sheetal A; Almal, Arpit A; Steven, Kenneth E; Skinner, Donald G; Fry, David W; Lenehan, Peter F; Worzel, William P; Cote, Richard J

    2016-02-01

    Due to the high recurrence risk of nonmuscle invasive urothelial carcinoma it is crucial to distinguish patients at high risk from those with indolent disease. In this study we used a machine learning algorithm to identify the genes in patients with nonmuscle invasive urothelial carcinoma at initial presentation that were most predictive of recurrence. We used the genes in a molecular signature to predict recurrence risk within 5 years after transurethral resection of bladder tumor. Whole genome profiling was performed on 112 frozen nonmuscle invasive urothelial carcinoma specimens obtained at first presentation on Human WG-6 BeadChips (Illumina®). A genetic programming algorithm was applied to evolve classifier mathematical models for outcome prediction. Cross-validation based resampling and gene use frequencies were used to identify the most prognostic genes, which were combined into rules used in a voting algorithm to predict the sample target class. Key genes were validated by quantitative polymerase chain reaction. The classifier set included 21 genes that predicted recurrence. Quantitative polymerase chain reaction was done for these genes in a subset of 100 patients. A 5-gene combined rule incorporating a voting algorithm yielded 77% sensitivity and 85% specificity to predict recurrence in the training set, and 69% and 62%, respectively, in the test set. A singular 3-gene rule was constructed that predicted recurrence with 80% sensitivity and 90% specificity in the training set, and 71% and 67%, respectively, in the test set. Using primary nonmuscle invasive urothelial carcinoma from initial occurrences genetic programming identified transcripts in reproducible fashion, which were predictive of recurrence. These findings could potentially impact nonmuscle invasive urothelial carcinoma management. Copyright © 2016 American Urological Association Education and Research, Inc. Published by Elsevier Inc. All rights reserved.

  8. Gene expression signature in organized and growth arrested mammaryacini predicts good outcome in breast cancer

    Energy Technology Data Exchange (ETDEWEB)

    Fournier, Marcia V.; Martin, Katherine J.; Kenny, Paraic A.; Xhaja, Kris; Bosch, Irene; Yaswen, Paul; Bissell, Mina J.

    2006-02-08

    To understand how non-malignant human mammary epithelial cells (HMEC) transit from a disorganized proliferating to an organized growth arrested state, and to relate this process to the changes that occur in breast cancer, we studied gene expression changes in non-malignant HMEC grown in three-dimensional cultures, and in a previously published panel of microarray data for 295 breast cancer samples. We hypothesized that the gene expression pattern of organized and growth arrested mammary acini would share similarities with breast tumors with good prognoses. Using Affymetrix HG-U133A microarrays, we analyzed the expression of 22,283 gene transcripts in two HMEC cell lines, 184 (finite life span) and HMT3522 S1 (immortal non-malignant), on successive days post-seeding in a laminin-rich extracellular matrix assay. Both HMECs underwent growth arrest in G0/G1 and differentiated into polarized acini between days 5 and 7. We identified gene expression changes with the same temporal pattern in both lines. We show that genes that are significantly lower in the organized, growth arrested HMEC than in their proliferating counterparts can be used to classify breast cancer patients into poor and good prognosis groups with high accuracy. This study represents a novel unsupervised approach to identifying breast cancer markers that may be of use clinically.

  9. Gene Expression Differences Predict Treatment Outcome of Merkel Cell Carcinoma Patients

    Directory of Open Access Journals (Sweden)

    Loren Masterson

    2014-01-01

    Full Text Available Due to the rarity of Merkel cell carcinoma (MCC, prospective clinical trials have not been practical. This study aimed to identify biomarkers with prognostic significance. While sixty-two patients were identified who were treated for MCC at our institution, only seventeen patients had adequate formalin-fixed paraffin-embedded archival tissue and followup to be included in the study. Patients were stratified into good, moderate, or poor prognosis. Laser capture microdissection was used to isolate tumor cells for subsequent RNA isolation and gene expression analysis with Affymetrix GeneChip Human Exon 1.0 ST arrays. Among the 191 genes demonstrating significant differential expression between prognostic groups, keratin 20 and neurofilament protein have previously been identified in studies of MCC and were significantly upregulated in tumors from patients with a poor prognosis. Immunohistochemistry further established that keratin 20 was overexpressed in the poor prognosis tumors. In addition, novel genes of interest such as phospholipase A2 group X, kinesin family member 3A, tumor protein D52, mucin 1, and KIT were upregulated in specimens from patients with poor prognosis. Our pilot study identified several gene expression differences which could be used in the future as prognostic biomarkers in MCC patients.

  10. Prediction of graft-versus-host disease in humans by donor gene-expression profiling.

    Directory of Open Access Journals (Sweden)

    Chantal Baron

    2007-01-01

    Full Text Available BACKGROUND: Graft-versus-host disease (GVHD results from recognition of host antigens by donor T cells following allogeneic hematopoietic cell transplantation (AHCT. Notably, histoincompatibility between donor and recipient is necessary but not sufficient to elicit GVHD. Therefore, we tested the hypothesis that some donors may be "stronger alloresponders" than others, and consequently more likely to elicit GVHD. METHODS AND FINDINGS: To this end, we measured the gene-expression profiles of CD4(+ and CD8(+ T cells from 50 AHCT donors with microarrays. We report that pre-AHCT gene-expression profiling segregates donors whose recipient suffered from GVHD or not. Using quantitative PCR, established statistical tests, and analysis of multiple independent training-test datasets, we found that for chronic GVHD the "dangerous donor" trait (occurrence of GVHD in the recipient is under polygenic control and is shaped by the activity of genes that regulate transforming growth factor-beta signaling and cell proliferation. CONCLUSIONS: These findings strongly suggest that the donor gene-expression profile has a dominant influence on the occurrence of GVHD in the recipient. The ability to discriminate strong and weak alloresponders using gene-expression profiling could pave the way to personalized transplantation medicine.

  11. Identification of Gene Networks for Residual Feed Intake in Angus Cattle Using Genomic Prediction and RNA-seq.

    Science.gov (United States)

    Weber, Kristina L; Welly, Bryan T; Van Eenennaam, Alison L; Young, Amy E; Porto-Neto, Laercio R; Reverter, Antonio; Rincon, Gonzalo

    2016-01-01

    Improvement in feed conversion efficiency can improve the sustainability of beef cattle production, but genomic selection for feed efficiency affects many underlying molecular networks and physiological traits. This study describes the differences between steer progeny of two influential Angus bulls with divergent genomic predictions for residual feed intake (RFI). Eight steer progeny of each sire were phenotyped for growth and feed intake from 8 mo. of age (average BW 254 kg, with a mean difference between sire groups of 4.8 kg) until slaughter at 14-16 mo. of age (average BW 534 kg, sire group difference of 28.8 kg). Terminal samples from pituitary gland, skeletal muscle, liver, adipose, and duodenum were collected from each steer for transcriptome sequencing. Gene expression networks were derived using partial correlation and information theory (PCIT), including differentially expressed (DE) genes, tissue specific (TS) genes, transcription factors (TF), and genes associated with RFI from a genome-wide association study (GWAS). Relative to progeny of the high RFI sire, progeny of the low RFI sire had -0.56 kg/d finishing period RFI (P = 0.05), -1.08 finishing period feed conversion ratio (P = 0.01), +3.3 kg^0.75 finishing period metabolic mid-weight (MMW; P = 0.04), +28.8 kg final body weight (P = 0.01), -12.9 feed bunk visits per day (P = 0.02) with +0.60 min/visit duration (P = 0.01), and +0.0045 carcass specific gravity (weight in air/weight in air-weight in water, a predictor of carcass fat content; P = 0.03). RNA-seq identified 633 DE genes between sire groups among 17,016 expressed genes. PCIT analysis identified >115,000 significant co-expression correlations between genes and 25 TF hubs, i.e. controllers of clusters of DE, TS, and GWAS SNP genes. Pathway analysis suggests low RFI bull progeny possess heightened gut inflammation and reduced fat deposition. This multi-omics analysis shows how differences in RFI genomic breeding values can impact other

  12. Identification of Gene Networks for Residual Feed Intake in Angus Cattle Using Genomic Prediction and RNA-seq.

    Directory of Open Access Journals (Sweden)

    Kristina L Weber

    Full Text Available Improvement in feed conversion efficiency can improve the sustainability of beef cattle production, but genomic selection for feed efficiency affects many underlying molecular networks and physiological traits. This study describes the differences between steer progeny of two influential Angus bulls with divergent genomic predictions for residual feed intake (RFI. Eight steer progeny of each sire were phenotyped for growth and feed intake from 8 mo. of age (average BW 254 kg, with a mean difference between sire groups of 4.8 kg until slaughter at 14-16 mo. of age (average BW 534 kg, sire group difference of 28.8 kg. Terminal samples from pituitary gland, skeletal muscle, liver, adipose, and duodenum were collected from each steer for transcriptome sequencing. Gene expression networks were derived using partial correlation and information theory (PCIT, including differentially expressed (DE genes, tissue specific (TS genes, transcription factors (TF, and genes associated with RFI from a genome-wide association study (GWAS. Relative to progeny of the high RFI sire, progeny of the low RFI sire had -0.56 kg/d finishing period RFI (P = 0.05, -1.08 finishing period feed conversion ratio (P = 0.01, +3.3 kg^0.75 finishing period metabolic mid-weight (MMW; P = 0.04, +28.8 kg final body weight (P = 0.01, -12.9 feed bunk visits per day (P = 0.02 with +0.60 min/visit duration (P = 0.01, and +0.0045 carcass specific gravity (weight in air/weight in air-weight in water, a predictor of carcass fat content; P = 0.03. RNA-seq identified 633 DE genes between sire groups among 17,016 expressed genes. PCIT analysis identified >115,000 significant co-expression correlations between genes and 25 TF hubs, i.e. controllers of clusters of DE, TS, and GWAS SNP genes. Pathway analysis suggests low RFI bull progeny possess heightened gut inflammation and reduced fat deposition. This multi-omics analysis shows how differences in RFI genomic breeding values can impact other

  13. Gene expression profiles in paraffin-embedded core biopsy tissue predict response to chemotherapy in women with locally advanced breast cancer.

    Science.gov (United States)

    Gianni, Luca; Zambetti, Milvia; Clark, Kim; Baker, Joffre; Cronin, Maureen; Wu, Jenny; Mariani, Gabriella; Rodriguez, Jaime; Carcangiu, Marialuisa; Watson, Drew; Valagussa, Pinuccia; Rouzier, Roman; Symmans, W Fraser; Ross, Jeffrey S; Hortobagyi, Gabriel N; Pusztai, Lajos; Shak, Steven

    2005-10-10

    We sought to identify gene expression markers that predict the likelihood of chemotherapy response. We also tested whether chemotherapy response is correlated with the 21-gene Recurrence Score assay that quantifies recurrence risk. Patients with locally advanced breast cancer received neoadjuvant paclitaxel and doxorubicin. RNA was extracted from the pretreatment formalin-fixed paraffin-embedded core biopsies. The expression of 384 genes was quantified using reverse transcriptase polymerase chain reaction and correlated with pathologic complete response (pCR). The performance of genes predicting for pCR was tested in patients from an independent neoadjuvant study where gene expression was obtained using DNA microarrays. Of 89 assessable patients (mean age, 49.9 years; mean tumor size, 6.4 cm), 11 (12%) had a pCR. Eighty-six genes correlated with pCR (unadjusted P < .05); pCR was more likely with higher expression of proliferation-related genes and immune-related genes, and with lower expression of estrogen receptor (ER) -related genes. In 82 independent patients treated with neoadjuvant paclitaxel and doxorubicin, DNA microarray data were available for 79 of the 86 genes. In univariate analysis, 24 genes correlated with pCR with P < .05 (false discovery, four genes) and 32 genes showed correlation with P < .1 (false discovery, eight genes). The Recurrence Score was positively associated with the likelihood of pCR (P = .005), suggesting that the patients who are at greatest recurrence risk are more likely to have chemotherapy benefit. Quantitative expression of ER-related genes, proliferation genes, and immune-related genes are strong predictors of pCR in women with locally advanced breast cancer receiving neoadjuvant anthracyclines and paclitaxel.

  14. Identifying Growth Conditions for Nicotiana benthimiana Resulting in Predictable Gene Expression of Promoter-Gus Fusion

    Science.gov (United States)

    Sandoval, V.; Barton, K.; Longhurst, A.

    2012-12-01

    Revoluta (Rev) is a transcription factor that establishes leaf polarity inArabidopsis thaliana. Through previous work in Dr. Barton's Lab, it is known that Revoluta binds to the ZPR3 promoter, thus activating the ZPR3 gene product inArabidopsis thaliana. Using this knowledge, two separate DNA constructs were made, one carrying revgene and in the other, the ZPR3 promoter fussed with the GUS gene. When inoculated in Nicotiana benthimiana (tobacco), the pMDC32 plasmid produces the Rev protein. Rev binds to the ZPR3 promoter thereby activating the transcription of the GUS gene, which can only be expressed in the presence of Rev. When GUS protein comes in contact with X-Gluc it produce the blue stain seen (See Figure 1). In the past, variability has been seen of GUS expression on tobacco therefore we hypothesized that changing the growing conditions and leaf age might improve how well it's expressed.

  15. Prediction of the Ebola Virus Infection Related Human Genes Using Protein-Protein Interaction Network.

    Science.gov (United States)

    Cao, HuanHuan; Zhang, YuHang; Zhao, Jia; Zhu, Liucun; Wang, Yi; Li, JiaRui; Feng, Yuan-Ming; Zhang, Ning

    2017-01-01

    Ebola hemorrhagic fever (EHF) is caused by Ebola virus (EBOV). It is reported that human could be infected by EBOV with a high fatality rate. However, association factors between EBOV and host still tend to be ambiguous. According to the "guilt by association" (GBA) principle, proteins interacting with each other are very likely to function similarly or the same. Based on this assumption, we tried to obtain EBOV infection-related human genes in a protein-protein interaction network using Dijkstra algorithm. We hope it could contribute to the discovery of novel effective treatments. Finally, 15 genes were selected as potential EBOV infection-related human genes. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  16. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project

    Directory of Open Access Journals (Sweden)

    McDonagh Paul D

    2003-06-01

    Full Text Available Abstract Background Seattle Biomedical Research Institute (SBRI as part of the Leishmania Genome Network (LGN is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Results Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. Conclusion An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.

  17. Prediction of novel target genes and pathways involved in bevacizumab-resistant colorectal cancer

    Science.gov (United States)

    Makondi, Precious Takondwa; Lee, Chia-Hwa; Huang, Chien-Yu; Chu, Chi-Ming; Chang, Yu-Jia

    2018-01-01

    Bevacizumab combined with cytotoxic chemotherapy is the backbone of metastatic colorectal cancer (mCRC) therapy; however, its treatment efficacy is hampered by therapeutic resistance. Therefore, understanding the mechanisms underlying bevacizumab resistance is crucial to increasing the therapeutic efficacy of bevacizumab. The Gene Expression Omnibus (GEO) database (dataset, GSE86525) was used to identify the key genes and pathways involved in bevacizumab-resistant mCRC. The GEO2R web tool was used to identify differentially expressed genes (DEGs). Functional and pathway enrichment analyses of the DEGs were performed using the Database for Annotation, Visualization, and Integrated Discovery(DAVID). Protein–protein interaction (PPI) networks were established using the Search Tool for the Retrieval of Interacting Genes/Proteins database(STRING) and visualized using Cytoscape software. A total of 124 DEGs were obtained, 57 of which upregulated and 67 were downregulated. PPI network analysis showed that seven upregulated genes and nine downregulated genes exhibited high PPI degrees. In the functional enrichment, the DEGs were mainly enriched in negative regulation of phosphate metabolic process and positive regulation of cell cycle process gene ontologies (GOs); the enriched pathways were the phosphoinositide 3-kinase-serine/threonine kinase signaling pathway, bladder cancer, and microRNAs in cancer. Cyclin-dependent kinase inhibitor 1A(CDKN1A), toll-like receptor 4 (TLR4), CD19 molecule (CD19), breast cancer 1, early onset (BRCA1), platelet-derived growth factor subunit A (PDGFA), and matrix metallopeptidase 1 (MMP1) were the DEGs involved in the pathways and the PPIs. The clinical validation of the DEGs in mCRC (TNM clinical stages 3 and 4) revealed that high PDGFA expression levels were associated with poor overall survival, whereas high BRCA1 and MMP1 expression levels were associated with favorable progress free survival(PFS). The identified genes and pathways

  18. Prediction of novel target genes and pathways involved in bevacizumab-resistant colorectal cancer.

    Directory of Open Access Journals (Sweden)

    Precious Takondwa Makondi

    Full Text Available Bevacizumab combined with cytotoxic chemotherapy is the backbone of metastatic colorectal cancer (mCRC therapy; however, its treatment efficacy is hampered by therapeutic resistance. Therefore, understanding the mechanisms underlying bevacizumab resistance is crucial to increasing the therapeutic efficacy of bevacizumab. The Gene Expression Omnibus (GEO database (dataset, GSE86525 was used to identify the key genes and pathways involved in bevacizumab-resistant mCRC. The GEO2R web tool was used to identify differentially expressed genes (DEGs. Functional and pathway enrichment analyses of the DEGs were performed using the Database for Annotation, Visualization, and Integrated Discovery(DAVID. Protein-protein interaction (PPI networks were established using the Search Tool for the Retrieval of Interacting Genes/Proteins database(STRING and visualized using Cytoscape software. A total of 124 DEGs were obtained, 57 of which upregulated and 67 were downregulated. PPI network analysis showed that seven upregulated genes and nine downregulated genes exhibited high PPI degrees. In the functional enrichment, the DEGs were mainly enriched in negative regulation of phosphate metabolic process and positive regulation of cell cycle process gene ontologies (GOs; the enriched pathways were the phosphoinositide 3-kinase-serine/threonine kinase signaling pathway, bladder cancer, and microRNAs in cancer. Cyclin-dependent kinase inhibitor 1A(CDKN1A, toll-like receptor 4 (TLR4, CD19 molecule (CD19, breast cancer 1, early onset (BRCA1, platelet-derived growth factor subunit A (PDGFA, and matrix metallopeptidase 1 (MMP1 were the DEGs involved in the pathways and the PPIs. The clinical validation of the DEGs in mCRC (TNM clinical stages 3 and 4 revealed that high PDGFA expression levels were associated with poor overall survival, whereas high BRCA1 and MMP1 expression levels were associated with favorable progress free survival(PFS. The identified genes and pathways

  19. The completion of the Mammalian Gene Collection (MGC)

    Science.gov (United States)

    Temple, Gary; Gerhard, Daniela S.; Rasooly, Rebekah; Feingold, Elise A.; Good, Peter J.; Robinson, Cristen; Mandich, Allison; Derge, Jeffrey G.; Lewis, Jeanne; Shoaf, Debonny; Collins, Francis S.; Jang, Wonhee; Wagner, Lukas; Shenmen, Carolyn M.; Misquitta, Leonie; Schaefer, Carl F.; Buetow, Kenneth H.; Bonner, Tom I.; Yankie, Linda; Ward, Ming; Phan, Lon; Astashyn, Alex; Brown, Garth; Farrell, Catherine; Hart, Jennifer; Landrum, Melissa; Maidak, Bonnie L.; Murphy, Michael; Murphy, Terence; Rajput, Bhanu; Riddick, Lillian; Webb, David; Weber, Janet; Wu, Wendy; Pruitt, Kim D.; Maglott, Donna; Siepel, Adam; Brejova, Brona; Diekhans, Mark; Harte, Rachel; Baertsch, Robert; Kent, Jim; Haussler, David; Brent, Michael; Langton, Laura; Comstock, Charles L.G.; Stevens, Michael; Wei, Chaochun; van Baren, Marijke J.; Salehi-Ashtiani, Kourosh; Murray, Ryan R.; Ghamsari, Lila; Mello, Elizabeth; Lin, Chenwei; Pennacchio, Christa; Schreiber, Kirsten; Shapiro, Nicole; Marsh, Amber; Pardes, Elizabeth; Moore, Troy; Lebeau, Anita; Muratet, Mike; Simmons, Blake; Kloske, David; Sieja, Stephanie; Hudson, James; Sethupathy, Praveen; Brownstein, Michael; Bhat, Narayan; Lazar, Joseph; Jacob, Howard; Gruber, Chris E.; Smith, Mark R.; McPherson, John; Garcia, Angela M.; Gunaratne, Preethi H.; Wu, Jiaqian; Muzny, Donna; Gibbs, Richard A.; Young, Alice C.; Bouffard, Gerard G.; Blakesley, Robert W.; Mullikin, Jim; Green, Eric D.; Dickson, Mark C.; Rodriguez, Alex C.; Grimwood, Jane; Schmutz, Jeremy; Myers, Richard M.; Hirst, Martin; Zeng, Thomas; Tse, Kane; Moksa, Michelle; Deng, Merinda; Ma, Kevin; Mah, Diana; Pang, Johnson; Taylor, Greg; Chuah, Eric; Deng, Athena; Fichter, Keith; Go, Anne; Lee, Stephanie; Wang, Jing; Griffith, Malachi; Morin, Ryan; Moore, Richard A.; Mayo, Michael; Munro, Sarah; Wagner, Susan; Jones, Steven J.M.; Holt, Robert A.; Marra, Marco A.; Lu, Sun; Yang, Shuwei; Hartigan, James; Graf, Marcus; Wagner, Ralf; Letovksy, Stanley; Pulido, Jacqueline C.; Robison, Keith; Esposito, Dominic; Hartley, James; Wall, Vanessa E.; Hopkins, Ralph F.; Ohara, Osamu; Wiemann, Stefan

    2009-01-01

    Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide. PMID:19767417

  20. Comparative transcriptome analyses of three medicinal Forsythia species and prediction of candidate genes involved in secondary metabolisms.

    Science.gov (United States)

    Sun, Luchao; Rai, Amit; Rai, Megha; Nakamura, Michimi; Kawano, Noriaki; Yoshimatsu, Kayo; Suzuki, Hideyuki; Kawahara, Nobuo; Saito, Kazuki; Yamazaki, Mami

    2018-05-07

    The three Forsythia species, F. suspensa, F. viridissima and F. koreana, have been used as herbal medicines in China, Japan and Korea for centuries and they are known to be rich sources of numerous pharmaceutical metabolites, forsythin, forsythoside A, arctigenin, rutin and other phenolic compounds. In this study, de novo transcriptome sequencing and assembly was performed on these species. Using leaf and flower tissues of F. suspensa, F. viridissima and F. koreana, 1.28-2.45-Gbp sequences of Illumina based pair-end reads were obtained and assembled into 81,913, 88,491 and 69,458 unigenes, respectively. Classification of the annotated unigenes in gene ontology terms and KEGG pathways was used to compare the transcriptome of three Forsythia species. The expression analysis of orthologous genes across all three species showed the expression in leaf tissues being highly correlated. The candidate genes presumably involved in the biosynthetic pathway of lignans and phenylethanoid glycosides were screened as co-expressed genes. They express highly in the leaves of F. viridissima and F. koreana. Furthermore, the three unigenes annotated as acyltransferase were predicted to be associated with the biosynthesis of acteoside and forsythoside A from the expression pattern and phylogenetic analysis. This study is the first report on comparative transcriptome analyses of medicinally important Forsythia genus and will serve as an important resource to facilitate further studies on biosynthesis and regulation of therapeutic compounds in Forsythia species.

  1. Predicting Recurrence and Progression of Noninvasive Papillary Bladder Cancer at Initial Presentation Based on Quantitative Gene Expression Profiles

    DEFF Research Database (Denmark)

    Birkhahn, M.; Mitra, A.P.; Williams, Johan

    2010-01-01

    Background: Currently, tumor grade is the best predictor of outcome at first presentation of noninvasive papillary (Ta) bladder cancer. However, reliable predictors of Ta tumor recurrence and progression for individual patients, which could optimize treatment and follow-up schedules based...... on specific tumor biology, are yet to be identified. Objective: To identify genes predictive for recurrence and progression in Ta bladder cancer at first presentation using a quantitative, pathway-specific approach. Design, setting, and participants: Retrospective study of patients with Ta G2/3 bladder tumors...... at initial presentation with three distinct clinical outcomes: absence of recurrence (n = 16), recurrence without progression (n = 16), and progression to carcinoma in situ or invasive disease (n = 16). Measurements: Expressions of 24 genes that feature in relevant pathways that are deregulated in bladder...

  2. Do prion protein gene polymorphisms induce apoptosis in non ...

    Indian Academy of Sciences (India)

    2016-08-26

    Aug 26, 2016 ... Genetic variations such as single nucleotide polymorphisms (SNPs) in prion protein coding gene, Prnp, greatly affect susceptibility to prion diseases in mammals. Here, the coding region of Prnp was screened for polymorphisms in redeared turtle, Trachemys scripta. Four polymorphisms, L203V, N205I, ...

  3. Classification and Diagnostic Output Prediction of Cancer Using Gene Expression Profiling and Supervised Machine Learning Algorithms

    DEFF Research Database (Denmark)

    Yoo, C.; Gernaey, Krist

    2008-01-01

    importance in the projection (VIP) information of the DPLS method. The power of the gene selection method and the proposed supervised hierarchical clustering method is illustrated on a three microarray data sets of leukemia, breast, and colon cancer. Supervised machine learning algorithms thus enable...

  4. Gene expression analysis predicts insect venom anaphylaxis in indolent systemic mastocytosis

    NARCIS (Netherlands)

    Niedoszytko, M.; Bruinenberg, M.; van Doormaal, J. J.; de Monchy, J. G. R.; Nedoszytko, B.; Koppelman, G. H.; Nawijn, M. C.; Wijmenga, C.; Jassem, E.; Oude Elberink, J. N. G.

    P>Background: Anaphylaxis to insect venom (Hymenoptera) is most severe in patients with mastocytosis and may even lead to death. However, not all patients with mastocytosis suffer from anaphylaxis. The aim of the study was to analyze differences in gene expression between patients with indolent

  5. Dopamine Receptor D4 Gene Variation Predicts Preschoolers' Developing Theory of Mind

    Science.gov (United States)

    Lackner, Christine; Sabbagh, Mark A.; Hallinan, Elizabeth; Liu, Xudong; Holden, Jeanette J. A.

    2012-01-01

    Individual differences in preschoolers' understanding that human action is caused by internal mental states, or representational theory of mind (RTM), are heritable, as are developmental disorders such as autism in which RTM is particularly impaired. We investigated whether polymorphisms of genes affecting dopamine (DA) utilization and metabolism…

  6. Conservation of transcription factor binding events predicts gene expression across species

    Science.gov (United States)

    Hemberg, Martin; Kreiman, Gabriel

    2011-01-01

    Recent technological advances have made it possible to determine the genome-wide binding sites of transcription factors (TFs). Comparisons across species have suggested a relatively low degree of evolutionary conservation of experimentally defined TF binding events (TFBEs). Using binding data for six different TFs in hepatocytes and embryonic stem cells from human and mouse, we demonstrate that evolutionary conservation of TFBEs within orthologous proximal promoters is closely linked to function, defined as expression of the target genes. We show that (i) there is a significantly higher degree of conservation of TFBEs when the target gene is expressed in both species; (ii) there is increased conservation of binding events for groups of TFs compared to individual TFs; and (iii) conserved TFBEs have a greater impact on the expression of their target genes than non-conserved ones. These results link conservation of structural elements (TFBEs) to conservation of function (gene expression) and suggest a higher degree of functional conservation than implied by previous studies. PMID:21622661

  7. GENECODIS-Grid: An online grid-based tool to predict functional information in gene lists

    International Nuclear Information System (INIS)

    Nogales, R.; Mejia, E.; Vicente, C.; Montes, E.; Delgado, A.; Perez Griffo, F. J.; Tirado, F.; Pascual-Montano, A.

    2007-01-01

    In this work we introduce GeneCodis-Grid, a grid-based alternative to a bioinformatics tool named Genecodis that integrates different sources of biological information to search for biological features (annotations) that frequently co-occur in a set of genes and rank them by statistical significance. GeneCodis-Grid is a web-based application that takes advantage of two independent grid networks and a computer cluster managed by a meta-scheduler and a web server that host the application. The mining of concurrent biological annotations provides significant information for the functional analysis of gene list obtained by high throughput experiments in biology. Due to the large popularity of this tool, that has registered more than 13000 visits since its publication in January 2007, there is a strong need to facilitate users from different sites to access the system simultaneously. In addition, the complexity of some of the statistical tests used in this approach has made this technique a good candidate for its implementation in a Grid opportunistic environment. (Author)

  8. Genome wide gene expression regulation by HIP1 Protein Interactor, HIPPI: Prediction and validation

    Directory of Open Access Journals (Sweden)

    Lahiri Ansuman

    2011-09-01

    Full Text Available Abstract Background HIP1 Protein Interactor (HIPPI is a pro-apoptotic protein that induces Caspase8 mediated apoptosis in cell. We have shown earlier that HIPPI could interact with a specific 9 bp sequence motif, defined as the HIPPI binding site (HBS, present in the upstream promoter of Caspase1 gene and regulate its expression. We also have shown that HIPPI, without any known nuclear localization signal, could be transported to the nucleus by HIP1, a NLS containing nucleo-cytoplasmic shuttling protein. Thus our present work aims at the investigation of the role of HIPPI as a global transcription regulator. Results We carried out genome wide search for the presence of HBS in the upstream sequences of genes. Our result suggests that HBS was predominantly located within 2 Kb upstream from transcription start site. Transcription factors like CREBP1, TBP, OCT1, EVI1 and P53 half site were significantly enriched in the 100 bp vicinity of HBS indicating that they might co-operate with HIPPI for transcription regulation. To illustrate the role of HIPPI on transcriptome, we performed gene expression profiling by microarray. Exogenous expression of HIPPI in HeLa cells resulted in up-regulation of 580 genes (p HIP1 was knocked down. HIPPI-P53 interaction was necessary for HIPPI mediated up-regulation of Caspase1 gene. Finally, we analyzed published microarray data obtained with post mortem brains of Huntington's disease (HD patients to investigate the possible involvement of HIPPI in HD pathogenesis. We observed that along with the transcription factors like CREB, P300, SREBP1, Sp1 etc. which are already known to be involved in HD, HIPPI binding site was also significantly over-represented in the upstream sequences of genes altered in HD. Conclusions Taken together, the results suggest that HIPPI could act as an important transcription regulator in cell regulating a vast array of genes, particularly transcription factors and at least, in part, play a

  9. An Individual-Based Diploid Model Predicts Limited Conditions Under Which Stochastic Gene Expression Becomes Advantageous

    KAUST Repository

    Matsumoto, Tomotaka

    2015-11-24

    Recent studies suggest the existence of a stochasticity in gene expression (SGE) in many organisms, and its non-negligible effect on their phenotype and fitness. To date, however, how SGE affects the key parameters of population genetics are not well understood. SGE can increase the phenotypic variation and act as a load for individuals, if they are at the adaptive optimum in a stable environment. On the other hand, part of the phenotypic variation caused by SGE might become advantageous if individuals at the adaptive optimum become genetically less-adaptive, for example due to an environmental change. Furthermore, SGE of unimportant genes might have little or no fitness consequences. Thus, SGE can be advantageous, disadvantageous, or selectively neutral depending on its context. In addition, there might be a genetic basis that regulates magnitude of SGE, which is often referred to as “modifier genes,” but little is known about the conditions under which such an SGE-modifier gene evolves. In the present study, we conducted individual-based computer simulations to examine these conditions in a diploid model. In the simulations, we considered a single locus that determines organismal fitness for simplicity, and that SGE on the locus creates fitness variation in a stochastic manner. We also considered another locus that modifies the magnitude of SGE. Our results suggested that SGE was always deleterious in stable environments and increased the fixation probability of deleterious mutations in this model. Even under frequently changing environmental conditions, only very strong natural selection made SGE adaptive. These results suggest that the evolution of SGE-modifier genes requires strict balance among the strength of natural selection, magnitude of SGE, and frequency of environmental changes. However, the degree of dominance affected the condition under which SGE becomes advantageous, indicating a better opportunity for the evolution of SGE in different genetic

  10. Evaluating bacterial gene-finding HMM structures as probabilistic logic programs

    DEFF Research Database (Denmark)

    Mørk, Søren; Holmes, Ian

    2012-01-01

    , a probabilistic dialect of Prolog. Results: We evaluate Hidden Markov Model structures for bacterial protein-coding gene potential, including a simple null model structure, three structures based on existing bacterial gene finders and two novel model structures. We test standard versions as well as ADPH length...

  11. SNP discovery in candidate adaptive genes using exon capture in a free-ranging alpine ungulate

    Science.gov (United States)

    Gretchen H. Roffler; Stephen J. Amish; Seth Smith; Ted Cosart; Marty Kardos; Michael K. Schwartz; Gordon Luikart

    2016-01-01

    Identification of genes underlying genomic signatures of natural selection is key to understanding adaptation to local conditions. We used targeted resequencing to identify SNP markers in 5321 candidate adaptive genes associated with known immunological, metabolic and growth functions in ovids and other ungulates. We selectively targeted 8161 exons in protein-coding...

  12. From essential to persistent genes: a functional approach to constructing synthetic life

    DEFF Research Database (Denmark)

    Acevedo-Rocha, Carlos G.; Fang, Gang; Schmidt, Markus

    2013-01-01

    A central undertaking in synthetic biology (SB) is the quest for the ‘minimal genome’. However, ‘minimal sets’ of essential genes are strongly context-dependent and, in all prokaryotic genomes sequenced to date, not a single protein-coding gene is entirely conserved. Furthermore, a lack...

  13. The non-protein coding breast cancer susceptibility locus Mcs5a acts in a non-mammary cell-autonomous fashion through the immune system and modulates T-cell homeostasis and functions.

    Science.gov (United States)

    Smits, Bart M G; Sharma, Deepak; Samuelson, David J; Woditschka, Stephan; Mau, Bob; Haag, Jill D; Gould, Michael N

    2011-08-16

    Mechanisms underlying low-penetrance, common, non-protein coding variants in breast cancer risk loci are largely undefined. We showed previously that the non-protein coding mammary carcinoma susceptibility locus Mcs5a/MCS5A modulates breast cancer risk in rats and women. The Mcs5a allele from the Wistar-Kyoto (WKy) rat strain consists of two genetically interacting elements that have to be present on the same chromosome to confer mammary carcinoma resistance. We also found that the two interacting elements of the resistant allele are required for the downregulation of transcript levels of the Fbxo10 gene specifically in T-cells. Here we describe mechanisms through which Mcs5a may reduce mammary carcinoma susceptibility. We performed mammary carcinoma multiplicity studies with three mammary carcinoma-inducing treatments, namely 7,12-dimethylbenz(a)anthracene (DMBA) and N-nitroso-N-methylurea (NMU) carcinogenesis, and mammary ductal infusion of retrovirus expressing the activated HER2/neu oncogene. We used mammary gland and bone marrow transplantation assays to assess the target tissue of Mcs5a activity. We used immunophenotyping assays on well-defined congenic rat lines carrying susceptible and resistant Mcs5a alleles to identify changes in T-cell homeostasis and function associated with resistance. We show that Mcs5a acts beyond the initial step of mammary epithelial cell transformation, during early cancer progression. We show that Mcs5a controls susceptibility in a non-mammary cell-autonomous manner through the immune system. The resistant Mcs5a allele was found to be associated with an overabundance of gd T-cell receptor (TCR)+ T-cells as well as a CD62L (L-selectin)-high population of all T-cell classes. In contrast to in mammary carcinoma, gdTCR+ T-cells are the predominant T-cell type in the mammary gland and were found to be overabundant in the mammary epithelium of Mcs5a resistant congenic rats. Most of them simultaneously expressed the CD4, CD8, and CD161

  14. Bayesian mixture models for assessment of gene differential behaviour and prediction of pCR through the integration of copy number and gene expression data.

    Directory of Open Access Journals (Sweden)

    Filippo Trentini

    Full Text Available We consider modeling jointly microarray RNA expression and DNA copy number data. We propose Bayesian mixture models that define latent Gaussian probit scores for the DNA and RNA, and integrate between the two platforms via a regression of the RNA probit scores on the DNA probit scores. Such a regression conveniently allows us to include additional sample specific covariates such as biological conditions and clinical outcomes. The two developed methods are aimed respectively to make inference on differential behaviour of genes in patients showing different subtypes of breast cancer and to predict the pathological complete response (pCR of patients borrowing strength across the genomic platforms. Posterior inference is carried out via MCMC simulations. We demonstrate the proposed methodology using a published data set consisting of 121 breast cancer patients.

  15. Urban landscape genetics: canopy cover predicts gene flow between white-footed mouse (Peromyscus leucopus) populations in New York City.

    Science.gov (United States)

    Munshi-South, Jason

    2012-03-01

    In this study, I examine the influence of urban canopy cover on gene flow between 15 white-footed mouse (Peromyscus leucopus) populations in New York City parklands. Parks in the urban core are often highly fragmented, leading to rapid genetic differentiation of relatively nonvagile species. However, a diverse array of 'green' spaces may provide dispersal corridors through 'grey' urban infrastructure. I identify urban landscape features that promote genetic connectivity in an urban environment and compare the success of two different landscape connectivity approaches at explaining gene flow. Gene flow was associated with 'effective distances' between populations that were calculated based on per cent tree canopy cover using two different approaches: (i) isolation by effective distance (IED) that calculates the single best pathway to minimize passage through high-resistance (i.e. low canopy cover) areas, and (ii) isolation by resistance (IBR), an implementation of circuit theory that identifies all low-resistance paths through the landscape. IBR, but not IED, models were significantly associated with three measures of gene flow (Nm from F(ST) , BayesAss+ and Migrate-n) after factoring out the influence of isolation by distance using partial Mantel tests. Predicted corridors for gene flow between city parks were largely narrow, linear parklands or vegetated spaces that are not managed for wildlife, such as cemeteries and roadway medians. These results have implications for understanding the impacts of urbanization trends on native wildlife, as well as for urban reforestation efforts that aim to improve urban ecosystem processes. © 2012 Blackwell Publishing Ltd.

  16. Aberrant gene methylation in non-neoplastic mucosa as a predictive marker of ulcerative colitis-associated CRC.

    Science.gov (United States)

    Scarpa, Marco; Scarpa, Melania; Castagliuolo, Ignazio; Erroi, Francesca; Kotsafti, Andromachi; Basato, Silvia; Brun, Paola; D'Incà, Renata; Rugge, Massimo; Angriman, Imerio; Castoro, Carlo

    2016-03-01

    BACKGROUND PROMOTER: hypermethylation plays a major role in cancer through transcriptional silencing of critical genes. The aim of our study is to evaluate the methylation status of these genes in the colonic mucosa without dysplasia or adenocarcinoma at the different steps of sporadic and UC-related carcinogenesis and to investigate the possible role of genomic methylation as a marker of CRC. The expression of Dnmts 1 and 3A was significantly increased in UC-related carcinogenesis compared to non inflammatory colorectal carcinogenesis. In non-neoplastic colonic mucosa, the number of methylated genes resulted significantly higher in patients with CRC and in those with UC-related CRC compared to the HC and UC patients and patients with dysplastic lesion of the colon. The number of methylated genes in non-neoplastic colonic mucosa predicted the presence of CRC with good accuracy either in non inflammatory and inflammatory related CRC. Colonic mucosal samples were collected from healthy subjects (HC) (n = 30) and from patients with ulcerative colitis (UC) (n = 29), UC and dysplasia (n = 14), UC and cancer (n = 10), dysplastic adenoma (n = 14), and colon adenocarcinoma (n = 10). DNA methyltransferases-1, -3a, -3b, mRNA expression were quantified by real time qRT-PCR. The methylation status of CDH13, APC, MLH1, MGMT1 and RUNX3 gene promoters was assessed by methylation-specific PCR. Methylation status of APC, CDH13, MGMT, MLH1 and RUNX3 in the non-neoplastic mucosa may be used as a marker of CRC: these preliminary results could allow for the adjustment of a patient's surveillance interval and to select UC patients who should undergo intensive surveillance.

  17. Identification and Validation of a New Set of Five Genes for Prediction of Risk in Early Breast Cancer

    Directory of Open Access Journals (Sweden)

    Giorgio Mustacchi

    2013-05-01

    Full Text Available Molecular tests predicting the outcome of breast cancer patients based on gene expression levels can be used to assist in making treatment decisions after consideration of conventional markers. In this study we identified a subset of 20 mRNA differentially regulated in breast cancer analyzing several publicly available array gene expression data using R/Bioconductor package. Using RTqPCR we evaluate 261 consecutive invasive breast cancer cases not selected for age, adjuvant treatment, nodal and estrogen receptor status from paraffin embedded sections. The biological samples dataset was split into a training (137 cases and a validation set (124 cases. The gene signature was developed on the training set and a multivariate stepwise Cox analysis selected five genes independently associated with DFS: FGF18 (HR = 1.13, p = 0.05, BCL2 (HR = 0.57, p = 0.001, PRC1 (HR = 1.51, p = 0.001, MMP9 (HR = 1.11, p = 0.08, SERF1a (HR = 0.83, p = 0.007. These five genes were combined into a linear score (signature weighted according to the coefficients of the Cox model, as: 0.125FGF18 − 0.560BCL2 + 0.409PRC1 + 0.104MMP9 − 0.188SERF1A (HR = 2.7, 95% CI = 1.9–4.0, p < 0.001. The signature was then evaluated on the validation set assessing the discrimination ability by a Kaplan Meier analysis, using the same cut offs classifying patients at low, intermediate or high risk of disease relapse as defined on the training set (p < 0.001. Our signature, after a further clinical validation, could be proposed as prognostic signature for disease free survival in breast cancer patients where the indication for adjuvant chemotherapy added to endocrine treatment is uncertain.

  18. Prediction of the damage-associated non-synonymous single nucleotide polymorphisms in the human MC1R gene.

    Science.gov (United States)

    Hepp, Diego; Gonçalves, Gislene Lopes; de Freitas, Thales Renato Ochotorena

    2015-01-01

    The melanocortin 1 receptor (MC1R) is involved in the control of melanogenesis. Polymorphisms in this gene have been associated with variation in skin and hair color and with elevated risk for the development of melanoma. Here we used 11 computational tools based on different approaches to predict the damage-associated non-synonymous single nucleotide polymorphisms (nsSNPs) in the coding region of the human MC1R gene. Among the 92 nsSNPs arranged according to the predictions 62% were classified as damaging in more than five tools. The classification was significantly correlated with the scores of two consensus programs. Alleles associated with the red hair color (RHC) phenotype and with the risk of melanoma were examined. The R variants D84E, R142H, R151C, I155T, R160W and D294H were classified as damaging by the majority of the tools while the r variants V60L, V92M and R163Q have been predicted as neutral in most of the programs The combination of the prediction tools results in 14 nsSNPs indicated as the most damaging mutations in MC1R (L48P, R67W, H70Y, P72L, S83P, R151H, S172I, L206P, T242I, G255R, P256S, C273Y, C289R and R306H); C273Y showed to be highly damaging in SIFT, Polyphen-2, MutPred, PANTHER and PROVEAN scores. The computational analysis proved capable of identifying the potentially damaging nsSNPs in MC1R, which are candidates for further laboratory studies of the functional and pharmacological significance of the alterations in the receptor and the phenotypic outcomes.

  19. A biological network-based regularized artificial neural network model for robust phenotype prediction from gene expression data.

    Science.gov (United States)

    Kang, Tianyu; Ding, Wei; Zhang, Luoyan; Ziemek, Daniel; Zarringhalam, Kourosh

    2017-12-19

    Stratification of patient subpopulations that respond favorably to treatment or experience and adverse reaction is an essential step toward development of new personalized therapies and diagnostics. It is currently feasible to generate omic-scale biological measurements for all patients in a study, providing an opportunity for machine learning models to identify molecular markers for disease diagnosis and progression. However, the high variability of genetic background in human populations hampers the reproducibility of omic-scale markers. In this paper, we develop a biological network-based regularized artificial neural network model for prediction of phenotype from transcriptomic measurements in clinical trials. To improve model sparsity and the overall reproducibility of the model, we incorporate regularization for simultaneous shrinkage of gene sets based on active upstream regulatory mechanisms into the model. We benchmark our method against various regression, support vector machines and artificial neural network models and demonstrate the ability of our method in predicting the clinical outcomes using clinical trial data on acute rejection in kidney transplantation and response to Infliximab in ulcerative colitis. We show that integration of prior biological knowledge into the classification as developed in this paper, significantly improves the robustness and generalizability of predictions to independent datasets. We provide a Java code of our algorithm along with a parsed version of the STRING DB database. In summary, we present a method for prediction of clinical phenotypes using baseline genome-wide expression data that makes use of prior biological knowledge on gene-regulatory interactions in order to increase robustness and reproducibility of omic-scale markers. The integrated group-wise regularization methods increases the interpretability of biological signatures and gives stable performance estimates across independent test sets.

  20. Prediction of the damage-associated non-synonymous single nucleotide polymorphisms in the human MC1R gene.

    Directory of Open Access Journals (Sweden)

    Diego Hepp

    Full Text Available The melanocortin 1 receptor (MC1R is involved in the control of melanogenesis. Polymorphisms in this gene have been associated with variation in skin and hair color and with elevated risk for the development of melanoma. Here we used 11 computational tools based on different approaches to predict the damage-associated non-synonymous single nucleotide polymorphisms (nsSNPs in the coding region of the human MC1R gene. Among the 92 nsSNPs arranged according to the predictions 62% were classified as damaging in more than five tools. The classification was significantly correlated with the scores of two consensus programs. Alleles associated with the red hair color (RHC phenotype and with the risk of melanoma were examined. The R variants D84E, R142H, R151C, I155T, R160W and D294H were classified as damaging by the majority of the tools while the r variants V60L, V92M and R163Q have been predicted as neutral in most of the programs The combination of the prediction tools results in 14 nsSNPs indicated as the most damaging mutations in MC1R (L48P, R67W, H70Y, P72L, S83P, R151H, S172I, L206P, T242I, G255R, P256S, C273Y, C289R and R306H; C273Y showed to be highly damaging in SIFT, Polyphen-2, MutPred, PANTHER and PROVEAN scores. The computational analysis proved capable of identifying the potentially damaging nsSNPs in MC1R, which are candidates for further laboratory studies of the functional and pharmacological significance of the alterations in the receptor and the phenotypic outcomes.

  1. Motif-independent prediction of a secondary metabolism gene cluster using comparative genomics: application to sequenced genomes of Aspergillus and ten other filamentous fungal species.

    Science.gov (United States)

    Takeda, Itaru; Umemura, Myco; Koike, Hideaki; Asai, Kiyoshi; Machida, Masayuki

    2014-08-01

    Despite their biological importance, a significant number of genes for secondary metabolite biosynthesis (SMB) remain undetected due largely to the fact that they are highly diverse and are not expressed under a variety of cultivation conditions. Several software tools including SMURF and antiSMASH have been developed to predict fungal SMB gene clusters by finding core genes encoding polyketide synthase, nonribosomal peptide synthetase and dimethylallyltryptophan synthase as well as several others typically present in the cluster. In this work, we have devised a novel comparative genomics method to identify SMB gene clusters that is independent of motif information of the known SMB genes. The method detects SMB gene clusters by searching for a similar order of genes and their presence in nonsyntenic blocks. With this method, we were able to identify many known SMB gene clusters with the core genes in the genomic sequences of 10 filamentous fungi. Furthermore, we have also detected SMB gene clusters without core genes, including the kojic acid biosynthesis gene cluster of Aspergillus oryzae. By varying the detection parameters of the method, a significant difference in the sequence characteristics was detected between the genes residing inside the clusters and those outside the clusters. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  2. ETS Gene Fusions as Predictive Biomarkers of Resistance to Radiation Therapy for Prostate Cancer

    Science.gov (United States)

    2016-05-01

    phenotype  in   preclinical  models  of  prostate  cancer,  2)  to  explore  the  mechanism  of  interaction  between   ERG  (the  predominant  ETS...established  this  axis  as  a  potential  therapeutic   target.         15. SUBJECT  TERMS Prostate cancer, ETS gene fusions, ERG , radiation resistance, DNA...interaction  between   ERG   (the   predominant   ETS   gene   fusion   product)   and   the   DNA   repair   protein   DNA-­PK,   and   3)   to

  3. Survivin gene levels in the peripheral blood of patients with gastric cancer independently predict survival

    Directory of Open Access Journals (Sweden)

    Scalerta Romano

    2009-12-01

    Full Text Available Abstract Background The detection of circulating tumor cells (CTC is considered a promising tool for improving risk stratification in patients with solid tumors. We investigated on whether the expression of CTC related genes adds any prognostic power to the TNM staging system in patients with gastric carcinoma. Methods Seventy patients with TNM stage I to IV gastric carcinoma were retrospectively enrolled. Peripheral blood samples were tested by means of quantitative real time PCR (qrtPCR for the expression of four CTC related genes: carcinoembryonic antigen (CEA, cytokeratin-19 (CK19, vascular endothelial growth factor (VEGF and Survivin (BIRC5. Results Gene expression of Survivin, CK19, CEA and VEGF was higher than in normal controls in 98.6%, 97.1%, 42.9% and 38.6% of cases, respectively, suggesting a potential diagnostic value of both Survivin and CK19. At multivariable survival analysis, TNM staging and Survivin mRNA levels were retained as independent prognostic factors, demonstrating that Survivin expression in the peripheral blood adds prognostic information to the TNM system. In contrast with previously published data, the transcript abundance of CEA, CK19 and VEGF was not associated with patients' clinical outcome. Conclusions Gene expression levels of Survivin add significant prognostic value to the current TNM staging system. The validation of these findings in larger prospective and multicentric series might lead to the implementation of this biomarker in the routine clinical setting in order to optimize risk stratification and ultimately personalize the therapeutic management of these patients.

  4. Cancer-Predicting Gene Expression Changes in Colonic Mucosa of Western Diet Fed Mlh1 +/- Mice

    Science.gov (United States)

    Dermadi Bebek, Denis; Valo, Satu; Reyhani, Nima; Ollila, Saara; Päivärinta, Essi; Peltomäki, Päivi; Mutanen, Marja; Nyström, Minna

    2013-01-01

    Colorectal cancer (CRC) is the second most common cause of cancer-related deaths in the Western world and interactions between genetic and environmental factors, including diet, are suggested to play a critical role in its etiology. We conducted a long-term feeding experiment in the mouse to address gene expression and methylation changes arising in histologically normal colonic mucosa as putative cancer-predisposing events available for early detection. The expression of 94 growth-regulatory genes previously linked to human CRC was studied at two time points (5 weeks and 12 months of age) in the heterozygote Mlh1 +/- mice, an animal model for human Lynch syndrome (LS), and wild type Mlh1 +/+ littermates, fed by either Western-style (WD) or AIN-93G control diet. In mice fed with WD, proximal colon mucosa, the predominant site of cancer formation in LS, exhibited a significant expression decrease in tumor suppressor genes, Dkk1, Hoxd1, Slc5a8, and Socs1, the latter two only in the Mlh1 +/- mice. Reduced mRNA expression was accompanied by increased promoter methylation of the respective genes. The strongest expression decrease (7.3 fold) together with a significant increase in its promoter methylation was seen in Dkk1, an antagonist of the canonical Wnt signaling pathway. Furthermore, the inactivation of Dkk1 seems to predispose to neoplasias in the proximal colon. This and the fact that Mlh1 which showed only modest methylation was still expressed in both Mlh1 +/- and Mlh1 +/+ mice indicate that the expression decreases and the inactivation of Dkk1 in particular is a prominent early marker for colon oncogenesis. PMID:24204690

  5. Cancer-predicting gene expression changes in colonic mucosa of Western diet fed Mlh1+/- mice.

    Directory of Open Access Journals (Sweden)

    Marjaana Pussila

    Full Text Available Colorectal cancer (CRC is the second most common cause of cancer-related deaths in the Western world and interactions between genetic and environmental factors, including diet, are suggested to play a critical role in its etiology. We conducted a long-term feeding experiment in the mouse to address gene expression and methylation changes arising in histologically normal colonic mucosa as putative cancer-predisposing events available for early detection. The expression of 94 growth-regulatory genes previously linked to human CRC was studied at two time points (5 weeks and 12 months of age in the heterozygote Mlh1(+/- mice, an animal model for human Lynch syndrome (LS, and wild type Mlh1(+/+ littermates, fed by either Western-style (WD or AIN-93G control diet. In mice fed with WD, proximal colon mucosa, the predominant site of cancer formation in LS, exhibited a significant expression decrease in tumor suppressor genes, Dkk1, Hoxd1, Slc5a8, and Socs1, the latter two only in the Mlh1(+/- mice. Reduced mRNA expression was accompanied by increased promoter methylation of the respective genes. The strongest expression decrease (7.3 fold together with a significant increase in its promoter methylation was seen in Dkk1, an antagonist of the canonical Wnt signaling pathway. Furthermore, the inactivation of Dkk1 seems to predispose to neoplasias in the proximal colon. This and the fact that Mlh1 which showed only modest methylation was still expressed in both Mlh1(+/- and Mlh1(+/+ mice indicate that the expression decreases and the inactivation of Dkk1 in particular is a prominent early marker for colon oncogenesis.

  6. Characterizing haploinsufficiency of SHELL gene to improve fruit form prediction in introgressive hybrids of oil palm

    OpenAIRE

    Teh, Chee Keng; Muaz, Siti Dalila; Tangaya, Praveena; Fong, Po-Yee; Ong, Ai Ling; Mayes, Sean; Chew, Fook Tim; Kulaveerasingam, Harikrishna; Appleton, David Ross

    2017-01-01

    The fundamental trait in selective breeding of oil palm (Eleais guineensis Jacq.) is the shell thickness surrounding the kernel. The monogenic shell thickness is inversely correlated to mesocarp thickness, where the crude palm oil accumulates. Commercial thin-shelled tenera derived from thick-shelled dura???shell-less pisifera generally contain 30% higher oil per bunch. Two mutations, sh MPOB (M1) and sh AVROS (M2) in the SHELL gene ? a type II MADS-box transcription factor mainly present in ...

  7. Melanopsin Gene Variations Interact With Season to Predict Sleep Onset and Chronotype

    OpenAIRE

    Roecklein, Kathryn A.; Wong, Patricia M.; Franzen, Peter L.; Hasler, Brant P.; Wood-Vasey, W. Michael; Nimgaonkar, Vishwajit L.; Miller, Megan A.; Kepreos, Kyle M.; Ferrell, Robert E.; Manuck, Stephen B.

    2012-01-01

    The human melanopsin gene has been reported to mediate risk for seasonal affective disorder (SAD), which is hypothesized to be caused by decreased photic input during winter when light levels fall below threshold, resulting in differences in circadian phase and/or sleep. However, it is unclear if melanopsin increases risk of SAD by causing differences in sleep or circadian phase, or if those differences are symptoms of the mood disorder. To determine if melanopsin sequence variations are asso...

  8. A method of predicting changes in human gene splicing induced by genetic variants in context of cis-acting elements

    Directory of Open Access Journals (Sweden)

    Hicks Chindo

    2010-01-01

    Full Text Available Abstract Background Polymorphic variants and mutations disrupting canonical splicing isoforms are among the leading causes of human hereditary disorders. While there is a substantial evidence of aberrant splicing causing Mendelian diseases, the implication of such events in multi-genic disorders is yet to be well understood. We have developed a new tool (SpliceScan II for predicting the effects of genetic variants on splicing and cis-regulatory elements. The novel Bayesian non-canonical 5'GC splice site (SS sensor used in our tool allows inference on non-canonical exons. Results Our tool performed favorably when compared with the existing methods in the context of genes linked to the Autism Spectrum Disorder (ASD. SpliceScan II was able to predict more aberrant splicing isoforms triggered by the mutations, as documented in DBASS5 and DBASS3 aberrant splicing databases, than other existing methods. Detrimental effects behind some of the polymorphic variations previously associated with Alzheimer's and breast cancer could be explained by changes in predicted splicing patterns. Conclusions We have developed SpliceScan II, an effective and sensitive tool for predicting the detrimental effects of genomic variants on splicing leading to Mendelian and complex hereditary disorders. The method could potentially be used to screen resequenced patient DNA to identify de novo mutations and polymorphic variants that could contribute to a genetic disorder.

  9. Association of TLL1 Gene Polymorphism (rs1503298, T > C) with Coronary Heart Disease in PREDICT, UDACS and ED Cohorts

    International Nuclear Information System (INIS)

    Zain, M.; Cooper, J. A.; Li, K. W.; Palmen, J.; Acharya, J.; Howard, P.; Ireland, H.; Humphries, S. E.; Awan, F. R.; Baig, S. M.; Elkeles, R. S.

    2014-01-01

    Objective: To determine the sequence variant of TLL1 gene (rs1503298, T > C) in three British cohorts (PREDICT, UDACS and ED) of patients with type-2 Diabetes mellitus (T2DM) in order to assess its association with coronary heart disease (CHD). Study Design: Analytical study. Place and Duration of Study: UCL, London, UK. Participants were genotyped in 2011-2012 for TLL1 SNP. Samples and related information were previously collected in 2001-2003 for PREDICT, and in 2001-2002 for UDACS and ED groups. Methodology: Patients included in PREDICT (n=600), UDACS (n=1020) and ED (n=1240) had Diabetes. TLL1 SNP (rs1503298, T > C) was genotyped using TaqMan technology. Allele frequencies were compared using c2 test, and tested for Hardy-Weinberg equilibrium. The risk of disease was assessed from Odds ratios (OR) with 95% Confidence Intervals (95% CI). Moreover, for the PREDICT cohort, the SNP association was tested with Coronary Artery Calcification (CAC) scores. Results: No significant association was found for this SNP with CHD or CAC scores in these cohorts. Conclusion: This SNP could not be confirmed as a risk factor for CHD in T2DM patients. However, the low power of the small sample size available is a limitation to the modest effect on risk. Further studies in larger samples would be useful. (author)

  10. On the total number of genes and their length distribution in complete microbial genomes

    DEFF Research Database (Denmark)

    Skovgaard, Marie; Jensen, L.J.; Brunak, Søren

    2001-01-01

    In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length...... distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300...... genes, we show that it probably has only similar to 3800 genes, and that a similar discrepancy exists for almost all published genomes....

  11. Gene expression signature of normal cell-of-origin predicts ovarian tumor outcomes.

    Directory of Open Access Journals (Sweden)

    Melissa A Merritt

    Full Text Available The potential role of the cell-of-origin in determining the tumor phenotype has been raised, but not adequately examined. We hypothesized that distinct cells-of-origin may play a role in determining ovarian tumor phenotype and outcome. Here we describe a new cell culture medium for in vitro culture of paired normal human ovarian (OV and fallopian tube (FT epithelial cells from donors without cancer. While these cells have been cultured individually for short periods of time, to our knowledge this is the first long-term culture of both cell types from the same donors. Through analysis of the gene expression profiles of the cultured OV/FT cells we identified a normal cell-of-origin gene signature that classified primary ovarian cancers into OV-like and FT-like subgroups; this classification correlated with significant differences in clinical outcomes. The identification of a prognostically significant gene expression signature derived solely from normal untransformed cells is consistent with the hypothesis that the normal cell-of-origin may be a source of ovarian tumor heterogeneity and the associated differences in tumor outcome.

  12. Accurate microRNA target prediction correlates with protein repression levels

    Directory of Open Access Journals (Sweden)

    Simossis Victor A

    2009-09-01

    Full Text Available Abstract Background MicroRNAs are small endogenously expressed non-coding RNA molecules that regulate target gene expression through translation repression or messenger RNA degradation. MicroRNA regulation is performed through pairing of the microRNA to sites in the messenger RNA of protein coding genes. Since experimental identification of miRNA target genes poses difficulties, computational microRNA target prediction is one of the key means in deciphering the role of microRNAs in development and disease. Results DIANA-microT 3.0 is an algorithm for microRNA target prediction which is based on several parameters calculated individually for each microRNA and combines conserved and non-conserved microRNA recognition elements into a final prediction score, which correlates with protein production fold change. Specifically, for each predicted interaction the program reports a signal to noise ratio and a precision score which can be used as an indication of the false positive rate of the prediction. Conclusion Recently, several computational target prediction programs were benchmarked based on a set of microRNA target genes identified by the pSILAC method. In this assessment DIANA-microT 3.0 was found to achieve the highest precision among the most widely used microRNA target prediction programs reaching approximately 66%. The DIANA-microT 3.0 prediction results are available online in a user friendly web server at http://www.microrna.gr/microT

  13. Predicting incomplete gene microarray data with the use of supervised learning algorithms

    CSIR Research Space (South Africa)

    Twala, B

    2010-10-01

    Full Text Available that prediction using supervised learning can be improved in probabilistic terms given incomplete microarray data. This imputation approach is based on the a priori probability of each value determined from the instances at that node of a decision tree (PDT...

  14. The importance of virulence prediction and gene networks in microbial risk assessment

    DEFF Research Database (Denmark)

    Wassenaar, Gertrude Maria; Gamieldien, Junaid; Shatkin, JoAnne

    2007-01-01

    For microbial risk assessment, it is necessary to recognize and predict Virulence of bacterial pathogens, including their ability to contaminate foods. Hazard characterization requires data on strain variability regarding virulence and survival during food processing. Moreover, information...... and characterization of microbial hazards, including emerging pathogens, in the context of microbial risk assessment....

  15. Gene expression analysis in predicting the effectiveness of insect venom immunotherapy

    NARCIS (Netherlands)

    Niedoszytko, M.; Bruinenberg, M.; de Monchy, J.; Wijmenga, C.; Platteel, M.; Jassem, E.; Oude Elberink, Joanna N.G.

    Background: Venom immunotherapy (VIT) enables longtime prevention of insect venom allergy in the majority of patients. However, in some, the risk of a resystemic reaction increases after completion of treatment. No reliable factors predicting individual lack of efficacy of VIT are currently

  16. The prevalence of gene duplications and their ancient origin in Rhodobacter sphaeroides 2.4.1

    Directory of Open Access Journals (Sweden)

    Cho Hyuk

    2010-12-01

    Full Text Available Abstract Background Rhodobacter sphaeroides 2.4.1 is a metabolically versatile organism that belongs to α-3 subdivision of Proteobacteria. The present study was to identify the extent, history, and role of gene duplications in R. sphaeroides 2.4.1, an organism that possesses two chromosomes. Results A protein similarity search (BLASTP identified 1247 orfs (~29.4% of the total protein coding orfs that are present in 2 or more copies, 37.5% (234 gene-pairs of which exist in duplicate copies. The distribution of the duplicate gene-pairs in all Clusters of Orthologous Groups (COGs differed significantly when compared to the COG distribution across the whole genome. Location plots revealed clusters of gene duplications that possessed the same COG classification. Phylogenetic analyses were performed to determine a tree topology predicting either a Type-A or Type-B phylogenetic relationship. A Type-A phylogenetic relationship shows that a copy of the protein-pair matches more with an ortholog from a species closely related to R. sphaeroides while a Type-B relationship predicts the highest match between both copies of the R. sphaeroides protein-pair. The results revealed that ~77% of the proteins exhibited a Type-A phylogenetic relationship demonstrating the ancient origin of these gene duplications. Additional analyses on three other strains of R. sphaeroides revealed varying levels of gene loss and retention in these strains. Also, analyses on common gene pairs among the four strains revealed that these genes experience similar functional constraints and undergo purifying selection. Conclusions Although the results suggest that the level of gene duplication in organisms with complex genome structuring (more than one chromosome seems to be not markedly different from that in organisms with only a single chromosome, these duplications may have aided in genome reorganization in this group of eubacteria prior to the formation of R. sphaeroides as gene

  17. Assessment of the Prognostic and Treatment-Predictive Performance of the Combined HOXB13:IL17BR-MGI Gene Expression Signature in the Trans-ATAC Cohort

    Science.gov (United States)

    2013-12-01

    Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 2004; 351: 2817–26. 7...Barlow WE, Shak S, et al, for The Breast Cancer Intergroup of North America. Pr