WorldWideScience

Sample records for automated genome annotations

  1. BEACON: automated tool for Bacterial GEnome Annotation ComparisON

    KAUST Repository

    Kalkatawi, Manal Matoq Saeed

    2015-08-18

    Background Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). Results The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON’s utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27 %, while the number of genes without any function assignment is reduced. Conclusions We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/

  2. An automated annotation tool for genomic DNA sequences using GeneScan and BLAST

    Indian Academy of Sciences (India)

    Andrew M. Lynn; Chakresh Kumar Jain; K. Kosalai; Pranjan Barman; Nupur Thakur; Harish Batra; Alok Bhattacharya

    2001-04-01

    Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST. The routines are used to develop a system for automated annotation of genome DNA sequences.

  3. A pipeline for automated annotation of yeast genome sequences by a conserved-synteny approach

    Directory of Open Access Journals (Sweden)

    Proux-Wéra Estelle

    2012-09-01

    Full Text Available Abstract Background Yeasts are a model system for exploring eukaryotic genome evolution. Next-generation sequencing technologies are poised to vastly increase the number of yeast genome sequences, both from resequencing projects (population studies and from de novo sequencing projects (new species. However, the annotation of genomes presents a major bottleneck for de novo projects, because it still relies on a process that is largely manual. Results Here we present the Yeast Genome Annotation Pipeline (YGAP, an automated system designed specifically for new yeast genome sequences lacking transcriptome data. YGAP does automatic de novo annotation, exploiting homology and synteny information from other yeast species stored in the Yeast Gene Order Browser (YGOB database. The basic premises underlying YGAP's approach are that data from other species already tells us what genes we should expect to find in any particular genomic region and that we should also expect that orthologous genes are likely to have similar intron/exon structures. Additionally, it is able to detect probable frameshift sequencing errors and can propose corrections for them. YGAP searches intelligently for introns, and detects tRNA genes and Ty-like elements. Conclusions In tests on Saccharomyces cerevisiae and on the genomes of Naumovozyma castellii and Tetrapisispora blattae newly sequenced with Roche-454 technology, YGAP outperformed another popular annotation program (AUGUSTUS. For S. cerevisiae and N. castellii, 91-93% of YGAP's predicted gene structures were identical to those in previous manually curated gene sets. YGAP has been implemented as a webserver with a user-friendly interface at http://wolfe.gen.tcd.ie/annotation.

  4. Evaluation of Three Automated Genome Annotations for Halorhabdus utahensis

    DEFF Research Database (Denmark)

    Bakke, Peter; Carney, Nick; DeLoache, Will

    2009-01-01

    in databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology...

  5. Automatic annotation of organellar genomes with DOGMA

    Energy Technology Data Exchange (ETDEWEB)

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  6. An Introduction to Genome Annotation.

    Science.gov (United States)

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  7. Human Genome Annotation

    Science.gov (United States)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  8. The Development of PIPA: An Integrated and Automated Pipeline for Genome-Wide Protein Function Annotation

    Science.gov (United States)

    2008-01-25

    protein function annotation Chenggang Yu1, Nela Zavaljevski1, Valmik Desai1, Seth Johnson2, Fred J Stevens3 and Jaques Reifman*1 Address: 1Biotechnology...cyu@bioanalysis.org; Nela Zavaljevski - nelaz@bioanalysis.org; Valmik Desai - valmik@bioanalysis.org; Seth Johnson - sjohnson@exonhit-usa.com; Fred J

  9. NCBI prokaryotic genome annotation pipeline.

    Science.gov (United States)

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

  10. Bioinformatics for plant genome annotation

    NARCIS (Netherlands)

    Fiers, M.W.E.J.

    2006-01-01

    Large amounts of genome sequence data are available and much more will become available in the near future. A DNA sequence alone has, however, limited use. Genome annotation is required to assign biological interpretation to the DNA sequence. This thesis describ

  11. Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data

    Directory of Open Access Journals (Sweden)

    Minucci Saverio

    2011-10-01

    Full Text Available Abstract Background High-throughput sequencing is generating massive amounts of data at a pace that largely exceeds the throughput of data analysis routines. Here we introduce Fish the ChIPs (FC, a computational pipeline aimed at a broad public of users and designed to perform complete ChIP-Seq data analysis of an unlimited number of samples, thus increasing throughput, reproducibility and saving time. Results Starting from short read sequences, FC performs the following steps: 1 quality controls, 2 alignment to a reference genome, 3 peak calling, 4 genomic annotation, 5 generation of raw signal tracks for visualization on the UCSC and IGV genome browsers. FC exploits some of the fastest and most effective tools today available. Installation on a Mac platform requires very basic computational skills while configuration and usage are supported by a user-friendly graphic user interface. Alternatively, FC can be compiled from the source code on any Unix machine and then run with the possibility of customizing each single parameter through a simple configuration text file that can be generated using a dedicated user-friendly web-form. Considering the execution time, FC can be run on a desktop machine, even though the use of a computer cluster is recommended for analyses of large batches of data. FC is perfectly suited to work with data coming from Illumina Solexa Genome Analyzers or ABI SOLiD and its usage can potentially be extended to any sequencing platform. Conclusions Compared to existing tools, FC has two main advantages that make it suitable for a broad range of users. First of all, it can be installed and run by wet biologists on a Mac machine. Besides it can handle an unlimited number of samples, being convenient for large analyses. In this context, computational biologists can increase reproducibility of their ChIP-Seq data analyses while saving time for downstream analyses. Reviewers This article was reviewed by Gavin Huttley, George

  12. DIYA: A Bacterial Annotation Pipeline for any Genomics Lab

    Science.gov (United States)

    2009-02-12

    microbial genomes overnight (Mardis, 2008). These technologies have created many new small ‘genome centers’ ( Zwick , 2005). DIYA (Do-It- Yourself...2008) The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation. BMC Bioinformatics, 9, 52. Zwick ,M.E

  13. Correction of the Caulobacter crescentus NA1000 genome annotation.

    Directory of Open Access Journals (Sweden)

    Bert Ely

    Full Text Available Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  14. Correction of the Caulobacter crescentus NA1000 genome annotation.

    Science.gov (United States)

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  15. Improving pan-genome annotation using whole genome multiple alignment

    Directory of Open Access Journals (Sweden)

    Salzberg Steven L

    2011-06-01

    Full Text Available Abstract Background Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes. Results We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review. Conclusions Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.

  16. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    Energy Technology Data Exchange (ETDEWEB)

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  17. Automating Ontological Annotation with WordNet

    Energy Technology Data Exchange (ETDEWEB)

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob L.; Hohimer, Ryan E.; White, Amanda M.

    2006-01-22

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  18. Genome Annotation Transfer Utility (GATU: rapid annotation of viral genomes using a closely related reference genome

    Directory of Open Access Journals (Sweden)

    Upton Chris

    2006-06-01

    Full Text Available Abstract Background Since DNA sequencing has become easier and cheaper, an increasing number of closely related viral genomes have been sequenced. However, many of these have been deposited in GenBank without annotations, severely limiting their value to researchers. While maintaining comprehensive genomic databases for a set of virus families at the Viral Bioinformatics Resource Center http://www.biovirus.org and Viral Bioinformatics – Canada http://www.virology.ca, we found that researchers were unnecessarily spending time annotating viral genomes that were close relatives of already annotated viruses. We have therefore designed and implemented a novel tool, Genome Annotation Transfer Utility (GATU, to transfer annotations from a previously annotated reference genome to a new target genome, thereby greatly reducing this laborious task. Results GATU transfers annotations from a reference genome to a closely related target genome, while still giving the user final control over which annotations should be included. GATU also detects open reading frames present in the target but not the reference genome and provides the user with a variety of bioinformatics tools to quickly determine if these ORFs should also be included in the annotation. After this process is complete, GATU saves the newly annotated genome as a GenBank, EMBL or XML-format file. The software is coded in Java and runs on a variety of computer platforms. Its user-friendly Graphical User Interface is specifically designed for users trained in the biological sciences. Conclusion GATU greatly simplifies the initial stages of genome annotation by using a closely related genome as a reference. It is not intended to be a gene prediction tool or a "complete" annotation system, but we have found that it significantly reduces the time required for annotation of genes and mature peptides as well as helping to standardize gene names between related organisms by transferring reference genome

  19. Automated analysis and annotation of basketball video

    Science.gov (United States)

    Saur, Drew D.; Tan, Yap-Peng; Kulkarni, Sanjeev R.; Ramadge, Peter J.

    1997-01-01

    Automated analysis and annotation of video sequences are important for digital video libraries, content-based video browsing and data mining projects. A successful video annotation system should provide users with useful video content summary in a reasonable processing time. Given the wide variety of video genres available today, automatically extracting meaningful video content for annotation still remains hard by using current available techniques. However, a wide range video has inherent structure such that some prior knowledge about the video content can be exploited to improve our understanding of the high-level video semantic content. In this paper, we develop tools and techniques for analyzing structured video by using the low-level information available directly from MPEG compressed video. Being able to work directly in the video compressed domain can greatly reduce the processing time and enhance storage efficiency. As a testbed, we have developed a basketball annotation system which combines the low-level information extracted from MPEG stream with the prior knowledge of basketball video structure to provide high level content analysis, annotation and browsing for events such as wide- angle and close-up views, fast breaks, steals, potential shots, number of possessions and possession times. We expect our approach can also be extended to structured video in other domains.

  20. Software for computing and annotating genomic ranges.

    Directory of Open Access Journals (Sweden)

    Michael Lawrence

    Full Text Available We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

  1. Restauro-G: A Rapid Genome Re-Annotation System for Comparative Genomics

    Institute of Scientific and Technical Information of China (English)

    Satoshi Tamaki; Kazuharu Arakawa; Nobuaki Kono; Masaru Tomita

    2007-01-01

    Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/ under the GNU General Public License.

  2. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    Directory of Open Access Journals (Sweden)

    Hamilton John P

    2007-10-01

    Full Text Available Abstract Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1 the submission of gene annotation to an annotation project, 2 the review of the submitted models by project annotators, and 3 the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP, an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website http://rice.tigr.org, as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the

  3. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.

  4. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Science.gov (United States)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  5. DNAVis: interactive visualization of comparative genome annotations

    NARCIS (Netherlands)

    Fiers, M.W.E.J.; Wetering, van de H.; Peeters, T.H.J.M.; Wijk, van J.J.; Nap, J.P.H.

    2006-01-01

    The software package DNAVis offers a fast, interactive and real-time visualization of DNA sequences and their comparative genome annotations. DNAVis implements advanced methods of information visualization such as linked views, perspective walls and semantic zooming, in addition to the display of he

  6. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop.

    Science.gov (United States)

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-10-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world's biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop.

  7. AGeS: a software system for microbial genome sequence annotation.

    Directory of Open Access Journals (Sweden)

    Kamal Kumar

    Full Text Available BACKGROUND: The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. METHODOLOGY: The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA. The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions.

  8. Fuzzy Emotional Semantic Analysis and Automated Annotation of Scene Images

    Directory of Open Access Journals (Sweden)

    Jianfang Cao

    2015-01-01

    Full Text Available With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance.

  9. Fuzzy emotional semantic analysis and automated annotation of scene images.

    Science.gov (United States)

    Cao, Jianfang; Chen, Lichao

    2015-01-01

    With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP) neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance.

  10. Genome cartography through domain annotation.

    Science.gov (United States)

    Ponting, C P; Dickens, N J

    2001-01-01

    The evolutionary history of eukaryotic proteins involves rapid sequence divergence, addition and deletion of domains, and fusion and fission of genes. Although the protein repertoires of distantly related species differ greatly, their domain repertoires do not. To account for the great diversity of domain contexts and an unexpected paucity of ortholog conservation, we must categorize the coding regions of completely sequenced genomes into domain families, as well as protein families.

  11. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation.

    Directory of Open Access Journals (Sweden)

    Oscar Beijbom

    Full Text Available Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys.

  12. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

    Science.gov (United States)

    Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157

  13. Annotation-Based Whole Genomic Prediction and Selection

    DEFF Research Database (Denmark)

    Kadarmideen, Haja; Do, Duy Ngoc; Janss, Luc;

    in their contribution to estimated genomic variances and in prediction of genomic breeding values by applying SNP annotation approaches to feed efficiency. Ensembl Variant Predictor (EVP) and Pig QTL database were used as the source of genomic annotation for 60K chip. Genomic prediction was performed using the Bayes...... prove useful for less heritable traits such as diseases and fertility...

  14. Bioinformatics Assisted Gene Discovery and Annotation of Human Genome

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    As the sequencing stage of human genome project is near the end, the work has begun for discovering novel genes from genome sequences and annotating their biological functions. Here are reviewed current major bioinformatics tools and technologies available for large scale gene discovery and annotation from human genome sequences. Some ideas about possible future development are also provided.

  15. Applied bioinformatics: Genome annotation and transcriptome analysis

    DEFF Research Database (Denmark)

    Gupta, Vikas

    and dhurrin, which have not previously been characterized in blueberries. There are more than 44,500 spider species with distinct habitats and unique characteristics. Spiders are masters of producing silk webs to catch prey and using venom to neutralize. The exploration of the genetics behind these properties...... has just started. We have assembled and annotated the first two spider genomes to facilitate our understanding of spiders at the molecular level. The need for analyzing the large and increasing amount of sequencing data has increased the demand for efficient, user friendly, and broadly applicable...

  16. Annotation of selection strengths in viral genomes

    DEFF Research Database (Denmark)

    McCauley, Stephen; de Groot, Saskia; Mailund, Thomas;

    2007-01-01

    Motivation: Viral genomes tend to code in overlapping reading frames to maximize information content. This may result in atypical codon bias and particular evolutionary constraints. Due to the fast mutation rate of viruses, there is additional strong evidence for varying selection between intra...... reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference sequence, we...... may thus achieve an annotation both of coding regions as well as selection strengths, allowing us to investigate different selection patterns and hypotheses. Results: We illustrate our method by applying it to a multiple alignment of four HIV2 sequences, as well as four Hepatitis B sequences. We...

  17. Towards a Library of Standard Operating Procedures (SOPs) for (meta)genomic annotation

    Energy Technology Data Exchange (ETDEWEB)

    Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen

    2008-04-01

    Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.

  18. Annotation of the protein coding regions of the equine genome

    DEFF Research Database (Denmark)

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.;

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced...

  19. Gene calling and bacterial genome annotation with BG7.

    Science.gov (United States)

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).

  20. A Human-Curated Annotation of the Candida albicans Genome.

    Directory of Open Access Journals (Sweden)

    2005-07-01

    Full Text Available Recent sequencing and assembly of the genome for the fungal pathogen Candida albicans used simple automated procedures for the identification of putative genes. We have reviewed the entire assembly, both by hand and with additional bioinformatic resources, to accurately map and describe 6,354 genes and to identify 246 genes whose original database entries contained sequencing errors (or possibly mutations that affect their reading frame. Comparison with other fungal genomes permitted the identification of numerous fungus-specific genes that might be targeted for antifungal therapy. We also observed that, compared to other fungi, the protein-coding sequences in the C. albicans genome are especially rich in short sequence repeats. Finally, our improved annotation permitted a detailed analysis of several multigene families, and comparative genomic studies showed that C. albicans has a far greater catabolic range, encoding respiratory Complex 1, several novel oxidoreductases and ketone body degrading enzymes, malonyl-CoA and enoyl-CoA carriers, several novel amino acid degrading enzymes, a variety of secreted catabolic lipases and proteases, and numerous transporters to assimilate the resulting nutrients. The results of these efforts will ensure that the Candida research community has uniform and comprehensive genomic information for medical research as well as for future diagnostic and therapeutic applications.

  1. Improving microbial genome annotations in an integrated database context.

    Directory of Open Access Journals (Sweden)

    I-Min A Chen

    Full Text Available Effective comparative analysis of microbial genomes requires a consistent and complete view of biological data. Consistency regards the biological coherence of annotations, while completeness regards the extent and coverage of functional characterization for genomes. We have developed tools that allow scientists to assess and improve the consistency and completeness of microbial genome annotations in the context of the Integrated Microbial Genomes (IMG family of systems. All publicly available microbial genomes are characterized in IMG using different functional annotation and pathway resources, thus providing a comprehensive framework for identifying and resolving annotation discrepancies. A rule based system for predicting phenotypes in IMG provides a powerful mechanism for validating functional annotations, whereby the phenotypic traits of an organism are inferred based on the presence of certain metabolic reactions and pathways and compared to experimentally observed phenotypes. The IMG family of systems are available at http://img.jgi.doe.gov/.

  2. Annotation of the protein coding regions of the equine genome

    DEFF Research Database (Denmark)

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced m...... and appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross...

  3. Automation and Validation of Annotation for Hindi Anaphora Resolution

    Directory of Open Access Journals (Sweden)

    Pardeep Singh

    2015-10-01

    Full Text Available The process of labelling any language genre by which one can extract useful information is called annotation. This provides syntactic information about a word or a word phrase. In this paper, an effort has been made to provide the algorithm for semiautomatic annotation for Hindi text to cater anaphora resolution only. The study was conducted on twelve files of Ranchi Express available in EMILLE corpus. The corpus is originally tagged for demonstrative pronouns. The detection of the pronouns is supported by the incorporation of seven tags. However the semantic interpretation of the demonstrative pronoun is not supported in the original corpus. In this paper an effort has been made to automate the process of tagging as well as the handling of semantic information through addition tags. It was conducted on 1485 demonstrative pronouns. The average accuracy of precision, recall and F measure is 74, 71 and 72 respectively.

  4. Solving the Problem: Genome Annotation Standards before the Data Deluge

    Science.gov (United States)

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  5. Using Apollo to browse and edit genome annotations.

    Science.gov (United States)

    Misra, Sima; Harris, Nomi

    2006-01-01

    An annotation is any feature that can be tied to genomic sequence, such as an exon, transcript, promoter, or transposable element. As biological knowledge increases, annotations of different types need to be added and modified, and links to other sources of information need to be incorporated, to allow biologists to easily access all of the available sequence analysis data and design appropriate experiments. The Apollo genome browser and editor offers biologists these capabilities. Apollo can display many different types of computational evidence, such as alignments and similarities based on BLAST searches (UNITS 3.3 & 3.4), and enables biologists to utilize computational evidence to create and edit gene models and other genomic features, e.g., using experimental evidence to refine exon-intron structures predicted by gene prediction algorithms. This protocol describes simple ways to browse genome annotation data, as well as techniques for editing annotations and loading data from different sources.

  6. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    Directory of Open Access Journals (Sweden)

    Childs Kevin L

    2010-11-01

    Full Text Available Abstract Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence.

  7. Genome annotations - KOME | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available [ Credits ] BLAST Search Image Search Home About Archive Update History Contact us ....zip File URL: ftp://ftp.biosciencedbc.jp/archive/kome/LATEST/kome_genome_annotat...e Update History of This Database Site Policy | Contact Us Genome annotations - KOME | LSDB Archive ...

  8. Scripps Genome ADVISER: Annotation and Distributed Variant Interpretation SERver.

    Directory of Open Access Journals (Sweden)

    Phillip H Pham

    Full Text Available Interpretation of human genomes is a major challenge. We present the Scripps Genome ADVISER (SG-ADVISER suite, which aims to fill the gap between data generation and genome interpretation by performing holistic, in-depth, annotations and functional predictions on all variant types and effects. The SG-ADVISER suite includes a de-identification tool, a variant annotation web-server, and a user interface for inheritance and annotation-based filtration. SG-ADVISER allows users with no bioinformatics expertise to manipulate large volumes of variant data with ease--without the need to download large reference databases, install software, or use a command line interface. SG-ADVISER is freely available at genomics.scripps.edu/ADVISER.

  9. Improving the Caenorhabditis elegans genome annotation using machine learning.

    Directory of Open Access Journals (Sweden)

    Gunnar Rätsch

    2007-02-01

    Full Text Available For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions and 95% (coding regions only of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

  10. MIPS: analysis and annotation of genome information in 2007.

    Science.gov (United States)

    Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  11. Comparative genomics in cyprinids: Common carp EST's help the annotation of the zebrafish genome

    NARCIS (Netherlands)

    Christoffels, A.; Bartfai, R.; Srinivasan, H.; Komen, J.

    2006-01-01

    Background - Automatic annotation of sequenced eukaryotic genomes integrates a combination of methodologies such as ab-initio methods and alignment of homologous genes and/or proteins. For example, annotation of the zebrafish genome within Ensembl relies heavily on available cDNA and protein sequenc

  12. A Manual Curation Strategy to Improve Genome Annotation: Application to a Set of Haloarchael Genomes

    Directory of Open Access Journals (Sweden)

    Friedhelm Pfeiffer

    2015-06-01

    Full Text Available Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae. Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins. To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt, to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.

  13. Genome annotation of a Saccharomyces sp. lager brewer's yeast

    Directory of Open Access Journals (Sweden)

    Patricia Marcela De León-Medina

    2016-09-01

    Full Text Available The genome of lager brewer's yeast is a hybrid, with Saccharomyces eubayanus and Saccharomyces cerevisiae as sub-genomes. Due to their specific use in the beer industry, relatively little information is available. The genome of brewing yeast was sequenced and annotated in this study. We obtained a genome size of 22.7 Mbp that consisted of 133 scaffolds, with 65 scaffolds larger than 10 kbp. With respect to the annotation, 9939 genes were obtained, and when they were submitted to a local alignment, we found that 53.93% of these genes corresponded to S. cerevisiae, while another 42.86% originated from S. eubayanus. Our results confirm that our strain is a hybrid of at least two different genomes.

  14. Analysis of high-throughput sequencing and annotation strategies for phage genomes.

    Directory of Open Access Journals (Sweden)

    Matthew R Henn

    Full Text Available BACKGROUND: Bacterial viruses (phages play a critical role in shaping microbial populations as they influence both host mortality and horizontal gene transfer. As such, they have a significant impact on local and global ecosystem function and human health. Despite their importance, little is known about the genomic diversity harbored in phages, as methods to capture complete phage genomes have been hampered by the lack of knowledge about the target genomes, and difficulties in generating sufficient quantities of genomic DNA for sequencing. Of the approximately 550 phage genomes currently available in the public domain, fewer than 5% are marine phage. METHODOLOGY/PRINCIPAL FINDINGS: To advance the study of phage biology through comparative genomic approaches we used marine cyanophage as a model system. We compared DNA preparation methodologies (DNA extraction directly from either phage lysates or CsCl purified phage particles, and sequencing strategies that utilize either Sanger sequencing of a linker amplification shotgun library (LASL or of a whole genome shotgun library (WGSL, or 454 pyrosequencing methods. We demonstrate that genomic DNA sample preparation directly from a phage lysate, combined with 454 pyrosequencing, is best suited for phage genome sequencing at scale, as this method is capable of capturing complete continuous genomes with high accuracy. In addition, we describe an automated annotation informatics pipeline that delivers high-quality annotation and yields few false positives and negatives in ORF calling. CONCLUSIONS/SIGNIFICANCE: These DNA preparation, sequencing and annotation strategies enable a high-throughput approach to the burgeoning field of phage genomics.

  15. Intra-species sequence comparisons for annotating genomes

    Energy Technology Data Exchange (ETDEWEB)

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  16. Missing genes in the annotation of prokaryotic genomes

    Directory of Open Access Journals (Sweden)

    Feng Wu-chun

    2010-03-01

    Full Text Available Abstract Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting. Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations. The vast majority of the missing genes found are small (less than 100 aa. A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

  17. Genome Annotation in a Community College Cell Biology Lab

    Science.gov (United States)

    Beagley, C. Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning…

  18. MUTAGEN: Multi-user tool for annotating GENomes

    DEFF Research Database (Denmark)

    Brugger, K.; Redder, P.; Skovgaard, Marie

    2003-01-01

    MUTAGEN is a free prokaryotic annotation system. It offers the advantages of genome comparison, graphical sequence browsers, search facilities and open-source for user-specific adjustments. The web-interface allows several users to access the system from standard desktop computers. The Sulfolobus...

  19. Annotation of the Clostridium Acetobutylicum Genome

    Energy Technology Data Exchange (ETDEWEB)

    Daly, M. J.

    2004-06-09

    The genome sequence of the solvent producing bacterium Clostridium acetobutylicum ATCC824, has been determined by the shotgun approach. The genome consists of a 3.94 Mb chromosome and a 192 kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases, closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria.

  20. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    Science.gov (United States)

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  1. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    Directory of Open Access Journals (Sweden)

    Yevgeny Nikolaichik

    2016-05-01

    Full Text Available The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp. and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  2. MITOS: improved de novo metazoan mitochondrial genome annotation.

    Science.gov (United States)

    Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F

    2013-11-01

    About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.

  3. VIGOR, an annotation program for small viral genomes

    Directory of Open Access Journals (Sweden)

    Wang Shiliang

    2010-09-01

    Full Text Available Abstract Background The decrease in cost for sequencing and improvement in technologies has made it easier and more common for the re-sequencing of large genomes as well as parallel sequencing of small genomes. It is possible to completely sequence a small genome within days and this increases the number of publicly available genomes. Among the types of genomes being rapidly sequenced are those of microbial and viral genomes responsible for infectious diseases. However, accurate gene prediction is a challenge that persists for decoding a newly sequenced genome. Therefore, accurate and efficient gene prediction programs are highly desired for rapid and cost effective surveillance of RNA viruses through full genome sequencing. Results We have developed VIGOR (Viral Genome ORF Reader, a web application tool for gene prediction in influenza virus, rotavirus, rhinovirus and coronavirus subtypes. VIGOR detects protein coding regions based on sequence similarity searches and can accurately detect genome specific features such as frame shifts, overlapping genes, embedded genes, and can predict mature peptides within the context of a single polypeptide open reading frame. Genotyping capability for influenza and rotavirus is built into the program. We compared VIGOR to previously described gene prediction programs, ZCURVE_V, GeneMarkS and FLAN. The specificity and sensitivity of VIGOR are greater than 99% for the RNA viral genomes tested. Conclusions VIGOR is a user friendly web-based genome annotation program for five different viral agents, influenza, rotavirus, rhinovirus, coronavirus and SARS coronavirus. This is the first gene prediction program for rotavirus and rhinovirus for public access. VIGOR is able to accurately predict protein coding genes for the above five viral types and has the capability to assign function to the predicted open reading frames and genotype influenza virus. The prediction software was designed for performing high

  4. MIPS: analysis and annotation of proteins from whole genomes.

    Science.gov (United States)

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  5. Evolutionarily conserved substrate substructures for automated annotation of enzyme superfamilies.

    Directory of Open Access Journals (Sweden)

    Ranyee A Chiang

    Full Text Available The evolution of enzymes affects how well a species can adapt to new environmental conditions. During enzyme evolution, certain aspects of molecular function are conserved while other aspects can vary. Aspects of function that are more difficult to change or that need to be reused in multiple contexts are often conserved, while those that vary may indicate functions that are more easily changed or that are no longer required. In analogy to the study of conservation patterns in enzyme sequences and structures, we have examined the patterns of conservation and variation in enzyme function by analyzing graph isomorphisms among enzyme substrates of a large number of enzyme superfamilies. This systematic analysis of substrate substructures establishes the conservation patterns that typify individual superfamilies. Specifically, we determined the chemical substructures that are conserved among all known substrates of a superfamily and the substructures that are reacting in these substrates and then examined the relationship between the two. Across the 42 superfamilies that were analyzed, substantial variation was found in how much of the conserved substructure is reacting, suggesting that superfamilies may not be easily grouped into discrete and separable categories. Instead, our results suggest that many superfamilies may need to be treated individually for analyses of evolution, function prediction, and guiding enzyme engineering strategies. Annotating superfamilies with these conserved and reacting substructure patterns provides information that is orthogonal to information provided by studies of conservation in superfamily sequences and structures, thereby improving the precision with which we can predict the functions of enzymes of unknown function and direct studies in enzyme engineering. Because the method is automated, it is suitable for large-scale characterization and comparison of fundamental functional capabilities of both characterized

  6. cDNA2Genome: A tool for mapping and annotating cDNAs

    Directory of Open Access Journals (Sweden)

    Suhai Sandor

    2003-09-01

    Full Text Available Abstract Background In the last years several high-throughput cDNA sequencing projects have been funded worldwide with the aim of identifying and characterizing the structure of complete novel human transcripts. However some of these cDNAs are error prone due to frameshifts and stop codon errors caused by low sequence quality, or to cloning of truncated inserts, among other reasons. Therefore, accurate CDS prediction from these sequences first require the identification of potentially problematic cDNAs in order to speed up the posterior annotation process. Results cDNA2Genome is an application for the automatic high-throughput mapping and characterization of cDNAs. It utilizes current annotation data and the most up to date databases, especially in the case of ESTs and mRNAs in conjunction with a vast number of approaches to gene prediction in order to perform a comprehensive assessment of the cDNA exon-intron structure. The final result of cDNA2Genome is an XML file containing all relevant information obtained in the process. This XML output can easily be used for further analysis such us program pipelines, or the integration of results into databases. The web interface to cDNA2Genome also presents this data in HTML, where the annotation is additionally shown in a graphical form. cDNA2Genome has been implemented under the W3H task framework which allows the combination of bioinformatics tools in tailor-made analysis task flows as well as the sequential or parallel computation of many sequences for large-scale analysis. Conclusions cDNA2Genome represents a new versatile and easily extensible approach to the automated mapping and annotation of human cDNAs. The underlying approach allows sequential or parallel computation of sequences for high-throughput analysis of cDNAs.

  7. nGASP - the nematode genome annotation assessment project

    Energy Technology Data Exchange (ETDEWEB)

    Coghlan, A; Fiedler, T J; McKay, S J; Flicek, P; Harris, T W; Blasiar, D; Allen, J; Stein, L D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C

  8. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.

    Science.gov (United States)

    Pruitt, Kim D; Tatusova, Tatiana; Brown, Garth R; Maglott, Donna R

    2012-01-01

    The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).

  9. Assembly, Annotation, and Analysis of Multiple Mycorrhizal Fungal Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Initiative Consortium, Mycorrhizal Genomics; Kuo, Alan; Grigoriev, Igor; Kohler, Annegret; Martin, Francis

    2013-03-08

    Mycorrhizal fungi play critical roles in host plant health, soil community structure and chemistry, and carbon and nutrient cycling, all areas of intense interest to the US Dept. of Energy (DOE) Joint Genome Institute (JGI). To this end we are building on our earlier sequencing of the Laccaria bicolor genome by partnering with INRA-Nancy and the mycorrhizal research community in the MGI to sequence and analyze dozens of mycorrhizal genomes of all Basidiomycota and Ascomycota orders and multiple ecological types (ericoid, orchid, and ectomycorrhizal). JGI has developed and deployed high-throughput sequencing techniques, and Assembly, RNASeq, and Annotation Pipelines. In 2012 alone we sequenced, assembled, and annotated 12 draft or improved genomes of mycorrhizae, and predicted ~;;232831 genes and ~;;15011 multigene families, All of this data is publicly available on JGI MycoCosm (http://jgi.doe.gov/fungi/), which provides access to both the genome data and tools with which to analyze the data. Preliminary comparisons of the current total of 14 public mycorrhizal genomes suggest that 1) short secreted proteins potentially involved in symbiosis are more enriched in some orders than in others amongst the mycorrhizal Agaricomycetes, 2) there are wide ranges of numbers of genes involved in certain functional categories, such as signal transduction and post-translational modification, and 3) novel gene families are specific to some ecological types.

  10. Scaling up genome annotation using MAKER and work queue.

    Science.gov (United States)

    Thrasher, Andrew; Musgrave, Zachary; Kachmarck, Brian; Thain, Douglas; Emrich, Scott

    2014-01-01

    Next generation sequencing technologies have enabled sequencing many genomes. Because of the overall increasing demand and the inherent parallelism available in many required analyses, these bioinformatics applications should ideally run on clusters, clouds and/or grids. We present a modified annotation framework that achieves a speed-up of 45x using 50 workers using a Caenorhabditis japonica test case. We also evaluate these modifications within the Amazon EC2 cloud framework. The underlying genome annotation (MAKER) is parallelised as an MPI application. Our framework enables it to now run without MPI while utilising a wide variety of distributed computing resources. This parallel framework also allows easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that often rely on shared file systems. Combined, our proposed framework can be used, even during early stages of development, to easily run sequence analysis tools on clusters, grids and clouds.

  11. Sequencing and annotated analysis of an Estonian human genome.

    Science.gov (United States)

    Lilleoja, Rutt; Sarapik, Aili; Reimann, Ene; Reemann, Paula; Jaakma, Ülle; Vasar, Eero; Kõks, Sulev

    2012-02-01

    In present study we describe the sequencing and annotated analysis of the individual genome of Estonian. Using SOLID technology we generated 2,449,441,916 of 50-bp reads. The Bioscope version 1.3 was used for mapping and pairing of reads to the NCBI human genome reference (build 36, hg18). Bioscope enables also the annotation of the results of variant (tertiary) analysis. The average mapping of reads was 75.5% with total coverage of 107.72 Gb. resulting in mean fold coverage of 34.6. We found 3,482,975 SNPs out of which 352,492 were novel. 21,222 SNPs were in coding region: 10,649 were synonymous SNPs, 10,360 were nonsynonymous missense SNPs, 155 were nonsynonymous nonsense SNPs and 58 were nonsynonymous frameshifts. We identified 219 CNVs with total base pair coverage of 37,326,300 bp and 87,451 large insertion/deletion polymorphisms covering 10,152,256 bp of the genome. In addition, we found 285,864 small size insertion/deletion polymorphisms out of which 133,969 were novel. Finally, we identified 53 inversions, 19 overlapped genes and 2 overlapped exons. Interestingly, we found the region in chromosome 6 to be enriched with the coding SNPs and CNVs. This study confirms previous findings, that our genomes are more complex and variable as thought before. Therefore, sequencing of the personal genomes followed by annotation would improve the analysis of heritability of phenotypes and our understandings on the functions of genome.

  12. High-throughput proteogenomics of Ruegeria pomeroyi: seeding a better genomic annotation for the whole marine Roseobacter clade

    Directory of Open Access Journals (Sweden)

    Christie-Oleza Joseph A

    2012-02-01

    Full Text Available Abstract Background The structural and functional annotation of genomes is now heavily based on data obtained using automated pipeline systems. The key for an accurate structural annotation consists of blending similarities between closely related genomes with biochemical evidence of the genome interpretation. In this work we applied high-throughput proteogenomics to Ruegeria pomeroyi, a member of the Roseobacter clade, an abundant group of marine bacteria, as a seed for the annotation of the whole clade. Results A large dataset of peptides from R. pomeroyi was obtained after searching over 1.1 million MS/MS spectra against a six-frame translated genome database. We identified 2006 polypeptides, of which thirty-four were encoded by open reading frames (ORFs that had not previously been annotated. From the pool of 'one-hit-wonders', i.e. those ORFs specified by only one peptide detected by tandem mass spectrometry, we could confirm the probable existence of five additional new genes after proving that the corresponding RNAs were transcribed. We also identified the most-N-terminal peptide of 486 polypeptides, of which sixty-four had originally been wrongly annotated. Conclusions By extending these re-annotations to the other thirty-six Roseobacter isolates sequenced to date (twenty different genera, we propose the correction of the assigned start codons of 1082 homologous genes in the clade. In addition, we also report the presence of novel genes within operons encoding determinants of the important tricarboxylic acid cycle, a feature that seems to be characteristic of some Roseobacter genomes. The detection of their corresponding products in large amounts raises the question of their function. Their discoveries point to a possible theory for protein evolution that will rely on high expression of orphans in bacteria: their putative poor efficiency could be counterbalanced by a higher level of expression. Our proteogenomic analysis will increase

  13. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    Directory of Open Access Journals (Sweden)

    Holt Carson

    2011-12-01

    Full Text Available Abstract Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

  14. ArrayIDer: automated structural re-annotation pipeline for DNA microarrays

    Directory of Open Access Journals (Sweden)

    McCarthy Fiona M

    2009-01-01

    Full Text Available Abstract Background Systems biology modeling from microarray data requires the most contemporary structural and functional array annotation. However, microarray annotations, especially for non-commercial, non-traditional biomedical model organisms, are often dated. In addition, most microarray analysis tools do not readily accept EST clone names, which are abundantly represented on arrays. Manual re-annotation of microarrays is impracticable and so we developed a computational re-annotation tool (ArrayIDer to retrieve the most recent accession mapping files from public databases based on EST clone names or accessions and rapidly generate database accessions for entire microarrays. Results We utilized the Fred Hutchinson Cancer Research Centre 13K chicken cDNA array – a widely-used non-commercial chicken microarray – to demonstrate the principle that ArrayIDer could markedly improve annotation. We structurally re-annotated 55% of the entire array. Moreover, we decreased non-chicken functional annotations by 2 fold. One beneficial consequence of our re-annotation was to identify 290 pseudogenes, of which 66 were previously incorrectly annotated. Conclusion ArrayIDer allows rapid automated structural re-annotation of entire arrays and provides multiple accession types for use in subsequent functional analysis. This information is especially valuable for systems biology modeling in the non-traditional biomedical model organisms.

  15. ARC: Automated Resource Classifier for agglomerative functional classification of prokaryotic proteins using annotation texts

    Indian Academy of Sciences (India)

    Muthiah Gnanamani; Naveen Kumar; Srinivasan Ramachandran

    2007-08-01

    Functional classification of proteins is central to comparative genomics. The need for algorithms tuned to enable integrative interpretation of analytical data is felt globally. The availability of a general, automated software with built-in flexibility will significantly aid this activity. We have prepared ARC (Automated Resource Classifier), which is an open source software meeting the user requirements of flexibility. The default classification scheme based on keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell wall, Cell membrane and Transporters ($\\mathcal{C}$), Cell division ($\\mathcal{D}$), Information ($\\mathcal{I}$), Translocation ($\\mathcal{L}$), Metabolism ($\\mathcal{M}$), Stress($\\mathcal{R}$), Signal and communication($\\mathcal{S}$) and 2 ancillary classes: Others ($\\mathcal{O}$) and Hypothetical ($\\mathcal{H}$). The keyword library of ARC was built serially by first drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent steps, this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333 prokaryotes. ARC is available at http://arc.igib.res.in.

  16. Large-scale prokaryotic gene prediction and comparison to genome annotation

    DEFF Research Database (Denmark)

    Nielsen, Pernille; Krogh, Anders Stærmose

    2005-01-01

    Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome...... comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic...... genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms...

  17. AGAPE (Automated Genome Analysis PipelinE for pan-genome analysis of Saccharomyces cerevisiae.

    Directory of Open Access Journals (Sweden)

    Giltae Song

    Full Text Available The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.

  18. AGAPE (Automated Genome Analysis PipelinE) for pan-genome analysis of Saccharomyces cerevisiae.

    Science.gov (United States)

    Song, Giltae; Dickins, Benjamin J A; Demeter, Janos; Engel, Stacia; Gallagher, Jennifer; Choe, Kisurb; Dunn, Barbara; Snyder, Michael; Cherry, J Michael

    2015-01-01

    The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.

  19. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kent, Stephen [University of Chicago

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  20. Automated annotation of chemical names in the literature with tunable accuracy

    Directory of Open Access Journals (Sweden)

    Zhang Jun D

    2011-11-01

    Full Text Available Abstract Background A significant portion of the biomedical and chemical literature refers to small molecules. The accurate identification and annotation of compound name that are relevant to the topic of the given literature can establish links between scientific publications and various chemical and life science databases. Manual annotation is the preferred method for these works because well-trained indexers can understand the paper topics as well as recognize key terms. However, considering the hundreds of thousands of new papers published annually, an automatic annotation system with high precision and relevance can be a useful complement to manual annotation. Results An automated chemical name annotation system, MeSH Automated Annotations (MAA, was developed to annotate small molecule names in scientific abstracts with tunable accuracy. This system aims to reproduce the MeSH term annotations on biomedical and chemical literature that would be created by indexers. When comparing automated free text matching to those indexed manually of 26 thousand MEDLINE abstracts, more than 40% of the annotations were false-positive (FP cases. To reduce the FP rate, MAA incorporated several filters to remove "incorrect" annotations caused by nonspecific, partial, and low relevance chemical names. In part, relevance was measured by the position of the chemical name in the text. Tunable accuracy was obtained by adding or restricting the sections of the text scanned for chemical names. The best precision obtained was 96% with a 28% recall rate. The best performance of MAA, as measured with the F statistic was 66%, which favorably compares to other chemical name annotation systems. Conclusions Accurate chemical name annotation can help researchers not only identify important chemical names in abstracts, but also match unindexed and unstructured abstracts to chemical records. The current work is tested against MEDLINE, but the algorithm is not specific to this

  1. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    Energy Technology Data Exchange (ETDEWEB)

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  2. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation.

    Science.gov (United States)

    Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco

    2016-04-01

    Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline.

  3. Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation

    Directory of Open Access Journals (Sweden)

    Mayer Klaus FX

    2007-08-01

    Full Text Available Abstract Background Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. Results To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. Conclusion This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. The Apollo2Go software is freely available from ftp://ftpmips.gsf.de/plants/apollo_webservice.

  4. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    Science.gov (United States)

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  5. Improving Automated Annotation of Benthic Survey Images Using Wide-band Fluorescence

    Science.gov (United States)

    Beijbom, Oscar; Treibitz, Tali; Kline, David I.; Eyal, Gal; Khen, Adi; Neal, Benjamin; Loya, Yossi; Mitchell, B. Greg; Kriegman, David

    2016-03-01

    Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck.

  6. Genomic variant annotation workflow for clinical applications [version 2; referees: 2 approved

    Directory of Open Access Journals (Sweden)

    Thomas Thurnherr

    2016-10-01

    Full Text Available Annotation and interpretation of DNA aberrations identified through next-generation sequencing is becoming an increasingly important task. Even more so in the context of data analysis pipelines for medical applications, where genomic aberrations are associated with phenotypic and clinical features. Here we describe a workflow to identify potential gene targets in aberrated genes or pathways and their corresponding drugs. To this end, we provide the R/Bioconductor package rDGIdb, an R wrapper to query the drug-gene interaction database (DGIdb. DGIdb accumulates drug-gene interaction data from 15 different resources and allows filtering on different levels. The rDGIdb package makes these resources and tools available to R users. Moreover, rDGIdb queries can be automated through incorporation of the rDGIdb package into NGS sequencing pipelines.

  7. Annotated bibliography of films in automation, data processing, and computer science

    CERN Document Server

    Soloman, Martin B Jr

    2015-01-01

    With the rapid development of computer science and the expanding use of computers in all facets of American life, there has been made available a wide range of instructional and informational films on automation, data processing, and computer science. Here is the first annotated bibliography of these and related films, gathered from industrial, institutional, and other sources.This bibliography annotates 244 films, alphabetically arranged by title, with a detailed subject index. Information is also provided concerning the intended audience, rental-purchase data, ordering procedures, and such s

  8. Expressed Peptide Tags: An additional layer of data for genome annotation

    Energy Technology Data Exchange (ETDEWEB)

    Savidor, Alon [ORNL; Donahoo, Ryan S [ORNL; Hurtado-Gonzales, Oscar [University of Tennessee, Knoxville (UTK); Verberkmoes, Nathan C [ORNL; Shah, Manesh B [ORNL; Lamour, Kurt H [ORNL; McDonald, W Hayes [ORNL

    2006-01-01

    While genome sequencing is becoming ever more routine, genome annotation remains a challenging process. Identification of the coding sequences within the genomic milieu presents a tremendous challenge, especially for eukaryotes with their complex gene architectures. Here we present a method to assist the annotation process through the use of proteomic data and bioinformatics. Mass spectra of digested protein preparations of the organism of interest were acquired and searched against a protein database created by a six frame translation of the genome. The identified peptides were mapped back to the genome, compared to the current annotation, and then categorized as supporting or extending the current genome annotation. We named the classified peptides Expressed Peptide Tags (EPTs). The well annotated bacterium Rhodopseudomonas palustris was used as a control for the method and showed high degree of correlation between EPT mapping and the current annotation, with 86% of the EPTs confirming existing gene calls and less than 1% of the EPTs expanding on the current annotation. The eukaryotic plant pathogens Phytophthora ramorum and Phytophthora sojae, whose genomes have been recently sequenced and are much less well annotated, were also subjected to this method. A series of algorithmic steps were taken to increase the confidence of EPT identification for these organisms, including generation of smaller sub-databases to be searched against, and definition of EPT criteria that accommodates the more complex eukaryotic gene architecture. As expected, the analysis of the Phytophthora species showed less correlation between EPT mapping and their current annotation. While ~77% of Phytophthora EPTs supported the current annotation, a portion of them (7.2% and 12.6% for P. ramorum and P. sojae, respectively) suggested modification to current gene calls or identified novel genes that were missed by the current genome annotation of these organisms.

  9. VESPA: Software to Facilitate Genomic Annotation of Prokaryotic Organisms Through Integration of Proteomic and Transcriptomic Data

    Energy Technology Data Exchange (ETDEWEB)

    Peterson, Elena S.; McCue, Lee Ann; Rutledge, Alexandra C.; Jensen, Jeffrey L.; Walker, Julia; Kobold, Mark A.; Webb, Samantha R.; Payne, Samuel H.; Ansong, Charles; Adkins, Joshua N.; Cannon, William R.; Webb-Robertson, Bobbie-Jo M.

    2012-04-25

    Visual Exploration and Statistics to Promote Annotation (VESPA) is an interactive visual analysis software tool that facilitates the discovery of structural mis-annotations in prokaryotic genomes. VESPA integrates high-throughput peptide-centric proteomics data and oligo-centric or RNA-Seq transcriptomics data into a genomic context. The data may be interrogated via visual analysis across multiple levels of genomic resolution, linked searches, exports and interaction with BLAST to rapidly identify location of interest within the genome and evaluate potential mis-annotations.

  10. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  11. TAPDANCE: An automated tool to identify and annotate transposon insertion CISs and associations between CISs from next generation sequence data

    Directory of Open Access Journals (Sweden)

    Sarver Aaron L

    2012-06-01

    Full Text Available Abstract Background Next generation sequencing approaches applied to the analyses of transposon insertion junction fragments generated in high throughput forward genetic screens has created the need for clear informatics and statistical approaches to deal with the massive amount of data currently being generated. Previous approaches utilized to 1 map junction fragments within the genome and 2 identify Common Insertion Sites (CISs within the genome are not practical due to the volume of data generated by current sequencing technologies. Previous approaches applied to this problem also required significant manual annotation. Results We describe Transposon Annotation Poisson Distribution Association Network Connectivity Environment (TAPDANCE software, which automates the identification of CISs within transposon junction fragment insertion data. Starting with barcoded sequence data, the software identifies and trims sequences and maps putative genomic sequence to a reference genome using the bowtie short read mapper. Poisson distribution statistics are then applied to assess and rank genomic regions showing significant enrichment for transposon insertion. Novel methods of counting insertions are used to ensure that the results presented have the expected characteristics of informative CISs. A persistent mySQL database is generated and utilized to keep track of sequences, mappings and common insertion sites. Additionally, associations between phenotypes and CISs are also identified using Fisher’s exact test with multiple testing correction. In a case study using previously published data we show that the TAPDANCE software identifies CISs as previously described, prioritizes them based on p-value, allows holistic visualization of the data within genome browser software and identifies relationships present in the structure of the data. Conclusions The TAPDANCE process is fully automated, performs similarly to previous labor intensive approaches

  12. Prototype semantic infrastructure for automated small molecule classification and annotation in lipidomics

    Directory of Open Access Journals (Sweden)

    Dumontier Michel

    2011-07-01

    Full Text Available Abstract Background The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality. Results As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of

  13. Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae

    DEFF Research Database (Denmark)

    Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim;

    2008-01-01

    to a genome scale metabolic model of A. oryzae. Results: Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted......Background: Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number...... of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other...

  14. Re-annotation and re-analysis of the Campylobacter jejuni NCTC11168 genome sequence

    Directory of Open Access Journals (Sweden)

    Dorrell Nick

    2007-06-01

    Full Text Available Abstract Background Campylobacter jejuni is the leading bacterial cause of human gastroenteritis in the developed world. To improve our understanding of this important human pathogen, the C. jejuni NCTC11168 genome was sequenced and published in 2000. The original annotation was a milestone in Campylobacter research, but is outdated. We now describe the complete re-annotation and re-analysis of the C. jejuni NCTC11168 genome using current database information, novel tools and annotation techniques not used during the original annotation. Results Re-annotation was carried out using sequence database searches such as FASTA, along with programs such as TMHMM for additional support. The re-annotation also utilises sequence data from additional Campylobacter strains and species not available during the original annotation. Re-annotation was accompanied by a full literature search that was incorporated into the updated EMBL file [EMBL: AL111168]. The C. jejuni NCTC11168 re-annotation reduced the total number of coding sequences from 1654 to 1643, of which 90.0% have additional information regarding the identification of new motifs and/or relevant literature. Re-annotation has led to 18.2% of coding sequence product functions being revised. Conclusions Major updates were made to genes involved in the biosynthesis of important surface structures such as lipooligosaccharide, capsule and both O- and N-linked glycosylation. This re-annotation will be a key resource for Campylobacter research and will also provide a prototype for the re-annotation and re-interpretation of other bacterial genomes.

  15. Automated design of genomic Southern blot probes

    Directory of Open Access Journals (Sweden)

    Komiyama Noboru H

    2010-01-01

    experimentally validate a number of these automated designs by Southern blotting. The majority of probes we tested performed well confirming our in silico prediction methodology and the general usefulness of the software for automated genomic Southern probe design. Conclusions Software and supplementary information are freely available at: http://www.genes2cognition.org/software/southern_blot

  16. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    Science.gov (United States)

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  17. Using Microbial Genome Annotation as a Foundation for Collaborative Student Research

    Science.gov (United States)

    Reed, Kelynne E.; Richardson, John M.

    2013-01-01

    We used the Integrated Microbial Genomes Annotation Collaboration Toolkit as a framework to incorporate microbial genomics research into a microbiology and biochemistry course in a way that promoted student learning of bioinformatics and research skills and emphasized teamwork and collaboration as evidenced through multiple assessment mechanisms.…

  18. Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome

    Science.gov (United States)

    Ernst, Jason; Kellis, Manolis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

  19. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd.

    Science.gov (United States)

    Irshad, H; Montaser-Kouhsari, L; Waltz, G; Bucur, O; Nowak, J A; Dong, F; Knoblauch, N W; Beck, A H

    2015-01-01

    The development of tools in computational pathology to assist physicians and biomedical scientists in the diagnosis of disease requires access to high-quality annotated images for algorithm learning and evaluation. Generating high-quality expert-derived annotations is time-consuming and expensive. We explore the use of crowdsourcing for rapidly obtaining annotations for two core tasks in com- putational pathology: nucleus detection and nucleus segmentation. We designed and implemented crowdsourcing experiments using the CrowdFlower platform, which provides access to a large set of labor channel partners that accesses and manages millions of contributors worldwide. We obtained annotations from four types of annotators and compared concordance across these groups. We obtained: crowdsourced annotations for nucleus detection and segmentation on a total of 810 images; annotations using automated methods on 810 images; annotations from research fellows for detection and segmentation on 477 and 455 images, respectively; and expert pathologist-derived annotations for detection and segmentation on 80 and 63 images, respectively. For the crowdsourced annotations, we evaluated performance across a range of contributor skill levels (1, 2, or 3). The crowdsourced annotations (4,860 images in total) were completed in only a fraction of the time and cost required for obtaining annotations using traditional methods. For the nucleus detection task, the research fellow-derived annotations showed the strongest concordance with the expert pathologist- derived annotations (F-M =93.68%), followed by the crowd-sourced contributor levels 1,2, and 3 and the automated method, which showed relatively similar performance (F-M = 87.84%, 88.49%, 87.26%, and 86.99%, respectively). For the nucleus segmentation task, the crowdsourced contributor level 3-derived annotations, research fellow-derived annotations, and automated method showed the strongest concordance with the expert pathologist

  20. Exploring an Annotated Sequence Assembly of the Perennial Ryegrass Genome for Genomic Regions Enriched for Trait Associated Variants

    DEFF Research Database (Denmark)

    Byrne, Stephen; Cericola, Fabio; Janss, Luc;

    2015-01-01

    Perennial ryegrass (Lolium perenne L.) is an outbreeding diploid species and one of the most important forage crops used in temperate agriculture. We have developed a draft sequence assembly of the perennial ryegrass genome and annotated it with the aid of RNA-seq data from various genotypes, plant...... components, and treatments. We predicted 39,795 high quality proteins originating from 28,182 genetic loci. We wanted to use the annotated assembly to study if SNPs falling within various annotation classes explain differing proportions of the variance for traits such as heading date and rust resistance...

  1. VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data

    Directory of Open Access Journals (Sweden)

    Peterson Elena S

    2012-04-01

    Full Text Available Abstract Background The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq, global microarrays, and tandem mass spectrometry (MS/MS-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates. Results VESPA is a desktop Java™ application that integrates high-throughput proteomics data (peptide-centric and transcriptomics (probe or RNA-Seq data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (Yersinia pestis Pestoides F and Synechococcus sp. PCC 7002 to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data. Conclusions VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations

  2. Genome-wide functional annotation of Phomopsis longicolla isolate MSPL 10-6

    Directory of Open Access Journals (Sweden)

    Omar Darwish

    2016-06-01

    Full Text Available Phomopsis seed decay of soybean is caused primarily by the seed-borne fungal pathogen Phomopsis longicolla (syn. Diaporthe longicolla. This disease severely decreases soybean seed quality, reduces seedling vigor and stand establishment, and suppresses yield. It is one of the most economically important soybean diseases. In this study we annotated the entire genome of P. longicolla isolate MSPL 10-6, which was isolated from field-grown soybean seed in Mississippi, USA. This study represents the first reported genome-wide functional annotation of a seed borne fungal pathogen in the Diaporthe–Phomopsis complex. The P. longicolla genome annotation will enable research into the genetic basis of fungal infection of soybean seed and provide information for the study of soybean–fungal interactions. The genome annotation will also be a valuable resource for the research and agricultural communities. It will aid in the development of new control strategies for this pathogen. The annotations can be found from: http://bioinformatics.towson.edu/phomopsis_longicolla/download.html. NCBI accession number is: AYRD00000000.

  3. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud.

    Science.gov (United States)

    Duvick, Jon; Standage, Daniel S; Merchant, Nirav; Brendel, Volker P

    2016-04-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching.

  4. Whole genome sequence and genome annotation of Colletotrichum acutatum, causal agent of anthracnose in pepper plants in South Korea

    Directory of Open Access Journals (Sweden)

    Joon-Hee Han

    2016-06-01

    Full Text Available Colletotrichum acutatum is a destructive fungal pathogen which causes anthracnose in a wide range of crops. Here we report the whole genome sequence and annotation of C. acutatum strain KC05, isolated from an infected pepper in Kangwon, South Korea. Genomic DNA from the KC05 strain was used for the whole genome sequencing using a PacBio sequencer and the MiSeq system. The KC05 genome was determined to be 52,190,760 bp in size with a G + C content of 51.73% in 27 scaffolds and to contain 13,559 genes with an average length of 1516 bp. Gene prediction and annotation were performed by incorporating RNA-Seq data. The genome sequence of the KC05 was deposited at DDBJ/ENA/GenBank under the accession number LUXP00000000.

  5. Genome sequencing and annotation of Cellulomonas sp. HZM

    Directory of Open Access Journals (Sweden)

    Patric Chua

    2015-09-01

    Full Text Available We report the draft genome sequence of Cellulomonas sp. HZM, isolated from a tropical peat swamp forest. The draft genome size is 3,559,280 bp with a G + C content of 73% and contains 3 rRNA sequences (single copies of 5S, 16S and 23S rRNA.

  6. Protein annotation in the era of personal genomics

    DEFF Research Database (Denmark)

    Holberg Blicher, Thomas; Gupta, Ramneek; Wesolowska, Agata;

    2010-01-01

    the differences between many individuals of the same species-humans in particular-the focus needs be on the functional impact of individual residue variation. To fulfil the promises of personal genomics, we need to start asking not only what is in a genome but also how millions of small differences between...

  7. Comparative Annotation of Viral Genomes with Non-Conserved Gene Structure

    DEFF Research Database (Denmark)

    de Groot, Saskia; Mailund, Thomas; Hein, Jotun

    2007-01-01

    allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related sequences. Results......: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of about 84% and specificity of about 97%. We additionally annotate three pairwise alignments of the more distantly related HIV1...... and HIV2, as well as of two different Hepatitis Viruses, attaining results of ~87% sensitivity and ~98.5% specificity. We subsequently incorporate prior knowledge by "knowing" the gene structure of one sequence and annotating the other conditional on it. Boosting accuracy close to perfect we demonstrate...

  8. The draft genome sequence and annotation of the desert woodrat Neotoma lepida

    Directory of Open Access Journals (Sweden)

    Michael Campbell

    2016-09-01

    Full Text Available We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida. This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata and the juniper shrub (Juniperus monosperma. The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000.

  9. The draft genome sequence and annotation of the desert woodrat Neotoma lepida.

    Science.gov (United States)

    Campbell, Michael; Oakeson, Kelly F; Yandell, Mark; Halpert, James R; Dearing, Denise

    2016-09-01

    We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida). This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata) and the juniper shrub (Juniperus monosperma). The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000.

  10. The 2008 update of the Aspergillus nidulans genome annotation: A community effort

    NARCIS (Netherlands)

    Wortman, J.R.; Gilsenan, J.M.; Joardar, V.; Deegan, J.; Clutterbuck, J.; Andersen, M.R.; Archer, D.; Bencina, M.; Braus, G.; Coutinho, P.; von Döhren, H.; Doonan, J.; Driessen, A.J.M.; Durek, P.; Espeso, E.; Fekete, E.; Flipphi, M.; Estrada, C.G.; Geysens, S.; Goldman, G.; de Groot, P.W.J.; Hansen, K.; Harris, S.D.; Heinekamp, T.; Helmstaedt, K.; Henrissat, B.; Hofmann, G.; Homan, T.; Horio, T.; Horiuchi, H.; James, S.; Jones, M.; Karaffa, L.; Karányi, Z.; Kato, M.; Keller, N.; Kelly, D.E.; Kiel, J.A.K.W.; Kim, J.M.; van der Klei, I.J.; Klis, F.M.; Kovalchuk, A.; Kraševec, N.; Kubicek, C.P.; Liu, B.; MacCabe, A.; Meyer, V.; Mirabito, P.; Miskei, M.; Mos, M.; Mullins, J.; Nelson, D.R.; Nielsen, J.; Oakley, B.R.; Osmani, S.A.; Pakula, T.; Paszewski, A.; Paulsen, I.; Pilsyk, S.; Pócsi, I.; Punt, P.J.; Ram, A.F.J.; Ren, Q.; Robellet, X.; Robson, G.; Seiboth, B.; van Solingen, P.; Specht, T.; Sun, J.; Taheri-Talesh, N.; Takeshita, N.; Ussery, D.; vanKuyk, P.A.; Visser, H.; van de Vondervoort, P.J.I.; de Vries, R.P.; Walton, J.; Xiang, X.; Xiong, Y.; Zeng, A.P.; Brandt, B.W.; Cornell, M.J.; van den Hondel, C.A.M.J.J.; Visser, J.; Oliver, S.G.; Turner, G.

    2009-01-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional appli

  11. The 2008 update of the Aspergillus nidulans genome annotation: A community effort

    DEFF Research Database (Denmark)

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita

    2009-01-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional ap...

  12. Toward an Upgraded Honey Bee (Apis mellifera L.) Genome Annotation Using Proteogenomics.

    Science.gov (United States)

    McAfee, Alison; Harpur, Brock A; Michaud, Sarah; Beavis, Ronald C; Kent, Clement F; Zayed, Amro; Foster, Leonard J

    2016-02-05

    The honey bee is a key pollinator in agricultural operations as well as a model organism for studying the genetics and evolution of social behavior. The Apis mellifera genome has been sequenced and annotated twice over, enabling proteomics and functional genomics methods for probing relevant aspects of their biology. One troubling trend that emerged from proteomic analyses is that honey bee peptide samples consistently result in lower peptide identification rates compared with other organisms. This suggests that the genome annotation can be improved, or atypical biological processes are interfering with the mass spectrometry workflow. First, we tested whether high levels of polymorphisms could explain some of the missed identifications by searching spectra against the reference proteome (OGSv3.2) versus a customized proteome of a single honey bee, but our results indicate that this contribution was minor. Likewise, error-tolerant peptide searches lead us to eliminate unexpected post-translational modifications as a major factor in missed identifications. We then used a proteogenomic approach with ~1500 raw files to search for missing genes and new exons, to revive discarded annotations and to identify over 2000 new coding regions. These results will contribute to a more comprehensive genome annotation and facilitate continued research on this important insect.

  13. Genome sequencing and annotation of Aeromonas sp. HZM

    Directory of Open Access Journals (Sweden)

    Patric Chua

    2015-09-01

    Full Text Available We report the draft genome sequence of Aeromonas sp. strain HZM, isolated from tropical peat swamp forest soil. The draft genome size is 4,451,364 bp with a G + C content of 61.7% and contains 10 rRNA sequences (eight copies of 5S rRNA genes, single copy of 16S and 23S rRNA each. The genome sequence can be accessed at DDBJ/EMBL/GenBank under the accession no. JEMQ00000000.

  14. CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L. methylation filtered genomic genespace sequences

    Directory of Open Access Journals (Sweden)

    Spraggins Thomas A

    2007-04-01

    Full Text Available Abstract Background Cowpea [Vigna unguiculata (L. Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI, funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace recovered using methylation filtration technology and providing annotation and analysis of the sequence data. Description CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource, and UniProtKB-TrEMBL. Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the

  15. Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data.

    Science.gov (United States)

    Lohse, Marc; Nagel, Axel; Herter, Thomas; May, Patrick; Schroda, Michael; Zrenner, Rita; Tohge, Takayuki; Fernie, Alisdair R; Stitt, Mark; Usadel, Björn

    2014-05-01

    Next-generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan 'BIN' ontology, which is tailored for functional annotation of plant 'omics' data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan-to-GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator.

  16. Genome sequencing and annotation of Amycolatopsis azurea DSM 43854T

    Directory of Open Access Journals (Sweden)

    Indu Khatri

    2014-12-01

    Full Text Available We report the 9.2 Mb genome of the azureomycin A and B antibiotic producing strain Amycolatopsis azurea isolated from a Japanese soil sample. The draft genome of strain DSM 43854T consists of 9,223,451 bp with a G + C content of 69.0% and the genome contains 3 rRNA genes (5S–23S–16S and 58 aminoacyl-tRNA synthetase genes. The homology searches revealed that the PKS gene clusters are supposed to be responsible for the biosynthesis of naptomycin, macbecin, rifamycin, mitomycin, maduropeptin enediyne, neocarzinostatin enediyne, C-1027 enediyne, calicheamicin enediyne, landomycin, simocyclinone, medermycin, granaticin, polyketomycin, teicoplanin, balhimycin, vancomycin, staurosporine, rubradirin and complestatin.

  17. Citrus sinensis annotation project (CAP): a comprehensive database for sweet orange genome.

    Science.gov (United States)

    Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  18. BambooGDB: a bamboo genome database with functional annotation and an analysis platform.

    Science.gov (United States)

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.

  19. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger

    Directory of Open Access Journals (Sweden)

    Grigoriev Igor V

    2009-02-01

    Full Text Available Abstract Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR. Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6% of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.

  20. Genome sequencing and annotation of multidrug resistant Mycobacterium tuberculosis (MDR-TB PR10 strain

    Directory of Open Access Journals (Sweden)

    Mohd Zakihalani A. Halim

    2016-03-01

    Full Text Available Here, we report the draft genome sequence and annotation of a multidrug resistant Mycobacterium tuberculosis strain PR10 (MDR-TB PR10 isolated from a patient diagnosed with tuberculosis. The size of the draft genome MDR-TB PR10 is 4.34 Mbp with 65.6% of G + C content and consists of 4637 predicted genes. The determinants were categorized by RAST into 400 subsystems with 4286 coding sequences and 50 RNAs. The whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010968.

  1. Evolutionary interrogation of human biology in well-annotated genomic framework of rhesus macaque.

    Science.gov (United States)

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-05-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.

  2. TEnest: automated chronological annotation and visualization of nested plant transposable elements.

    Science.gov (United States)

    Kronmiller, Brent A; Wise, Roger P

    2008-01-01

    Organisms with a high density of transposable elements (TEs) exhibit nesting, with subsequent repeats found inside previously inserted elements. Nesting splits the sequence structure of TEs and makes annotation of repetitive areas challenging. We present TEnest, a repeat identification and display tool made specifically for highly repetitive genomes. TEnest identifies repetitive sequences and reconstructs separated sections to provide full-length repeats and, for long-terminal repeat (LTR) retrotransposons, calculates age since insertion based on LTR divergence. TEnest provides a chronological insertion display to give an accurate visual representation of TE integration history showing timeline, location, and families of each TE identified, thus creating a framework from which evolutionary comparisons can be made among various regions of the genome. A database of repeats has been developed for maize (Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare) to illustrate the potential of TEnest software. All currently finished maize bacterial artificial chromosomes totaling 29.3 Mb were analyzed with TEnest to provide a characterization of the repeat insertions. Sixty-seven percent of the maize genome was found to be made up of TEs; of these, 95% are LTR retrotransposons. The rate of solo LTR formation is shown to be dissimilar across retrotransposon families. Phylogenetic analysis of TE families reveals specific events of extreme TE proliferation, which may explain the high quantities of certain TE families found throughout the maize genome. The TEnest software package is available for use on PlantGDB under the tools section (http://www.plantgdb.org/prj/TE_nest/TE_nest.html); the source code is available from (http://wiselab.org).

  3. Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability

    Indian Academy of Sciences (India)

    Vetriselvi Rangannan; Manju Bansal

    2007-08-01

    Analysis of various predicted structural properties of promoter regions in prokaryotic as well as eukaryotic genomes had earlier indicated that they have several common features, such as lower stability, higher curvature and less bendability, when compared with their neighboring regions. Based on the difference in stability between neighboring upstream and downstream regions in the vicinity of experimentally determined transcription start sites, a promoter prediction algorithm has been developed to identify prokaryotic promoter sequences in whole genomes. The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over the entire genome (G) are used to search for promoters in the genomic sequences. Using these cutoff values to predict promoter regions across entire Escherichia coli genome, we achieved a reliability of 70% when the predicted promoters were cross verified against the 960 transcription start sites (TSSs) listed in the Ecocyc database. Annotation of the whole E. coli genome for promoter region could be carried out with 49% accuracy. The method is quite general and it can be used to annotate the promoter regions of other prokaryotic genomes.

  4. Toward the automated generation of genome-scale metabolic networks in the SEED

    Directory of Open Access Journals (Sweden)

    Gould John

    2007-04-01

    Full Text Available Abstract Background Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. Results We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis. We have implemented our tools and database within the SEED, an open-source software environment for comparative

  5. Functional annotation by identification of local surface similarities: a novel tool for structural genomics

    Directory of Open Access Journals (Sweden)

    Zanzoni Andreas

    2005-08-01

    Full Text Available Abstract Background Protein function is often dependent on subsets of solvent-exposed residues that may exist in a similar three-dimensional configuration in non homologous proteins thus having different order and/or spacing in the sequence. Hence, functional annotation by means of sequence or fold similarity is not adequate for such cases. Results We describe a method for the function-related annotation of protein structures by means of the detection of local structural similarity with a library of annotated functional sites. An automatic procedure was used to annotate the function of local surface regions. Next, we employed a sequence-independent algorithm to compare exhaustively these functional patches with a larger collection of protein surface cavities. After tuning and validating the algorithm on a dataset of well annotated structures, we applied it to a list of protein structures that are classified as being of unknown function in the Protein Data Bank. By this strategy, we were able to provide functional clues to proteins that do not show any significant sequence or global structural similarity with proteins in the current databases. Conclusion This method is able to spot structural similarities associated to function-related similarities, independently on sequence or fold resemblance, therefore is a valuable tool for the functional analysis of uncharacterized proteins. Results are available at http://cbm.bio.uniroma2.it/surface/structuralGenomics.html

  6. Functional annotation from the genome sequence of the giant panda.

    Science.gov (United States)

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  7. Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation.

    Science.gov (United States)

    Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias

    2014-08-01

    Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications.

  8. Genome-wide functional annotation and structural verification of metabolic ORFeome of Chlamydomonas reinhardtii

    Directory of Open Access Journals (Sweden)

    Fan Changyu

    2011-06-01

    Full Text Available Abstract Background Recent advances in the field of metabolic engineering have been expedited by the availability of genome sequences and metabolic modelling approaches. The complete sequencing of the C. reinhardtii genome has made this unicellular alga a good candidate for metabolic engineering studies; however, the annotation of the relevant genes has not been validated and the much-needed metabolic ORFeome is currently unavailable. We describe our efforts on the functional annotation of the ORF models released by the Joint Genome Institute (JGI, prediction of their subcellular localizations, and experimental verification of their structural annotation at the genome scale. Results We assigned enzymatic functions to the translated JGI ORF models of C. reinhardtii by reciprocal BLAST searches of the putative proteome against the UniProt and AraCyc enzyme databases. The best match for each translated ORF was identified and the EC numbers were transferred onto the ORF models. Enzymatic functional assignment was extended to the paralogs of the ORFs by clustering ORFs using BLASTCLUST. In total, we assigned 911 enzymatic functions, including 886 EC numbers, to 1,427 transcripts. We further annotated the enzymatic ORFs by prediction of their subcellular localization. The majority of the ORFs are predicted to be compartmentalized in the cytosol and chloroplast. We verified the structure of the metabolism-related ORF models by reverse transcription-PCR of the functionally annotated ORFs. Following amplification and cloning, we carried out 454FLX and Sanger sequencing of the ORFs. Based on alignment of the 454FLX reads to the ORF predicted sequences, we obtained more than 90% coverage for more than 80% of the ORFs. In total, 1,087 ORF models were verified by 454 and Sanger sequencing methods. We obtained expression evidence for 98% of the metabolic ORFs in the algal cells grown under constant light in the presence of acetate. Conclusions We functionally

  9. Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis

    OpenAIRE

    Ayele, Mulu; Haas, Brian J.; Kumar, Nikhil; Wu, Hank; Xiao, Yongli; Van Aken, Susan; Utterback, Teresa R.; WORTMAN, Jennifer R.; White, Owen R.; Town, Christopher D

    2005-01-01

    Through comparative studies of the model organism Arabidopsis thaliana and its close relative Brassica oleracea, we have identified conserved regions that represent potentially functional sequences overlooked by previous Arabidopsis genome annotation methods. A total of 454,274 whole genome shotgun sequences covering 283 Mb (0.44×) of the estimated 650 Mb Brassica genome were searched against the Arabidopsis genome, and conserved Arabidopsis genome sequences (CAGSs) were identified. Of these ...

  10. Design and implementation of a database for Brucella melitensis genome annotation.

    Science.gov (United States)

    De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

    2008-03-18

    The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.

  11. Advancing Trypanosoma brucei genome annotation through ribosome profiling and spliced leader mapping.

    Science.gov (United States)

    Parsons, Marilyn; Ramasamy, Gowthaman; Vasconcelos, Elton J R; Jensen, Bryan C; Myler, Peter J

    2015-08-01

    Since the initial publication of the trypanosomatid genomes, curation has been ongoing. Here we make use of existing Trypanosoma brucei ribosome profiling data to provide evidence of ribosome occupancy (and likely translation) of mRNAs from 225 currently unannotated coding sequences (CDSs). A small number of these putative genes correspond to extra copies of previously annotated genes, but 85% are novel. The median size of these novels CDSs is small (81 aa), indicating that past annotation work has excelled at detecting large CDSs. Of the unique CDSs confirmed here, over half have candidate orthologues in other trypanosomatid genomes, most of which were not yet annotated as protein-coding genes. Nonetheless, approximately one-third of the new CDSs were found only in T. brucei subspecies. Using ribosome footprints, RNA-Seq and spliced leader mapping data, we updated previous work to definitively revise the start sites for 414 CDSs as compared to the current gene models. The data pointed to several regions of the genome that had sequence errors that altered coding region boundaries. Finally, we consolidated this data with our previous work to propose elimination of 683 putative genes as protein-coding and arrive at a view of the translatome of slender bloodstream and procyclic culture form T. brucei.

  12. Genome sequencing and annotation of Amycolatopsis vancoresmycina strain DSM 44592T

    Directory of Open Access Journals (Sweden)

    Navjot Kaur

    2014-12-01

    Full Text Available We report the 9.0-Mb draft genome of Amycolatopsis vancoresmycina strain DSM 44592T, isolated from Indian soil sample; produces antibiotic vancoresmycin. Draft genome of strain DSM44592T consists of 9,037,069 bp with a G+C content of 71.79% and 8340 predicted protein coding genes and 57 RNAs. RAST annotation indicates that strains Streptomyces sp. AA4 (score 521, Saccharomonospora viridis DSM 43017 (score 400 and Actinosynnema mirum DSM 43827 (score 372 are the closest neighbors of the strain DSM 44592T.

  13. Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER

    Indian Academy of Sciences (India)

    Gautam Aggarwal; Ramakrishna Ramaswamy

    2002-02-01

    We compare the annotation of three complete genomes using the ab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail.

  14. The physics of DNA and the annotation of the Plasmodium falciparum genome.

    Science.gov (United States)

    Yeramian, E

    2000-09-19

    A gene identification procedure is formulated, based on large-scale structural analyses of genomic sequences. The structural property is the physical - thermal - stability of the DNA double-helix, as described by the classical helix-coil model. The analyses are detailed for the Plasmodium falciparum genome, which represents one of the most difficult cases for the gene identification problem (notably because of the extreme AT-richness of the genome). In this genome, the coding domains (either uninterrupted genes or exons in split genes) are accurately identified as regions of high thermal stability. The conclusion is based on the study of the available cloned genes, of which 17 examples are described in detail. These examples demonstrate that the physical criterion is valid for the detection of coding regions whose lengths extend from a few base pairs up to several thousand base pairs. Accordingly, the structural analyses can provide a powerful and convenient tool for the identification of complex genes in the P. falciparum genome. The limits of such a scheme are discussed. The gene identification procedure is applied to the completely sequenced chromosomes (2 and 3), and the results are compared with the database annotations. The structural analyses suggest more or less extensive revision to the annotations, and also allow new putative genes to be identified in the chromosome sequences. Several examples of such new genes are described in detail.

  15. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation

    Science.gov (United States)

    Stepanov, Victor G.; Tirumalai, Madhan R.; Montazari, Saied; Checinska, Aleksandra; Venkateswaran, Kasthuri

    2016-01-01

    Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV) radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains. PMID:27351589

  16. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation.

    Directory of Open Access Journals (Sweden)

    Victor G Stepanov

    Full Text Available Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains.

  17. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation.

    Science.gov (United States)

    Stepanov, Victor G; Tirumalai, Madhan R; Montazari, Saied; Checinska, Aleksandra; Venkateswaran, Kasthuri; Fox, George E

    2016-01-01

    Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV) radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains.

  18. Re-annotation of the genome sequence of Helicobacter pylori 26695.

    Science.gov (United States)

    Resende, Tiago; Correia, Daniela M; Rocha, Miguel; Rocha, Isabel

    2013-11-15

    Helicobacter pylori is a pathogenic bacterium that colonizes the human epithelia, causing duodenal and gastric ulcers, and gastric cancer. The genome of H. pylori 26695 has been previously sequenced and annotated. In addition, two genome-scale metabolic models have been developed. In order to maintain accurate and relevant information on coding sequences (CDS) and to retrieve new information, the assignment of new functions to Helicobacter pylori 26695s genes was performed in this work. The use of software tools, on-line databases and an annotation pipeline for inspecting each gene allowed the attribution of validated EC numbers and TC numbers to metabolic genes encoding enzymes and transport proteins, respectively. 1212 genes encoding proteins were identified in this annotation, being 712 metabolic genes and 500 non-metabolic, while 191 new functions were assignment to the CDS of this bacterium. This information provides relevant biological information for the scientific community dealing with this organism and can be used as the basis for a new metabolic model reconstruction.

  19. Annotation of the Asian Citrus Psyllid Genome Reveals a Reduced Innate Immune System.

    Science.gov (United States)

    Arp, Alex P; Hunter, Wayne B; Pelz-Stelinski, Kirsten S

    2016-01-01

    Citrus production worldwide is currently facing significant losses due to citrus greening disease, also known as Huanglongbing. The citrus greening bacteria, Candidatus Liberibacter asiaticus (CLas), is a persistent propagative pathogen transmitted by the Asian citrus psyllid, Diaphorina citri Kuwayama (Hemiptera: Liviidae). Hemipterans characterized to date lack a number of insect immune genes, including those associated with the Imd pathway targeting Gram-negative bacteria. The D. citri draft genome was used to characterize the immune defense genes present in D. citri. Predicted mRNAs identified by screening the published D. citri annotated draft genome were manually searched using a custom database of immune genes from previously annotated insect genomes. Toll and JAK/STAT pathways, general defense genes Dual oxidase, Nitric oxide synthase, prophenoloxidase, and cellular immune defense genes were present in D. citri. In contrast, D. citri lacked genes for the Imd pathway, most antimicrobial peptides, 1,3-β-glucan recognition proteins (GNBPs), and complete peptidoglycan recognition proteins. These data suggest that D. citri has a reduced immune capability similar to that observed in A. pisum, P. humanus, and R. prolixus. The absence of immune system genes from the D. citri genome may facilitate CLas infections, and is possibly compensated for by their relationship with their microbial endosymbionts.

  20. MIPS: analysis and annotation of proteins from whole genomes in 2005.

    Science.gov (United States)

    Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V

    2006-01-01

    The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).

  1. Gene fusions and gene duplications: relevance to genomic annotation and functional analysis

    Directory of Open Access Journals (Sweden)

    Riley Monica

    2005-03-01

    Full Text Available Abstract Background Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular proteins consist of two or more components (modules encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work overstated the number of multimodular proteins in E. coli. This work corrects the identification of modules by including sequence information from proteins in 50 sequenced microbial genomes. Results Multimodular E. coli K-12 proteins were identified from sequence similarities between their component modules and non-fused proteins in 50 genomes and from the literature. We found 109 multimodular proteins in E. coli containing either two or three modules. Most modules had standalone sequence relatives in other genomes. The separated modules together with all the single (un-fused proteins constitute the sum of all unimodular proteins of E. coli. Pairwise sequence relationships among all E. coli unimodular proteins generated 490 sequence similar, paralogous groups. Groups ranged in size from 92 to 2 members and had varying degrees of relatedness among their members. Some E. coli enzyme groups were compared to homologs in other bacterial genomes. Conclusion The deleterious effects of multimodular proteins on annotation and on the formation of groups of paralogs are emphasized. To improve annotation results, all multimodular proteins in an organism should be detected and when known each function should be connected with its location in the sequence of the protein. When transferring functions by sequence similarity, alignment locations must be noted, particularly when alignments cover only part of the sequences, in order to enable transfer of the correct function. Separating multimodular proteins into module units makes

  2. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation

    Energy Technology Data Exchange (ETDEWEB)

    Mavromatis, K [U.S. Department of Energy, Joint Genome Institute; Land, Miriam L [ORNL; Brettin, Thomas S [ORNL; Quest, Daniel J [ORNL; Copeland, A [U.S. Department of Energy, Joint Genome Institute; Clum, Alicia [U.S. Department of Energy, Joint Genome Institute; Goodwin, Lynne A. [Los Alamos National Laboratory (LANL); Woyke, Tanja [U.S. Department of Energy, Joint Genome Institute; Lapidus, Alla L. [U.S. Department of Energy, Joint Genome Institute; Klenk, Hans-Peter [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Cottingham, Robert W [ORNL; Kyrpides, Nikos C [U.S. Department of Energy, Joint Genome Institute

    2012-01-01

    Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).

  3. Genome, functional gene annotation, and nuclear transformation of the heterokont oleaginous alga Nannochloropsis oceanica CCMP1779.

    Directory of Open Access Journals (Sweden)

    Astrid Vieler

    Full Text Available Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica-specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis

  4. Discovery and annotation of small proteins using genomics, proteomics and computational approaches

    Energy Technology Data Exchange (ETDEWEB)

    Yang, Xiaohan; Tschaplinski, Timothy J.; Hurst, Gregory B.; Jawdy, Sara; Abraham, Paul E.; Lankford, Patricia K.; Adams, Rachel M.; Shah, Manesh B.; Hettich, Robert L.; Lindquist, Erika; Kalluri, Udaya C.; Gunter, Lee E.; Pennacchio, Christa; Tuskan, Gerald A.

    2011-03-02

    Small proteins (10 200 amino acids aa in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained 2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) codingpotential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  5. SeqMule: automated pipeline for analysis of human exome/genome sequencing data.

    Science.gov (United States)

    Guo, Yunfei; Ding, Xiaolei; Shen, Yufeng; Lyon, Gholson J; Wang, Kai

    2015-09-18

    Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org.

  6. Automated genome mining for natural products

    Directory of Open Access Journals (Sweden)

    Zajkowski James

    2009-06-01

    Full Text Available Abstract Background Discovery of new medicinal agents from natural sources has largely been an adventitious process based on screening of plant and microbial extracts combined with bioassay-guided identification and natural product structure elucidation. Increasingly rapid and more cost-effective genome sequencing technologies coupled with advanced computational power have converged to transform this trend toward a more rational and predictive pursuit. Results We have developed a rapid method of scanning genome sequences for multiple polyketide, nonribosomal peptide, and mixed combination natural products with output in a text format that can be readily converted to two and three dimensional structures using conventional software. Our open-source and web-based program can assemble various small molecules composed of twenty standard amino acids and twenty two other chain-elongation intermediates used in nonribosomal peptide systems, and four acyl-CoA extender units incorporated into polyketides by reading a hidden Markov model of DNA. This process evaluates and selects the substrate specificities along the assembly line of nonribosomal synthetases and modular polyketide synthases. Conclusion Using this approach we have predicted the structures of natural products from a diverse range of bacteria based on a limited number of signature sequences. In accelerating direct DNA to metabolomic analysis, this method bridges the interface between chemists and biologists and enables rapid scanning for compounds with potential therapeutic value.

  7. Citrus sinensis annotation project (CAP: a comprehensive database for sweet orange genome.

    Directory of Open Access Journals (Sweden)

    Jia Wang

    Full Text Available Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia, and constructed the Citrus sinensis annotation project (CAP to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  8. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    Science.gov (United States)

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established.

  9. Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence

    Directory of Open Access Journals (Sweden)

    Luo Ming-Cheng

    2011-01-01

    Full Text Available Abstract Background Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence. Results An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA

  10. High-density rhesus macaque oligonucleotide microarray design using early-stage rhesus genome sequence information and human genome annotations

    Directory of Open Access Journals (Sweden)

    Magness Charles L

    2007-01-01

    a closely related species. Conclusion The number of different genes represented on microarrays for unfinished genomes can be greatly increased by matching known gene transcript annotations from a closely related species with sequence data from the unfinished genome. Signal intensity on both EST- and genome-derived arrays was highly correlated with probe distance from the 3' UTR, information often missing from ESTs yet present in early-stage genome projects.

  11. The de novo genome assembly and annotation of a female domestic dromedary of North African origin.

    Science.gov (United States)

    Fitak, Robert R; Mohandesan, Elmira; Corander, Jukka; Burger, Pamela A

    2016-01-01

    The single-humped dromedary (Camelus dromedarius) is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbour a variety of biological and physiological characteristics with evolutionary, economic and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65× coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21,167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single-nucleotide polymorphisms with a mean density of 0.71 × 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes, highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps.

  12. New local potential useful for genome annotation and 3D modeling

    Energy Technology Data Exchange (ETDEWEB)

    Chandonia, John-Marc; Cohen, Fred E.

    2003-07-17

    A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar (but not identical) to the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10 percent more structures correctly identified as the most likely structural match in a fold library, and 20 percent more structures correctly narrowed down to a set of five possible candidates. JThread also significantly improves the average sequence alignment accuracy, from 53 percent to 62 percent of residues correctly aligned. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.

  13. Subfunction partitioning, the teleost radiation and the annotation of the human genome.

    Science.gov (United States)

    Postlethwait, John; Amores, Angel; Cresko, William; Singer, Amy; Yan, Yi-Lin

    2004-10-01

    Half of all vertebrate species are teleost fish. What accounts for this explosion of biodiversity? Recent evidence and advances in evolutionary theory suggest that genomic features could have played a significant role in the teleost radiation. This review examines evidence for an ancient whole-genome duplication (tetraploidization) event that probably occurred just before the teleost radiation. The partitioning of ancestral subfunctions between gene copies arising from this duplication could have contributed to the genetic isolation of populations, to lineage-specific diversification of developmental programs, and ultimately to phenotypic variation among teleost fish. Beyond its importance for understanding mechanisms that generate biodiversity, the partitioning of subfunctions between teleost co-orthologs of human genes can facilitate the identification of tissue-specific conserved noncoding regions and can simplify the analysis of ancestral gene functions obscured by pleiotropy or haploinsufficiency. Applying these principles on a genomic scale can accelerate the functional annotation of the human genome and understanding of the roles of human genes in health and disease.

  14. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    Science.gov (United States)

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  15. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies.

    Directory of Open Access Journals (Sweden)

    Qiongshi Lu

    2016-04-01

    Full Text Available Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.

  16. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome.

    Science.gov (United States)

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A V

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

  17. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    Directory of Open Access Journals (Sweden)

    Nupoor Chowdhary

    Full Text Available Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2 production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs. Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs, 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach, we strongly

  18. Genomic organization, annotation, and ligand-receptor inferences of chicken chemokines and chemokine receptor genes based on comparative genomics

    Directory of Open Access Journals (Sweden)

    Sze Sing-Hoi

    2005-03-01

    Full Text Available Abstract Background Chemokines and their receptors play important roles in host defense, organogenesis, hematopoiesis, and neuronal communication. Forty-two chemokines and 19 cognate receptors have been found in the human genome. Prior to this report, only 11 chicken chemokines and 7 receptors had been reported. The objectives of this study were to systematically identify chicken chemokines and their cognate receptor genes in the chicken genome and to annotate these genes and ligand-receptor binding by a comparative genomics approach. Results Twenty-three chemokine and 14 chemokine receptor genes were identified in the chicken genome. All of the chicken chemokines contained a conserved CC, CXC, CX3C, or XC motif, whereas all the chemokine receptors had seven conserved transmembrane helices, four extracellular domains with a conserved cysteine, and a conserved DRYLAIV sequence in the second intracellular domain. The number of coding exons in these genes and the syntenies are highly conserved between human, mouse, and chicken although the amino acid sequence homologies are generally low between mammalian and chicken chemokines. Chicken genes were named with the systematic nomenclature used in humans and mice based on phylogeny, synteny, and sequence homology. Conclusion The independent nomenclature of chicken chemokines and chemokine receptors suggests that the chicken may have ligand-receptor pairings similar to mammals. All identified chicken chemokines and their cognate receptors were identified in the chicken genome except CCR9, whose ligand was not identified in this study. The organization of these genes suggests that there were a substantial number of these genes present before divergence between aves and mammals and more gene duplications of CC, CXC, CCR, and CXCR subfamilies in mammals than in aves after the divergence.

  19. Ontology for Genome Comparison and Genomic Rearrangements

    Directory of Open Access Journals (Sweden)

    Anil Wipat

    2006-04-01

    Full Text Available We present an ontology for describing genomes, genome comparisons, their evolution and biological function. This ontology will support the development of novel genome comparison algorithms and aid the community in discussing genomic evolution. It provides a framework for communication about comparative genomics, and a basis upon which further automated analysis can be built. The nomenclature defined by the ontology will foster clearer communication between biologists, and also standardize terms used by data publishers in the results of analysis programs. The overriding aim of this ontology is the facilitation of consistent annotation of genomes through computational methods, rather than human annotators. To this end, the ontology includes definitions that support computer analysis and automated transfer of annotations between genomes, rather than relying upon human mediation.

  20. Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA.

    Directory of Open Access Journals (Sweden)

    Kumar Parijat Tripathi

    Full Text Available RNA-seq is a new tool to measure RNA transcript counts, using high-throughput sequencing at an extraordinary accuracy. It provides quantitative means to explore the transcriptome of an organism of interest. However, interpreting this extremely large data into biological knowledge is a problem, and biologist-friendly tools are lacking. In our lab, we developed Transcriptator, a web application based on a computational Python pipeline with a user-friendly Java interface. This pipeline uses the web services available for BLAST (Basis Local Search Alignment Tool, QuickGO and DAVID (Database for Annotation, Visualization and Integrated Discovery tools. It offers a report on statistical analysis of functional and Gene Ontology (GO annotation's enrichment. It helps users to identify enriched biological themes, particularly GO terms, pathways, domains, gene/proteins features and protein-protein interactions related informations. It clusters the transcripts based on functional annotations and generates a tabular report for functional and gene ontology annotations for each submitted transcript to the web server. The implementation of QuickGo web-services in our pipeline enable the users to carry out GO-Slim analysis, whereas the integration of PORTRAIT (Prediction of transcriptomic non coding RNA (ncRNA by ab initio methods helps to identify the non coding RNAs and their regulatory role in transcriptome. In summary, Transcriptator is a useful software for both NGS and array data. It helps the users to characterize the de-novo assembled reads, obtained from NGS experiments for non-referenced organisms, while it also performs the functional enrichment analysis of differentially expressed transcripts/genes for both RNA-seq and micro-array experiments. It generates easy to read tables and interactive charts for better understanding of the data. The pipeline is modular in nature, and provides an opportunity to add new plugins in the future. Web application is

  1. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus

    Directory of Open Access Journals (Sweden)

    Ronan K. Carroll

    2016-02-01

    Full Text Available In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300, in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions.

  2. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries

    Directory of Open Access Journals (Sweden)

    Pardinas Jose R

    2008-04-01

    Full Text Available Abstract Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects.

  3. The RAST Server: Rapid Annotations using Subsystems Technology

    Directory of Open Access Journals (Sweden)

    Overbeek Ross A

    2008-02-01

    Full Text Available Abstract Background The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. Description We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12–24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. Conclusion By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.

  4. A kingdom-specific protein domain HMM library for improved annotation of fungal genomes

    Directory of Open Access Journals (Sweden)

    Oliver Stephen G

    2007-04-01

    Full Text Available Abstract Background Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs, which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models. Results Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied. Conclusion Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam

  5. Towards fully automated structure-based function prediction in structural genomics: a case study.

    Science.gov (United States)

    Watson, James D; Sanderson, Steve; Ezersky, Alexandra; Savchenko, Alexei; Edwards, Aled; Orengo, Christine; Joachimiak, Andrzej; Laskowski, Roman A; Thornton, Janet M

    2007-04-13

    As the global Structural Genomics projects have picked up pace, the number of structures annotated in the Protein Data Bank as hypothetical protein or unknown function has grown significantly. A major challenge now involves the development of computational methods to assign functions to these proteins accurately and automatically. As part of the Midwest Center for Structural Genomics (MCSG) we have developed a fully automated functional analysis server, ProFunc, which performs a battery of analyses on a submitted structure. The analyses combine a number of sequence-based and structure-based methods to identify functional clues. After the first stage of the Protein Structure Initiative (PSI), we review the success of the pipeline and the importance of structure-based function prediction. As a dataset, we have chosen all structures solved by the MCSG during the 5 years of the first PSI. Our analysis suggests that two of the structure-based methods are particularly successful and provide examples of local similarity that is difficult to identify using current sequence-based methods. No one method is successful in all cases, so, through the use of a number of complementary sequence and structural approaches, the ProFunc server increases the chances that at least one method will find a significant hit that can help elucidate function. Manual assessment of the results is a time-consuming process and subject to individual interpretation and human error. We present a method based on the Gene Ontology (GO) schema using GO-slims that can allow the automated assessment of hits with a success rate approaching that of expert manual assessment.

  6. Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs

    Directory of Open Access Journals (Sweden)

    Khan Shafiq A

    2003-06-01

    Full Text Available Abstract Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells.

  7. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots

    Directory of Open Access Journals (Sweden)

    Sujai eKumar

    2013-11-01

    Full Text Available Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage and biological (best-matching sequence in annotated databases indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots. We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.

  8. Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA

    Science.gov (United States)

    Zuccaro, Antonio; Guarracino, Mario Rosario

    2015-01-01

    RNA-seq is a new tool to measure RNA transcript counts, using high-throughput sequencing at an extraordinary accuracy. It provides quantitative means to explore the transcriptome of an organism of interest. However, interpreting this extremely large data into biological knowledge is a problem, and biologist-friendly tools are lacking. In our lab, we developed Transcriptator, a web application based on a computational Python pipeline with a user-friendly Java interface. This pipeline uses the web services available for BLAST (Basis Local Search Alignment Tool), QuickGO and DAVID (Database for Annotation, Visualization and Integrated Discovery) tools. It offers a report on statistical analysis of functional and Gene Ontology (GO) annotation’s enrichment. It helps users to identify enriched biological themes, particularly GO terms, pathways, domains, gene/proteins features and protein—protein interactions related informations. It clusters the transcripts based on functional annotations and generates a tabular report for functional and gene ontology annotations for each submitted transcript to the web server. The implementation of QuickGo web-services in our pipeline enable the users to carry out GO-Slim analysis, whereas the integration of PORTRAIT (Prediction of transcriptomic non coding RNA (ncRNA) by ab initio methods) helps to identify the non coding RNAs and their regulatory role in transcriptome. In summary, Transcriptator is a useful software for both NGS and array data. It helps the users to characterize the de-novo assembled reads, obtained from NGS experiments for non-referenced organisms, while it also performs the functional enrichment analysis of differentially expressed transcripts/genes for both RNA-seq and micro-array experiments. It generates easy to read tables and interactive charts for better understanding of the data. The pipeline is modular in nature, and provides an opportunity to add new plugins in the future. Web application is freely

  9. Discovery of germline-related genes in Cephalochordate amphioxus: A genome wide survey using genome annotation and transcriptome data.

    Science.gov (United States)

    Yue, Jia-Xing; Li, Kun-Lung; Yu, Jr-Kai

    2015-12-01

    The generation of germline cells is a critical process in the reproduction of multicellular organisms. Studies in animal models have identified a common repertoire of genes that play essential roles in primordial germ cell (PGC) formation. However, comparative studies also indicate that the timing and regulation of this core genetic program vary considerably in different animals, raising the intriguing questions regarding the evolution of PGC developmental mechanisms in metazoans. Cephalochordates (commonly called amphioxus or lancelets) represent one of the invertebrate chordate groups and can provide important information about the evolution of developmental mechanisms in the chordate lineage. In this study, we used genome and transcriptome data to identify germline-related genes in two distantly related cephalochordate species, Branchiostoma floridae and Asymmetron lucayanum. Branchiostoma and Asymmetron diverged more than 120 MYA, and the most conspicuous difference between them is their gonadal morphology. We used important germline developmental genes in several model animals to search the amphioxus genome and transcriptome dataset for conserved homologs. We also annotated the assembled transcriptome data using Gene Ontology (GO) terms to facilitate the discovery of putative genes associated with germ cell development and reproductive functions in amphioxus. We further confirmed the expression of 14 genes in developing oocytes or mature eggs using whole mount in situ hybridization, suggesting their potential functions in amphioxus germ cell development. The results of this global survey provide a useful resource for testing potential functions of candidate germline-related genes in cephalochordates and for investigating differences in gonad developmental mechanisms between Branchiostoma and Asymmetron species.

  10. Automated generation of program translation and verification tools using annotated grammars

    NARCIS (Netherlands)

    Ordonez Camacho, D.; Mens, K.; Brand, M.G.J. van den; Vinju, J.J.

    2010-01-01

    Automatically generating program translators from source and target language specifications is a non-trivial problem. In this paper we focus on the problem of automating the process of building translators between operations languages, a family of DSLs used to program satellite operations procedures

  11. ChIP-Seq-Annotated Heliconius erato Genome Highlights Patterns of cis-Regulatory Evolution in Lepidoptera.

    Science.gov (United States)

    Lewis, James J; van der Burg, Karin R L; Mazo-Vargas, Anyi; Reed, Robert D

    2016-09-13

    Uncovering phylogenetic patterns of cis-regulatory evolution remains a fundamental goal for evolutionary and developmental biology. Here, we characterize the evolution of regulatory loci in butterflies and moths using chromatin immunoprecipitation sequencing (ChIP-seq) annotation of regulatory elements across three stages of head development. In the process we provide a high-quality, functionally annotated genome assembly for the butterfly, Heliconius erato. Comparing cis-regulatory element conservation across six lepidopteran genomes, we find that regulatory sequences evolve at a pace similar to that of protein-coding regions. We also observe that elements active at multiple developmental stages are markedly more conserved than elements with stage-specific activity. Surprisingly, we also find that stage-specific proximal and distal regulatory elements evolve at nearly identical rates. Our study provides a benchmark for genome-wide patterns of regulatory element evolution in insects, and it shows that developmental timing of activity strongly predicts patterns of regulatory sequence evolution.

  12. Proteomics-based confirmation of protein expression and correction of annotation errors in the Brucella abortus genome

    Directory of Open Access Journals (Sweden)

    Tomaki Fadi

    2010-05-01

    Full Text Available Abstract Background Brucellosis is a major bacterial zoonosis affecting domestic livestock and wild mammals, as well as humans around the globe. While conducting proteomics studies to better understand Brucella abortus virulence, we consolidated the proteomic data collected and compared it to publically available genomic data. Results The proteomic data was compiled from several independent comparative studies of Brucella abortus that used either outer membrane blebs, cytosols, or whole bacteria grown in media, as well as intracellular bacteria recovered at different times following macrophage infection. We identified a total of 621 bacterial proteins that were differentially expressed in a condition-specific manner. For 305 of these proteins we provide the first experimental evidence of their expression. Using a custom-built protein sequence database, we uncovered 7 annotation errors. We provide experimental evidence of expression of 5 genes that were originally annotated as non-expressed pseudogenes, as well as start site annotation errors for 2 other genes. Conclusions An essential element for ensuring correct functional studies is the correspondence between reported genome sequences and subsequent proteomics studies. In this study, we have used proteomics evidence to confirm expression of multiple proteins previously considered to be putative, as well as correct annotation errors in the genome of Brucella abortus strain 2308.

  13. Multimedia input in automated image annotation and content-based retrieval

    Science.gov (United States)

    Srihari, Rohini K.

    1995-03-01

    This research explores the interaction of linguistic and photographic information in an integrated text/image database. By utilizing linguistic descriptions of a picture (speech and text input) coordinated with pointing references to the picture, we extract information useful in two aspects: image interpretation and image retrieval. In the image interpretation phase, objects and regions mentioned in the text are identified; the annotated image is stored in a database for future use. We incorporate techniques from our previous research on photo understanding using accompanying text: a system, PICTION, which identifies human faces in a newspaper photograph based on the caption. In the image retrieval phase, images matching natural language queries are presented to a user in a ranked order. This phase combines the output of (1) the image interpretation/annotation phase, (2) statistical text retrieval methods, and (3) image retrieval methods (e.g., color indexing). The system allows both point and click querying on a given image as well as intelligent querying across the entire text/image database.

  14. Comparative genomic analysis of the family Iridoviridae: re-annotating and defining the core set of iridovirus genes

    Directory of Open Access Journals (Sweden)

    Upton Chris

    2007-01-01

    Full Text Available Abstract Background Members of the family Iridoviridae can cause severe diseases resulting in significant economic and environmental losses. Very little is known about how iridoviruses cause disease in their host. In the present study, we describe the re-analysis of the Iridoviridae family of complex DNA viruses using a variety of comparative genomic tools to yield a greater consensus among the annotated sequences of its members. Results A series of genomic sequence comparisons were made among, and between the Ranavirus and Megalocytivirus genera in order to identify novel conserved ORFs. Of these two genera, the Megalocytivirus genomes required the greatest number of altered annotations. Prior to our re-analysis, the Megalocytivirus species orange-spotted grouper iridovirus and rock bream iridovirus shared 99% sequence identity, but only 82 out of 118 potential ORFs were annotated; in contrast, we predict that these species share an identical complement of genes. These annotation changes allowed the redefinition of the group of core genes shared by all iridoviruses. Seven new core genes were identified, bringing the total number to 26. Conclusion Our re-analysis of genomes within the Iridoviridae family provides a unifying framework to understand the biology of these viruses. Further re-defining the core set of iridovirus genes will continue to lead us to a better understanding of the phylogenetic relationships between individual iridoviruses as well as giving us a much deeper understanding of iridovirus replication. In addition, this analysis will provide a better framework for characterizing and annotating currently unclassified iridoviruses.

  15. Genomic annotation of the meningioma tumor suppressor locus on chromosome 1p34.

    Science.gov (United States)

    Sulman, Erik P; White, Peter S; Brodeur, Garrett M

    2004-01-29

    Meningioma is a frequently occurring tumor of the meninges surrounding the central nervous system. Loss of the short arm of chromosome 1 (1p) is the second most frequent chromosomal abnormality observed in these tumors. Previously, we identified a 3.7 megabase (Mb) region of consistent deletion on 1p33-p34 in a panel of 157 tumors. Loss of this region was associated with advanced disease and predictive for tumor relapse. In this report, a high-resolution integrated map of the region was constructed (CompView) to identify all markers in the smallest region of overlapping deletion (SRO). A regional somatic cell hybrid panel was used to more precisely localize those markers identified in CompView as within or overlapping the region. Additional deletion mapping using microsatellites localized to the region narrowed the SRO to approximately 2.8 Mb. The 88 markers remaining in the SRO were used to screen genomic databases to identify large-insert clones. Clones were assembled into a physical map of the region by PCR-based, sequence-tagged site (STS) content mapping. A sequence from clones was used to validate STS content by electronic PCR and to identify transcripts. A minimal tiling path of 43 clones was constructed across the SRO. Sequence data from the most current sequence assembly were used for further validation. A total of 59 genes were ordered within the SRO. In all, 17 of these were selected as likely candidates based on annotation using Gene Ontology Consortium terms, including the MUTYH, PRDX1, FOXD2, FOXE3, PTCH2, and RAD54L genes. This annotation of a putative tumor suppressor locus provides a resource for further analysis of meningioma candidate genes.

  16. Automated training for algorithms that learn from genomic data.

    Science.gov (United States)

    Cilingir, Gokcen; Broschat, Shira L

    2015-01-01

    Supervised machine learning algorithms are used by life scientists for a variety of objectives. Expert-curated public gene and protein databases are major resources for gathering data to train these algorithms. While these data resources are continuously updated, generally, these updates are not incorporated into published machine learning algorithms which thereby can become outdated soon after their introduction. In this paper, we propose a new model of operation for supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure and the learning process are automated, one can create a system that generates a classifier or predictor using information available from public resources. The proposed model is explained using three case studies on SignalP, MemLoci, and ApicoAP in which existing machine learning models are utilized in pipelines. Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning algorithms into self-evolving learners that benefit from the ever-changing data available for gene products and to develop new machine learning algorithms that are similarly capable.

  17. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    Science.gov (United States)

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  18. Integrative analysis of functional genomic annotations and sequencing data to identify rare causal variants via hierarchical modeling

    Directory of Open Access Journals (Sweden)

    Marinela eCapanu

    2015-05-01

    Full Text Available Identifying the small number of rare causal variants contributing to disease has beena major focus of investigation in recent years, but represents a formidable statisticalchallenge due to the rare frequencies with which these variants are observed. In thiscommentary we draw attention to a formal statistical framework, namely hierarchicalmodeling, to combine functional genomic annotations with sequencing data with theobjective of enhancing our ability to identify rare causal variants. Using simulations weshow that in all configurations studied, the hierarchical modeling approach has superiordiscriminatory ability compared to a recently proposed aggregate measure of deleteriousness,the Combined Annotation-Dependent Depletion (CADD score, supportingour premise that aggregate functional genomic measures can more accurately identifycausal variants when used in conjunction with sequencing data through a hierarchicalmodeling approach

  19. Automated annotation of functional imaging experiments via multi-label classification

    Directory of Open Access Journals (Sweden)

    Matthew D Turner

    2013-12-01

    Full Text Available Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO, the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes, k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization. Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. Naive Bayes methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

  20. Automated annotation of functional imaging experiments via multi-label classification.

    Science.gov (United States)

    Turner, Matthew D; Chakrabarti, Chayan; Jones, Thomas B; Xu, Jiawei F; Fox, Peter T; Luger, George F; Laird, Angela R; Turner, Jessica A

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

  1. Discovering gene annotations in biomedical text databases

    Directory of Open Access Journals (Sweden)

    Ozsoyoglu Gultekin

    2008-03-01

    Full Text Available Abstract Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i automating the annotation of genomic entities with Gene Ontology concepts, and (ii providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate

  2. Scaling Out and Evaluation of OBSecAn, an Automated Section Annotator for Semi-Structured Clinical Documents, on a Large VA Clinical Corpus.

    Science.gov (United States)

    Tran, Le-Thuy T; Divita, Guy; Redd, Andrew; Carter, Marjorie E; Samore, Matthew; Gundlapalli, Adi V

    2015-01-01

    "Identifying and labeling" (annotating) sections improves the effectiveness of extracting information stored in the free text of clinical documents. OBSecAn, an automated ontology-based section annotator, was developed to identify and label sections of semi-structured clinical documents from the Department of Veterans Affairs (VA). In the first step, the algorithm reads and parses the document to obtain and store information regarding sections into a structure that supports the hierarchy of sections. The second stage detects and makes correction to errors in the parsed structure. The third stage produces the section annotation output using the final parsed tree. In this study, we present the OBSecAn method and its scale to a million document corpus and evaluate its performance in identifying family history sections. We identify high yield sections for this use case from note titles such as primary care and demonstrate a median rate of 99% in correctly identifying a family history section.

  3. Draft genome sequence and annotation of Lactobacillus acetotolerans BM-LA14527, a beer-spoilage bacteria.

    Science.gov (United States)

    Liu, Junyan; Li, Lin; Peters, Brian M; Li, Bing; Deng, Yang; Xu, Zhenbo; Shirtliff, Mark E

    2016-09-01

    Lactobacillus acetotolerans is a hard-to-culture beer-spoilage bacterium capable of entering into the viable putative nonculturable (VPNC) state. As part of an initial strategy to investigate the phenotypic behavior of L. acetotolerans, draft genome sequencing was performed. Results demonstrated a total of 1824 predicted annotated genes, with several potential VPNC- and beer-spoilage-associated genes identified. Importantly, this is the first genome sequence of L. acetotolerans as beer-spoilage bacteria and it may aid in further analysis of L. acetotolerans and other beer-spoilage bacteria, with direct implications for food safety control in the beer brewing industry.

  4. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

    Directory of Open Access Journals (Sweden)

    Liu Chang

    2012-12-01

    Full Text Available Abstract Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas.

  5. Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists

    Directory of Open Access Journals (Sweden)

    Masseroli Marco

    2007-03-01

    Full Text Available Abstract Background The increasing protein family and domain based annotations constitute important information to understand protein functions and gain insight into relations among their codifying genes. To allow analyzing of gene proteomic annotations, we implemented novel modules within GFINDer, a Web system we previously developed that dynamically aggregates functional and phenotypic annotations of user-uploaded gene lists and allows performing their statistical analysis and mining. Results Exploiting protein information in Pfam and InterPro databanks, we developed and added in GFINDer original modules specifically devoted to the exploration and analysis of functional signatures of gene protein products. They allow annotating numerous user-classified nucleotide sequence identifiers with controlled information on related protein families, domains and functional sites, classifying them according to such protein annotation categories, and statistically analyzing the obtained classifications. In particular, when uploaded nucleotide sequence identifiers are subdivided in classes, the Statistics Protein Families&Domains module allows estimating relevance of Pfam or InterPro controlled annotations for the uploaded genes by highlighting protein signatures significantly more represented within user-defined classes of genes. In addition, the Logistic Regression module allows identifying protein functional signatures that better explain the considered gene classification. Conclusion Novel GFINDer modules provide genomic protein family and domain analyses supporting better functional interpretation of gene classes, for instance defined through statistical and clustering analyses of gene expression results from microarray experiments. They can hence help understanding fundamental biological processes and complex cellular mechanisms influenced by protein domain composition, and contribute to unveil new biomedical knowledge about the codifying genes.

  6. Heterogeneous data analysis for annotation of microRNAs and novel genome assembly

    NARCIS (Netherlands)

    Zhang, Yanju

    2011-01-01

    This thesis is the collection of four published papers demonstrating annotation of genes and microRNAs with the aid of bioinformatics, in particular using heterogeneous data integration. Gene annotation is the process of detecting the structure and biological function of the raw DNA sequences; while

  7. Homology-based annotation of non-coding RNAs in the genomes of Schistosoma mansoni and Schistosoma japonicum

    Directory of Open Access Journals (Sweden)

    Santana Clara

    2009-10-01

    Full Text Available Abstract Background Schistosomes are trematode parasites of the phylum Platyhelminthes. They are considered the most important of the human helminth parasites in terms of morbidity and mortality. Draft genome sequences are now available for Schistosoma mansoni and Schistosoma japonicum. Non-coding RNA (ncRNA plays a crucial role in gene expression regulation, cellular function and defense, homeostasis, and pathogenesis. The genome-wide annotation of ncRNAs is a non-trivial task unless well-annotated genomes of closely related species are already available. Results A homology search for structured ncRNA in the genome of S. mansoni resulted in 23 types of ncRNAs with conserved primary and secondary structure. Among these, we identified rRNA, snRNA, SL RNA, SRP, tRNAs and RNase P, and also possibly MRP and 7SK RNAs. In addition, we confirmed five miRNAs that have recently been reported in S. japonicum and found two additional homologs of known miRNAs. The tRNA complement of S. mansoni is comparable to that of the free-living planarian Schmidtea mediterranea, although for some amino acids differences of more than a factor of two are observed: Leu, Ser, and His are overrepresented, while Cys, Meth, and Ile are underrepresented in S. mansoni. On the other hand, the number of tRNAs in the genome of S. japonicum is reduced by more than a factor of four. Both schistosomes have a complete set of minor spliceosomal snRNAs. Several ncRNAs that are expected to exist in the S. mansoni genome were not found, among them the telomerase RNA, vault RNAs, and Y RNAs. Conclusion The ncRNA sequences and structures presented here represent the most complete dataset of ncRNA from any lophotrochozoan reported so far. This data set provides an important reference for further analysis of the genomes of schistosomes and indeed eukaryotic genomes at large.

  8. Annotation Of Novel And Conserved MicroRNA Genes In The Build 10 Sus scrofa Reference Genome And Determination Of Their Expression Levels In Ten Different Tissues

    DEFF Research Database (Denmark)

    Thomsen, Bo; Nielsen, Mathilde; Hedegaard, Jakob

    The DNA template used in the pig genome sequencing project was provided by a Duroc pig named TJ Tabasco. In an effort to annotate microRNA (miRNA) genes in the reference genome we have conducted deep sequencing to determine the miRNA transcriptomes in ten different tissues isolated from Pinky......, a genetically identical clone of TJ Tabasco. The purpose was to generate miRNA sequences that are highly homologous to the reference genome sequence, which along with computational prediction will improve confidence in the genomic annotation of miRNA genes. Based on homology searches of the sequence data...

  9. Automated comparative auditing of NCIT genomic roles using NCBI.

    Science.gov (United States)

    Cohen, Barry; Oren, Marc; Min, Hua; Perl, Yehoshua; Halper, Michael

    2008-12-01

    Biomedical research has identified many human genes and various knowledge about them. The National Cancer Institute Thesaurus (NCIT) represents such knowledge as concepts and roles (relationships). Due to the rapid advances in this field, it is to be expected that the NCIT's Gene hierarchy will contain role errors. A comparative methodology to audit the Gene hierarchy with the use of the National Center for Biotechnology Information's (NCBI's) Entrez Gene database is presented. The two knowledge sources are accessed via a pair of Web crawlers to ensure up-to-date data. Our algorithms then compare the knowledge gathered from each, identify discrepancies that represent probable errors, and suggest corrective actions. The primary focus is on two kinds of gene-roles: (1) the chromosomal locations of genes, and (2) the biological processes in which genes play a role. Regarding chromosomal locations, the discrepancies revealed are striking and systematic, suggesting a structurally common origin. In regard to the biological processes, difficulties arise because genes frequently play roles in multiple processes, and processes may have many designations (such as synonymous terms). Our algorithms make use of the roles defined in the NCIT Biological Process hierarchy to uncover many probable gene-role errors in the NCIT. These results show that automated comparative auditing is a promising technique that can identify a large number of probable errors and corrections for them in a terminological genomic knowledge repository, thus facilitating its overall maintenance.

  10. ChIP-Seq-Annotated Heliconius erato Genome Highlights Patterns of cis-Regulatory Evolution in Lepidoptera

    Directory of Open Access Journals (Sweden)

    James J. Lewis

    2016-09-01

    Full Text Available Uncovering phylogenetic patterns of cis-regulatory evolution remains a fundamental goal for evolutionary and developmental biology. Here, we characterize the evolution of regulatory loci in butterflies and moths using chromatin immunoprecipitation sequencing (ChIP-seq annotation of regulatory elements across three stages of head development. In the process we provide a high-quality, functionally annotated genome assembly for the butterfly, Heliconius erato. Comparing cis-regulatory element conservation across six lepidopteran genomes, we find that regulatory sequences evolve at a pace similar to that of protein-coding regions. We also observe that elements active at multiple developmental stages are markedly more conserved than elements with stage-specific activity. Surprisingly, we also find that stage-specific proximal and distal regulatory elements evolve at nearly identical rates. Our study provides a benchmark for genome-wide patterns of regulatory element evolution in insects, and it shows that developmental timing of activity strongly predicts patterns of regulatory sequence evolution.

  11. Improved genome annotation through untargeted detection of pathway-specific metabolites

    Directory of Open Access Journals (Sweden)

    Banfield Jillian F

    2011-06-01

    Full Text Available Abstract Background Mass spectrometry-based metabolomics analyses have the potential to complement sequence-based methods of genome annotation, but only if raw mass spectral data can be linked to specific metabolic pathways. In untargeted metabolomics, the measured mass of a detected compound is used to define the location of the compound in chemical space, but uncertainties in mass measurements lead to "degeneracies" in chemical space since multiple chemical formulae correspond to the same measured mass. We compare two methods to eliminate these degeneracies. One method relies on natural isotopic abundances, and the other relies on the use of stable-isotope labeling (SIL to directly determine C and N atom counts. Both depend on combinatorial explorations of the "chemical space" comprised of all possible chemical formulae comprised of biologically relevant chemical elements. Results Of 1532 metabolic pathways curated in the MetaCyc database, 412 contain a metabolite having a chemical formula unique to that metabolic pathway. Thus, chemical formulae alone can suffice to infer the presence of some metabolic pathways. Of 248,928 unique chemical formulae selected from the PubChem database, more than 95% had at least one degeneracy on the basis of accurate mass information alone. Consideration of natural isotopic abundance reduced degeneracy to 64%, but mainly for formulae less than 500 Da in molecular weight, and only if the error in the relative isotopic peak intensity was less than 10%. Knowledge of exact C and N atom counts as determined by SIL enabled reduced degeneracy, allowing for determination of unique chemical formula for 55% of the PubChem formulae. Conclusions To facilitate the assignment of chemical formulae to unknown mass-spectral features, profiling can be performed on cultures uniformly labeled with stable isotopes of nitrogen (15N or carbon (13C. This makes it possible to accurately count the number of carbon and nitrogen atoms in

  12. Maize microarray annotation database

    Directory of Open Access Journals (Sweden)

    Berger Dave K

    2011-10-01

    Full Text Available Abstract Background Microarray technology has matured over the past fifteen years into a cost-effective solution with established data analysis protocols for global gene expression profiling. The Agilent-016047 maize 44 K microarray was custom-designed from EST sequences, but only reporter sequences with EST accession numbers are publicly available. The following information is lacking: (a reporter - gene model match, (b number of reporters per gene model, (c potential for cross hybridization, (d sense/antisense orientation of reporters, (e position of reporter on B73 genome sequence (for eQTL studies, and (f functional annotations of genes represented by reporters. To address this, we developed a strategy to annotate the Agilent-016047 maize microarray, and built a publicly accessible annotation database. Description Genomic annotation of the 42,034 reporters on the Agilent-016047 maize microarray was based on BLASTN results of the 60-mer reporter sequences and their corresponding ESTs against the maize B73 RefGen v2 "Working Gene Set" (WGS predicted transcripts and the genome sequence. The agreement between the EST, WGS transcript and gDNA BLASTN results were used to assign the reporters into six genomic annotation groups. These annotation groups were: (i "annotation by sense gene model" (23,668 reporters, (ii "annotation by antisense gene model" (4,330; (iii "annotation by gDNA" without a WGS transcript hit (1,549; (iv "annotation by EST", in which case the EST from which the reporter was designed, but not the reporter itself, has a WGS transcript hit (3,390; (v "ambiguous annotation" (2,608; and (vi "inconclusive annotation" (6,489. Functional annotations of reporters were obtained by BLASTX and Blast2GO analysis of corresponding WGS transcripts against GenBank. The annotations are available in the Maize Microarray Annotation Database http://MaizeArrayAnnot.bi.up.ac.za/, as well as through a GBrowse annotation file that can be uploaded to

  13. Joint Genome Institute's Automation Approach and History

    Energy Technology Data Exchange (ETDEWEB)

    Roberts, Simon

    2006-07-05

    Department of Energy/Joint Genome Institute (DOE/JGI) collaborates with DOE national laboratories and community users, to advance genome science in support of the DOE missions of clean bio-energy, carbon cycling, and bioremediation.

  14. Partitioning SNPs Identified By GBS into Genome Annotation Classes and Calculating SNP-Explained Variances for Heading Date and Disease Resistance from the Resulting Genomic Relationship Matrices - Lolium perenne

    DEFF Research Database (Denmark)

    Byrne, Stephen; Cericola, Fabio; Janss, Luc;

    2015-01-01

    , and an average protein Annotation Edit Distance (AED) of 0.14. Genotyping-By-Sequencing (GBS) data was generated after genome complexity reduction with ApeKI for 995 breeding families. Data was aligned against the annotated sequence assembly, and we identified variants at over 1.8 million positions, which were......,273 SNPs), genes with NB-ARC domains (9,056 SNPs), intron (168,023 SNPs), and inter-genic (1,420,866 SNPs). Genomic relationship matrices were created for each annotation class and SNP-explained variances for heading date and disease resistance were calculated...

  15. Promoter prediction and annotation of microbial genomes based on DNA sequence and structural responses to superhelical stress

    Directory of Open Access Journals (Sweden)

    Benham Craig J

    2006-05-01

    Full Text Available Abstract Background In our previous studies, we found that the sites in prokaryotic genomes which are most susceptible to duplex destabilization under the negative superhelical stresses that occur in vivo are statistically highly significantly associated with intergenic regions that are known or inferred to contain promoters. In this report we investigate how this structural property, either alone or together with other structural and sequence attributes, may be used to search prokaryotic genomes for promoters. Results We show that the propensity for stress-induced DNA duplex destabilization (SIDD is closely associated with specific promoter regions. The extent of destabilization in promoter-containing regions is found to be bimodally distributed. When compared with DNA curvature, deformability, thermostability or sequence motif scores within the -10 region, SIDD is found to be the most informative DNA property regarding promoter locations in the E. coli K12 genome. SIDD properties alone perform better at detecting promoter regions than other programs trained on this genome. Because this approach has a very low false positive rate, it can be used to predict with high confidence the subset of promoters that are strongly destabilized. When SIDD properties are combined with -10 motif scores in a linear classification function, they predict promoter regions with better than 80% accuracy. When these methods were tested with promoter and non-promoter sequences from Bacillus subtilis, they achieved similar or higher accuracies. We also present a strictly SIDD-based predictor for annotating promoter sequences in complete microbial genomes. Conclusion In this report we show that the propensity to undergo stress-induced duplex destabilization (SIDD is a distinctive structural attribute of many prokaryotic promoter sequences. We have developed methods to identify promoter sequences in prokaryotic genomes that use SIDD either as a sole predictor or in

  16. A combined approach for genome wide protein function annotation/prediction

    DEFF Research Database (Denmark)

    Benso, Alfredo; Di Carlo, Stefano; Ur Rehman, Hafeez;

    2013-01-01

    proteins are discovered. On the other hand, proteins are the prominent stakeholders in almost all biological processes, and therefore the need to precisely know their functions for a better understanding of the underlying biological mechanism is inevitable. The challenge of annotating uncharacterized...

  17. KSHV 2.0: a comprehensive annotation of the Kaposi's sarcoma-associated herpesvirus genome using next-generation sequencing reveals novel genomic and functional features.

    Directory of Open Access Journals (Sweden)

    Carolina Arias

    2014-01-01

    Full Text Available Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV, we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs, and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus.

  18. Metalloproteomics: High-Throughput Structural and Functional Annotation of Proteins in Structural Genomics

    Energy Technology Data Exchange (ETDEWEB)

    Shi,W.; Zhan, C.; Lgnatov, A.; Manjasetty, B.; Marinkovic, N.; Sullivan, M.; Huang, R.; Chance, M.; Li, H.; et al.

    2005-01-01

    A high-throughput method for measuring transition metal content based on quantitation of X-ray fluorescence signals was used to analyze 654 proteins selected as targets by the New York Structural GenomiX Research Consortium. Over 10% showed the presence of transition metal atoms in stoichiometric amounts; these totals as well as the abundance distribution are similar to those of the Protein Data Bank. Bioinformatics analysis of the identified metalloproteins in most cases supported the metalloprotein annotation; identification of the conserved metal binding motif was also shown to be useful in verifying structural models of the proteins. Metalloproteomics provides a rapid structural and functional annotation for these sequences and is shown to be {approx}95% accurate in predicting the presence or absence of stoichiometric metal content. The project's goal is to assay at least 1 member from each Pfam family; approximately 500 Pfam families have been characterized with respect to transition metal content so far.

  19. An automated Genomes-to-Natural Products platform (GNP) for the discovery of modular natural products.

    Science.gov (United States)

    Johnston, Chad W; Skinnider, Michael A; Wyatt, Morgan A; Li, Xiang; Ranieri, Michael R M; Yang, Lian; Zechel, David L; Ma, Bin; Magarvey, Nathan A

    2015-09-28

    Bacterial natural products are a diverse and valuable group of small molecules, and genome sequencing indicates that the vast majority remain undiscovered. The prediction of natural product structures from biosynthetic assembly lines can facilitate their discovery, but highly automated, accurate, and integrated systems are required to mine the broad spectrum of sequenced bacterial genomes. Here we present a genome-guided natural products discovery tool to automatically predict, combinatorialize and identify polyketides and nonribosomal peptides from biosynthetic assembly lines using LC-MS/MS data of crude extracts in a high-throughput manner. We detail the directed identification and isolation of six genetically predicted polyketides and nonribosomal peptides using our Genome-to-Natural Products platform. This highly automated, user-friendly programme provides a means of realizing the potential of genetically encoded natural products.

  20. Rapid annotation of anonymous sequences from genome projects using semantic similarities and a weighting scheme in gene ontology.

    Directory of Open Access Journals (Sweden)

    Paolo Fontana

    Full Text Available BACKGROUND: Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. METHODOLOGY: We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. CONCLUSIONS: The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.

  1. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    Science.gov (United States)

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day…

  2. VibrioBase: A Model for Next-Generation Genome and Annotation Database Development

    Directory of Open Access Journals (Sweden)

    Siew Woh Choo

    2014-01-01

    Full Text Available To facilitate the ongoing research of Vibrio spp., a dedicated platform for the Vibrio research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. We present VibrioBase, a useful resource platform, providing all basic features of a sequence database with the addition of unique analysis tools which could be valuable for the Vibrio research community. VibrioBase currently houses a total of 252 Vibrio genomes developed in a user-friendly manner and useful to enable the analysis of these genomic data, particularly in the field of comparative genomics. Besides general data browsing features, VibrioBase offers analysis tools such as BLAST interfaces and JBrowse genome browser. Other important features of this platform include our newly developed in-house tools, the pairwise genome comparison (PGC tool, and pathogenomics profiling tool (PathoProT. The PGC tool is useful in the identification and comparative analysis of two genomes, whereas PathoProT is designed for comparative pathogenomics analysis of Vibrio strains. Both of these tools will enable researchers with little experience in bioinformatics to get meaningful information from Vibrio genomes with ease. We have tested the validity and suitability of these tools and features for use in the next-generation database development.

  3. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    Directory of Open Access Journals (Sweden)

    King Nichole L

    2009-02-01

    Full Text Available Abstract Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal http://www.drosophila-peptideatlas.org allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1 reduction of the complexity inherently associated with performing targeted proteomic studies, (2 designing and accelerating shotgun proteomics experiments, (3 confirming or questioning gene models, and (4 adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert.

  4. Updated genome assembly and annotation of Paenibacillus larvae, the agent of American foulbrood disease of honey bees

    Directory of Open Access Journals (Sweden)

    de Graaf Dirk C

    2011-09-01

    Full Text Available Abstract Background As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. Results We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. Conclusions This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.

  5. Carbohydrate catabolic flexibility in the mammalian intestinal commensal Lactobacillus ruminis revealed by fermentation studies aligned to genome annotations

    LENUS (Irish Health Repository)

    2011-08-30

    Abstract Background Lactobacillus ruminis is a poorly characterized member of the Lactobacillus salivarius clade that is part of the intestinal microbiota of pigs, humans and other mammals. Its variable abundance in human and animals may be linked to historical changes over time and geographical differences in dietary intake of complex carbohydrates. Results In this study, we investigated the ability of nine L. ruminis strains of human and bovine origin to utilize fifty carbohydrates including simple sugars, oligosaccharides, and prebiotic polysaccharides. The growth patterns were compared with metabolic pathways predicted by annotation of a high quality draft genome sequence of ATCC 25644 (human isolate) and the complete genome of ATCC 27782 (bovine isolate). All of the strains tested utilized prebiotics including fructooligosaccharides (FOS), soybean-oligosaccharides (SOS) and 1,3:1,4-β-D-gluco-oligosaccharides to varying degrees. Six strains isolated from humans utilized FOS-enriched inulin, as well as FOS. In contrast, three strains isolated from cows grew poorly in FOS-supplemented medium. In general, carbohydrate utilisation patterns were strain-dependent and also varied depending on the degree of polymerisation or complexity of structure. Six putative operons were identified in the genome of the human isolate ATCC 25644 for the transport and utilisation of the prebiotics FOS, galacto-oligosaccharides (GOS), SOS, and 1,3:1,4-β-D-Gluco-oligosaccharides. One of these comprised a novel FOS utilisation operon with predicted capacity to degrade chicory-derived FOS. However, only three of these operons were identified in the ATCC 27782 genome that might account for the utilisation of only SOS and 1,3:1,4-β-D-Gluco-oligosaccharides. Conclusions This study has provided definitive genome-based evidence to support the fermentation patterns of nine strains of Lactobacillus ruminis, and has linked it to gene distribution patterns in strains from different sources

  6. Dictionary-driven protein annotation.

    Science.gov (United States)

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  7. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and D. barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data

    Science.gov (United States)

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica v. virgifera (16,747 bp) and D. barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and...

  8. Genome sequencing and annotation of Acinetobacter gerneri strain MTCC 9824T

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.4 Mb genome of Acinetobacter gerneri strain MTCC 9824T. The genome has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S16S and 64 aminoacyl-tRNA synthetase genes.

  9. Genome sequencing and annotation of Afipia septicemium strain OHSU_II

    Directory of Open Access Journals (Sweden)

    Philip Yang

    2014-12-01

    Full Text Available We report the 5.1 Mb noncontiguous draft genome of Afipia septicemium strain OHSU_II, isolated from blood of a female patient. The genome consists of 5,087,893 bp circular chromosome with no identifiable autonomous plasmid with a G + C content of 61.09% and contains 4898 protein-coding genes and 49 RNA genes including 3 rRNA genes and 46 tRNA genes.

  10. Genome sequencing and annotation of Acinetobacter gyllenbergii strain MTCC 11365T

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report 4.3 Mb genome of the Acinetobacter gyllenbergii strain MTCC 11365T. The draft genome of A. gyllenbergii has a G + C content of 41.0% and includes 3 rRNA genes (5S, 23S, 16S and 67 aminoacyl-tRNA synthetase genes.

  11. A hybrid approach for the automated finishing of bacterial genomes.

    Science.gov (United States)

    Bashir, Ali; Klammer, Aaron A; Robins, William P; Chin, Chen-Shan; Webster, Dale; Paxinos, Ellen; Hsu, David; Ashby, Meredith; Wang, Susana; Peluso, Paul; Sebra, Robert; Sorenson, Jon; Bullard, James; Yen, Jackie; Valdovino, Marie; Mollova, Emilia; Luong, Khai; Lin, Steven; LaMay, Brianna; Joshi, Amruta; Rowe, Lori; Frace, Michael; Tarr, Cheryl L; Turnsek, Maryann; Davis, Brigid M; Kasarskis, Andrew; Mekalanos, John J; Waldor, Matthew K; Schadt, Eric E

    2012-07-01

    Advances in DNA sequencing technology have improved our ability to characterize most genomic diversity. However, accurate resolution of large structural events is challenging because of the short read lengths of second-generation technologies. Third-generation sequencing technologies, which can yield longer multikilobase reads, have the potential to address limitations associated with genome assembly. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at >99.9% accuracy. Complex regions with clinically relevant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 cholera reference strain, we obtained 14 scaffolds of greater than 1 kb for the experimental data and 8 scaffolds of greater than 1 kb for the simulated data, which allowed us to correct several errors in contigs assembled from the short-read data alone. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.

  12. Genome sequencing and annotation of Acinetobacter guillouiae strain MSP 4-18

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.8 Mb genome of Acinetobacter guillouiae MSP 4-18, isolated from a mangrove soil sample from Parangipettai (11°30′N, 79°47′E, Tamil Nadu, India. The draft genome of A. guillouiae MSP 4-18 has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S, 16S and 69 aminoacyl-tRNA synthetase genes.

  13. neXtA5: accelerating annotation of articles via automated approaches in neXtProt.

    Science.gov (United States)

    Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick

    2016-01-01

    The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being

  14. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Science.gov (United States)

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  15. Mapping and annotating obesity-related genes in pig and human genomes.

    Science.gov (United States)

    Martelli, Pier Luigi; Fontanesi, Luca; Piovesan, Damiano; Fariselli, Piero; Casadio, Rita

    2014-01-01

    Background. Obesity is a major health problem in both developed and emerging countries. Obesity is a complex disease whose etiology involves genetic factors in strong interplay with environmental determinants and lifestyle. The discovery of genetic factors and biological pathways underlying human obesity is hampered by the difficulty in controlling the genetic background of human cohorts. Animal models are then necessary to further dissect the genetics of obesity. Pig has emerged as one of the most attractive models, because of the similarity with humans in the mechanisms regulating the fat deposition. Results. We collected the genes related to obesity in humans and to fat deposition traits in pig. We localized them on both human and pig genomes, building a map useful to interpret comparative studies on obesity. We characterized the collected genes structurally and functionally with BAR+ and mapped them on KEGG pathways and on STRING protein interaction network. Conclusions. The collected set consists of 361 obesity related genes in human and pig genomes. All genes were mapped on the human genome, and 54 could not be localized on the pig genome (release 2012). Only for 3 human genes there is no counterpart in pig, confirming that this animal is a good model for human obesity studies. Obesity related genes are mostly involved in regulation and signaling processes/pathways and relevant connection emerges between obesity-related genes and diseases such as cancer and infectious diseases.

  16. Automated alignment-based curation of gene models in filamentous fungi

    OpenAIRE

    2014-01-01

    Background Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations. Results We prov...

  17. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions

    Science.gov (United States)

    Han, Ying; Hazelett, Dennis J.; Wiklund, Fredrik; Schumacher, Fredrick R.; Stram, Daniel O.; Berndt, Sonja I.; Wang, Zhaoming; Rand, Kristin A.; Hoover, Robert N.; Machiela, Mitchell J.; Yeager, Merideth; Burdette, Laurie; Chung, Charles C.; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C.; Key, Timothy J.; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L.; Kolb, Suzanne; Gapstur, Susan M.; Diver, W. Ryan; Stevens, Victoria L.; Strom, Sara S.; Pettaway, Curtis A.; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Yeboah, Edward D.; Tettey, Yao; Biritwum, Richard B.; Adjei, Andrew A.; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P.; Isaacs, William B.; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L.; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M.; Ingles, Sue A.; Kittles, Rick A.; Murphy, Adam B.; Blot, William J.; Signorello, Lisa B.; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M. Cristina; Wu, Suh-Yuh; Hennis, Anselm J. M.; Rybicki, Benjamin A.; Neslund-Dudas, Christine; Hsing, Ann W.; Chu, Lisa; Goodman, Phyllis J.; Klein, Eric A.; Zheng, S. Lilly; Witte, John S.; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L.; Hunter, David J.; Gronberg, Henrik; Cook, Michael B.; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J.; Easton, Douglas F.; Henderson, Brian E.; Coetzee, Gerhard A.; Conti, David V.; Haiman, Christopher A.

    2015-01-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10−4–5.6 × 10−3) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10−6) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  18. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    Science.gov (United States)

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation.

  19. Emerging applications of read profiles towards the functional annotation of the genome

    DEFF Research Database (Denmark)

    Pundhir, Sachin; Poirazi, Panayiota; Gorodkin, Jan

    2015-01-01

    is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation...... to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification...

  20. Comparisons of Shewanella strains based on genome annotations, modeling and experiments

    Energy Technology Data Exchange (ETDEWEB)

    Ong, Wai Kit; Vu, Trang; Lovendahl, Klaus N.; Llull, Jenna; Serres, Margaret; Romine, Margaret F.; Reed, Jennifer L.

    2014-01-01

    Shewanella is a genus of facultatively anaerobic, Gram-negative bacteria that have highly adaptable metabolism which allows them to thrive in diverse environments. This quality makes them attractive target bacteria for research in bioremediation and microbial fuel cell applications. Constraint-based modeling is a useful tool for helping researchers gain insights into the metabolic capabilities of these bacteria. However, Shewanella oneidensis MR-1 is the only strain with a genome-scale metabolic model constructed out of the 22 sequenced Shewanella strains.

  1. Genome sequencing and annotation of Geobacillus sp. 1017, a hydrocarbon-oxidizing thermophilic bacterium isolated from a heavy oil reservoir (China

    Directory of Open Access Journals (Sweden)

    Vitaly V. Kadnikov

    2017-03-01

    Full Text Available The draft genome sequence of Geobacillus sp. strain 1017, a thermophilic aerobic oil-oxidizing bacterium isolated from formation water of the Dagang high-temperature oilfield, China, is presented here. The genome comprised 3.6 Mbp, with the G + C content of 51.74%. The strain had a number of genes responsible for numerous metabolic and transport systems, exopolysaccharide biosynthesis, and decomposition of sugars and aromatic compounds, as well as the genes related to resistance to metals and metalloids. The genome sequence is available at DDBJ/EMBL/GenBank under the accession no MQMG00000000. This genome is annotated for elucidation of the genomic and phenotypic diversity of new thermophilic alkane-oxidizing bacteria of the genus Geobacillus.

  2. “Controlled, cross-species dataset for exploring biases in genome annotation and modification profiles”

    Directory of Open Access Journals (Sweden)

    Alison McAfee

    2015-12-01

    Full Text Available Since the sequencing of the honey bee genome, proteomics by mass spectrometry has become increasingly popular for biological analyses of this insect; but we have observed that the number of honey bee protein identifications is consistently low compared to other organisms [1]. In this dataset, we use nanoelectrospray ionization-coupled liquid chromatography–tandem mass spectrometry (nLC–MS/MS to systematically investigate the root cause of low honey bee proteome coverage. To this end, we present here data from three key experiments: a controlled, cross-species analyses of samples from Apis mellifera, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens; a proteomic analysis of an individual honey bee whose genome was also sequenced; and a cross-tissue honey bee proteome comparison. The cross-species dataset was interrogated to determine relative proteome coverages between species, and the other two datasets were used to search for polymorphic sequences and to compare protein cleavage profiles, respectively.

  3. Optimizing high performance computing workflow for protein functional annotation.

    Science.gov (United States)

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

  4. Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS

    DEFF Research Database (Denmark)

    Wang, Yunpeng; Thompson, Wesley K; Schork, Andrew J;

    2016-01-01

    , pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic ("z-score") of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive...... a "relative enrichment score" for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions...... to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both...

  5. Rapid high resolution genotyping of Francisella tularensis by whole genome sequence comparison of annotated genes ("MLST+".

    Directory of Open Access Journals (Sweden)

    Markus H Antwerpen

    Full Text Available The zoonotic disease tularemia is caused by the bacterium Francisella tularensis. This pathogen is considered as a category A select agent with potential to be misused in bioterrorism. Molecular typing based on DNA-sequence like canSNP-typing or MLVA has become the accepted standard for this organism. Due to the organism's highly clonal nature, the current typing methods have reached their limit of discrimination for classifying closely related subpopulations within the subspecies F. tularensis ssp. holarctica. We introduce a new gene-by-gene approach, MLST+, based on whole genome data of 15 sequenced F. tularensis ssp. holarctica strains and apply this approach to investigate an epidemic of lethal tularemia among non-human primates in two animal facilities in Germany. Due to the high resolution of MLST+ we are able to demonstrate that three independent clones of this highly infectious pathogen were responsible for these spatially and temporally restricted outbreaks.

  6. Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics

    DEFF Research Database (Denmark)

    Lundby, Alicia; Rossin, Elizabeth J.; Steffensen, Annette B.;

    2014-01-01

    Genome-wide association studies (GWAS) have identified thousands of loci associated with complex traits, but it is challenging to pinpoint causal genes in these loci and to exploit subtle association signals. We used tissue-specific quantitative interaction proteomics to map a network of five genes...... involved in the Mendelian disorder long QT syndrome (LOTS). We integrated the LOTS network with GWAS loci from the corresponding common complex trait, QT-interval variation, to identify candidate genes that were subsequently confirmed in Xenopus laevis oocytes and zebrafish. We used the LOTS protein...... to propose candidates in GWAS loci for functional studies and to systematically filter subtle association signals using tissue-specific quantitative interaction proteomics....

  7. Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE).

    Science.gov (United States)

    Schmedes, Sarah E; King, Jonathan L; Budowle, Bruce

    2015-01-01

    Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.

  8. Genetic fine-mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci

    Science.gov (United States)

    Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Goncalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Borringer, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex SF; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian’an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Perry, John RB; Platou, Carl GP; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth JF; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöckel, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin NA; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O’Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-01-01

    We performed fine-mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in/near KCNQ1. “Credible sets” of variants most likely to drive each distinct signal mapped predominantly to non-coding sequence, implying that T2D association is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine-mapping implicated rs10830963 as driving T2D association. We confirmed that this T2D-risk allele increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D-risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  9. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

    Science.gov (United States)

    Gaulton, Kyle J; Ferreira, Teresa; Lee, Yeji; Raimondo, Anne; Mägi, Reedik; Reschen, Michael E; Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Gonçalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Bottinger, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex S F; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kao, Wen-Hong L; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian'an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Peltonen, Leena; Perry, John R B; Platou, Carl G P; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wiltshire, Steven; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth J F; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöcke, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin N A; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O'Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-12-01

    We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease.

  10. VariOtator, a Software Tool for Variation Annotation with the Variation Ontology.

    Science.gov (United States)

    Schaafsma, Gerard C P; Vihinen, Mauno

    2016-04-01

    The Variation Ontology (VariO) is used for describing and annotating types, effects, consequences, and mechanisms of variations. To facilitate easy and consistent annotations, the online application VariOtator was developed. For variation type annotations, VariOtator is fully automated, accepting variant descriptions in Human Genome Variation Society (HGVS) format, and generating VariO terms, either with or without full lineage, that is, all parent terms. When a coding DNA variant description with a reference sequence is provided, VariOtator checks the description first with Mutalyzer and then generates the predicted RNA and protein descriptions with their respective VariO annotations. For the other sublevels, function, structure, and property, annotations cannot be automated, and VariOtator generates annotation based on provided details. For VariO terms relating to structure and property, one can use attribute terms as modifiers and evidence code terms for annotating experimental evidence. There is an online batch version, and stand-alone batch versions to be used with a Leiden Open Variation Database (LOVD) download file. A SOAP Web service allows client programs to access VariOtator programmatically. Thus, systematic variation effect and type annotations can be efficiently generated to allow easy use and integration of variations and their consequences.

  11. Inconsistencies of genome annotations in apicomplexan parasites revealed by 5'-end-one-pass and full-length sequences of oligo-capped cDNAs

    Directory of Open Access Journals (Sweden)

    Sugano Sumio

    2009-07-01

    Full Text Available Abstract Background Apicomplexan parasites are causative agents of various diseases including malaria and have been targets of extensive genomic sequencing. We generated 5'-EST collections for six apicomplexa parasites using our full-length oligo-capping cDNA library method. To improve upon the current genome annotations, as well as to validate the importance for physical cDNA clone resources, we generated a large-scale collection of full-length cDNAs for several apicomplexa parasites. Results In this study, we used a total of 61,056 5'-end-single-pass cDNA sequences from Plasmodium falciparum, P. vivax, P. yoelii, P. berghei, Cryptosporidium parvum, and Toxoplasma gondii. We compared these partially sequenced cDNA sequences with the currently annotated gene models and observed significant inconsistencies between the two datasets. In particular, we found that on average 14% of the exons in the current gene models were not supported by any cDNA evidence, and that 16% of the current gene models may contain at least one mis-annotation and should be re-evaluated. We also identified a large number of transcripts that had been previously unidentified. For 732 cDNAs in T. gondii, the entire sequences were determined in order to evaluate the annotated gene models at the complete full-length transcript level. We found that 41% of the T. gondii gene models contained at least one inconsistency. We also identified and confirmed by RT-PCR 140 previously unidentified transcripts found in the intergenic regions of the current gene annotations. We show that the majority of these discrepancies are due to questionable predictions of one or two extra exons in the upstream or downstream regions of the genes. Conclusion Our data indicates that the current gene models are likely to still be incomplete and have much room for improvement. Our unique full-length cDNA information is especially useful for further refinement of the annotations for the genomes of

  12. An integrated pipeline for next generation sequencing and annotation of the complete mitochondrial genome of the giant intestinal fluke, Fasciolopsis buski (Lankester, 1857) Looss, 1899.

    Science.gov (United States)

    Biswal, Devendra Kumar; Ghatani, Sudeep; Shylla, Jollin A; Sahu, Ranjana; Mullapudi, Nandita; Bhattacharya, Alok; Tandon, Veena

    2013-01-01

    Helminths include both parasitic nematodes (roundworms) and platyhelminths (trematode and cestode flatworms) that are abundant, and are of clinical importance. The genetic characterization of parasitic flatworms using advanced molecular tools is central to the diagnosis and control of infections. Although the nuclear genome houses suitable genetic markers (e.g., in ribosomal (r) DNA) for species identification and molecular characterization, the mitochondrial (mt) genome consistently provides a rich source of novel markers for informative systematics and epidemiological studies. In the last decade, there have been some important advances in mtDNA genomics of helminths, especially lung flukes, liver flukes and intestinal flukes. Fasciolopsis buski, often called the giant intestinal fluke, is one of the largest digenean trematodes infecting humans and found primarily in Asia, in particular the Indian subcontinent. Next-generation sequencing (NGS) technologies now provide opportunities for high throughput sequencing, assembly and annotation within a short span of time. Herein, we describe a high-throughput sequencing and bioinformatics pipeline for mt genomics for F. buski that emphasizes the utility of short read NGS platforms such as Ion Torrent and Illumina in successfully sequencing and assembling the mt genome using innovative approaches for PCR primer design as well as assembly. We took advantage of our NGS whole genome sequence data (unpublished so far) for F. buski and its comparison with available data for the Fasciola hepatica mtDNA as the reference genome for design of precise and specific primers for amplification of mt genome sequences from F. buski. A long-range PCR was carried out to create an NGS library enriched in mt DNA sequences. Two different NGS platforms were employed for complete sequencing, assembly and annotation of the F. buski mt genome. The complete mt genome sequences of the intestinal fluke comprise 14,118 bp and is thus the shortest

  13. An integrated pipeline for next generation sequencing and annotation of the complete mitochondrial genome of the giant intestinal fluke, Fasciolopsis buski (Lankester, 1857 Looss, 1899

    Directory of Open Access Journals (Sweden)

    Devendra Kumar Biswal

    2013-11-01

    Full Text Available Helminths include both parasitic nematodes (roundworms and platyhelminths (trematode and cestode flatworms that are abundant, and are of clinical importance. The genetic characterization of parasitic flatworms using advanced molecular tools is central to the diagnosis and control of infections. Although the nuclear genome houses suitable genetic markers (e.g., in ribosomal (r DNA for species identification and molecular characterization, the mitochondrial (mt genome consistently provides a rich source of novel markers for informative systematics and epidemiological studies. In the last decade, there have been some important advances in mtDNA genomics of helminths, especially lung flukes, liver flukes and intestinal flukes. Fasciolopsis buski, often called the giant intestinal fluke, is one of the largest digenean trematodes infecting humans and found primarily in Asia, in particular the Indian subcontinent. Next-generation sequencing (NGS technologies now provide opportunities for high throughput sequencing, assembly and annotation within a short span of time. Herein, we describe a high-throughput sequencing and bioinformatics pipeline for mt genomics for F. buski that emphasizes the utility of short read NGS platforms such as Ion Torrent and Illumina in successfully sequencing and assembling the mt genome using innovative approaches for PCR primer design as well as assembly. We took advantage of our NGS whole genome sequence data (unpublished so far for F. buski and its comparison with available data for the Fasciola hepatica mtDNA as the reference genome for design of precise and specific primers for amplification of mt genome sequences from F. buski. A long-range PCR was carried out to create an NGS library enriched in mt DNA sequences. Two different NGS platforms were employed for complete sequencing, assembly and annotation of the F. buski mt genome. The complete mt genome sequences of the intestinal fluke comprise 14,118 bp and is thus the

  14. Automated LC-HRMS(/MS) approach for the annotation of fragment ions derived from stable isotope labeling-assisted untargeted metabolomics.

    Science.gov (United States)

    Neumann, Nora K N; Lehner, Sylvia M; Kluger, Bernhard; Bueschl, Christoph; Sedelmaier, Karoline; Lemmens, Marc; Krska, Rudolf; Schuhmacher, Rainer

    2014-08-05

    Structure elucidation of biological compounds is still a major bottleneck of untargeted LC-HRMS approaches in metabolomics research. The aim of the present study was to combine stable isotope labeling and tandem mass spectrometry for the automated interpretation of the elemental composition of fragment ions and thereby facilitate the structural characterization of metabolites. The software tool FragExtract was developed and evaluated with LC-HRMS/MS spectra of both native (12)C- and uniformly (13)C (U-(13)C)-labeled analytical standards of 10 fungal substances in pure solvent and spiked into fungal culture filtrate of Fusarium graminearum respectively. Furthermore, the developed approach is exemplified with nine unknown biochemical compounds contained in F. graminearum samples derived from an untargeted metabolomics experiment. The mass difference between the corresponding fragment ions present in the MS/MS spectra of the native and U-(13)C-labeled compound enabled the assignment of the number of carbon atoms to each fragment signal and allowed the generation of meaningful putative molecular formulas for each fragment ion, which in turn also helped determine the elemental composition of the precursor ion. Compared to laborious manual analysis of the MS/MS spectra, the presented algorithm marks an important step toward efficient fragment signal elucidation and structure annotation of metabolites in future untargeted metabolomics studies. Moreover, as demonstrated for a fungal culture sample, FragExtract also assists the characterization of unknown metabolites, which are not contained in databases, and thus exhibits a significant contribution to untargeted metabolomics research.

  15. Automated whole-genome multiple alignment of rat, mouse, and human

    Energy Technology Data Exchange (ETDEWEB)

    Brudno, Michael; Poliakov, Alexander; Salamov, Asaf; Cooper, Gregory M.; Sidow, Arend; Rubin, Edward M.; Solovyev, Victor; Batzoglou, Serafim; Dubchak, Inna

    2004-07-04

    We have built a whole genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline which combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment, and consists of two main steps: (1) alignment of the mouse and rat genomes; and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human and 97% of all alignments with human sequence > 100kb agree with a three-way synteny map built independently using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment; and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.

  16. Annotated English

    CERN Document Server

    Hernandez-Orallo, Jose

    2010-01-01

    This document presents Annotated English, a system of diacritical symbols which turns English pronunciation into a precise and unambiguous process. The annotations are defined and located in such a way that the original English text is not altered (not even a letter), thus allowing for a consistent reading and learning of the English language with and without annotations. The annotations are based on a set of general rules that make the frequency of annotations not dramatically high. This makes the reader easily associate annotations with exceptions, and makes it possible to shape, internalise and consolidate some rules for the English language which otherwise are weakened by the enormous amount of exceptions in English pronunciation. The advantages of this annotation system are manifold. Any existing text can be annotated without a significant increase in size. This means that we can get an annotated version of any document or book with the same number of pages and fontsize. Since no letter is affected, the ...

  17. DEFINITION OF A SEMANTIC PLATAFORM FOR AUTOMATED CODE GENERATION BASED ON UML CLASS DIAGRAMS AND DSL SEMANTIC ANNOTATIONS

    Directory of Open Access Journals (Sweden)

    ANDRÉS MUÑETÓN

    2012-01-01

    Full Text Available En este trabajo se propone una plataforma semántica de servicios que implementan los pasos de un método para la generación automática de código. El método se basa en información semántica y en MDA (model-driven architecture. La generación de código se logra relacionando semánticamente operaciones en diagramas de clases en UML (unified modeling language con operaciones implementadas. La relación entre operaciones se hace consultando operaciones implementadas que tengan la misma postcondición de la operación bajo implementación. El código resultante es una secuencia de invocaciones a operaciones implementadas que, en conjunto, alcancen la postcondición de la operación bajo implementación. La semántica se especifica mediante un DSL (domain-specific language, también definido en este artículo. Los servicios de la plataforma y el método se prueban mediante un caso de estudio.

  18. Functional annotations of diabetes nephropathy susceptibility loci through analysis of genome-wide renal gene expression in rat models of diabetes mellitus

    DEFF Research Database (Denmark)

    Hu, Yaomin; Kaisaki, Pamela J; Argoud, Karène;

    2009-01-01

    to hyperglycaemia and renal structural changes of positional candidate genes at selected diabetic nephropathy (DN) susceptibility loci. METHODS: Both Affymetrix and Illumina technologies were used to identify significant quantitative changes in the abundance of over 15,000 transcripts in kidney of models...... number of protein coding sequences of unknown function which can be considered as functional and, when they map to DN loci, positional candidates for DN. Further expression analysis of rat orthologs of human DN positional candidate genes provided functional annotations of known and novel genes...... that are responsive to hyperglycaemia and may contribute to renal functional and/or structural alterations. CONCLUSION: Combining transcriptomics in animal models and comparative genomics provides important information to improve functional annotations of disease susceptibility loci in humans and experimental support...

  19. AutoFACT: An Automatic Functional Annotation and Classification Tool

    Directory of Open Access Journals (Sweden)

    Lang B Franz

    2005-06-01

    Full Text Available Abstract Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1 analyzes nucleotide and protein sequence data; (2 determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3 assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4 generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at http://megasun.bch.umontreal.ca/Software/AutoFACT.htm.

  20. A Tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free, and Suitable for Automation.

    Science.gov (United States)

    Aubrey, Wayne; Riley, Michael C; Young, Michael; King, Ross D; Oliver, Stephen G; Clare, Amanda

    2015-01-01

    Many advances in synthetic biology require the removal of a large number of genomic elements from a genome. Most existing deletion methods leave behind markers, and as there are a limited number of markers, such methods can only be applied a fixed number of times. Deletion methods that recycle markers generally are either imprecise (remove untargeted sequences), or leave scar sequences which can cause genome instability and rearrangements. No existing marker recycling method is automation-friendly. We have developed a novel openly available deletion tool that consists of: 1) a method for deleting genomic elements that can be repeatedly used without limit, is precise, scar-free, and suitable for automation; and 2) software to design the method's primers. Our tool is sequence agnostic and could be used to delete large numbers of coding sequences, promoter regions, transcription factor binding sites, terminators, etc in a single genome. We have validated our tool on the deletion of non-essential open reading frames (ORFs) from S. cerevisiae. The tool is applicable to arbitrary genomes, and we provide primer sequences for the deletion of: 90% of the ORFs from the S. cerevisiae genome, 88% of the ORFs from S. pombe genome, and 85% of the ORFs from the L. lactis genome.

  1. A Tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free, and Suitable for Automation.

    Directory of Open Access Journals (Sweden)

    Wayne Aubrey

    Full Text Available Many advances in synthetic biology require the removal of a large number of genomic elements from a genome. Most existing deletion methods leave behind markers, and as there are a limited number of markers, such methods can only be applied a fixed number of times. Deletion methods that recycle markers generally are either imprecise (remove untargeted sequences, or leave scar sequences which can cause genome instability and rearrangements. No existing marker recycling method is automation-friendly. We have developed a novel openly available deletion tool that consists of: 1 a method for deleting genomic elements that can be repeatedly used without limit, is precise, scar-free, and suitable for automation; and 2 software to design the method's primers. Our tool is sequence agnostic and could be used to delete large numbers of coding sequences, promoter regions, transcription factor binding sites, terminators, etc in a single genome. We have validated our tool on the deletion of non-essential open reading frames (ORFs from S. cerevisiae. The tool is applicable to arbitrary genomes, and we provide primer sequences for the deletion of: 90% of the ORFs from the S. cerevisiae genome, 88% of the ORFs from S. pombe genome, and 85% of the ORFs from the L. lactis genome.

  2. A Tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free, and Suitable for Automation

    Science.gov (United States)

    Aubrey, Wayne; Riley, Michael C.; Young, Michael; King, Ross D.; Oliver, Stephen G.; Clare, Amanda

    2015-01-01

    Many advances in synthetic biology require the removal of a large number of genomic elements from a genome. Most existing deletion methods leave behind markers, and as there are a limited number of markers, such methods can only be applied a fixed number of times. Deletion methods that recycle markers generally are either imprecise (remove untargeted sequences), or leave scar sequences which can cause genome instability and rearrangements. No existing marker recycling method is automation-friendly. We have developed a novel openly available deletion tool that consists of: 1) a method for deleting genomic elements that can be repeatedly used without limit, is precise, scar-free, and suitable for automation; and 2) software to design the method’s primers. Our tool is sequence agnostic and could be used to delete large numbers of coding sequences, promoter regions, transcription factor binding sites, terminators, etc in a single genome. We have validated our tool on the deletion of non-essential open reading frames (ORFs) from S. cerevisiae. The tool is applicable to arbitrary genomes, and we provide primer sequences for the deletion of: 90% of the ORFs from the S. cerevisiae genome, 88% of the ORFs from S. pombe genome, and 85% of the ORFs from the L. lactis genome. PMID:26630677

  3. 多元自动化基因组工程%Multiplex Automated Genome Engineering

    Institute of Scientific and Technical Information of China (English)

    李丹; 高海军

    2015-01-01

    基因组编辑技术在基因组工程研究中应用广泛,其中位点特异性核酸酶编辑技术和CRISPR/Cas系统在单基因编辑方面贡献卓越,但由于基因组的庞大,这些技术又有一定的局限性。多元自动化基因组工程(MAGE)是一种新型基因组编辑技术,可同时作用于多个基因,具有快速、高效的特点,已被用于大肠杆菌的基因敲除和基因替换。主要介绍了MAGE的原理、具体操作流程及技术进展,并结合MAGE技术的应用,讨论其发展趋势。%Genome editing is widely used in genome engineering research and site-specific nuclease technologies and CRISPR/Cas system focus on single gene editing. Owing to the huge size of genome, there are some limitations on the applications of these technologies. Multiplex Automated Genome Engineering(MAGE)is a new, fast and efficient genome editing technology, which can operate multiple genes simultaneously, and be used in knockout and replacement ofEscherichia coligenes. This review illustrates the recent advances in the theory, operation protocol and technological innovation of MAGE, its application and development trend were also discussed.

  4. Re-annotation of the physical map of Glycine max for polyploid-like regions by BAC end sequence driven whole genome shotgun read assembly

    Directory of Open Access Journals (Sweden)

    Shultz Jeffry

    2008-07-01

    Full Text Available Abstract Background Many of the world's most important food crops have either polyploid genomes or homeologous regions derived from segmental shuffling following polyploid formation. The soybean (Glycine max genome has been shown to be composed of approximately four thousand short interspersed homeologous regions with 1, 2 or 4 copies per haploid genome by RFLP analysis, microsatellite anchors to BACs and by contigs formed from BAC fingerprints. Despite these similar regions,, the genome has been sequenced by whole genome shotgun sequence (WGS. Here the aim was to use BAC end sequences (BES derived from three minimum tile paths (MTP to examine the extent and homogeneity of polyploid-like regions within contigs and the extent of correlation between the polyploid-like regions inferred from fingerprinting and the polyploid-like sequences inferred from WGS matches. Results Results show that when sequence divergence was 1–10%, the copy number of homeologous regions could be identified from sequence variation in WGS reads overlapping BES. Homeolog sequence variants (HSVs were single nucleotide polymorphisms (SNPs; 89% and single nucleotide indels (SNIs 10%. Larger indels were rare but present (1%. Simulations that had predicted fingerprints of homeologous regions could be separated when divergence exceeded 2% were shown to be false. We show that a 5–10% sequence divergence is necessary to separate homeologs by fingerprinting. BES compared to WGS traces showed polyploid-like regions with less than 1% sequence divergence exist at 2.3% of the locations assayed. Conclusion The use of HSVs like SNPs and SNIs to characterize BACs wil improve contig building methods. The implications for bioinformatic and functional annotation of polyploid and paleopolyploid genomes show that a combined approach of BAC fingerprint based physical maps, WGS sequence and HSV-based partitioning of BAC clones from homeologous regions to separate contigs will allow reliable de

  5. Semantic annotation of medical images

    Science.gov (United States)

    Seifert, Sascha; Kelm, Michael; Moeller, Manuel; Mukherjee, Saikat; Cavallaro, Alexander; Huber, Martin; Comaniciu, Dorin

    2010-03-01

    Diagnosis and treatment planning for patients can be significantly improved by comparing with clinical images of other patients with similar anatomical and pathological characteristics. This requires the images to be annotated using common vocabulary from clinical ontologies. Current approaches to such annotation are typically manual, consuming extensive clinician time, and cannot be scaled to large amounts of imaging data in hospitals. On the other hand, automated image analysis while being very scalable do not leverage standardized semantics and thus cannot be used across specific applications. In our work, we describe an automated and context-sensitive workflow based on an image parsing system complemented by an ontology-based context-sensitive annotation tool. An unique characteristic of our framework is that it brings together the diverse paradigms of machine learning based image analysis and ontology based modeling for accurate and scalable semantic image annotation.

  6. Annotation of a hybrid partial genome of the Coffee Rust (Hemileia vastatrix contributes to the gene repertoire catalogue of the Pucciniales

    Directory of Open Access Journals (Sweden)

    Marco Aurelio Cristancho

    2014-10-01

    Full Text Available Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333Mb was built based on the 8 isolates; this assembly was used for subsequent analyses.Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3,921 families were uncovered; a considerable proportion of the predicted proteins (73.8% were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish

  7. Annotation of a hybrid partial genome of the coffee rust (Hemileia vastatrix) contributes to the gene repertoire catalog of the Pucciniales.

    Science.gov (United States)

    Cristancho, Marco A; Botero-Rozo, David Octavio; Giraldo, William; Tabima, Javier; Riaño-Pachón, Diego Mauricio; Escobar, Carolina; Rozo, Yomara; Rivera, Luis F; Durán, Andrés; Restrepo, Silvia; Eilam, Tamar; Anikster, Yehoshua; Gaitán, Alvaro L

    2014-01-01

    Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333 Mb was built based on the 8 isolates; this assembly was used for subsequent analyses. Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3921 families were uncovered; a considerable proportion of the predicted proteins (73.8%) were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish races/isolates.

  8. Graph-based sequence annotation using a data integration approach.

    Science.gov (United States)

    Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan

    2008-08-25

    The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

  9. Algal functional annotation tool

    Energy Technology Data Exchange (ETDEWEB)

    Lopez, D. [UCLA; Casero, D. [UCLA; Cokus, S. J. [UCLA; Merchant, S. S. [UCLA; Pellegrini, M. [UCLA

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  10. Whole-Genome Sequencing and Annotation of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum annuum (Bird's Eye Chili) and Capsicum chinense (Yellow Lantern Chili), Respectively.

    Science.gov (United States)

    Gan, Huan You; Gan, Han Ming; Savka, Michael A; Triassi, Alexander J; Wheatley, Matthew S; Naqvi, Kubra F; Foxhall, Taylor E; Anauo, Michael J; Baldwin, Mariah L; Burkhardt, Russell N; O'Bryon, Isabelle G; Dailey, Lucas K; Busairi, Nurfatini Idayu; Keith, Robert C; Khair, Megat Hazmah Megat Mazhar; Rasul, Muhammad Zamir Mohd; Rosdi, Nur Aiman Mohd; Mountzouros, James R; Rhoads, Aleigha C; Selochan, Melissa A; Tautanov, Timur B; Polter, Steven J; Marks, Kayla D; Caraballo, Alexander A; Hudson, André O

    2015-01-01

    Here, we report the genome sequences of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum spp. Annotation revealed gene clusters for the synthesis of bacilysin, lichensin, and bacillibactin and sporulation killing factor (skfA) in Bacillus safensis RIT372 and turnerbactin and carotenoid in Pseudomonas oryzihabitans RIT370.

  11. Mitochondrial Disease Sequence Data Resource (MSeqDR): A global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities

    NARCIS (Netherlands)

    M.J. Falk (Marni J.); L. Shen (Lishuang); M. Gonzalez (Michael); J. Leipzig (Jeremy); M.T. Lott (Marie T.); A.P.M. Stassen (Alphons P.M.); M.A. Diroma (Maria Angela); D. Navarro-Gomez (Daniel); P. Yeske (Philip); R. Bai (Renkui); R.G. Boles (Richard G.); V. Brilhante (Virginia); D. Ralph (David); J.T. DaRe (Jeana T.); R. Shelton (Robert); S.F. Terry (Sharon); Z. Zhang (Zhe); W.C. Copeland (William C.); M. van Oven (Mannis); H. Prokisch (Holger); D.C. Wallace; M. Attimonelli (Marcella); D. Krotoski (Danuta); S. Zuchner (Stephan); X. Gai (Xiaowu); S. Bale (Sherri); J. Bedoyan (Jirair); D.M. Behar (Doron); P. Bonnen (Penelope); L. Brooks (Lisa); C. Calabrese (Claudia); S. Calvo (Sarah); P.F. Chinnery (Patrick); J. Christodoulou (John); D. Church (Deanna); R. Clima (Rosanna); B.H. Cohen (Bruce H.); R.G.H. Cotton (Richard); I.F.M. de Coo (René); O. Derbenevoa (Olga); J.T. den Dunnen (Johan); D. Dimmock (David); G. Enns (Gregory); G. Gasparre (Giuseppe); A. Goldstein (Amy); I. Gonzalez (Iris); K. Gwinn (Katrina); S. Hahn (Sihoun); R.H. Haas (Richard H.); H. Hakonarson (Hakon); M. Hirano (Michio); D. Kerr (Douglas); D. Li (Dong); M. Lvova (Maria); F. Macrae (Finley); D. Maglott (Donna); E. McCormick (Elizabeth); G. Mitchell (Grant); V.K. Mootha (Vamsi K.); Y. Okazaki (Yasushi); A. Pujol (Aurora); M. Parisi (Melissa); J.C. Perin (Juan Carlos); E.A. Pierce (Eric A.); V. Procaccio (Vincent); S. Rahman (Shamima); H. Reddi (Honey); H. Rehm (Heidi); E. Riggs (Erin); R.J.T. Rodenburg (Richard); Y. Rubinstein (Yaffa); R. Saneto (Russell); M. Santorsola (Mariangela); C. Scharfe (Curt); C. Sheldon (Claire); E.A. Shoubridge (Eric); D. Simone (Domenico); B. Smeets (Bert); J.A.M. Smeitink (Jan); C. Stanley (Christine); A. Suomalainen (Anu); M.A. Tarnopolsky (Mark); I. Thiffault (Isabelle); D.R. Thorburn (David R.); J.V. Hove (Johan Van); L. Wolfe (Lynne); L.-J. Wong (Lee-Jun)

    2015-01-01

    textabstractSuccess rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires th

  12. TU-CD-BRB-07: Identification of Associations Between Radiologist-Annotated Imaging Features and Genomic Alterations in Breast Invasive Carcinoma, a TCGA Phenotype Research Group Study

    Energy Technology Data Exchange (ETDEWEB)

    Rao, A; Net, J [University of Miami, Miami, Florida (United States); Brandt, K [Mayo Clinic, Rochester, Minnesota (United States); Huang, E [National Cancer Institute, NIH, Bethesda, MD (United States); Freymann, J; Kirby, J [Leidos Biomedical Research Inc., Frederick, MD (United States); Burnside, E [University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin (United States); Morris, E; Sutton, E [Memorial Sloan Kettering Cancer Center, New York, NY (United States); Bonaccio, E [Roswell Park Cancer Institute, Buffalo, NY (United States); Giger, M; Jaffe, C [Univ Chicago, Chicago, IL (United States); Ganott, M; Zuley, M [University of Pittsburgh Medical Center - Magee Womens Hospital, Pittsburgh, Pennsylvania (United States); Le-Petross, H [MD Anderson Cancer Center, Houston, TX (United States); Dogan, B [UT MDACC, Houston, TX (United States); Whitman, G [UTMDACC, Houston, TX (United States)

    2015-06-15

    Purpose: To determine associations between radiologist-annotated MRI features and genomic measurements in breast invasive carcinoma (BRCA) from the Cancer Genome Atlas (TCGA). Methods: 98 TCGA patients with BRCA were assessed by a panel of radiologists (TCGA Breast Phenotype Research Group) based on a variety of mass and non-mass features according to the Breast Imaging Reporting and Data System (BI-RADS). Batch corrected gene expression data was obtained from the TCGA Data Portal. The Kruskal-Wallis test was used to assess correlations between categorical image features and tumor-derived genomic features (such as gene pathway activity, copy number and mutation characteristics). Image-derived features were also correlated with estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2/neu) status. Multiple hypothesis correction was done using Benjamini-Hochberg FDR. Associations at an FDR of 0.1 were selected for interpretation. Results: ER status was associated with rim enhancement and peritumoral edema. PR status was associated with internal enhancement. Several components of the PI3K/Akt pathway were associated with rim enhancement as well as heterogeneity. In addition, several components of cell cycle regulation and cell division were associated with imaging characteristics.TP53 and GATA3 mutations were associated with lesion size. MRI features associated with TP53 mutation status were rim enhancement and peritumoral edema. Rim enhancement was associated with activity of RB1, PIK3R1, MAP3K1, AKT1,PI3K, and PIK3CA. Margin status was associated with HIF1A/ARNT, Ras/ GTP/PI3K, KRAS, and GADD45A. Axillary lymphadenopathy was associated with RB1 and BCL2L1. Peritumoral edema was associated with Aurora A/GADD45A, BCL2L1, CCNE1, and FOXA1. Heterogeneous internal nonmass enhancement was associated with EGFR, PI3K, AKT1, HF/MET, and EGFR/Erbb4/neuregulin 1. Diffuse nonmass enhancement was associated with HGF/MET/MUC20/SHIP

  13. Genomic typing of Escherichia coli O157:H7 by semi-automated fluorescent AFLP analysis.

    Science.gov (United States)

    Zhao, S; Mitchell, S E; Meng, J; Kresovich, S; Doyle, M P; Dean, R E; Casa, A M; Weller, J W

    2000-02-01

    Escherichia coli serotype O157:H7 isolates were analyzed using a relatively new DNA fingerprinting method, amplified fragment length polymorphism (AFLP). Total genomic DNA was digested with two restriction endonucleases (EcoRI and MseI), and compatible oligonucleotide adapters were ligated to the ends of the resulting DNA fragments. Subsets of fragments from the total pool of cleaved DNA were then amplified by the polymerase chain reaction (PCR) using selective primers that extended beyond the adapter and restriction site sequences. One of the primers from each set was labeled with a fluorescent dye, which enabled amplified fragments to be detected and sized automatically on an automated DNA sequencer. Three AFLP primer sets generated a total of thirty-seven unique genotypes among the 48 E. coli O157:H7 isolates tested. Prior fingerprinting analysis of large restriction fragments from these same isolates by pulsed-field gel electrophoresis (PFGE) resulted in only 21 unique DNA profiles. Also, AFLP fingerprinting was successful for one DNA sample that was not typable by PFGE, presumably because of template degradation. AFLP analysis, therefore, provided greater genetic resolution and was less sensitive to DNA quality than PFGE. Consequently, this DNA typing technology should be very useful for genetic subtyping of bacterial pathogens in epidemiologic studies.

  14. Human genomic DNA analysis using a semi-automated sample preparation, amplification, and electrophoresis separation platform.

    Science.gov (United States)

    Raisi, Fariba; Blizard, Benjamin A; Raissi Shabari, Akbar; Ching, Jesus; Kintz, Gregory J; Mitchell, Jim; Lemoff, Asuncion; Taylor, Mike T; Weir, Fred; Western, Linda; Wong, Wendy; Joshi, Rekha; Howland, Pamela; Chauhan, Avinash; Nguyen, Peter; Petersen, Kurt E

    2004-03-01

    The growing importance of analyzing the human genome to detect hereditary and infectious diseases associated with specific DNA sequences has motivated us to develop automated devices to integrate sample preparation, real-time PCR, and microchannel electrophoresis (MCE). In this report, we present results from an optimized compact system capable of processing a raw sample of blood, extracting the DNA, and performing a multiplexed PCR reaction. Finally, an innovative electrophoretic separation was performed on the post-PCR products using a unique MCE system. The sample preparation system extracted and lysed white blood cells (WBC) from whole blood, producing DNA of sufficient quantity and quality for a polymerase chain reaction (PCR). Separation of multiple amplicons was achieved in a microfabricated channel 30 microm x 100 microm in cross section and 85 mm in length filled with a replaceable methyl cellulose matrix operated under denaturing conditions at 50 degrees C. By incorporating fluorescent-labeled primers in the PCR, the amplicons were identified by a two-color (multiplexed) fluorescence detection system. Two base-pair resolution of single-stranded DNA (PCR products) was achieved. We believe that this integrated system provides a unique solution for DNA analysis.

  15. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.

    Science.gov (United States)

    Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2012-01-01

    The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org.

  16. Quantifying Variability of Manual Annotation in Cryo-Electron Tomograms.

    Science.gov (United States)

    Hecksel, Corey W; Darrow, Michele C; Dai, Wei; Galaz-Montoya, Jesús G; Chin, Jessica A; Mitchell, Patrick G; Chen, Shurui; Jakana, Jemba; Schmid, Michael F; Chiu, Wah

    2016-06-01

    Although acknowledged to be variable and subjective, manual annotation of cryo-electron tomography data is commonly used to answer structural questions and to create a "ground truth" for evaluation of automated segmentation algorithms. Validation of such annotation is lacking, but is critical for understanding the reproducibility of manual annotations. Here, we used voxel-based similarity scores for a variety of specimens, ranging in complexity and segmented by several annotators, to quantify the variation among their annotations. In addition, we have identified procedures for merging annotations to reduce variability, thereby increasing the reliability of manual annotation. Based on our analyses, we find that it is necessary to combine multiple manual annotations to increase the confidence level for answering structural questions. We also make recommendations to guide algorithm development for automated annotation of features of interest.

  17. Genome-wide and functional annotation of human E3 ubiquitin ligases identifies MULAN, a mitochondrial E3 that regulates the organelle's dynamics and signaling.

    Directory of Open Access Journals (Sweden)

    Wei Li

    Full Text Available Specificity of protein ubiquitylation is conferred by E3 ubiquitin (Ub ligases. We have annotated approximately 617 putative E3s and substrate-recognition subunits of E3 complexes encoded in the human genome. The limited knowledge of the function of members of the large E3 superfamily prompted us to generate genome-wide E3 cDNA and RNAi expression libraries designed for functional screening. An imaging-based screen using these libraries to identify E3s that regulate mitochondrial dynamics uncovered MULAN/FLJ12875, a RING finger protein whose ectopic expression and knockdown both interfered with mitochondrial trafficking and morphology. We found that MULAN is a mitochondrial protein - two transmembrane domains mediate its localization to the organelle's outer membrane. MULAN is oriented such that its E3-active, C-terminal RING finger is exposed to the cytosol, where it has access to other components of the Ub system. Both an intact RING finger and the correct subcellular localization were required for regulation of mitochondrial dynamics, suggesting that MULAN's downstream effectors are proteins that are either integral to, or associated with, mitochondria and that become modified with Ub. Interestingly, MULAN had previously been identified as an activator of NF-kappaB, thus providing a link between mitochondrial dynamics and mitochondria-to-nucleus signaling. These findings suggest the existence of a new, Ub-mediated mechanism responsible for integration of mitochondria into the cellular environment.

  18. Genome sequencing and annotation of Laceyella sacchari strain GS 1-1, isolated from hot spring, Chumathang, Leh, India

    Directory of Open Access Journals (Sweden)

    Navjot Kaur

    2014-12-01

    Full Text Available We report the 3.3-Mb draft genome of Laceyella sacchari strain GS 1-1, isolated from hot spring water sample, Chumathang, Leh, India. Draft genome of strain GS 1-1 consists of 3, 324, 316 bp with a G + C content of 48.8% and 3429 predicted protein coding genes and 75 RNAs. Geobacillus thermodenitrificans strain NG80-2, Geobacillus kaustophilus strain HTA426 and Geobacillus sp. Strain G11MC16 are the closest neighbors of the strain GS 1-1.

  19. Interspecific Comparison and annotation of two complete mitochondrial genome sequences from the plant pathogenic fungus Mycosphaerella graminicola

    Energy Technology Data Exchange (ETDEWEB)

    Millenbaugh, Bonnie A; Pangilinan, Jasmyn L.; Torriani, Stefano F.F.; Goodwin, Stephen B.; Kema, Gert H.J.; McDonald, Bruce A.

    2007-12-07

    The mitochondrial genomes of two isolates of the wheat pathogen Mycosphaerella graminicola were sequenced completely and compared to identify polymorphic regions. This organism is of interest because it is phylogenetically distant from other fungi with sequenced mitochondrial genomes and it has shown discordant patterns of nuclear and mitochondrial diversity. The mitochondrial genome of M. graminicola is a circular molecule of approximately 43,960 bp containing the typical genes coding for 14 proteins related to oxidative phosphorylation, one RNA polymerase, two rRNA genes and a set of 27 tRNAs. The mitochondrial DNA of M. graminicola lacks the gene encoding the putative ribosomal protein (rps5-like), commonly found in fungal mitochondrial genomes. Most of the tRNA genes were clustered with a gene order conserved with many other ascomycetes. A sample of thirty-five additional strains representing the known global mt diversity was partially sequenced to measure overall mitochondrial variability within the species. Little variation was found, confirming previous RFLP-based findings of low mitochondrial diversity. The mitochondrial sequence of M. graminicola is the first reported from the family Mycosphaerellaceae or the order Capnodiales. The sequence also provides a tool to better understand the development of fungicide resistance and the conflicting pattern of high nuclear and low mitochondrial diversity in global populations of this fungus.

  20. Whole-Genome Sequence and Annotation of Salmonella enterica subsp. enterica Serovar Enteritidis Phage Type 8 Strain EN1660

    Science.gov (United States)

    Perry, Benjamin J.; Fitzgerald, Stephen F.; Kröger, Carsten

    2017-01-01

    ABSTRACT The genome of Salmonella enterica subspecies enterica serovar Enteritidis phage type 8 strain EN1660, isolated from an outbreak in Thunder Bay, Canada, was sequenced to 46-fold coverage using an Illumina MiSeq with 300-bp paired-end sequencing chemistry to produce 28 contigs with an N50 value of 490,721 bp. PMID:28126943

  1. antiSMASH : rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences

    NARCIS (Netherlands)

    Medema, Marnix H.; Blin, Kai; Cimermancic, Peter; de Jager, Victor; Zakrzewski, Piotr; Fischbach, Michael A.; Weber, Tilmann; Takano, Eriko; Breitling, Rainer

    2011-01-01

    Bacterial and fungal secondary metabolism is a rich source of novel bioactive compounds with potential pharmaceutical applications as antibiotics, anti-tumor drugs or cholesterol-lowering drugs. To find new drug candidates, microbiologists are increasingly relying on sequencing genomes of a wide var

  2. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    Science.gov (United States)

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene...

  3. Genome-wide annotation, expression profiling, and protein interaction studies of the core cell-cycle genes in Phalaenopsis aphrodite.

    Science.gov (United States)

    Lin, Hsiang-Yin; Chen, Jhun-Chen; Wei, Miao-Ju; Lien, Yi-Chen; Li, Huang-Hsien; Ko, Swee-Suak; Liu, Zin-Huang; Fang, Su-Chiung

    2014-01-01

    Orchidaceae is one of the most abundant and diverse families in the plant kingdom and its unique developmental patterns have drawn the attention of many evolutionary biologists. Particular areas of interest have included the co-evolution of pollinators and distinct floral structures, and symbiotic relationships with mycorrhizal flora. However, comprehensive studies to decipher the molecular basis of growth and development in orchids remain scarce. Cell proliferation governed by cell-cycle regulation is fundamental to growth and development of the plant body. We took advantage of recently released transcriptome information to systematically isolate and annotate the core cell-cycle regulators in the moth orchid Phalaenopsis aphrodite. Our data verified that Phalaenopsis cyclin-dependent kinase A (CDKA) is an evolutionarily conserved CDK. Expression profiling studies suggested that core cell-cycle genes functioning during the G1/S, S, and G2/M stages were preferentially enriched in the meristematic tissues that have high proliferation activity. In addition, subcellular localization and pairwise interaction analyses of various combinations of CDKs and cyclins, and of E2 promoter-binding factors and dimerization partners confirmed interactions of the functional units. Furthermore, our data showed that expression of the core cell-cycle genes was coordinately regulated during pollination-induced reproductive development. The data obtained establish a fundamental framework for study of the cell-cycle machinery in Phalaenopsis orchids.

  4. Automated integration of genomic physical mapping data via parallel simulated annealing

    Energy Technology Data Exchange (ETDEWEB)

    Slezak, T.

    1994-06-01

    The Human Genome Center at the Lawrence Livermore National Laboratory (LLNL) is nearing closure on a high-resolution physical map of human chromosome 19. We have build automated tools to assemble 15,000 fingerprinted cosmid clones into 800 contigs with minimal spanning paths identified. These islands are being ordered, oriented, and spanned by a variety of other techniques including: Fluorescence Insitu Hybridization (FISH) at 3 levels of resolution, ECO restriction fragment mapping across all contigs, and a multitude of different hybridization and PCR techniques to link cosmid, YAC, AC, PAC, and Pl clones. The FISH data provide us with partial order and distance data as well as orientation. We made the observation that map builders need a much rougher presentation of data than do map readers; the former wish to see raw data since these can expose errors or interesting biology. We further noted that by ignoring our length and distance data we could simplify our problem into one that could be readily attacked with optimization techniques. The data integration problem could then be seen as an M x N ordering of our N cosmid clones which ``intersect`` M larger objects by defining ``intersection`` to mean either contig/map membership or hybridization results. Clearly, the goal of making an integrated map is now to rearrange the N cosmid clone ``columns`` such that the number of gaps on the object ``rows`` are minimized. Our FISH partially-ordered cosmid clones provide us with a set of constraints that cannot be violated by the rearrangement process. We solved the optimization problem via simulated annealing performed on a network of 40+ Unix machines in parallel, using a server/client model built on explicit socket calls. For current maps we can create a map in about 4 hours on the parallel net versus 4+ days on a single workstation. Our biologists are now using this software on a daily basis to guide their efforts toward final closure.

  5. Versatile annotation and publication quality visualization of protein complexes using POLYVIEW-3D

    Directory of Open Access Journals (Sweden)

    Meller Jaroslaw

    2007-08-01

    Full Text Available Abstract Background Macromolecular visualization as well as automated structural and functional annotation tools play an increasingly important role in the post-genomic era, contributing significantly towards the understanding of molecular systems and processes. For example, three dimensional (3D models help in exploring protein active sites and functional hot spots that can be targeted in drug design. Automated annotation and visualization pipelines can also reveal other functionally important attributes of macromolecules. These goals are dependent on the availability of advanced tools that integrate better the existing databases, annotation servers and other resources with state-of-the-art rendering programs. Results We present a new tool for protein structure analysis, with the focus on annotation and visualization of protein complexes, which is an extension of our previously developed POLYVIEW web server. By integrating the web technology with state-of-the-art software for macromolecular visualization, such as the PyMol program, POLYVIEW-3D enables combining versatile structural and functional annotations with a simple web-based interface for creating publication quality structure rendering, as well as animated images for Powerpoint™, web sites and other electronic resources. The service is platform independent and no plug-ins are required. Several examples of how POLYVIEW-3D can be used for structural and functional analysis in the context of protein-protein interactions are presented to illustrate the available annotation options. Conclusion POLYVIEW-3D server features the PyMol image rendering that provides detailed and high quality presentation of macromolecular structures, with an easy to use web-based interface. POLYVIEW-3D also provides a wide array of options for automated structural and functional analysis of proteins and their complexes. Thus, the POLYVIEW-3D server may become an important resource for researches and educators in

  6. Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions

    Directory of Open Access Journals (Sweden)

    Wagner L. Araújo

    2012-09-01

    Full Text Available The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g. photosynthesis, photorespiration and nitrogen metabolism. We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications.

  7. Functional annotation of rheumatoid arthritis and osteoarthritis associated genes by integrative genome-wide gene expression profiling analysis.

    Directory of Open Access Journals (Sweden)

    Zhan-Chun Li

    Full Text Available BACKGROUND: Rheumatoid arthritis (RA and osteoarthritis (OA are two major types of joint diseases that share multiple common symptoms. However, their pathological mechanism remains largely unknown. The aim of our study is to identify RA and OA related-genes and gain an insight into the underlying genetic basis of these diseases. METHODS: We collected 11 whole genome-wide expression profiling datasets from RA and OA cohorts and performed a meta-analysis to comprehensively investigate their expression signatures. This method can avoid some pitfalls of single dataset analyses. RESULTS AND CONCLUSION: We found that several biological pathways (i.e., the immunity, inflammation and apoptosis related pathways are commonly involved in the development of both RA and OA. Whereas several other pathways (i.e., vasopressin-related pathway, regulation of autophagy, endocytosis, calcium transport and endoplasmic reticulum stress related pathways present significant difference between RA and OA. This study provides novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease.

  8. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens.

    Directory of Open Access Journals (Sweden)

    Lehlohonolo Benedict Qhanya

    Full Text Available Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s, heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence. Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea, Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala, revealed the presence of numerous putative P450s ranging from 267 (A. mellea to 14 (M. osmundae. Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  9. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens.

    Science.gov (United States)

    Qhanya, Lehlohonolo Benedict; Matowane, Godfrey; Chen, Wanping; Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  10. CycADS: an annotation database system to ease the development and update of BioCyc databases.

    Science.gov (United States)

    Vellozo, Augusto F; Véron, Amélie S; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano

    2011-01-01

    In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http://www.cycadsys.org.

  11. Genomic Sequence Comparisons, 1987-2003 Final Report

    Energy Technology Data Exchange (ETDEWEB)

    George M. Church

    2004-07-29

    This project was to develop new DNA sequencing and RNA and protein quantitation methods and related genome annotation tools. The project began in 1987 with the development of multiplex sequencing (published in Science in 1988), and one of the first automated sequencing methods. This lead to the first commercial genome sequence in 1994 and to the establishment of the main commercial participants (GTC then Agencourt) in the public DOE/NIH genome project. In collaboration with GTC we contributed to one of the first complete DOE genome sequences, in 1997, that of Methanobacterium thermoautotropicum, a species of great relevance to energy-rich gas production.

  12. An Approach to Function Annotation for Proteins of Unknown Function (PUFs in the Transcriptome of Indian Mulberry.

    Directory of Open Access Journals (Sweden)

    K H Dhanyalakshmi

    Full Text Available The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs. Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS, which also provides a web service API (Application Programming Interface for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas.

  13. Implementation of a semi-automated strategy for the annotation of metabolomic fingerprints generated by liquid chromatography-high resolution mass spectrometry from biological samples.

    Science.gov (United States)

    Courant, Frédérique; Royer, Anne-Lise; Chéreau, Sylvain; Morvan, Marie-Line; Monteau, Fabrice; Antignac, Jean-Philippe; Le Bizec, Bruno

    2012-11-07

    Metabolomics aims at detecting and semi-quantifying small molecular weight metabolites in biological samples in order to characterise the metabolic changes resulting from one or more given factors and/or to develop models based on diagnostic biomarker candidates. Nevertheless, whatever the objective of a metabolomic study, one critical step consists in the structural identification of mass spectrometric features revealed by statistical analysis and this remains a real challenge. Indeed, this requires both an understanding of the studied biological system, the correct use of various analytical information (retention time, molecular weight experimentally measured, isotopic golden rules, MS/MS fragment pattern interpretation…), or querying online databases. In gas chromatography-electro-ionisation (EI)-mass spectrometry, EI leads to a very reproducible fragmentation allowing establishment of universal EI mass spectra databases (for example, the NIST database -National Institute of Standards and Technology) and thus facilitates the identification step. Unfortunately, the situation is different when working with liquid chromatography-mass spectrometry (LC-MS) since atmospheric pressure ionisation exhibits high inter-instrument variability regarding fragmentation. Therefore, the constitution of LC-MS "in-house" spectral databases appears relevant in this context. The present study describes the procedure developed and applied to increment 133 and 130 metabolites in databanks dedicated to analyses performed with LC-HRMS in positive and negative electrospray ionisation, and the use of these databanks for annotating quickly untargeted metabolomics fingerprints. This study also describes the optimization of the parameters controlling the automatic processing in order to obtain a fast and reliable annotation of a maximum of organic compounds. This strategy was applied to bovine kidney samples collected from control animals or animals treated with steroid hormones. Thirty

  14. A unified gene catalog for the laboratory mouse reference genome.

    Science.gov (United States)

    Zhu, Y; Richardson, J E; Hale, P; Baldarelli, R M; Reed, D J; Recla, J M; Sinclair, R; Reddy, T B K; Bult, C J

    2015-08-01

    We report here a semi-automated process by which mouse genome feature predictions and curated annotations (i.e., genes, pseudogenes, functional RNAs, etc.) from Ensembl, NCBI and Vertebrate Genome Annotation database (Vega) are reconciled with the genome features in the Mouse Genome Informatics (MGI) database (http://www.informatics.jax.org) into a comprehensive and non-redundant catalog. Our gene unification method employs an algorithm (fjoin--feature join) for efficient detection of genome coordinate overlaps among features represented in two annotation data sets. Following the analysis with fjoin, genome features are binned into six possible categories (1:1, 1:0, 0:1, 1:n, n:1, n:m) based on coordinate overlaps. These categories are subsequently prioritized for assessment of annotation equivalencies and differences. The version of the unified catalog reported here contains more than 59,000 entries, including 22,599 protein-coding coding genes, 12,455 pseudogenes, and 24,007 other feature types (e.g., microRNAs, lincRNAs, etc.). More than 23,000 of the entries in the MGI gene catalog have equivalent gene models in the annotation files obtained from NCBI, Vega, and Ensembl. 12,719 of the features are unique to NCBI relative to Ensembl/Vega; 11,957 are unique to Ensembl/Vega relative to NCBI, and 3095 are unique to MGI. More than 4000 genome features fall into categories that require manual inspection to resolve structural differences in the gene models from different annotation sources. Using the MGI unified gene catalog, researchers can easily generate a comprehensive report of mouse genome features from a single source and compare the details of gene and transcript structure using MGI's mouse genome browser.

  15. Facilitating functional annotation of chicken microarray data

    Directory of Open Access Journals (Sweden)

    Gresham Cathy R

    2009-10-01

    Full Text Available Abstract Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO. However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and

  16. Transcript annotation prioritization and screening system (TrAPSS) for mutation screening.

    Science.gov (United States)

    O'Leary, Brian M; Davis, Steven G; Smith, Michael F; Brown, Bartley; Kemp, Mathew B; Almabrazi, Hakeem; Grundstad, Jason A; Burns, Thomas; Leontiev, Vladimir; Andorf, Jeaneen; Clark, Abbot F; Sheffield, Val C; Casavant, Thomas L; Scheetz, Todd E; Stone, Edwin M; Braun, Terry A

    2007-12-01

    When searching for disease-causing mutations with polymerase chain reaction (PCR)-based methods, candidate genes are usually screened in their entirety, exon by exon. Genomic resources (i.e. www.ncbi.nih.gov, www.ensembl.org, and genome.ucsc.edu) largely support this paradigm for mutation screening by making it easy to view and access sequence data associated with genes in their genomic context. However, the administrative burden of conducting mutation screening in potentially hundreds of genes and thousands of exons in thousands of patients is significant, even with the use of public genome resources. For example, the manual design of oligonucleotide primers for all exons of the 10 Leber's congenital amaurosis (LCA) genes (149 exons) represents a significant information management challenge. The Transcript Annotation Prioritization and Screening System (TrAPSS) is designed to accelerate mutation screening by (1) providing a gene-based local cache of candidate disease genes in a genomic context, (2) automating tasks associated with optimizing candidate disease gene screening and information management, and (3) providing the implementation of an algorithmic technique to utilize large amounts of heterogeneous genome annotation (e.g. conserved protein functional domains) so as to prioritize candidate genes.

  17. Draft Genome Sequence of Bacillus licheniformis Strain YNP1-TSU Isolated from Whiterock Springs in Yellowstone National Park.

    Science.gov (United States)

    O'Hair, Joshua A; Li, Hui; Thapa, Santosh; Scholz, Matthew B; Zhou, Suping

    2017-03-02

    Novel cellulolytic microorganisms can potentially influence second-generation biofuel production. This paper reports the draft genome sequence of Bacillus licheniformis strain YNP1-TSU, isolated from hydrothermal-vegetative microbiomes inside Yellowstone National Park. The assembled sequence contigs predicted 4,230 coding genes, 66 tRNAs, and 10 rRNAs through automated annotation.

  18. Draft Genome Sequence of Bacillus licheniformis Strain YNP1-TSU Isolated from Whiterock Springs in Yellowstone National Park

    Science.gov (United States)

    O'Hair, Joshua A.; Li, Hui; Thapa, Santosh; Scholz, Matthew B.

    2017-01-01

    ABSTRACT Novel cellulolytic microorganisms can potentially influence second-generation biofuel production. This paper reports the draft genome sequence of Bacillus licheniformis strain YNP1-TSU, isolated from hydrothermal-vegetative microbiomes inside Yellowstone National Park. The assembled sequence contigs predicted 4,230 coding genes, 66 tRNAs, and 10 rRNAs through automated annotation. PMID:28254968

  19. Large-scale annotation of small-molecule libraries using public databases.

    Science.gov (United States)

    Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

    2007-01-01

    While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.

  20. Analysis of antisense expression by whole genome tiling microarrays and siRNAs suggests mis-annotation of Arabidopsis orphan protein-coding genes.

    Directory of Open Access Journals (Sweden)

    Casey R Richardson

    Full Text Available BACKGROUND: MicroRNAs (miRNAs and trans-acting small-interfering RNAs (tasi-RNAs are small (20-22 nt long RNAs (smRNAs generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. PRINCIPAL FINDINGS: We explored rice (Oryza sativa sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis 'orphan' hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the "ancient" (deeply conserved class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for "new" rapidly-evolving MIRNA genes. CONCLUSIONS: Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non

  1. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    Science.gov (United States)

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp.

  2. The first mitochondrial genome of the sepsid fly Nemopoda mamaevi Ozerov, 1997 (Diptera: Sciomyzoidea: Sepsidae, with mitochondrial genome phylogeny of cyclorrhapha.

    Directory of Open Access Journals (Sweden)

    Xuankun Li

    Full Text Available Sepsid flies (Diptera: Sepsidae are important model insects for sexual selection research. In order to develop mitochondrial (mt genome data for this significant group, we sequenced the first complete mt genome of the sepsid fly Nemopoda mamaevi Ozerov, 1997. The circular 15,878 bp mt genome is typical of Diptera, containing all 37 genes usually present in bilaterian animals. We discovered inaccurate annotations of fly mt genomes previously deposited on GenBank and thus re-annotated all published mt genomes of Cyclorrhapha. These re-annotations were based on comparative analysis of homologous genes, and provide a statistical analysis of start and stop codon positions. We further detected two 18 bp of conserved intergenic sequences from tRNAGlu-tRNAPhe and ND1-tRNASer(UCN across Cyclorrhapha, which are the mtTERM binding site motifs. Additionally, we compared automated annotation software MITOS with hand annotation method. Phylogenetic trees based on the mt genome data from Cyclorrhapha were inferred by Maximum-likelihood and Bayesian methods, strongly supported a close relationship between Sepsidae and the Tephritoidea.

  3. The first mitochondrial genome of the sepsid fly Nemopoda mamaevi Ozerov, 1997 (Diptera: Sciomyzoidea: Sepsidae), with mitochondrial genome phylogeny of cyclorrhapha.

    Science.gov (United States)

    Li, Xuankun; Ding, Shuangmei; Cameron, Stephen L; Kang, Zehui; Wang, Yuyu; Yang, Ding

    2015-01-01

    Sepsid flies (Diptera: Sepsidae) are important model insects for sexual selection research. In order to develop mitochondrial (mt) genome data for this significant group, we sequenced the first complete mt genome of the sepsid fly Nemopoda mamaevi Ozerov, 1997. The circular 15,878 bp mt genome is typical of Diptera, containing all 37 genes usually present in bilaterian animals. We discovered inaccurate annotations of fly mt genomes previously deposited on GenBank and thus re-annotated all published mt genomes of Cyclorrhapha. These re-annotations were based on comparative analysis of homologous genes, and provide a statistical analysis of start and stop codon positions. We further detected two 18 bp of conserved intergenic sequences from tRNAGlu-tRNAPhe and ND1-tRNASer(UCN) across Cyclorrhapha, which are the mtTERM binding site motifs. Additionally, we compared automated annotation software MITOS with hand annotation method. Phylogenetic trees based on the mt genome data from Cyclorrhapha were inferred by Maximum-likelihood and Bayesian methods, strongly supported a close relationship between Sepsidae and the Tephritoidea.

  4. Comparative genomic mapping of the bovine Fragile Histidine Triad (FHIT tumour suppressor gene: characterization of a 2 Mb BAC contig covering the locus, complete annotation of the gene, analysis of cDNA and of physiological expression profiles

    Directory of Open Access Journals (Sweden)

    Boussaha Mekki

    2006-05-01

    Full Text Available Abstract Background The Fragile Histidine Triad gene (FHIT is an oncosuppressor implicated in many human cancers, including vesical tumors. FHIT is frequently hit by deletions caused by fragility at FRA3B, the most active of human common fragile sites, where FHIT lays. Vesical tumors affect also cattle, including animals grazing in the wild on bracken fern; compounds released by the fern are known to induce chromosome fragility and may trigger cancer with the interplay of latent Papilloma virus. Results The bovine FHIT was characterized by assembling a contig of 78 BACs. Sequence tags were designed on human exons and introns and used directly to select bovine BACs, or compared with sequence data in the bovine genome database or in the trace archive of the bovine genome sequencing project, and adapted before use. FHIT is split in ten exons like in man, with exons 5 to 9 coding for a 149 amino acids protein. VISTA global alignments between bovine genomic contigs retrieved from the bovine genome database and the human FHIT region were performed. Conservation was extremely high over a 2 Mb region spanning the whole FHIT locus, including the size of introns. Thus, the bovine FHIT covers about 1.6 Mb compared to 1.5 Mb in man. Expression was analyzed by RT-PCR and Northern blot, and was found to be ubiquitous. Four cDNA isoforms were isolated and sequenced, that originate from an alternative usage of three variants of exon 4, revealing a size very close to the major human FHIT cDNAs. Conclusion A comparative genomic approach allowed to assemble a contig of 78 BACs and to completely annotate a 1.6 Mb region spanning the bovine FHIT gene. The findings confirmed the very high level of conservation between human and bovine genomes and the importance of comparative mapping to speed the annotation process of the recently sequenced bovine genome. The detailed knowledge of the genomic FHIT region will allow to study the role of FHIT in bovine cancerogenesis

  5. Revisiting the reference genomes of human pathogenic Cryptosporidium species: reannotation of C. parvum Iowa and a new C. hominis reference

    Science.gov (United States)

    Isaza, Juan P.; Galván, Ana Luz; Polanco, Victor; Huang, Bernice; Matveyev, Andrey V.; Serrano, Myrna G.; Manque, Patricio; Buck, Gregory A.; Alzate, Juan F.

    2015-01-01

    Cryptosporidium parvum and C. hominis are the most relevant species of this genus for human health. Both cause a self-limiting diarrhea in immunocompetent individuals, but cause potentially life-threatening disease in the immunocompromised. Despite the importance of these pathogens, only one reference genome of each has been analyzed and published. These two reference genomes were sequenced using automated capillary sequencing; as of yet, no next generation sequencing technology has been applied to improve their assemblies and annotations. For C. hominis, the main challenge that prevents a larger number of genomes to be sequenced is its resistance to axenic culture. In the present study, we employed next generation technology to analyse the genomic DNA and RNA to generate a new reference genome sequence of a C. hominis strain isolated directly from human stool and a new genome annotation of the C. parvum Iowa reference genome. PMID:26549794

  6. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs

    DEFF Research Database (Denmark)

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip;

    2013-01-01

    (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment...

  7. MannDB: A microbial annotation database for protein characterization

    Energy Technology Data Exchange (ETDEWEB)

    Zhou, C; Lam, M; Smith, J; Zemla, A; Dyer, M; Kuczmarski, T; Vitalis, E; Slezak, T

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high

  8. A framework for automated enrichment of functionally significant inverted repeats in whole genomes

    Directory of Open Access Journals (Sweden)

    Frank Ronald L

    2010-10-01

    Full Text Available Abstract Background RNA transcripts from genomic sequences showing dyad symmetry typically adopt hairpin-like, cloverleaf, or similar structures that act as recognition sites for proteins. Such structures often are the precursors of non-coding RNA (ncRNA sequences like microRNA (miRNA and small-interfering RNA (siRNA that have recently garnered more functional significance than in the past. Genomic DNA contains hundreds of thousands of such inverted repeats (IRs with varying degrees of symmetry. But by collecting statistically significant information from a known set of ncRNA, we can sort these IRs into those that are likely to be functional. Results A novel method was developed to scan genomic DNA for partially symmetric inverted repeats and the resulting set was further refined to match miRNA precursors (pre-miRNA with respect to their density of symmetry, statistical probability of the symmetry, length of stems in the predicted hairpin secondary structure, and the GC content of the stems. This method was applied on the Arabidopsis thaliana genome and validated against the set of 190 known Arabidopsis pre-miRNA in the miRBase database. A preliminary scan for IRs identified 186 of the known pre-miRNA but with 714700 pre-miRNA candidates. This large number of IRs was further refined to 483908 candidates with 183 pre-miRNA identified and further still to 165371 candidates with 171 pre-miRNA identified (i.e. with 90% of the known pre-miRNA retained. Conclusions 165371 candidates for potentially functional miRNA is still too large a set to warrant wet lab analyses, such as northern blotting, on all of them. Hence additional filters are needed to further refine the number of candidates while still retaining most of the known miRNA. These include detection of promoters and terminators, homology analyses, location of candidate relative to coding regions, and better secondary structure prediction algorithms. The software developed is designed to easily

  9. Concept annotation in the CRAFT corpus

    Directory of Open Access Journals (Sweden)

    Bada Michael

    2012-07-01

    Full Text Available Abstract Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released. Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens, our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection, the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are

  10. High-throughput automated microfluidic sample preparation for accurate microbial genomics

    Science.gov (United States)

    Kim, Soohong; De Jonghe, Joachim; Kulesa, Anthony B.; Feldman, David; Vatanen, Tommi; Bhattacharyya, Roby P.; Berdy, Brittany; Gomez, James; Nolan, Jill; Epstein, Slava; Blainey, Paul C.

    2017-01-01

    Low-cost shotgun DNA sequencing is transforming the microbial sciences. Sequencing instruments are so effective that sample preparation is now the key limiting factor. Here, we introduce a microfluidic sample preparation platform that integrates the key steps in cells to sequence library sample preparation for up to 96 samples and reduces DNA input requirements 100-fold while maintaining or improving data quality. The general-purpose microarchitecture we demonstrate supports workflows with arbitrary numbers of reaction and clean-up or capture steps. By reducing the sample quantity requirements, we enabled low-input (∼10,000 cells) whole-genome shotgun (WGS) sequencing of Mycobacterium tuberculosis and soil micro-colonies with superior results. We also leveraged the enhanced throughput to sequence ∼400 clinical Pseudomonas aeruginosa libraries and demonstrate excellent single-nucleotide polymorphism detection performance that explained phenotypically observed antibiotic resistance. Fully-integrated lab-on-chip sample preparation overcomes technical barriers to enable broader deployment of genomics across many basic research and translational applications. PMID:28128213

  11. Harmonization and semantic annotation of data dictionaries from the Pharmacogenomics Research Network: a case study.

    Science.gov (United States)

    Zhu, Qian; Freimuth, Robert R; Lian, Zonghui; Bauer, Scott; Pathak, Jyotishman; Tao, Cui; Durski, Matthew J; Chute, Christopher G

    2013-04-01

    The Pharmacogenomics Research Network (PGRN) is a collaborative partnership of research groups funded by NIH to discover and understand how genome contributes to an individual's response to medication. Since traditional biomedical research studies and clinical trials are often conducted independently, common and standardized representations for data are seldom used. This leads to heterogeneity in data representation, which hinders data reuse, data integration and meta-analyses. This study demonstrates harmonization and semantic annotation work for pharmacogenomics data dictionaries collected from PGRN research groups. A semi-automated system was developed to support the harmonization/annotation process, which includes four individual steps, (1) pre-processing PGRN variables; (2) decomposing and normalizing variable descriptions; (3) semantically annotating words and phrases using controlled terminologies; (4) grouping PGRN variables into categories based on the annotation results and semantic types, for total 1514 PGRN variables. Our results demonstrate that there is a significant amount of variability in how pharmacogenomics data is represented and that additional standardization efforts are needed. This represents a critical first step toward identifying and creating data standards for pharmacogenomics studies.

  12. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes.

    Science.gov (United States)

    Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine

    2017-01-04

    The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations.

  13. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes

    Science.gov (United States)

    Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine

    2017-01-01

    The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations. PMID:27899624

  14. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    Energy Technology Data Exchange (ETDEWEB)

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) or individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual

  15. Ontological Annotation with WordNet

    Energy Technology Data Exchange (ETDEWEB)

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob; Hohimer, Ryan E.; White, Amanda M.

    2006-06-06

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  16. Snpdat: Easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms

    Directory of Open Access Journals (Sweden)

    Doran Anthony G

    2013-02-01

    Full Text Available Abstract Background Single nucleotide polymorphisms (SNPs are the most abundant genetic variant found in vertebrates and invertebrates. SNP discovery has become a highly automated, robust and relatively inexpensive process allowing the identification of many thousands of mutations for model and non-model organisms. Annotating large numbers of SNPs can be a difficult and complex process. Many tools available are optimised for use with organisms densely sampled for SNPs, such as humans. There are currently few tools available that are species non-specific or support non-model organism data. Results Here we present SNPdat, a high throughput analysis tool that can provide a comprehensive annotation of both novel and known SNPs for any organism with a draft sequence and annotation. Using a dataset of 4,566 SNPs identified in cattle using high-throughput DNA sequencing we demonstrate the annotations performed and the statistics that can be generated by SNPdat. Conclusions SNPdat provides users with a simple tool for annotation of genomes that are either not supported by other tools or have a small number of annotated SNPs available. SNPdat can also be used to analyse datasets from organisms which are densely sampled for SNPs. As a command line tool it can easily be incorporated into existing SNP discovery pipelines and fills a niche for analyses involving non-model organisms that are not supported by many available SNP annotation tools. SNPdat will be of great interest to scientists involved in SNP discovery and analysis projects, particularly those with limited bioinformatics experience.

  17. ECR Browser: A Tool For Visualizing And Accessing Data From Comparisons Of Multiple Vertebrate Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Loots, G G; Ovcharenko, I; Stubbs, L; Nobrega, M A

    2004-01-06

    The increasing number of vertebrate genomes being sequenced in draft or finished form provide a unique opportunity to study and decode the language of DNA sequence through comparative genome alignments. However, novel tools and strategies are required to accommodate this increasing volume of genomic information and to facilitate experimental annotation of genome function. Here we present the ECR Browser, a tool that provides an easy and dynamic access to whole genome alignments of human, mouse, rat and fish sequences. This web-based tool (http://ecrbrowser.dcode.org) provides the starting point for discovery of novel genes, identification of distant gene regulatory elements and prediction of transcription factor binding sites. The genome alignment portal of the ECR Browser also permits fast and automated alignment of any user-submitted sequence to the genome of choice. The interconnection of the ECR browser with other DNA sequence analysis tools creates a unique portal for studying and exploring vertebrate genomes.

  18. JAFA: a protein function annotation meta-server

    DEFF Research Database (Denmark)

    Friedberg, Iddo; Harder, Tim; Godzik, Adam

    2006-01-01

    With the high number of sequences and structures streaming in from genomic projects, there is a need for more powerful and sophisticated annotation tools. Most problematic of the annotation efforts is predicting gene and protein function. Over the past few years there has been considerable progre...

  19. The DNA sequence, annotation and analysis of human chromosome 3

    DEFF Research Database (Denmark)

    Muzny, Donna M; Scherer, Steven E; Kaul, Rajinder

    2006-01-01

    After the completion of a draft human genome sequence, the International Human Genome Sequencing Consortium has proceeded to finish and annotate each of the 24 chromosomes comprising the human genome. Here we describe the sequencing and analysis of human chromosome 3, one of the largest human chr...

  20. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    Directory of Open Access Journals (Sweden)

    Kevin R Ramkissoon

    Full Text Available The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  1. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    Science.gov (United States)

    Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  2. Insights into the annotated genome sequence of Methanoculleus bourgensis MS2(T), related to dominant methanogens in biogas-producing plants.

    Science.gov (United States)

    Maus, Irena; Wibberg, Daniel; Stantscheff, Robbin; Stolze, Yvonne; Blom, Jochen; Eikmeyer, Felix-Gregor; Fracowiak, Jochen; König, Helmut; Pühler, Alfred; Schlüter, Andreas

    2015-05-10

    The final step of the biogas production process, the methanogenesis, is frequently dominated by members of the genus Methanoculleus. In particular, the species Methanoculleus bourgensis was identified to play a role in different biogas reactor systems. The genome of the type strain M. bourgensis MS2(T), originally isolated from a sewage sludge digestor, was completely sequenced to analyze putative adaptive genome features conferring competitiveness within biogas reactor environments to the strain. Sequencing and assembly of the M. bourgensis MS2(T) genome yielded a chromosome with a size of 2,789,773 bp. Comparative analysis of M. bourgensis MS2(T) and Methanoculleus marisnigri JR1 revealed significant similarities. The absence of genes for a putative ammonium uptake system may indicate that M. bourgensis MS2(T) is adapted to environments rich in ammonium/ammonia. Specific genes featuring predicted functions in the context of osmolyte production were detected in the genome of M. bourgensis MS2(T). Mapping of metagenome sequences derived from a production-scale biogas plant revealed that M. bourgensis MS2(T) almost completely comprises the genetic information of dominant methanogens present in the biogas reactor analyzed. Hence, availability of the M. bourgensis MS2(T) genome sequence may be valuable regarding further research addressing the performance of Methanoculleus species in agricultural biogas plants.

  3. Ubiquitous Annotation Systems

    DEFF Research Database (Denmark)

    Hansen, Frank Allan

    2006-01-01

    Ubiquitous annotation systems allow users to annotate physical places, objects, and persons with digital information. Especially in the field of location based information systems much work has been done to implement adaptive and context-aware systems, but few efforts have focused on the general...... requirements for linking information to objects in both physical and digital space. This paper surveys annotation techniques from open hypermedia systems, Web based annotation systems, and mobile and augmented reality systems to illustrate different approaches to four central challenges ubiquitous annotation...... systems have to deal with: anchoring, structuring, presentation, and authoring. Through a number of examples each challenge is discussed and HyCon, a context-aware hypermedia framework developed at the University of Aarhus, Denmark, is used to illustrate an integrated approach to ubiquitous annotations...

  4. Visualization for genomics: the Microbial Genome Viewer.

    NARCIS (Netherlands)

    Kerkhoven, R.; Enckevort, F.H.J. van; Boekhorst, J.; Molenaar, D.; Siezen, R.J.

    2004-01-01

    SUMMARY: A Web-based visualization tool, the Microbial Genome Viewer, is presented that allows the user to combine complex genomic data in a highly interactive way. This Web tool enables the interactive generation of chromosome wheels and linear genome maps from genome annotation data stored in a My

  5. Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease.

    Science.gov (United States)

    Sifrim, Alejandro; Van Houdt, Jeroen Kj; Tranchevent, Leon-Charles; Nowakowska, Beata; Sakai, Ryo; Pavlopoulos, Georgios A; Devriendt, Koen; Vermeesch, Joris R; Moreau, Yves; Aerts, Jan

    2012-01-01

    The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org.

  6. The GATO gene annotation tool for research laboratories

    Directory of Open Access Journals (Sweden)

    A. Fujita

    2005-11-01

    Full Text Available Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.

  7. The UCSC genome browser database

    DEFF Research Database (Denmark)

    Kuhn, R M; Karolchik, D; Zweig, A S

    2007-01-01

    The University of California, Santa Cruz Genome Browser Database contains, as of September 2006, sequence and annotation data for the genomes of 13 vertebrate and 19 invertebrate species. The Genome Browser displays a wide variety of annotations at all scales from the single nucleotide level up t...

  8. The UCSC Genome Browser Database

    DEFF Research Database (Denmark)

    Hinrichs, A S; Karolchik, D; Baertsch, R

    2006-01-01

    The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, ...

  9. Biosynthesis of Akaeolide and Lorneic Acids and Annotation of Type I Polyketide Synthase Gene Clusters in the Genome of Streptomyces sp. NPS554

    Directory of Open Access Journals (Sweden)

    Tao Zhou

    2015-01-01

    Full Text Available The incorporation pattern of biosynthetic precursors into two structurally unique polyketides, akaeolide and lorneic acid A, was elucidated by feeding experiments with 13C-labeled precursors. In addition, the draft genome sequence of the producer, Streptomyces sp. NPS554, was performed and the biosynthetic gene clusters for these polyketides were identified. The putative gene clusters contain all the polyketide synthase (PKS domains necessary for assembly of the carbon skeletons. Combined with the 13C-labeling results, gene function prediction enabled us to propose biosynthetic pathways involving unusual carbon-carbon bond formation reactions. Genome analysis also indicated the presence of at least ten orphan type I PKS gene clusters that might be responsible for the production of new polyketides.

  10. PREPACT 2.0: Predicting C-to-U and U-to-C RNA Editing in Organelle Genome Sequences with Multiple References and Curated RNA Editing Annotation

    OpenAIRE

    2013-01-01

    RNA editing is vast in some genetic systems, with up to thousands of targeted C-to-U and U-to-C substitutions in mitochondria and chloroplasts of certain plants. Efficient prognoses of RNA editing in organelle genomes will help to reveal overlooked cases of editing. We present PREPACT 2.0 (http://www.prepact.de) with numerous enhancements of our previously developed Plant RNA Editing Prediction & Analysis Computer Tool. Reference organelle transcriptomes for editing prediction have been exten...

  11. Annotating Coloured Petri Nets

    DEFF Research Database (Denmark)

    Lindstrøm, Bo; Wells, Lisa Marie

    2002-01-01

    Coloured Petri nets (CP-nets) can be used for several fundamentally different purposes like functional analysis, performance analysis, and visualisation. To be able to use the corresponding tool extensions and libraries it is sometimes necessary to include extra auxiliary information in the CP-ne...... a certain use of the CP-net. We define the semantics of annotations by describing a translation from a CP-net and the corresponding annotation layers to another CP-net where the annotations are an integrated part of the CP-net....... a method which makes it possible to associate auxiliary information, called annotations, with tokens without modifying the colour sets of the CP-net. Annotations are pieces of information that are not essential for determining the behaviour of the system being modelled, but are rather added to support...

  12. MAGPIE/EGRET Annotation of the 2.9-Mb Drosophila melanogaster Adh Region

    Science.gov (United States)

    Gaasterland, Terry; Sczyrba, Alexander; Thomas, Elizabeth; Aytekin-Kurban, Gulriz; Gordon, Paul; Sensen, Christoph W.

    2000-01-01

    Our challenge in annotating the 2.91-Mb Adh region of the Drosophila melanogaster genome was to identify genetic and genomic features automatically, completely, and precisely within a 6-week period. To do so, we augmented the MAGPIE microbial genome annotation system to handle eukaryotic genomic sequence data. The new configuration required the integration of eukaryotic gene-finding tools and DNA repeat tools into the automatic data collection module. It also required us to define in MAGPIE new strategies to combine data about eukaryotic exon predictions with functional data to refine the exon predictions. At the heart of the resulting new eukaryotic genome annotation system is a reverse comparison of public protein and complementary DNA sequences against the input genome to identify missing exons and to refine exon boundaries. The software modules that add eukaryotic genome annotation capability to MAGPIE are available as EGRET (Eukaryotic Genome Rapid Evaluation Tool). PMID:10779489

  13. ParsEval: parallel comparison and analysis of gene structure annotations

    Directory of Open Access Journals (Sweden)

    Standage Daniel S

    2012-08-01

    Full Text Available Abstract Background Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. Results ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. Conclusions Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation comparison to date. The source code is freely available (under an ISC license at http://parseval.sourceforge.net/.

  14. Personnalisation de Syst\\`emes OLAP Annot\\'es

    CERN Document Server

    Jerbi, Houssem; Ravat, Franck; Teste, Olivier

    2010-01-01

    This paper deals with personalization of annotated OLAP systems. Data constellation is extended to support annotations and user preferences. Annotations reflect the decision-maker experience whereas user preferences enable users to focus on the most interesting data. User preferences allow annotated contextual recommendations helping the decision-maker during his/her multidimensional navigations.

  15. TransportDB 2.0: a database for exploring membrane transporters in sequenced genomes from all domains of life

    Science.gov (United States)

    Elbourne, Liam D. H.; Tetu, Sasha G.; Hassan, Karl A.; Paulsen, Ian T.

    2017-01-01

    All cellular life contains an extensive array of membrane transport proteins. The vast majority of these transporters have not been experimentally characterized. We have developed a bioinformatic pipeline to identify and annotate complete sets of transporters in any sequenced genome. This pipeline is now fully automated enabling it to better keep pace with the accelerating rate of genome sequencing. This manuscript describes TransportDB 2.0 (http://www.membranetransport.org/transportDB2/), a completely updated version of TransportDB, which provides access to the large volumes of data generated by our automated transporter annotation pipeline. The TransportDB 2.0 web portal has been rebuilt to utilize contemporary JavaScript libraries, providing a highly interactive interface to the annotation information, and incorporates analysis tools that enable users to query the database on a number of levels. For example, TransportDB 2.0 includes tools that allow users to select annotated genomes of interest from the thousands of species held in the database and compare their complete transporter complements. PMID:27899676

  16. TransportDB 2.0: a database for exploring membrane transporters in sequenced genomes from all domains of life.

    Science.gov (United States)

    Elbourne, Liam D H; Tetu, Sasha G; Hassan, Karl A; Paulsen, Ian T

    2017-01-04

    All cellular life contains an extensive array of membrane transport proteins. The vast majority of these transporters have not been experimentally characterized. We have developed a bioinformatic pipeline to identify and annotate complete sets of transporters in any sequenced genome. This pipeline is now fully automated enabling it to better keep pace with the accelerating rate of genome sequencing. This manuscript describes TransportDB 2.0 (http://www.membranetransport.org/transportDB2/), a completely updated version of TransportDB, which provides access to the large volumes of data generated by our automated transporter annotation pipeline. The TransportDB 2.0 web portal has been rebuilt to utilize contemporary JavaScript libraries, providing a highly interactive interface to the annotation information, and incorporates analysis tools that enable users to query the database on a number of levels. For example, TransportDB 2.0 includes tools that allow users to select annotated genomes of interest from the thousands of species held in the database and compare their complete transporter complements.

  17. Model and Interoperability using Meta Data Annotations

    Science.gov (United States)

    David, O.

    2011-12-01

    Software frameworks and architectures are in need for meta data to efficiently support model integration. Modelers have to know the context of a model, often stepping into modeling semantics and auxiliary information usually not provided in a concise structure and universal format, consumable by a range of (modeling) tools. XML often seems the obvious solution for capturing meta data, but its wide adoption to facilitate model interoperability is limited by XML schema fragmentation, complexity, and verbosity outside of a data-automation process. Ontologies seem to overcome those shortcomings, however the practical significance of their use remains to be demonstrated. OMS version 3 took a different approach for meta data representation. The fundamental building block of a modular model in OMS is a software component representing a single physical process, calibration method, or data access approach. Here, programing language features known as Annotations or Attributes were adopted. Within other (non-modeling) frameworks it has been observed that annotations lead to cleaner and leaner application code. Framework-supported model integration, traditionally accomplished using Application Programming Interfaces (API) calls is now achieved using descriptive code annotations. Fully annotated components for various hydrological and Ag-system models now provide information directly for (i) model assembly and building, (ii) data flow analysis for implicit multi-threading or visualization, (iii) automated and comprehensive model documentation of component dependencies, physical data properties, (iv) automated model and component testing, calibration, and optimization, and (v) automated audit-traceability to account for all model resources leading to a particular simulation result. Such a non-invasive methodology leads to models and modeling components with only minimal dependencies on the modeling framework but a strong reference to its originating code. Since models and

  18. DepecheMood: a Lexicon for Emotion Analysis from Crowd-Annotated News

    OpenAIRE

    2014-01-01

    While many lexica annotated with words polarity are available for sentiment analysis, very few tackle the harder task of emotion analysis and are usually quite limited in coverage. In this paper, we present a novel approach for extracting - in a totally automated way - a high-coverage and high-precision lexicon of roughly 37 thousand terms annotated with emotion scores, called DepecheMood. Our approach exploits in an original way 'crowd-sourced' affective annotation implicitly provided by rea...

  19. Genome Sequencing

    DEFF Research Database (Denmark)

    Sato, Shusei; Andersen, Stig Uggerhøj

    2014-01-01

    The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based on transcr......The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based...

  20. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads

    Energy Technology Data Exchange (ETDEWEB)

    Martin, Jeffrey; Bruno, Vincent M.; Fang, Zhide; Meng, Xiandong; Blow, Matthew; Zhang, Tao; Sherlock, Gavin; Snyder, Michael; Wang, Zhong

    2010-11-19

    Background: Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. While high throughput mRNA sequencing (RNA-Seq) has emerged as a powerful tool for addressing these problems, its success is dependent upon the availability and quality of reference genome sequences, thus limiting the organisms to which it can be applied. Results: Here, we describe Rnnotator, an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data without the need for a reference genome. We have applied the Rnnotator assembly pipeline to two yeast transcriptomes and compared the results to the reference gene catalogs of these organisms. The contigs produced by Rnnotator are highly accurate (95percent) and reconstruct full-length genes for the majority of the existing gene models (54.3percent). Furthermore, our analyses revealed many novel transcribed regions that are absent from well annotated genomes, suggesting Rnnotator serves as a complementary approach to analysis based on a reference genome for comprehensive transcriptomics. Conclusions: These results demonstrate that the Rnnotator pipeline is able to reconstruct full-length transcripts in the absence of a complete reference genome.

  1. GeneViTo: Visualizing gene-product functional and structural features in genomic datasets

    Directory of Open Access Journals (Sweden)

    Promponas Vasilis J

    2003-10-01

    Full Text Available Abstract Background The availability of increasing amounts of sequence data from completely sequenced genomes boosts the development of new computational methods for automated genome annotation and comparative genomics. Therefore, there is a need for tools that facilitate the visualization of raw data and results produced by bioinformatics analysis, providing new means for interactive genome exploration. Visual inspection can be used as a basis to assess the quality of various analysis algorithms and to aid in-depth genomic studies. Results GeneViTo is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features. Emphasis is laid on the potential comparison between annotation and prediction data in order to offer a supplement to the provided information, especially in cases of "poor" annotation, or an evaluation of available predictions. Moreover, desired information can be output in high quality JPEG image files for further elaboration and scientific use. A compilation of properly formatted GeneViTo input data for demonstration is available to interested readers for two completely sequenced prokaryotes, Chlamydia trachomatis and Methanococcus jannaschii. Conclusions GeneViTo offers an inspectional view of genomic functional elements, concerning data stemming both from database annotation and analysis tools for an overall analysis of existing genomes. The application is compatible with Linux or Windows ME-2000-XP operating

  2. Current and future trends in marine image annotation software

    Science.gov (United States)

    Gomes-Pereira, Jose Nuno; Auger, Vincent; Beisiegel, Kolja; Benjamin, Robert; Bergmann, Melanie; Bowden, David; Buhl-Mortensen, Pal; De Leo, Fabio C.; Dionísio, Gisela; Durden, Jennifer M.; Edwards, Luke; Friedman, Ariell; Greinert, Jens; Jacobsen-Stout, Nancy; Lerner, Steve; Leslie, Murray; Nattkemper, Tim W.; Sameoto, Jessica A.; Schoening, Timm; Schouten, Ronald; Seager, James; Singh, Hanumant; Soubigou, Olivier; Tojeira, Inês; van den Beld, Inge; Dias, Frederico; Tempera, Fernando; Santos, Ricardo S.

    2016-12-01

    Given the need to describe, analyze and index large quantities of marine imagery data for exploration and monitoring activities, a range of specialized image annotation tools have been developed worldwide. Image annotation - the process of transposing objects or events represented in a video or still image to the semantic level, may involve human interactions and computer-assisted solutions. Marine image annotation software (MIAS) have enabled over 500 publications to date. We review the functioning, application trends and developments, by comparing general and advanced features of 23 different tools utilized in underwater image analysis. MIAS requiring human input are basically a graphical user interface, with a video player or image browser that recognizes a specific time code or image code, allowing to log events in a time-stamped (and/or geo-referenced) manner. MIAS differ from similar software by the capability of integrating data associated to video collection, the most simple being the position coordinates of the video recording platform. MIAS have three main characteristics: annotating events in real time, posteriorly to annotation and interact with a database. These range from simple annotation interfaces, to full onboard data management systems, with a variety of toolboxes. Advanced packages allow to input and display data from multiple sensors or multiple annotators via intranet or internet. Posterior human-mediated annotation often include tools for data display and image analysis, e.g. length, area, image segmentation, point count; and in a few cases the possibility of browsing and editing previous dive logs or to analyze the annotations. The interaction with a database allows the automatic integration of annotations from different surveys, repeated annotation and collaborative annotation of shared datasets, browsing and querying of data. Progress in the field of automated annotation is mostly in post processing, for stable platforms or still images

  3. Semantic annotation of mutable data.

    Science.gov (United States)

    Morris, Robert A; Dou, Lei; Hanken, James; Kelly, Maureen; Lowery, David B; Ludäscher, Bertram; Macklin, James A; Morris, Paul J

    2013-01-01

    Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.

  4. Semantic annotation of mutable data.

    Directory of Open Access Journals (Sweden)

    Robert A Morris

    Full Text Available Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.

  5. Semantic annotation of morphological descriptions: an overall strategy

    Directory of Open Access Journals (Sweden)

    Cui Hong

    2010-05-01

    Full Text Available Abstract Background Large volumes of morphological descriptions of whole organisms have been created as print or electronic text in a human-readable format. Converting the descriptions into computer- readable formats gives a new life to the valuable knowledge on biodiversity. Research in this area started 20 years ago, yet not sufficient progress has been made to produce an automated system that requires only minimal human intervention but works on descriptions of various plant and animal groups. This paper attempts to examine the hindering factors by identifying the mismatches between existing research and the characteristics of morphological descriptions. Results This paper reviews the techniques that have been used for automated annotation, reports exploratory results on characteristics of morphological descriptions as a genre, and identifies challenges facing automated annotation systems. Based on these criteria, the paper proposes an overall strategy for converting descriptions of various taxon groups with the least human effort. Conclusions A combined unsupervised and supervised machine learning strategy is needed to construct domain ontologies and lexicons and to ultimately achieve automated semantic annotation of morphological descriptions. Further, we suggest that each effort in creating a new description or annotating an individual description collection should be shared and contribute to the "biodiversity information commons" for the Semantic Web. This cannot be done without a sound strategy and a close partnership between and among information scientists and biologists.

  6. A Machine Learning Based Analytical Framework for Semantic Annotation Requirements

    CERN Document Server

    Hassanzadeh, Hamed; 10.5121/ijwest.2011.2203

    2011-01-01

    The Semantic Web is an extension of the current web in which information is given well-defined meaning. The perspective of Semantic Web is to promote the quality and intelligence of the current web by changing its contents into machine understandable form. Therefore, semantic level information is one of the cornerstones of the Semantic Web. The process of adding semantic metadata to web resources is called Semantic Annotation. There are many obstacles against the Semantic Annotation, such as multilinguality, scalability, and issues which are related to diversity and inconsistency in content of different web pages. Due to the wide range of domains and the dynamic environments that the Semantic Annotation systems must be performed on, the problem of automating annotation process is one of the significant challenges in this domain. To overcome this problem, different machine learning approaches such as supervised learning, unsupervised learning and more recent ones like, semi-supervised learning and active learn...

  7. UCSC genome browser tutorial.

    Science.gov (United States)

    Zweig, Ann S; Karolchik, Donna; Kuhn, Robert M; Haussler, David; Kent, W James

    2008-08-01

    The University of California Santa Cruz (UCSC) Genome Bioinformatics website consists of a suite of free, open-source, on-line tools that can be used to browse, analyze, and query genomic data. These tools are available to anyone who has an Internet browser and an interest in genomics. The website provides a quick and easy-to-use visual display of genomic data. It places annotation tracks beneath genome coordinate positions, allowing rapid visual correlation of different types of information. Many of the annotation tracks are submitted by scientists worldwide; the others are computed by the UCSC Genome Bioinformatics group from publicly available sequence data. It also allows users to upload and display their own experimental results or annotation sets by creating a custom track. The suite of tools, downloadable data files, and links to documentation and other information can be found at http://genome.ucsc.edu/.

  8. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs.

    Directory of Open Access Journals (Sweden)

    Norihiro Maeda

    2006-04-01

    Full Text Available The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts, providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.

  9. Cheating. An Annotated Bibliography.

    Science.gov (United States)

    Wildemuth, Barbara M., Comp.

    This 89-item, annotated bibliography was compiled to provide access to research and discussions of cheating and, specifically, cheating on tests. It is not limited to any educational level, nor is it confined to any specific curriculum area. Two data bases were searched by computer, and a library search was conducted. A computer search of the…

  10. Annotated bibliography traceability

    NARCIS (Netherlands)

    Narain, G.

    2006-01-01

    This annotated bibliography contains summaries of articles and chapters of books, which are relevant to traceability. After each summary there is a part about the relevancy of the paper for the LEI project. The aim of the LEI-project is to gain insight in several aspects of traceability in order to

  11. Annotation of Regular Polysemy

    DEFF Research Database (Denmark)

    Martinez Alonso, Hector

    Regular polysemy has received a lot of attention from the theory of lexical semantics and from computational linguistics. However, there is no consensus on how to represent the sense of underspecified examples at the token level, namely when annotating or disambiguating senses of metonymic words...

  12. Collaborative Movie Annotation

    Science.gov (United States)

    Zad, Damon Daylamani; Agius, Harry

    In this paper, we focus on metadata for self-created movies like those found on YouTube and Google Video, the duration of which are increasing in line with falling upload restrictions. While simple tags may have been sufficient for most purposes for traditionally very short video footage that contains a relatively small amount of semantic content, this is not the case for movies of longer duration which embody more intricate semantics. Creating metadata is a time-consuming process that takes a great deal of individual effort; however, this effort can be greatly reduced by harnessing the power of Web 2.0 communities to create, update and maintain it. Consequently, we consider the annotation of movies within Web 2.0 environments, such that users create and share that metadata collaboratively and propose an architecture for collaborative movie annotation. This architecture arises from the results of an empirical experiment where metadata creation tools, YouTube and an MPEG-7 modelling tool, were used by users to create movie metadata. The next section discusses related work in the areas of collaborative retrieval and tagging. Then, we describe the experiments that were undertaken on a sample of 50 users. Next, the results are presented which provide some insight into how users interact with existing tools and systems for annotating movies. Based on these results, the paper then develops an architecture for collaborative movie annotation.

  13. Annotated Bibliography. First Edition.

    Science.gov (United States)

    Haring, Norris G.

    An annotated bibliography which presents approximately 300 references from 1951 to 1973 on the education of severely/profoundly handicapped persons. Citations are grouped alphabetically by author's name within the following categories: characteristics and treatment, gross motor development, sensory and motor development, physical therapy for the…

  14. Annotation: The Savant Syndrome

    Science.gov (United States)

    Heaton, Pamela; Wallace, Gregory L.

    2004-01-01

    Background: Whilst interest has focused on the origin and nature of the savant syndrome for over a century, it is only within the past two decades that empirical group studies have been carried out. Methods: The following annotation briefly reviews relevant research and also attempts to address outstanding issues in this research area.…

  15. Annotation of Ehux ESTs

    Energy Technology Data Exchange (ETDEWEB)

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  16. Solar Tutorial and Annotation Resource (STAR)

    Science.gov (United States)

    Showalter, C.; Rex, R.; Hurlburt, N. E.; Zita, E. J.

    2009-12-01

    We have written a software suite designed to facilitate solar data analysis by scientists, students, and the public, anticipating enormous datasets from future instruments. Our “STAR" suite includes an interactive learning section explaining 15 classes of solar events. Users learn software tools that exploit humans’ superior ability (over computers) to identify many events. Annotation tools include time slice generation to quantify loop oscillations, the interpolation of event shapes using natural cubic splines (for loops, sigmoids, and filaments) and closed cubic splines (for coronal holes). Learning these tools in an environment where examples are provided prepares new users to comfortably utilize annotation software with new data. Upon completion of our tutorial, users are presented with media of various solar events and asked to identify and annotate the images, to test their mastery of the system. Goals of the project include public input into the data analysis of very large datasets from future solar satellites, and increased public interest and knowledge about the Sun. In 2010, the Solar Dynamics Observatory (SDO) will be launched into orbit. SDO’s advancements in solar telescope technology will generate a terabyte per day of high-quality data, requiring innovation in data management. While major projects develop automated feature recognition software, so that computers can complete much of the initial event tagging and analysis, still, that software cannot annotate features such as sigmoids, coronal magnetic loops, coronal dimming, etc., due to large amounts of data concentrated in relatively small areas. Previously, solar physicists manually annotated these features, but with the imminent influx of data it is unrealistic to expect specialized researchers to examine every image that computers cannot fully process. A new approach is needed to efficiently process these data. Providing analysis tools and data access to students and the public have proven

  17. GIFtS: annotation landscape analysis with GeneCards

    Directory of Open Access Journals (Sweden)

    Dalah Irina

    2009-10-01

    Full Text Available Abstract Background Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO, pathways, interactions, phenotypes, publications and many more. Results We present the GeneCards Inferred Functionality Score (GIFtS which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25 between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a

  18. Semi-automated curation of metabolic models via flux balance analysis: a case study with Mycoplasma gallisepticum.

    Directory of Open Access Journals (Sweden)

    Eddy J Bautista

    Full Text Available Primarily used for metabolic engineering and synthetic biology, genome-scale metabolic modeling shows tremendous potential as a tool for fundamental research and curation of metabolism. Through a novel integration of flux balance analysis and genetic algorithms, a strategy to curate metabolic networks and facilitate identification of metabolic pathways that may not be directly inferable solely from genome annotation was developed. Specifically, metabolites involved in unknown reactions can be determined, and potentially erroneous pathways can be identified. The procedure developed allows for new fundamental insight into metabolism, as well as acting as a semi-automated curation methodology for genome-scale metabolic modeling. To validate the methodology, a genome-scale metabolic model for the bacterium Mycoplasma gallisepticum was created. Several reactions not predicted by the genome annotation were postulated and validated via the literature. The model predicted an average growth rate of 0.358±0.12[Formula: see text], closely matching the experimentally determined growth rate of M. gallisepticum of 0.244±0.03[Formula: see text]. This work presents a powerful algorithm for facilitating the identification and curation of previously known and new metabolic pathways, as well as presenting the first genome-scale reconstruction of M. gallisepticum.

  19. Management Tool for Semantic Annotations in WSDL

    Science.gov (United States)

    Boissel-Dallier, Nicolas; Lorré, Jean-Pierre; Benaben, Frédérick

    Semantic Web Services add features to automate web services discovery and composition. A new standard called SAWSDL emerged recently as a W3C recommendation to add semantic annotations within web service descriptions (WSDL). In order to manipulate such information in Java program we need an XML parser. Two open-source libraries already exist (SAWSDL4J and Woden4SAWSDL) but they don't meet all our specific needs such as support for WSDL 1.1 and 2.0. This paper presents a new tool, called EasyWSDL, which is able to handle semantic annotations as well as to manage the full WSDL description thanks to a plug-in mechanism. This tool allows us to read/edit/create a WSDL description and related annotations thanks to a uniform API, in both 1.1 and 2.0 versions. This document compares these three libraries and presents its integration into Dragon the OW2 open-source SOA governance tool.

  20. galaxieEST: addressing EST identity through automated phylogenetic analysis

    Directory of Open Access Journals (Sweden)

    Larsson Karl-Henrik

    2004-07-01

    Full Text Available Abstract Background Research involving expressed sequence tags (ESTs is intricately coupled to the existence of large, well-annotated sequence repositories. Comparatively complete and satisfactory annotated public sequence libraries are, however, available only for a limited range of organisms, rendering the absence of sequences and gene structure information a tangible problem for those working with taxa lacking an EST or genome sequencing project. Paralogous genes belonging to the same gene family but distinguished by derived characteristics are particularly prone to misidentification and erroneous annotation; high but incomplete levels of sequence similarity are typically difficult to interpret and have formed the basis of many unsubstantiated assumptions of orthology. In these cases, a phylogenetic study of the query sequence together with the most similar sequences in the database may be of great value to the identification process. In order to facilitate this laborious procedure, a project to employ automated phylogenetic analysis in the identification of ESTs was initiated. Results galaxieEST is an open source Perl-CGI script package designed to complement traditional similarity-based identification of EST sequences through employment of automated phylogenetic analysis. It uses a series of BLAST runs as a sieve to retrieve nucleotide and protein sequences for inclusion in neighbour joining and parsimony analyses; the output includes the BLAST output, the results of the phylogenetic analyses, and the corresponding multiple alignments. galaxieEST is available as an on-line web service for identification of fungal ESTs and for download / local installation for use with any organism group at http://galaxie.cgb.ki.se/galaxieEST.html. Conclusions By addressing sequence relatedness in addition to similarity, galaxieEST provides an integrative view on EST origin and identity, which may prove particularly useful in cases where similarity searches

  1. Genome organization of the SARS-CoV

    DEFF Research Database (Denmark)

    Xu, Jing; Hu, Jianfei; Wang, Jing;

    2003-01-01

    Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or devel...

  2. Automated cell analysis tool for a genome-wide RNAi screen with support vector machine based supervised learning

    Science.gov (United States)

    Remmele, Steffen; Ritzerfeld, Julia; Nickel, Walter; Hesser, Jürgen

    2011-03-01

    RNAi-based high-throughput microscopy screens have become an important tool in biological sciences in order to decrypt mostly unknown biological functions of human genes. However, manual analysis is impossible for such screens since the amount of image data sets can often be in the hundred thousands. Reliable automated tools are thus required to analyse the fluorescence microscopy image data sets usually containing two or more reaction channels. The herein presented image analysis tool is designed to analyse an RNAi screen investigating the intracellular trafficking and targeting of acylated Src kinases. In this specific screen, a data set consists of three reaction channels and the investigated cells can appear in different phenotypes. The main issue of the image processing task is an automatic cell segmentation which has to be robust and accurate for all different phenotypes and a successive phenotype classification. The cell segmentation is done in two steps by segmenting the cell nuclei first and then using a classifier-enhanced region growing on basis of the cell nuclei to segment the cells. The classification of the cells is realized by a support vector machine which has to be trained manually using supervised learning. Furthermore, the tool is brightness invariant allowing different staining quality and it provides a quality control that copes with typical defects during preparation and acquisition. A first version of the tool has already been successfully applied for an RNAi-screen containing three hundred thousand image data sets and the SVM extended version is designed for additional screens.

  3. Translational genomics for plant breeding with the genome sequence explosion.

    Science.gov (United States)

    Kang, Yang Jae; Lee, Taeyoung; Lee, Jayern; Shim, Sangrea; Jeong, Haneul; Satyawan, Dani; Kim, Moon Young; Lee, Suk-Ha

    2016-04-01

    The use of next-generation sequencers and advanced genotyping technologies has propelled the field of plant genomics in model crops and plants and enhanced the discovery of hidden bridges between genotypes and phenotypes. The newly generated reference sequences of unstudied minor plants can be annotated by the knowledge of model plants via translational genomics approaches. Here, we reviewed the strategies of translational genomics and suggested perspectives on the current databases of genomic resources and the database structures of translated information on the new genome. As a draft picture of phenotypic annotation, translational genomics on newly sequenced plants will provide valuable assistance for breeders and researchers who are interested in genetic studies.

  4. GFF-Ex: a genome feature extraction package

    OpenAIRE

    Rastogi, Achal; Gupta, Dinesh

    2014-01-01

    Background Genomic features of whole genome sequences emerging from various sequencing and annotation projects are represented and stored in several formats. Amongst these formats, the GFF (Generic/General Feature Format) has emerged as a widely accepted, portable and successfully used flat file format for genome annotation storage. With an increasing interest in genome annotation projects and secondary and meta-analysis, there is a need for efficient tools to extract sequences of interests f...

  5. A Web-based High-Throughput Tool for Next-Generation Sequence Annotation

    Science.gov (United States)

    2011-06-01

    annotation of a newly sequenced complete genome, can help devise new strategies in diagnostics and forensics . Moreover, these annotations, coupled...References 1. Hall, N., “Advanced sequencing technologies and their wider impact in microbiology ”, The Journal of Experimental Biology, 210(9), pp. 1518–1525

  6. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis

    NARCIS (Netherlands)

    Neerincx, P.B.T.; Casel, P.; Prickett, D.; Nie, H.; Watson, M.; Leunissen, J.A.M.; Groenen, M.A.M.; Klopp, C.

    2009-01-01

    Background - Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/

  7. Semantic annotation of biological concepts interplaying microbial cellular responses

    Directory of Open Access Journals (Sweden)

    Carreira Rafael

    2011-11-01

    Full Text Available Abstract Background Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes. Results Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules, proteins (transcription factors, enzymes and transporters, small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts and compounds (most frequently annotated concepts, whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts. Conclusions To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes. Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts.

  8. EST-PAC a web package for EST annotation and protein sequence prediction

    Directory of Open Access Journals (Sweden)

    Strahm Yvan

    2006-10-01

    Full Text Available Abstract With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1 searching local or remote biological databases for sequence similarities using Blast services, 2 predicting protein coding sequence from EST data and, 3 annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  9. Code Generation for Protocols from CPN models Annotated with Pragmatics

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge; Kristensen, Lars Michael; Kindler, Ekkart

    Model-driven engineering (MDE) provides a foundation for automatically generating software based on models. Models allow software designs to be specified focusing on the problem domain and abstracting from the details of underlying implementation platforms. When applied in the context of formal...... modelling languages, MDE further has the advantage that models are amenable to model checking which allows key behavioural properties of the software design to be verified. The combination of formally verified models and automated code generation contributes to a high degree of assurance that the resulting...... of the same model and sufficiently detailed to serve as a basis for automated code generation when annotated with code generation pragmatics. Pragmatics are syntactical annotations designed to make the CPN models descriptive and to address the problem that models with enough details for generating code from...

  10. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    Science.gov (United States)

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific

  11. Complete genome sequence of an attenuated Sparfloxacin-resistant Streptococcus agalactiae strain 138spar

    Science.gov (United States)

    The complete genome of a sparfloxacin-resistant Streptococcus agalactiae vaccine strain 138spar is 1,838,126 bp in size. The genome has 1892 coding sequences and 82 RNAs. The annotation of the genome is added by the NCBI Prokaryotic Genome Annotation Pipeline. The publishing of this genome will allo...

  12. Genephony: a knowledge management tool for genome-wide research

    Science.gov (United States)

    Nuzzo, Angelo; Riva, Alberto

    2009-01-01

    Background One of the consequences of the rapid and widespread adoption of high-throughput experimental technologies is an exponential increase of the amount of data produced by genome-wide experiments. Researchers increasingly need to handle very large volumes of heterogeneous data, including both the data generated by their own experiments and the data retrieved from publicly available repositories of genomic knowledge. Integration, exploration, manipulation and interpretation of data and information therefore need to become as automated as possible, since their scale and breadth are, in general, beyond the limits of what individual researchers and the basic data management tools in normal use can handle. This paper describes Genephony, a tool we are developing to address these challenges. Results We describe how Genephony can be used to manage large datesets of genomic information, integrating them with existing knowledge repositories. We illustrate its functionalities with an example of a complex annotation task, in which a set of SNPs coming from a genotyping experiment is annotated with genes known to be associated to a phenotype of interest. We show how, thanks to the modular architecture of Genephony and its user-friendly interface, this task can be performed in a few simple steps. Conclusion Genephony is an online tool for the manipulation of large datasets of genomic information. It can be used as a browser for genomic data, as a high-throughput annotation tool, and as a knowledge discovery tool. It is designed to be easy to use, flexible and extensible. Its knowledge management engine provides fine-grained control over individual data elements, as well as efficient operations on large datasets. PMID:19728881

  13. Systems genetics and genome-wide association approaches for analysis of feed intake, feed efficiency, and performance in beef cattle

    DEFF Research Database (Denmark)

    Santana, M.H.A.; Freua, M.C.; Do, D. N.;

    2016-01-01

    , were annotated and the biological processes underlying the traits were inferred from Database for Annotation, Visualization and Integrated Discovery (DAVID) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Our results indicated several putative genomic regions associated with the target...

  14. Warehouse automation

    OpenAIRE

    Pogačnik, Jure

    2017-01-01

    An automated high bay warehouse is commonly used for storing large number of material with a high throughput. In an automated warehouse pallet movements are mainly performed by a number of automated devices like conveyors systems, trolleys, and stacker cranes. From the introduction of the material to the automated warehouse system to its dispatch the system requires no operator input or intervention since all material movements are done automatically. This allows the automated warehouse to op...

  15. Pride-asap: Automatic fragment ion annotation of identified PRIDE spectra

    OpenAIRE

    Hulstaert, Niels; Reisinger, Florian; Rameseder, Jonathan; Barsnes, Harald; Vizcaíno, Juan Antonio; Vizcaíno, Juan Antonio; Martens, Lennart

    2013-01-01

    We present an open source software application and library written in Java that provides a uniform annotation of identified spectra stored in the PRIDE database. Pride-asap can be ran in a command line mode for automated processing of multiple PRIDE experiments, but also has a graphical user interface that allows end users to annotate the spectra in PRIDE experiments and to inspect the results in detail. Pride-asap binaries, source code and additional information can be downloaded from http:/...

  16. Automatic annotation of lecture videos for multimedia driven pedagogical platforms

    Directory of Open Access Journals (Sweden)

    Ali Shariq Imran

    2016-12-01

    Full Text Available Today’s eLearning websites are heavily loaded with multimedia contents, which are often unstructured, unedited, unsynchronized, and lack inter-links among different multimedia components. Hyperlinking different media modality may provide a solution for quick navigation and easy retrieval of pedagogical content in media driven eLearning websites. In addition, finding meta-data information to describe and annotate media content in eLearning platforms is challenging, laborious, prone to errors, and time-consuming task. Thus annotations for multimedia especially of lecture videos became an important part of video learning objects. To address this issue, this paper proposes three major contributions namely, automated video annotation, the 3-Dimensional (3D tag clouds, and the hyper interactive presenter (HIP eLearning platform. Combining existing state-of-the-art SIFT together with tag cloud, a novel approach for automatic lecture video annotation for the HIP is proposed. New video annotations are implemented automatically providing the needed random access in lecture videos within the platform, and a 3D tag cloud is proposed as a new way of user interaction mechanism. A preliminary study of the usefulness of the system has been carried out, and the initial results suggest that 70% of the students opted for using HIP as their preferred eLearning platform at Gjøvik University College (GUC.

  17. Use of Annotations for Component and Framework Interoperability

    Science.gov (United States)

    David, O.; Lloyd, W.; Carlson, J.; Leavesley, G. H.; Geter, F.

    2009-12-01

    western United States at the USDA NRCS National Water and Climate Center. PRMS is a component based modular precipitation-runoff model developed to evaluate the impacts of various combinations of precipitation, climate, and land use on streamflow and general basin hydrology. The new OMS 3.0 PRMS model source code is more concise and flexible as a result of using the new framework’s annotation based approach. The fully annotated components are now providing information directly for (i) model assembly and building, (ii) dataflow analysis for implicit multithreading, (iii) automated and comprehensive model documentation of component dependencies, physical data properties, (iv) automated model and component testing, and (v) automated audit-traceability to account for all model resources leading to a particular simulation result. Experience to date has demonstrated the multi-purpose value of using annotations. Annotations are also a feasible and practical method to enable interoperability among models and modeling frameworks. As a prototype example, model code annotations were used to generate binding and mediation code to allow the use of OMS 3.0 model components within the OpenMI context.

  18. Mesotext. Framing and exploring annotations

    NARCIS (Netherlands)

    Boot, P.; Boot, P.; Stronks, E.

    2007-01-01

    From the introduction: Annotation is an important item on the wish list for digital scholarly tools. It is one of John Unsworth’s primitives of scholarship (Unsworth 2000). Especially in linguistics,a number of tools have been developed that facilitate the creation of annotations to source material

  19. HBVRegDB: Annotation, comparison, detection and visualization of regulatory elements in hepatitis B virus sequences

    Directory of Open Access Journals (Sweden)

    Firth Andrew E

    2007-12-01

    Full Text Available Abstract Background The many Hepadnaviridae sequences available have widely varied functional annotation. The genomes are very compact (~3.2 kb but contain multiple layers of functional regulatory elements in addition to coding regions. Key regions are subject to purifying selection, as mutations in these regions will produce non-functional viruses. Results These genomic sequences have been organized into a structured database to facilitate research at the molecular level. HBVRegDB is a comparative genomic analysis tool with an integrated underlying sequence database. The database contains genomic sequence data from representative viruses. In addition to INSDC and RefSeq annotation, HBVRegDB also contains expert and systematically calculated annotations (e.g. promoters and comparative genome analysis results (e.g. blastn, tblastx. It also contains analyses based on curated HBV alignments. Information about conserved regions – including primary conservation (e.g. CDS-Plotcon and RNA secondary structure predictions (e.g. Alidot – is integrated into the database. A large amount of data is graphically presented using the GBrowse (Generic Genome Browser adapted for analysis of viral genomes. Flexible query access is provided based on any annotated genomic feature. Novel regulatory motifs can be found by analysing the annotated sequences. Conclusion HBVRegDB serves as a knowledge database and as a comparative genomic analysis tool for molecular biologists investigating HBV. It is publicly available and complementary to other viral and HBV focused datasets and tools http://hbvregdb.otago.ac.nz. The availability of multiple and highly annotated sequences of viral genomes in one database combined with comparative analysis tools facilitates detection of novel genomic elements.

  20. Collective dynamics of social annotation

    CERN Document Server

    Cattuto, Ciro; Baldassarri, Andrea; Schehr, G; Loreto, Vittorio

    2009-01-01

    The enormous increase of popularity and use of the WWW has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with text keywords dubbed tags. Understanding the rich emerging structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks, and the complex networks framework, can effectively contribute to the mathematical modeling of social annotation systems. Here we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of random walks. This modeling framework reproduces several aspects, so far unexplained, of social annotation, among which the peculiar growth of the size of the...

  1. Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets

    Science.gov (United States)

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M. Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J.

    2014-01-01

    Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

  2. SelenoDB 2.0: annotation of selenoprotein genes in animals and their genetic diversity in humans.

    Science.gov (United States)

    Romagné, Frédéric; Santesmasses, Didac; White, Louise; Sarangi, Gaurab K; Mariotti, Marco; Hübler, Ron; Weihmann, Antje; Parra, Genís; Gladyshev, Vadim N; Guigó, Roderic; Castellano, Sergi

    2014-01-01

    SelenoDB (http://www.selenodb.org) aims to provide high-quality annotations of selenoprotein genes, proteins and SECIS elements. Selenoproteins are proteins that contain the amino acid selenocysteine (Sec) and the first release of the database included annotations for eight species. Since the release of SelenoDB 1.0 many new animal genomes have been sequenced. The annotations of selenoproteins in new genomes usually contain many errors in major databases. For this reason, we have now fully annotated selenoprotein genes in 58 animal genomes. We provide manually curated annotations for human selenoproteins, whereas we use an automatic annotation pipeline to annotate selenoprotein genes in other animal genomes. In addition, we annotate the homologous genes containing cysteine (Cys) instead of Sec. Finally, we have surveyed genetic variation in the annotated genes in humans. We use exon capture and resequencing approaches to identify single-nucleotide polymorphisms in more than 50 human populations around the world. We thus present a detailed view of the genetic divergence of Sec- and Cys-containing genes in animals and their diversity in humans. The addition of these datasets into the second release of the database provides a valuable resource for addressing medical and evolutionary questions in selenium biology.

  3. Proteomic detection of non-annotated protein-coding genes in Pseudomonas fluorescens Pf0-1.

    Science.gov (United States)

    Kim, Wook; Silby, Mark W; Purvine, Sam O; Nicoll, Julie S; Hixson, Kim K; Monroe, Matt; Nicora, Carrie D; Lipton, Mary S; Levy, Stuart B

    2009-12-24

    Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  4. Proteomic detection of non-annotated protein-coding genes in Pseudomonas fluorescens Pf0-1.

    Directory of Open Access Journals (Sweden)

    Wook Kim

    Full Text Available Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  5. Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1

    Energy Technology Data Exchange (ETDEWEB)

    Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

    2009-12-24

    Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  6. Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data

    Directory of Open Access Journals (Sweden)

    Merchant Sabeeha S

    2011-07-01

    Full Text Available Abstract Background Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. Description The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of

  7. Accounting Automation

    OpenAIRE

    Laynebaril1

    2017-01-01

    Accounting Automation   Click Link Below To Buy:   http://hwcampus.com/shop/accounting-automation/  Or Visit www.hwcampus.com Accounting Automation” Please respond to the following: Imagine you are a consultant hired to convert a manual accounting system to an automated system. Suggest the key advantages and disadvantages of automating a manual accounting system. Identify the most important step in the conversion process. Provide a rationale for your response. ...

  8. Technology for the human genome project: State-of-the-art and the future state-of-the-art

    Energy Technology Data Exchange (ETDEWEB)

    Garner, H.

    1995-12-31

    The Genome Center at Southwestern Medical Center is producing a high resolution map of chromosome 11 based on a strategy called Genome Sequence Sampling (GSS). Starting from a low resolution YAC map, cosmids are selected or sub-cloned at a high redundancy (20x) and simultaneously restriction mapped and end sequenced. The data are then integrated to make a high resolution map that is annotated with sequence ({approximately}350 bp) every {approximately}2 kbp. We have constructed several specialized hardware and software systems, implemented those devices in a production environment, and have been fine tuning each system and their work in synchrony. This seminar will discuss the design of the process, hardware, software and laboratories for high throughout operations research, insights illuminated as automation was infused in the lab and our near term objectives. For the future, we are exploring new `chip` based nanovolume biology applications, analysis software and new generations of automation systems.

  9. Home Automation

    OpenAIRE

    Ahmed, Zeeshan

    2010-01-01

    In this paper I briefly discuss the importance of home automation system. Going in to the details I briefly present a real time designed and implemented software and hardware oriented house automation research project, capable of automating house's electricity and providing a security system to detect the presence of unexpected behavior.

  10. Pattern matching in indeterminate and Arc-annotated sequences.

    Science.gov (United States)

    Aumi, Md Tanvir Islam; Moosa, Tanaeem M; Rahman, M Sohel

    2013-08-01

    In this paper, we present efficient algorithms for finding indeterminate Arc-Annotated patterns in indeterminate Arc-Annotated references. Our algorithms run in O(m+ (nm) w) time where n and m are respectively the length of our reference and pattern strings and w is the target machine word size. Here we have assumed the alphabet size to be constant, because, indeterminate Arc-Annotated sequences are used to model biological sequences. Clearly, for short patterns, our algorithms run in linear time and efficient algorithms for matching short patterns to reference genomes have huge applications in practical settings. We have also applied our algorithms to scan the ncRNAs without pseudoknots. We scanned three whole human chromosomes and it took only 2.5 - 4 minutes to scan one whole chromosome for an ncRNA family. Some relevant patents are discussed in.

  11. META2: Intercellular DNA Methylation Pairwise Annotation and Integrative Analysis

    Directory of Open Access Journals (Sweden)

    Binhua Tang

    2016-01-01

    Full Text Available Genome-wide deciphering intercellular differential DNA methylation as well as its roles in transcriptional regulation remains elusive in cancer epigenetics. Here we developed a toolkit META2 for DNA methylation annotation and analysis, which aims to perform integrative analysis on differentially methylated loci and regions through deep mining and statistical comparison methods. META2 contains multiple versatile functions for investigating and annotating DNA methylation profiles. Benchmarked with T-47D cell, we interrogated the association within differentially methylated CpG (DMC and region (DMR candidate count and region length and identified major transition zones as clues for inferring statistically significant DMRs; together we validated those DMRs with the functional annotation. Thus META2 can provide a comprehensive analysis approach for epigenetic research and clinical study.

  12. Sentiment Analysis of Document Based on Annotation

    CERN Document Server

    Shukla, Archana

    2011-01-01

    I present a tool which tells the quality of document or its usefulness based on annotations. Annotation may include comments, notes, observation, highlights, underline, explanation, question or help etc. comments are used for evaluative purpose while others are used for summarization or for expansion also. Further these comments may be on another annotation. Such annotations are referred as meta-annotation. All annotation may not get equal weightage. My tool considered highlights, underline as well as comments to infer the collective sentiment of annotators. Collective sentiments of annotators are classified as positive, negative, objectivity. My tool computes collective sentiment of annotations in two manners. It counts all the annotation present on the documents as well as it also computes sentiment scores of all annotation which includes comments to obtain the collective sentiments about the document or to judge the quality of document. I demonstrate the use of tool on research paper.

  13. KEGG as a reference resource for gene and protein annotation.

    Science.gov (United States)

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-04

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks.

  14. ODMedit: uniform semantic annotation for data integration in medicine based on a public metadata repository

    Directory of Open Access Journals (Sweden)

    Martin Dugas

    2016-06-01

    Full Text Available Abstract Background The volume and complexity of patient data – especially in personalised medicine – is steadily increasing, both regarding clinical data and genomic profiles: Typically more than 1,000 items (e.g., laboratory values, vital signs, diagnostic tests etc. are collected per patient in clinical trials. In oncology hundreds of mutations can potentially be detected for each patient by genomic profiling. Therefore data integration from multiple sources constitutes a key challenge for medical research and healthcare. Methods Semantic annotation of data elements can facilitate to identify matching data elements in different sources and thereby supports data integration. Millions of different annotations are required due to the semantic richness of patient data. These annotations should be uniform, i.e., two matching data elements shall contain the same annotations. However, large terminologies like SNOMED CT or UMLS don’t provide uniform coding. It is proposed to develop semantic annotations of medical data elements based on a large-scale public metadata repository. To achieve uniform codes, semantic annotations shall be re-used if a matching data element is available in the metadata repository. Results A web-based tool called ODMedit ( https://odmeditor.uni-muenster.de/ was developed to create data models with uniform semantic annotations. It contains ~800,000 terms with semantic annotations which were derived from ~5,800 models from the portal of medical data models (MDM. The tool was successfully applied to manually annotate 22 forms with 292 data items from CDISC and to update 1,495 data models of the MDM portal. Conclusion Uniform manual semantic annotation of data models is feasible in principle, but requires a large-scale collaborative effort due to the semantic richness of patient data. A web-based tool for these annotations is available, which is linked to a public metadata repository.

  15. Complete genome sequence of a virulent Streptococcus agalactiae strain 138P isolated from disease Nile tilapia

    Science.gov (United States)

    The complete genome of a virulent Streptococcus agalactiae strain 138P is 1838701 bp in size, containing 1831 genes. The genome has 1593 coding sequences, 152 pseudo genes, 16 rRNAs, 69 tRNAs, and 1 non-coding RNA. The annotation of the genome is added by the NCBI Prokaryotic Genome Annotation Pipel...

  16. Draft Genome Sequence of Lactobacillus rhamnosus 2166.

    OpenAIRE

    Karlyshev, Andrey V.; Melnikov, Vyacheslav G.; Kosarev, Igor V.; Abramov, Vyacheslav M.

    2014-01-01

    In this report, we present a draft sequence of the genome of Lactobacillus rhamnosus strain 2166, a potential novel probiotic. Genome annotation and read mapping onto a reference genome of L. rhamnosus strain GG allowed for the identification of the differences and similarities in the genomic contents and gene arrangements of these strains.

  17. Publication Production: An Annotated Bibliography.

    Science.gov (United States)

    Firman, Anthony H.

    1994-01-01

    Offers brief annotations of 52 articles and papers on document production (from the Society for Technical Communication's journal and proceedings) on 9 topics: information processing, document design, using color, typography, tables, illustrations, photography, printing and binding, and production management. (SR)

  18. 家蚕黑胸败血芽孢杆菌基因组测序及结构分析和功能注释%Genome Sequencing, Structural Analysis and Functional Annotations of Bacillus bombysepticus from the Silkworm, Bombyx mori

    Institute of Scientific and Technical Information of China (English)

    程廷才; 林平; 金盛凯; 付博华; 龙仁文; 夏庆友

    2014-01-01

    Bacillus bombyseptieus is one of the common bacterial pathogens causing bacterial black thorax septicemia in silkworm (Bombyx mori).In this paper,sequence features as well as structural and functional annotations of B.bombyseptieus genome were focused based on sequencing and assembly of B.bombyseptieus genomic sequences.We generated 2.1 Gb of raw data from sequencing,with a coverage of more than 360-fold to the genome,and assembled them into a 5.87 Mb genome of B.bombyseptieus,including two replicons,one of nuclear genome (5 295.783 kb) and one of plasmid genome (577.809 kb).Percentage of repeat sequences accounts for 1.179 2% in the nuclear genome,which contains 136 non-coding RNAs and 5 298 predicted protein-coding genes.The functional annotation of B.bombyseptieus genome using KEGG and COG databases showed that most genes are involved in transportation,amino acid metabolism and carbohydrate metabolism.Genomic syntenic analysis indicated that B.bombyseptieus is closely related to bacteria of Bacillus genus.These results will help find the key virulence factor of B.bombyseptieus and facilitate the understanding of its infection mechanism and interaction with host.%黑胸败血芽孢杆菌(Bacillus bombyseptieus)是家蚕细菌性败血病的常见病原之一.在获得黑胸败血芽孢杆菌基因组完成图的基础上,对该菌株的基因组测序、基因组序列结构特征以及基因功能注释等信息进行分析.该菌株的基因组测序数据量达到2.1 Gb,基因组覆盖度超过360倍,基因组组装大小为5.87 Mb,包含2个复制子——1个核基因组(5 295.783 kb)和1个质粒基因组(577.809kb);基因组重复序列比率为1.179 2%,包含136个非编码RNA和5 298个编码基因.测序得到的基因组数据通过KEGG和COG数据库进行功能注释,主要与物质转运、氨基酸代谢、碳水化合物代谢等相关.基因组共线性分析表明黑胸败血芽孢杆菌与芽孢杆菌属细菌的亲缘关系较近.该研

  19. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor

    Directory of Open Access Journals (Sweden)

    Hankus Lukasz

    2006-10-01

    Full Text Available Abstract Background Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. Results We describe the software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase. RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats. It uses WU-BLAST for speed and sensitivity, and can conduct DNA-DNA, DNA-protein, or translated DNA-translated DNA searches of genomic sequence. Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence(s, repeat sequences found in the query, and alignments. Conclusion Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. They can be found at http://www.girinst.org/repbase/submission.html (RepbaseSubmitter and http://www.girinst.org/censor/index.php (Censor.

  20. Genome reannotation of Escherichia coli CFT073 with new insights into virulence

    Directory of Open Access Journals (Sweden)

    Hu Gang-Qing

    2009-11-01

    Full Text Available Abstract Background As one of human pathogens, the genome of Uropathogenic Escherichia coli strain CFT073 was sequenced and published in 2002, which was significant in pathogenetic bacterial genomics research. However, the current RefSeq annotation of this pathogen is now outdated to some degree, due to missing or misannotation of some essential genes associated with its virulence. We carried out a systematic reannotation by combining automated annotation tools with manual efforts to provide a comprehensive understanding of virulence for the CFT073 genome. Results The reannotation excluded 608 coding sequences from the RefSeq annotation. Meanwhile, a total of 299 coding sequences were newly added, about one third of them are found in genomic island (GI regions while more than one fifth of them are located in virulence related regions pathogenicity islands (PAIs. Furthermore, there are totally 341 genes were relocated with their translational initiation sites (TISs, which resulted in a high quality of gene start annotation. In addition, 94 pseudogenes annotated in RefSeq were thoroughly inspected and updated. The number of miscellaneous genes (sRNAs has been updated from 6 in RefSeq to 46 in the reannotation. Based on the adjustment in the reannotation, subsequent analysis were conducted by both general and case studies on new virulence factors or new virulence-associated genes that are crucial during the urinary tract infections (UTIs process, including invasion, colonization, nutrition uptaking and population density control. Furthermore, miscellaneous RNAs collected in the reannotation are believed to contribute to the virulence of strain CFT073. The reannotation including the nucleotide data, the original RefSeq annotation, and all reannotated results is freely available via http://mech.ctb.pku.edu.cn/CFT073/. Conclusion As a result, the reannotation presents a more comprehensive picture of mechanisms of uropathogenicity of UPEC strain CFT073

  1. SpectroGene: A Tool for Proteogenomic Annotations Using Top-Down Spectra.

    Science.gov (United States)

    Kolmogorov, Mikhail; Liu, Xiaowen; Pevzner, Pavel A

    2016-01-01

    In the past decade, proteogenomics has emerged as a valuable technique that contributes to the state-of-the-art in genome annotation; however, previous proteogenomic studies were limited to bottom-up mass spectrometry and did not take advantage of top-down approaches. We show that top-down proteogenomics allows one to address the problems that remained beyond the reach of traditional bottom-up proteogenomics. In particular, we show that top-down proteogenomics leads to the discovery of previously unannotated genes even in extensively studied bacterial genomes and present SpectroGene, a software tool for genome annotation using top-down tandem mass spectra. We further show that top-down proteogenomics searches (against the six-frame translation of a genome) identify nearly all proteoforms found in traditional top-down proteomics searches (against the annotated proteome). SpectroGene is freely available at http://github.com/fenderglass/SpectroGene .

  2. : a database of ciliate genome rearrangements.

    Science.gov (United States)

    Burns, Jonathan; Kukushkin, Denys; Lindblad, Kelsi; Chen, Xiao; Jonoska, Nataša; Landweber, Laura F

    2016-01-01

    Ciliated protists exhibit nuclear dimorphism through the presence of somatic macronuclei (MAC) and germline micronuclei (MIC). In some ciliates, DNA from precursor segments in the MIC genome rearranges to form transcriptionally active genes in the mature MAC genome, making these ciliates model organisms to study the process of somatic genome rearrangement. Similar broad scale, somatic rearrangement events occur in many eukaryotic cells and tumors. The (http://oxytricha.princeton.edu/mds_ies_db) is a database of genome recombination and rearrangement annotations, and it provides tools for visualization and comparative analysis of precursor and product genomes. The database currently contains annotations for two completely sequenced ciliate genomes: Oxytricha trifallax and Tetrahymena thermophila.

  3. Automated methods of predicting the function of biological sequences using GO and BLAST

    Directory of Open Access Journals (Sweden)

    Baumann Ute

    2005-11-01

    Full Text Available Abstract Background With the exponential increase in genomic sequence data there is a need to develop automated approaches to deducing the biological functions of novel sequences with high accuracy. Our aim is to demonstrate how accuracy benchmarking can be used in a decision-making process evaluating competing designs of biological function predictors. We utilise the Gene Ontology, GO, a directed acyclic graph of functional terms, to annotate sequences with functional information describing their biological context. Initially we examine the effect on accuracy scores of increasing the allowed distance between predicted and a test set of curator assigned terms. Next we evaluate several annotator methods using accuracy benchmarking. Given an unannotated sequence we use the Basic Local Alignment Search Tool, BLAST, to find similar sequences that have already been assigned GO terms by curators. A number of methods were developed that utilise terms associated with the best five matching sequences. These methods were compared against a benchmark method of simply using terms associated with the best BLAST-matched sequence (best BLAST approach. Results The precision and recall of estimates increases rapidly as the amount of distance permitted between a predicted term and a correct term assignment increases. Accuracy benchmarking allows a comparison of annotation methods. A covering graph approach performs poorly, except where the term assignment rate is high. A term distance concordance approach has a similar accuracy to the best BLAST approach, demonstrating lower precision but higher recall. However, a discriminant function method has higher precision and recall than the best BLAST approach and other methods shown here. Conclusion Allowing term predictions to be counted correct if closely related to a correct term decreases the reliability of the accuracy score. As such we recommend using accuracy measures that require exact matching of predicted

  4. INDIGO - INtegrated data warehouse of microbial genomes with examples from the red sea extremophiles.

    KAUST Repository

    Alam, Intikhab

    2013-12-06

    The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes.

  5. Automated Annotation of Microbial and Human Flavonoid-Derived Metabolites

    NARCIS (Netherlands)

    Mihaleva, V.V.; Ünlü, F.; Vervoort, J.J.M.; Ridder, L.O.

    2015-01-01

    Flavonoids are a class of natural compounds essentially produced by plants that are part of animal and human diets and have assumed health-promoting benefits. Upon human consumption, these flavonoids are to a modest extent absorbed in the small intestines. The major part arrives in the colon where t

  6. SHARP: genome-scale identification of gene-protein-reaction associations in cyanobacteria.

    Science.gov (United States)

    Krishnakumar, S; Durai, Dilip A; Wangikar, Pramod P; Viswanathan, Ganesh A

    2013-11-01

    Genome scale metabolic model provides an overview of an organism's metabolic capability. These genome-specific metabolic reconstructions are based on identification of gene to protein to reaction (GPR) associations and, in turn, on homology with annotated genes from other organisms. Cyanobacteria are photosynthetic prokaryotes which have diverged appreciably from their nonphotosynthetic counterparts. They also show significant evolutionary divergence from plants, which are well studied for their photosynthetic apparatus. We argue that context-specific sequence and domain similarity can add to the repertoire of the GPR associations and significantly expand our view of the metabolic capability of cyanobacteria. We took an approach that combines the results of context-specific sequence-to-sequence similarity search with those of sequence-to-profile searches. We employ PSI-BLAST for the former, and CDD, Pfam, and COG for the latter. An optimization algorithm was devised to arrive at a weighting scheme to combine the different evidences with KEGG-annotated GPRs as training data. We present the algorithm in the form of software "Systematic, Homology-based Automated Re-annotation for Prokaryotes (SHARP)." We predicted 3,781 new GPR associations for the 10 prokaryotes considered of which eight are cyanobacteria species. These new GPR associations fall in several metabolic pathways and were used to annotate 7,718 gaps in the metabolic network. These new annotations led to discovery of several pathways that may be active and thereby providing new directions for metabolic engineering of these species for production of useful products. Metabolic model developed on such a reconstructed network is likely to give better phenotypic predictions.

  7. Pragmatics annotated coloured petri nets for protocol software generation and verification

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge Fagerland; Kristensen, Lars M.; Kindler, Ekkart

    2016-01-01

    Pragmatics Annotated Coloured Petri Nets (PA-CPNs) are a restricted class of Coloured Petri Nets (CPNs) developed to support automated generation of protocol software. The practical application of PA-CPNs and the supporting PetriCode software tool have been discussed and evaluated in earlier papers...

  8. Pragmatics Annotated Coloured Petri Nets for Protocol Software Generation and Verification

    DEFF Research Database (Denmark)

    Fagerland Simonsen, Kent Inge; Kristensen, Lars Michael; Kindler, Ekkart

    2015-01-01

    PetriCode is a tool that supports automated generation of protocol software from a restricted class of Coloured Petri Nets (CPNs) called Pragmatics Annotated Coloured Petri Nets (PA-CPNs). Petri-Code and PA-CPNs have been designed with five main requirements in mind, which include the same model...

  9. The UCSC Archaeal Genome Browser: 2012 update

    OpenAIRE

    Chan, Patricia P.; Holmes, Andrew D.; Smith, Andrew M.; Tran, Danny; Lowe, Todd M.

    2011-01-01

    The UCSC Archaeal Genome Browser (http://archaea.ucsc.edu) offers a graphical web-based resource for exploration and discovery within archaeal and other selected microbial genomes. By bringing together existing gene annotations, gene expression data, multiple-genome alignments, pre-computed sequence comparisons and other specialized analysis tracks, the genome browser is a powerful aggregator of varied genomic information. The genome browser environment maintains the current look-and-feel of ...

  10. Objective-guided image annotation.

    Science.gov (United States)

    Mao, Qi; Tsang, Ivor Wai-Hung; Gao, Shenghua

    2013-04-01

    Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined. And then, we formulate these loss functions as relaxed surrogate functions and optimize them by structural SVMs. According to the analysis of various measures and the high time complexity of optimizing micro-averaging measures, in this paper, we focus on example-based measures that are tailor-made for image annotation tasks but are seldom explored in the literature. Experiments show consistency with the formal analysis on two widely used multi-label datasets, and demonstrate the superior performance of our proposed method over state-of-the-art baseline methods in terms of example-based measures on four

  11. Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline

    Directory of Open Access Journals (Sweden)

    Georgia Kapatai

    2016-09-01

    Full Text Available Streptococcus pneumoniae typically express one of 92 serologically distinct capsule polysaccharide (cps types (serotypes. Some of these serotypes are closely related to each other; using the commercially available typing antisera, these are assigned to common serogroups containing types that show cross-reactivity. In this serotyping scheme, factor antisera are used to allocate serotypes within a serogroup, based on patterns of reactions. This serotyping method is technically demanding, requires considerable experience and the reading of the results can be subjective. This study describes the analysis of the S. pneumoniae capsular operon genetic sequence to determine serotype distinguishing features and the development, evaluation and verification of an automated whole genome sequence (WGS-based serotyping bioinformatics tool, PneumoCaT (Pneumococcal Capsule Typing. Initially, WGS data from 871 S. pneumoniae isolates were mapped to reference cps locus sequences for the 92 serotypes. Thirty-two of 92 serotypes could be unambiguously identified based on sequence similarities within the cps operon. The remaining 60 were allocated to one of 20 ‘genogroups’ that broadly correspond to the immunologically defined serogroups. By comparing the cps reference sequences for each genogroup, unique molecular differences were determined for serotypes within 18 of the 20 genogroups and verified using the set of 871 isolates. This information was used to design a decision-tree style algorithm within the PneumoCaT bioinformatics tool to predict to serotype level for 89/94 (92 + 2 molecular types/subtypes from WGS data and to serogroup level for serogroups 24 and 32, which currently comprise 2.1% of UK referred, invasive isolates submitted to the National Reference Laboratory (NRL, Public Health England (June 2014–July 2015. PneumoCaT was evaluated with an internal validation set of 2065 UK isolates covering 72/92 serotypes, including 19 non-typeable isolates

  12. ASAP: Amplification, sequencing & annotation of plastomes

    Directory of Open Access Journals (Sweden)

    Folta Kevin M

    2005-12-01

    Full Text Available Abstract Background Availability of DNA sequence information is vital for pursuing structural, functional and comparative genomics studies in plastids. Traditionally, the first step in mining the valuable information within a chloroplast genome requires sequencing a chloroplast plasmid library or BAC clones. These activities involve complicated preparatory procedures like chloroplast DNA isolation or identification of the appropriate BAC clones to be sequenced. Rolling circle amplification (RCA is being used currently to amplify the chloroplast genome from purified chloroplast DNA and the resulting products are sheared and cloned prior to sequencing. Herein we present a universal high-throughput, rapid PCR-based technique to amplify, sequence and assemble plastid genome sequence from diverse species in a short time and at reasonable cost from total plant DNA, using the large inverted repeat region from strawberry and peach as proof of concept. The method exploits the highly conserved coding regions or intergenic regions of plastid genes. Using an informatics approach, chloroplast DNA sequence information from 5 available eudicot plastomes was aligned to identify the most conserved regions. Cognate primer pairs were then designed to generate ~1 – 1.2 kb overlapping amplicons from the inverted repeat region in 14 diverse genera. Results 100% coverage of the inverted repeat region was obtained from Arabidopsis, tobacco, orange, strawberry, peach, lettuce, tomato and Amaranthus. Over 80% coverage was obtained from distant species, including Ginkgo, loblolly pine and Equisetum. Sequence from the inverted repeat region of strawberry and peach plastome was obtained, annotated and analyzed. Additionally, a polymorphic region identified from gel electrophoresis was sequenced from tomato and Amaranthus. Sequence analysis revealed large deletions in these species relative to tobacco plastome thus exhibiting the utility of this method for structural and

  13. Collective dynamics of social annotation.

    Science.gov (United States)

    Cattuto, Ciro; Barrat, Alain; Baldassarri, Andrea; Schehr, Gregory; Loreto, Vittorio

    2009-06-30

    The enormous increase of popularity and use of the worldwide web has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with keywords known as "tags." Understanding the rich emergent structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks (RWs), and complex networks theory, can effectively contribute to the mathematical modeling of social annotation systems. Here, we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of RWs. This modeling framework reproduces several aspects, thus far unexplained, of social annotation, among which are the peculiar growth of the size of the vocabulary used by the community and its complex network structure that represents an externalization of semantic structures grounded in cognition and that are typically hard to access.

  14. Bovine Genome Database: new tools for gleaning function from the Bos taurus genome.

    Science.gov (United States)

    Elsik, Christine G; Unni, Deepak R; Diesh, Colin M; Tayal, Aditi; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Bovine Genome Database (BGD) (http://BovineGenome.org). The goal of BGD is to support bovine genomics research by providing genome annotation and data mining tools. We have developed new genome and annotation browsers using JBrowse and WebApollo for two Bos taurus genome assemblies, the reference genome assembly (UMD3.1.1) and the alternate genome assembly (Btau_4.6.1). Annotation tools have been customized to highlight priority genes for annotation, and to aid annotators in selecting gene evidence tracks from 91 tissue specific RNAseq datasets. We have also developed BovineMine, based on the InterMine data warehousing system, to integrate the bovine genome, annotation, QTL, SNP and expression data with external sources of orthology, gene ontology, gene interaction and pathway information. BovineMine provides powerful query building tools, as well as customized query templates, and allows users to analyze and download genome-wide datasets. With BovineMine, bovine researchers can use orthology to leverage the curated gene pathways of model organisms, such as human, mouse and rat. BovineMine will be especially useful for gene ontology and pathway analyses in conjunction with GWAS and QTL studies.

  15. The draft genome sequence of the nematode Caenorhabditis briggsae, a companion to C. elegans.

    Science.gov (United States)

    Gupta, Bhagwati P; Sternberg, Paul W

    2003-01-01

    The publication of the draft genome sequence of Caenorhabditis briggsae improves the annotation of the genome of its close relative Caenorhabditis elegans and will facilitate comparative genomics and the study of the evolutionary changes during development.

  16. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

    Science.gov (United States)

    Pruitt, Kim D; Tatusova, Tatiana; Maglott, Donna R

    2005-01-01

    The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff.

  17. Library Automation

    OpenAIRE

    Dhakne, B. N.; Giri, V. V.; Waghmode, S. S.

    2010-01-01

    New technologies library provides several new materials, media and mode of storing and communicating the information. Library Automation reduces the drudgery of repeated manual efforts in library routine. By use of library automation collection, Storage, Administration, Processing, Preservation and communication etc.

  18. Structured RNAs and synteny regions in the pig genome

    DEFF Research Database (Denmark)

    Anthon, Christian; Tafer, Hakim; Havgaard, Jakob Hull;

    2014-01-01

    for Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog). CONCLUSIONS: We have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian genome, which is likely to play central roles in both health modelling and production. The core annotation is available in Ensembl 70......BACKGROUND: Annotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs are known and the computational models are resource demanding. Currently, the human genome holds the best mammalian ncRNA annotation, a result of numerous efforts by several groups. However...

  19. NCBI viral genomes resource.

    Science.gov (United States)

    Brister, J Rodney; Ako-Adjei, Danso; Bao, Yiming; Blinkova, Olga

    2015-01-01

    Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.

  20. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    Science.gov (United States)

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.

  1. Determining similarity of scientific entities in annotation datasets.

    Science.gov (United States)

    Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas

    2015-01-01

    Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/

  2. An Annotation Scheme for Reichenbach's Verbal Tense Structure

    CERN Document Server

    Derczynski, Leon

    2012-01-01

    In this paper we present RTMML, a markup language for the tenses of verbs and temporal relations between verbs. There is a richness to tense in language that is not fully captured by existing temporal annotation schemata. Following Reichenbach we present an analysis of tense in terms of abstract time points, with the aim of supporting automated processing of tense and temporal relations in language. This allows for precise reasoning about tense in documents, and the deduction of temporal relations between the times and verbal events in a discourse. We define the syntax of RTMML, and demonstrate the markup in a range of situations.

  3. Generating Protocol Software from CPN Models Annotated with Pragmatics

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge; Kristensen, Lars M.; Kindler, Ekkart

    2013-01-01

    and verify protocol software, but limited work exists on using CPN models of protocols as a basis for automated code generation. The contribution of this paper is a method for generating protocol software from a class of CPN models annotated with code generation pragmatics. Our code generation method...... consists of three main steps: automatically adding so-called derived pragmatics to the CPN model, computing an abstract template tree, which associates pragmatics with code templates, and applying the templates to generate code which can then be compiled. We illustrate our method using a unidirectional...

  4. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA

    Directory of Open Access Journals (Sweden)

    Camon Evelyn B

    2005-05-01

    Full Text Available Abstract Background The Gene Ontology Annotation (GOA database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an

  5. Mouse genome database 2016.

    Science.gov (United States)

    Bult, Carol J; Eppig, Janan T; Blake, Judith A; Kadin, James A; Richardson, Joel E

    2016-01-01

    The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the primary community model organism database for the laboratory mouse and serves as the source for key biological reference data related to mouse genes, gene functions, phenotypes and disease models with a strong emphasis on the relationship of these data to human biology and disease. As the cost of genome-scale sequencing continues to decrease and new technologies for genome editing become widely adopted, the laboratory mouse is more important than ever as a model system for understanding the biological significance of human genetic variation and for advancing the basic research needed to support the emergence of genome-guided precision medicine. Recent enhancements to MGD include new graphical summaries of biological annotations for mouse genes, support for mobile access to the database, tools to support the annotation and analysis of sets of genes, and expanded support for comparative biology through the expansion of homology data.

  6. The UCSC Genome Browser Database

    DEFF Research Database (Denmark)

    Karolchik, D; Kuhn, R M; Baertsch, R

    2008-01-01

    The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrat...

  7. Automation or De-automation

    Science.gov (United States)

    Gorlach, Igor; Wessel, Oliver

    2008-09-01

    In the global automotive industry, for decades, vehicle manufacturers have continually increased the level of automation of production systems in order to be competitive. However, there is a new trend to decrease the level of automation, especially in final car assembly, for reasons of economy and flexibility. In this research, the final car assembly lines at three production sites of Volkswagen are analysed in order to determine the best level of automation for each, in terms of manufacturing costs, productivity, quality and flexibility. The case study is based on the methodology proposed by the Fraunhofer Institute. The results of the analysis indicate that fully automated assembly systems are not necessarily the best option in terms of cost, productivity and quality combined, which is attributed to high complexity of final car assembly systems; some de-automation is therefore recommended. On the other hand, the analysis shows that low automation can result in poor product quality due to reasons related to plant location, such as inadequate workers' skills, motivation, etc. Hence, the automation strategy should be formulated on the basis of analysis of all relevant aspects of the manufacturing process, such as costs, quality, productivity and flexibility in relation to the local context. A more balanced combination of automated and manual assembly operations provides better utilisation of equipment, reduces production costs and improves throughput.

  8. Annotated Bibliography on Humanistic Education

    Science.gov (United States)

    Ganung, Cynthia

    1975-01-01

    Part I of this annotated bibliography deals with books and articles on such topics as achievement motivation, process education, transactional analysis, discipline without punishment, role-playing, interpersonal skills, self-acceptance, moral education, self-awareness, values clarification, and non-verbal communication. Part II focuses on…

  9. Meaningful Assessment: An Annotated Bibliography.

    Science.gov (United States)

    Thrond, Mary A.

    The annotated bibliography contains citations of nine references on alternative student assessment methods in second language programs, particularly at the secondary school level. The references include a critique of conventional reading comprehension assessment, a discussion of performance assessment, a proposal for a multi-trait, multi-method…

  10. Teacher Evaluation: An Annotated Bibliography.

    Science.gov (United States)

    McKenna, Bernard H.; And Others

    In his introduction to the 86-item annotated bibliography by Mueller and Poliakoff, McKenna discusses his views on teacher evaluation and his impressions of the documents cited. He observes, in part, that the current concern is with the process of evaluation and that most researchers continue to believe that student achievement is the most…

  11. Child Development: An Annotated Bibliography.

    Science.gov (United States)

    Dickerson, LaVerne Thornton, Comp.

    This annotated bibliography focuses on recent publications dealing with factors that influence child growth and development, rather than the developmental processes themselves. Topics include: general sources on child development; physical and perceptual-motor development; cognitive development; social and personality development; and play.…

  12. Nikos Kazantzakis: An Annotated Bibliography.

    Science.gov (United States)

    Qiu, Kui

    This research paper consists of an annotated bibliography about Nikos Kazantzakis, one of the major modern Greek writers and author of "The Last Temptation of Christ,""Zorba the Greek," and many other works. Because of Kazantzakis' position in world literature there are many critical works about him; however, bibliographical control of these works…

  13. Annotation and Curation of Uncharacterized proteins- Challenges

    Directory of Open Access Journals (Sweden)

    Johny eIjaq

    2015-03-01

    Full Text Available Hypothetical Proteins are the proteins that are predicted to be expressed from an open reading frame (ORF, constituting a substantial fraction of proteomes in both prokaryotes and eukaryotes. Genome projects have led to the identification of many therapeutic targets, the putative function of the protein and their interactions. In this review we have enlisted various methods. Annotation linked to structural and functional prediction of hypothetical proteins assist in the discovery of new structures and functions serving as markers and pharmacological targets for drug designing, discovery and screening. Mass spectrometry is an analytical technique for validating protein characterisation. Matrix-assisted laser desorption ionization–mass spectrometry (MALDI-MS is an efficient analytical method. Microarrays and Protein expression profiles help understanding the biological systems through a systems-wide study of proteins and their interactions with other proteins and non-proteinaceous molecules to control complex processes in cells and tissues and even whole organism. Next generation sequencing technology accelerates multiple areas of genomics research.

  14. Annotating BI Visualization Dashboards: Needs and Challenges

    OpenAIRE

    Elias, Micheline; Bezerianos, Anastasia

    2012-01-01

    International audience; Annotations have been identified as an important aid in analysis record-keeping and recently data discovery. In this paper we discuss the use of annotations on visualization dashboards, with a special focus on business intelligence (BI) analysis. In-depth interviews with experts lead to new annotation needs for multi-chart visualization systems, on which we based the design of a dashboard prototype that supports data and context aware annotations. We focus particularly ...

  15. Are clickthrough data reliable as image annotations?

    NARCIS (Netherlands)

    Tsikrika, T.; Diou, C.; Vries, A.P. de; Delopoulos, A.

    2009-01-01

    We examine the reliability of clickthrough data as concept-based image annotations, by comparing them against manual annotations, for different concept categories. Our analysis shows that, for many concepts, the image annotations generated by using clickthrough data are reliable, with up to 90% of t

  16. Annotating images by mining image search results

    NARCIS (Netherlands)

    Wang, X.J.; Zhang, L.; Li, X.; Ma, W.Y.

    2008-01-01

    Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results

  17. Functional annotation of a full-length mouse cDNA collection

    Energy Technology Data Exchange (ETDEWEB)

    Kawai, J.; Shinagawa, A.; Shibata, K.; Yoshino, M.; Itoh, M.; Ishii, Y.; Arakawa, T.; Hara, A.; Fukunishi, Y.; Konno, H.; Adachi, J.; Fukuda, S.; Aizawa, K.; Izawa, M.; Nishi, K.; Kiyosawa, H.; Kondo, S.; Yamanaka, I.; Saito, T.; Okazaki, Y.; Gojobori, T.; Bono, H.; Kasukawa, T.; Saito, R.; Kadota, K.; Matsuda, H.; Ashburner, M.; Batalov, S.; Casavant, T.; Fleischmann, W.; Gaasterland, T.; Gissi, C.; King, B.; Kochiwa, H.; Kuehl, P.; Lewis, S.; Matsuo, Y.; Nikaido, I.; Pesole, G.; Quackenbush, J.; Schriml, L.M.; Staubli, F.; Suzuki, R.; Tomita, M.; Wagner, L.; Washio, T.; Sakai, K.; Okido, T.; Furuno, M.; Aono, H.; Baldarelli, R.; Barsh, G.; Blake, J.; Boffelli, D.; Bojunga, N.; Carninci, P.; de Bonaldo, M.F.; Brownstein, M.J.; Bult, C.; Fletcher, C.; Fujita, M.; Gariboldi, M.; Gustincich, S.; Hill, D.; Hofmann, M.; Hume, D.A.; Kamiya, M.; Lee, N.H.; Lyons, P.; Marchionni, L.; Mashima, J.; Mazzarelli, J.; Mombaerts, P.; Nordone, P.; Ring, B.; Ringwald, M.; Rodriguez, I.; Sakamoto, N.; Sasaki, H.; Sato, K.; Schonbach, C.; Seya, T.; Shibata, Y.; Storch, K.-F.; Suzuki, H.; Toyo-oka, K.; Wang, K.H.; Weitz, C.; Whittaker, C.; Wilming, L.; Wynshaw-Boris, A.; Yoshida, K.; Hasegawa, Y.; Kawaji, H.; Kohtsuki, S.; Hayashizaki, Y.; RIKEN Genome Exploration Research Group Phase II T; FANTOM Consortium

    2001-01-01

    The RIKEN Mouse Gene Encyclopedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analyzed in this project. Here we describe the first RIKEN clone collection, which is one of the largest described for any organism. Analysis of these cDNAs extends known gene families and identifies new ones.

  18. Augmented annotation and orthologue analysis for Oryctolagus cuniculus: Better Bunny

    Directory of Open Access Journals (Sweden)

    Craig Douglas B

    2012-05-01

    Full Text Available Abstract Background The rabbit is an important model organism used in a wide range of biomedical research. However, the rabbit genome is still sparsely annotated, thus prohibiting extensive functional analysis of gene sets derived from whole-genome experiments. We developed a web-based application that provides augmented annotation and orthologue analysis for rabbit genes. Importantly, the application allows comprehensive functional analysis through the use of orthologous relationships. Results Using data extracted from several public bioinformatics repositories we created Better Bunny, a database and query tool that extensively augments the available functional annotation for rabbit genes. Using the complete set of target genes from a commercial rabbit gene expression microarray as our benchmark, we are able to obtain functional information for 88 % of the genes on the microarray. Previously, functional information was available for fewer than 10 % of the rabbit genes. Conclusions We have developed a freely available, web-accessible bioinformatics tool that enables investigators to quickly and easily perform extensive functional analysis of rabbit genes (http://cptweb.cpt.wayne.edu. The software application fills a critical void for a wide range of biomedical research that relies on the rabbit model and requires characterization of biological function for large sets of genes.

  19. The automation of science.

    Science.gov (United States)

    King, Ross D; Rowland, Jem; Oliver, Stephen G; Young, Michael; Aubrey, Wayne; Byrne, Emma; Liakata, Maria; Markham, Magdalena; Pir, Pinar; Soldatova, Larisa N; Sparkes, Andrew; Whelan, Kenneth E; Clare, Amanda

    2009-04-03

    The basis of science is the hypothetico-deductive method and the recording of experiments in sufficient detail to enable reproducibility. We report the development of Robot Scientist "Adam," which advances the automation of both. Adam has autonomously generated functional genomics hypotheses about the yeast Saccharomyces cerevisiae and experimentally tested these hypotheses by using laboratory automation. We have confirmed Adam's conclusions through manual experiments. To describe Adam's research, we have developed an ontology and logical language. The resulting formalization involves over 10,000 different research units in a nested treelike structure, 10 levels deep, that relates the 6.6 million biomass measurements to their logical description. This formalization describes how a machine contributed to scientific knowledge.

  20. Multi-annotation discursive de corpus écrit

    OpenAIRE

    Péry-Woodley, Marie-Paule

    2011-01-01

    National audience; On the basis of the experience acquired in the course of the ANNODIS project, the following questions are discussed: - what is the annotation campaign for? building an annotated " reference corpus" vs. annotation as an experiment; - defining annotation tasks. Naïve vs. expert annotation; - the annotation manual : from linguistic model to annotation protocol; - automatic pre-processing vs. manual annotation. Segmentation, tagging and mark-ups: steps in corpus preparation; - ...

  1. Pragmatics Annotated Coloured Petri Nets for Protocol Software Generation and Verification

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge; Kristensen, Lars Michael; Kindler, Ekkart

    This paper presents the formal definition of Pragmatics Annotated Coloured Petri Nets (PA-CPNs). PA-CPNs represent a class of Coloured Petri Nets (CPNs) that are designed to support automated code genera-tion of protocol software. PA-CPNs restrict the structure of CPN models and allow Petri net...... elements to be annotated with so-called pragmatics, which are exploited for code generation. The approach and tool for gen-erating code is called PetriCode and has been discussed and evaluated in earlier work already. The contribution of this paper is to give a formal def-inition for PA-CPNs; in addition...

  2. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    Energy Technology Data Exchange (ETDEWEB)

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  3. Certifying cost annotations in compilers

    CERN Document Server

    Amadio, Roberto M; Régis-Gianas, Yann; Saillard, Ronan

    2010-01-01

    We discuss the problem of building a compiler which can lift in a provably correct way pieces of information on the execution cost of the object code to cost annotations on the source code. To this end, we need a clear and flexible picture of: (i) the meaning of cost annotations, (ii) the method to prove them sound and precise, and (iii) the way such proofs can be composed. We propose a so-called labelling approach to these three questions. As a first step, we examine its application to a toy compiler. This formal study suggests that the labelling approach has good compositionality and scalability properties. In order to provide further evidence for this claim, we report our successful experience in implementing and testing the labelling approach on top of a prototype compiler written in OCAML for (a large fragment of) the C language.

  4. Corpus Annotation for Parser Evaluation

    OpenAIRE

    CARROLL, JOHN; Minnen, Guido; Briscoe, Ted

    1999-01-01

    We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.

  5. Computational annotation of genes differentially expressed along olive fruit development

    Directory of Open Access Journals (Sweden)

    Martinelli Federico

    2009-10-01

    used to query all known KEGG (Kyoto Encyclopaedia of Genes and Genomes metabolic pathways for characterizing and positioning retrieved EST records. The integration of the olive sequence datasets within the MapMan platform for microarray analysis allowed the identification of specific biosynthetic pathways useful for the definition of key functional categories in time course analyses for gene groups. Conclusion The bioinformatic annotation of all gene sequences was useful to shed light on metabolic pathways and transcriptional aspects related to carbohydrates, fatty acids, secondary metabolites, transcription factors and hormones as well as response to biotic and abiotic stresses throughout olive drupe development. These results represent a first step toward both functional genomics and systems biology research for understanding the gene functions and regulatory networks in olive fruit growth and ripening.

  6. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4).

    Science.gov (United States)

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Tennessen, Kristin; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2016-01-01

    The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases.

  7. Figure 2 from Integrative Genomics Viewer: Visualizing Big Data | Office of Cancer Genomics

    Science.gov (United States)

    Grouping and sorting genomic data in IGV. The IGV user interface displaying 202 glioblastoma samples from TCGA. Samples are grouped by tumor subtype (second annotation column) and data type (first annotation column) and sorted by copy number of the EGFR locus (middle column). Adapted from Figure 1; Robinson et al. 2011

  8. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python)

    OpenAIRE

    Kristopher J. L. Irizarry; Josep Rutllant

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism’s genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism’s genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 g...

  9. Draft genome sequence of an aflatoxigenic Aspergillus species, A. bombycis

    Science.gov (United States)

    The genome of the A. bombycis Type strain was sequenced using a Personal Genome Machine, followed by annotation of its predicted genes. The genome size for A. bombycis was found to be approximately 37 Mb and contained 12,266 genes. This announcement introduces a sequenced genome for an aflatoxigenic...

  10. Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450k array

    NARCIS (Netherlands)

    Slieker, R.C.; Bos, S.D.; Goeman, J.J.; Bovee, J.V.; Talens, R.P.; Breggen, R. van der; Suchiman, H.E.; Lameijer, E.W.; Putter, H.; Akker, E.B. van den; Zhang, Y.; Jukema, J.W.; Slagboom, P.E.; Meulenbelt, I.; Heijmans, B.T.

    2013-01-01

    BACKGROUND: DNA methylation has been recognized as a key mechanism in cell differentiation. Various studies have compared tissues to characterize epigenetically regulated genomic regions, but due to differences in study design and focus there still is no consensus as to the annotation of genomic reg

  11. Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450k array

    NARCIS (Netherlands)

    Slieker, R.C.; Bos, S.D.; Goeman, J,J; Bovee, J.V.M.G.; Talens, R.P.; Van der Breggen, R.; Suchiman, H.E.D.; Lameijer, E.W.; Putter, H.; Van dern Akker, E.B.; Zhang, Y.; Jukema, J.W.; Slagboom, P.E.; Meulenbelt, I.; Heijmans, B.T.

    2013-01-01

    Background: DNA methylation has been recognized as a key mechanism in cell differentiation. Various studies have compared tissues to characterize epigenetically regulated genomic regions, but due to differences in study design and focus there still is no consensus as to the annotation of genomic reg

  12. Functional annotation of hierarchical modularity.

    Directory of Open Access Journals (Sweden)

    Kanchana Padmanabhan

    Full Text Available In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function-hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology and the association of individual genes or proteins with these concepts (e.g., GO terms, our method will assign a Hierarchical Modularity Score (HMS to each node in the hierarchy of functional modules; the HMS score and its p-value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of "enriched" functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our

  13. SeqAnt: A web service to rapidly identify and annotate DNA sequence variations

    Directory of Open Access Journals (Sweden)

    Patel Viren

    2010-09-01

    Full Text Available Abstract Background The enormous throughput and low cost of second-generation sequencing platforms now allow research and clinical geneticists to routinely perform single experiments that identify tens of thousands to millions of variant sites. Existing methods to annotate variant sites using information from publicly available databases via web browsers are too slow to be useful for the large sequencing datasets being routinely generated by geneticists. Because sequence annotation of variant sites is required before functional characterization can proceed, the lack of a high-throughput pipeline to efficiently annotate variant sites can act as a significant bottleneck in genetics research. Results SeqAnt (Sequence Annotator is an open source web service and software package that rapidly annotates DNA sequence variants and identifies recessive or compound heterozygous loci in human, mouse, fly, and worm genome sequencing experiments. Variants are characterized with respect to their functional type, frequency, and evolutionary conservation. Annotated variants can be viewed on a web browser, downloaded in a tab-delimited text file, or directly uploaded in a BED format to the UCSC genome browser. To demonstrate the speed of SeqAnt, we annotated a series of publicly available datasets that ranged in size from 37 to 3,439,107 variant sites. The total time to completely annotate these data completely ranged from 0.17 seconds to 28 minutes 49.8 seconds. Conclusion SeqAnt is an open source web service and software package that overcomes a critical bottleneck facing research and clinical geneticists using second-generation sequencing platforms. SeqAnt will prove especially useful for those investigators who lack dedicated bioinformatics personnel or infrastructure in their laboratories.

  14. Ensembl Genomes 2013: scaling up access to genome-wide data

    Science.gov (United States)

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provi...

  15. A computational approach for the annotation of hydrogen-bonded base interactions in crystallographic structures of the ribozymes

    Energy Technology Data Exchange (ETDEWEB)

    Hamdani, Hazrina Yusof, E-mail: hazrina@mfrlab.org [School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi (Malaysia); Advanced Medical and Dental Institute, Universiti Sains Malaysia, Bertam, Kepala Batas (Malaysia); Artymiuk, Peter J., E-mail: p.artymiuk@sheffield.ac.uk [Dept. of Molecular Biology and Biotechnology, Firth Court, University of Sheffield, S10 T2N Sheffield (United Kingdom); Firdaus-Raih, Mohd, E-mail: firdaus@mfrlab.org [School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi (Malaysia)

    2015-09-25

    A fundamental understanding of the atomic level interactions in ribonucleic acid (RNA) and how they contribute towards RNA architecture is an important knowledge platform to develop through the discovery of motifs from simple arrangements base pairs, to more complex arrangements such as triples and larger patterns involving non-standard interactions. The network of hydrogen bond interactions is important in connecting bases to form potential tertiary motifs. Therefore, there is an urgent need for the development of automated methods for annotating RNA 3D structures based on hydrogen bond interactions. COnnection tables Graphs for Nucleic ACids (COGNAC) is automated annotation system using graph theoretical approaches that has been developed for the identification of RNA 3D motifs. This program searches for patterns in the unbroken networks of hydrogen bonds for RNA structures and capable of annotating base pairs and higher-order base interactions, which ranges from triples to sextuples. COGNAC was able to discover 22 out of 32 quadruples occurrences of the Haloarcula marismortui large ribosomal subunit (PDB ID: 1FFK) and two out of three occurrences of quintuple interaction reported by the non-canonical interactions in RNA (NCIR) database. These and several other interactions of interest will be discussed in this paper. These examples demonstrate that the COGNAC program can serve as an automated annotation system that can be used to annotate conserved base-base interactions and could be added as additional information to established RNA secondary structure prediction methods.

  16. MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease.

    Science.gov (United States)

    Shen, Lishuang; Diroma, Maria Angela; Gonzalez, Michael; Navarro-Gomez, Daniel; Leipzig, Jeremy; Lott, Marie T; van Oven, Mannis; Wallace, Douglas C; Muraresku, Colleen Clarke; Zolkipli-Cunningham, Zarazuela; Chinnery, Patrick F; Attimonelli, Marcella; Zuchner, Stephan; Falk, Marni J; Gai, Xiaowu

    2016-06-01

    MSeqDR is the Mitochondrial Disease Sequence Data Resource, a centralized and comprehensive genome and phenome bioinformatics resource built by the mitochondrial disease community to facilitate clinical diagnosis and research investigations of individual patient phenotypes, genomes, genes, and variants. A central Web portal (https://mseqdr.org) integrates community knowledge from expert-curated databases with genomic and phenotype data shared by clinicians and researchers. MSeqDR also functions as a centralized application server for Web-based tools to analyze data across both mitochondrial and nuclear DNA, including investigator-driven whole exome or genome dataset analyses through MSeqDR-Genesis. MSeqDR-GBrowse genome browser supports interactive genomic data exploration and visualization with custom tracks relevant to mtDNA variation and mitochondrial disease. MSeqDR-LSDB is a locus-specific database that currently manages 178 mitochondrial diseases, 1,363 genes associated with mitochondrial biology or disease, and 3,711 pathogenic variants in those genes. MSeqDR Disease Portal allows hierarchical tree-style disease exploration to evaluate their unique descriptions, phenotypes, and causative variants. Automated genomic data submission tools are provided that capture ClinVar compliant variant annotations. PhenoTips will be used for phenotypic data submission on deidentified patients using human phenotype ontology terminology. The development of a dynamic informed patient consent process to guide data access is underway to realize the full potential of these resources.

  17. PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results

    Directory of Open Access Journals (Sweden)

    Zhao Xuechun

    2007-02-01

    Full Text Available Abstract Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1 query and target sequence database management, (2 automated high-throughput BLAST searching, (3 indexing and searching of results, (4 filtering results online, (5 managing results of personal interest in favorite categories, (6 automated sequence annotation (such as NCBI NR and ontology-based annotation. PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results

  18. Information theory applied to the sparse gene ontology annotation network to predict novel gene function

    Science.gov (United States)

    Tao, Ying; Li, Jianrong

    2010-01-01

    Motivation Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches. Results We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003. Availability The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt. PMID:17646340

  19. Automating Finance

    Science.gov (United States)

    Moore, John

    2007-01-01

    In past years, higher education's financial management side has been riddled with manual processes and aging mainframe applications. This article discusses schools which had taken advantage of an array of technologies that automate billing, payment processing, and refund processing in the case of overpayment. The investments are well worth it:…

  20. Comprehensive annotation of Glossina pallidipes salivary gland hypertrophy virus from Ethiopian tsetse flies: a proteogenomics approach

    NARCIS (Netherlands)

    Abd-Alla, Adly M.M.; Kariithi, H.M.; Cousserans, F.; Parker, N.J.; Ince, Ikbal Agah; Scully, Erin D.; Boeren, J.A.; Geib, Scott M.; Mekonnen, Solomon; Vlak, J.M.; Parker, A.G.; Vreysen, M.J.B.; Bergoin, M.

    2016-01-01

    Glossina pallidipes salivary gland hypertrophy virus (GpSGHV; family Hytrosaviridae) can establish asymptomatic and symptomatic infection in its tsetse fly host. Here, we present a comprehensive annotation of the genome of an Ethiopian GpSGHV isolate (GpSGHV-Eth) compared with the reference Ugandan

  1. SNPsnap: a Web-based tool for identification and annotation of matched SNPs

    DEFF Research Database (Denmark)

    Pers, Tune Hannes; Timshel, Pascal; Hirschhorn, Joel N.

    2015-01-01

    Summary : An important computational step following genome-wide association studies (GWAS) is to assess whether disease or trait-associated single-nucleotide polymorphisms (SNPs) enrich for particular biological annotations. SNP-based enrichment analysis needs to account for biases such as co......@broadinstitute.org Supplementary information : Supplementary data are available at Bioinformatics online....

  2. Virome Assembly and Annotation: A Surprise in the Namib Desert

    Science.gov (United States)

    Hesse, Uljana; van Heusden, Peter; Kirby, Bronwyn M.; Olonade, Israel; van Zyl, Leonardo J.; Trindade, Marla

    2017-01-01

    Sequencing, assembly, and annotation of environmental virome samples is challenging. Methodological biases and differences in species abundance result in fragmentary read coverage; sequence reconstruction is further complicated by the mosaic nature of viral genomes. In this paper, we focus on biocomputational aspects of virome analysis, emphasizing latent pitfalls in sequence annotation. Using simulated viromes that mimic environmental data challenges we assessed the performance of five assemblers (CLC-Workbench, IDBA-UD, SPAdes, RayMeta, ABySS). Individual analyses of relevant scaffold length fractions revealed shortcomings of some programs in reconstruction of viral genomes with excessive read coverage (IDBA-UD, RayMeta), and in accurate assembly of scaffolds ≥50 kb (SPAdes, RayMeta, ABySS). The CLC-Workbench assembler performed best in terms of genome recovery (including highly covered genomes) and correct reconstruction of large scaffolds; and was used to assemble a virome from a copper rich site in the Namib Desert. We found that scaffold network analysis and cluster-specific read reassembly improved reconstruction of sequences with excessive read coverage, and that strict data filtering for non-viral sequences prior to downstream analyses was essential. In this study we describe novel viral genomes identified in the Namib Desert copper site virome. Taxonomic affiliations of diverse proteins in the dataset and phylogenetic analyses of circovirus-like proteins indicated links to the marine habitat. Considering additional evidence from this dataset we hypothesize that viruses may have been carried from the Atlantic Ocean into the Namib Desert by fog and wind, highlighting the impact of the extended environment on an investigated niche in metagenome studies. PMID:28167933

  3. Virome Assembly and Annotation: A Surprise in the Namib Desert.

    Science.gov (United States)

    Hesse, Uljana; van Heusden, Peter; Kirby, Bronwyn M; Olonade, Israel; van Zyl, Leonardo J; Trindade, Marla

    2017-01-01

    Sequencing, assembly, and annotation of environmental virome samples is challenging. Methodological biases and differences in species abundance result in fragmentary read coverage; sequence reconstruction is further complicated by the mosaic nature of viral genomes. In this paper, we focus on biocomputational aspects of virome analysis, emphasizing latent pitfalls in sequence annotation. Using simulated viromes that mimic environmental data challenges we assessed the performance of five assemblers (CLC-Workbench, IDBA-UD, SPAdes, RayMeta, ABySS). Individual analyses of relevant scaffold length fractions revealed shortcomings of some programs in reconstruction of viral genomes with excessive read coverage (IDBA-UD, RayMeta), and in accurate assembly of scaffolds ≥50 kb (SPAdes, RayMeta, ABySS). The CLC-Workbench assembler performed best in terms of genome recovery (including highly covered genomes) and correct reconstruction of large scaffolds; and was used to assemble a virome from a copper rich site in the Namib Desert. We found that scaffold network analysis and cluster-specific read reassembly improved reconstruction of sequences with excessive read coverage, and that strict data filtering for non-viral sequences prior to downstream analyses was essential. In this study we describe novel viral genomes identified in the Namib Desert copper site virome. Taxonomic affiliations of diverse proteins in the dataset and phylogenetic analyses of circovirus-like proteins indicated links to the marine habitat. Considering additional evidence from this dataset we hypothesize that viruses may have been carried from the Atlantic Ocean into the Namib Desert by fog and wind, highlighting the impact of the extended environment on an investigated niche in metagenome studies.

  4. Omics data management and annotation.

    Science.gov (United States)

    Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron

    2011-01-01

    Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter.

  5. The integrated microbial genome resource of analysis.

    Science.gov (United States)

    Checcucci, Alice; Mengoni, Alessio

    2015-01-01

    Integrated Microbial Genomes and Metagenomes (IMG) is a biocomputational system that allows to provide information and support for annotation and comparative analysis of microbial genomes and metagenomes. IMG has been developed by the US Department of Energy (DOE)-Joint Genome Institute (JGI). IMG platform contains both draft and complete genomes, sequenced by Joint Genome Institute and other public and available genomes. Genomes of strains belonging to Archaea, Bacteria, and Eukarya domains are present as well as those of viruses and plasmids. Here, we provide some essential features of IMG system and case study for pangenome analysis.

  6. MODBASE: a database of annotated comparative protein structure models and associated resources

    OpenAIRE

    Pieper, Ursula; Eswar, Narayanan; Davis, Fred P.; Braberg, Hannes; Madhusudhan, M. S.; Rossi, Andrea; Marti-Renom, Marc; Karchin, Rachel; Webb, Ben M.; Eramian, David; Shen, Min-Yi; Kelly, Libusha; Melo, Francisco; Sali, Andrej

    2005-01-01

    MODBASE () is a database of annotated comparative protein structure models for all available protein sequences that can be matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (). MODBASE is updated regularly to reflect the growth in protein sequence and structure databases, and improvements in the software for calculat...

  7. Knowledge Annotation maknig implicit knowledge explicit

    CERN Document Server

    Dingli, Alexiei

    2011-01-01

    Did you ever read something on a book, felt the need to comment, took up a pencil and scribbled something on the books' text'? If you did, you just annotated a book. But that process has now become something fundamental and revolutionary in these days of computing. Annotation is all about adding further information to text, pictures, movies and even to physical objects. In practice, anything which can be identified either virtually or physically can be annotated. In this book, we will delve into what makes annotations, and analyse their significance for the future evolutions of the web. We wil

  8. RIKEN mouse genome encyclopedia.

    Science.gov (United States)

    Hayashizaki, Yoshihide

    2003-01-01

    We have been working to establish the comprehensive mouse full-length cDNA collection and sequence database to cover as many genes as we can, named Riken mouse genome encyclopedia. Recently we are constructing higher-level annotation (Functional ANnoTation Of Mouse cDNA; FANTOM) not only with homology search based annotation but also with expression data profile, mapping information and protein-protein database. More than 1,000,000 clones prepared from 163 tissues were end-sequenced to classify into 159,789 clusters and 60,770 representative clones were fully sequenced. As a conclusion, the 60,770 sequences contained 33,409 unique. The next generation of life science is clearly based on all of the genome information and resources. Based on our cDNA clones we developed the additional system to explore gene function. We developed cDNA microarray system to print all of these cDNA clones, protein-protein interaction screening system, protein-DNA interaction screening system and so on. The integrated database of all the information is very useful not only for analysis of gene transcriptional network and for the connection of gene to phenotype to facilitate positional candidate approach. In this talk, the prospect of the application of these genome resourced should be discussed. More information is available at the web page: http://genome.gsc.riken.go.jp/.

  9. 1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data

    Directory of Open Access Journals (Sweden)

    Cox Juergen

    2012-11-01

    Full Text Available Abstract Quantitative proteomics now provides abundance ratios for thousands of proteins upon perturbations. These need to be functionally interpreted and correlated to other types of quantitative genome-wide data such as the corresponding transcriptome changes. We describe a new method, 2D annotation enrichment, which compares quantitative data from any two 'omics' types in the context of categorical annotation of the proteins or genes. Suitable genome-wide categories are membership of proteins in biochemical pathways, their annotation with gene ontology terms, sub-cellular localization, the presence of protein domains or the membership in protein complexes. 2D annotation enrichment detects annotation terms whose members show consistent behavior in one or both of the data dimensions. This consistent behavior can be a correlation between the two data types, such as simultaneous up- or down-regulation in both data dimensions, or a lack thereof, such as regulation in one dimension but no change in the other. For the statistical formulation of the test we introduce a two-dimensional generalization of the nonparametric two-sample test. The false discovery rate is stringently controlled by correcting for multiple hypothesis testing. We also describe one-dimensional annotation enrichment, which can be applied to single omics data. The 1D and 2D annotation enrichment algorithms are freely available as part of the Perseus software.

  10. TEnest 2.0: computational annotation and visualization of nested transposable elements.

    Science.gov (United States)

    Kronmiller, Brent A; Wise, Roger P

    2013-01-01

    Grass genomes harbor a diverse and complex content of repeated sequences. Most of these repeats occur as abundant transposable elements (TEs), which present unique challenges to sequence, assemble, and annotate genomes. Multiple copies of Long Terminal Repeat (LTR) retrotransposons can hinder sequence assembly and also cause problems with gene annotation. TEs can also contain protein-encoding genes, the ancient remnants of which can mislead gene identification software if not correctly masked. Hence, accurate assembly is crucial for gene annotation. We present TEnest v2.0. TEnest computationally annotates and chronologically displays nested transposable elements. Utilizing organism-specific TE databases as a reference for reconstructing degraded TEs to their ancestral state, annotation of repeats is accomplished by iterative sequence alignment. Subsequently, an output consisting of a graphical display of the chronological nesting structure and coordinate positions of all TE insertions is the result. Both linux command line and Web versions of the TEnest software are available at www.wiselab.org and www.plantgdb.org/tool/, respectively.

  11. The AnnoLite and AnnoLyze programs for comparative annotation of protein structures

    Directory of Open Access Journals (Sweden)

    Dopazo Joaquín

    2007-05-01

    Full Text Available Abstract Background Advances in structural biology, including structural genomics, have resulted in a rapid increase in the number of experimentally determined protein structures. However, about half of the structures deposited by the structural genomics consortia have little or no information about their biological function. Therefore, there is a need for tools for automatically and comprehensively annotating the function of protein structures. We aim to provide such tools by applying comparative protein structure annotation that relies on detectable relationships between protein structures to transfer functional annotations. Here we introduce two programs, AnnoLite and AnnoLyze, which use the structural alignments deposited in the DBAli database. Description AnnoLite predicts the SCOP, CATH, EC, InterPro, PfamA, and GO terms with an average sensitivity of ~90% and average precision of ~80%. AnnoLyze predicts ligand binding site and domain interaction patches with an average sensitivity of ~70% and average precision of ~30%, correctly localizing binding sites for small molecules in ~95% of its predictions. Conclusion The AnnoLite and AnnoLyze programs for comparative annotation of protein structures can reliably and automatically annotate new protein structures. The programs are fully accessible via the Internet as part of the DBAli suite of tools at http://salilab.org/DBAli/.

  12. Heating automation

    OpenAIRE

    Tomažič, Tomaž

    2013-01-01

    This degree paper presents usage and operation of peripheral devices with microcontroller for heating automation. The main goal is to make a quality system control for heating three house floors and with that, increase efficiency of heating devices and lower heating expenses. Heat pump, furnace, boiler pump, two floor-heating pumps and two radiator pumps need to be controlled by this system. For work, we have chosen a development kit stm32f4 - discovery with five temperature sensors, LCD disp...

  13. Automation Security

    OpenAIRE

    Mirzoev, Dr. Timur

    2014-01-01

    Web-based Automated Process Control systems are a new type of applications that use the Internet to control industrial processes with the access to the real-time data. Supervisory control and data acquisition (SCADA) networks contain computers and applications that perform key functions in providing essential services and commodities (e.g., electricity, natural gas, gasoline, water, waste treatment, transportation) to all Americans. As such, they are part of the nation s critical infrastructu...

  14. Marketing automation

    OpenAIRE

    Raluca Dania TODOR

    2017-01-01

    The automation of the marketing process seems to be nowadays, the only solution to face the major changes brought by the fast evolution of technology and the continuous increase in supply and demand. In order to achieve the desired marketing results, businessis have to employ digital marketing and communication services. These services are efficient and measurable thanks to the marketing technology used to track, score and implement each campaign. Due to the...

  15. BioSAVE: Display of scored annotation within a sequence context

    Directory of Open Access Journals (Sweden)

    Adryan Boris

    2008-03-01

    Full Text Available Abstract Background Visualization of sequence annotation is a common feature in many bioinformatics tools. For many applications it is desirable to restrict the display of such annotation according to a score cutoff, as biological interpretation can be difficult in the presence of the entire data. Unfortunately, many visualisation solutions are somewhat static in the way they handle such score cutoffs. Results We present BioSAVE, a sequence annotation viewer with on-the-fly selection of visualisation thresholds for each feature. BioSAVE is a versatile OS X program for visual display of scored features (annotation within a sequence context. The program reads sequence and additional supplementary annotation data (e.g., position weight matrix matches, conservation scores, structural domains from a variety of commonly used file formats and displays them graphically. Onscreen controls then allow for live customisation of these graphics, including on-the-fly selection of visualisation thresholds for each feature. Conclusion Possible applications of the program include display of transcription factor binding sites in a genomic context or the visualisation of structural domain assignments in protein sequences and many more. The dynamic visualisation of these annotations is useful, e.g., for the determination of cutoff values of predicted features to match experimental data. Program, source code and exemplary files are freely available at the BioSAVE homepage.

  16. Novel definition files for human GeneChips based on GeneAnnot

    Directory of Open Access Journals (Sweden)

    Ferrari Sergio

    2007-11-01

    Full Text Available Abstract Background Improvements in genome sequence annotation revealed discrepancies in the original probeset/gene assignment in Affymetrix microarray and the existence of differences between annotations and effective alignments of probes and transcription products. In the current generation of Affymetrix human GeneChips, most probesets include probes matching transcripts from more than one gene and probes which do not match any transcribed sequence. Results We developed a novel set of custom Chip Definition Files (CDF and the corresponding Bioconductor libraries for Affymetrix human GeneChips, based on the information contained in the GeneAnnot database. GeneAnnot-based CDFs are composed of unique custom-probesets, including only probes matching a single gene. Conclusion GeneAnnot-based custom CDFs solve the problem of a reliable reconstruction of expression levels and eliminate the existence of more than one probeset per gene, which often leads to discordant expression signals for the same transcript when gene differential expression is the focus of the analysis. GeneAnnot CDFs are freely distributed and fully compliant with Affymetrix standards and all available software for gene expression analysis. The CDF libraries are available from http://www.xlab.unimo.it/GA_CDF, along with supplementary information (CDF libraries, installation guidelines and R code, CDF statistics, and analysis results.

  17. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    Science.gov (United States)

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  18. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data

    Directory of Open Access Journals (Sweden)

    Steven N. Hart

    2015-05-01

    Full Text Available Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations. Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user’s own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  19. Comprehensive Annotation of Mature Peptides and Genotypes for Zika Virus

    Science.gov (United States)

    Sun, Guangyu; Baumgarth, Nicole; Klem, Edward B.; Scheuermann, Richard H.

    2017-01-01

    The rapid spread of Zika virus (ZIKV) has caused much concern in the global health community, due in part to a link to fetal microcephaly and other neurological illnesses. While an increasing amount of ZIKV genomic sequence data is being generated, an understanding of the virus molecular biology is still greatly lacking. A significant step towards establishing ZIKV proteomics would be the compilation of all proteins produced by the virus, and the resultant virus genotypes. Here we report for the first time such data, using new computational methods for the annotation of mature peptide proteins, genotypes, and recombination events for all ZIKV genomes. The data is made publicly available through the Virus Pathogen Resource at www.viprbrc.org. PMID:28125631

  20. The UCSC Genome Browser database: 2016 update.

    Science.gov (United States)

    Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James

    2016-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.

  1. Construction of coffee transcriptome networks based on gene annotation semantics.

    Science.gov (United States)

    Castillo, Luis F; Galeano, Narmer; Isaza, Gustavo A; Gaitán, Alvaro

    2012-07-24

    Gene annotation is a process that encompasses multiple approaches on the analysis of nucleic acids or protein sequences in order to assign structural and functional characteristics to gene models. When thousands of gene models are being described in an organism genome, construction and visualization of gene networks impose novel challenges in the understanding of complex expression patterns and the generation of new knowledge in genomics research. In order to take advantage of accumulated text data after conventional gene sequence analysis, this work applied semantics in combination with visualization tools to build transcriptome networks from a set of coffee gene annotations. A set of selected coffee transcriptome sequences, chosen by the quality of the sequence comparison reported by Basic Local Alignment Search Tool (BLAST) and Interproscan, were filtered out by coverage, identity, length of the query, and e-values. Meanwhile, term descriptors for molecular biology and biochemistry were obtained along the Wordnet dictionary in order to construct a Resource Description Framework (RDF) using Ruby scripts and Methontology to find associations between concepts. Relationships between sequence annotations and semantic concepts were graphically represented through a total of 6845 oriented vectors, which were reduced to 745 non-redundant associations. A large gene network connecting transcripts by way of relational concepts was created where detailed connections remain to be validated for biological significance based on current biochemical and genetics frameworks. Besides reusing text information in the generation of gene connections and for data mining purposes, this tool development opens the possibility to visualize complex and abundant transcriptome data, and triggers the formulation of new hypotheses in metabolic pathways analysis.

  2. ePIANNO: ePIgenomics ANNOtation tool.

    Directory of Open Access Journals (Sweden)

    Chia-Hsin Liu

    Full Text Available Recently, with the development of next generation sequencing (NGS, the combination of chromatin immunoprecipitation (ChIP and NGS, namely ChIP-seq, has become a powerful technique to capture potential genomic binding sites of regulatory factors, histone modifications and chromatin accessible regions. For most researchers, additional information including genomic variations on the TF binding site, allele frequency of variation between different populations, variation associated disease, and other neighbour TF binding sites are essential to generate a proper hypothesis or a meaningful conclusion. Many ChIP-seq datasets had been deposited on the public domain to help researchers make new discoveries. However, researches are often intimidated by the complexity of data structure and largeness of data volume. Such information would be more useful if they could be combined or downloaded with ChIP-seq data. To meet such demands, we built a webtool: ePIgenomic ANNOtation tool (ePIANNO, http://epianno.stat.sinica.edu.tw/index.html. ePIANNO is a web server that combines SNP information of populations (1000 Genomes Project and gene-disease association information of GWAS (NHGRI with ChIP-seq (hmChIP, ENCODE, and ROADMAP epigenomics data. ePIANNO has a user-friendly website interface allowing researchers to explore, navigate, and extract data quickly. We use two examples to demonstrate how users could use functions of ePIANNO webserver to explore useful information about TF related genomic variants. Users could use our query functions to search target regions, transcription factors, or annotations. ePIANNO may help users to generate hypothesis or explore potential biological functions for their studies.

  3. Towards an event annotated corpus of Polish

    Directory of Open Access Journals (Sweden)

    Michał Marcińczuk

    2015-12-01

    Full Text Available Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent and text mentions (language dependant. The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.

  4. Ground Truth Annotation in T Analyst

    DEFF Research Database (Denmark)

    2015-01-01

    This video shows how to annotate the ground truth tracks in the thermal videos. The ground truth tracks are produced to be able to compare them to tracks obtained from a Computer Vision tracking approach. The program used for annotation is T-Analyst, which is developed by Aliaksei Laureshyn, Ph...

  5. Creating Gaze Annotations in Head Mounted Displays

    DEFF Research Database (Denmark)

    Mardanbeigi, Diako; Qvarfordt, Pernilla

    2015-01-01

    To facilitate distributed communication in mobile settings, we developed GazeNote for creating and sharing gaze annotations in head mounted displays (HMDs). With gaze annotations it possible to point out objects of interest within an image and add a verbal description. To create an annota- tion, ...

  6. Annotation of regular polysemy and underspecification

    DEFF Research Database (Denmark)

    Martínez Alonso, Héctor; Pedersen, Bolette Sandford; Bel, Núria

    2013-01-01

    We present the result of an annotation task on regular polysemy for a series of seman- tic classes or dot types in English, Dan- ish and Spanish. This article describes the annotation process, the results in terms of inter-encoder agreement, and the sense distributions obtained with two methods...

  7. Harnessing Collaborative Annotations on Online Formative Assessments

    Science.gov (United States)

    Lin, Jian-Wei; Lai, Yuan-Cheng

    2013-01-01

    This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…

  8. The surplus value of semantic annotations

    NARCIS (Netherlands)

    M. Marx

    2010-01-01

    We compare the costs of semantic annotation of textual documents to its benefits for information processing tasks. Semantic annotation can improve the performance of retrieval tasks and facilitates an improved search experience through faceted search, focused retrieval, better document summaries, an

  9. Manual Annotation of Translational Equivalence The Blinker Project

    CERN Document Server

    Melamed, I D

    1998-01-01

    Bilingual annotators were paid to link roughly sixteen thousand corresponding words between on-line versions of the Bible in modern French and modern English. These annotations are freely available to the research community from http://www.cis.upenn.edu/~melamed . The annotations can be used for several purposes. First, they can be used as a standard data set for developing and testing translation lexicons and statistical translation models. Second, researchers in lexical semantics will be able to mine the annotations for insights about cross-linguistic lexicalization patterns. Third, the annotations can be used in research into certain recently proposed methods for monolingual word-sense disambiguation. This paper describes the annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations. The annotation process was repeated five times by different annotators. Inter-annotator agreement rates indicate that the annotations are reasonably rel...

  10. Protein function annotation by local binding site surface similarity.

    Science.gov (United States)

    Spitzer, Russell; Cleves, Ann E; Varela, Rocco; Jain, Ajay N

    2014-04-01

    Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against ∼60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that predated query protein biochemical annotation for five out of the eight query proteins. A panel of 12 currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins.

  11. Genome Exploitation and Bioinformatics Tools

    Science.gov (United States)

    de Jong, Anne; van Heel, Auke J.; Kuipers, Oscar P.

    Bioinformatic tools can greatly improve the efficiency of bacteriocin screening efforts by limiting the amount of strains. Different classes of bacteriocins can be detected in genomes by looking at different features. Finding small bacteriocins can be especially challenging due to low homology and because small open reading frames (ORFs) are often omitted from annotations. In this chapter, several bioinformatic tools/strategies to identify bacteriocins in genomes are discussed.

  12. Structured RNAs and synteny regions in the pig genome

    DEFF Research Database (Denmark)

    Anthon, Christian; Tafer, Hakim; Havgaard, Jakob H

    2014-01-01

    for Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog). CONCLUSIONS: We have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian genome, which is likely to play central roles in both health modelling and production. The core annotation is available in Ensembl 70...

  13. Automated Detection of Solar Eruptions

    CERN Document Server

    Hurlburt, Neal

    2015-01-01

    Observation of the solar atmosphere reveals a wide range of motions, from small scale jets and spicules to global-scale coronal mass ejections. Identifying and characterizing these motions are essential to advancing our understanding the drivers of space weather. Both automated and visual identifications are currently used in identifying CMEs. To date, eruptions near the solar surface (which may be precursors to CMEs) have been identified primarily by visual inspection. Here we report on EruptionPatrol (EP): a software module that is designed to automatically identify eruptions from data collected by SDO/AIA. We describe the method underlying the module and compare its results to previous identifications found in the Heliophysics Event Knowledgebase. EP identifies eruptions events that are consistent with those found by human annotations, but in a significantly more consistent and quantitative manner. Eruptions are found to be distributed within 15Mm of the solar surface. They possess peak speeds ranging from...

  14. A Common XML-based Framework for Syntactic Annotations

    CERN Document Server

    Ide, Nancy; Erjavec, Tomaz

    2009-01-01

    It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes.

  15. AgBase: a functional genomics resource for agriculture

    Directory of Open Access Journals (Sweden)

    Hill David P

    2006-09-01

    Full Text Available Abstract Background Many agricultural species and their pathogens have sequenced genomes and more are in progress. Agricultural species provide food, fiber, xenotransplant tissues, biopharmaceuticals and biomedical models. Moreover, many agricultural microorganisms are human zoonoses. However, systems biology from functional genomics data is hindered in agricultural species because agricultural genome sequences have relatively poor structural and functional annotation and agricultural research communities are smaller with limited funding compared to many model organism communities. Description To facilitate systems biology in these traditionally agricultural species we have established "AgBase", a curated, web-accessible, public resource http://www.agbase.msstate.edu for structural and functional annotation of agricultural genomes. The AgBase database includes a suite of computational tools to use GO annotations. We use standardized nomenclature following the Human Genome Organization Gene Nomenclature guidelines and are currently functionally annotating chicken, cow and sheep gene products using the Gene Ontology (GO. The computational tools we have developed accept and batch process data derived from different public databases (with different accession codes, return all existing GO annotations, provide a list of products without GO annotation, identify potential orthologs, model functional genomics data using GO and assist proteomics analysis of ESTs and EST assemblies. Our journal database helps prevent redundant manual GO curation. We encourage and publicly acknowledge GO annotations from researchers and provide a service for researchers interested in GO and analysis of functional genomics data. Conclusion The AgBase database is the first database dedicated to functional genomics and systems biology analysis for agriculturally important species and their pathogens. We use experimental data to improve structural annotation of genomes and to

  16. A methodology to annotate systems biology markup language models with the synthetic biology open language.

    Science.gov (United States)

    Roehner, Nicholas; Myers, Chris J

    2014-02-21

    Recently, we have begun to witness the potential of synthetic biology, noted here in the form of bacteria and yeast that have been genetically engineered to produce biofuels, manufacture drug precursors, and even invade tumor cells. The success of these projects, however, has often failed in translation and application to new projects, a problem exacerbated by a lack of engineering standards that combine descriptions of the structure and function of DNA. To address this need, this paper describes a methodology to connect the systems biology markup language (SBML) to the synthetic biology open language (SBOL), existing standards that describe biochemical models and DNA components, respectively. Our methodology involves first annotating SBML model elements such as species and reactions with SBOL DNA components. A graph is then constructed from the model, with vertices corresponding to elements within the model and edges corresponding to the cause-and-effect relationships between these elements. Lastly, the graph is traversed to assemble the annotating DNA components into a composite DNA component, which is used to annotate the model itself and can be referenced by other composite models and DNA components. In this way, our methodology can be used to build up a hierarchical library of models annotated with DNA components. Such a library is a useful input to any future genetic technology mapping algorithm that would automate the process of composing DNA components to satisfy a behavioral specification. Our methodology for SBML-to-SBOL annotation is implemented in the latest version of our genetic design automation (GDA) software tool, iBioSim.

  17. Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote

    Directory of Open Access Journals (Sweden)

    Liu Shikai

    2012-11-01

    Full Text Available Abstract Background Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly. Results In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly. Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop

  18. Making web annotations persistent over time

    Energy Technology Data Exchange (ETDEWEB)

    Sanderson, Robert [Los Alamos National Laboratory; Van De Sompel, Herbert [Los Alamos National Laboratory

    2010-01-01

    As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations of web resources changing over time, an annotation made about a web resource today may no longer be relevant to the representation that is served from that same resource tomorrow. We assume the existence of archived versions of resources, and combine the temporal features of the emerging Open Annotation data model with the capability offered by the Memento framework that allows seamless navigation from the URI of a resource to archived versions of that resource, and arrive at a solution that provides guarantees regarding the persistence of web annotations over time. More specifically, we provide theoretical solutions and proof-of-concept experimental evaluations for two problems: reconstructing an existing annotation so that the correct archived version is displayed for all resources involved in the annotation, and retrieving all annotations that involve a given archived version of a web resource.

  19. Genome Sequence of Actinobacillus suis Type Strain ATCC 33415T.

    Science.gov (United States)

    Calcutt, Michael J; Foecking, Mark F; Mhlanga-Mutangadura, Tendai; Reilly, Thomas J

    2014-09-18

    The assembled and annotated genome of Actinobacillus suis ATCC 33415(T) is reported here. The 2,501,598-bp genome encodes 2,246 open reading frames (ORFs) with strain variable incursion of an integrative conjugative element into a tRNA locus. Comparative analysis of the deduced gene set should inform our understanding of pathogenesis, genomic plasticity, and serotype variation.

  20. The GLOBE-Consortium: The Next-Generation Genome Viewer

    NARCIS (Netherlands)

    T.A. Knoch (Tobias); H.J.F.M.M. Eussen (Bert); M.J. Moorhouse (Michael)

    2006-01-01

    textabstractThe GLOBE 3D Genome Viewer is the novel system-biology oriented genome browser necessary to access, present, annotate, and to simulate the holistic genome complexity in a unique gateway towards a real understanding, educative presentation and curative manipulation planning of this tr

  1. From genomes to pangenomes: understanding variation among individuals and species

    OpenAIRE

    Contreras-Moreira, Bruno; Vinuesa, Pablo

    2017-01-01

    This tutorial illustrates how to analyze pan-genomes using GET_HOMOLOGUES and GET_HOMOLOGUES-EST. After a short introduction, where the main concepts are illustrated, the remaining sections cover the installation and typical operations required to analyze and annotate genomes and transcriptomes from a pan-genome perspective, in which individuals or species contribute genetic material to a pool.

  2. AgBase: a functional genomics resource for agriculture

    OpenAIRE

    2006-01-01

    Abstract Background Many agricultural species and their pathogens have sequenced genomes and more are in progress. Agricultural species provide food, fiber, xenotransplant tissues, biopharmaceuticals and biomedical models. Moreover, many agricultural microorganisms are human zoonoses. However, systems biology from functional genomics data is hindered in agricultural species because agricultural genome sequences have relatively poor structural and functional annotation and agricultural researc...

  3. Dana-Farber Cancer Institute | Office of Cancer Genomics

    Science.gov (United States)

    Functional Annotation of Cancer Genomes Principal Investigator: William C. Hahn, M.D., Ph.D. The comprehensive characterization of cancer genomes has and will continue to provide an increasingly complete catalog of genetic alterations in specific cancers. However, most epithelial cancers harbor hundreds of genetic alterations as a consequence of genomic instability. Therefore, the functional consequences of the majority of mutations remain unclear.