WorldWideScience

Sample records for automated genome annotations

  1. BEACON: automated tool for Bacterial GEnome Annotation ComparisON

    KAUST Repository

    Kalkatawi, Manal Matoq Saeed

    2015-08-18

    Background Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). Results The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON’s utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27 %, while the number of genes without any function assignment is reduced. Conclusions We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/

  2. An automated annotation tool for genomic DNA sequences using GeneScan and BLAST

    Indian Academy of Sciences (India)

    Andrew M. Lynn; Chakresh Kumar Jain; K. Kosalai; Pranjan Barman; Nupur Thakur; Harish Batra; Alok Bhattacharya

    2001-04-01

    Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST. The routines are used to develop a system for automated annotation of genome DNA sequences.

  3. Evaluation of Three Automated Genome Annotations for Halorhabdus utahensis

    DEFF Research Database (Denmark)

    Bakke, Peter; Carney, Nick; DeLoache, Will;

    2009-01-01

    Genome annotations are accumulating rapidly and depend heavily on automated annotation systems. Many genome centers offer annotation systems but no one has compared their output in a systematic way to determine accuracy and inherent errors. Errors in the annotations are routinely deposited in...... databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology...... and effectiveness of the annotations, as well as to explore the genes, pathways, and physiology of the previously unannotated genome. The annotation services differ considerably in gene calls, features, and ease of use. We had to manually identify the origin of replication and the species...

  4. The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

    OpenAIRE

    Stevens Fred J; Johnson Seth; Desai Valmik; Zavaljevski Nela; Yu Chenggang; Reifman Jaques

    2008-01-01

    Abstract Background Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeli...

  5. Automated annotation of microbial proteomes in SWISS-PROT.

    Science.gov (United States)

    Gattiker, Alexandre; Michoud, Karine; Rivoire, Catherine; Auchincloss, Andrea H; Coudert, Elisabeth; Lima, Tania; Kersey, Paul; Pagni, Marco; Sigrist, Christian J A; Lachaize, Corinne; Veuthey, Anne Lise; Gasteiger, Elisabeth; Bairoch, Amos

    2003-02-01

    Large-scale sequencing of prokaryotic genomes demands the automation of certain annotation tasks currently manually performed in the production of the SWISS-PROT protein knowledgebase. The HAMAP project, or 'High-quality Automated and Manual Annotation of microbial Proteomes', aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation. Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans). Many checks are enforced in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channelled to manual curation. The results of this annotation are integrated in SWISS-PROT, and a website is provided at http://www.expasy.org/sprot/hamap/. PMID:12798039

  6. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    OpenAIRE

    Hamilton John P; Campbell Matthew; Thibaud-Nissen Françoise; Zhu Wei; Buell C

    2007-01-01

    Abstract Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging k...

  7. NCBI prokaryotic genome annotation pipeline.

    Science.gov (United States)

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. PMID:27342282

  8. Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data

    Directory of Open Access Journals (Sweden)

    Minucci Saverio

    2011-10-01

    Full Text Available Abstract Background High-throughput sequencing is generating massive amounts of data at a pace that largely exceeds the throughput of data analysis routines. Here we introduce Fish the ChIPs (FC, a computational pipeline aimed at a broad public of users and designed to perform complete ChIP-Seq data analysis of an unlimited number of samples, thus increasing throughput, reproducibility and saving time. Results Starting from short read sequences, FC performs the following steps: 1 quality controls, 2 alignment to a reference genome, 3 peak calling, 4 genomic annotation, 5 generation of raw signal tracks for visualization on the UCSC and IGV genome browsers. FC exploits some of the fastest and most effective tools today available. Installation on a Mac platform requires very basic computational skills while configuration and usage are supported by a user-friendly graphic user interface. Alternatively, FC can be compiled from the source code on any Unix machine and then run with the possibility of customizing each single parameter through a simple configuration text file that can be generated using a dedicated user-friendly web-form. Considering the execution time, FC can be run on a desktop machine, even though the use of a computer cluster is recommended for analyses of large batches of data. FC is perfectly suited to work with data coming from Illumina Solexa Genome Analyzers or ABI SOLiD and its usage can potentially be extended to any sequencing platform. Conclusions Compared to existing tools, FC has two main advantages that make it suitable for a broad range of users. First of all, it can be installed and run by wet biologists on a Mac machine. Besides it can handle an unlimited number of samples, being convenient for large analyses. In this context, computational biologists can increase reproducibility of their ChIP-Seq data analyses while saving time for downstream analyses. Reviewers This article was reviewed by Gavin Huttley, George

  9. A Factor Graph Approach to Automated GO Annotation.

    Science.gov (United States)

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  10. Genome re-annotation: a wiki solution?

    OpenAIRE

    Salzberg, Steven L.

    2007-01-01

    The annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution.

  11. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    Energy Technology Data Exchange (ETDEWEB)

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  12. Automating Ontological Annotation with WordNet

    Energy Technology Data Exchange (ETDEWEB)

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob L.; Hohimer, Ryan E.; White, Amanda M.

    2006-01-22

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  13. Automated analysis and annotation of basketball video

    Science.gov (United States)

    Saur, Drew D.; Tan, Yap-Peng; Kulkarni, Sanjeev R.; Ramadge, Peter J.

    1997-01-01

    Automated analysis and annotation of video sequences are important for digital video libraries, content-based video browsing and data mining projects. A successful video annotation system should provide users with useful video content summary in a reasonable processing time. Given the wide variety of video genres available today, automatically extracting meaningful video content for annotation still remains hard by using current available techniques. However, a wide range video has inherent structure such that some prior knowledge about the video content can be exploited to improve our understanding of the high-level video semantic content. In this paper, we develop tools and techniques for analyzing structured video by using the low-level information available directly from MPEG compressed video. Being able to work directly in the video compressed domain can greatly reduce the processing time and enhance storage efficiency. As a testbed, we have developed a basketball annotation system which combines the low-level information extracted from MPEG stream with the prior knowledge of basketball video structure to provide high level content analysis, annotation and browsing for events such as wide- angle and close-up views, fast breaks, steals, potential shots, number of possessions and possession times. We expect our approach can also be extended to structured video in other domains.

  14. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    Energy Technology Data Exchange (ETDEWEB)

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent

  15. Comparative omics-driven genome annotation refinement: application across Yersiniae.

    Directory of Open Access Journals (Sweden)

    Alexandra C Schrimpe-Rutledge

    Full Text Available Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. The annotation process is now performed almost exclusively in an automated fashion to balance the large number of sequences generated. One possible way of reducing errors inherent to automated computational annotations is to apply data from omics measurements (i.e. transcriptional and proteomic to the un-annotated genome with a proteogenomic-based approach. Here, the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species. Transcriptomic and proteomic data derived from highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis Pestoides F, and Y. pseudotuberculosis PB1/+ was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 incorrect (i.e., observed frameshifts, extended start sites, and translated pseudogenes protein-coding sequences within the three current genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes, including the insertion-ablated argD, underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, a transcriptional regulator, and many hypothetical proteins that were missed during annotation.

  16. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    Science.gov (United States)

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  17. Software for computing and annotating genomic ranges.

    Directory of Open Access Journals (Sweden)

    Michael Lawrence

    Full Text Available We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

  18. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    Directory of Open Access Journals (Sweden)

    Hamilton John P

    2007-10-01

    Full Text Available Abstract Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1 the submission of gene annotation to an annotation project, 2 the review of the submitted models by project annotators, and 3 the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP, an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website http://rice.tigr.org, as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the

  19. Restauro-G: A Rapid Genome Re-Annotation System for Comparative Genomics

    Institute of Scientific and Technical Information of China (English)

    Satoshi Tamaki; Kazuharu Arakawa; Nobuaki Kono; Masaru Tomita

    2007-01-01

    Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/ under the GNU General Public License.

  20. JGI Plant Genomics Gene Annotation Pipeline

    Energy Technology Data Exchange (ETDEWEB)

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David; Hayes, David; Mitros, Therese

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward this aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.

  1. The discrepancies in the results of bioinformatics tools for genomic structural annotation

    Science.gov (United States)

    Pawełkowicz, Magdalena; Nowak, Robert; Osipowski, Paweł; Rymuszka, Jacek; Świerkula, Katarzyna; Wojcieszek, Michał; Przybecki, Zbigniew

    2014-11-01

    A major focus of sequencing project is to identify genes in genomes. However it is necessary to define the variety of genes and the criteria for identifying them. In this work we present discrepancies and dependencies from the application of different bioinformatic programs for structural annotation performed on the cucumber data set from Polish Consortium of Cucumber Genome Sequencing. We use Fgenesh, GenScan and GeneMark to automated structural annotation, the results have been compared to reference annotation.

  2. Improving gene annotation of complete viral genomes

    OpenAIRE

    Mills, Ryan; Rozanov, Michael; Lomsadze, Alexandre; Tatusova, Tatiana; Borodovsky, Mark

    2003-01-01

    Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protoc...

  3. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.

  4. AGeS: A Software System for Microbial Genome Sequence Annotation

    Science.gov (United States)

    Kumar, Kamal; Desai, Valmik; Cheng, Li; Khitrov, Maxim; Grover, Deepak; Satya, Ravi Vijaya; Yu, Chenggang; Zavaljevski, Nela; Reifman, Jaques

    2011-01-01

    Background The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. Methodology The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions. PMID:21408217

  5. Fuzzy Emotional Semantic Analysis and Automated Annotation of Scene Images

    Directory of Open Access Journals (Sweden)

    Jianfang Cao

    2015-01-01

    Full Text Available With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance.

  6. Fuzzy emotional semantic analysis and automated annotation of scene images.

    Science.gov (United States)

    Cao, Jianfang; Chen, Lichao

    2015-01-01

    With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP) neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance. PMID:25838818

  7. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation.

    Science.gov (United States)

    Beijbom, Oscar; Edmunds, Peter J; Roelfsema, Chris; Smith, Jennifer; Kline, David I; Neal, Benjamin P; Dunlap, Matthew J; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B Greg; Kriegman, David

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157

  8. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation.

    Directory of Open Access Journals (Sweden)

    Oscar Beijbom

    Full Text Available Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys.

  9. Applied bioinformatics: Genome annotation and transcriptome analysis

    DEFF Research Database (Denmark)

    Gupta, Vikas

    Next generation sequencing (NGS) has revolutionized the field of genomics and its wide range of applications has resulted in the genome-wide analysis of hundreds of species and the development of thousands of computational tools. This thesis represents my work on NGS analysis of four species, Lotus...... japonicus (Lotus), Vaccinium corymbosum (blueberry), Stegodyphus mimosarum (spider) and Trifolium occidentale (clover). From a bioinformatics data analysis perspective, my work can be divided into three parts; genome annotation, small RNA, and gene expression analysis. Lotus is a legume of significant...... agricultural and biological importance. Its capacity to form symbiotic relationships with rhizobia and microrrhizal fungi has fascinated researchers for years. Lotus has a small genome of approximately 470 Mb and a short life cycle of 2 to 3 months, which has made Lotus a model legume plant for many molecular...

  10. An automated system designed for large scale NMR data deposition and annotation: application to over 600 assigned chemical shift data entries to the BioMagResBank from the Riken Structural Genomics/Proteomics Initiative internal database

    International Nuclear Information System (INIS)

    Biomolecular NMR chemical shift data are key information for the functional analysis of biomolecules and the development of new techniques for NMR studies utilizing chemical shift statistical information. Structural genomics projects are major contributors to the accumulation of protein chemical shift information. The management of the large quantities of NMR data generated by each project in a local database and the transfer of the data to the public databases are still formidable tasks because of the complicated nature of NMR data. Here we report an automated and efficient system developed for the deposition and annotation of a large number of data sets including 1H, 13C and 15N resonance assignments used for the structure determination of proteins. We have demonstrated the feasibility of our system by applying it to over 600 entries from the internal database generated by the RIKEN Structural Genomics/Proteomics Initiative (RSGI) to the public database, BioMagResBank (BMRB). We have assessed the quality of the deposited chemical shifts by comparing them with those predicted from the PDB coordinate entry for the corresponding protein. The same comparison for other matched BMRB/PDB entries deposited from 2001–2011 has been carried out and the results suggest that the RSGI entries greatly improved the quality of the BMRB database. Since the entries include chemical shifts acquired under strikingly similar experimental conditions, these NMR data can be expected to be a promising resource to improve current technologies as well as to develop new NMR methods for protein studies.

  11. Bioinformatics Assisted Gene Discovery and Annotation of Human Genome

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    As the sequencing stage of human genome project is near the end, the work has begun for discovering novel genes from genome sequences and annotating their biological functions. Here are reviewed current major bioinformatics tools and technologies available for large scale gene discovery and annotation from human genome sequences. Some ideas about possible future development are also provided.

  12. Descartes' fly: the geometry of genomic annotation.

    Science.gov (United States)

    Kim, J

    2001-03-01

    The completion of the Drosophila melanogaster genome marks another significant milestone in the growth of sequence information. But it also contributes to the ever-widening gap between sequence information and biological knowledge. One important approach to reducing this gap is theoretical inference through computational technologies. Many computer programs have been designed to annotate genomic sequence information with biologically relevant information. Here, I suggest that all of these methods have a common structure in which the sequence fragments are "coordinated" by some method of description such as Hidden Markov models. The key to the algorithms lies in constructing the most efficient set of coordinates that allow extrapolation and interpolation from existing knowledge. Efficient extrapolation and interpolation are produced if the sequence fragments acquire a natural geometrical structure in the coordinated description. Finding such a coordinate frame is an inductive problem with no algorithmic solution. The greater part of the problem of genomic annotation lies in biological modeling of the data rather than in algorithmic improvements. PMID:11793243

  13. Close Clustering Based Automated Color Image Annotation

    CERN Document Server

    Garg, Ankit; Asawa, Krishna

    2010-01-01

    Most image-search approaches today are based on the text based tags associated with the images which are mostly human generated and are subject to various kinds of errors. The results of a query to the image database thus can often be misleading and may not satisfy the requirements of the user. In this work we propose our approach to automate this tagging process of images, where image results generated can be fine filtered based on a probabilistic tagging mechanism. We implement a tool which helps to automate the tagging process by maintaining a training database, wherein the system is trained to identify certain set of input images, the results generated from which are used to create a probabilistic tagging mechanism. Given a certain set of segments in an image it calculates the probability of presence of particular keywords. This probability table is further used to generate the candidate tags for input images.

  14. A Human-Curated Annotation of the Candida albicans Genome.

    Directory of Open Access Journals (Sweden)

    2005-07-01

    Full Text Available Recent sequencing and assembly of the genome for the fungal pathogen Candida albicans used simple automated procedures for the identification of putative genes. We have reviewed the entire assembly, both by hand and with additional bioinformatic resources, to accurately map and describe 6,354 genes and to identify 246 genes whose original database entries contained sequencing errors (or possibly mutations that affect their reading frame. Comparison with other fungal genomes permitted the identification of numerous fungus-specific genes that might be targeted for antifungal therapy. We also observed that, compared to other fungi, the protein-coding sequences in the C. albicans genome are especially rich in short sequence repeats. Finally, our improved annotation permitted a detailed analysis of several multigene families, and comparative genomic studies showed that C. albicans has a far greater catabolic range, encoding respiratory Complex 1, several novel oxidoreductases and ketone body degrading enzymes, malonyl-CoA and enoyl-CoA carriers, several novel amino acid degrading enzymes, a variety of secreted catabolic lipases and proteases, and numerous transporters to assimilate the resulting nutrients. The results of these efforts will ensure that the Candida research community has uniform and comprehensive genomic information for medical research as well as for future diagnostic and therapeutic applications.

  15. DIYA: a bacterial annotation pipeline for any genomics lab

    OpenAIRE

    Stewart, Andrew C.; Osborne, Brian; Read, Timothy D

    2009-01-01

    Summary:DIYA (Do-It-Yourself Annotator) is a modular and configurable open source pipeline software, written in Perl, used for the rapid annotation of bacterial genome sequences. The software is currently used to take DNA contigs as input, either in the form of complete genomes or the result of shotgun sequencing, and produce an annotated sequence in Genbank file format as output. Availability: Distribution and source code are available at (https://sourceforge.net/projects/diyg/). Contact: tr...

  16. TOPSAN: a collaborative annotation environment for structural genomics

    Directory of Open Access Journals (Sweden)

    Weekes Dana

    2010-08-01

    Full Text Available Abstract Background Many protein structures determined in high-throughput structural genomics centers, despite their significant novelty and importance, are available only as PDB depositions and are not accompanied by a peer-reviewed manuscript. Because of this they are not accessible by the standard tools of literature searches, remaining underutilized by the broad biological community. Results To address this issue we have developed TOPSAN, The Open Protein Structure Annotation Network, a web-based platform that combines the openness of the wiki model with the quality control of scientific communication. TOPSAN enables research collaborations and scientific dialogue among globally distributed participants, the results of which are reviewed by experts and eventually validated by peer review. The immediate goal of TOPSAN is to harness the combined experience, knowledge, and data from such collaborations in order to enhance the impact of the astonishing number and diversity of structures being determined by structural genomics centers and high-throughput structural biology. Conclusions TOPSAN combines features of automated annotation databases and formal, peer-reviewed scientific research literature, providing an ideal vehicle to bridge a gap between rapidly accumulating data from high-throughput technologies and a much slower pace for its analysis and integration with other, relevant research.

  17. BG7: a new approach for bacterial genome annotation designed for next generation sequencing data.

    Directory of Open Access Journals (Sweden)

    Pablo Pareja-Tobes

    Full Text Available BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version - which is developed in Java, takes advantage of Amazon Web Services (AWS cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.

  18. BG7: a new approach for bacterial genome annotation designed for next generation sequencing data.

    Science.gov (United States)

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version - which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  19. Close Clustering Based Automated Color Image Annotation

    OpenAIRE

    Garg, Ankit; Dwivedi, Rahul; Asawa, Krishna

    2010-01-01

    Most image-search approaches today are based on the text based tags associated with the images which are mostly human generated and are subject to various kinds of errors. The results of a query to the image database thus can often be misleading and may not satisfy the requirements of the user. In this work we propose our approach to automate this tagging process of images, where image results generated can be fine filtered based on a probabilistic tagging mechanism. We implement a tool which...

  20. Improving microbial genome annotations in an integrated database context.

    Directory of Open Access Journals (Sweden)

    I-Min A Chen

    Full Text Available Effective comparative analysis of microbial genomes requires a consistent and complete view of biological data. Consistency regards the biological coherence of annotations, while completeness regards the extent and coverage of functional characterization for genomes. We have developed tools that allow scientists to assess and improve the consistency and completeness of microbial genome annotations in the context of the Integrated Microbial Genomes (IMG family of systems. All publicly available microbial genomes are characterized in IMG using different functional annotation and pathway resources, thus providing a comprehensive framework for identifying and resolving annotation discrepancies. A rule based system for predicting phenotypes in IMG provides a powerful mechanism for validating functional annotations, whereby the phenotypic traits of an organism are inferred based on the presence of certain metabolic reactions and pathways and compared to experimentally observed phenotypes. The IMG family of systems are available at http://img.jgi.doe.gov/.

  1. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

    OpenAIRE

    Oscar Beijbom; Edmunds, Peter J.; Chris Roelfsema; Jennifer Smith; Kline, David I.; Neal, Benjamin P.; Matthew J Dunlap; Vincent Moriarty; Tung-Yung Fan; Chih-Jui Tan; Stephen Chan; Tali Treibitz; Anthony Gamst; B. Greg Mitchell; David Kriegman

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images capture...

  2. Annotation of the protein coding regions of the equine genome

    DEFF Research Database (Denmark)

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.; Zeng, Zheng; Liu, Jinze; Orlando, Ludovic Antoine Alexandre; MacLeod, James N.

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced m...... appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross...

  3. A proteogenomic update to Yersinia: enhancing genome annotation

    Directory of Open Access Journals (Sweden)

    Huang Shih-Ting

    2010-08-01

    Full Text Available Abstract Background Modern biomedical research depends on a complete and accurate proteome. With the widespread adoption of new sequencing technologies, genome sequences are generated at a near exponential rate, diminishing the time and effort that can be invested in genome annotation. The resulting gene set contains numerous errors in even the most basic form of annotation: the primary structure of the proteins. Results The application of experimental proteomics data to genome annotation, called proteogenomics, can quickly and efficiently discover misannotations, yielding a more accurate and complete genome annotation. We present a comprehensive proteogenomic analysis of the plague bacterium, Yersinia pestis KIM. We discover non-annotated genes, correct protein boundaries, remove spuriously annotated ORFs, and make major advances towards accurate identification of signal peptides. Finally, we apply our data to 21 other Yersinia genomes, correcting and enhancing their annotations. Conclusions In total, 141 gene models were altered and have been updated in RefSeq and Genbank, which can be accessed seamlessly through any NCBI tool (e.g. blast or downloaded directly. Along with the improved gene models we discover new, more accurate means of identifying signal peptides in proteomics data.

  4. Automation and Validation of Annotation for Hindi Anaphora Resolution

    Directory of Open Access Journals (Sweden)

    Pardeep Singh

    2015-10-01

    Full Text Available The process of labelling any language genre by which one can extract useful information is called annotation. This provides syntactic information about a word or a word phrase. In this paper, an effort has been made to provide the algorithm for semiautomatic annotation for Hindi text to cater anaphora resolution only. The study was conducted on twelve files of Ranchi Express available in EMILLE corpus. The corpus is originally tagged for demonstrative pronouns. The detection of the pronouns is supported by the incorporation of seven tags. However the semantic interpretation of the demonstrative pronoun is not supported in the original corpus. In this paper an effort has been made to automate the process of tagging as well as the handling of semantic information through addition tags. It was conducted on 1485 demonstrative pronouns. The average accuracy of precision, recall and F measure is 74, 71 and 72 respectively.

  5. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    Directory of Open Access Journals (Sweden)

    Childs Kevin L

    2010-11-01

    Full Text Available Abstract Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence.

  6. SNP annotation-based whole genomic prediction and selection

    DEFF Research Database (Denmark)

    Do, Duy Ngoc; Janss, Luc; Jensen, Just;

    2015-01-01

    into a training (968 pigs) and a validation dataset (304 pigs) by assigning records as before and after January 1, 2012, respectively. SNP were annotated by 14 different classes using Ensembl variant effect prediction. Predictive accuracy and prediction bias were calculated using Bayesian Power LASSO...... SNP to total genomic variance was similar among annotated classes across different traits. Predictive performance of SNP classes did not significantly differ from randomized SNP groups. Genomic prediction has accuracy comparable to observed phenotype, and use of genomic prediction can be cost...... effective by replacing feed intake measurement. Genomic annotation had less impact on predictive accuracy traits considered here but may be different for other traits. It is the first study to provide useful insights into biological classes of SNP driving the whole genomic prediction for complex traits in...

  7. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    Science.gov (United States)

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function. PMID:27388471

  8. Rapid reconstruction and annotation of vaccinium macrocarpon organellar genomes

    Science.gov (United States)

    The mitochondria genome of the American cranberry (Vaccinium macrocarpon Ait.) cultivar “HyRed” is being reconstructed using sequence 454 shotgun data and in silico procedures. We previously reconstructed and annotated a putative plastid cranberry genome with an approximate length of 176kb. The reli...

  9. Combined evidence annotation of transposable elements in genome sequences.

    Directory of Open Access Journals (Sweden)

    Hadi Quesneville

    2005-07-01

    Full Text Available Transposable elements (TEs are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated "TE models" in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1, and we found a substantially higher number of TEs (n = 6,013 than previously identified (n = 1,572. Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1. We also estimated that 518 TE copies (8.6% are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other

  10. Computational evaluation of TIS annotation for prokaryotic genomes

    OpenAIRE

    Zhu Huaiqiu; Ju Li-Ning; Zheng Xiaobin; Hu Gang-Qing; She Zhen-Su

    2008-01-01

    Abstract Background Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks. Results Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment o...

  11. Genome sequencing and annotation of Serratia sp. strain TEL.

    Science.gov (United States)

    Lephoto, Tiisetso E; Gray, Vincent M

    2015-12-01

    We present the annotation of the draft genome sequence of Serratia sp. strain TEL (GenBank accession number KP711410). This organism was isolated from entomopathogenic nematode Oscheius sp. strain TEL (GenBank accession number KM492926) collected from grassland soil and has a genome size of 5,000,541 bp and 542 subsystems. The genome sequence can be accessed at DDBJ/EMBL/GenBank under the accession number LDEG00000000. PMID:26697332

  12. Genome annotation of a Saccharomyces sp. lager brewer's yeast.

    Science.gov (United States)

    De León-Medina, Patricia Marcela; Elizondo-González, Ramiro; Damas-Buenrostro, Luis Cástulo; Geertman, Jan-Maarten; Van den Broek, Marcel; Galán-Wong, Luis Jesús; Ortiz-López, Rocío; Pereyra-Alférez, Benito

    2016-09-01

    The genome of lager brewer's yeast is a hybrid, with Saccharomyces eubayanus and Saccharomyces cerevisiae as sub-genomes. Due to their specific use in the beer industry, relatively little information is available. The genome of brewing yeast was sequenced and annotated in this study. We obtained a genome size of 22.7 Mbp that consisted of 133 scaffolds, with 65 scaffolds larger than 10 kbp. With respect to the annotation, 9939 genes were obtained, and when they were submitted to a local alignment, we found that 53.93% of these genes corresponded to S. cerevisiae, while another 42.86% originated from S. eubayanus. Our results confirm that our strain is a hybrid of at least two different genomes. PMID:27330999

  13. Analysis of high-throughput sequencing and annotation strategies for phage genomes.

    Directory of Open Access Journals (Sweden)

    Matthew R Henn

    Full Text Available BACKGROUND: Bacterial viruses (phages play a critical role in shaping microbial populations as they influence both host mortality and horizontal gene transfer. As such, they have a significant impact on local and global ecosystem function and human health. Despite their importance, little is known about the genomic diversity harbored in phages, as methods to capture complete phage genomes have been hampered by the lack of knowledge about the target genomes, and difficulties in generating sufficient quantities of genomic DNA for sequencing. Of the approximately 550 phage genomes currently available in the public domain, fewer than 5% are marine phage. METHODOLOGY/PRINCIPAL FINDINGS: To advance the study of phage biology through comparative genomic approaches we used marine cyanophage as a model system. We compared DNA preparation methodologies (DNA extraction directly from either phage lysates or CsCl purified phage particles, and sequencing strategies that utilize either Sanger sequencing of a linker amplification shotgun library (LASL or of a whole genome shotgun library (WGSL, or 454 pyrosequencing methods. We demonstrate that genomic DNA sample preparation directly from a phage lysate, combined with 454 pyrosequencing, is best suited for phage genome sequencing at scale, as this method is capable of capturing complete continuous genomes with high accuracy. In addition, we describe an automated annotation informatics pipeline that delivers high-quality annotation and yields few false positives and negatives in ORF calling. CONCLUSIONS/SIGNIFICANCE: These DNA preparation, sequencing and annotation strategies enable a high-throughput approach to the burgeoning field of phage genomics.

  14. Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation

    OpenAIRE

    Mayer Klaus FX; Spannagl Manuel; Ernst Rebecca; Klee Kathrin

    2007-01-01

    Abstract Background Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. Results To overcome these limitations we extended the Apollo data adapter with a generic, co...

  15. Intra-species sequence comparisons for annotating genomes

    Energy Technology Data Exchange (ETDEWEB)

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  16. GENCODE: the reference human genome annotation for The ENCODE Project.

    Science.gov (United States)

    Harrow, Jennifer; Frankish, Adam; Gonzalez, Jose M; Tapanari, Electra; Diekhans, Mark; Kokocinski, Felix; Aken, Bronwen L; Barrell, Daniel; Zadissa, Amonida; Searle, Stephen; Barnes, If; Bignell, Alexandra; Boychenko, Veronika; Hunt, Toby; Kay, Mike; Mukherjee, Gaurab; Rajan, Jeena; Despacio-Reyes, Gloria; Saunders, Gary; Steward, Charles; Harte, Rachel; Lin, Michael; Howald, Cédric; Tanzer, Andrea; Derrien, Thomas; Chrast, Jacqueline; Walters, Nathalie; Balasubramanian, Suganthi; Pei, Baikang; Tress, Michael; Rodriguez, Jose Manuel; Ezkurdia, Iakes; van Baren, Jeltje; Brent, Michael; Haussler, David; Kellis, Manolis; Valencia, Alfonso; Reymond, Alexandre; Gerstein, Mark; Guigó, Roderic; Hubbard, Tim J

    2012-09-01

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. PMID:22955987

  17. GANESH: Software for Customized Annotation of Genome Regions

    OpenAIRE

    Huntley, Derek; Hummerich, Holger; Smedley, Damian; Kittivoravitkul, Sasivimol; McCarthy, Mark; Little, Peter; Sergot, Marek

    2003-01-01

    GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching an...

  18. Genome Annotation in a Community College Cell Biology Lab

    Science.gov (United States)

    Beagley, C. Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning…

  19. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    Science.gov (United States)

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  20. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    Science.gov (United States)

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  1. Annotation of the Clostridium Acetobutylicum Genome

    Energy Technology Data Exchange (ETDEWEB)

    Daly, M. J.

    2004-06-09

    The genome sequence of the solvent producing bacterium Clostridium acetobutylicum ATCC824, has been determined by the shotgun approach. The genome consists of a 3.94 Mb chromosome and a 192 kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases, closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria.

  2. Annotation-Based Whole Genomic Prediction and Selection

    DEFF Research Database (Denmark)

    Kadarmideen, Haja; Do, Duy Ngoc; Janss, Luc;

    using the BayesCπ method and applied to 1,272 Duroc pigs with both genotypic and phenotypic records including residual (RFI) and daily feed intake (DFI), average daily gain (ADG) and back fat (BF)). Records were split into a training (968 pigs) and a validation dataset (304 pigs). SNPs were annotated by...... 14 different classes. Predictive accuracy was 0.531, 0.532, 0.302, and 0.344 for DFI, RFI, ADG and BF, respectively. The contribution per SNP to total genomic variance was similar among annotated classes across different traits. Predictive performance of SNP classes did not significantly differ from...

  3. An Improved microRNA Annotation of the Canine Genome

    Science.gov (United States)

    Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  4. Annotation of selection strengths in viral genomes

    DEFF Research Database (Denmark)

    McCauley, Stephen; de Groot, Saskia; Mailund, Thomas;

    2007-01-01

    overlapping reading frames. We introduce an evolutionary model capable of accounting for varying levels of selection along the genome, and incorporate it into our prior single sequence HMM methodology, extending it now to a phylogenetic HMM. Given an alignment of several homologous viruses to a reference...

  5. cDNA2Genome: A tool for mapping and annotating cDNAs

    Directory of Open Access Journals (Sweden)

    Suhai Sandor

    2003-09-01

    Full Text Available Abstract Background In the last years several high-throughput cDNA sequencing projects have been funded worldwide with the aim of identifying and characterizing the structure of complete novel human transcripts. However some of these cDNAs are error prone due to frameshifts and stop codon errors caused by low sequence quality, or to cloning of truncated inserts, among other reasons. Therefore, accurate CDS prediction from these sequences first require the identification of potentially problematic cDNAs in order to speed up the posterior annotation process. Results cDNA2Genome is an application for the automatic high-throughput mapping and characterization of cDNAs. It utilizes current annotation data and the most up to date databases, especially in the case of ESTs and mRNAs in conjunction with a vast number of approaches to gene prediction in order to perform a comprehensive assessment of the cDNA exon-intron structure. The final result of cDNA2Genome is an XML file containing all relevant information obtained in the process. This XML output can easily be used for further analysis such us program pipelines, or the integration of results into databases. The web interface to cDNA2Genome also presents this data in HTML, where the annotation is additionally shown in a graphical form. cDNA2Genome has been implemented under the W3H task framework which allows the combination of bioinformatics tools in tailor-made analysis task flows as well as the sequential or parallel computation of many sequences for large-scale analysis. Conclusions cDNA2Genome represents a new versatile and easily extensible approach to the automated mapping and annotation of human cDNAs. The underlying approach allows sequential or parallel computation of sequences for high-throughput analysis of cDNAs.

  6. Aspekte der bioinformatischen Analyse und Annotation des Genoms von Rhodopirellula baltica

    OpenAIRE

    Teeling, Hanno

    2004-01-01

    This thesis focuses on the bioinformatic analysis and annotation of the genome of the marine planctomycete Rhodopirellula baltica. A comprehensive bioinformatic pipeline was set up and established that comprises gene prediction, annotation and visualization tools. Considerable effort was put into the manual annotation process.The annotation of the genome of Rhodopirellula baltica revealed that this organism is specialized on the aerobic degradation of complex carbohydrates. Its genome harbors...

  7. Aldo-keto reductase (AKR) superfamily: genomics and annotation.

    Science.gov (United States)

    Mindnich, Rebekka D; Penning, Trevor M

    2009-07-01

    Aldo-keto reductases (AKRs) are phase I metabolising enzymes that catalyse the reduced nicotinamide adenine dinucleotide (phosphate) (NAD(P)H)-dependent reduction of carbonyl groups to yield primary and secondary alcohols on a wide range of substrates, including aliphatic and aromatic aldehydes and ketones, ketoprostaglandins, ketosteroids and xenobiotics. In so doing they functionalise the carbonyl group for conjugation (phase II enzyme reactions). Although functionally diverse, AKRs form a protein superfamily based on their high sequence identity and common protein fold, the (alpha/beta) 8 -barrel structure. Well over 150 AKR enzymes, from diverse organisms, have been annotated so far and given systematic names according to a nomenclature that is based on multiple protein sequence alignment and degree of identity. Annotation of non-vertebrate AKRs at the National Center for Biotechnology Information or Vertebrate Genome Annotation (vega) database does not often include the systematic nomenclature name, so the most comprehensive overview of all annotated AKRs is found on the AKR website (http://www.med.upenn.edu/akr/). This site also hosts links to more detailed and specialised information (eg on crystal structures, gene expression and single nucleotide polymorphisms [SNPs]). The protein-based AKR nomenclature allows unambiguous identification of a given enzyme but does not reflect the wealth of genomic and transcriptomic variation that exists in the various databases. In this context, identification of putative new AKRs and their distinction from pseudogenes are challenging. This review provides a short summary of the characteristic features of AKR biochemistry and structure that have been reviewed in great detail elsewhere, and focuses mainly on nomenclature and database entries of human AKRs that so far have not been subject to systematic annotation. Recent developments in the annotation of SNP and transcript variance in AKRs are also summarised. PMID:19706366

  8. Aldo-keto reductase (AKR superfamily: Genomics and annotation

    Directory of Open Access Journals (Sweden)

    Mindnich Rebekka D

    2009-07-01

    Full Text Available Abstract Aldo-keto reductases (AKRs are phase I metabolising enzymes that catalyse the reduced nicotinamide adenine dinucleotide (phosphate (NAD(PH-dependent reduction of carbonyl groups to yield primary and secondary alcohols on a wide range of substrates, including aliphatic and aromatic aldehydes and ketones, ketoprostaglan-dins, ketosteroids and xenobiotics. In so doing they functionalise the carbonyl group for conjugation (phase II enzyme reactions. Although functionally diverse, AKRs form a protein superfamily based on their high sequence identity and common protein fold, the (α/(β8-barrel structure. Well over 150 AKR enzymes, from diverse organisms, have been annotated so far and given systematic names according to a nomenclature that is based on multiple protein sequence alignment and degree of identity. Annotation of non-vertebrate AKRs at the National Center for Biotechnology Information or Vertebrate Genome Annotation (vega database does not often include the systematic nomenclature name, so the most comprehensive overview of all annotated AKRs is found on the AKR website (http://www.med.upenn.edu/akr/. This site also hosts links to more detailed and specialised information (eg on crystal structures, gene expression and single nucleotide polymorphisms [SNPs]. The protein-based AKR nomenclature allows unambiguous identification of a given enzyme but does not reflect the wealth of genomic and transcriptomic variation that exists in the various databases. In this context, identification of putative new AKRs and their distinction from pseudogenes are challenging. This review provides a short summary of the characteristic features of AKR biochemistry and structure that have been reviewed in great detail elsewhere, and focuses mainly on nomenclature and database entries of human AKRs that so far have not been subject to systematic annotation. Recent developments in the annotation of SNP and transcript variance in AKRs are also summarised.

  9. nGASP - the nematode genome annotation assessment project

    Energy Technology Data Exchange (ETDEWEB)

    Coghlan, A; Fiedler, T J; McKay, S J; Flicek, P; Harris, T W; Blasiar, D; Allen, J; Stein, L D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C

  10. Genome Sequencing and Annotation of Mycobacterium tuberculosis PR08 strain

    Directory of Open Access Journals (Sweden)

    Mohammad Maaruf Jaafar

    2016-03-01

    Full Text Available Mycobacterium tuberculosis is an acid fast bacterial species in the family Mycobacteriaceae and is the causative agent of most cases of tuberculosis. Here, we report the genomic features of Mycobacterium tuberculosis isolated from the cerebrospinal fluid (CSF of a patient diagnosed with both pulmonary and extrapulmonary tuberculosis (TB. The isolated strain was identified as Mycobacterium tuberculosis PR08 (MTB PR08. Genomic DNA of the MTB PR08 strain was extracted and subjected to whole genome sequencing using MiSeq (Illumina, CA,USA. The draft genome size of MTB PR08 strain is 4,292,364 bp with a G + C content of 65.2%. This strain was annotated to have 4723 genes and 48 RNAs. This whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010895.

  11. Assembly, Annotation, and Analysis of Multiple Mycorrhizal Fungal Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Initiative Consortium, Mycorrhizal Genomics; Kuo, Alan; Grigoriev, Igor; Kohler, Annegret; Martin, Francis

    2013-03-08

    Mycorrhizal fungi play critical roles in host plant health, soil community structure and chemistry, and carbon and nutrient cycling, all areas of intense interest to the US Dept. of Energy (DOE) Joint Genome Institute (JGI). To this end we are building on our earlier sequencing of the Laccaria bicolor genome by partnering with INRA-Nancy and the mycorrhizal research community in the MGI to sequence and analyze dozens of mycorrhizal genomes of all Basidiomycota and Ascomycota orders and multiple ecological types (ericoid, orchid, and ectomycorrhizal). JGI has developed and deployed high-throughput sequencing techniques, and Assembly, RNASeq, and Annotation Pipelines. In 2012 alone we sequenced, assembled, and annotated 12 draft or improved genomes of mycorrhizae, and predicted ~;;232831 genes and ~;;15011 multigene families, All of this data is publicly available on JGI MycoCosm (http://jgi.doe.gov/fungi/), which provides access to both the genome data and tools with which to analyze the data. Preliminary comparisons of the current total of 14 public mycorrhizal genomes suggest that 1) short secreted proteins potentially involved in symbiosis are more enriched in some orders than in others amongst the mycorrhizal Agaricomycetes, 2) there are wide ranges of numbers of genes involved in certain functional categories, such as signal transduction and post-translational modification, and 3) novel gene families are specific to some ecological types.

  12. SVA: software for annotating and visualizing sequenced human genomes

    Science.gov (United States)

    Ge, Dongliang; Ruzzo, Elizabeth K.; Shianna, Kevin V.; He, Min; Pelak, Kimberly; Heinzen, Erin L.; Need, Anna C.; Cirulli, Elizabeth T.; Maia, Jessica M.; Dickson, Samuel P.; Zhu, Mingfu; Singh, Abanish; Allen, Andrew S.; Goldstein, David B.

    2011-01-01

    Summary: Here we present Sequence Variant Analyzer (SVA), a software tool that assigns a predicted biological function to variants identified in next-generation sequencing studies and provides a browser to visualize the variants in their genomic contexts. SVA also provides for flexible interaction with software implementing variant association tests allowing users to consider both the bioinformatic annotation of identified variants and the strength of their associations with studied traits. We illustrate the annotation features of SVA using two simple examples of sequenced genomes that harbor Mendelian mutations. Availability and implementation: Freely available on the web at http://www.svaproject.org. Contact: d.ge@duke.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21624899

  13. Rfam: annotating non-coding RNAs in complete genomes

    OpenAIRE

    Griffiths-Jones, Sam; Moxon, Simon; Marshall, Mhairi; Khanna, Ajay; Eddy, Sean R.; Bateman, Alex

    2004-01-01

    Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essentia...

  14. Refined annotation and assembly of the Tetrahymena thermophila genome sequence through EST analysis, comparative genomic hybridization, and targeted gap closure

    Directory of Open Access Journals (Sweden)

    Lee Suzanne R

    2008-11-01

    Full Text Available Abstract Background Tetrahymena thermophila, a widely studied model for cellular and molecular biology, is a binucleated single-celled organism with a germline micronucleus (MIC and somatic macronucleus (MAC. The recent draft MAC genome assembly revealed low sequence repetitiveness, a result of the epigenetic removal of invasive DNA elements found only in the MIC genome. Such low repetitiveness makes complete closure of the MAC genome a feasible goal, which to achieve would require standard closure methods as well as removal of minor MIC contamination of the MAC genome assembly. Highly accurate preliminary annotation of Tetrahymena's coding potential was hindered by the lack of both comparative genomic sequence information from close relatives and significant amounts of cDNA evidence, thus limiting the value of the genomic information and also leaving unanswered certain questions, such as the frequency of alternative splicing. Results We addressed the problem of MIC contamination using comparative genomic hybridization with purified MIC and MAC DNA probes against a whole genome oligonucleotide microarray, allowing the identification of 763 genome scaffolds likely to contain MIC-limited DNA sequences. We also employed standard genome closure methods to essentially finish over 60% of the MAC genome. For the improvement of annotation, we have sequenced and analyzed over 60,000 verified EST reads from a variety of cellular growth and development conditions. Using this EST evidence, a combination of automated and manual reannotation efforts led to updates that affect 16% of the current protein-coding gene models. By comparing EST abundance, many genes showing apparent differential expression between these conditions were identified. Rare instances of alternative splicing and uses of the non-standard amino acid selenocysteine were also identified. Conclusion We report here significant progress in genome closure and reannotation of Tetrahymena

  15. High-throughput proteogenomics of Ruegeria pomeroyi: seeding a better genomic annotation for the whole marine Roseobacter clade

    Directory of Open Access Journals (Sweden)

    Christie-Oleza Joseph A

    2012-02-01

    Full Text Available Abstract Background The structural and functional annotation of genomes is now heavily based on data obtained using automated pipeline systems. The key for an accurate structural annotation consists of blending similarities between closely related genomes with biochemical evidence of the genome interpretation. In this work we applied high-throughput proteogenomics to Ruegeria pomeroyi, a member of the Roseobacter clade, an abundant group of marine bacteria, as a seed for the annotation of the whole clade. Results A large dataset of peptides from R. pomeroyi was obtained after searching over 1.1 million MS/MS spectra against a six-frame translated genome database. We identified 2006 polypeptides, of which thirty-four were encoded by open reading frames (ORFs that had not previously been annotated. From the pool of 'one-hit-wonders', i.e. those ORFs specified by only one peptide detected by tandem mass spectrometry, we could confirm the probable existence of five additional new genes after proving that the corresponding RNAs were transcribed. We also identified the most-N-terminal peptide of 486 polypeptides, of which sixty-four had originally been wrongly annotated. Conclusions By extending these re-annotations to the other thirty-six Roseobacter isolates sequenced to date (twenty different genera, we propose the correction of the assigned start codons of 1082 homologous genes in the clade. In addition, we also report the presence of novel genes within operons encoding determinants of the important tricarboxylic acid cycle, a feature that seems to be characteristic of some Roseobacter genomes. The detection of their corresponding products in large amounts raises the question of their function. Their discoveries point to a possible theory for protein evolution that will rely on high expression of orphans in bacteria: their putative poor efficiency could be counterbalanced by a higher level of expression. Our proteogenomic analysis will increase

  16. Creating reference gene annotation for the mouse C57BL6/J genome assembly

    OpenAIRE

    Mudge, Jonathan M; Harrow, Jennifer

    2015-01-01

    Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, al...

  17. The Saccharomyces Genome Database: Exploring Genome Features and Their Annotations.

    Science.gov (United States)

    Cherry, J Michael

    2015-12-01

    Genomic-scale assays result in data that provide information over the entire genome. Such base pair resolution data cannot be summarized easily except via a graphical viewer. A genome browser is a tool that displays genomic data and experimental results as horizontal tracks. Genome browsers allow searches for a chromosomal coordinate or a feature, such as a gene name, but they do not allow searches by function or upstream binding site. Entry into a genome browser requires that you identify the gene name or chromosomal coordinates for a region of interest. A track provides a representation for genomic results and is displayed as a row of data shown as line segments to indicate regions of the chromosome with a feature. Another type of track presents a graph or wiggle plot that indicates the processed signal intensity computed for a particular experiment or set of experiments. Wiggle plots are typical for genomic assays such as the various next-generation sequencing methods (e.g., chromatin immunoprecipitation [ChIP]-seq or RNA-seq), where it represents a peak of DNA binding, histone modification, or the mapping of an RNA sequence. Here we explore the browser that has been built into the Saccharomyces Genome Database (SGD). PMID:26631126

  18. ARC: Automated Resource Classifier for agglomerative functional classification of prokaryotic proteins using annotation texts

    Indian Academy of Sciences (India)

    Muthiah Gnanamani; Naveen Kumar; Srinivasan Ramachandran

    2007-08-01

    Functional classification of proteins is central to comparative genomics. The need for algorithms tuned to enable integrative interpretation of analytical data is felt globally. The availability of a general, automated software with built-in flexibility will significantly aid this activity. We have prepared ARC (Automated Resource Classifier), which is an open source software meeting the user requirements of flexibility. The default classification scheme based on keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell wall, Cell membrane and Transporters ($\\mathcal{C}$), Cell division ($\\mathcal{D}$), Information ($\\mathcal{I}$), Translocation ($\\mathcal{L}$), Metabolism ($\\mathcal{M}$), Stress($\\mathcal{R}$), Signal and communication($\\mathcal{S}$) and 2 ancillary classes: Others ($\\mathcal{O}$) and Hypothetical ($\\mathcal{H}$). The keyword library of ARC was built serially by first drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent steps, this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333 prokaryotes. ARC is available at http://arc.igib.res.in.

  19. Protein annotation in the era of personal genomics

    DEFF Research Database (Denmark)

    Holberg Blicher, Thomas; Gupta, Ramneek; Wesolowska, Agata;

    2010-01-01

    the differences between many individuals of the same species-humans in particular-the focus needs be on the functional impact of individual residue variation. To fulfil the promises of personal genomics, we need to start asking not only what is in a genome but also how millions of small differences......Protein annotation provides a condensed and systematic view on the function of individual proteins. It has traditionally dealt with sorting proteins into functional categories, which for example has proven to be successful for the comparison of different species. However, if we are to understand...... between individual genomes affect protein function and in turn human health. Copyright © 2010 Elsevier Ltd. All rights reserved....

  20. AGAPE (Automated Genome Analysis PipelinE for pan-genome analysis of Saccharomyces cerevisiae.

    Directory of Open Access Journals (Sweden)

    Giltae Song

    Full Text Available The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.

  1. MarinegenomicsDB: an integrated genome viewer for community-based annotation of genomes.

    Science.gov (United States)

    Koyanagi, Ryo; Takeuchi, Takeshi; Hisata, Kanako; Gyoja, Fuki; Shoguchi, Eiichi; Satoh, Nori; Kawashima, Takeshi

    2013-10-01

    We constructed a web-based genome annotation platform, MarinegenomicsDB, to integrate genome data from various marine organisms including the pearl oyster Pinctada fucata and the coral Acropora digitifera. This newly developed viewer application provides open access to published data and a user-friendly environment for community-based manual gene annotation. Development on a flexible framework enables easy expansion of the website on demand. To date, more than 2000 genes have been annotated using this system. In the future, the website will be expanded to host a wider variety of data, more species, and different types of genome-wide analyses. The website is available at the following URL: http://marinegenomics.oist.jp. PMID:24125644

  2. Re-annotation of the woodland strawberry (Fragaria vesca) genome

    OpenAIRE

    Darwish, Omar; Shahan, Rachel; Liu, Zhongchi; Slovin, Janet P; Alkharouf, Nadim W

    2015-01-01

    Background Fragaria vesca is a low-growing, small-fruited diploid strawberry species commonly called woodland strawberry. It is native to temperate regions of Eurasia and North America and while it produces edible fruits, it is most highly useful as an experimental perennial plant system that can serve as a model for the agriculturally important Rosaceae family. A draft of the F. vesca genome sequence was published in 2011 [Nat Genet 43:223,2011]. The first generation annotation (version 1.1)...

  3. RNA-Seq improves annotation of protein-coding genes in the cucumber genome

    OpenAIRE

    Fei Zhangjun; Yan Pengcheng; Huang Sanwen; Zhang Zhonghua; Li Zhen; Lin Kui

    2011-01-01

    Abstract Background As more and more genomes are sequenced, genome annotation becomes increasingly important in bridging the gap between sequence and biology. Gene prediction, which is at the center of genome annotation, usually integrates various resources to compute consensus gene structures. However, many newly sequenced genomes have limited resources for gene predictions. In an effort to create high-quality gene models of the cucumber genome (Cucumis sativus var. sativus), based on the EV...

  4. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kent, Stephen [University of Chicago

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  5. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation.

    Science.gov (United States)

    Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco

    2016-04-01

    Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline. PMID:26936607

  6. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    Energy Technology Data Exchange (ETDEWEB)

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  7. Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation

    Directory of Open Access Journals (Sweden)

    Mayer Klaus FX

    2007-08-01

    Full Text Available Abstract Background Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. Results To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. Conclusion This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. The Apollo2Go software is freely available from ftp://ftpmips.gsf.de/plants/apollo_webservice.

  8. Synergistic use of plant-prokaryote comparative genomics for functional annotations

    Directory of Open Access Journals (Sweden)

    Waller Jeffrey C

    2011-06-01

    Full Text Available Abstract Background Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction integrates comparative genomics based mainly on microbial genomes with functional genomic data from model microorganisms and post-genomic data from plants. This approach bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is more powerful than purely computational approaches to identifying gene-function associations. Results Among Arabidopsis genes, we focused on those (2,325 in total that (i are unique or belong to families with no more than three members, (ii occur in prokaryotes, and (iii have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology-independent characteristics associated in the SEED database with the prokaryotic members of each family. In-depth comparative genomic analysis was performed for 360 top candidate families. From this pool, 78 families were connected to general areas of metabolism and, of these families, specific functional predictions were made for 41. Twenty-one predicted functions have been experimentally tested or are currently under investigation by our group in at least one prokaryotic organism (nine of them have been validated, four invalidated, and eight are in progress. Ten additional predictions have been independently validated by other groups. Discovering the function of very widespread but hitherto enigmatic proteins such as the YrdC or YgfZ families illustrates the power of our approach

  9. The Effect of Human Genome Annotation Complexity on RNA-Seq Gene Expression Quantification

    Science.gov (United States)

    Wu, Po-Yen; Phan, John H.; Wang, May D.

    2016-01-01

    Next-generation sequencing (NGS) has brought human genomic research to an unprecedented era. RNA-Seq is a branch of NGS that can be used to quantify gene expression and depends on accurate annotation of the human genome (i.e., the definition of genes and all of their variants or isoforms). Multiple annotations of the human genome exist with varying complexity. However, it is not clear how the choice of genome annotation influences RNA-Seq gene expression quantification. We assess the effect of different genome annotations in terms of (1) mapping quality, (2) quantification variation, (3) quantification accuracy (i.e., by comparing to qRT-PCR data), and (4) the concordance of detecting differentially expressed genes. External validation with qRT-PCR suggests that more complex genome annotations result in higher quantification variation.

  10. Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project : open letter

    NARCIS (Netherlands)

    Archibald, A.L.; Bottema, C.D.; Brauning, R.; Burgess, S.C.; Burt, D.W.; Casas, E.; Cheng, H.H.; Clarke, L.; Couldrey, C.; Dalrymple, B.P.; Elsik, C.G.; Foissac, S.; Giuffra, E.; Groenen, M.A.M.; Hayes, B.J.; Huang, L.S.; Khatib, H.; Kijas, J.W.; Kim, H.; Lunney, J.K.; McCarthy, F.M.; McEwan, J.; Moore, S.; Nanduri, B.; Notredame, C.; Palti, Y.; Plastow, G.S.; Reecy, J.M.; Rohrer, G.; Sarropoulou, E.; Schmidt, C.J.; Silverstein, J.; Tellam, R.L.; Tixier-Boichard, M.; Tosser-klopp, G.; Tuggle, C.K.; Vilkki, J.; White, S.N.; Zhao, S.; Zhou, H.

    2015-01-01

    We describe the organization of a nascent international effort, the Functional Annotation of Animal Genomes (FAANG) project, whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species.

  11. Large-scale prokaryotic gene prediction and comparison to genome annotation

    DEFF Research Database (Denmark)

    Nielsen, Pernille; Krogh, Anders Stærmose

    2005-01-01

    Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome...... comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic...

  12. LocusTrack: Integrated visualization of GWAS results and genomic annotation

    OpenAIRE

    Cuellar-Partida, Gabriel; Renteria, Miguel E.; Macgregor, Stuart

    2015-01-01

    Background Genome-wide association studies (GWAS) are an important tool for the mapping of complex traits and diseases. Visual inspection of genomic annotations may be used to generate insights into the biological mechanisms underlying GWAS-identified loci. Results We developed LocusTrack, a web-based application that annotates and creates plots of regional GWAS results and incorporates user-specified tracks that display annotations such as linkage disequilibrium (LD), phylogenetic conservati...

  13. Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

    Directory of Open Access Journals (Sweden)

    Danchin Antoine

    2002-02-01

    Full Text Available Abstract Background Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene is imbedded in the approach. Results We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. Conclusions The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific

  14. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  15. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    OpenAIRE

    Mavromatis, Konstantinos

    2010-01-01

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes predic...

  16. The DOE-JGI Standard Operating Procedure for the Annotations of Microbial Genomes.

    Science.gov (United States)

    Mavromatis, Konstantinos; Ivanova, Natalia N; Chen, I-Min A; Szeto, Ernest; Markowitz, Victor M; Kyrpides, Nikos C

    2009-01-01

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes. PMID:21304638

  17. The DOE-JGI Standard Operating Procedure for the Annotations of Microbial Genomes

    OpenAIRE

    Mavromatis, Konstantinos; Natalia N. Ivanova; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor M.; Kyrpides, Nikos C.

    2009-01-01

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes predic...

  18. Annotated bibliography of films in automation, data processing, and computer science

    CERN Document Server

    Soloman, Martin B Jr

    2015-01-01

    With the rapid development of computer science and the expanding use of computers in all facets of American life, there has been made available a wide range of instructional and informational films on automation, data processing, and computer science. Here is the first annotated bibliography of these and related films, gathered from industrial, institutional, and other sources.This bibliography annotates 244 films, alphabetically arranged by title, with a detailed subject index. Information is also provided concerning the intended audience, rental-purchase data, ordering procedures, and such s

  19. AphidBase: A centralized bioinformatic resource for annotation of the pea aphid genome

    OpenAIRE

    Legeai, Fabrice; Shigenobu, Shuji; Gauthier, Jean-Pierre; Colbourne, John; Rispe, Claude; Collin, Olivier; Richards, Stephen; Wilson, Alex C. C.; Tagu, Denis

    2010-01-01

    AphidBase is a centralized bioinformatic resource that was developed to facilitate community annotation of the pea aphid genome by the International Aphid Genomics Consortium (IAGC). The AphidBase Information System designed to organize and distribute genomic data and annotations for a large international community was constructed using open source software tools from the Generic Model Organism Database (GMOD). The system includes Apollo and GBrowse utilities as well as a wiki, blast search c...

  20. Experimental-confirmation and functional-annotation of predicted proteins in the chicken genome

    Directory of Open Access Journals (Sweden)

    McCarthy Fiona M

    2007-11-01

    Full Text Available Abstract Background The chicken genome was sequenced because of its phylogenetic position as a non-mammalian vertebrate, its use as a biomedical model especially to study embryology and development, its role as a source of human disease organisms and its importance as the major source of animal derived food protein. However, genomic sequence data is, in itself, of limited value; generally it is not equivalent to understanding biological function. The benefit of having a genome sequence is that it provides a basis for functional genomics. However, the sequence data currently available is poorly structurally and functionally annotated and many genes do not have standard nomenclature assigned. Results We analysed eight chicken tissues and improved the chicken genome structural annotation by providing experimental support for the in vivo expression of 7,809 computationally predicted proteins, including 30 chicken proteins that were only electronically predicted or hypothetical translations in human. To improve functional annotation (based on Gene Ontology, we mapped these identified proteins to their human and mouse orthologs and used this orthology to transfer Gene Ontology (GO functional annotations to the chicken proteins. The 8,213 orthology-based GO annotations that we produced represent an 8% increase in currently available chicken GO annotations. Orthologous chicken products were also assigned standardized nomenclature based on current chicken nomenclature guidelines. Conclusion We demonstrate the utility of high-throughput expression proteomics for rapid experimental structural annotation of a newly sequenced eukaryote genome. These experimentally-supported predicted proteins were further annotated by assigning the proteins with standardized nomenclature and functional annotation. This method is widely applicable to a diverse range of species. Moreover, information from one genome can be used to improve the annotation of other genomes and

  1. GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.

    Science.gov (United States)

    Nishi, Tatsunari; Ikemura, Toshimichi; Kanaya, Shigehiko

    2005-02-14

    With the rapid increases in the amounts of sequence data for prokaryotic genomes, it has become important to develop systems for automated and accurate genome annotation. We present herein a novel ab initio gene identification system, GeneLook, that predicts protein-coding open reading frames (ORFs) with high sensitivity and specificity with no prior knowledge of the sequence composition. The system predicts protein-coding ORFs in two stages, seed ORF selection and main prediction. In the selection of reliable seed ORFs containing at least 200 codons, GeneLook predicts translation start sites and operon structures through searches for ribosome-binding sites and a novel operon prediction algorithm. The codon and nucleotide frequencies of seed ORFs are then used to determine values for two new coding-potential parameters for identification of protein-coding ORFs of at least 34 codons and for another parameter that improves the prediction accuracy for GC-rich genomes. In the main prediction, GeneLook uses these parameters to identify the most likely genes of a given minimal length. We assessed the performance of GeneLook with two indices, sensitivity and specificity that are defined as true positives (TP)/(TP+false negatives) and TP/(TP+false positives), respectively. This system predicted protein-coding ORFs for Escherichia coli and Bacillus subtilis with sensitivities of 96.5% and 96.2%, respectively, and specificities of 96.9% and 96.1%, respectively. The system also identified 94.1% of annotated genes of the Pseudomonas aeruginosa genome, which is GC-rich, with high specificity (97.2%). Furthermore, GeneLook identified protein-coding ORFs with high accuracy from a wide variety of prokaryotic genomes. PMID:15716020

  2. Xylella fastidiosa comparative genomic database is an information resource to explore the annotation, genomic features, and biology of different strains

    Directory of Open Access Journals (Sweden)

    Alessandro M. Varani

    2012-01-01

    Full Text Available The Xylella fastidiosa comparative genomic database is a scientific resource with the aim to provide a user-friendly interface for accessing high-quality manually curated genomic annotation and comparative sequence analysis, as well as for identifying and mapping prophage-like elements, a marked feature of Xylella genomes. Here we describe a database and tools for exploring the biology of this important plant pathogen. The hallmarks of this database are the high quality genomic annotation, the functional and comparative genomic analysis and the identification and mapping of prophage-like elements. It is available from web site http://www.xylella.lncc.br.

  3. Comparative genomics in cyprinids: common carp ESTs help the annotation of the zebrafish genome

    Directory of Open Access Journals (Sweden)

    Srinivasan Hamsa

    2006-12-01

    Full Text Available Abstract Background Automatic annotation of sequenced eukaryotic genomes integrates a combination of methodologies such as ab-initio methods and alignment of homologous genes and/or proteins. For example, annotation of the zebrafish genome within Ensembl relies heavily on available cDNA and protein sequences from two distantly related fish species and other vertebrates that have diverged several hundred million years ago. The scarcity of genomic information from other cyprinids provides the impetus to leverage EST collections to understand gene structures in this diverse teleost group. Results We have generated 6,050 ESTs from the differentiating testis of common carp (Cyprinus carpio and clustered them with 9,303 non-gonadal ESTs from CarpBase as well as 1,317 ESTs and 652 common carp mRNAs from GenBank. Over 28% of the resulting 8,663 unique transcripts are exclusively testis-derived ESTs. Moreover, 974 of these transcripts did not match any sequence in the zebrafish or fathead minnow EST collection. A total of 1,843 unique common carp sequences could be stringently mapped to the zebrafish genome (version 5, of which 1,752 matched coding sequences of zebrafish genes with or without potential splice variants. We show that 91 common carp transcripts map to intergenic and intronic regions on the zebrafish genome assembly and regions annotated with non-teleost sequences. Interestingly, an additional 42 common carp transcripts indicate the potential presence of new splicing variants not found in zebrafish databases so far. The fact that common carp transcripts help the identification or confirmation of these coding regions in zebrafish exemplifies the usefulness of sequences from closely related species for the annotation of model genomes. We also demonstrate that 5' UTR sequences of common carp and zebrafish orthologs share a significant level of similarity based on preservation of motif arrangements for as many as 10 ab-initio motifs. Conclusion

  4. TAPDANCE: An automated tool to identify and annotate transposon insertion CISs and associations between CISs from next generation sequence data

    Directory of Open Access Journals (Sweden)

    Sarver Aaron L

    2012-06-01

    Full Text Available Abstract Background Next generation sequencing approaches applied to the analyses of transposon insertion junction fragments generated in high throughput forward genetic screens has created the need for clear informatics and statistical approaches to deal with the massive amount of data currently being generated. Previous approaches utilized to 1 map junction fragments within the genome and 2 identify Common Insertion Sites (CISs within the genome are not practical due to the volume of data generated by current sequencing technologies. Previous approaches applied to this problem also required significant manual annotation. Results We describe Transposon Annotation Poisson Distribution Association Network Connectivity Environment (TAPDANCE software, which automates the identification of CISs within transposon junction fragment insertion data. Starting with barcoded sequence data, the software identifies and trims sequences and maps putative genomic sequence to a reference genome using the bowtie short read mapper. Poisson distribution statistics are then applied to assess and rank genomic regions showing significant enrichment for transposon insertion. Novel methods of counting insertions are used to ensure that the results presented have the expected characteristics of informative CISs. A persistent mySQL database is generated and utilized to keep track of sequences, mappings and common insertion sites. Additionally, associations between phenotypes and CISs are also identified using Fisher’s exact test with multiple testing correction. In a case study using previously published data we show that the TAPDANCE software identifies CISs as previously described, prioritizes them based on p-value, allows holistic visualization of the data within genome browser software and identifies relationships present in the structure of the data. Conclusions The TAPDANCE process is fully automated, performs similarly to previous labor intensive approaches

  5. Proteogenomics: the needs and roles to be filled by proteomics in genome annotation

    Energy Technology Data Exchange (ETDEWEB)

    Ansong, Charles; Purvine, Samuel O.; Adkins, Joshua N.; Lipton, Mary S.; Smith, Richard D.

    2008-01-01

    While genome sequencing efforts reveal the basic building blocks of life, a genome sequence alone is insufficient for elucidating biological function. Genome annotation – the process of identifying genes and assigning function to each gene in a genome sequence – provides the means to elucidate biological function from sequence. Current state-of-the-art high throughput genome annotation uses a combination of comparative (sequence similarity data) and non-comparative (ab initio gene prediction algorithms) methods to identify open reading frames in genome sequences. Because approaches used to validate the presence of these open reading frames are typically based on the information derived from the annotated genomes, they cannot independently and unequivocally determine whether a predicted open reading frame is translated into a protein. With the ability to directly measure peptides arising from expressed proteins, high throughput liquid chromatography-tandem mass spectrometry-based proteomics, approaches can be used to verify coding regions of a genomic sequence. Here, we highlight several ways in which high throughput tandem mass spectrometry-based proteomics can improve the quality of genome annotations and suggest that it could be efficiently applied during the initial gene calling process so that the improvements are propagated through the subsequent functional annotation process.

  6. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation.

    Science.gov (United States)

    Sharma, Virag; Elghafari, Anas; Hiller, Michael

    2016-06-20

    Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes. PMID:27016733

  7. Exploring an Annotated Sequence Assembly of the Perennial Ryegrass Genome for Genomic Regions Enriched for Trait Associated Variants

    DEFF Research Database (Denmark)

    Byrne, Stephen; Cericola, Fabio; Janss, Luc;

    2015-01-01

    Perennial ryegrass (Lolium perenne L.) is an outbreeding diploid species and one of the most important forage crops used in temperate agriculture. We have developed a draft sequence assembly of the perennial ryegrass genome and annotated it with the aid of RNA-seq data from various genotypes, plant.......3 SNPs per gene. SNPs were partitioned according to various annotation features and genomic relationship matrices were created for each annotation class. The SNP-explained variances for heading date and disease resistance were calculated for each class....

  8. Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae

    DEFF Research Database (Denmark)

    Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim;

    2008-01-01

    Background: Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number...... of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other...... related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading...

  9. GenoQuery: a new querying module for functional annotation in a genomic warehouse

    OpenAIRE

    Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine

    2008-01-01

    Motivation: We have to cope with both a deluge of new genome sequences and a huge amount of data produced by high-throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architectu...

  10. Automated design of genomic Southern blot probes

    Directory of Open Access Journals (Sweden)

    Komiyama Noboru H

    2010-01-01

    experimentally validate a number of these automated designs by Southern blotting. The majority of probes we tested performed well confirming our in silico prediction methodology and the general usefulness of the software for automated genomic Southern probe design. Conclusions Software and supplementary information are freely available at: http://www.genes2cognition.org/software/southern_blot

  11. Using Microbial Genome Annotation as a Foundation for Collaborative Student Research

    Science.gov (United States)

    Reed, Kelynne E.; Richardson, John M.

    2013-01-01

    We used the Integrated Microbial Genomes Annotation Collaboration Toolkit as a framework to incorporate microbial genomics research into a microbiology and biochemistry course in a way that promoted student learning of bioinformatics and research skills and emphasized teamwork and collaboration as evidenced through multiple assessment mechanisms.…

  12. Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome

    Science.gov (United States)

    Ernst, Jason; Kellis, Manolis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

  13. A Fugu–Human Genome Synteny Viewer: web software for graphical display and annotation reports of synteny between Fugu genomic sequence and human genes

    OpenAIRE

    Halling-Brown, Mark; Sansom, Clare; Moss, David S; Elgar, Greg; Edwards, Yvonne J.K.

    2004-01-01

    A web server has been developed to access annotation and graphical reports of synteny and gene order between the Fugu genome and human genes. In this system, the assembled Fugu genomic sequences (also known as scaffolds) are annotated. The annotations for each Fugu scaffold are computed, stored and made publicly available. The annotations describe matches to human homologous genes. For each significant human gene match on the Fugu scaffold, the corresponding human chromosome map and measures ...

  14. Creating reference gene annotation for the mouse C57BL6/J genome assembly.

    Science.gov (United States)

    Mudge, Jonathan M; Harrow, Jennifer

    2015-10-01

    Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species. PMID:26187010

  15. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    OpenAIRE

    Devisetty, Upendra Kumar; Covington, Michael F.; An V Tat; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system...

  16. Unsupervised Decoding of Long-Term, Naturalistic Human Neural Recordings with Automated Video and Audio Annotations

    Science.gov (United States)

    Wang, Nancy X. R.; Olson, Jared D.; Ojemann, Jeffrey G.; Rao, Rajesh P. N.; Brunton, Bingni W.

    2016-01-01

    Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018

  17. Unsupervised Decoding of Long-Term, Naturalistic Human Neural Recordings with Automated Video and Audio Annotations.

    Science.gov (United States)

    Wang, Nancy X R; Olson, Jared D; Ojemann, Jeffrey G; Rao, Rajesh P N; Brunton, Bingni W

    2016-01-01

    Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018

  18. Genome-wide functional annotation of Phomopsis longicolla isolate MSPL 10-6.

    Science.gov (United States)

    Darwish, Omar; Li, Shuxian; Matthews, Benjamin; Alkharouf, Nadim

    2016-06-01

    Phomopsis seed decay of soybean is caused primarily by the seed-borne fungal pathogen Phomopsis longicolla (syn. Diaporthe longicolla). This disease severely decreases soybean seed quality, reduces seedling vigor and stand establishment, and suppresses yield. It is one of the most economically important soybean diseases. In this study we annotated the entire genome of P. longicolla isolate MSPL 10-6, which was isolated from field-grown soybean seed in Mississippi, USA. This study represents the first reported genome-wide functional annotation of a seed borne fungal pathogen in the Diaporthe-Phomopsis complex. The P. longicolla genome annotation will enable research into the genetic basis of fungal infection of soybean seed and provide information for the study of soybean-fungal interactions. The genome annotation will also be a valuable resource for the research and agricultural communities. It will aid in the development of new control strategies for this pathogen. The annotations can be found from: http://bioinformatics.towson.edu/phomopsis_longicolla/download.html. NCBI accession number is: AYRD00000000. PMID:27222801

  19. Whole genome sequence and genome annotation of Colletotrichum acutatum, causal agent of anthracnose in pepper plants in South Korea

    Directory of Open Access Journals (Sweden)

    Joon-Hee Han

    2016-06-01

    Full Text Available Colletotrichum acutatum is a destructive fungal pathogen which causes anthracnose in a wide range of crops. Here we report the whole genome sequence and annotation of C. acutatum strain KC05, isolated from an infected pepper in Kangwon, South Korea. Genomic DNA from the KC05 strain was used for the whole genome sequencing using a PacBio sequencer and the MiSeq system. The KC05 genome was determined to be 52,190,760 bp in size with a G + C content of 51.73% in 27 scaffolds and to contain 13,559 genes with an average length of 1516 bp. Gene prediction and annotation were performed by incorporating RNA-Seq data. The genome sequence of the KC05 was deposited at DDBJ/ENA/GenBank under the accession number LUXP00000000.

  20. Whole genome sequence and genome annotation of Colletotrichum acutatum, causal agent of anthracnose in pepper plants in South Korea.

    Science.gov (United States)

    Han, Joon-Hee; Chon, Jae-Kyung; Ahn, Jong-Hwa; Choi, Ik-Young; Lee, Yong-Hwan; Kim, Kyoung Su

    2016-06-01

    Colletotrichum acutatum is a destructive fungal pathogen which causes anthracnose in a wide range of crops. Here we report the whole genome sequence and annotation of C. acutatum strain KC05, isolated from an infected pepper in Kangwon, South Korea. Genomic DNA from the KC05 strain was used for the whole genome sequencing using a PacBio sequencer and the MiSeq system. The KC05 genome was determined to be 52,190,760 bp in size with a G + C content of 51.73% in 27 scaffolds and to contain 13,559 genes with an average length of 1516 bp. Gene prediction and annotation were performed by incorporating RNA-Seq data. The genome sequence of the KC05 was deposited at DDBJ/ENA/GenBank under the accession number LUXP00000000. PMID:27114908

  1. Genome annotations - KOME | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available ated to rice genomes in GenBank blastn/blastx results. Related to rice genomes is defined as follows: The word... Oryza sativa or rice is included in the definition of the entry and the word BAC, PAC, chromosome Genomic

  2. The 2008 update of the Aspergillus nidulans genome annotation: A community effort

    DEFF Research Database (Denmark)

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita;

    2009-01-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional...... applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation...... of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology....

  3. Citrus sinensis Annotation Project (CAP): A Comprehensive Database for Sweet Orange Genome

    OpenAIRE

    Wang, Jia; Chen, DiJun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-...

  4. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system

    OpenAIRE

    I-Min A Chen; Markowitz, Victor M.; Palaniappan, Krishna; Szeto, Ernest; Chu, Ken; Huang, Jinghua; Ratner, Anna; Pillay, Manoj; Hadjithomas, Michalis; Huntemann, Marcel; Mikhailova, Natalia; Ovchinnikova, Galina; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2016-01-01

    Background The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a “Wiki-based” approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Results Here,...

  5. iNGS: a prototype tool for genome interpretation and annotation

    OpenAIRE

    Navas-Delgado, Ismael; García Godoy, María Jesús; Arjona-Pulido, Fátima; Castillo-Castillo, Trinidad; Ramos-Ostio, Ana Isabel; Ifantes Díaz, Sarai; Medina García, Ana; Aldana-Montes, José F.

    2013-01-01

    Currently, clinical interpretation of whole-genome NGS genetic findings are very low-throughput because of a lack of computational tools/software. The current bottleneck of whole-genome and whole-exome sequencing projects is in structured data management and sophisticated computational analysis of experimental data. In this work, we have started designing a platform for integrating, in a first step, existing analysis tools and adding annotations from public databases to the findings of these ...

  6. Evolutionary Interrogation of Human Biology in Well-Annotated Genomic Framework of Rhesus Macaque

    OpenAIRE

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E.; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-01-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive N...

  7. The draft genome sequence and annotation of the desert woodrat Neotoma lepida.

    Science.gov (United States)

    Campbell, Michael; Oakeson, Kelly F; Yandell, Mark; Halpert, James R; Dearing, Denise

    2016-09-01

    We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida). This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata) and the juniper shrub (Juniperus monosperma). The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000. PMID:27408812

  8. SALAD database: a motif-based database of protein annotations for plant comparative genomics

    OpenAIRE

    Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi

    2009-01-01

    Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets o...

  9. Genome Sequence and Annotation of Colletotrichum higginsianum, a Causal Agent of Crucifer Anthracnose Disease.

    Science.gov (United States)

    Zampounis, Antonios; Pigné, Sandrine; Dallery, Jean-Félix; Wittenberg, Alexander H J; Zhou, Shiguo; Schwartz, David C; Thon, Michael R; O'Connell, Richard J

    2016-01-01

    Colletotrichum higginsianum is an ascomycete fungus causing anthracnose disease on numerous cultivated plants in the family Brassicaceae, as well as the model plant Arabidopsis thaliana We report an assembly of the nuclear genome and gene annotation of this pathogen, which was obtained using a combination of PacBio long-read sequencing and optical mapping. PMID:27540062

  10. Complete Genome Sequence and Annotation of the Treponema pallidum subsp. pallidum Chicago Strain ▿

    Science.gov (United States)

    Giacani, Lorenzo; Jeffrey, Brendan M.; Molini, Barbara J.; Le, HoaVan T.; Lukehart, Sheila A.; Centurion-Lara, Arturo; Rockey, Daniel D.

    2010-01-01

    In syphilis research, the Nichols strain of Treponema pallidum, isolated in 1912, has been the most widely studied. Recently, important differences among T. pallidum strains emerged; therefore, we sequenced and annotated the Chicago strain genome to facilitate and encourage the use of this strain in studying the pathogenesis of syphilis. PMID:20348263

  11. Genome sequence and annotation of Streptomyces sp. W6 - putative natural producer of annimycin antibiotic

    Czech Academy of Sciences Publication Activity Database

    Chrudimský, Tomáš; Chroňáková, Alica; Petříčková, Kateřina; Petříček, Miroslav; Krištůfek, Václav

    Praha : Institute of Chemical Technology Prague, 2014. s. 102-103. ISBN 978-80-7080-887-0. [BioTech 2014 & 6 th Czech-Swiss Symposium with Exhibition. 11.06.2014-14.06.2014, Praha] Institutional support: RVO:60077344 ; RVO:61388971 Keywords : genome sequence * annotation * Streptomyces sp. W6 * annimycin antibiotic Subject RIV: EE - Microbiology, Virology

  12. Toward an Upgraded Honey Bee (Apis mellifera L.) Genome Annotation Using Proteogenomics.

    Science.gov (United States)

    McAfee, Alison; Harpur, Brock A; Michaud, Sarah; Beavis, Ronald C; Kent, Clement F; Zayed, Amro; Foster, Leonard J

    2016-02-01

    The honey bee is a key pollinator in agricultural operations as well as a model organism for studying the genetics and evolution of social behavior. The Apis mellifera genome has been sequenced and annotated twice over, enabling proteomics and functional genomics methods for probing relevant aspects of their biology. One troubling trend that emerged from proteomic analyses is that honey bee peptide samples consistently result in lower peptide identification rates compared with other organisms. This suggests that the genome annotation can be improved, or atypical biological processes are interfering with the mass spectrometry workflow. First, we tested whether high levels of polymorphisms could explain some of the missed identifications by searching spectra against the reference proteome (OGSv3.2) versus a customized proteome of a single honey bee, but our results indicate that this contribution was minor. Likewise, error-tolerant peptide searches lead us to eliminate unexpected post-translational modifications as a major factor in missed identifications. We then used a proteogenomic approach with ∼1500 raw files to search for missing genes and new exons, to revive discarded annotations and to identify over 2000 new coding regions. These results will contribute to a more comprehensive genome annotation and facilitate continued research on this important insect. PMID:26718741

  13. Complete Genome Sequence and Annotation of the Treponema pallidum subsp. pallidum Chicago Strain ▿

    OpenAIRE

    Giacani, Lorenzo; Jeffrey, Brendan M; Molini, Barbara J.; Le, HoaVan T.; Lukehart, Sheila A.; Centurion-Lara, Arturo; Rockey, Daniel D.

    2010-01-01

    In syphilis research, the Nichols strain of Treponema pallidum, isolated in 1912, has been the most widely studied. Recently, important differences among T. pallidum strains emerged; therefore, we sequenced and annotated the Chicago strain genome to facilitate and encourage the use of this strain in studying the pathogenesis of syphilis.

  14. ProtAnnot: an App for Integrated Genome Browser to display how alternative splicing and transcription affect proteins

    Science.gov (United States)

    Mall, Tarun; Eckstein, John; Norris, David; Vora, Hiral; Freese, Nowlan H.; Loraine, Ann E.

    2016-01-01

    Summary: One gene can produce multiple transcript variants encoding proteins with different functions. To facilitate visual analysis of transcript variants, we developed ProtAnnot, which shows protein annotations in the context of genomic sequence. ProtAnnot searches InterPro and displays profile matches (protein annotations) alongside gene models, exposing how alternative promoters, splicing and 3′ end processing add, remove, or remodel functional motifs. To draw attention to these effects, ProtAnnot color-codes exons by frame and displays a cityscape graphic summarizing exonic sequence at each position. These techniques make visual analysis of alternative transcripts faster and more convenient for biologists. Availability and implementation: ProtAnnot is a plug-in App for Integrated Genome Browser, an open source desktop genome browser available from http://www.bioviz.org. Contact: aloraine@uncc.edu PMID:27153567

  15. CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L. methylation filtered genomic genespace sequences

    Directory of Open Access Journals (Sweden)

    Spraggins Thomas A

    2007-04-01

    Full Text Available Abstract Background Cowpea [Vigna unguiculata (L. Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI, funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace recovered using methylation filtration technology and providing annotation and analysis of the sequence data. Description CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource, and UniProtKB-TrEMBL. Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the

  16. Annotated features of domestic cat ??? Felis catus genome

    OpenAIRE

    Tamazian, Gaik; Simonov, Serguei; Dobrynin, Pavel; Makunin, Alexey; Logachev, Anton; Komissarov, Aleksey; Shevchenko, Andrey; Brukhin, Vladimir; Cherkasov, Nikolay; Svitin, Anton; Goldman, David B. (David Benjamin); Antunes, Agostinho; Quilez, Javier; Lorente-Gald??s, Bel??n, 1981-; Alkan, Can

    2014-01-01

    Background. Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotat...

  17. Genome sequencing and annotation of Acinetobacter junii strain MTCC 11364

    Directory of Open Access Journals (Sweden)

    Indu Khatri

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 3.5 Mb draft genome of the Acinetobacter junii strain MTCC 11364. The genome has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S, 16S and 64 aminoacyl-tRNA synthetase genes.

  18. Genome sequencing, annotation of Citrobacter freundii strain GTC 09479

    Directory of Open Access Journals (Sweden)

    Kazuyuki Kimura

    2014-12-01

    Full Text Available We report the 4.9-Mb genome sequence of Citrobacter freundii strain GTC 09479, isolated from urine sample collected during the year 1983 at Gifu University Graduate School of Medicine, Japan. This draft genome consist of 4,899,578 bp with 51.62% G + C, 4,574 predicted CDSs, 72 tRNAs and 10 rRNAs.

  19. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger

    Directory of Open Access Journals (Sweden)

    Grigoriev Igor V

    2009-02-01

    Full Text Available Abstract Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR. Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6% of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.

  20. Comparative Annotation of Viral Genomes with Non-Conserved Gene Structure

    DEFF Research Database (Denmark)

    de Groot, Saskia; Mailund, Thomas; Hein, Jotun

    2007-01-01

    -finding program, which allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related......Motivation: Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded...... sequences. Results: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of about 84% and specificity of about 97%. We additionally annotate three pairwise alignments of the more distantly...

  1. Genome sequencing and annotation of Amycolatopsis azurea DSM 43854T

    Science.gov (United States)

    Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-01-01

    We report the 9.2 Mb genome of the azureomycin A and B antibiotic producing strain Amycolatopsis azurea isolated from a Japanese soil sample. The draft genome of strain DSM 43854T consists of 9,223,451 bp with a G + C content of 69.0% and the genome contains 3 rRNA genes (5S–23S–16S) and 58 aminoacyl-tRNA synthetase genes. The homology searches revealed that the PKS gene clusters are supposed to be responsible for the biosynthesis of naptomycin, macbecin, rifamycin, mitomycin, maduropeptin enediyne, neocarzinostatin enediyne, C-1027 enediyne, calicheamicin enediyne, landomycin, simocyclinone, medermycin, granaticin, polyketomycin, teicoplanin, balhimycin, vancomycin, staurosporine, rubradirin and complestatin. PMID:26484067

  2. Genome sequencing and annotation of Amycolatopsis azurea DSM 43854T

    Directory of Open Access Journals (Sweden)

    Indu Khatri

    2014-12-01

    Full Text Available We report the 9.2 Mb genome of the azureomycin A and B antibiotic producing strain Amycolatopsis azurea isolated from a Japanese soil sample. The draft genome of strain DSM 43854T consists of 9,223,451 bp with a G + C content of 69.0% and the genome contains 3 rRNA genes (5S–23S–16S and 58 aminoacyl-tRNA synthetase genes. The homology searches revealed that the PKS gene clusters are supposed to be responsible for the biosynthesis of naptomycin, macbecin, rifamycin, mitomycin, maduropeptin enediyne, neocarzinostatin enediyne, C-1027 enediyne, calicheamicin enediyne, landomycin, simocyclinone, medermycin, granaticin, polyketomycin, teicoplanin, balhimycin, vancomycin, staurosporine, rubradirin and complestatin.

  3. Genome sequencing and annotation of multidrug resistant Mycobacterium tuberculosis (MDR-TB PR10 strain

    Directory of Open Access Journals (Sweden)

    Mohd Zakihalani A. Halim

    2016-03-01

    Full Text Available Here, we report the draft genome sequence and annotation of a multidrug resistant Mycobacterium tuberculosis strain PR10 (MDR-TB PR10 isolated from a patient diagnosed with tuberculosis. The size of the draft genome MDR-TB PR10 is 4.34 Mbp with 65.6% of G + C content and consists of 4637 predicted genes. The determinants were categorized by RAST into 400 subsystems with 4286 coding sequences and 50 RNAs. The whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010968.

  4. Update on human genome completion and annotations: Gene nomenclature

    Directory of Open Access Journals (Sweden)

    Nebert Daniel W

    2003-11-01

    Full Text Available Abstract Why is agreeing on one particular name for each gene important? As one genome after another becomes sequenced, it is imperative to consider the complexity of genes, genetic architecture, gene expression, gene-gene and gene-product interactions and evolutionary relatedness across species. To agree on a particular gene name not only makes one's own research easier, but will also be helpful to the present generation, as well as future generations, of graduate students and postdoctoral fellows who are about to enter genomics research.

  5. Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability

    Indian Academy of Sciences (India)

    Vetriselvi Rangannan; Manju Bansal

    2007-08-01

    Analysis of various predicted structural properties of promoter regions in prokaryotic as well as eukaryotic genomes had earlier indicated that they have several common features, such as lower stability, higher curvature and less bendability, when compared with their neighboring regions. Based on the difference in stability between neighboring upstream and downstream regions in the vicinity of experimentally determined transcription start sites, a promoter prediction algorithm has been developed to identify prokaryotic promoter sequences in whole genomes. The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over the entire genome (G) are used to search for promoters in the genomic sequences. Using these cutoff values to predict promoter regions across entire Escherichia coli genome, we achieved a reliability of 70% when the predicted promoters were cross verified against the 960 transcription start sites (TSSs) listed in the Ecocyc database. Annotation of the whole E. coli genome for promoter region could be carried out with 49% accuracy. The method is quite general and it can be used to annotate the promoter regions of other prokaryotic genomes.

  6. Genome-wide functional annotation and structural verification of metabolic ORFeome of Chlamydomonas reinhardtii

    Directory of Open Access Journals (Sweden)

    Fan Changyu

    2011-06-01

    Full Text Available Abstract Background Recent advances in the field of metabolic engineering have been expedited by the availability of genome sequences and metabolic modelling approaches. The complete sequencing of the C. reinhardtii genome has made this unicellular alga a good candidate for metabolic engineering studies; however, the annotation of the relevant genes has not been validated and the much-needed metabolic ORFeome is currently unavailable. We describe our efforts on the functional annotation of the ORF models released by the Joint Genome Institute (JGI, prediction of their subcellular localizations, and experimental verification of their structural annotation at the genome scale. Results We assigned enzymatic functions to the translated JGI ORF models of C. reinhardtii by reciprocal BLAST searches of the putative proteome against the UniProt and AraCyc enzyme databases. The best match for each translated ORF was identified and the EC numbers were transferred onto the ORF models. Enzymatic functional assignment was extended to the paralogs of the ORFs by clustering ORFs using BLASTCLUST. In total, we assigned 911 enzymatic functions, including 886 EC numbers, to 1,427 transcripts. We further annotated the enzymatic ORFs by prediction of their subcellular localization. The majority of the ORFs are predicted to be compartmentalized in the cytosol and chloroplast. We verified the structure of the metabolism-related ORF models by reverse transcription-PCR of the functionally annotated ORFs. Following amplification and cloning, we carried out 454FLX and Sanger sequencing of the ORFs. Based on alignment of the 454FLX reads to the ORF predicted sequences, we obtained more than 90% coverage for more than 80% of the ORFs. In total, 1,087 ORF models were verified by 454 and Sanger sequencing methods. We obtained expression evidence for 98% of the metabolic ORFs in the algal cells grown under constant light in the presence of acetate. Conclusions We functionally

  7. Emerging applications of read profiles towards the functional annotation of the genome

    DEFF Research Database (Denmark)

    Pundhir, Sachin; Poirazi, Panayiota; Gorodkin, Jan

    2015-01-01

    Functional annotation of the genome is important to understand the phenotypic complexity of various species. The road toward functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS) data. HTS data...... is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation...... to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification...

  8. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants

    OpenAIRE

    Makarov, Vladimir; O'Grady, Tina; Cai, Guiqing; Lihm, Jayon; Buxbaum, Joseph D; Yoon, Seungtai

    2012-01-01

    Summary: AnnTools is a versatile bioinformatics application designed for comprehensive annotation of a full spectrum of human genome variation: novel and known single-nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural variants/copy number variation (SV/CNV). The variants are interpreted by interrogating data compiled from 15 constantly updated sources. In addition to detailed functional characterization of the coding variants, AnnTools searches for overlaps ...

  9. Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS

    OpenAIRE

    2016-01-01

    Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic (“z-score”) of the SNP. We use a multiple...

  10. ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes

    OpenAIRE

    Hu, Gang-Qing; Zheng, Xiaobin; Yang, Yi-Fan; Ortet, Philippe; She, Zhen-Su; Zhu, Huaiqiu

    2007-01-01

    Correct annotation of translation initiation site (TIS) is essential for both experiments and bioinformatics studies of prokaryotic translation initiation mechanism as well as understanding of gene regulation and gene structure. Here we describe a comprehensive database ProTISA, which collects TIS confirmed through a variety of available evidences for prokaryotic genomes, including Swiss-Prot experiments record, literature, conserved domain hits and sequence alignment between orthologous gene...

  11. The Genome Sequence of Leishmania (Leishmania) amazonensis: Functional Annotation and Extended Analysis of Gene Models

    OpenAIRE

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; e Ferreira, Renata Carmona; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães

    2013-01-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved...

  12. Transcript Mapping and Genome Annotation of Ascidian mtDNA Using EST Data

    OpenAIRE

    Gissi, Carmela; Pesole, Graziano

    2003-01-01

    Mitochondrial transcripts of two ascidian species were reconstructed through sequence assembly of publicly available ESTs resembling mitochondrial DNA sequences (mt-ESTs). This strategy allowed us to analyze processing and mapping of the mitochondrial transcripts and to investigate the gene organization of a previously uncharacterized mitochondrial genome (mtDNA). This new strategy would greatly facilitate the sequencing and annotation of mtDNAs. In Ciona intestinalis, the assembled mt-...

  13. Automated correction of genome sequence errors

    OpenAIRE

    Gajer, Pawel; Schatz, Michael; Salzberg, Steven L

    2004-01-01

    By using information from an assembly of a genome, a new program called AutoEditor significantly improves base calling accuracy over that achieved by previous algorithms. This in turn improves the overall accuracy of genome sequences and facilitates the use of these sequences for polymorphism discovery. We describe the algorithm and its application in a large set of recent genome sequencing projects. The number of erroneous base calls in these projects was reduced by 80%. In an analysis of ov...

  14. Genome sequencing and annotation of Amycolatopsis vancoresmycina strain DSM 44592T

    Directory of Open Access Journals (Sweden)

    Navjot Kaur

    2014-12-01

    Full Text Available We report the 9.0-Mb draft genome of Amycolatopsis vancoresmycina strain DSM 44592T, isolated from Indian soil sample; produces antibiotic vancoresmycin. Draft genome of strain DSM44592T consists of 9,037,069 bp with a G+C content of 71.79% and 8340 predicted protein coding genes and 57 RNAs. RAST annotation indicates that strains Streptomyces sp. AA4 (score 521, Saccharomonospora viridis DSM 43017 (score 400 and Actinosynnema mirum DSM 43827 (score 372 are the closest neighbors of the strain DSM 44592T.

  15. Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER

    Indian Academy of Sciences (India)

    Gautam Aggarwal; Ramakrishna Ramaswamy

    2002-02-01

    We compare the annotation of three complete genomes using the ab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail.

  16. Sequencing and genome annotation of honey bee microsporidia parasite, Nosema apis and comparative genome analysis with its sympatric congener, N. ceranae

    Science.gov (United States)

    Here we present a draft genome sequence and annotation of the honey bee microsporidian parasite, Nosema apis. We applied the whole-genome shotgun (WGS) sequencing approach to sequence and assemble the genome of N. apis to 22-fold sequence coverage. We predicted 2927 protein-coding genes in the N. ...

  17. The fast changing landscape of sequencing technologies and their impact on microbial genome assemblies and annotation.

    Directory of Open Access Journals (Sweden)

    Konstantinos Mavromatis

    Full Text Available BACKGROUND: The emergence of next generation sequencing (NGS has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. CONCLUSION: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina and finally to assemblies that generate almost complete genomes (Illumina+PacBio.

  18. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation

    Energy Technology Data Exchange (ETDEWEB)

    Mavromatis, K [U.S. Department of Energy, Joint Genome Institute; Land, Miriam L [ORNL; Brettin, Thomas S [ORNL; Quest, Daniel J [ORNL; Copeland, A [U.S. Department of Energy, Joint Genome Institute; Clum, Alicia [U.S. Department of Energy, Joint Genome Institute; Goodwin, Lynne A. [Los Alamos National Laboratory (LANL); Woyke, Tanja [U.S. Department of Energy, Joint Genome Institute; Lapidus, Alla L. [U.S. Department of Energy, Joint Genome Institute; Klenk, Hans-Peter [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Cottingham, Robert W [ORNL; Kyrpides, Nikos C [U.S. Department of Energy, Joint Genome Institute

    2012-01-01

    Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).

  19. Emerging applications of read profiles towards the functional annotation of the genome

    Directory of Open Access Journals (Sweden)

    Sachin Pundhir

    2015-05-01

    Full Text Available Functional annotation of the genome in various species is important to understand their phenotypic complexity. The road towards functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS data. HTS data is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles. Interpretation of these read profiles are essential for the analysis in relation to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g. from direct (non-sequence based alignments to classification of patterns into functional groups. In this review, we highlight the emerging applications of read profiles for the annotation of non-coding RNA and cis-regulatory regions such as enhancers and promoters. We also discuss the biological rationale behind their formation.

  20. Genome, functional gene annotation, and nuclear transformation of the heterokont oleaginous alga Nannochloropsis oceanica CCMP1779.

    Directory of Open Access Journals (Sweden)

    Astrid Vieler

    Full Text Available Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica-specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis

  1. Discovery and annotation of small proteins using genomics, proteomics and computational approaches

    Energy Technology Data Exchange (ETDEWEB)

    Yang, Xiaohan; Tschaplinski, Timothy J.; Hurst, Gregory B.; Jawdy, Sara; Abraham, Paul E.; Lankford, Patricia K.; Adams, Rachel M.; Shah, Manesh B.; Hettich, Robert L.; Lindquist, Erika; Kalluri, Udaya C.; Gunter, Lee E.; Pennacchio, Christa; Tuskan, Gerald A.

    2011-03-02

    Small proteins (10 200 amino acids aa in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained 2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) codingpotential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  2. Citrus sinensis annotation project (CAP: a comprehensive database for sweet orange genome.

    Directory of Open Access Journals (Sweden)

    Jia Wang

    Full Text Available Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia, and constructed the Citrus sinensis annotation project (CAP to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  3. Snat: a SNP annotation tool for bovine by integrating various sources of genomic information

    Directory of Open Access Journals (Sweden)

    Liu Jian-Feng

    2011-10-01

    Full Text Available Abstract Background Most recently, with maturing of bovine genome sequencing and high throughput SNP genotyping technologies, a large number of significant SNPs associated with economic important traits can be identified by genome-wide association studies (GWAS. To further determine true association findings in GWAS, the common strategy is to sift out most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. To systematically prioritize these statistically significant SNPs and facilitate follow-up replication studies, we developed a bovine SNP annotation tool (Snat based on a web interface. Results With Snat, various sources of genomic information are integrated and retrieved from several leading online databases, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, linkage information from AnimalQTLdb, conserved elements from UCSC Genome Browser Database and gene functions from Gene Ontology (GO, KEGG PATHWAY and Online Mendelian Inheritance in Animals (OMIA. Snat provides two different applications, including a CGI-based web utility and a command-line version, to access the integrated database, target any single nucleotide loci of interest and perform multi-level functional annotations. For further validation of the practical significance of our study, SNPs involved in two commercial bovine SNP chips, i.e., the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been annotated by Snat, and the corresponding outputs can be directly downloaded from Snat website. Furthermore, a real dataset involving 20 identified SNPs associated with milk yield in our recent GWAS was employed to demonstrate the practical significance of Snat. Conclusions To our best knowledge, Snat is one of first tools focusing on SNP annotation for livestock. Snat confers

  4. High-density rhesus macaque oligonucleotide microarray design using early-stage rhesus genome sequence information and human genome annotations

    Directory of Open Access Journals (Sweden)

    Magness Charles L

    2007-01-01

    a closely related species. Conclusion The number of different genes represented on microarrays for unfinished genomes can be greatly increased by matching known gene transcript annotations from a closely related species with sequence data from the unfinished genome. Signal intensity on both EST- and genome-derived arrays was highly correlated with probe distance from the 3' UTR, information often missing from ESTs yet present in early-stage genome projects.

  5. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    Science.gov (United States)

    Cherry, J Michael

    2015-12-01

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology. PMID:26631125

  6. Automated genome mining for natural products

    Directory of Open Access Journals (Sweden)

    Zajkowski James

    2009-06-01

    Full Text Available Abstract Background Discovery of new medicinal agents from natural sources has largely been an adventitious process based on screening of plant and microbial extracts combined with bioassay-guided identification and natural product structure elucidation. Increasingly rapid and more cost-effective genome sequencing technologies coupled with advanced computational power have converged to transform this trend toward a more rational and predictive pursuit. Results We have developed a rapid method of scanning genome sequences for multiple polyketide, nonribosomal peptide, and mixed combination natural products with output in a text format that can be readily converted to two and three dimensional structures using conventional software. Our open-source and web-based program can assemble various small molecules composed of twenty standard amino acids and twenty two other chain-elongation intermediates used in nonribosomal peptide systems, and four acyl-CoA extender units incorporated into polyketides by reading a hidden Markov model of DNA. This process evaluates and selects the substrate specificities along the assembly line of nonribosomal synthetases and modular polyketide synthases. Conclusion Using this approach we have predicted the structures of natural products from a diverse range of bacteria based on a limited number of signature sequences. In accelerating direct DNA to metabolomic analysis, this method bridges the interface between chemists and biologists and enables rapid scanning for compounds with potential therapeutic value.

  7. The de novo genome assembly and annotation of a female domestic dromedary of North African origin.

    Science.gov (United States)

    Fitak, Robert R; Mohandesan, Elmira; Corander, Jukka; Burger, Pamela A

    2016-01-01

    The single-humped dromedary (Camelus dromedarius) is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbour a variety of biological and physiological characteristics with evolutionary, economic and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65× coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21 167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single-nucleotide polymorphisms with a mean density of 0.71 × 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes, highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps. PMID:26178449

  8. New local potential useful for genome annotation and 3D modeling

    Energy Technology Data Exchange (ETDEWEB)

    Chandonia, John-Marc; Cohen, Fred E.

    2003-07-17

    A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar (but not identical) to the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10 percent more structures correctly identified as the most likely structural match in a fold library, and 20 percent more structures correctly narrowed down to a set of five possible candidates. JThread also significantly improves the average sequence alignment accuracy, from 53 percent to 62 percent of residues correctly aligned. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.

  9. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool.

    Science.gov (United States)

    Motion, Graham B; Howden, Andrew J M; Huitema, Edgar; Jones, Susan

    2015-12-15

    There are currently 151 plants with draft genomes available but levels of functional annotation for putative protein products are low. Therefore, accurate computational predictions are essential to annotate genomes in the first instance, and to provide focus for the more costly and time consuming functional assays that follow. DNA-binding proteins are an important class of proteins that require annotation, but current computational methods are not applicable for genome wide predictions in plant species. Here, we explore the use of species and lineage specific models for the prediction of DNA-binding proteins in plants. We show that a species specific support vector machine model based on Arabidopsis sequence data is more accurate (accuracy 81%) than a generic model (74%), and based on this we develop a plant specific model for predicting DNA-binding proteins. We apply this model to the tomato proteome and demonstrate its ability to perform accurate high-throughput prediction of DNA-binding proteins. In doing so, we have annotated 36 currently uncharacterised proteins by assigning a putative DNA-binding function. Our model is publically available and we propose it be used in combination with existing tools to help increase annotation levels of DNA-binding proteins encoded in plant genomes. PMID:26304539

  10. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud[OPEN

    Science.gov (United States)

    Merchant, Nirav

    2016-01-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  11. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches

    Energy Technology Data Exchange (ETDEWEB)

    Yang, Xiaohan [ORNL; Tschaplinski, Timothy J [ORNL; Hurst, Gregory {Greg} B [ORNL; Jawdy, Sara [ORNL; Abraham, Paul E [ORNL; Lankford, Patricia K [ORNL; Adams, Rachel M [ORNL; Shah, Manesh B [ORNL; Hettich, Robert {Bob} L [ORNL; Kalluri, Udaya C [ORNL; Gunter, Lee E [ORNL; Pennacchio, Christa [U.S. Department of Energy, Joint Genome Institute; Tuskan, Gerald A [ORNL

    2011-01-01

    Small proteins (10 200 amino acids (AA) in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained ~2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 AA in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: 1) coding-potential prediction, 2) evolutionary conservation between P. deltoides and other plant species, and 3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1,469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1,282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were also supported by proteomics data. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  12. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    Science.gov (United States)

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  13. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    Directory of Open Access Journals (Sweden)

    Nupoor Chowdhary

    Full Text Available Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2 production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs. Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs, 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach, we strongly

  14. MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data.

    Science.gov (United States)

    Schmid, Christoph D; Sengstag, Thierry; Bucher, Philipp; Delorenzi, Mauro

    2007-07-01

    A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/. PMID:17526516

  15. Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs

    Directory of Open Access Journals (Sweden)

    Khan Shafiq A

    2003-06-01

    Full Text Available Abstract Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells.

  16. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots

    Directory of Open Access Journals (Sweden)

    Sujai eKumar

    2013-11-01

    Full Text Available Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage and biological (best-matching sequence in annotated databases indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots. We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.

  17. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries

    Directory of Open Access Journals (Sweden)

    Pardinas Jose R

    2008-04-01

    Full Text Available Abstract Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects.

  18. Automated genome mining of ribosomal peptide natural products

    Energy Technology Data Exchange (ETDEWEB)

    Mohimani, Hosein; Kersten, Roland; Liu, Wei; Wang, Mingxun; Purvine, Samuel O.; Wu, Si; Brewer, Heather M.; Pasa-Tolic, Ljiljana; Bandeira, Nuno; Moore, Bradley S.; Pevzner, Pavel A.; Dorrestein, Pieter C.

    2014-07-31

    Ribosomally synthesized and posttranslationally modified peptides (RiPPs), especially from microbial sources, are a large group of bioactive natural products that are a promising source of new (bio)chemistry and bioactivity (1). In light of exponentially increasing microbial genome databases and improved mass spectrometry (MS)-based metabolomic platforms, there is a need for computational tools that connect natural product genotypes predicted from microbial genome sequences with their corresponding chemotypes from metabolomic datasets. Here, we introduce RiPPquest, a tandem mass spectrometry database search tool for identification of microbial RiPPs and apply it for lanthipeptide discovery. RiPPquest uses genomics to limit search space to the vicinity of RiPP biosynthetic genes and proteomics to analyze extensive peptide modifications and compute p-values of peptide-spectrum matches (PSMs). We highlight RiPPquest by connection of multiple RiPPs from extracts of Streptomyces to their gene clusters and by the discovery of a new class III lanthipeptide, informatipeptin, from Streptomyces viridochromogenes DSM 40736 as the first natural product to be identified in an automated fashion by genome mining. The presented tool is available at cy-clo.ucsd.edu.

  19. Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction.

    Science.gov (United States)

    Masseroli, Marco; Canakoglu, Arif; Ceri, Stefano

    2016-01-01

    Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists' ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome, and OMIM). Our solution is general, as it uses a flexible, modular, and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure, and number. These procedures also assure consistency, quality, and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction. PMID:27045824

  20. Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS.

    Science.gov (United States)

    Wang, Yunpeng; Thompson, Wesley K; Schork, Andrew J; Holland, Dominic; Chen, Chi-Hua; Bettella, Francesco; Desikan, Rahul S; Li, Wen; Witoelar, Aree; Zuber, Verena; Devor, Anna; Nöthen, Markus M; Rietschel, Marcella; Chen, Qiang; Werge, Thomas; Cichon, Sven; Weinberger, Daniel R; Djurovic, Srdjan; O'Donovan, Michael; Visscher, Peter M; Andreassen, Ole A; Dale, Anders M

    2016-01-01

    Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic ("z-score") of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a "relative enrichment score" for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8). There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3. PMID:26808560

  1. Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS.

    Directory of Open Access Journals (Sweden)

    Yunpeng Wang

    2016-01-01

    Full Text Available Most of the genetic architecture of schizophrenia (SCZ has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3, which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy for each single nucleotide polymorphism (SNP to enable more accurate estimation of replication probabilities, conditional on the observed test statistic ("z-score" of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a "relative enrichment score" for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS of SCZ (n = 82,315, obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8. There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3.

  2. Multimedia input in automated image annotation and content-based retrieval

    Science.gov (United States)

    Srihari, Rohini K.

    1995-03-01

    This research explores the interaction of linguistic and photographic information in an integrated text/image database. By utilizing linguistic descriptions of a picture (speech and text input) coordinated with pointing references to the picture, we extract information useful in two aspects: image interpretation and image retrieval. In the image interpretation phase, objects and regions mentioned in the text are identified; the annotated image is stored in a database for future use. We incorporate techniques from our previous research on photo understanding using accompanying text: a system, PICTION, which identifies human faces in a newspaper photograph based on the caption. In the image retrieval phase, images matching natural language queries are presented to a user in a ranked order. This phase combines the output of (1) the image interpretation/annotation phase, (2) statistical text retrieval methods, and (3) image retrieval methods (e.g., color indexing). The system allows both point and click querying on a given image as well as intelligent querying across the entire text/image database.

  3. Manual annotation and analysis of the defensin gene cluster in the C57BL/6J mouse reference genome

    Directory of Open Access Journals (Sweden)

    Dougan Gordon

    2009-12-01

    Full Text Available Abstract Background Host defense peptides are a critical component of the innate immune system. Human alpha- and beta-defensin genes are subject to copy number variation (CNV and historically the organization of mouse alpha-defensin genes has been poorly defined. Here we present the first full manual genomic annotation of the mouse defensin region on Chromosome 8 of the reference strain C57BL/6J, and the analysis of the orthologous regions of the human and rat genomes. Problems were identified with the reference assemblies of all three genomes. Defensins have been studied for over two decades and their naming has become a critical issue due to incorrect identification of defensin genes derived from different mouse strains and the duplicated nature of this region. Results The defensin gene cluster region on mouse Chromosome 8 A2 contains 98 gene loci: 53 are likely active defensin genes and 22 defensin pseudogenes. Several TATA box motifs were found for human and mouse defensin genes that likely impact gene expression. Three novel defensin genes belonging to the Cryptdin Related Sequences (CRS family were identified. All additional mouse defensin loci on Chromosomes 1, 2 and 14 were annotated and unusual splice variants identified. Comparison of the mouse alpha-defensins in the three main mouse reference gene sets Ensembl, Mouse Genome Informatics (MGI, and NCBI RefSeq reveals significant inconsistencies in annotation and nomenclature. We are collaborating with the Mouse Genome Nomenclature Committee (MGNC to establish a standardized naming scheme for alpha-defensins. Conclusions Prior to this analysis, there was no reliable reference gene set available for the mouse strain C57BL/6J defensin genes, demonstrating that manual intervention is still critical for the annotation of complex gene families and heavily duplicated regions. Accurate gene annotation is facilitated by the annotation of pseudogenes and regulatory elements. Manually curated gene

  4. Integrative analysis of functional genomic annotations and sequencing data to identify rare causal variants via hierarchical modeling

    Directory of Open Access Journals (Sweden)

    Marinela eCapanu

    2015-05-01

    Full Text Available Identifying the small number of rare causal variants contributing to disease has beena major focus of investigation in recent years, but represents a formidable statisticalchallenge due to the rare frequencies with which these variants are observed. In thiscommentary we draw attention to a formal statistical framework, namely hierarchicalmodeling, to combine functional genomic annotations with sequencing data with theobjective of enhancing our ability to identify rare causal variants. Using simulations weshow that in all configurations studied, the hierarchical modeling approach has superiordiscriminatory ability compared to a recently proposed aggregate measure of deleteriousness,the Combined Annotation-Dependent Depletion (CADD score, supportingour premise that aggregate functional genomic measures can more accurately identifycausal variants when used in conjunction with sequencing data through a hierarchicalmodeling approach

  5. Automated training for algorithms that learn from genomic data.

    Science.gov (United States)

    Cilingir, Gokcen; Broschat, Shira L

    2015-01-01

    Supervised machine learning algorithms are used by life scientists for a variety of objectives. Expert-curated public gene and protein databases are major resources for gathering data to train these algorithms. While these data resources are continuously updated, generally, these updates are not incorporated into published machine learning algorithms which thereby can become outdated soon after their introduction. In this paper, we propose a new model of operation for supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure and the learning process are automated, one can create a system that generates a classifier or predictor using information available from public resources. The proposed model is explained using three case studies on SignalP, MemLoci, and ApicoAP in which existing machine learning models are utilized in pipelines. Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning algorithms into self-evolving learners that benefit from the ever-changing data available for gene products and to develop new machine learning algorithms that are similarly capable. PMID:25695053

  6. The genome sequence of Leishmania (Leishmania) amazonensis: functional annotation and extended analysis of gene models.

    Science.gov (United States)

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; Carmona e Ferreira, Renata; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães; Bahia, Diana

    2013-12-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3'-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

  7. Comparative annotation of functional regions in the human genome using epigenomic data.

    Science.gov (United States)

    Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

    2013-04-01

    Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type-specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes. PMID:23482391

  8. Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae

    Directory of Open Access Journals (Sweden)

    Deng Jixin

    2009-02-01

    Full Text Available Abstract Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 http://www.broad.mit.edu/annotation/genome/magnaporthe_grisea/MultiDownloads.html. However, a comprehensive manual curation remains to be performed. Gene Ontology (GO annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO. In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57% being annotated with 1,957 distinct and specific GO terms. Unannotated proteins

  9. Draft genome sequence and annotation of Lactobacillus acetotolerans BM-LA14527, a beer-spoilage bacteria.

    Science.gov (United States)

    Liu, Junyan; Li, Lin; Peters, Brian M; Li, Bing; Deng, Yang; Xu, Zhenbo; Shirtliff, Mark E

    2016-09-01

    Lactobacillus acetotolerans is a hard-to-culture beer-spoilage bacterium capable of entering into the viable putative nonculturable (VPNC) state. As part of an initial strategy to investigate the phenotypic behavior of L. acetotolerans, draft genome sequencing was performed. Results demonstrated a total of 1824 predicted annotated genes, with several potential VPNC- and beer-spoilage-associated genes identified. Importantly, this is the first genome sequence of L. acetotolerans as beer-spoilage bacteria and it may aid in further analysis of L. acetotolerans and other beer-spoilage bacteria, with direct implications for food safety control in the beer brewing industry. PMID:27559043

  10. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

    Directory of Open Access Journals (Sweden)

    Liu Chang

    2012-12-01

    Full Text Available Abstract Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas.

  11. Automated annotation of functional imaging experiments via multi-label classification

    Directory of Open Access Journals (Sweden)

    MatthewDTurner

    2013-12-01

    Full Text Available Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO, the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes, k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization. Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. Naive Bayes methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

  12. Homology-based annotation of non-coding RNAs in the genomes of Schistosoma mansoni and Schistosoma japonicum

    Directory of Open Access Journals (Sweden)

    Santana Clara

    2009-10-01

    Full Text Available Abstract Background Schistosomes are trematode parasites of the phylum Platyhelminthes. They are considered the most important of the human helminth parasites in terms of morbidity and mortality. Draft genome sequences are now available for Schistosoma mansoni and Schistosoma japonicum. Non-coding RNA (ncRNA plays a crucial role in gene expression regulation, cellular function and defense, homeostasis, and pathogenesis. The genome-wide annotation of ncRNAs is a non-trivial task unless well-annotated genomes of closely related species are already available. Results A homology search for structured ncRNA in the genome of S. mansoni resulted in 23 types of ncRNAs with conserved primary and secondary structure. Among these, we identified rRNA, snRNA, SL RNA, SRP, tRNAs and RNase P, and also possibly MRP and 7SK RNAs. In addition, we confirmed five miRNAs that have recently been reported in S. japonicum and found two additional homologs of known miRNAs. The tRNA complement of S. mansoni is comparable to that of the free-living planarian Schmidtea mediterranea, although for some amino acids differences of more than a factor of two are observed: Leu, Ser, and His are overrepresented, while Cys, Meth, and Ile are underrepresented in S. mansoni. On the other hand, the number of tRNAs in the genome of S. japonicum is reduced by more than a factor of four. Both schistosomes have a complete set of minor spliceosomal snRNAs. Several ncRNAs that are expected to exist in the S. mansoni genome were not found, among them the telomerase RNA, vault RNAs, and Y RNAs. Conclusion The ncRNA sequences and structures presented here represent the most complete dataset of ncRNA from any lophotrochozoan reported so far. This data set provides an important reference for further analysis of the genomes of schistosomes and indeed eukaryotic genomes at large.

  13. Beyond genomic variation - comparison and functional annotation of three Brassica rapa genomes: a turnip, a rapid cycling and a Chinese cabbage

    OpenAIRE

    Lin, Ke; Zhang, Ningwen; Severing, Edouard I.; Nijveen, Harm; Cheng, Feng; Visser, Richard GF; Wang, Xiaowu; de Ridder, Dick; Bonnema, Guusje

    2014-01-01

    Background Brassica rapa is an economically important crop species. During its long breeding history, a large number of morphotypes have been generated, including leafy vegetables such as Chinese cabbage and pakchoi, turnip tuber crops and oil crops. Results To investigate the genetic variation underlying this morphological variation, we re-sequenced, assembled and annotated the genomes of two B. rapa subspecies, turnip crops (turnip) and a rapid cycling. We then analysed the two resulting ge...

  14. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    Science.gov (United States)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  15. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium.

    Science.gov (United States)

    Gaudet, Pascale; Livstone, Michael S; Lewis, Suzanna E; Thomas, Paul D

    2011-09-01

    The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods. PMID:21873635

  16. Non-gaussian distributions affect identification of expression patterns, functional annotation, and prospective classification in human cancer genomes.

    Directory of Open Access Journals (Sweden)

    Nicholas F Marko

    Full Text Available INTRODUCTION: Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. METHODS: We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. RESULTS: Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. CONCLUSIONS: Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that "small" departures from normality in the expression data distributions are analytically-insignificant and that "robust" gene-calling algorithms can fully compensate for these effects.

  17. Automated Comparative Auditing of NCIT Genomic Roles Using NCBI

    Science.gov (United States)

    Cohen, Barry; Oren, Marc; Min, Hua; Perl, Yehoshua; Halper, Michael

    2008-01-01

    Biomedical research has identified many human genes and various knowledge about them. The National Cancer Institute Thesaurus (NCIT) represents such knowledge as concepts and roles (relationships). Due to the rapid advances in this field, it is to be expected that the NCIT’s Gene hierarchy will contain role errors. A comparative methodology to audit the Gene hierarchy with the use of the National Center for Biotechnology Information’s (NCBI’s) Entrez Gene database is presented. The two knowledge sources are accessed via a pair of Web crawlers to ensure up-to-date data. Our algorithms then compare the knowledge gathered from each, identify discrepancies that represent probable errors, and suggest corrective actions. The primary focus is on two kinds of gene-roles: (1) the chromosomal locations of genes, and (2) the biological processes in which genes plays a role. Regarding chromosomal locations, the discrepancies revealed are striking and systematic, suggesting a structurally common origin. In regard to the biological processes, difficulties arise because genes frequently play roles in multiple processes, and processes may have many designations (such as synonymous terms). Our algorithms make use of the roles defined in the NCIT Biological Process hierarchy to uncover many probable gene-role errors in the NCIT. These results show that automated comparative auditing is a promising technique that can identify a large number of probable errors and corrections for them in a terminological genomic knowledge repository, thus facilitating its overall maintenance. PMID:18486558

  18. Protein function annotation by homology-based inference

    OpenAIRE

    Loewenstein, Yaniv; Raimondo, Domenico; Redfern, Oliver C.; Watson, James; Frishman, Dmitrij; Linial, Michal; Orengo, Christine; Thornton, Janet; Tramontano, Anna

    2009-01-01

    With many genomes now sequenced, computational annotation methods to characterize genes and proteins from their sequence are increasingly important. The BioSapiens Network has developed tools to address all stages of this process, and here we review progress in the automated prediction of protein function based on protein sequence and structure.

  19. ChIP-Seq-Annotated Heliconius erato Genome Highlights Patterns of cis-Regulatory Evolution in Lepidoptera.

    Science.gov (United States)

    Lewis, James J; van der Burg, Karin R L; Mazo-Vargas, Anyi; Reed, Robert D

    2016-09-13

    Uncovering phylogenetic patterns of cis-regulatory evolution remains a fundamental goal for evolutionary and developmental biology. Here, we characterize the evolution of regulatory loci in butterflies and moths using chromatin immunoprecipitation sequencing (ChIP-seq) annotation of regulatory elements across three stages of head development. In the process we provide a high-quality, functionally annotated genome assembly for the butterfly, Heliconius erato. Comparing cis-regulatory element conservation across six lepidopteran genomes, we find that regulatory sequences evolve at a pace similar to that of protein-coding regions. We also observe that elements active at multiple developmental stages are markedly more conserved than elements with stage-specific activity. Surprisingly, we also find that stage-specific proximal and distal regulatory elements evolve at nearly identical rates. Our study provides a benchmark for genome-wide patterns of regulatory element evolution in insects, and it shows that developmental timing of activity strongly predicts patterns of regulatory sequence evolution. PMID:27626657

  20. Sequencing and annotation of the Wolbachia endosymbiont of Diaphorina citri by the CG-HLB Genome Resources group reveals candidate sources of interaction with the insect host

    OpenAIRE

    Saha, Surya; Hunter, Wayne; Lindeberg, Magdalen

    2014-01-01

    The Citrus Greening – Huanglongbing (CG-HLB) Genome Resources group serves as a bioinformatics resource for diverse projects related to the biology of CG-HLB.  A major recent project concerns the generation and annotation of a draft genome sequence for the Wolbachia endosymbiont (wDi) of the Asian citrus psyllid, of particular interest given the potential for control of psyllid behavior through manipulation of its bacterial endosymbionts.   The Wolbachia draft genome was assembled and contigs...

  1. MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes

    OpenAIRE

    Noguchi, Hideki; Taniguchi, Takeaki; Itoh, Takehiko

    2008-01-01

    Recent advances in DNA sequencers are accelerating genome sequencing, especially in microbes, and complete and draft genomes from various species have been sequenced in rapid succession. Here, we present a comprehensive gene prediction tool, the MetaGeneAnnotator (MGA), which precisely predicts all kinds of prokaryotic genes from a single or a set of anonymous genomic sequences having a variety of lengths. The MGA integrates statistical models of prophage genes, in addition to those of bacter...

  2. Annotation of two large contiguous regions from the Haemonchus contortus genome using RNA-seq and comparative analysis with Caenorhabditis elegans.

    Directory of Open Access Journals (Sweden)

    Roz Laing

    Full Text Available The genomes of numerous parasitic nematodes are currently being sequenced, but their complexity and size, together with high levels of intra-specific sequence variation and a lack of reference genomes, makes their assembly and annotation a challenging task. Haemonchus contortus is an economically significant parasite of livestock that is widely used for basic research as well as for vaccine development and drug discovery. It is one of many medically and economically important parasites within the strongylid nematode group. This group of parasites has the closest phylogenetic relationship with the model organism Caenorhabditis elegans, making comparative analysis a potentially powerful tool for genome annotation and functional studies. To investigate this hypothesis, we sequenced two contiguous fragments from the H. contortus genome and undertook detailed annotation and comparative analysis with C. elegans. The adult H. contortus transcriptome was sequenced using an Illumina platform and RNA-seq was used to annotate a 409 kb overlapping BAC tiling path relating to the X chromosome and a 181 kb BAC insert relating to chromosome I. In total, 40 genes and 12 putative transposable elements were identified. 97.5% of the annotated genes had detectable homologues in C. elegans of which 60% had putative orthologues, significantly higher than previous analyses based on EST analysis. Gene density appears to be less in H. contortus than in C. elegans, with annotated H. contortus genes being an average of two-to-three times larger than their putative C. elegans orthologues due to a greater intron number and size. Synteny appears high but gene order is generally poorly conserved, although areas of conserved microsynteny are apparent. C. elegans operons appear to be partially conserved in H. contortus. Our findings suggest that a combination of RNA-seq and comparative analysis with C. elegans is a powerful approach for the annotation and analysis of strongylid

  3. Joint Genome Institute's Automation Approach and History

    Energy Technology Data Exchange (ETDEWEB)

    Roberts, Simon

    2006-07-05

    Department of Energy/Joint Genome Institute (DOE/JGI) collaborates with DOE national laboratories and community users, to advance genome science in support of the DOE missions of clean bio-energy, carbon cycling, and bioremediation.

  4. High-density rhesus macaque oligonucleotide microarray design using early-stage rhesus genome sequence information and human genome annotations

    OpenAIRE

    Magness Charles L; Thomas Matthew J; Proll Sean C; Paeper Bryan; Korth Marcus J; Wallace James C; Iadonato Shawn P; Nelson Charles; Katze Michael G

    2007-01-01

    Abstract Background Until recently, few genomic reagents specific for non-human primate research have been available. To address this need, we have constructed a macaque-specific high-density oligonucleotide microarray by using highly fragmented low-pass sequence contigs from the rhesus genome project together with the detailed sequence and exon structure of the human genome. Using this method, we designed oligonucleotide probes to over 17,000 distinct rhesus/human gene orthologs and increase...

  5. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    OpenAIRE

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day “Plant Genomics and Gene Annotation” workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understand...

  6. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    Directory of Open Access Journals (Sweden)

    King Nichole L

    2009-02-01

    Full Text Available Abstract Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal http://www.drosophila-peptideatlas.org allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1 reduction of the complexity inherently associated with performing targeted proteomic studies, (2 designing and accelerating shotgun proteomics experiments, (3 confirming or questioning gene models, and (4 adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert.

  7. Updated genome assembly and annotation of Paenibacillus larvae, the agent of American foulbrood disease of honey bees

    Directory of Open Access Journals (Sweden)

    de Graaf Dirk C

    2011-09-01

    Full Text Available Abstract Background As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. Results We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. Conclusions This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.

  8. Carbohydrate catabolic flexibility in the mammalian intestinal commensal Lactobacillus ruminis revealed by fermentation studies aligned to genome annotations

    LENUS (Irish Health Repository)

    2011-08-30

    Abstract Background Lactobacillus ruminis is a poorly characterized member of the Lactobacillus salivarius clade that is part of the intestinal microbiota of pigs, humans and other mammals. Its variable abundance in human and animals may be linked to historical changes over time and geographical differences in dietary intake of complex carbohydrates. Results In this study, we investigated the ability of nine L. ruminis strains of human and bovine origin to utilize fifty carbohydrates including simple sugars, oligosaccharides, and prebiotic polysaccharides. The growth patterns were compared with metabolic pathways predicted by annotation of a high quality draft genome sequence of ATCC 25644 (human isolate) and the complete genome of ATCC 27782 (bovine isolate). All of the strains tested utilized prebiotics including fructooligosaccharides (FOS), soybean-oligosaccharides (SOS) and 1,3:1,4-β-D-gluco-oligosaccharides to varying degrees. Six strains isolated from humans utilized FOS-enriched inulin, as well as FOS. In contrast, three strains isolated from cows grew poorly in FOS-supplemented medium. In general, carbohydrate utilisation patterns were strain-dependent and also varied depending on the degree of polymerisation or complexity of structure. Six putative operons were identified in the genome of the human isolate ATCC 25644 for the transport and utilisation of the prebiotics FOS, galacto-oligosaccharides (GOS), SOS, and 1,3:1,4-β-D-Gluco-oligosaccharides. One of these comprised a novel FOS utilisation operon with predicted capacity to degrade chicory-derived FOS. However, only three of these operons were identified in the ATCC 27782 genome that might account for the utilisation of only SOS and 1,3:1,4-β-D-Gluco-oligosaccharides. Conclusions This study has provided definitive genome-based evidence to support the fermentation patterns of nine strains of Lactobacillus ruminis, and has linked it to gene distribution patterns in strains from different sources

  9. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    Science.gov (United States)

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day…

  10. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and D. barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data

    Science.gov (United States)

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica v. virgifera (16,747 bp) and D. barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and...

  11. Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning.

    Science.gov (United States)

    Panwar, Bharat; Menon, Rajasree; Eksi, Ridvan; Li, Hong-Dong; Omenn, Gilbert S; Guan, Yuanfang

    2016-06-01

    The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level. In the present study, we used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. We used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. We demonstrated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. All predictions have been implemented in a web resource called "IsoFunc", which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc . PMID:27142340

  12. Hal: an Automated Pipeline for Phylogenetic Analyses of Genomic Data

    OpenAIRE

    Robbertse, Barbara; Yoder, Ryan J.; Boyd, Alex; Reeves, John; Spatafora, Joseph W.

    2011-01-01

    The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous clusters of proteins or genes, and the assembly of alignments of orthologous sequence data into ind...

  13. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis.

    Directory of Open Access Journals (Sweden)

    Nicolas Buisine

    Full Text Available Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.. The large insert (~10 Kb, ~17 Kb paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome "fragmentation", reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia, was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data.

  14. Genome sequencing and annotation of Acinetobacter gerneri strain MTCC 9824T

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.4 Mb genome of Acinetobacter gerneri strain MTCC 9824T. The genome has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S16S and 64 aminoacyl-tRNA synthetase genes.

  15. Genome sequencing and annotation of Acinetobacter gyllenbergii strain MTCC 11365T

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report 4.3 Mb genome of the Acinetobacter gyllenbergii strain MTCC 11365T. The draft genome of A. gyllenbergii has a G + C content of 41.0% and includes 3 rRNA genes (5S, 23S, 16S and 67 aminoacyl-tRNA synthetase genes.

  16. Genome sequencing and annotation of Acinetobacter haemolyticus strain MTCC 9819T

    Directory of Open Access Journals (Sweden)

    Indu Khatri

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 3.4 Mb genome of Acinetobacter haemolyticus strain MTCC 9819T. The genome has a G + C content of 40.0% and includes 3 rRNA genes (5S, 23S, 16S and 65 aminoacyl-tRNA synthetase genes.

  17. Genome-wide annotation and analysis of zebra finch microRNA repertoire reveal sex-biased expression

    Directory of Open Access Journals (Sweden)

    Luo Guan-Zheng

    2012-12-01

    Full Text Available Abstract Background MicroRNAs (miRNAs are small noncoding RNAs that regulate gene expression post-transcriptionally in a wide range of biological processes. The zebra finch (Taeniopygia guttata, an oscine songbird with characteristic learned vocal behavior, provides biologists a unique model system for studying vocal behavior, sexually dimorphic brain development and functions, and comparative genomics. Results We deep sequenced small RNA libraries made from the brain, heart, liver, and muscle tissues of adult male and female zebra finches. By mapping the sequence reads to the zebra finch genome and to known miRNAs in miRBase, we annotated a total of 193 miRNAs. Among them, 29 (15% are avian specific, including three novel zebra finch specific miRNAs. Many of the miRNAs exhibit sequence heterogeneity including length variations, untemplated terminal nucleotide additions, and internal substitution events occurring at the uridine nucleotide within a GGU motif. We also identified seven Z chromosome-encoded miRNAs. Among them, miR-2954, an avian specific miRNA, is expressed at significantly higher levels in males than in females in all tissues examined. Target prediction analysis reveals that miR-2954, but not other Z-linked miRNAs, preferentially targets Z chromosome-encoded genes, including several genes known to be expressed in a sexually dimorphic manner in the zebra finch brain. Conclusions Our genome-wide systematic analysis of mature sequences, genomic locations, evolutionary sequence conservation, and tissue expression profiles of the zebra finch miRNA repertoire provides a valuable resource to the research community. Our analysis also reveals a miRNA-mediated mechanism that potentially regulates sex-biased gene expression in avian species.

  18. Genome sequencing and annotation of Acinetobacter guillouiae strain MSP 4-18

    Directory of Open Access Journals (Sweden)

    Nitin Kumar Singh

    2014-12-01

    Full Text Available The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.8 Mb genome of Acinetobacter guillouiae MSP 4-18, isolated from a mangrove soil sample from Parangipettai (11°30′N, 79°47′E, Tamil Nadu, India. The draft genome of A. guillouiae MSP 4-18 has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S, 16S and 69 aminoacyl-tRNA synthetase genes.

  19. Whole genome sequences and annotation of Micrococcus luteus SUBG006, a novel phytopathogen of mango

    Directory of Open Access Journals (Sweden)

    Purvi M. Rakhashiya

    2015-12-01

    Full Text Available Actinobaceria, Micrococcus luteus SUBG006 was isolated from infected leaves of Mangifera indica L. vr. Nylon in Rajkot, (22.30°N, 70.78°E, Gujarat, India. The genome size is 3.86 Mb with G + C content of 69.80% and contains 112 rRNA sequences (5S, 16S and 23S. The whole genome sequencing has been deposited in DDBJ/EMBL/GenBank under the accession number JOKP00000000.

  20. Whole genome sequences and annotation of Micrococcus luteus SUBG006, a novel phytopathogen of mango.

    Science.gov (United States)

    Rakhashiya, Purvi M; Patel, Pooja P; Thaker, Vrinda S

    2015-12-01

    Actinobaceria, Micrococcus luteus SUBG006 was isolated from infected leaves of Mangifera indica L. vr. Nylon in Rajkot, (22.30°N, 70.78°E), Gujarat, India. The genome size is 3.86 Mb with G + C content of 69.80% and contains 112 rRNA sequences (5S, 16S and 23S). The whole genome sequencing has been deposited in DDBJ/EMBL/GenBank under the accession number JOKP00000000. PMID:26697318

  1. Whole genome sequences and annotation of Micrococcus luteus SUBG006, a novel phytopathogen of mango

    OpenAIRE

    Rakhashiya, Purvi M.; Patel, Pooja P.; Thaker, Vrinda S.

    2015-01-01

    Actinobaceria, Micrococcus luteus SUBG006 was isolated from infected leaves of Mangifera indica L. vr. Nylon in Rajkot, (22.30°N, 70.78°E), Gujarat, India. The genome size is 3.86 Mb with G + C content of 69.80% and contains 112 rRNA sequences (5S, 16S and 23S). The whole genome sequencing has been deposited in DDBJ/EMBL/GenBank under the accession number JOKP00000000.

  2. Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics

    DEFF Research Database (Denmark)

    Khurana, Ekta; Fu, Yao; Colonna, Vincenza; Mu, Xinmeng Jasmine; Kang, Hyun Min; Lappalainen, Tuuli; Sboner, Andrea; Lochovsky, Lucas; Chen, Jieming; Harmanci, Arif; Das, Jishnu; Abyzov, Alexej; Balasubramanian, Suganthi; Beal, Kathryn; Chakravarty, Dimple; Challis, Daniel; Chen, Yuan; Clarke, Declan; Clarke, Laura; Cunningham, Fiona; Evani, Uday S.; Flicek, Paul; Fragoza, Robert; Garrison, Erik; Gibbs, Richard; Gümüş, Zeynep H.; Herrero, Javier; Kitabayashi, Naoki; Kong, Yong; Lage, Kasper; Liluashvili, Vaja; Lipkin, Steven M.; MacArthur, Daniel G.; Marth, Gabor; Muzny, Donna; Pers, Tune Hannes; Ritchie, Graham R. S.; Rosenfeld, Jeffrey A.; Sisu, Cristina; Wei, Xiaomu; Wilson, Michael; Xue, Yali; Yu, Fuli; Dermitzakis, Emmanouil T.; Yu, Haiyuan; Rubin, Mark A.; Tyler-Smith, Chris; Gerstein, Mark

    2013-01-01

    Identifying Important Identifiers Each of us has millions of sequence variations in our genomes. Signatures of purifying or negative selection should help identify which of those variations is functionally important. Khurana et al. (1235587) used sequence polymorphisms from 1092 humans across 14 ...

  3. Mapping and annotating obesity-related genes in pig and human genomes.

    Science.gov (United States)

    Martelli, Pier Luigi; Fontanesi, Luca; Piovesan, Damiano; Fariselli, Piero; Casadio, Rita

    2014-01-01

    Background. Obesity is a major health problem in both developed and emerging countries. Obesity is a complex disease whose etiology involves genetic factors in strong interplay with environmental determinants and lifestyle. The discovery of genetic factors and biological pathways underlying human obesity is hampered by the difficulty in controlling the genetic background of human cohorts. Animal models are then necessary to further dissect the genetics of obesity. Pig has emerged as one of the most attractive models, because of the similarity with humans in the mechanisms regulating the fat deposition. Results. We collected the genes related to obesity in humans and to fat deposition traits in pig. We localized them on both human and pig genomes, building a map useful to interpret comparative studies on obesity. We characterized the collected genes structurally and functionally with BAR+ and mapped them on KEGG pathways and on STRING protein interaction network. Conclusions. The collected set consists of 361 obesity related genes in human and pig genomes. All genes were mapped on the human genome, and 54 could not be localized on the pig genome (release 2012). Only for 3 human genes there is no counterpart in pig, confirming that this animal is a good model for human obesity studies. Obesity related genes are mostly involved in regulation and signaling processes/pathways and relevant connection emerges between obesity-related genes and diseases such as cancer and infectious diseases. PMID:23855670

  4. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Science.gov (United States)

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  5. Genome Sequence and Annotation of Trichoderma parareesei, the Ancestor of the Cellulase Producer Trichoderma reesei

    OpenAIRE

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; Karimi Aghcheh, Razieh; Atanasova, Lea; Chenthamara, Komal; Baker, Scott E.; Zhang, Ruifu; Shen, Qirong; Freitag, Michael; Kubicek, Christian P.; Druzhinina, Irina S.

    2015-01-01

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins.

  6. Genome Sequence and Annotation of Trichoderma parareesei, the Ancestor of the Cellulase Producer Trichoderma reesei.

    Science.gov (United States)

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; Karimi Aghcheh, Razieh; Atanasova, Lea; Chenthamara, Komal; Baker, Scott E; Zhang, Ruifu; Shen, Qirong; Freitag, Michael; Kubicek, Christian P; Druzhinina, Irina S

    2015-01-01

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins. PMID:26272569

  7. neXtA5: accelerating annotation of articles via automated approaches in neXtProt.

    Science.gov (United States)

    Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick

    2016-01-01

    The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being

  8. Automated quality control for genome wide association studies

    Science.gov (United States)

    Ellingson, Sally R.; Fardo, David W.

    2016-01-01

    This paper provides details on the necessary steps to assess and control data in genome wide association studies (GWAS) using genotype information on a large number of genetic markers for large number of individuals. Due to varied study designs and genotyping platforms between multiple sites/projects as well as potential genotyping errors, it is important to ensure high quality data. Scripts and directions are provided to facilitate others in this process.

  9. Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics

    DEFF Research Database (Denmark)

    Lundby, Alicia; Rossin, Elizabeth J.; Steffensen, Annette B.;

    2014-01-01

    Genome-wide association studies (GWAS) have identified thousands of loci associated with complex traits, but it is challenging to pinpoint causal genes in these loci and to exploit subtle association signals. We used tissue-specific quantitative interaction proteomics to map a network of five genes...... propose candidates in GWAS loci for functional studies and to systematically filter subtle association signals using tissue-specific quantitative interaction proteomics....

  10. Identification of transcriptional signals in Encephalitozoon cuniculi widespread among Microsporidia phylum: support for accurate structural genome annotation

    Directory of Open Access Journals (Sweden)

    Wincker Patrick

    2009-12-01

    , 5'UTRs being strongly reduced, these signals can be used to ensure the accurate prediction of translation initiation codons for microsporidian genes and to improve microsporidian genome annotation.

  11. “Controlled, cross-species dataset for exploring biases in genome annotation and modification profiles”

    Directory of Open Access Journals (Sweden)

    Alison McAfee

    2015-12-01

    Full Text Available Since the sequencing of the honey bee genome, proteomics by mass spectrometry has become increasingly popular for biological analyses of this insect; but we have observed that the number of honey bee protein identifications is consistently low compared to other organisms [1]. In this dataset, we use nanoelectrospray ionization-coupled liquid chromatography–tandem mass spectrometry (nLC–MS/MS to systematically investigate the root cause of low honey bee proteome coverage. To this end, we present here data from three key experiments: a controlled, cross-species analyses of samples from Apis mellifera, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens; a proteomic analysis of an individual honey bee whose genome was also sequenced; and a cross-tissue honey bee proteome comparison. The cross-species dataset was interrogated to determine relative proteome coverages between species, and the other two datasets were used to search for polymorphic sequences and to compare protein cleavage profiles, respectively.

  12. Beyond genomic variation - comparison and functional annotation in three Brassica rapa genotypes: a turnip, a rapid cycling and a Chinese cabbage

    OpenAIRE

    Lin, K.; Zhang, N.; Severing, E.I.; Nijveen, H.; Cheng, F; Visser, R.G.F.; Wang, X.; De, Ridder, G.; Bonnema, A.B.

    2014-01-01

    Background - Brassica rapa is an economically important crop species. During its long breeding history, a large number of morphotypes have been generated, including leafy vegetables such as Chinese cabbage and pakchoi, turnip tuber crops and oil crops. Results - To investigate the genetic variation underlying this morphological variation, we re-sequenced, assembled and annotated the genomes of two B. rapa subspecies, turnip crops (turnip) and a rapid cycling. We then analysed the two resultin...

  13. Evidence-based annotation of gene function in Shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions.

    OpenAIRE

    Adam Deutschbauer; Price, Morgan N.; Kelly M. Wetmore; Wenjun Shao; Baumohl, Jason K; Zhuchen Xu; Michelle Nguyen; Raquel Tamse; Davis, Ronald W; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron ...

  14. Methods and software tools for mitochondrial genome assembly and annotation%线粒体基因组数据的分析方法和软件

    Institute of Scientific and Technical Information of China (English)

    李雪娟; 杨婧; 王俊红; 任倩俐; 李霞; 黄原

    2013-01-01

    With the increasing popularity of mitochondrial genome studies, the correct assembly and annotation of genomes are the basis of all subsequent research into a species. Here we describe the protocols using Staden Package software to assemble and annotate the mitochondrial genome, along with other commonly used software, such as ContigExpress, DNAMAN, DNASTAR, BioEdit and Sequencher. In addition, methods for the use of different software packages (including DOGMA.MOSAS.MITOS.GOBASE.OGRe.MitoZoa.tRNAscan-SE.ARWEN.BLAST and MiTFi) to annotate mitochondrial genomic protein-coding genes, rRNA, tRNA and the A +T region are briefly introduced. Finally, application of MEGAS software to analyze the composition of mitochondrial genomes, Sequin software to submit sequences to GenBank, and mitochondrial genome data visualization tools ( CG view. MTviz and OGDRAW) are also briefly introduced.%线粒体基因组的研究已经普及,其正确的拼接和注释是所有后续研究的基础.本文以Staden Package软件为主介绍了拼接和注释的线粒体基因组的方法,同时介绍了其他常用的拼接软件ContigExpress、DNAMAN、DNASTAR、BioEdit和Sequencher,以及利用不同软件(包括DOGMA、MOSAS、MITOS、GOBASE、OGRe、MitoZoa、tRNAscan-SE、ARWEN、BLAST和MiTFi等)对线粒体基因组中的蛋白质编码基因、rRNA、tRNA和A+T富集区进行注释的方法,最后介绍了利用MEGA5软件分析线粒体基因组的组成、Sequin软件提交序列和线粒体基因组数据绘图工具(CG view、MTviz和OGDRAW).

  15. Rapid high resolution genotyping of Francisella tularensis by whole genome sequence comparison of annotated genes ("MLST+".

    Directory of Open Access Journals (Sweden)

    Markus H Antwerpen

    Full Text Available The zoonotic disease tularemia is caused by the bacterium Francisella tularensis. This pathogen is considered as a category A select agent with potential to be misused in bioterrorism. Molecular typing based on DNA-sequence like canSNP-typing or MLVA has become the accepted standard for this organism. Due to the organism's highly clonal nature, the current typing methods have reached their limit of discrimination for classifying closely related subpopulations within the subspecies F. tularensis ssp. holarctica. We introduce a new gene-by-gene approach, MLST+, based on whole genome data of 15 sequenced F. tularensis ssp. holarctica strains and apply this approach to investigate an epidemic of lethal tularemia among non-human primates in two animal facilities in Germany. Due to the high resolution of MLST+ we are able to demonstrate that three independent clones of this highly infectious pathogen were responsible for these spatially and temporally restricted outbreaks.

  16. Partitioning SNPs Identified By GBS into Genome Annotation Classes and Calculating SNP-Explained Variances for Heading Date and Disease Resistance from the Resulting Genomic Relationship Matrices - Lolium perenne

    DEFF Research Database (Denmark)

    Byrne, Stephen; Cericola, Fabio; Janss, Luc;

    2015-01-01

    A draft sequence assembly of the perennial ryegrass genome was annotated with the aid of RNA-seq data from various genotypes, plant components, and treatments. We predicted 39,795 high quality proteins originating from 28,182 genetic loci. There was an average of 5.9 exons per protein, and an...... average protein Annotation Edit Distance (AED) of 0.14. Genotyping-By-Sequencing (GBS) data was generated after genome complexity reduction with ApeKI for 995 breeding families. Data was aligned against the annotated sequence assembly, and we identified variants at over 1.8 million positions, which were...... partitioned according to genomic feature. Variants were found in 20,594 (73%) of the gene loci, and these had on average 22.3 SNPs per gene. SNPs were partitioned according to the following annotation features; exon (277,404 SNPs), CDS (221,364 SNPs), 5’UTR (32,551 SNPs), 3’UTR (44,781 SNPs), gene (458...

  17. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

    Science.gov (United States)

    Gaulton, Kyle J; Ferreira, Teresa; Lee, Yeji; Raimondo, Anne; Mägi, Reedik; Reschen, Michael E; Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Gonçalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Bottinger, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex S F; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kao, Wen-Hong L; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian'an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Peltonen, Leena; Perry, John R B; Platou, Carl G P; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wiltshire, Steven; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth J F; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöcke, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin N A; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O'Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-12-01

    We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  18. CrAgDb--a database of annotated chaperone repertoire in archaeal genomes.

    Science.gov (United States)

    Rani, Shikha; Srivastava, Abhishikha; Kumar, Manish; Goel, Manisha

    2016-03-01

    Chaperones are a diverse class of ubiquitous proteins that assist other cellular proteins in folding correctly and maintaining their native structure. Many different chaperones cooperate to constitute the 'proteostasis' machinery in the cells. It has been proposed earlier that archaeal organisms could be ideal model systems for deciphering the basic functioning of the 'protein folding machinery' in higher eukaryotes. Several chaperone families have been characterized in archaea over the years but mostly one protein at a time, making it difficult to decipher the composition and mechanistics of the protein folding system as a whole. In order to deal with these lacunae, we have developed a database of all archaeal chaperone proteins, CrAgDb (Chaperone repertoire in Archaeal genomes). The data have been presented in a systematic way with intuitive browse and search facilities for easy retrieval of information. Access to these curated datasets should expedite large-scale analysis of archaeal chaperone networks and significantly advance our understanding of operation and regulation of the protein folding machinery in archaea. Researchers could then translate this knowledge to comprehend the more complex protein folding pathways in eukaryotic systems. The database is freely available at http://14.139.227.92/mkumar/cragdb/. PMID:26862144

  19. Genetic fine-mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci

    Science.gov (United States)

    Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Goncalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Borringer, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex SF; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian’an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Perry, John RB; Platou, Carl GP; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth JF; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöckel, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin NA; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O’Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-01-01

    We performed fine-mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in/near KCNQ1. “Credible sets” of variants most likely to drive each distinct signal mapped predominantly to non-coding sequence, implying that T2D association is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine-mapping implicated rs10830963 as driving T2D association. We confirmed that this T2D-risk allele increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D-risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  20. VariOtator, a Software Tool for Variation Annotation with the Variation Ontology.

    Science.gov (United States)

    Schaafsma, Gerard C P; Vihinen, Mauno

    2016-04-01

    The Variation Ontology (VariO) is used for describing and annotating types, effects, consequences, and mechanisms of variations. To facilitate easy and consistent annotations, the online application VariOtator was developed. For variation type annotations, VariOtator is fully automated, accepting variant descriptions in Human Genome Variation Society (HGVS) format, and generating VariO terms, either with or without full lineage, that is, all parent terms. When a coding DNA variant description with a reference sequence is provided, VariOtator checks the description first with Mutalyzer and then generates the predicted RNA and protein descriptions with their respective VariO annotations. For the other sublevels, function, structure, and property, annotations cannot be automated, and VariOtator generates annotation based on provided details. For VariO terms relating to structure and property, one can use attribute terms as modifiers and evidence code terms for annotating experimental evidence. There is an online batch version, and stand-alone batch versions to be used with a Leiden Open Variation Database (LOVD) download file. A SOAP Web service allows client programs to access VariOtator programmatically. Thus, systematic variation effect and type annotations can be efficiently generated to allow easy use and integration of variations and their consequences. PMID:26773573

  1. Algal functional annotation tool

    Energy Technology Data Exchange (ETDEWEB)

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes

  2. Inconsistencies of genome annotations in apicomplexan parasites revealed by 5'-end-one-pass and full-length sequences of oligo-capped cDNAs

    Directory of Open Access Journals (Sweden)

    Sugano Sumio

    2009-07-01

    Full Text Available Abstract Background Apicomplexan parasites are causative agents of various diseases including malaria and have been targets of extensive genomic sequencing. We generated 5'-EST collections for six apicomplexa parasites using our full-length oligo-capping cDNA library method. To improve upon the current genome annotations, as well as to validate the importance for physical cDNA clone resources, we generated a large-scale collection of full-length cDNAs for several apicomplexa parasites. Results In this study, we used a total of 61,056 5'-end-single-pass cDNA sequences from Plasmodium falciparum, P. vivax, P. yoelii, P. berghei, Cryptosporidium parvum, and Toxoplasma gondii. We compared these partially sequenced cDNA sequences with the currently annotated gene models and observed significant inconsistencies between the two datasets. In particular, we found that on average 14% of the exons in the current gene models were not supported by any cDNA evidence, and that 16% of the current gene models may contain at least one mis-annotation and should be re-evaluated. We also identified a large number of transcripts that had been previously unidentified. For 732 cDNAs in T. gondii, the entire sequences were determined in order to evaluate the annotated gene models at the complete full-length transcript level. We found that 41% of the T. gondii gene models contained at least one inconsistency. We also identified and confirmed by RT-PCR 140 previously unidentified transcripts found in the intergenic regions of the current gene annotations. We show that the majority of these discrepancies are due to questionable predictions of one or two extra exons in the upstream or downstream regions of the genes. Conclusion Our data indicates that the current gene models are likely to still be incomplete and have much room for improvement. Our unique full-length cDNA information is especially useful for further refinement of the annotations for the genomes of

  3. The Ensembl gene annotation system.

    Science.gov (United States)

    Aken, Bronwen L; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J; Murphy, Daniel N; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul; Searle, Stephen M J

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html. PMID:27337980

  4. Annotated English

    CERN Document Server

    Hernandez-Orallo, Jose

    2010-01-01

    This document presents Annotated English, a system of diacritical symbols which turns English pronunciation into a precise and unambiguous process. The annotations are defined and located in such a way that the original English text is not altered (not even a letter), thus allowing for a consistent reading and learning of the English language with and without annotations. The annotations are based on a set of general rules that make the frequency of annotations not dramatically high. This makes the reader easily associate annotations with exceptions, and makes it possible to shape, internalise and consolidate some rules for the English language which otherwise are weakened by the enormous amount of exceptions in English pronunciation. The advantages of this annotation system are manifold. Any existing text can be annotated without a significant increase in size. This means that we can get an annotated version of any document or book with the same number of pages and fontsize. Since no letter is affected, the ...

  5. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs

    DEFF Research Database (Denmark)

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip;

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False...... Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate...... (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in...

  6. Evidence-based annotation of gene function in Shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions.

    Directory of Open Access Journals (Sweden)

    Adam Deutschbauer

    2011-11-01

    Full Text Available Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes.

  7. Evidence-based annotation of gene function in Shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions.

    Science.gov (United States)

    Deutschbauer, Adam; Price, Morgan N; Wetmore, Kelly M; Shao, Wenjun; Baumohl, Jason K; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W; Arkin, Adam P

    2011-11-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  8. Automated whole-genome multiple alignment of rat, mouse, and human

    Energy Technology Data Exchange (ETDEWEB)

    Brudno, Michael; Poliakov, Alexander; Salamov, Asaf; Cooper, Gregory M.; Sidow, Arend; Rubin, Edward M.; Solovyev, Victor; Batzoglou, Serafim; Dubchak, Inna

    2004-07-04

    We have built a whole genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline which combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment, and consists of two main steps: (1) alignment of the mouse and rat genomes; and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human and 97% of all alignments with human sequence > 100kb agree with a three-way synteny map built independently using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment; and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.

  9. Vcfanno: fast, flexible annotation of genetic variants.

    Science.gov (United States)

    Pedersen, Brent S; Layer, Ryan M; Quinlan, Aaron R

    2016-01-01

    The integration of genome annotations is critical to the identification of genetic variants that are relevant to studies of disease or other traits. However, comprehensive variant annotation with diverse file formats is difficult with existing methods. Here we describe vcfanno, which flexibly extracts and summarizes attributes from multiple annotation files and integrates the annotations within the INFO column of the original VCF file. By leveraging a parallel "chromosome sweeping" algorithm, we demonstrate substantial performance gains by annotating ~85,000 variants per second with 50 attributes from 17 commonly used genome annotation resources. Vcfanno is available at https://github.com/brentp/vcfanno under the MIT license. PMID:27250555

  10. GroopM: an automated tool for the recovery of population genomes from related metagenomes

    Directory of Open Access Journals (Sweden)

    Michael Imelfort

    2014-09-01

    Full Text Available Metagenomic binning methods that leverage differential population abundances in microbial communities (differential coverage are emerging as a complementary approach to conventional composition-based binning. Here we introduce GroopM, an automated binning tool that primarily uses differential coverage to obtain high fidelity population genomes from related metagenomes. We demonstrate the effectiveness of GroopM using synthetic and real-world metagenomes, and show that GroopM produces results comparable with more time consuming, labor-intensive methods.

  11. Concept annotation in the CRAFT corpus

    OpenAIRE

    Bada Michael; Eckert Miriam; Evans Donald; Garcia Kristin; Shipley Krista; Sitnikov Dmitry; Baumgartner William A; Cohen K; Verspoor Karin; Blake Judith A; Hunter Lawrence E

    2012-01-01

    Abstract Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRA...

  12. Re-annotation of the physical map of Glycine max for polyploid-like regions by BAC end sequence driven whole genome shotgun read assembly

    Directory of Open Access Journals (Sweden)

    Shultz Jeffry

    2008-07-01

    Full Text Available Abstract Background Many of the world's most important food crops have either polyploid genomes or homeologous regions derived from segmental shuffling following polyploid formation. The soybean (Glycine max genome has been shown to be composed of approximately four thousand short interspersed homeologous regions with 1, 2 or 4 copies per haploid genome by RFLP analysis, microsatellite anchors to BACs and by contigs formed from BAC fingerprints. Despite these similar regions,, the genome has been sequenced by whole genome shotgun sequence (WGS. Here the aim was to use BAC end sequences (BES derived from three minimum tile paths (MTP to examine the extent and homogeneity of polyploid-like regions within contigs and the extent of correlation between the polyploid-like regions inferred from fingerprinting and the polyploid-like sequences inferred from WGS matches. Results Results show that when sequence divergence was 1–10%, the copy number of homeologous regions could be identified from sequence variation in WGS reads overlapping BES. Homeolog sequence variants (HSVs were single nucleotide polymorphisms (SNPs; 89% and single nucleotide indels (SNIs 10%. Larger indels were rare but present (1%. Simulations that had predicted fingerprints of homeologous regions could be separated when divergence exceeded 2% were shown to be false. We show that a 5–10% sequence divergence is necessary to separate homeologs by fingerprinting. BES compared to WGS traces showed polyploid-like regions with less than 1% sequence divergence exist at 2.3% of the locations assayed. Conclusion The use of HSVs like SNPs and SNIs to characterize BACs wil improve contig building methods. The implications for bioinformatic and functional annotation of polyploid and paleopolyploid genomes show that a combined approach of BAC fingerprint based physical maps, WGS sequence and HSV-based partitioning of BAC clones from homeologous regions to separate contigs will allow reliable de

  13. Instructions for Temporal Annotation of Scheduling Dialogs

    OpenAIRE

    O'Hara, Tom; Wiebe, Janyce; Payne, Karen

    1997-01-01

    Human annotation of natural language facilitates standardized evaluation of natural language processing systems and supports automated feature extraction. This document consists of instructions for annotating the temporal information in scheduling dialogs, dialogs in which the participants schedule a meeting with one another. Task-oriented dialogs, such as these are, would arise in many useful applications, for instance, automated information providers and automated phone operators. Explicit ...

  14. A collection of bioconductor methods to visualize gene-list annotations

    Directory of Open Access Journals (Sweden)

    Kibbe Warren A

    2010-01-01

    Full Text Available Abstract Background Gene-list annotations are critical for researchers to explore the complex relationships between genes and functionalities. Currently, the annotations of a gene list are usually summarized by a table or a barplot. As such, potentially biologically important complexities such as one gene belonging to multiple annotation categories are difficult to extract. We have devised explicit and efficient visualization methods that provide intuitive methods for interrogating the intrinsic connections between biological categories and genes. Findings We have constructed a data model and now present two novel methods in a Bioconductor package, "GeneAnswers", to simultaneously visualize genes, concepts (a.k.a. annotation categories, and concept-gene connections (a.k.a. annotations: the "Concept-and-Gene Network" and the "Concept-and-Gene Cross Tabulation". These methods have been tested and validated with microarray-derived gene lists. Conclusions These new visualization methods can effectively present annotations using Gene Ontology, Disease Ontology, or any other user-defined gene annotations that have been pre-associated with an organism's genome by human curation, automated pipelines, or a combination of the two. The gene-annotation data model and associated methods are available in the Bioconductor package called "GeneAnswers " described in this publication.

  15. Gene Ontology annotations and resources.

    Science.gov (United States)

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  16. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and Diabrotica barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data.

    Science.gov (United States)

    Coates, Brad S

    2014-06-01

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica virgifera virgifera (16,747 bp) and Diabrotica barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and 22 tRNA and 2 rRNA sequences were in typical of insect mitochondrial genomes. Non-standard nad4 and cox3 stop codons were composed of single T nucleotides and likely completed by adenylation, and atypical TTT start codons was predicted for both D. v. virgifera and D. barberinad1 genes. The D. v. virgifera and D. barberi haplotypes showed 819 variable nucleotide positions within PCG regions (7.36% divergence), which suggest that speciation may have occurred ~3.68 million years ago assuming a linear rate of short-term substitution. Phylogenetic analyses of Coleopteran MtD genome show clustering based on family level, and may have the capacity to resolve the evolutionary history within this Order of insects. PMID:24657060

  17. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species

    Science.gov (United States)

    Genome wide analysis of orthologous clusters is an important component of comparative genomics studies. Identifying the overlap among orthologous clusters can enable us to elucidate the function and evolution of proteins across multiple species. Here, we report a web platform named OrthoVenn that i...

  18. Algal functional annotation tool

    Energy Technology Data Exchange (ETDEWEB)

    Lopez, D. [UCLA; Casero, D. [UCLA; Cokus, S. J. [UCLA; Merchant, S. S. [UCLA; Pellegrini, M. [UCLA

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  19. Pep2Path: automated mass spectrometry-guided genome mining of peptidic natural products.

    Directory of Open Access Journals (Sweden)

    Marnix H Medema

    2014-09-01

    Full Text Available Nonribosomally and ribosomally synthesized bioactive peptides constitute a source of molecules of great biomedical importance, including antibiotics such as penicillin, immunosuppressants such as cyclosporine, and cytostatics such as bleomycin. Recently, an innovative mass-spectrometry-based strategy, peptidogenomics, has been pioneered to effectively mine microbial strains for novel peptidic metabolites. Even though mass-spectrometric peptide detection can be performed quite fast, true high-throughput natural product discovery approaches have still been limited by the inability to rapidly match the identified tandem mass spectra to the gene clusters responsible for the biosynthesis of the corresponding compounds. With Pep2Path, we introduce a software package to fully automate the peptidogenomics approach through the rapid Bayesian probabilistic matching of mass spectra to their corresponding biosynthetic gene clusters. Detailed benchmarking of the method shows that the approach is powerful enough to correctly identify gene clusters even in data sets that consist of hundreds of genomes, which also makes it possible to match compounds from unsequenced organisms to closely related biosynthetic gene clusters in other genomes. Applying Pep2Path to a data set of compounds without known biosynthesis routes, we were able to identify candidate gene clusters for the biosynthesis of five important compounds. Notably, one of these clusters was detected in a genome from a different subphylum of Proteobacteria than that in which the molecule had first been identified. All in all, our approach paves the way towards high-throughput discovery of novel peptidic natural products. Pep2Path is freely available from http://pep2path.sourceforge.net/, implemented in Python, licensed under the GNU General Public License v3 and supported on MS Windows, Linux and Mac OS X.

  20. Pep2Path: automated mass spectrometry-guided genome mining of peptidic natural products.

    Science.gov (United States)

    Medema, Marnix H; Paalvast, Yared; Nguyen, Don D; Melnik, Alexey; Dorrestein, Pieter C; Takano, Eriko; Breitling, Rainer

    2014-09-01

    Nonribosomally and ribosomally synthesized bioactive peptides constitute a source of molecules of great biomedical importance, including antibiotics such as penicillin, immunosuppressants such as cyclosporine, and cytostatics such as bleomycin. Recently, an innovative mass-spectrometry-based strategy, peptidogenomics, has been pioneered to effectively mine microbial strains for novel peptidic metabolites. Even though mass-spectrometric peptide detection can be performed quite fast, true high-throughput natural product discovery approaches have still been limited by the inability to rapidly match the identified tandem mass spectra to the gene clusters responsible for the biosynthesis of the corresponding compounds. With Pep2Path, we introduce a software package to fully automate the peptidogenomics approach through the rapid Bayesian probabilistic matching of mass spectra to their corresponding biosynthetic gene clusters. Detailed benchmarking of the method shows that the approach is powerful enough to correctly identify gene clusters even in data sets that consist of hundreds of genomes, which also makes it possible to match compounds from unsequenced organisms to closely related biosynthetic gene clusters in other genomes. Applying Pep2Path to a data set of compounds without known biosynthesis routes, we were able to identify candidate gene clusters for the biosynthesis of five important compounds. Notably, one of these clusters was detected in a genome from a different subphylum of Proteobacteria than that in which the molecule had first been identified. All in all, our approach paves the way towards high-throughput discovery of novel peptidic natural products. Pep2Path is freely available from http://pep2path.sourceforge.net/, implemented in Python, licensed under the GNU General Public License v3 and supported on MS Windows, Linux and Mac OS X. PMID:25188327

  1. Functional annotations of diabetes nephropathy susceptibility loci through analysis of genome-wide renal gene expression in rat models of diabetes mellitus

    Directory of Open Access Journals (Sweden)

    Farrall Martin

    2009-07-01

    Full Text Available Abstract Background Hyperglycaemia in diabetes mellitus (DM alters gene expression regulation in various organs and contributes to long term vascular and renal complications. We aimed to generate novel renal genome-wide gene transcription data in rat models of diabetes in order to test the responsiveness to hyperglycaemia and renal structural changes of positional candidate genes at selected diabetic nephropathy (DN susceptibility loci. Methods Both Affymetrix and Illumina technologies were used to identify significant quantitative changes in the abundance of over 15,000 transcripts in kidney of models of spontaneous (genetically determined mild hyperglycaemia and insulin resistance (Goto-Kakizaki-GK and experimentally induced severe hyperglycaemia (Wistar-Kyoto-WKY rats injected with streptozotocin [STZ]. Results Different patterns of transcription regulation in the two rat models of diabetes likely underlie the roles of genetic variants and hyperglycaemia severity. The impact of prolonged hyperglycaemia on gene expression changes was more profound in STZ-WKY rats than in GK rats and involved largely different sets of genes. These included genes already tested in genetic studies of DN and a large number of protein coding sequences of unknown function which can be considered as functional and, when they map to DN loci, positional candidates for DN. Further expression analysis of rat orthologs of human DN positional candidate genes provided functional annotations of known and novel genes that are responsive to hyperglycaemia and may contribute to renal functional and/or structural alterations. Conclusion Combining transcriptomics in animal models and comparative genomics provides important information to improve functional annotations of disease susceptibility loci in humans and experimental support for testing candidate genes in human genetics.

  2. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features

    Directory of Open Access Journals (Sweden)

    Xiaodong Cui

    2016-01-01

    Full Text Available Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq and show that RNA 5-methylcytosine (m5C is enriched in 5′UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future.

  3. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features.

    Science.gov (United States)

    Cui, Xiaodong; Wei, Zhen; Zhang, Lin; Liu, Hui; Sun, Lei; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2016-01-01

    Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine) derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq) and show that RNA 5-methylcytosine (m(5)C) is enriched in 5'UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future. PMID:27239475

  4. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm

    Science.gov (United States)

    2014-01-01

    Background Recovering individual genomes from metagenomic datasets allows access to uncultivated microbial populations that may have important roles in natural and engineered ecosystems. Understanding the roles of these uncultivated populations has broad application in ecology, evolution, biotechnology and medicine. Accurate binning of assembled metagenomic sequences is an essential step in recovering the genomes and understanding microbial functions. Results We have developed a binning algorithm, MaxBin, which automates the binning of assembled metagenomic scaffolds using an expectation-maximization algorithm after the assembly of metagenomic sequencing reads. Binning of simulated metagenomic datasets demonstrated that MaxBin had high levels of accuracy in binning microbial genomes. MaxBin was used to recover genomes from metagenomic data obtained through the Human Microbiome Project, which demonstrated its ability to recover genomes from real metagenomic datasets with variable sequencing coverages. Application of MaxBin to metagenomes obtained from microbial consortia adapted to grow on cellulose allowed genomic analysis of new, uncultivated, cellulolytic bacterial populations, including an abundant myxobacterial population distantly related to Sorangium cellulosum that possessed a much smaller genome (5 MB versus 13 to 14 MB) but has a more extensive set of genes for biomass deconstruction. For the cellulolytic consortia, the MaxBin results were compared to binning using emergent self-organizing maps (ESOMs) and differential coverage binning, demonstrating that it performed comparably to these methods but had distinct advantages in automation, resolution of related genomes and sensitivity. Conclusions The automatic binning software that we developed successfully classifies assembled sequences in metagenomic datasets into recovered individual genomes. The isolation of dozens of species in cellulolytic microbial consortia, including a novel species of

  5. Evidence-based gene predictions in plant genomes

    Science.gov (United States)

    Automated evidence-based gene building is a rapid and cost-effective way to provide reliable gene annotations on newly sequenced genomes. One of the limitations of evidence-based gene builders, however, is their requirement for gene expression evidence—known proteins, full-length cDNAs, or expressed...

  6. Functional annotation and ENU

    OpenAIRE

    Gunn, Teresa M.

    2012-01-01

    Functional annotation of every gene in the mouse genome is a herculean task that requires a multifaceted approach. Many large-scale initiatives are contributing to this undertaking. The International Knockout Mouse Consortium (IKMC) plans to mutate every protein-coding gene, using a combination of gene trapping and gene targeting in embryonic stem cells. Many other groups are performing using the chemical mutagen ethylnitrosourea (ENU) or transpon-based systems to induce mutations, screening ...

  7. Genome-wide and functional annotation of human E3 ubiquitin ligases identifies MULAN, a mitochondrial E3 that regulates the organelle's dynamics and signaling.

    Directory of Open Access Journals (Sweden)

    Wei Li

    Full Text Available Specificity of protein ubiquitylation is conferred by E3 ubiquitin (Ub ligases. We have annotated approximately 617 putative E3s and substrate-recognition subunits of E3 complexes encoded in the human genome. The limited knowledge of the function of members of the large E3 superfamily prompted us to generate genome-wide E3 cDNA and RNAi expression libraries designed for functional screening. An imaging-based screen using these libraries to identify E3s that regulate mitochondrial dynamics uncovered MULAN/FLJ12875, a RING finger protein whose ectopic expression and knockdown both interfered with mitochondrial trafficking and morphology. We found that MULAN is a mitochondrial protein - two transmembrane domains mediate its localization to the organelle's outer membrane. MULAN is oriented such that its E3-active, C-terminal RING finger is exposed to the cytosol, where it has access to other components of the Ub system. Both an intact RING finger and the correct subcellular localization were required for regulation of mitochondrial dynamics, suggesting that MULAN's downstream effectors are proteins that are either integral to, or associated with, mitochondria and that become modified with Ub. Interestingly, MULAN had previously been identified as an activator of NF-kappaB, thus providing a link between mitochondrial dynamics and mitochondria-to-nucleus signaling. These findings suggest the existence of a new, Ub-mediated mechanism responsible for integration of mitochondria into the cellular environment.

  8. Genome sequencing and annotation of Laceyella sacchari strain GS 1-1, isolated from hot spring, Chumathang, Leh, India

    Directory of Open Access Journals (Sweden)

    Navjot Kaur

    2014-12-01

    Full Text Available We report the 3.3-Mb draft genome of Laceyella sacchari strain GS 1-1, isolated from hot spring water sample, Chumathang, Leh, India. Draft genome of strain GS 1-1 consists of 3, 324, 316 bp with a G + C content of 48.8% and 3429 predicted protein coding genes and 75 RNAs. Geobacillus thermodenitrificans strain NG80-2, Geobacillus kaustophilus strain HTA426 and Geobacillus sp. Strain G11MC16 are the closest neighbors of the strain GS 1-1.

  9. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences

    NARCIS (Netherlands)

    Medema, M.H.; Blin, K.; Cimermancic, P.; Jager, de V.C.L.; Zakrzewski, P.; Fischbach, M.A.; Weber, T.; Takano, E.; Breitling, R.

    2011-01-01

    Bacterial and fungal secondary metabolism is a rich source of novel bioactive compounds with potential pharmaceutical applications as antibiotics, anti-tumor drugs or cholesterol-lowering drugs. To find new drug candidates, microbiologists are increasingly relying on sequencing genomes of a wide var

  10. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences.

    NARCIS (Netherlands)

    Medema, M.H.; Blin, K.; Cimermancic, P.; Jager, V.C.L. de; Zakrzewski, P.; Fischbach, M.A.; Weber, T.; Takano, E.; Breitling, R.

    2011-01-01

    Bacterial and fungal secondary metabolism is a rich source of novel bioactive compounds with potential pharmaceutical applications as antibiotics, anti-tumor drugs or cholesterol-lowering drugs. To find new drug candidates, microbiologists are increasingly relying on sequencing genomes of a wide var

  11. Interspecific Comparison and annotation of two complete mitochondrial genome sequences from the plant pathogenic fungus Mycosphaerella graminicola

    Energy Technology Data Exchange (ETDEWEB)

    Millenbaugh, Bonnie A; Pangilinan, Jasmyn L.; Torriani, Stefano F.F.; Goodwin, Stephen B.; Kema, Gert H.J.; McDonald, Bruce A.

    2007-12-07

    The mitochondrial genomes of two isolates of the wheat pathogen Mycosphaerella graminicola were sequenced completely and compared to identify polymorphic regions. This organism is of interest because it is phylogenetically distant from other fungi with sequenced mitochondrial genomes and it has shown discordant patterns of nuclear and mitochondrial diversity. The mitochondrial genome of M. graminicola is a circular molecule of approximately 43,960 bp containing the typical genes coding for 14 proteins related to oxidative phosphorylation, one RNA polymerase, two rRNA genes and a set of 27 tRNAs. The mitochondrial DNA of M. graminicola lacks the gene encoding the putative ribosomal protein (rps5-like), commonly found in fungal mitochondrial genomes. Most of the tRNA genes were clustered with a gene order conserved with many other ascomycetes. A sample of thirty-five additional strains representing the known global mt diversity was partially sequenced to measure overall mitochondrial variability within the species. Little variation was found, confirming previous RFLP-based findings of low mitochondrial diversity. The mitochondrial sequence of M. graminicola is the first reported from the family Mycosphaerellaceae or the order Capnodiales. The sequence also provides a tool to better understand the development of fungicide resistance and the conflicting pattern of high nuclear and low mitochondrial diversity in global populations of this fungus.

  12. Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

    Science.gov (United States)

    Philippe, Nicolas; Boureux, Anthony; Bréhélin, Laurent; Tarhio, Jorma; Commes, Thérèse; Rivals, Éric

    2009-01-01

    Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts. PMID:19531739

  13. antiSMASH : rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences

    NARCIS (Netherlands)

    Medema, Marnix H.; Blin, Kai; Cimermancic, Peter; de Jager, Victor; Zakrzewski, Piotr; Fischbach, Michael A.; Weber, Tilmann; Takano, Eriko; Breitling, Rainer

    2011-01-01

    Bacterial and fungal secondary metabolism is a rich source of novel bioactive compounds with potential pharmaceutical applications as antibiotics, anti-tumor drugs or cholesterol-lowering drugs. To find new drug candidates, microbiologists are increasingly relying on sequencing genomes of a wide var

  14. KSHV 2.0: A Comprehensive Annotation of the Kaposi's Sarcoma-Associated Herpesvirus Genome Using Next-Generation Sequencing Reveals Novel Genomic and Functional Features

    OpenAIRE

    Carolina Arias; Ben Weisburd; Noam Stern-Ginossar; Alexandre Mercier; Alexis S Madrid; Priya Bellare; Meghan Holdorf; Jonathan S Weissman; Don Ganem

    2014-01-01

    Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes...

  15. Annotated Answer Set Programming

    OpenAIRE

    Straccia, Umberto

    2005-01-01

    We present Annotated Answer Set Programming, that extends the ex pressive power of disjunctive logic programming with annotation terms, taken from the generalized annotated logic programming framework.

  16. Annotated Videography.

    Science.gov (United States)

    United States Holocaust Memorial Museum, Washington, DC.

    This annotated list of 43 videotapes recommended for classroom use addresses various themes for teaching about the Holocaust, including: (1) overviews of the Holocaust; (2) life before the Holocaust; (3) propaganda; (4) racism, anti-Semitism; (5) "enemies of the state"; (6) ghettos; (7) camps; (8) genocide; (9) rescue; (10) resistance; (11)…

  17. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes.

    Science.gov (United States)

    Lin, Hsin-Hung; Liao, Yu-Chieh

    2016-01-01

    Metagenomics, the application of shotgun sequencing, facilitates the reconstruction of the genomes of individual species from natural environments. A major challenge in the genome recovery domain is to agglomerate or 'bin' sequences assembled from metagenomic reads into individual groups. Metagenomic binning without consideration of reference sequences enables the comprehensive discovery of new microbial organisms and aids in the microbial genome reconstruction process. Here we present MyCC, an automated binning tool that combines genomic signatures, marker genes and optional contig coverages within one or multiple samples, in order to visualize the metagenomes and to identify the reconstructed genomic fragments. We demonstrate the superior performance of MyCC compared to other binning tools including CONCOCT, GroopM, MaxBin and MetaBAT on both synthetic and real human gut communities with a small sample size (one to 11 samples), as well as on a large metagenome dataset (over 250 samples). Moreover, we demonstrate the visualization of metagenomes in MyCC to aid in the reconstruction of genomes from distinct bins. MyCC is freely available at http://sourceforge.net/projects/sb2nhri/files/MyCC/. PMID:27067514

  18. Cross genome phylogenetic analysis of human and Drosophila G protein-coupled receptors: application to functional annotation of orphan receptors

    Directory of Open Access Journals (Sweden)

    Sowdhamini Ramanathan

    2005-08-01

    Full Text Available Abstract Background The cell-membrane G-protein coupled receptors (GPCRs are one of the largest known superfamilies and are the main focus of intense pharmaceutical research due to their key role in cell physiology and disease. A large number of putative GPCRs are 'orphans' with no identified natural ligands. The first step in understanding the function of orphan GPCRs is to identify their ligands. Phylogenetic clustering methods were used to elucidate the chemical nature of receptor ligands, which led to the identification of natural ligands for many orphan receptors. We have clustered human and Drosophila receptors with known ligands and orphans through cross genome phylogenetic analysis and hypothesized higher relationship of co-clustered members that would ease ligand identification, as related receptors share ligands with similar structure or class. Results Cross-genome phylogenetic analyses were performed to identify eight major groups of GPCRs dividing them into 32 clusters of 371 human and 113 Drosophila proteins (excluding olfactory, taste and gustatory receptors and reveal unexpected levels of evolutionary conservation across human and Drosophila GPCRs. We also observe that members of human chemokine receptors, involved in immune response, and most of nucleotide-lipid receptors (except opsins do not have counterparts in Drosophila. Similarly, a group of Drosophila GPCRs (methuselah receptors, associated in aging, is not present in humans. Conclusion Our analysis suggests ligand class association to 52 unknown Drosophila receptors and 95 unknown human GPCRs. A higher level of phylogenetic organization was revealed in which clusters with common domain architecture or cellular localization or ligand structure or chemistry or a shared function are evident across human and Drosophila genomes. Such analyses will prove valuable for identifying the natural ligands of Drosophila and human orphan receptors that can lead to a better understanding

  19. Versatile annotation and publication quality visualization of protein complexes using POLYVIEW-3D

    Directory of Open Access Journals (Sweden)

    Meller Jaroslaw

    2007-08-01

    Full Text Available Abstract Background Macromolecular visualization as well as automated structural and functional annotation tools play an increasingly important role in the post-genomic era, contributing significantly towards the understanding of molecular systems and processes. For example, three dimensional (3D models help in exploring protein active sites and functional hot spots that can be targeted in drug design. Automated annotation and visualization pipelines can also reveal other functionally important attributes of macromolecules. These goals are dependent on the availability of advanced tools that integrate better the existing databases, annotation servers and other resources with state-of-the-art rendering programs. Results We present a new tool for protein structure analysis, with the focus on annotation and visualization of protein complexes, which is an extension of our previously developed POLYVIEW web server. By integrating the web technology with state-of-the-art software for macromolecular visualization, such as the PyMol program, POLYVIEW-3D enables combining versatile structural and functional annotations with a simple web-based interface for creating publication quality structure rendering, as well as animated images for Powerpoint™, web sites and other electronic resources. The service is platform independent and no plug-ins are required. Several examples of how POLYVIEW-3D can be used for structural and functional analysis in the context of protein-protein interactions are presented to illustrate the available annotation options. Conclusion POLYVIEW-3D server features the PyMol image rendering that provides detailed and high quality presentation of macromolecular structures, with an easy to use web-based interface. POLYVIEW-3D also provides a wide array of options for automated structural and functional analysis of proteins and their complexes. Thus, the POLYVIEW-3D server may become an important resource for researches and educators in

  20. Annotation of Differentially Expressed Genes in the Somatic Embryogenesis of Musa and Their Location in the Banana Genome

    OpenAIRE

    Josefina Ines Maldonado-Borges; José Roberto Ku-Cauich; Rosa Maria Escobedo-GraciaMedrano

    2013-01-01

    Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were di...

  1. Automated integration of genomic physical mapping data via parallel simulated annealing

    Energy Technology Data Exchange (ETDEWEB)

    Slezak, T.

    1994-06-01

    The Human Genome Center at the Lawrence Livermore National Laboratory (LLNL) is nearing closure on a high-resolution physical map of human chromosome 19. We have build automated tools to assemble 15,000 fingerprinted cosmid clones into 800 contigs with minimal spanning paths identified. These islands are being ordered, oriented, and spanned by a variety of other techniques including: Fluorescence Insitu Hybridization (FISH) at 3 levels of resolution, ECO restriction fragment mapping across all contigs, and a multitude of different hybridization and PCR techniques to link cosmid, YAC, AC, PAC, and Pl clones. The FISH data provide us with partial order and distance data as well as orientation. We made the observation that map builders need a much rougher presentation of data than do map readers; the former wish to see raw data since these can expose errors or interesting biology. We further noted that by ignoring our length and distance data we could simplify our problem into one that could be readily attacked with optimization techniques. The data integration problem could then be seen as an M x N ordering of our N cosmid clones which ``intersect`` M larger objects by defining ``intersection`` to mean either contig/map membership or hybridization results. Clearly, the goal of making an integrated map is now to rearrange the N cosmid clone ``columns`` such that the number of gaps on the object ``rows`` are minimized. Our FISH partially-ordered cosmid clones provide us with a set of constraints that cannot be violated by the rearrangement process. We solved the optimization problem via simulated annealing performed on a network of 40+ Unix machines in parallel, using a server/client model built on explicit socket calls. For current maps we can create a map in about 4 hours on the parallel net versus 4+ days on a single workstation. Our biologists are now using this software on a daily basis to guide their efforts toward final closure.

  2. Automated protein-DNA interaction screening of Drosophila regulatory elements

    OpenAIRE

    Hens, Korneel; Feuz, Jean-Daniel; Isakova, Alina; Iagovitina, Antonina; Massouras, Andreas; Bryois, Julien; Callaerts, Patrick; Celniker, Susan E.; Deplancke, Bart

    2011-01-01

    Drosophila melanogaster has one of the best characterized metazoan genomes in terms of functionally annotated regulatory elements. To explore how these elements contribute to gene regulation in the context of gene regulatory networks, we need convenient tools to identify the proteins that bind to them. Here, we present the development and validation of a highly automated protein-DNA interaction detection method, enabling the high-throughput yeast one-hybrid-based screening of DNA elements ver...

  3. Annotation of Differentially Expressed Genes in the Somatic Embryogenesis of Musa and Their Location in the Banana Genome

    Directory of Open Access Journals (Sweden)

    Josefina Ines Maldonado-Borges

    2013-01-01

    Full Text Available Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs and the sequenced genome of the double haploid- (DH- Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa.

  4. Annotation of differentially expressed genes in the somatic embryogenesis of musa and their location in the banana genome.

    Science.gov (United States)

    Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-Graciamedrano, Rosa Maria

    2013-01-01

    Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100-4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa. PMID:24027442

  5. Modeling Loosely Annotated Images with Imagined Annotations

    CERN Document Server

    Tang, Hong; Chen, Yunhao

    2008-01-01

    In this paper, we present an approach to learning latent semantic analysis models from loosely annotated images for automatic image annotation and indexing. The given annotation in training images is loose due to: (1) ambiguous correspondences between visual features and annotated keywords; (2) incomplete lists of annotated keywords. The second reason motivates us to enrich the incomplete annotation in a simple way before learning topic models. In particular, some imagined keywords are poured into the incomplete annotation through measuring similarity between keywords. Then, both given and imagined annotations are used to learning probabilistic topic models for automatically annotating new images. We conduct experiments on a typical Corel dataset of images and loose annotations, and compare the proposed method with state-of-the-art discrete annotation methods (using a set of discrete blobs to represent an image). The proposed method improves word-driven probability Latent Semantic Analysis (PLSA-words) up to ...

  6. Unique features of odorant-binding proteins of the parasitoid wasp Nasonia vitripennis revealed by genome annotation and comparative analyses.

    Directory of Open Access Journals (Sweden)

    Filipe G Vieira

    Full Text Available Insects are the most diverse group of animals on the planet, comprising over 90% of all metazoan life forms, and have adapted to a wide diversity of ecosystems in nearly all environments. They have evolved highly sensitive chemical senses that are central to their interaction with their environment and to communication between individuals. Understanding the molecular bases of insect olfaction is therefore of great importance from both a basic and applied perspective. Odorant binding proteins (OBPs are some of most abundant proteins found in insect olfactory organs, where they are the first component of the olfactory transduction cascade, carrying odorant molecules to the olfactory receptors. We carried out a search for OBPs in the genome of the parasitoid wasp Nasonia vitripennis and identified 90 sequences encoding putative OBPs. This is the largest OBP family so far reported in insects. We report unique features of the N. vitripennis OBPs, including the presence and evolutionary origin of a new subfamily of double-domain OBPs (consisting of two concatenated OBP domains, the loss of conserved cysteine residues and the expression of pseudogenes. This study also demonstrates the extremely dynamic evolution of the insect OBP family: (i the number of different OBPs can vary greatly between species; (ii the sequences are highly diverse, sometimes as a result of positive selection pressure with even the canonical cysteines being lost; (iii new lineage specific domain arrangements can arise, such as the double domain OBP subfamily of wasps and mosquitoes.

  7. Functional annotation of rheumatoid arthritis and osteoarthritis associated genes by integrative genome-wide gene expression profiling analysis.

    Directory of Open Access Journals (Sweden)

    Zhan-Chun Li

    Full Text Available BACKGROUND: Rheumatoid arthritis (RA and osteoarthritis (OA are two major types of joint diseases that share multiple common symptoms. However, their pathological mechanism remains largely unknown. The aim of our study is to identify RA and OA related-genes and gain an insight into the underlying genetic basis of these diseases. METHODS: We collected 11 whole genome-wide expression profiling datasets from RA and OA cohorts and performed a meta-analysis to comprehensively investigate their expression signatures. This method can avoid some pitfalls of single dataset analyses. RESULTS AND CONCLUSION: We found that several biological pathways (i.e., the immunity, inflammation and apoptosis related pathways are commonly involved in the development of both RA and OA. Whereas several other pathways (i.e., vasopressin-related pathway, regulation of autophagy, endocytosis, calcium transport and endoplasmic reticulum stress related pathways present significant difference between RA and OA. This study provides novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease.

  8. Facilitating functional annotation of chicken microarray data

    OpenAIRE

    Buza, Teresia J; Kumar, Ranjit; Gresham, Cathy R; Burgess, Shane C.; McCarthy, Fiona M

    2009-01-01

    Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO). However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually...

  9. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens.

    Science.gov (United States)

    Qhanya, Lehlohonolo Benedict; Matowane, Godfrey; Chen, Wanping; Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  10. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens.

    Directory of Open Access Journals (Sweden)

    Lehlohonolo Benedict Qhanya

    Full Text Available Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s, heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence. Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea, Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala, revealed the presence of numerous putative P450s ranging from 267 (A. mellea to 14 (M. osmundae. Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  11. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens

    Science.gov (United States)

    Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  12. An Approach to Function Annotation for Proteins of Unknown Function (PUFs in the Transcriptome of Indian Mulberry.

    Directory of Open Access Journals (Sweden)

    K H Dhanyalakshmi

    Full Text Available The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs. Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS, which also provides a web service API (Application Programming Interface for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas.

  13. An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

    OpenAIRE

    Bell, Michael J; Colin S Gillespie; Swan, Daniel; Lord, Phillip

    2012-01-01

    Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases ...

  14. Annotated bibliography

    International Nuclear Information System (INIS)

    Under a cooperative agreement with the U.S. Department of Energy's Office of Science and Technology, Waste Policy Institute (WPI) is conducting a five-year research project to develop a research-based approach for integrating communication products in stakeholder involvement related to innovative technology. As part of the research, WPI developed this annotated bibliography which contains almost 100 citations of articles/books/resources involving topics related to communication and public involvement aspects of deploying innovative cleanup technology. To compile the bibliography, WPI performed on-line literature searches (e.g., Dialog, International Association of Business Communicators Public Relations Society of America, Chemical Manufacturers Association, etc.), consulted past years proceedings of major environmental waste cleanup conferences (e.g., Waste Management), networked with professional colleagues and DOE sites to gather reports or case studies, and received input during the August 1996 Research Design Team meeting held to discuss the project's research methodology. Articles were selected for annotation based upon their perceived usefulness to the broad range of public involvement and communication practitioners

  15. Facilitating functional annotation of chicken microarray data

    Directory of Open Access Journals (Sweden)

    Gresham Cathy R

    2009-10-01

    Full Text Available Abstract Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO. However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and

  16. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    OpenAIRE

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2014-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integra...

  17. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities.

    Science.gov (United States)

    Falk, Marni J; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T; Stassen, Alphons P M; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G; Brilhante, Virginia; Ralph, David; DaRe, Jeana T; Shelton, Robert; Terry, Sharon F; Zhang, Zhe; Copeland, William C; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2015-03-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The "Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium" is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through the use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is

  18. Self-organizing Approach for Automated Gene Identification in Whole Genomes

    CERN Document Server

    Gorban, A N; Popova, T G; Gorban, Alexander N.; Zinovyev, Andrey Yu.; Popova, Tatyana G.

    2001-01-01

    An approach based on using the idea of distinguished coding phase in explicit form for identification of protein-coding regions (exons) in whole genome has been proposed. For several genomes an optimal window length for averaging GC-content function and calculating codon frequencies has been found. Self-training procedure based on clustering in multidimensional space of triplet frequencies is proposed. For visualization of data in the space of triplet requiencies method of elastic maps was applied.

  19. Self-organizing Approach for Automated Gene Identification in Whole Genomes

    OpenAIRE

    Gorban, Alexander N; Zinovyev, Andrey Yu.; Popova, Tatyana G.

    2001-01-01

    An approach based on using the idea of distinguished coding phase in explicit form for identification of protein-coding regions (exons) in whole genome has been proposed. For several genomes an optimal window length for averaging GC-content function and calculating codon frequencies has been found. Self-training procedure based on clustering in multidimensional space of triplet frequencies is proposed. For visualization of data in the space of triplet requiencies method of elastic maps was ap...

  20. Comparative genomic mapping of the bovine Fragile Histidine Triad (FHIT tumour suppressor gene: characterization of a 2 Mb BAC contig covering the locus, complete annotation of the gene, analysis of cDNA and of physiological expression profiles

    Directory of Open Access Journals (Sweden)

    Boussaha Mekki

    2006-05-01

    Full Text Available Abstract Background The Fragile Histidine Triad gene (FHIT is an oncosuppressor implicated in many human cancers, including vesical tumors. FHIT is frequently hit by deletions caused by fragility at FRA3B, the most active of human common fragile sites, where FHIT lays. Vesical tumors affect also cattle, including animals grazing in the wild on bracken fern; compounds released by the fern are known to induce chromosome fragility and may trigger cancer with the interplay of latent Papilloma virus. Results The bovine FHIT was characterized by assembling a contig of 78 BACs. Sequence tags were designed on human exons and introns and used directly to select bovine BACs, or compared with sequence data in the bovine genome database or in the trace archive of the bovine genome sequencing project, and adapted before use. FHIT is split in ten exons like in man, with exons 5 to 9 coding for a 149 amino acids protein. VISTA global alignments between bovine genomic contigs retrieved from the bovine genome database and the human FHIT region were performed. Conservation was extremely high over a 2 Mb region spanning the whole FHIT locus, including the size of introns. Thus, the bovine FHIT covers about 1.6 Mb compared to 1.5 Mb in man. Expression was analyzed by RT-PCR and Northern blot, and was found to be ubiquitous. Four cDNA isoforms were isolated and sequenced, that originate from an alternative usage of three variants of exon 4, revealing a size very close to the major human FHIT cDNAs. Conclusion A comparative genomic approach allowed to assemble a contig of 78 BACs and to completely annotate a 1.6 Mb region spanning the bovine FHIT gene. The findings confirmed the very high level of conservation between human and bovine genomes and the importance of comparative mapping to speed the annotation process of the recently sequenced bovine genome. The detailed knowledge of the genomic FHIT region will allow to study the role of FHIT in bovine cancerogenesis

  1. TOPSAN: use of a collaborative environment for annotating, analyzing and disseminating data on JCSG and PSI structures

    International Nuclear Information System (INIS)

    Specific use cases of TOPSAN, an innovative collaborative platform for creating, sharing and distributing annotations and insights about protein structures, such as those determined by high-throughput structural genomics in the Protein Structure Initiative (PSI), are described. TOPSAN is the main annotation platform for JCSG structures and serves as a conduit for initiating collaborations with the biological community, as illustrated in this special issue of Acta Crystallographica Section F. Developed at the JCSG with the goal of opening a dialogue on the novel protein structures with the broader biological community, TOPSAN is a unique tool for fostering distributed collaborations and provides an efficient pathway to peer-reviewed publications. The NIH Protein Structure Initiative centers, such as the Joint Center for Structural Genomics (JCSG), have developed highly efficient technological platforms that are capable of experimentally determining the three-dimensional structures of hundreds of proteins per year. However, the overwhelming majority of the almost 5000 protein structures determined by these centers have yet to be described in the peer-reviewed literature. In a high-throughput structural genomics environment, the process of structure determination occurs independently of any associated experimental characterization of function, which creates a challenge for the annotation and analysis of structures and the publication of these results. This challenge has been addressed by developing TOPSAN (‘The Open Protein Structure Annotation Network’), which enables the generation of knowledge via collaborations among globally distributed contributors supported by automated amalgamation of available information. TOPSAN currently provides annotations for all protein structures determined by the JCSG in addition to preliminary annotations on a large number of structures from the other PSI production centers. TOPSAN-enabled collaborations have resulted in

  2. GELATO and SAGE: An Integrated Framework for MS Annotation

    OpenAIRE

    AlJadda, Khalifeh; Ranzinger, Rene; Porterfield, Melody; Weatherly, Brent; Korayem, Mohammed; Miller, John A.; Rasheed, Khaled; Kochut, Krys J; York, William S.

    2015-01-01

    Several algorithms and tools have been developed to (semi) automate the process of glycan identification by interpreting Mass Spectrometric data. However, each has limitations when annotating MSn data with thousands of MS spectra using uncurated public databases. Moreover, the existing tools are not designed to manage MSn data where n > 2. We propose a novel software package to automate the annotation of tandem MS data. This software consists of two major components. The first, is a free, sem...

  3. MannDB: A microbial annotation database for protein characterization

    Energy Technology Data Exchange (ETDEWEB)

    Zhou, C; Lam, M; Smith, J; Zemla, A; Dyer, M; Kuczmarski, T; Vitalis, E; Slezak, T

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high

  4. REFGEN and TREENAMER: Automated Sequence Data Handling for Phylogenetic Analysis in the Genomic Era

    Directory of Open Access Journals (Sweden)

    Guy Leonard

    2009-01-01

    Full Text Available The phylogenetic analysis of nucleotide sequences and increasingly that of amino acid sequences is used to address a number of biological questions. Access to extensive datasets, including numerous genome projects, means that standard phylogenetic analyses can include many hundreds of sequences. Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases. Managing large numbers of sequences and standardizing sequence labels for use in phylogenetic analysis programs can be a time consuming and laborious task. Here we report the availability of an online resource for the management of gene sequences recovered from public access genome databases such as GenBank. These web utilities include the facility for renaming every sequence in a FASTA alignment fi le, with each sequence label derived from a user-defined combination of the species name and/or database accession number. This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations. Post phylogenetic analysis, these webpages can then be used to rename every label in the subsequent tree fi les (with a user-defined combination of species name and/or database accession number. Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures. Additional features of our platform include the automatic removal of identical accession numbers (recorded in the report file and generation of species and accession number lists for use in supplementary materials or figure legends.

  5. Concept annotation in the CRAFT corpus

    Directory of Open Access Journals (Sweden)

    Bada Michael

    2012-07-01

    Full Text Available Abstract Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released. Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens, our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection, the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are

  6. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    Energy Technology Data Exchange (ETDEWEB)

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) or individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual

  7. A framework for automated enrichment of functionally significant inverted repeats in whole genomes

    Directory of Open Access Journals (Sweden)

    Frank Ronald L

    2010-10-01

    Full Text Available Abstract Background RNA transcripts from genomic sequences showing dyad symmetry typically adopt hairpin-like, cloverleaf, or similar structures that act as recognition sites for proteins. Such structures often are the precursors of non-coding RNA (ncRNA sequences like microRNA (miRNA and small-interfering RNA (siRNA that have recently garnered more functional significance than in the past. Genomic DNA contains hundreds of thousands of such inverted repeats (IRs with varying degrees of symmetry. But by collecting statistically significant information from a known set of ncRNA, we can sort these IRs into those that are likely to be functional. Results A novel method was developed to scan genomic DNA for partially symmetric inverted repeats and the resulting set was further refined to match miRNA precursors (pre-miRNA with respect to their density of symmetry, statistical probability of the symmetry, length of stems in the predicted hairpin secondary structure, and the GC content of the stems. This method was applied on the Arabidopsis thaliana genome and validated against the set of 190 known Arabidopsis pre-miRNA in the miRBase database. A preliminary scan for IRs identified 186 of the known pre-miRNA but with 714700 pre-miRNA candidates. This large number of IRs was further refined to 483908 candidates with 183 pre-miRNA identified and further still to 165371 candidates with 171 pre-miRNA identified (i.e. with 90% of the known pre-miRNA retained. Conclusions 165371 candidates for potentially functional miRNA is still too large a set to warrant wet lab analyses, such as northern blotting, on all of them. Hence additional filters are needed to further refine the number of candidates while still retaining most of the known miRNA. These include detection of promoters and terminators, homology analyses, location of candidate relative to coding regions, and better secondary structure prediction algorithms. The software developed is designed to easily

  8. Interlinking Multimedia Annotations

    OpenAIRE

    Li, Yunjia; Wald, Mike; Wills, Gary

    2011-01-01

    With the fast growth of multimedia sharing and annotating applications on the Web, there is an increasing research interests in semantic annotations of multimedia. However, applying linked data principles in multimedia annotations is a relatively new topic, especially when annotations are related to media fragments. This paper, therefore, discusses this problem and further breaks it down into three fundamental sub-questions: 1) choosing media fragment URIs 2) Dereferencing media fragment URIs...

  9. Plant protein annotation in the UniProt Knowledgebase.

    Science.gov (United States)

    Schneider, Michel; Bairoch, Amos; Wu, Cathy H; Apweiler, Rolf

    2005-05-01

    The Swiss-Prot, TrEMBL, Protein Information Resource (PIR), and DNA Data Bank of Japan (DDBJ) protein database activities have united to form the Universal Protein Resource (UniProt) Consortium. UniProt presents three database layers: the UniProt Archive, the UniProt Knowledgebase (UniProtKB), and the UniProt Reference Clusters. The UniProtKB consists of two sections: UniProtKB/Swiss-Prot (fully manually curated entries) and UniProtKB/TrEMBL (automated annotation, classification and extensive cross-references). New releases are published fortnightly. A specific Plant Proteome Annotation Program (http://www.expasy.org/sprot/ppap/) was initiated to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Through UniProt, our aim is to provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information that will allow the plant community to fully explore and utilize the wealth of information available for both plant and non-plant model organisms. PMID:15888679

  10. Plant Protein Annotation in the UniProt Knowledgebase1

    Science.gov (United States)

    Schneider, Michel; Bairoch, Amos; Wu, Cathy H.; Apweiler, Rolf

    2005-01-01

    The Swiss-Prot, TrEMBL, Protein Information Resource (PIR), and DNA Data Bank of Japan (DDBJ) protein database activities have united to form the Universal Protein Resource (UniProt) Consortium. UniProt presents three database layers: the UniProt Archive, the UniProt Knowledgebase (UniProtKB), and the UniProt Reference Clusters. The UniProtKB consists of two sections: UniProtKB/Swiss-Prot (fully manually curated entries) and UniProtKB/TrEMBL (automated annotation, classification and extensive cross-references). New releases are published fortnightly. A specific Plant Proteome Annotation Program (http://www.expasy.org/sprot/ppap/) was initiated to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Through UniProt, our aim is to provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information that will allow the plant community to fully explore and utilize the wealth of information available for both plant and nonplant model organisms. PMID:15888679

  11. Ontological Annotation with WordNet

    Energy Technology Data Exchange (ETDEWEB)

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob; Hohimer, Ryan E.; White, Amanda M.

    2006-06-06

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  12. ECR Browser: A Tool For Visualizing And Accessing Data From Comparisons Of Multiple Vertebrate Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Loots, G G; Ovcharenko, I; Stubbs, L; Nobrega, M A

    2004-01-06

    The increasing number of vertebrate genomes being sequenced in draft or finished form provide a unique opportunity to study and decode the language of DNA sequence through comparative genome alignments. However, novel tools and strategies are required to accommodate this increasing volume of genomic information and to facilitate experimental annotation of genome function. Here we present the ECR Browser, a tool that provides an easy and dynamic access to whole genome alignments of human, mouse, rat and fish sequences. This web-based tool (http://ecrbrowser.dcode.org) provides the starting point for discovery of novel genes, identification of distant gene regulatory elements and prediction of transcription factor binding sites. The genome alignment portal of the ECR Browser also permits fast and automated alignment of any user-submitted sequence to the genome of choice. The interconnection of the ECR browser with other DNA sequence analysis tools creates a unique portal for studying and exploring vertebrate genomes.

  13. Comparative validation of the D. melanogaster modENCODE transcriptome annotation

    OpenAIRE

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Anthony R. Fletcher; David C Plachetzki; FitzGerald, Peter C.; Artieri, Carlo G.; Atallah, Joel; Barmina, Olga; Brown, James B.; Blankenburg, Kerstin P

    2014-01-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the ...

  14. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology.

    Science.gov (United States)

    Chien, Jung-Ting; Pakala, Suman B; Geraldo, Juliana A; Lapp, Stacey A; Humphrey, Jay C; Barnwell, John W; Kissinger, Jessica C; Galinski, Mary R

    2016-01-01

    Plasmodium coatneyi is a protozoan parasite species that causes simian malaria and is an excellent model for studying disease caused by the human malaria parasite, P. falciparum Here we report the complete (nontelomeric) genome sequence of P. coatneyi Hackeri generated by the application of only Pacific Biosciences RS II (PacBio RS II) single-molecule real-time (SMRT) high-resolution sequence technology and assembly using the Hierarchical Genome Assembly Process (HGAP). This is the first Plasmodium genome sequence reported to use only PacBio technology. This approach has proven to be superior to short-read only approaches for this species. PMID:27587810

  15. Gene Ontology annotations at SGD: new data sources and annotation methods.

    Science.gov (United States)

    Hong, Eurie L; Balakrishnan, Rama; Dong, Qing; Christie, Karen R; Park, Julie; Binkley, Gail; Costanzo, Maria C; Dwight, Selina S; Engel, Stacia R; Fisk, Dianna G; Hirschman, Jodi E; Hitz, Benjamin C; Krieger, Cynthia J; Livstone, Michael S; Miyasato, Stuart R; Nash, Robert S; Oughtred, Rose; Skrzypek, Marek S; Weng, Shuai; Wong, Edith D; Zhu, Kathy K; Dolinski, Kara; Botstein, David; Cherry, J Michael

    2008-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current. PMID:17982175

  16. Transcriptome Annotation using Tandem SAGE Tags

    OpenAIRE

    Rivals, Eric; Boureux, Anthony; Lejeune, Mireille; Ottones, Florence; Pecharromàn Pérez, Oscar; Tarhio, Jorma; Pierrat, Fabien; Ruffle, Florence; Commes, Thérèse; Marti, Jacques

    2007-01-01

    Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial Analysis of Gene Expression (SAGE) can reveal new polyadenylated RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restric...

  17. Annotations for Intersection Typechecking

    Directory of Open Access Journals (Sweden)

    Joshua Dunfield

    2013-07-01

    Full Text Available In functional programming languages, the classic form of annotation is a single type constraint on a term. Intersection types add complications: a single term may have to be checked several times against different types, in different contexts, requiring annotation with several types. Moreover, it is useful (in some systems, necessary to indicate the context in which each such type is to be used. This paper explores the technical design space of annotations in systems with intersection types. Earlier work (Dunfield and Pfenning 2004 introduced contextual typing annotations, which we now tease apart into more elementary mechanisms: a "right hand" annotation (the standard form, a "left hand" annotation (the context in which a right-hand annotation is to be used, a merge that allows for multiple annotations, and an existential binder for index variables. The most novel element is the left-hand annotation, which guards terms (and right-hand annotations with a judgment that must follow from the current context.

  18. Annotated Stack Trees

    OpenAIRE

    Hague, Matthew; Penelle, Vincent

    2015-01-01

    Annotated pushdown automata provide an automaton model of higher-order recursion schemes, which may in turn be used to model higher-order programs for the purposes of verification. We study Ground Annotated Stack Tree Rewrite Systems -- a tree rewrite system where each node is labelled by the configuration of an annotated pushdown automaton. This allows the modelling of fork and join constructs in higher-order programs and is a generalisation of higher-order stack trees recently introduced by...

  19. High-coverage sequencing and annotated assembly of the genome of the Australian dragon lizard Pogona vitticeps

    OpenAIRE

    Georges, Arthur; Li, Qiye; Lian, Jinmin; O'Meally, Denis; Deakin, Janine; Wang, Zongji; Zhang, Pei; Fujita, Matthew; Patel, Hardip R.; Clare E Holleley; Yang ZHOU; Zhang, Xiuwen; Matsubara, Kazumi; Waters, Paul; Graves, Jennifer A. Marshall

    2015-01-01

    Background The lizards of the family Agamidae are one of the most prominent elements of the Australian reptile fauna. Here, we present a genomic resource built on the basis of a wild-caught male ZZ central bearded dragon Pogona vitticeps. Findings The genomic sequence for P. vitticeps, generated on the Illumina HiSeq 2000 platform, comprised 317 Gbp (179X raw read depth) from 13 insert libraries ranging from 250 bp to 40 kbp. After filtering for low-quality and duplicated reads, 146 Gbp of da...

  20. Computing human image annotation.

    Science.gov (United States)

    Channin, David S; Mongkolwat, Pattanasak; Kleper, Vladimir; Rubin, Daniel L

    2009-01-01

    An image annotation is the explanatory or descriptive information about the pixel data of an image that is generated by a human (or machine) observer. An image markup is the graphical symbols placed over the image to depict an annotation. In the majority of current, clinical and research imaging practice, markup is captured in proprietary formats and annotations are referenced only in free text radiology reports. This makes these annotations difficult to query, retrieve and compute upon, hampering their integration into other data mining and analysis efforts. This paper describes the National Cancer Institute's Cancer Biomedical Informatics Grid's (caBIG) Annotation and Image Markup (AIM) project, focusing on how to use AIM to query for annotations. The AIM project delivers an information model for image annotation and markup. The model uses controlled terminologies for important concepts. All of the classes and attributes of the model have been harmonized with the other models and common data elements in use at the National Cancer Institute. The project also delivers XML schemata necessary to instantiate AIMs in XML as well as a software application for translating AIM XML into DICOM S/R and HL7 CDA. Large collections of AIM annotations can be built and then queried as Grid or Web services. Using the tools of the AIM project, image annotations and their markup can be captured and stored in human and machine readable formats. This enables the inclusion of human image observation and inference as part of larger data mining and analysis activities. PMID:19964202

  1. The GATO gene annotation tool for research laboratories

    Directory of Open Access Journals (Sweden)

    A. Fujita

    2005-11-01

    Full Text Available Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.

  2. The GATO gene annotation tool for research laboratories.

    Science.gov (United States)

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB. PMID:16258624

  3. Biosynthesis of Akaeolide and Lorneic Acids and Annotation of Type I Polyketide Synthase Gene Clusters in the Genome of Streptomyces sp. NPS554

    Directory of Open Access Journals (Sweden)

    Tao Zhou

    2015-01-01

    Full Text Available The incorporation pattern of biosynthetic precursors into two structurally unique polyketides, akaeolide and lorneic acid A, was elucidated by feeding experiments with 13C-labeled precursors. In addition, the draft genome sequence of the producer, Streptomyces sp. NPS554, was performed and the biosynthetic gene clusters for these polyketides were identified. The putative gene clusters contain all the polyketide synthase (PKS domains necessary for assembly of the carbon skeletons. Combined with the 13C-labeling results, gene function prediction enabled us to propose biosynthetic pathways involving unusual carbon-carbon bond formation reactions. Genome analysis also indicated the presence of at least ten orphan type I PKS gene clusters that might be responsible for the production of new polyketides.

  4. A new approach for annotation of transposable elements using small RNA mapping.

    Science.gov (United States)

    El Baidouri, Moaine; Kim, Kyung Do; Abernathy, Brian; Arikit, Siwaret; Maumus, Florian; Panaud, Olivier; Meyers, Blake C; Jackson, Scott A

    2015-07-27

    Transposable elements (TEs) are mobile genomic DNA sequences found in most organisms. They so densely populate the genomes of many eukaryotic species that they are often the major constituents. With the rapid generation of many plant genome sequencing projects over the past few decades, there is an urgent need for improved TE annotation as a prerequisite for genome-wide studies. Analogous to the use of RNA-seq for gene annotation, we propose a new method for de novo TE annotation that uses as a guide 24 nt-siRNAs that are a part of TE silencing pathways. We use this new approach, called TASR (for Transposon Annotation using Small RNAs), for de novo annotation of TEs in Arabidopsis, rice and soybean and demonstrate that this strategy can be successfully applied for de novo TE annotation in plants.Executable PERL is available for download from: http://tasr-pipeline.sourceforge.net/. PMID:25813049

  5. Functional annotations of diabetes nephropathy susceptibility loci through analysis of genome-wide renal gene expression in rat models of diabetes mellitus

    DEFF Research Database (Denmark)

    Hu, Yaomin; Kaisaki, Pamela J; Argoud, Karène; Wilder, Steven P; Wallace, Karin J; Woon, Peng Y; Blancher, Christine; Tarnow, Lise; Groop, Per-Henrik; Hadjadj, Samy; Marre, Michel; Parving, Hans-Henrik; Farrall, Martin; Cox, Roger D; Lathrop, Mark; Vionnet, Nathalie; Bihoreau, Marie-Thérèse; Gauguier, Dominique

    2009-01-01

    BACKGROUND: Hyperglycaemia in diabetes mellitus (DM) alters gene expression regulation in various organs and contributes to long term vascular and renal complications. We aimed to generate novel renal genome-wide gene transcription data in rat models of diabetes in order to test the responsiveness...... models of spontaneous (genetically determined) mild hyperglycaemia and insulin resistance (Goto-Kakizaki-GK) and experimentally induced severe hyperglycaemia (Wistar-Kyoto-WKY rats injected with streptozotocin [STZ]). RESULTS: Different patterns of transcription regulation in the two rat models of...

  6. Personnalisation de Syst\\`emes OLAP Annot\\'es

    CERN Document Server

    Jerbi, Houssem; Ravat, Franck; Teste, Olivier

    2010-01-01

    This paper deals with personalization of annotated OLAP systems. Data constellation is extended to support annotations and user preferences. Annotations reflect the decision-maker experience whereas user preferences enable users to focus on the most interesting data. User preferences allow annotated contextual recommendations helping the decision-maker during his/her multidimensional navigations.

  7. Genome Sequencing

    DEFF Research Database (Denmark)

    Sato, Shusei; Andersen, Stig Uggerhøj

    2014-01-01

    The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based on transcr......The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based...

  8. Genome-Wide Computational Analysis of Musa Microsatellites: Classification, Cross-Taxon Transferability, Functional Annotation, Association with Transposons & miRNAs, and Genetic Marker Potential.

    Science.gov (United States)

    Biswas, Manosh Kumar; Liu, Yuxuan; Li, Chunyu; Sheng, Ou; Mayer, Christoph; Yi, Ganjun

    2015-01-01

    The development of organized, informative, robust, user-friendly, and freely accessible molecular markers is imperative to the Musa marker assisted breeding program. Although several hundred SSR markers have already been developed, the number of informative, robust, and freely accessible Musa markers remains inadequate for some breeding applications. In view of this issue, we surveyed SSRs in four different data sets, developed large-scale non-redundant highly informative therapeutic SSR markers, and classified them according to their attributes, as well as analyzed their cross-taxon transferability and utility for the genetic study of Musa and its relatives. A high SSR frequency (177 per Mbp) was found in the Musa genome. AT-rich dinucleotide repeats are predominant, and trinucleotide repeats are the most abundant in transcribed regions. A significant number of Musa SSRs are associated with pre-miRNAs, and 83% of these SSRs are promising candidates for the development of therapeutic SSR markers. Overall, 74% of the SSR markers were polymorphic, and 94% were transferable to at least one Musa spp. Two hundred forty-three markers generated a total of 1047 alleles, with 2-8 alleles each and an average of 4.38 alleles per locus. The PIC values ranged from 0.31 to 0.89 and averaged 0.71. We report the largest set of non-redundant, polymorphic, new SSR markers to be developed in Musa. These additional markers could be a valuable resource for marker-assisted breeding, genetic diversity and genomic studies of Musa and related species. PMID:26121637

  9. GeneViTo: Visualizing gene-product functional and structural features in genomic datasets

    Directory of Open Access Journals (Sweden)

    Promponas Vasilis J

    2003-10-01

    Full Text Available Abstract Background The availability of increasing amounts of sequence data from completely sequenced genomes boosts the development of new computational methods for automated genome annotation and comparative genomics. Therefore, there is a need for tools that facilitate the visualization of raw data and results produced by bioinformatics analysis, providing new means for interactive genome exploration. Visual inspection can be used as a basis to assess the quality of various analysis algorithms and to aid in-depth genomic studies. Results GeneViTo is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features. Emphasis is laid on the potential comparison between annotation and prediction data in order to offer a supplement to the provided information, especially in cases of "poor" annotation, or an evaluation of available predictions. Moreover, desired information can be output in high quality JPEG image files for further elaboration and scientific use. A compilation of properly formatted GeneViTo input data for demonstration is available to interested readers for two completely sequenced prokaryotes, Chlamydia trachomatis and Methanococcus jannaschii. Conclusions GeneViTo offers an inspectional view of genomic functional elements, concerning data stemming both from database annotation and analysis tools for an overall analysis of existing genomes. The application is compatible with Linux or Windows ME-2000-XP operating

  10. Creating Annotation Tools with the Annotation Graph Toolkit

    OpenAIRE

    Maeda, Kazuaki; Bird, Steven; Ma, Xiaoyi; Lee, Haejoong

    2002-01-01

    The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user interfaces for a variety of annotation tasks, and several sample applications. This paper describes all ...

  11. Model and Interoperability using Meta Data Annotations

    Science.gov (United States)

    David, O.

    2011-12-01

    Software frameworks and architectures are in need for meta data to efficiently support model integration. Modelers have to know the context of a model, often stepping into modeling semantics and auxiliary information usually not provided in a concise structure and universal format, consumable by a range of (modeling) tools. XML often seems the obvious solution for capturing meta data, but its wide adoption to facilitate model interoperability is limited by XML schema fragmentation, complexity, and verbosity outside of a data-automation process. Ontologies seem to overcome those shortcomings, however the practical significance of their use remains to be demonstrated. OMS version 3 took a different approach for meta data representation. The fundamental building block of a modular model in OMS is a software component representing a single physical process, calibration method, or data access approach. Here, programing language features known as Annotations or Attributes were adopted. Within other (non-modeling) frameworks it has been observed that annotations lead to cleaner and leaner application code. Framework-supported model integration, traditionally accomplished using Application Programming Interfaces (API) calls is now achieved using descriptive code annotations. Fully annotated components for various hydrological and Ag-system models now provide information directly for (i) model assembly and building, (ii) data flow analysis for implicit multi-threading or visualization, (iii) automated and comprehensive model documentation of component dependencies, physical data properties, (iv) automated model and component testing, calibration, and optimization, and (v) automated audit-traceability to account for all model resources leading to a particular simulation result. Such a non-invasive methodology leads to models and modeling components with only minimal dependencies on the modeling framework but a strong reference to its originating code. Since models and

  12. A study into annotation ranking metrics in geo-tagged image corpora

    OpenAIRE

    Hughes, Mark; Jones, Gareth J.F.; O'Connor, Noel E.

    2012-01-01

    Community contributed datasets are becoming increasingly common in automated image annotation systems. One important issue with community image data is that there is no guarantee that the associated metadata is relevant. A method is required that can accurately rank the semantic relevance of community annotations. This should enable the extracting of relevant subsets from potentially noisy collections of these annotations. Having relevant, non heterogeneous tags assigned to images should impr...

  13. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads

    Energy Technology Data Exchange (ETDEWEB)

    Martin, Jeffrey; Bruno, Vincent M.; Fang, Zhide; Meng, Xiandong; Blow, Matthew; Zhang, Tao; Sherlock, Gavin; Snyder, Michael; Wang, Zhong

    2010-11-19

    Background: Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. While high throughput mRNA sequencing (RNA-Seq) has emerged as a powerful tool for addressing these problems, its success is dependent upon the availability and quality of reference genome sequences, thus limiting the organisms to which it can be applied. Results: Here, we describe Rnnotator, an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data without the need for a reference genome. We have applied the Rnnotator assembly pipeline to two yeast transcriptomes and compared the results to the reference gene catalogs of these organisms. The contigs produced by Rnnotator are highly accurate (95percent) and reconstruct full-length genes for the majority of the existing gene models (54.3percent). Furthermore, our analyses revealed many novel transcribed regions that are absent from well annotated genomes, suggesting Rnnotator serves as a complementary approach to analysis based on a reference genome for comprehensive transcriptomics. Conclusions: These results demonstrate that the Rnnotator pipeline is able to reconstruct full-length transcripts in the absence of a complete reference genome.

  14. On Anomalies in Annotation Systems

    CERN Document Server

    Brust, Matthias R

    2007-01-01

    Today's computer-based annotation systems implement a wide range of functionalities that often go beyond those available in traditional paper-and-pencil annotations. Conceptually, annotation systems are based on thoroughly investigated psycho-sociological and pedagogical learning theories. They offer a huge diversity of annotation types that can be placed in textual as well as in multimedia format. Additionally, annotations can be published or shared with a group of interested parties via well-organized repositories. Although highly sophisticated annotation systems exist both conceptually as well as technologically, we still observe that their acceptance is somewhat limited. In this paper, we argue that nowadays annotation systems suffer from several fundamental problems that are inherent in the traditional paper-and-pencil annotation paradigm. As a solution, we propose to shift the annotation paradigm for the implementation of annotation system.

  15. Semantic annotation of mutable data.

    Directory of Open Access Journals (Sweden)

    Robert A Morris

    Full Text Available Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.

  16. The UCSC Genome Browser database: 2015 update

    OpenAIRE

    Rosenbloom, Kate R.; Armstrong, Joel; Barber, Galt P.; Casper, Jonathan; Clawson, Hiram; Diekhans, Mark; Dreszer, Timothy R.; Fujita, Pauline A.; Guruvadoo, Luvina; Haeussler, Maximilian; Harte, Rachel A.; Heitner, Steve; Hickey, Glenn; Hinrichs, Angie S.; Hubley, Robert

    2014-01-01

    Launched in 2001 to showcase the draft human genome assembly, the UCSC Genome Browser database (http://genome.ucsc.edu) and associated tools continue to grow, providing a comprehensive resource of genome assemblies and annotations to scientists and students worldwide. Highlights of the past year include the release of a browser for the first new human genome reference assembly in 4 years in December 2013 (GRCh38, UCSC hg38), a watershed comparative genomics annotation (100-species multiple al...

  17. Building a genome database using an object-oriented approach.

    Science.gov (United States)

    Barbasiewicz, Anna; Liu, Lin; Lang, B Franz; Burger, Gertraud

    2002-01-01

    GOBASE is a relational database that integrates data associated with mitochondria and chloroplasts. The most important data in GOBASE, i. e., molecular sequences and taxonomic information, are obtained from the public sequence data repository at the National Center for Biotechnology Information (NCBI), and are validated by our experts. Maintaining a curated genomic database comes with a towering labor cost, due to the shear volume of available genomic sequences and the plethora of annotation errors and omissions in records retrieved from public repositories. Here we describe our approach to increase automation of the database population process, thereby reducing manual intervention. As a first step, we used Unified Modeling Language (UML) to construct a list of potential errors. Each case was evaluated independently, and an expert solution was devised, and represented as a diagram. Subsequently, the UML diagrams were used as templates for writing object-oriented automation programs in the Java programming language. PMID:12542407

  18. Bioinformatics and genomic medicine.

    Science.gov (United States)

    Kim, Ju Han

    2002-01-01

    Bioinformatics is a rapidly emerging field of biomedical research. A flood of large-scale genomic and postgenomic data means that many of the challenges in biomedical research are now challenges in computational science. Clinical informatics has long developed methodologies to improve biomedical research and clinical care by integrating experimental and clinical information systems. The informatics revolution in both bioinformatics and clinical informatics will eventually change the current practice of medicine, including diagnostics, therapeutics, and prognostics. Postgenome informatics, powered by high-throughput technologies and genomic-scale databases, is likely to transform our biomedical understanding forever, in much the same way that biochemistry did a generation ago. This paper describes how these technologies will impact biomedical research and clinical care, emphasizing recent advances in biochip-based functional genomics and proteomics. Basic data preprocessing with normalization and filtering, primary pattern analysis, and machine-learning algorithms are discussed. Use of integrative biochip informatics technologies, including multivariate data projection, gene-metabolic pathway mapping, automated biomolecular annotation, text mining of factual and literature databases, and the integrated management of biomolecular databases, are also discussed. PMID:12544491

  19. Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

    Science.gov (United States)

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Fletcher, Anthony R; Plachetzki, David C; FitzGerald, Peter C; Artieri, Carlo G; Atallah, Joel; Barmina, Olga; Brown, James B; Blankenburg, Kerstin P; Clough, Emily; Dasgupta, Abhijit; Gubbala, Sai; Han, Yi; Jayaseelan, Joy C; Kalra, Divya; Kim, Yoo-Ah; Kovar, Christie L; Lee, Sandra L; Li, Mingmei; Malley, James D; Malone, John H; Mathew, Tittu; Mattiuzzo, Nicolas R; Munidasa, Mala; Muzny, Donna M; Ongeri, Fiona; Perales, Lora; Przytycka, Teresa M; Pu, Ling-Ling; Robinson, Garrett; Thornton, Rebecca L; Saada, Nehad; Scherer, Steven E; Smith, Harold E; Vinson, Charles; Warner, Crystal B; Worley, Kim C; Wu, Yuan-Qing; Zou, Xiaoyan; Cherbas, Peter; Kellis, Manolis; Eisen, Michael B; Piano, Fabio; Kionte, Karin; Fitch, David H; Sternberg, Paul W; Cutter, Asher D; Duff, Michael O; Hoskins, Roger A; Graveley, Brenton R; Gibbs, Richard A; Bickel, Peter J; Kopp, Artyom; Carninci, Piero; Celniker, Susan E; Oliver, Brian; Richards, Stephen

    2014-07-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community. PMID:24985915

  20. DDBJ in collaboration with mass-sequencing teams on annotation

    OpenAIRE

    Tateno, Y; Saitou, N; Okubo, K; Sugawara, H.; Gojobori, T

    2004-01-01

    In the past year, we at DDBJ (DNA Data Bank of Japan; http://www.ddbj.nig.ac.jp) collected and released 1 066 084 entries or 718 072 425 bases including the whole chromosome 22 of chimpanzee, the whole-genome shotgun sequences of silkworm and various others. On the other hand, we hosted workshops for human full-length cDNA annotation and participated in jamborees of mouse full-length cDNA annotation. The annotated data are made public at DDBJ. We are also in collaboration with a RIKEN team to...

  1. A Machine Learning Based Analytical Framework for Semantic Annotation Requirements

    CERN Document Server

    Hassanzadeh, Hamed; 10.5121/ijwest.2011.2203

    2011-01-01

    The Semantic Web is an extension of the current web in which information is given well-defined meaning. The perspective of Semantic Web is to promote the quality and intelligence of the current web by changing its contents into machine understandable form. Therefore, semantic level information is one of the cornerstones of the Semantic Web. The process of adding semantic metadata to web resources is called Semantic Annotation. There are many obstacles against the Semantic Annotation, such as multilinguality, scalability, and issues which are related to diversity and inconsistency in content of different web pages. Due to the wide range of domains and the dynamic environments that the Semantic Annotation systems must be performed on, the problem of automating annotation process is one of the significant challenges in this domain. To overcome this problem, different machine learning approaches such as supervised learning, unsupervised learning and more recent ones like, semi-supervised learning and active learn...

  2. GIFtS: annotation landscape analysis with GeneCards

    Directory of Open Access Journals (Sweden)

    Dalah Irina

    2009-10-01

    Full Text Available Abstract Background Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO, pathways, interactions, phenotypes, publications and many more. Results We present the GeneCards Inferred Functionality Score (GIFtS which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25 between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a

  3. Collaborative Movie Annotation

    Science.gov (United States)

    Zad, Damon Daylamani; Agius, Harry

    In this paper, we focus on metadata for self-created movies like those found on YouTube and Google Video, the duration of which are increasing in line with falling upload restrictions. While simple tags may have been sufficient for most purposes for traditionally very short video footage that contains a relatively small amount of semantic content, this is not the case for movies of longer duration which embody more intricate semantics. Creating metadata is a time-consuming process that takes a great deal of individual effort; however, this effort can be greatly reduced by harnessing the power of Web 2.0 communities to create, update and maintain it. Consequently, we consider the annotation of movies within Web 2.0 environments, such that users create and share that metadata collaboratively and propose an architecture for collaborative movie annotation. This architecture arises from the results of an empirical experiment where metadata creation tools, YouTube and an MPEG-7 modelling tool, were used by users to create movie metadata. The next section discusses related work in the areas of collaborative retrieval and tagging. Then, we describe the experiments that were undertaken on a sample of 50 users. Next, the results are presented which provide some insight into how users interact with existing tools and systems for annotating movies. Based on these results, the paper then develops an architecture for collaborative movie annotation.

  4. Annotated bibliography traceability

    NARCIS (Netherlands)

    Narain, G.

    2006-01-01

    This annotated bibliography contains summaries of articles and chapters of books, which are relevant to traceability. After each summary there is a part about the relevancy of the paper for the LEI project. The aim of the LEI-project is to gain insight in several aspects of traceability in order to

  5. Annotation: The Savant Syndrome

    Science.gov (United States)

    Heaton, Pamela; Wallace, Gregory L.

    2004-01-01

    Background: Whilst interest has focused on the origin and nature of the savant syndrome for over a century, it is only within the past two decades that empirical group studies have been carried out. Methods: The following annotation briefly reviews relevant research and also attempts to address outstanding issues in this research area.…

  6. Transcriptome annotation using tandem SAGE tags

    Science.gov (United States)

    Rivals, Eric; Boureux, Anthony; Lejeune, Mireille; Ottones, Florence; Pecharromàn Pérez, Oscar; Tarhio, Jorma; Pierrat, Fabien; Ruffle, Florence; Commes, Thérèse; Marti, Jacques

    2007-01-01

    Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation. PMID:17709346

  7. ALLocator: an interactive web platform for the analysis of metabolomic LC-ESI-MS datasets, enabling semi-automated, user-revised compound annotation and mass isotopomer ratio analysis.

    Directory of Open Access Journals (Sweden)

    Nikolas Kessler

    Full Text Available Adduct formation, fragmentation events and matrix effects impose special challenges to the identification and quantitation of metabolites in LC-ESI-MS datasets. An important step in compound identification is the deconvolution of mass signals. During this processing step, peaks representing adducts, fragments, and isotopologues of the same analyte are allocated to a distinct group, in order to separate peaks from coeluting compounds. From these peak groups, neutral masses and pseudo spectra are derived and used for metabolite identification via mass decomposition and database matching. Quantitation of metabolites is hampered by matrix effects and nonlinear responses in LC-ESI-MS measurements. A common approach to correct for these effects is the addition of a U-13C-labeled internal standard and the calculation of mass isotopomer ratios for each metabolite. Here we present a new web-platform for the analysis of LC-ESI-MS experiments. ALLocator covers the workflow from raw data processing to metabolite identification and mass isotopomer ratio analysis. The integrated processing pipeline for spectra deconvolution "ALLocatorSD" generates pseudo spectra and automatically identifies peaks emerging from the U-13C-labeled internal standard. Information from the latter improves mass decomposition and annotation of neutral losses. ALLocator provides an interactive and dynamic interface to explore and enhance the results in depth. Pseudo spectra of identified metabolites can be stored in user- and method-specific reference lists that can be applied on succeeding datasets. The potential of the software is exemplified in an experiment, in which abundance fold-changes of metabolites of the l-arginine biosynthesis in C. glutamicum type strain ATCC 13032 and l-arginine producing strain ATCC 21831 are compared. Furthermore, the capability for detection and annotation of uncommon large neutral losses is shown by the identification of (γ-glutamyl dipeptides in

  8. Semi-automated curation of metabolic models via flux balance analysis: a case study with Mycoplasma gallisepticum.

    Directory of Open Access Journals (Sweden)

    Eddy J Bautista

    Full Text Available Primarily used for metabolic engineering and synthetic biology, genome-scale metabolic modeling shows tremendous potential as a tool for fundamental research and curation of metabolism. Through a novel integration of flux balance analysis and genetic algorithms, a strategy to curate metabolic networks and facilitate identification of metabolic pathways that may not be directly inferable solely from genome annotation was developed. Specifically, metabolites involved in unknown reactions can be determined, and potentially erroneous pathways can be identified. The procedure developed allows for new fundamental insight into metabolism, as well as acting as a semi-automated curation methodology for genome-scale metabolic modeling. To validate the methodology, a genome-scale metabolic model for the bacterium Mycoplasma gallisepticum was created. Several reactions not predicted by the genome annotation were postulated and validated via the literature. The model predicted an average growth rate of 0.358±0.12[Formula: see text], closely matching the experimentally determined growth rate of M. gallisepticum of 0.244±0.03[Formula: see text]. This work presents a powerful algorithm for facilitating the identification and curation of previously known and new metabolic pathways, as well as presenting the first genome-scale reconstruction of M. gallisepticum.

  9. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study

    OpenAIRE

    Costanzo, Maria C.; Park, Julie; Balakrishnan, Rama; Cherry, J. Michael; Hong, Eurie L.

    2011-01-01

    Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational pre...

  10. Gene Ontology annotations at SGD: new data sources and annotation methods

    OpenAIRE

    Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.

    2007-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive bo...

  11. Joint Analysis of Functional Genomic Data and Genome-wide Association Studies of 18 Human Traits

    OpenAIRE

    Pickrell, Joseph K.

    2014-01-01

    Annotations of gene structures and regulatory elements can inform genome-wide association studies (GWASs). However, choosing the relevant annotations for interpreting an association study of a given trait remains challenging. I describe a statistical model that uses association statistics computed across the genome to identify classes of genomic elements that are enriched with or depleted of loci influencing a trait. The model naturally incorporates multiple types of annotations. I applied th...

  12. EnzDP: improved enzyme annotation for metabolic network reconstruction based on domain composition profiles.

    Science.gov (United States)

    Nguyen, Nam-Ninh; Srihari, Sriganesh; Leong, Hon Wai; Chong, Ket-Fah

    2015-10-01

    Determining the entire complement of enzymes and their enzymatic functions is a fundamental step for reconstructing the metabolic network of cells. High quality enzyme annotation helps in enhancing metabolic networks reconstructed from the genome, especially by reducing gaps and increasing the enzyme coverage. Currently, structure-based and network-based approaches can only cover a limited number of enzyme families, and the accuracy of homology-based approaches can be further improved. Bottom-up homology-based approach improves the coverage by rebuilding Hidden Markov Model (HMM) profiles for all known enzymes. However, its clustering procedure relies firmly on BLAST similarity score, ignoring protein domains/patterns, and is sensitive to changes in cut-off thresholds. Here, we use functional domain architecture to score the association between domain families and enzyme families (Domain-Enzyme Association Scoring, DEAS). The DEAS score is used to calculate the similarity between proteins, which is then used in clustering procedure, instead of using sequence similarity score. We improve the enzyme annotation protocol using a stringent classification procedure, and by choosing optimal threshold settings and checking for active sites. Our analysis shows that our stringent protocol EnzDP can cover up to 90% of enzyme families available in Swiss-Prot. It achieves a high accuracy of 94.5% based on five-fold cross-validation. EnzDP outperforms existing methods across several testing scenarios. Thus, EnzDP serves as a reliable automated tool for enzyme annotation and metabolic network reconstruction. Available at: www.comp.nus.edu.sg/~nguyennn/EnzDP . PMID:26542446

  13. A Machine Learning Based Analytical Framework for Semantic Annotation Requirements

    Directory of Open Access Journals (Sweden)

    Hamed Hassanzadeh

    2011-04-01

    Full Text Available The Semantic Web is an extension of the current web in which information is given well-defined meaning.The perspective of Semantic Web is to promote the quality and intelligence of the current web by changingits contents into machine understandable form. Therefore, semantic level information is one of thecornerstones of the Semantic Web. The process of adding semantic metadata to web resources is calledSemantic Annotation. There are many obstacles against the Semantic Annotation, such as multilinguality,scalability, and issues which are related to diversity and inconsistency in content of different web pages.Due to the wide range of domains and the dynamic environments that the Semantic Annotation systemsmust be performed on, the problem of automating annotation process is one of the significant challenges inthis domain. To overcome this problem, different machine learning approaches such as supervisedlearning, unsupervised learning and more recent ones like, semi-supervised learning and active learninghave been utilized. In this paper we present an inclusive layered classification of Semantic Annotationchallenges and discuss the most important issues in this field. Also, we review and analyze machinelearning applications for solving semantic annotation problems. For this goal, the article tries to closelystudy and categorize related researches for better understanding and to reach a framework that can mapmachine learning techniques into the Semantic Annotation challenges and requirements.

  14. In depth annotation of the Anopheles gambiae mosquito midgut transcriptome

    OpenAIRE

    Padrón, Alejandro; Molina-Cruz, Alvaro; Quinones, Mariam; Ribeiro, José MC; Ramphul, Urvashi; Rodrigues, Janneth; Shen, Kui; Haile, Ashley; Ramirez, José Luis; Barillas-Mury, Carolina

    2014-01-01

    Background Genome sequencing of Anopheles gambiae was completed more than ten years ago and has accelerated research on malaria transmission. However, annotation needs to be refined and verified experimentally, as most predicted transcripts have been identified by comparative analysis with genomes from other species. The mosquito midgut—the first organ to interact with Plasmodium parasites—mounts effective antiplasmodial responses that limit parasite survival and disease transmission. High-th...

  15. Complete genome sequence of an attenuated Sparfloxacin-resistant Streptococcus agalactiae strain 138spar

    Science.gov (United States)

    The complete genome of a sparfloxacin-resistant Streptococcus agalactiae vaccine strain 138spar is 1,838,126 bp in size. The genome has 1892 coding sequences and 82 RNAs. The annotation of the genome is added by the NCBI Prokaryotic Genome Annotation Pipeline. The publishing of this genome will allo...

  16. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study.

    Science.gov (United States)

    Costanzo, Maria C; Park, Julie; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2011-01-01

    Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned 'unknown' annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL: http://www.yeastgenome.org. PMID:21411447

  17. A molecular scanner to automate proteomic research and to display proteome images.

    Science.gov (United States)

    Binz, P A; Müller, M; Walther, D; Bienvenut, W V; Gras, R; Hoogland, C; Bouchet, G; Gasteiger, E; Fabbretti, R; Gay, S; Palagi, P; Wilkins, M R; Rouge, V; Tonella, L; Paesano, S; Rossellat, G; Karmime, A; Bairoch, A; Sanchez, J C; Appel, R D; Hochstrasser, D F

    1999-11-01

    Identification and characterization of all proteins expressed by a genome in biological samples represent major challenges in proteomics. Today's commonly used high-throughput approaches combine two-dimensional electrophoresis (2-DE) with peptide mass fingerprinting (PMF) analysis. Although automation is often possible, a number of limitations still adversely affect the rate of protein identification and annotation in 2-DE databases: the sequential excision process of pieces of gel containing protein; the enzymatic digestion step; the interpretation of mass spectra (reliability of identifications); and the manual updating of 2-DE databases. We present a highly automated method that generates a fully annoated 2-DE map. Using a parallel process, all proteins of a 2-DE are first simultaneously digested proteolytically and electro-transferred onto a poly(vinylidene difluoride) membrane. The membrane is then directly scanned by MALDI-TOF MS. After automated protein identification from the obtained peptide mass fingerprints using PeptIdent software (http://www.expasy.ch/tools/peptident.html + ++), a fully annotated 2-D map is created on-line. It is a multidimensional representation of a proteome that contains interpreted PMF data in addition to protein identification results. This "MS-imaging" method represents a major step toward the development of a clinical molecular scanner. PMID:10565287

  18. Semantic annotation of biological concepts interplaying microbial cellular responses

    Directory of Open Access Journals (Sweden)

    Carreira Rafael

    2011-11-01

    Full Text Available Abstract Background Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes. Results Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules, proteins (transcription factors, enzymes and transporters, small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts and compounds (most frequently annotated concepts, whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts. Conclusions To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes. Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts.

  19. Snap: An Integrated SNP Annotation Platform

    DEFF Research Database (Denmark)

    Li, S.; Ma, L.; Li, H.;

    2007-01-01

    Snap (Single Nucleotide Polymorphism Annotation Platform) is a server designed to comprehensively analyze single genes and relationships between genes basing on SNPs in the human genome. The aim of the platform is to facilitate the study of SNP finding and analysis within the framework of medical...... research. Using a user-friendly web interface, genes can be searched by name, description, position, SNP ID or clone name. Several public databases are integrated, including gene information from Ensembl, protein features from Uniprot/SWISS-PROT, Pfam and DAS-CBS. Gene relationships are fetched from BIND...

  20. The UCSC Genome Browser Database

    DEFF Research Database (Denmark)

    Karolchik, D; Kuhn, R M; Baertsch, R;

    2008-01-01

    The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrate...... and 21 invertebrate species as of September 2007. For each assembly, the GBD contains a collection of annotation data aligned to the genomic sequence. Highlights of this year's additions include a 28-species human-based vertebrate conservation annotation, an enhanced UCSC Genes set, and more human...... variation, MGC, and ENCODE data. The database is optimized for fast interactive performance with a set of web-based tools that may be used to view, manipulate, filter and download the annotation data. New toolset features include the Genome Graphs tool for displaying genome-wide data sets, session saving...

  1. EST-PAC a web package for EST annotation and protein sequence prediction

    Directory of Open Access Journals (Sweden)

    Strahm Yvan

    2006-10-01

    Full Text Available Abstract With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1 searching local or remote biological databases for sequence similarities using Blast services, 2 predicting protein coding sequence from EST data and, 3 annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  2. High-recovery visual identification and single-cell retrieval of circulating tumor cells for genomic analysis using a dual-technology platform integrated with automated immunofluorescence staining

    International Nuclear Information System (INIS)

    Circulating tumor cells (CTCs) are malignant cells that have migrated from solid cancers into the blood, where they are typically present in rare numbers. There is great interest in using CTCs to monitor response to therapies, to identify clinically actionable biomarkers, and to provide a non-invasive window on the molecular state of a tumor. Here we characterize the performance of the AccuCyte® – CyteFinder® system, a comprehensive, reproducible and highly sensitive platform for collecting, identifying and retrieving individual CTCs from microscopic slides for molecular analysis after automated immunofluorescence staining for epithelial markers. All experiments employed a density-based cell separation apparatus (AccuCyte) to separate nucleated cells from the blood and transfer them to microscopic slides. After staining, the slides were imaged using a digital scanning microscope (CyteFinder). Precisely counted model CTCs (mCTCs) from four cancer cell lines were spiked into whole blood to determine recovery rates. Individual mCTCs were removed from slides using a single-cell retrieval device (CytePicker™) for whole genome amplification and subsequent analysis by PCR and Sanger sequencing, whole exome sequencing, or array-based comparative genomic hybridization. Clinical CTCs were evaluated in blood samples from patients with different cancers in comparison with the CellSearch® system. AccuCyte – CyteFinder presented high-resolution images that allowed identification of mCTCs by morphologic and phenotypic features. Spike-in mCTC recoveries were between 90 and 91%. More than 80% of single-digit spike-in mCTCs were identified and even a single cell in 7.5 mL could be found. Analysis of single SKBR3 mCTCs identified presence of a known TP53 mutation by both PCR and whole exome sequencing, and confirmed the reported karyotype of this cell line. Patient sample CTC counts matched or exceeded CellSearch CTC counts in a small feasibility cohort. The AccuCyte

  3. Genephony: a knowledge management tool for genome-wide research

    Directory of Open Access Journals (Sweden)

    Riva Alberto

    2009-09-01

    Full Text Available Abstract Background One of the consequences of the rapid and widespread adoption of high-throughput experimental technologies is an exponential increase of the amount of data produced by genome-wide experiments. Researchers increasingly need to handle very large volumes of heterogeneous data, including both the data generated by their own experiments and the data retrieved from publicly available repositories of genomic knowledge. Integration, exploration, manipulation and interpretation of data and information therefore need to become as automated as possible, since their scale and breadth are, in general, beyond the limits of what individual researchers and the basic data management tools in normal use can handle. This paper describes Genephony, a tool we are developing to address these challenges. Results We describe how Genephony can be used to manage large datesets of genomic information, integrating them with existing knowledge repositories. We illustrate its functionalities with an example of a complex annotation task, in which a set of SNPs coming from a genotyping experiment is annotated with genes known to be associated to a phenotype of interest. We show how, thanks to the modular architecture of Genephony and its user-friendly interface, this task can be performed in a few simple steps. Conclusion Genephony is an online tool for the manipulation of large datasets of genomic information. It can be used as a browser for genomic data, as a high-throughput annotation tool, and as a knowledge discovery tool. It is designed to be easy to use, flexible and extensible. Its knowledge management engine provides fine-grained control over individual data elements, as well as efficient operations on large datasets.

  4. The Human Genome Browser at UCSC

    OpenAIRE

    Kent, W James; Sugnet, Charles W.; Furey, Terrence S.; Roskin, Krishna M; Pringle, Tom H.; Zahler, Alan M.; Haussler, and David

    2002-01-01

    As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies,...

  5. Predicting word sense annotation agreement

    DEFF Research Database (Denmark)

    Martinez Alonso, Hector; Johannsen, Anders Trærup; Lopez de Lacalle, Oier;

    2015-01-01

    High agreement is a common objective when annotating data for word senses. However, a number of factors make perfect agreement impossible, e.g. the limitations of the sense inventories, the difficulty of the examples or the interpretation preferences of the annotations. Estimating potential...... agreement is thus a relevant task to supplement the evaluation of sense annotations. In this article we propose two methods to predict agreement on word-annotation instances. We experiment with a continuous representation and a three-way discretization of observed agreement. In spite of the difficulty of...

  6. Single-cell PCR of genomic DNA enabled by automated single-cell printing for cell isolation.

    Science.gov (United States)

    Stumpf, F; Schoendube, J; Gross, A; Rath, C; Niekrawietz, S; Koltay, P; Roth, G

    2015-07-15

    Single-cell analysis has developed into a key topic in cell biology with future applications in personalized medicine, tumor identification as well as tumor discovery (Editorial, 2013). Here we employ inkjet-like printing to isolate individual living single human B cells (Raji cell line) and load them directly into standard PCR tubes. Single cells are optically detected in the nozzle of the microfluidic piezoelectric dispenser chip to ensure printing of droplets with single cells only. The printing process has been characterized by using microbeads (10µm diameter) resulting in a single bead delivery in 27 out of 28 cases and relative positional precision of ±350µm at a printing distance of 6mm between nozzle and tube lid. Process-integrated optical imaging enabled to identify the printing failure as void droplet and to exclude it from downstream processing. PCR of truly single-cell DNA was performed without pre-amplification directly from single Raji cells with 33% success rate (N=197) and Cq values of 36.3±2.5. Additionally single cell whole genome amplification (WGA) was employed to pre-amplify the single-cell DNA by a factor of >1000. This facilitated subsequent PCR for the same gene yielding a success rate of 64% (N=33) which will allow more sophisticated downstream analysis like sequencing, electrophoresis or multiplexing. PMID:25771302

  7. RNAspace.org: An integrated environment for the prediction, annotation, and analysis of ncRNA

    OpenAIRE

    Cros, Marie-Josee; de Monte, Antoine; Mariette, Jérôme; Bardou, Philippe; Grenier-Boley, Benjamin; Gautheret, Daniel

    2011-01-01

    The annotation of noncoding RNA genes remains a major bottleneck in genome sequencing projects. Most genome sequences released today still come with sets of tRNAs and rRNAs as the only annotated RNA elements, ignoring hundreds of other RNA families. We have developed a web environment that is dedicated to noncoding RNA (ncRNA) prediction, annotation, and analysis and allows users to run a variety of tools in an integrated and flexible manner. This environment offers complementary ncRNA gene f...

  8. HBVRegDB: Annotation, comparison, detection and visualization of regulatory elements in hepatitis B virus sequences

    Directory of Open Access Journals (Sweden)

    Firth Andrew E

    2007-12-01

    Full Text Available Abstract Background The many Hepadnaviridae sequences available have widely varied functional annotation. The genomes are very compact (~3.2 kb but contain multiple layers of functional regulatory elements in addition to coding regions. Key regions are subject to purifying selection, as mutations in these regions will produce non-functional viruses. Results These genomic sequences have been organized into a structured database to facilitate research at the molecular level. HBVRegDB is a comparative genomic analysis tool with an integrated underlying sequence database. The database contains genomic sequence data from representative viruses. In addition to INSDC and RefSeq annotation, HBVRegDB also contains expert and systematically calculated annotations (e.g. promoters and comparative genome analysis results (e.g. blastn, tblastx. It also contains analyses based on curated HBV alignments. Information about conserved regions – including primary conservation (e.g. CDS-Plotcon and RNA secondary structure predictions (e.g. Alidot – is integrated into the database. A large amount of data is graphically presented using the GBrowse (Generic Genome Browser adapted for analysis of viral genomes. Flexible query access is provided based on any annotated genomic feature. Novel regulatory motifs can be found by analysing the annotated sequences. Conclusion HBVRegDB serves as a knowledge database and as a comparative genomic analysis tool for molecular biologists investigating HBV. It is publicly available and complementary to other viral and HBV focused datasets and tools http://hbvregdb.otago.ac.nz. The availability of multiple and highly annotated sequences of viral genomes in one database combined with comparative analysis tools facilitates detection of novel genomic elements.

  9. Stochastic Coordinate Coding and Its Application for Drosophila Gene Expression Pattern Annotation

    OpenAIRE

    Lin, Binbin; Li, Qingyang; Sun, Qian; Lai, Ming-Jun; Davidson, Ian; Fan, Wei (David); Ye, Jieping

    2014-01-01

    \\textit{Drosophila melanogaster} has been established as a model organism for investigating the fundamental principles of developmental gene interactions. The gene expression patterns of \\textit{Drosophila melanogaster} can be documented as digital images, which are annotated with anatomical ontology terms to facilitate pattern discovery and comparison. The automated annotation of gene expression pattern images has received increasing attention due to the recent expansion of the image databas...

  10. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA

    OpenAIRE

    Camon Evelyn B; Barrell Daniel G; Dimmer Emily C; Lee Vivian; Magrane Michele; Maslen John; Binns David; Apweiler Rolf

    2005-01-01

    Abstract Background The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are o...

  11. Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1

    Energy Technology Data Exchange (ETDEWEB)

    Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

    2009-12-24

    Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  12. Collective dynamics of social annotation

    CERN Document Server

    Cattuto, Ciro; Baldassarri, Andrea; Schehr, G; Loreto, Vittorio

    2009-01-01

    The enormous increase of popularity and use of the WWW has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with text keywords dubbed tags. Understanding the rich emerging structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks, and the complex networks framework, can effectively contribute to the mathematical modeling of social annotation systems. Here we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of random walks. This modeling framework reproduces several aspects, so far unexplained, of social annotation, among which the peculiar growth of the size of the...

  13. Annotation of Regular Polysemy

    DEFF Research Database (Denmark)

    Martinez Alonso, Hector

    Regular polysemy has received a lot of attention from the theory of lexical semantics and from computational linguistics. However, there is no consensus on how to represent the sense of underspecified examples at the token level, namely when annotating or disambiguating senses of metonymic words...... like “London” (Location/Organization) or “cup” (Container/Content). The goal of this dissertation is to assess whether metonymic sense underspecification justifies incorporating a third sense into our sense inventories, thereby treating the underspecified sense as independent from the literal and...... metonymic. We have conducted an analysis in English, Danish and Spanish. Later on, we have tried to replicate the human judgments by means of unsupervised and semi-supervised sense prediction. The automatic sense-prediction systems have been unable to find empiric evidence for the underspecified sense, even...

  14. Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data

    Directory of Open Access Journals (Sweden)

    Merchant Sabeeha S

    2011-07-01

    Full Text Available Abstract Background Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. Description The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of

  15. The UCSC genome browser database

    DEFF Research Database (Denmark)

    Kuhn, R M; Karolchik, D; Zweig, A S;

    2007-01-01

    The University of California, Santa Cruz Genome Browser Database contains, as of September 2006, sequence and annotation data for the genomes of 13 vertebrate and 19 invertebrate species. The Genome Browser displays a wide variety of annotations at all scales from the single nucleotide level up...... to a full chromosome and includes assembly data, genes and gene predictions, mRNA and EST alignments, and comparative genomics, regulation, expression and variation data. The database is optimized for fast interactive performance with web tools that provide powerful visualization and querying capabilities......; an expanded SNP annotation track; and many new display options. The Genome Browser, other tools, downloadable data files and links to documentation and other information can be found at http://genome.ucsc.edu/....

  16. Assembly and annotation of a non-model gastropod (Nerita melanotragus) transcriptome: a comparison of De novo assemblers

    OpenAIRE

    Amin, Shorash; Prentis, Peter J.; Gilding, Edward K.; Pavasovic, Ana

    2014-01-01

    Background The sequencing, de novo assembly and annotation of transcriptome datasets generated with next generation sequencing (NGS) has enabled biologists to answer genomic questions in non-model species with unprecedented ease. Reliable and accurate de novo assembly and annotation of transcriptomes, however, is a critically important step for transcriptome assemblies generated from short read sequences. Typical benchmarks for assembly and annotation reliability have been performed with mode...

  17. Complete genome sequence of a virulent Streptococcus agalactiae strain 138P isolated from disease Nile tilapia

    Science.gov (United States)

    The complete genome of a virulent Streptococcus agalactiae strain 138P is 1838701 bp in size, containing 1831 genes. The genome has 1593 coding sequences, 152 pseudo genes, 16 rRNAs, 69 tRNAs, and 1 non-coding RNA. The annotation of the genome is added by the NCBI Prokaryotic Genome Annotation Pipel...

  18. Draft Genome Sequence of Lactobacillus rhamnosus 2166.

    OpenAIRE

    Karlyshev, Andrey V.; Melnikov, Vyacheslav G.; Kosarev, Igor V.; Abramov, Vyacheslav M.

    2014-01-01

    In this report, we present a draft sequence of the genome of Lactobacillus rhamnosus strain 2166, a potential novel probiotic. Genome annotation and read mapping onto a reference genome of L. rhamnosus strain GG allowed for the identification of the differences and similarities in the genomic contents and gene arrangements of these strains.

  19. ProteinSplit: splitting of multi-domain proteins using prediction of ordered and disordered regions in protein sequences for virtual structural genomics

    International Nuclear Information System (INIS)

    The annotation of protein folds within newly sequenced genomes is the main target for semi-automated protein structure prediction (virtual structural genomics). A large number of automated methods have been developed recently with very good results in the case of single-domain proteins. Unfortunately, most of these automated methods often fail to properly predict the distant homology between a given multi-domain protein query and structural templates. Therefore a multi-domain protein should be split into domains in order to overcome this limitation. ProteinSplit is designed to identify protein domain boundaries using a novel algorithm that predicts disordered regions in protein sequences. The software utilizes various sequence characteristics to assess the local propensity of a protein to be disordered or ordered in terms of local structure stability. These disordered parts of a protein are likely to create interdomain spacers. Because of its speed and portability, the method was successfully applied to several genome-wide fold annotation experiments. The user can run an automated analysis of sets of proteins or perform semi-automated multiple user projects (saving the results on the server). Additionally the sequences of predicted domains can be sent to the Bioinfo.PL Protein Structure Prediction Meta-Server for further protein three-dimensional structure and function prediction. The program is freely accessible as a web service at http://lucjan.bioinfo.pl/proteinsplit together with detailed benchmark results on the critical assessment of a fully automated structure prediction (CAFASP) set of sequences. The source code of the local version of protein domain boundary prediction is available upon request from the authors

  20. Pairagon+N-SCAN_EST: a model-based gene annotation pipeline

    DEFF Research Database (Denmark)

    Arumugam, Manimozhiyan; Wei, Chaochun; Brown, Randall H;

    2006-01-01

    This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of puta...

  1. APPRIS: annotation of principal and alternative splice isoforms.

    Science.gov (United States)

    Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L

    2013-01-01

    Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform. PMID:23161672

  2. The Saccharomyces Genome Database Variant Viewer

    OpenAIRE

    Sheppard, Travis K.; Gabdank, Idan; Engel, Stacia R.; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C.; Dalusag, Kyla S.; Demeter, Janos; Hellerstedt, Sage T.; Karra, Kalpana; Nash, Robert S.; Paskov, Kelley M.; Skrzypek, Marek S.; Weng, Shuai

    2015-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source...

  3. BGD: A Database of Bat Genomes

    OpenAIRE

    Jianfei Fang; Xuan Wang; Shuo Mu; Shuyi Zhang; Dong Dong

    2015-01-01

    Bats account for ~20% of mammalian species, and are the only mammals with true powered flight. For the sake of their specialized phenotypic traits, many researches have been devoted to examine the evolution of bats. Until now, some whole genome sequences of bats have been assembled and annotated, however, a uniform resource for the annotated bat genomes is still unavailable. To make the extensive data associated with the bat genomes accessible to the general biological communities, we establi...

  4. INDIGO - INtegrated data warehouse of microbial genomes with examples from the red sea extremophiles.

    KAUST Repository

    Alam, Intikhab

    2013-12-06

    The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes.

  5. Phylogenetic molecular function annotation

    OpenAIRE

    Barbara E Engelhardt; Jordan, Michael I.; Repo, Susanna T; Brenner, Steven E

    2009-01-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic ...

  6. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

    Directory of Open Access Journals (Sweden)

    Igor Mozetič

    Full Text Available What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive as ordered.

  7. Multilingual Twitter Sentiment Classification: The Role of Human Annotators

    Science.gov (United States)

    Mozetič, Igor; Grčar, Miha; Smailović, Jasmina

    2016-01-01

    What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered. PMID:27149621

  8. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

    Science.gov (United States)

    Mozetič, Igor; Grčar, Miha; Smailović, Jasmina

    2016-01-01

    What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered. PMID:27149621

  9. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    Energy Technology Data Exchange (ETDEWEB)

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.; Porwollik, Steffen; Jones, Marcus B.; Yoon, Hyunjin; Payne, Samuel H.; Martin, Jessica L.; Burnet, Meagan C.; Monroe, Matthew E.; Venepally, Pratap; Smith, Richard D.; Peterson, Scott; Heffron, Fred; Mcclelland, Michael; Adkins, Joshua N.

    2011-08-25

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify coding regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.

  10. Statistical Challenges in Functional Genomics

    OpenAIRE

    Sebastiani, Paola; Gussoni, Emanuela; Kohane, Isaac S.; Ramoni, Marco F

    2003-01-01

    On February 12, 2001 the Human Genome Project announced the completion of a draft physical map of the human genome---the genetic blueprint for a human being. Now the challenge is to annotate this map by understanding the functions of genes and their interplay with proteins and the environment to create complex, dynamic living systems. This is the goal of functional genomics. Recent technological advances enable biomedical investigators to observe the genome of entire orga...

  11. Home Automation

    OpenAIRE

    Ahmed, Zeeshan

    2010-01-01

    In this paper I briefly discuss the importance of home automation system. Going in to the details I briefly present a real time designed and implemented software and hardware oriented house automation research project, capable of automating house's electricity and providing a security system to detect the presence of unexpected behavior.

  12. Learning a Hybrid Architecture for Sequence Regression and Annotation

    OpenAIRE

    Zhang, Yizhe; Henao, Ricardo; Carin, Lawrence; Zhong, Jianling; Hartemink, Alexander J.

    2015-01-01

    When learning a hidden Markov model (HMM), sequen- tial observations can often be complemented by real-valued summary response variables generated from the path of hid- den states. Such settings arise in numerous domains, includ- ing many applications in biology, like motif discovery and genome annotation. In this paper, we present a flexible frame- work for jointly modeling both latent sequence features and the functional mapping that relates the summary response variables to the hidden stat...

  13. Cephalopod genomics

    DEFF Research Database (Denmark)

    Albertin, Caroline B.; Bonnaud, Laure; Brown, C. Titus;

    2012-01-01

    The Cephalopod Sequencing Consortium (CephSeq Consortium) was established at a NESCent Catalysis Group Meeting, ``Paths to Cephalopod Genomics-Strategies, Choices, Organization,'' held in Durham, North Carolina, USA on May 24-27, 2012. Twenty-eight participants representing nine countries (Austria......, Australia, China, Denmark, France, Italy, Japan, Spain and the USA) met to address the pressing need for genome sequencing of cephalopod mollusks. This group, drawn from cephalopod biologists, neuroscientists, developmental and evolutionary biologists, materials scientists, bioinformaticians and researchers...... active in sequencing, assembling and annotating genomes, agreed on a set of cephalopod species of particular importance for initial sequencing and developed strategies and an organization (CephSeq Consortium) to promote this sequencing. The conclusions and recommendations of this meeting are described in...

  14. Bovine Genome Database: new tools for gleaning function from the Bos taurus genome.

    Science.gov (United States)

    Elsik, Christine G; Unni, Deepak R; Diesh, Colin M; Tayal, Aditi; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Bovine Genome Database (BGD) (http://BovineGenome.org). The goal of BGD is to support bovine genomics research by providing genome annotation and data mining tools. We have developed new genome and annotation browsers using JBrowse and WebApollo for two Bos taurus genome assemblies, the reference genome assembly (UMD3.1.1) and the alternate genome assembly (Btau_4.6.1). Annotation tools have been customized to highlight priority genes for annotation, and to aid annotators in selecting gene evidence tracks from 91 tissue specific RNAseq datasets. We have also developed BovineMine, based on the InterMine data warehousing system, to integrate the bovine genome, annotation, QTL, SNP and expression data with external sources of orthology, gene ontology, gene interaction and pathway information. BovineMine provides powerful query building tools, as well as customized query templates, and allows users to analyze and download genome-wide datasets. With BovineMine, bovine researchers can use orthology to leverage the curated gene pathways of model organisms, such as human, mouse and rat. BovineMine will be especially useful for gene ontology and pathway analyses in conjunction with GWAS and QTL studies. PMID:26481361

  15. ASAP: Amplification, sequencing & annotation of plastomes

    Directory of Open Access Journals (Sweden)

    Folta Kevin M

    2005-12-01

    Full Text Available Abstract Background Availability of DNA sequence information is vital for pursuing structural, functional and comparative genomics studies in plastids. Traditionally, the first step in mining the valuable information within a chloroplast genome requires sequencing a chloroplast plasmid library or BAC clones. These activities involve complicated preparatory procedures like chloroplast DNA isolation or identification of the appropriate BAC clones to be sequenced. Rolling circle amplification (RCA is being used currently to amplify the chloroplast genome from purified chloroplast DNA and the resulting products are sheared and cloned prior to sequencing. Herein we present a universal high-throughput, rapid PCR-based technique to amplify, sequence and assemble plastid genome sequence from diverse species in a short time and at reasonable cost from total plant DNA, using the large inverted repeat region from strawberry and peach as proof of concept. The method exploits the highly conserved coding regions or intergenic regions of plastid genes. Using an informatics approach, chloroplast DNA sequence information from 5 available eudicot plastomes was aligned to identify the most conserved regions. Cognate primer pairs were then designed to generate ~1 – 1.2 kb overlapping amplicons from the inverted repeat region in 14 diverse genera. Results 100% coverage of the inverted repeat region was obtained from Arabidopsis, tobacco, orange, strawberry, peach, lettuce, tomato and Amaranthus. Over 80% coverage was obtained from distant species, including Ginkgo, loblolly pine and Equisetum. Sequence from the inverted repeat region of strawberry and peach plastome was obtained, annotated and analyzed. Additionally, a polymorphic region identified from gel electrophoresis was sequenced from tomato and Amaranthus. Sequence analysis revealed large deletions in these species relative to tobacco plastome thus exhibiting the utility of this method for structural and

  16. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

    OpenAIRE

    Burgun Anita; Galibert Marie-Dominique; de Tayrac Marie; Chicault Celine; Monnier Annabelle; Aubry Marc; Mosser Jean

    2006-01-01

    Abstract Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by tr...

  17. A clustering method for robust and reliable large scale functional and structural protein sequence annotation

    OpenAIRE

    Piovesan, Damiano

    2013-01-01

    Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accu...

  18. Using Semantic Annotation for Mining Privacy and Security Requirements from European Union Directives

    OpenAIRE

    Guarda, Paolo; Kiyavitskaya, Nadzeya; Zannone, Nicola

    2008-01-01

    The increasing complexity of software systems and growing demand for regulations compliance require effective methods and tools to support requirements analysts activities. In order to facilitate alignment of software system requirements and regulations, systematic methods and tools automating regulations analysis must be developed. This work explores applicability of the semantic annotation tool Cerno to mining of rights and obligations from European privacy directives.

  19. Genome organization of the SARS-CoV

    DEFF Research Database (Denmark)

    Xu, Jing; Hu, Jianfei; Wang, Jing;

    2003-01-01

    Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available...... of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic...

  20. Linguistic concepts described with Media Query Language for automated annotation

    OpenAIRE

    Lenkiewicz, A.; Lis, M; Lenkiewicz, P.

    2012-01-01

    Introduction Human spoken communication is multimodal, i.e. it encompasses both speech and gesture. Acoustic properties of voice, body movements, facial expression, etc. are an inherent and meaningful part of spoken interaction; they can provide attitudinal, grammatical and semantic information. In the recent years interest in audio-visual corpora has been rising rapidly as they enable investigation of different communicative modalities and provide more holistic view on communication (Kipp et...

  1. Collective dynamics of social annotation.

    Science.gov (United States)

    Cattuto, Ciro; Barrat, Alain; Baldassarri, Andrea; Schehr, Gregory; Loreto, Vittorio

    2009-06-30

    The enormous increase of popularity and use of the worldwide web has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with keywords known as "tags." Understanding the rich emergent structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks (RWs), and complex networks theory, can effectively contribute to the mathematical modeling of social annotation systems. Here, we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of RWs. This modeling framework reproduces several aspects, thus far unexplained, of social annotation, among which are the peculiar growth of the size of the vocabulary used by the community and its complex network structure that represents an externalization of semantic structures grounded in cognition and that are typically hard to access. PMID:19506244

  2. SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION

    Directory of Open Access Journals (Sweden)

    Archana Shukla

    2011-11-01

    Full Text Available I present a tool which tells the quality of document or its usefulness based on annotations. Annotation mayinclude comments, notes, observation, highlights, underline, explanation, question or help etc. commentsare used for evaluative purpose while others are used for summarization or for expansion also. Furtherthese comments may be on another annotation. Such annotations are referred as meta-annotation. Allannotation may not get equal weightage. My tool considered highlights, underline as well as comments toinfer the collective sentiment of annotators. Collective sentiments of annotators are classified as positive,negative, objectivity. My tool computes collective sentiment of annotations in two manners. It counts all theannotation present on the documents as well as it also computes sentiment scores of all annotation whichincludes comments to obtain the collective sentiments about the document or to judge the quality ofdocument. I demonstrate the use of tool on research paper.

  3. Structured RNAs and synteny regions in the pig genome

    DEFF Research Database (Denmark)

    Anthon, Christian; Tafer, Hakim; Havgaard, Jakob Hull; Thomsen, Bo; Hedegaard, Jakob; Seemann, Ernst Stefan; Pundhir, Sachin; Kehr, Stephanie; Bartschat, Sebastian; Nielsen, Mathilde; Nielsen, Rasmus O.; Fredholm, Merete; Stadler, Peter F.; Gorodkin, Jan

    2014-01-01

    BACKGROUND: Annotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs are known and the computational models are resource demanding. Currently, the human genome holds the best mammalian ncRNA annotation, a result of numerous efforts by several groups. Howeve...

  4. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    Science.gov (United States)

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation. PMID:19396742

  5. An Annotation Scheme for Reichenbach's Verbal Tense Structure

    CERN Document Server

    Derczynski, Leon

    2012-01-01

    In this paper we present RTMML, a markup language for the tenses of verbs and temporal relations between verbs. There is a richness to tense in language that is not fully captured by existing temporal annotation schemata. Following Reichenbach we present an analysis of tense in terms of abstract time points, with the aim of supporting automated processing of tense and temporal relations in language. This allows for precise reasoning about tense in documents, and the deduction of temporal relations between the times and verbal events in a discourse. We define the syntax of RTMML, and demonstrate the markup in a range of situations.

  6. Generating Protocol Software from CPN Models Annotated with Pragmatics

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge; Kristensen, Lars M.; Kindler, Ekkart

    2013-01-01

    verify protocol software, but limited work exists on using CPN models of protocols as a basis for automated code generation. The contribution of this paper is a method for generating protocol software from a class of CPN models annotated with code generation pragmatics. Our code generation method...... consists of three main steps: automatically adding so-called derived pragmatics to the CPN model, computing an abstract template tree, which associates pragmatics with code templates, and applying the templates to generate code which can then be compiled. We illustrate our method using a unidirectional...

  7. Generating Protocol Software from CPN Models Annotated with Pragmatics

    DEFF Research Database (Denmark)

    Simonsen, Kent Inge; Kristensen, Lars M.; Kindler, Ekkart

    Model-driven software engineering (MDSE) provides a foundation for automatically generating software based on models that focus on the problem domain while abstracting from the details of underlying implementation platforms. Coloured Petri Nets (CPNs) have been widely used to formally model and...... verify protocol software, but limited work exists on using CPN models of protocols as a basis for automated code generation. The contribution of this paper is a method for generating protocol software from a class of CPN models annotated with code generation pragmatics. Our code generation method...

  8. Systematic Functional Annotation and Visualization of Biological Networks.

    Science.gov (United States)

    Baryshnikova, Anastasia

    2016-06-22

    Large-scale biological networks represent relationships between genes, but our understanding of how networks are functionally organized is limited. Here, I describe spatial analysis of functional enrichment (SAFE), a systematic method for annotating biological networks and examining their functional organization. SAFE visualizes the network in 2D space and measures the continuous distribution of functional enrichment across local neighborhoods, producing a list of the associated functions and a map of their relative positioning. I applied SAFE to annotate the Saccharomyces cerevisiae genetic interaction similarity network and protein-protein interaction network with gene ontology terms. SAFE annotations of the genetic network matched manually derived annotations, while taking less than 1% of the time, and proved robust to noise and sensitive to biological signal. Integration of genetic interaction and chemical genomics data using SAFE revealed a link between vesicle-mediate transport and resistance to the anti-cancer drug bortezomib. These results demonstrate the utility of SAFE for examining biological networks and understanding their functional organization. PMID:27237738

  9. The Latest Buzz in Comparative Genomics

    OpenAIRE

    Kulathinal, Rob J.; Hartl, Daniel L.

    2005-01-01

    A second species of fruit fly has just been added to the growing list of organisms with complete and annotated genome sequences. The publication of the Drosophila pseudoobscura sequence provides a snapshot of how genomes have changed over tens of millions of years and sets the stage for the analysis of more fly genomes.

  10. The UCSC Genome Browser database: 2016 update

    OpenAIRE

    Speir, Matthew L; Zweig, Ann S.; Rosenbloom, Kate R.; Raney, Brian J.; Paten, Benedict; Nejad, Parisa; Rowe, Laurence D.; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S.; Heitner, Steve; Harte, Rachel A.; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A.

    2015-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for vari...

  11. REEF: searching REgionally Enriched Features in genomes

    OpenAIRE

    Danieli Gian Antonio; Coppe Alessandro; Bortoluzzi Stefania

    2006-01-01

    Abstract Background In Eukaryotic genomes, different features including genes are not uniformly distributed. The integration of annotation information and genomic position of functional DNA elements in the Eukaryotic genomes opened the way to test novel hypotheses of higher order genome organization and regulation of expression. Results REEF is a new tool, aimed at identifying genomic regions enriched in specific features, such as a class or group of genes homogeneous for expression and/or fu...

  12. Systems Theory and Communication. Annotated Bibliography.

    Science.gov (United States)

    Covington, William G., Jr.

    This annotated bibliography presents annotations of 31 books and journal articles dealing with systems theory and its relation to organizational communication, marketing, information theory, and cybernetics. Materials were published between 1963 and 1992 and are listed alphabetically by author. (RS)

  13. International Standard for a Linguistic Annotation Framework

    CERN Document Server

    Romary, Laurent

    2004-01-01

    This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. The Linguistic Annotation Framework is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.

  14. Herbarium genomics

    DEFF Research Database (Denmark)

    Bakker, Freek T.; Lei, Di; Yu, Jiaying;

    2016-01-01

    Herbarium genomics is proving promising as next-generation sequencing approaches are well suited to deal with the usually fragmented nature of archival DNA. We show that routine assembly of partial plastome sequences from herbarium specimens is feasible, from total DNA extracts and with specimens...... up to 146 years old. We use genome skimming and an automated assembly pipeline, Iterative Organelle Genome Assembly, that assembles paired-end reads into a series of candidate assemblies, the best one of which is selected based on likelihood estimation. We used 93 specimens from 12 different...... correlation between plastome coverage and nuclear genome size (C value) in our samples, but the range of C values included is limited. Finally, we conclude that routine plastome sequencing from herbarium specimens is feasible and cost-effective (compared with Sanger sequencing or plastome...

  15. Annotated Bibliography, Grades K-6.

    Science.gov (United States)

    Massachusetts Dept. of Education, Boston. Bureau of Nutrition Education and School Food Services.

    This annotated bibliography on nutrition is for the use of teachers at the elementary grade level. It contains a list of books suitable for reading about nutrition and foods for pupils from kindergarten through the sixth grade. Films and audiovisual presentations for classroom use are also listed. The names and addresses from which these materials…

  16. Functional annotation of human cytomegalovirus gene products: an update

    Directory of Open Access Journals (Sweden)

    Ellen eVan Damme

    2014-05-01

    Full Text Available Human Cytomegalovirus is an opportunistic double-stranded DNA virus with one of the largest viral genomes known. The 235kB genome is divided in a unique long (UL and a unique short (US region which are flanked by terminal and internal repeats. The expression of HCMV genes is highly complex and involves the production of protein coding transcripts, polyadenylated long non-coding RNAs, polyadenylated anti-sense transcripts and a variety of non-polyadenylated RNAs such as microRNAs. Although the function of many of these transcripts is unknown, they are suggested play a direct or regulatory role in the delicately orchestrated processes that ensure HCMV replication and life-long persistence. This review focuses on annotating the complete viral genome based on three sources of information. First, previous reviews were used as a template for the functional keywords to ensure continuity; second, the Uniprot database was used to further enrich the functional database; and finally, the literature was manually curated for novel functions of HCMV gene products. Novel discoveries were discussed in light of the viral life cycle. This functional annotation highlights still poorly understood regions of the genome but most importantly it can give insight in functional clusters and/or may be helpful in the analysis of transcriptomics and proteomics studies.

  17. Annotating images by mining image search results

    NARCIS (Netherlands)

    X.J. Wang; L. Zhang; X. Li; W.Y. Ma

    2008-01-01

    Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results

  18. Eight questions about semantic web annotations

    OpenAIRE

    Euzenat, Jérôme

    2002-01-01

    Improving information retrieval is annotation¹s central goal. However, without sufficient planning, annotation - especially when running a robot and attaching automatically extracted content - risks producing incoherent information. The author recommends answering eight questions before you annotate. He provides a practical application of this approach, and discusses applying the questions to other systems.

  19. Annotation and Classification of Argumentative Writing Revisions

    Science.gov (United States)

    Zhang, Fan; Litman, Diane

    2015-01-01

    This paper explores the annotation and classification of students' revision behaviors in argumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with…

  20. Library Automation

    OpenAIRE

    Dhakne, B. N.; Giri, V. V; Waghmode, S. S.

    2010-01-01

    New technologies library provides several new materials, media and mode of storing and communicating the information. Library Automation reduces the drudgery of repeated manual efforts in library routine. By use of library automation collection, Storage, Administration, Processing, Preservation and communication etc.

  1. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    Directory of Open Access Journals (Sweden)

    Burgun Anita

    2006-05-01

    Full Text Available Abstract Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence or text-mining of the published scientific literature (literature profiling. Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2 and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells. Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions.

  2. VISPA: a computational pipeline for the identification and analysis of genomic vector integration sites.

    Science.gov (United States)

    Calabria, Andrea; Leo, Simone; Benedicenti, Fabrizio; Cesana, Daniela; Spinozzi, Giulio; Orsini, Massimilano; Merella, Stefania; Stupka, Elia; Zanetti, Gianluigi; Montini, Eugenio

    2014-01-01

    The analysis of the genomic distribution of viral vector genomic integration sites is a key step in hematopoietic stem cell-based gene therapy applications, allowing to assess both the safety and the efficacy of the treatment and to study the basic aspects of hematopoiesis and stem cell biology. Identifying vector integration sites requires ad-hoc bioinformatics tools with stringent requirements in terms of computational efficiency, flexibility, and usability. We developed VISPA (Vector Integration Site Parallel Analysis), a pipeline for automated integration site identification and annotation based on a distributed environment with a simple Galaxy web interface. VISPA was successfully used for the bioinformatics analysis of the follow-up of two lentiviral vector-based hematopoietic stem-cell gene therapy clinical trials. Our pipeline provides a reliable and efficient tool to assess the safety and efficacy of integrating vectors in clinical settings. PMID:25342980

  3. The UCSC Genome Browser Database: update 2006

    DEFF Research Database (Denmark)

    Hinrichs, A S; Karolchik, D; Baertsch, R;

    2006-01-01

    The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, m......RNA and expressed sequence tag evidence, comparative genomics, regulation, expression and variation data. The database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. The Genome Browser displays a wide variety...... of annotations at all scales from single nucleotide level up to a full chromosome. The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets. The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering...

  4. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    Energy Technology Data Exchange (ETDEWEB)

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  5. Integration of new alternative reference strain genome sequences into the Saccharomyces genome database.

    Science.gov (United States)

    Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla; Demeter, Janos; Engel, Stacia; Hellerstedt, Sage T; Karra, Kalpana; Hitz, Benjamin C; Nash, Robert S; Paskov, Kelley; Sheppard, Travis; Skrzypek, Marek; Weng, Shuai; Wong, Edith; Michael Cherry, J

    2016-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. To provide a wider scope of genetic and phenotypic variation in yeast, the genome sequences and their corresponding annotations from 11 alternative S. cerevisiae reference strains have been integrated into SGD. Genomic and protein sequence information for genes from these strains are now available on the Sequence and Protein tab of the corresponding Locus Summary pages. We illustrate how these genome sequences can be utilized to aid our understanding of strain-specific functional and phenotypic differences.Database URL: www.yeastgenome.org. PMID:27252399

  6. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python)

    OpenAIRE

    Irizarry, Kristopher J. L.; Josep Rutllant

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 g...

  7. High accuracy mass spectrometry analysis as a tool to verify and improve gene annotation using Mycobacterium tuberculosis as an example

    Directory of Open Access Journals (Sweden)

    Prasad Swati

    2008-07-01

    Full Text Available Abstract Background While the genomic annotations of diverse lineages of the Mycobacterium tuberculosis complex are available, divergences between gene prediction methods are still a challenge for unbiased protein dataset generation. M. tuberculosis gene annotation is an example, where the most used datasets from two independent institutions (Sanger Institute and Institute of Genomic Research-TIGR differ up to 12% in the number of annotated open reading frames, and 46% of the genes contained in both annotations have different start codons. Such differences emphasize the importance of the identification of the sequence of protein products to validate each gene annotation including its sequence coding area. Results With this objective, we submitted a culture filtrate sample from M. tuberculosis to a high-accuracy LTQ-Orbitrap mass spectrometer analysis and applied refined N-terminal prediction to perform comparison of two gene annotations. From a total of 449 proteins identified from the MS data, we validated 35 tryptic peptides that were specific to one of the two datasets, representing 24 different proteins. From those, 5 proteins were only annotated in the Sanger database. In the remaining proteins, the observed differences were due to differences in annotation of transcriptional start sites. Conclusion Our results indicate that, even in a less complex sample likely to represent only 10% of the bacterial proteome, we were still able to detect major differences between different gene annotation approaches. This gives hope that high-throughput proteomics techniques can be used to improve and validate gene annotations, and in particular for verification of high-throughput, automatic gene annotations.

  8. AIGO: Towards a unified framework for the Analysis and the Inter-comparison of GO functional annotations

    Directory of Open Access Journals (Sweden)

    Defoin-Platel Michael

    2011-11-01

    Full Text Available Abstract Background In response to the rapid growth of available genome sequences, efforts have been made to develop automatic inference methods to functionally characterize them. Pipelines that infer functional annotation are now routinely used to produce new annotations at a genome scale and for a broad variety of species. These pipelines differ widely in their inference algorithms, confidence thresholds and data sources for reasoning. This heterogeneity makes a comparison of the relative merits of each approach extremely complex. The evaluation of the quality of the resultant annotations is also challenging given there is often no existing gold-standard against which to evaluate precision and recall. Results In this paper, we present a pragmatic approach to the study of functional annotations. An ensemble of 12 metrics, describing various aspects of functional annotations, is defined and implemented in a unified framework, which facilitates their systematic analysis and inter-comparison. The use of this framework is demonstrated on three illustrative examples: analysing the outputs of state-of-the-art inference pipelines, comparing electronic versus manual annotation methods, and monitoring the evolution of publicly available functional annotations. The framework is part of the AIGO library (http://code.google.com/p/aigo for the Analysis and the Inter-comparison of the products of Gene Ontology (GO annotation pipelines. The AIGO library also provides functionalities to easily load, analyse, manipulate and compare functional annotations and also to plot and export the results of the analysis in various formats. Conclusions This work is a step toward developing a unified framework for the systematic study of GO functional annotations. This framework has been designed so that new metrics on GO functional annotations can be added in a very straightforward way.

  9. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4).

    Science.gov (United States)

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Tennessen, Kristin; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2016-01-01

    The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases. PMID:26918089

  10. Process automation

    International Nuclear Information System (INIS)

    Process automation technology has been pursued in the chemical processing industries and to a very limited extent in nuclear fuel reprocessing. Its effective use has been restricted in the past by the lack of diverse and reliable process instrumentation and the unavailability of sophisticated software designed for process control. The Integrated Equipment Test (IET) facility was developed by the Consolidated Fuel Reprocessing Program (CFRP) in part to demonstrate new concepts for control of advanced nuclear fuel reprocessing plants. A demonstration of fuel reprocessing equipment automation using advanced instrumentation and a modern, microprocessor-based control system is nearing completion in the facility. This facility provides for the synergistic testing of all chemical process features of a prototypical fuel reprocessing plant that can be attained with unirradiated uranium-bearing feed materials. The unique equipment and mission of the IET facility make it an ideal test bed for automation studies. This effort will provide for the demonstration of the plant automation concept and for the development of techniques for similar applications in a full-scale plant. A set of preliminary recommendations for implementing process automation has been compiled. Some of these concepts are not generally recognized or accepted. The automation work now under way in the IET facility should be useful to others in helping avoid costly mistakes because of the underutilization or misapplication of process automation. 6 figs

  11. Computational annotation of genes differentially expressed along olive fruit development

    Directory of Open Access Journals (Sweden)

    Martinelli Federico

    2009-10-01

    used to query all known KEGG (Kyoto Encyclopaedia of Genes and Genomes metabolic pathways for characterizing and positioning retrieved EST records. The integration of the olive sequence datasets within the MapMan platform for microarray analysis allowed the identification of specific biosynthetic pathways useful for the definition of key functional categories in time course analyses for gene groups. Conclusion The bioinformatic annotation of all gene sequences was useful to shed light on metabolic pathways and transcriptional aspects related to carbohydrates, fatty acids, secondary metabolites, transcription factors and hormones as well as response to biotic and abiotic stresses throughout olive drupe development. These results represent a first step toward both functional genomics and systems biology research for understanding the gene functions and regulatory networks in olive fruit growth and ripening.

  12. Figure 2 from Integrative Genomics Viewer: Visualizing Big Data | Office of Cancer Genomics

    Science.gov (United States)

    Grouping and sorting genomic data in IGV. The IGV user interface displaying 202 glioblastoma samples from TCGA. Samples are grouped by tumor subtype (second annotation column) and data type (first annotation column) and sorted by copy number of the EGFR locus (middle column). Adapted from Figure 1; Robinson et al. 2011

  13. Werkzeuge zur Annotation diachroner Korpora

    OpenAIRE

    Burghardt, Manuel; Wolff, Christian

    2009-01-01

    Wir diskutieren zunächst die Problematik der (syntaktischen) Annotation diachroner Korpora und stellen anschließend eine Evaluationsstudie vor, bei der mehr als 50 Annotationswerkzeuge und -frameworks vor dem Hintergrund eines funktionalen und software-ergonomischen Anforderungsprofils nach dem Qualitätsmodell von ISO/IEC 9126-1:2001 (Software engineering – Product quality – Part 1: Quality model) und ISO/IEC 25000:2005 (Software Engineering – Software product Quality Requirements and Evaluat...

  14. Explaining microbial phenotypes on a genomic scale: GWAS for microbes

    OpenAIRE

    Dutilh, Bas E; Backus, Lennart; Edwards, Robert A.; Wels, Michiel; Bayjanov, Jumamurat R.; van Hijum, Sacha A. F. T.

    2013-01-01

    There is an increasing availability of complete or draft genome sequences for microbial organisms. These data form a potentially valuable resource for genotype–phenotype association and gene function prediction, provided that phenotypes are consistently annotated for all the sequenced strains. In this review, we address the requirements for successful gene-trait matching. We outline a basic protocol for microbial functional genomics, including genome assembly, annotation of genotypes (includi...

  15. MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease.

    Science.gov (United States)

    Shen, Lishuang; Diroma, Maria Angela; Gonzalez, Michael; Navarro-Gomez, Daniel; Leipzig, Jeremy; Lott, Marie T; van Oven, Mannis; Wallace, Douglas C; Muraresku, Colleen Clarke; Zolkipli-Cunningham, Zarazuela; Chinnery, Patrick F; Attimonelli, Marcella; Zuchner, Stephan; Falk, Marni J; Gai, Xiaowu

    2016-06-01

    MSeqDR is the Mitochondrial Disease Sequence Data Resource, a centralized and comprehensive genome and phenome bioinformatics resource built by the mitochondrial disease community to facilitate clinical diagnosis and research investigations of individual patient phenotypes, genomes, genes, and variants. A central Web portal (https://mseqdr.org) integrates community knowledge from expert-curated databases with genomic and phenotype data shared by clinicians and researchers. MSeqDR also functions as a centralized application server for Web-based tools to analyze data across both mitochondrial and nuclear DNA, including investigator-driven whole exome or genome dataset analyses through MSeqDR-Genesis. MSeqDR-GBrowse genome browser supports interactive genomic data exploration and visualization with custom tracks relevant to mtDNA variation and mitochondrial disease. MSeqDR-LSDB is a locus-specific database that currently manages 178 mitochondrial diseases, 1,363 genes associated with mitochondrial biology or disease, and 3,711 pathogenic variants in those genes. MSeqDR Disease Portal allows hierarchical tree-style disease exploration to evaluate their unique descriptions, phenotypes, and causative variants. Automated genomic data submission tools are provided that capture ClinVar compliant variant annotations. PhenoTips will be used for phenotypic data submission on deidentified patients using human phenotype ontology terminology. The development of a dynamic informed patient consent process to guide data access is underway to realize the full potential of these resources. PMID:26919060

  16. ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species

    OpenAIRE

    Extavour, Cassandra G.; Zeng, Victor

    2012-01-01

    The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This da...

  17. Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context

    OpenAIRE

    Gardner Timothy S; Olson Andrew J; Faith Jeremiah J; Sachidanandam Ravi

    2007-01-01

    Abstract Background Lightweight genome viewer (lwgv) is a web-based tool for visualization of sequence annotations in their chromosomal context. It performs most of the functions of larger genome browsers, while relying on standard flat-file formats and bypassing the database needs of most visualization tools. Visualization as an aide to discovery requires display of novel data in conjunction with static annotations in their chromosomal context. With database-based systems, displaying dynamic...

  18. GDR (Genome Database for Rosaceae): integrated web-database for Rosaceae genomics and genetics data

    OpenAIRE

    Jung, Sook; Staton, Margaret; Lee, Taein; Blenda, Anna; Svancara, Randall; Abbott, Albert; Main, Dorrie

    2007-01-01

    The Genome Database for Rosaceae (GDR) is a central repository of curated and integrated genetics and genomics data of Rosaceae, an economically important family which includes apple, cherry, peach, pear, raspberry, rose and strawberry. GDR contains annotated databases of all publicly available Rosaceae ESTs, the genetically anchored peach physical map, Rosaceae genetic maps and comprehensively annotated markers and traits. The ESTs are assembled to produce unigene sets of each genus and the ...

  19. A computational approach for the annotation of hydrogen-bonded base interactions in crystallographic structures of the ribozymes

    Energy Technology Data Exchange (ETDEWEB)

    Hamdani, Hazrina Yusof, E-mail: hazrina@mfrlab.org [School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi (Malaysia); Advanced Medical and Dental Institute, Universiti Sains Malaysia, Bertam, Kepala Batas (Malaysia); Artymiuk, Peter J., E-mail: p.artymiuk@sheffield.ac.uk [Dept. of Molecular Biology and Biotechnology, Firth Court, University of Sheffield, S10 T2N Sheffield (United Kingdom); Firdaus-Raih, Mohd, E-mail: firdaus@mfrlab.org [School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi (Malaysia)

    2015-09-25

    A fundamental understanding of the atomic level interactions in ribonucleic acid (RNA) and how they contribute towards RNA architecture is an important knowledge platform to develop through the discovery of motifs from simple arrangements base pairs, to more complex arrangements such as triples and larger patterns involving non-standard interactions. The network of hydrogen bond interactions is important in connecting bases to form potential tertiary motifs. Therefore, there is an urgent need for the development of automated methods for annotating RNA 3D structures based on hydrogen bond interactions. COnnection tables Graphs for Nucleic ACids (COGNAC) is automated annotation system using graph theoretical approaches that has been developed for the identification of RNA 3D motifs. This program searches for patterns in the unbroken networks of hydrogen bonds for RNA structures and capable of annotating base pairs and higher-order base interactions, which ranges from triples to sextuples. COGNAC was able to discover 22 out of 32 quadruples occurrences of the Haloarcula marismortui large ribosomal subunit (PDB ID: 1FFK) and two out of three occurrences of quintuple interaction reported by the non-canonical interactions in RNA (NCIR) database. These and several other interactions of interest will be discussed in this paper. These examples demonstrate that the COGNAC program can serve as an automated annotation system that can be used to annotate conserved base-base interactions and could be added as additional information to established RNA secondary structure prediction methods.

  20. A computational approach for the annotation of hydrogen-bonded base interactions in crystallographic structures of the ribozymes

    Science.gov (United States)

    Hamdani, Hazrina Yusof; Artymiuk, Peter J.; Firdaus-Raih, Mohd

    2015-09-01

    A fundamental understanding of the atomic level interactions in ribonucleic acid (RNA) and how they contribute towards RNA architecture is an important knowledge platform to develop through the discovery of motifs from simple arrangements base pairs, to more complex arrangements such as triples and larger patterns involving non-standard interactions. The network of hydrogen bond interactions is important in connecting bases to form potential tertiary motifs. Therefore, there is an urgent need for the development of automated methods for annotating RNA 3D structures based on hydrogen bond interactions. COnnection tables Graphs for Nucleic ACids (COGNAC) is automated annotation system using graph theoretical approaches that has been developed for the identification of RNA 3D motifs. This program searches for patterns in the unbroken networks of hydrogen bonds for RNA structures and capable of annotating base pairs and higher-order base interactions, which ranges from triples to sextuples. COGNAC was able to discover 22 out of 32 quadruples occurrences of the Haloarcula marismortui large ribosomal subunit (PDB ID: 1FFK) and two out of three occurrences of quintuple interaction reported by the non-canonical interactions in RNA (NCIR) database. These and several other interactions of interest will be discussed in this paper. These examples demonstrate that the COGNAC program can serve as an automated annotation system that can be used to annotate conserved base-base interactions and could be added as additional information to established RNA secondary structure prediction methods.

  1. A computational approach for the annotation of hydrogen-bonded base interactions in crystallographic structures of the ribozymes

    International Nuclear Information System (INIS)

    A fundamental understanding of the atomic level interactions in ribonucleic acid (RNA) and how they contribute towards RNA architecture is an important knowledge platform to develop through the discovery of motifs from simple arrangements base pairs, to more complex arrangements such as triples and larger patterns involving non-standard interactions. The network of hydrogen bond interactions is important in connecting bases to form potential tertiary motifs. Therefore, there is an urgent need for the development of automated methods for annotating RNA 3D structures based on hydrogen bond interactions. COnnection tables Graphs for Nucleic ACids (COGNAC) is automated annotation system using graph theoretical approaches that has been developed for the identification of RNA 3D motifs. This program searches for patterns in the unbroken networks of hydrogen bonds for RNA structures and capable of annotating base pairs and higher-order base interactions, which ranges from triples to sextuples. COGNAC was able to discover 22 out of 32 quadruples occurrences of the Haloarcula marismortui large ribosomal subunit (PDB ID: 1FFK) and two out of three occurrences of quintuple interaction reported by the non-canonical interactions in RNA (NCIR) database. These and several other interactions of interest will be discussed in this paper. These examples demonstrate that the COGNAC program can serve as an automated annotation system that can be used to annotate conserved base-base interactions and could be added as additional information to established RNA secondary structure prediction methods

  2. PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results

    Directory of Open Access Journals (Sweden)

    Zhao Xuechun

    2007-02-01

    Full Text Available Abstract Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1 query and target sequence database management, (2 automated high-throughput BLAST searching, (3 indexing and searching of results, (4 filtering results online, (5 managing results of personal interest in favorite categories, (6 automated sequence annotation (such as NCBI NR and ontology-based annotation. PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results

  3. Studying Oogenesis in a Non-model Organism Using Transcriptomics: Assembling, Annotating, and Analyzing Your Data.

    Science.gov (United States)

    Carter, Jean-Michel; Gibbs, Melanie; Breuker, Casper J

    2016-01-01

    This chapter provides a guide to processing and analyzing RNA-Seq data in a non-model organism. This approach was implemented for studying oogenesis in the Speckled Wood Butterfly Pararge aegeria. We focus in particular on how to perform a more informative primary annotation of your non-model organism by implementing our multi-BLAST annotation strategy. We also provide a general guide to other essential steps in the next-generation sequencing analysis workflow. Before undertaking these methods, we recommend you familiarize yourself with command line usage and fundamental concepts of database handling. Most of the operations in the primary annotation pipeline can be performed in Galaxy (or equivalent standalone versions of the tools) and through the use of common database operations (e.g. to remove duplicates) but other equivalent programs and/or custom scripts can be implemented for further automation. PMID:27557578

  4. A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki.

    OpenAIRE

    Flórez, Lope A.; Roppel, Sebastian F.; Schmeisky, Arne G.; Lammers, Christoph R.; Stülke, Jörg

    2009-01-01

    Bacillus subtilis is the model organism for Gram-positive bacteria, with a large amount of publications on all aspects of its biology. To facilitate genome annotation and the collection of comprehensive information on B. subtilis, we created SubtiWiki as a community-oriented annotation tool for information retrieval and continuous maintenance. The wiki is focused on the needs and requirements of scientists doing experimental work. This has implications for the design of the interface and for ...

  5. Aquatic models, genomics and chemical risk management.

    Science.gov (United States)

    Cheng, Keith C; Hinton, David E; Mattingly, Carolyn J; Planchart, Antonio

    2012-01-01

    The 5th Aquatic Animal Models for Human Disease meeting follows four previous meetings (Nairn et al., 2001; Schmale, 2004; Schmale et al., 2007; Hinton et al., 2009) in which advances in aquatic animal models for human disease research were reported, and community discussion of future direction was pursued. At this meeting, discussion at a workshop entitled Bioinformatics and Computational Biology with Web-based Resources (20 September 2010) led to an important conclusion: Aquatic model research using feral and experimental fish, in combination with web-based access to annotated anatomical atlases and toxicological databases, yields data that advance our understanding of human gene function, and can be used to facilitate environmental management and drug development. We propose here that the effects of genes and environment are best appreciated within an anatomical context - the specifically affected cells and organs in the whole animal. We envision the use of automated, whole-animal imaging at cellular resolution and computational morphometry facilitated by high-performance computing and automated entry into toxicological databases, as anchors for genetic and toxicological data, and as connectors between human and model system data. These principles should be applied to both laboratory and feral fish populations, which have been virtually irreplaceable sentinals for environmental contamination that results in human morbidity and mortality. We conclude that automation, database generation, and web-based accessibility, facilitated by genomic/transcriptomic data and high-performance and cloud computing, will potentiate the unique and potentially key roles that aquatic models play in advancing systems biology, drug development, and environmental risk management. PMID:21763781

  6. Omics data management and annotation.

    Science.gov (United States)

    Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron

    2011-01-01

    Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter

  7. Vespucci: a system for building annotated databases of nascent transcripts.

    Science.gov (United States)

    Allison, Karmel A; Kaikkonen, Minna U; Gaasterland, Terry; Glass, Christopher K

    2014-02-01

    Global run-on sequencing (GRO-seq) is a recent addition to the series of high-throughput sequencing methods that enables new insights into transcriptional dynamics within a cell. However, GRO-sequencing presents new algorithmic challenges, as existing analysis platforms for ChIP-seq and RNA-seq do not address the unique problem of identifying transcriptional units de novo from short reads located all across the genome. Here, we present a novel algorithm for de novo transcript identification from GRO-sequencing data, along with a system that determines transcript regions, stores them in a relational database and associates them with known reference annotations. We use this method to analyze GRO-sequencing data from primary mouse macrophages and derive novel quantitative insights into the extent and characteristics of non-coding transcription in mammalian cells. In doing so, we demonstrate that Vespucci expands existing annotations for mRNAs and lincRNAs by defining the primary transcript beyond the polyadenylation site. In addition, Vespucci generates assemblies for un-annotated non-coding RNAs such as those transcribed from enhancer-like elements. Vespucci thereby provides a robust system for defining, storing and analyzing diverse classes of primary RNA transcripts that are of increasing biological interest. PMID:24304890

  8. Project Aloha:indexing, highlighting and annotation

    OpenAIRE

    Fallahkhair, Sanaz; Kennedy, Ian

    2010-01-01

    Lifelong learning requires many skills that are often not taught or are poorly taught. Such skills include speed reading, critical analysis, creative thinking, active reading and even a “little” skill like annotation. There are many ways that readers annotate. A short classification of some ways that reader may annotate includes underlining, using coloured highlighters, interlinear notes, marginal notes, and disassociated notes. This paper presents an investigation into the use of a tool for ...

  9. Automatic Multilevel Medical Image Annotation and Retrieval

    OpenAIRE

    Mueen, A.; Zainuddin, R.; Baba, M. Sapiyan

    2007-01-01

    Image retrieval at the semantic level mostly depends on image annotation or image classification. Image annotation performance largely depends on three issues: (1) automatic image feature extraction; (2) a semantic image concept modeling; (3) algorithm for semantic image annotation. To address first issue, multilevel features are extracted to construct the feature vector, which represents the contents of the image. To address second issue, domain-dependent concept hierarchy is constructed for...

  10. Knowledge Annotation maknig implicit knowledge explicit

    CERN Document Server

    Dingli, Alexiei

    2011-01-01

    Did you ever read something on a book, felt the need to comment, took up a pencil and scribbled something on the books' text'? If you did, you just annotated a book. But that process has now become something fundamental and revolutionary in these days of computing. Annotation is all about adding further information to text, pictures, movies and even to physical objects. In practice, anything which can be identified either virtually or physically can be annotated. In this book, we will delve into what makes annotations, and analyse their significance for the future evolutions of the web. We wil

  11. On the total number of genes and their length distribution in complete microbial genomes

    DEFF Research Database (Denmark)

    Skovgaard, Marie; Jensen, L.J.; Brunak, Søren; Ussery, David; Krogh, Anders Stærmose

    2001-01-01

    In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length distribut......In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length...... distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300...

  12. Estimating the annotation error rate of curated GO database sequence annotations

    Directory of Open Access Journals (Sweden)

    Brown Alfred L

    2007-05-01

    Full Text Available Abstract Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO sequence database (GOSeqLite. This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006 at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.

  13. Deep annotation of Populus trichocarpa microRNAs from diverse tissue sets.

    Directory of Open Access Journals (Sweden)

    Joshua R Puzey

    Full Text Available Populus trichocarpa is an important woody model organism whose entire genome has been sequenced. This resource has facilitated the annotation of microRNAs (miRNAs, which are short non-coding RNAs with critical regulatory functions. However, despite their developmental importance, P. trichocarpa miRNAs have yet to be annotated from numerous important tissues. Here we significantly expand the breadth of tissue sampling and sequencing depth for miRNA annotation in P. trichocarpa using high-throughput smallRNA (sRNA sequencing. miRNA annotation was performed using three individual next-generation sRNA sequencing runs from separate leaves, xylem, and mechanically treated xylem, as well as a fourth run using a pooled sample containing vegetative apices, male flowers, female flowers, female apical buds, and male apical and lateral buds. A total of 276 miRNAs were identified from these datasets, including 155 previously unannotated miRNAs, most of which are P. trichocarpa specific. Importantly, we identified several xylem-enriched miRNAs predicted to target genes known to be important in secondary growth, including the critical reaction wood enzyme xyloglucan endo-transglycosylase/hydrolase and vascular-related transcription factors. This study provides a thorough genome-wide annotation of miRNAs in P. trichocarpa through deep sRNA sequencing from diverse tissue sets. Our data significantly expands the P. trichocarpa miRNA repertoire, which will facilitate a broad range of research in this major model system.

  14. Contig sequences and their annotation (amino acid sequence and results of homology search), and expression profile - Dicty_cDB | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available Dicty_cDB Contig sequences and their annotation (amino acid sequence and results of homology search), and ex...pression profile Data detail Data name Contig sequences and their annotation (amino acid sequence and result... sequences of cDNA sequences of Dictyostelium discoideum and their annotation (amino acid sequence and resul...ence and full-length cDNA sequence by the assembly program Phrap ( http://www.phrap.org/index.html ). Link to the... list of clones constituting the contig, the information on its mapping to the genome mapped to genome sequence and the

  15. BioSAVE: Display of scored annotation within a sequence context

    Directory of Open Access Journals (Sweden)

    Adryan Boris

    2008-03-01

    Full Text Available Abstract Background Visualization of sequence annotation is a common feature in many bioinformatics tools. For many applications it is desirable to restrict the display of such annotation according to a score cutoff, as biological interpretation can be difficult in the presence of the entire data. Unfortunately, many visualisation solutions are somewhat static in the way they handle such score cutoffs. Results We present BioSAVE, a sequence annotation viewer with on-the-fly selection of visualisation thresholds for each feature. BioSAVE is a versatile OS X program for visual display of scored features (annotation within a sequence context. The program reads sequence and additional supplementary annotation data (e.g., position weight matrix matches, conservation scores, structural domains from a variety of commonly used file formats and displays them graphically. Onscreen controls then allow for live customisation of these graphics, including on-the-fly selection of visualisation thresholds for each feature. Conclusion Possible applications of the program include display of transcription factor binding sites in a genomic context or the visualisation of structural domain assignments in protein sequences and many more. The dynamic visualisation of these annotations is useful, e.g., for the determination of cutoff values of predicted features to match experimental data. Program, source code and exemplary files are freely available at the BioSAVE homepage.

  16. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data

    Directory of Open Access Journals (Sweden)

    Steven N. Hart

    2015-05-01

    Full Text Available Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations. Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user’s own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  17. SpikeGUI: Software for Rapid Interictal Discharge Annotation via Template Matching and Online Machine Learning

    OpenAIRE

    Jin, Jing; Dauwels, Justin; Cash, Sydney; Westover, M. Brandon

    2014-01-01

    Detection of interictal discharges is a key element of interpreting EEGs during the diagnosis and management of epilepsy. Because interpretation of clinical EEG data is time-intensive and reliant on experts who are in short supply, there is a great need for automated spike detectors. However, attempts to develop general-purpose spike detectors have so far been severely limited by a lack of expert-annotated data. Huge databases of interictal discharges are therefore in great demand for the dev...

  18. GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation

    Science.gov (United States)

    Li, Cong; Gelernter, Joel; Zhao, Hongyu

    2014-01-01

    Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are

  19. Using Semantic Web Technologies to Annotate and Align Microarray Designs

    Science.gov (United States)

    Szpakowski, Sebastian; McCusker, James; Krauthammer, Michael

    2009-01-01

    In this paper, we annotate and align two different gene expression microarray designs using the Genomic ELement Ontology (GELO). GELO is a new ontology that leverages an existing community resource, Sequence Ontology (SO), to create views of genomically-aligned data in a semantic web environment. We start the process by mapping array probes to genomic coordinates. The coordinates represent an implicit link between the probes and multiple genomic elements, such as genes, transcripts, miRNA, and repetitive elements, which are represented using concepts in SO. We then use the RDF Query Language (SPARQL) to create explicit links between the probes and the elements. We show how the approach allows us to easily determine the element coverage and genomic overlap of the two array designs. We believe that the method will ultimately be useful for integration of cancer data across multiple omic studies. The ontology and other materials described in this paper are available at http://krauthammerlab.med.yale.edu/wiki/Gelo. PMID:24904201

  20. Using Semantic Web Technologies to Annotate and Align Microarray Designs

    Directory of Open Access Journals (Sweden)

    Sebastian Szpakowski

    2009-05-01

    Full Text Available In this paper, we annotate and align two different gene expression microarray designs using the Genomic ELement Ontology (GELO. GELO is a new ontology that leverages an existing community resource, Sequence Ontology (SO, to create views of genomically-aligned data in a semantic web environment. We start the process by mapping array probes to genomic coordinates. The coordinates represent an implicit link between the probes and multiple genomic elements, such as genes, transcripts, miRNA, and repetitive elements, which are represented using concepts in SO. We then use the RDF Query Language (SPARQL to create explicit links between the probes and the elements. We show how the approach allows us to easily determine the element coverage and genomic overlap of the two array designs. We believe that the method will ultimately be useful for integration of cancer data across multiple omic studies. The ontology and other materials described in this paper are available at http://krauthammerlab.med.yale.edu/wiki/Gelo.

  1. The UCSC Genome Browser database: 2016 update.

    Science.gov (United States)

    Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James

    2016-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment. PMID:26590259

  2. MannDB – A microbial database of automated protein sequence analyses and evidence integration for protein characterization

    Directory of Open Access Journals (Sweden)

    Kuczmarski Thomas A

    2006-10-01

    Full Text Available Abstract Background MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. Description MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. Conclusion MannDB comprises a large number of genomes and comprehensive protein

  3. Genome-wide comparative analysis reveals possible common ancestors of nucleotide-binding sites domain containing genes in hybrid Citrus sinensis genome and original Citrus clementina genome

    Science.gov (United States)

    We identified and re-annotated candidate disease resistance (R) genes with nucleotide-binding sites (NBS) domain from a Citrus clementina genome and two complete Citrus sinensis genome sequences (one from the USA and one from China). We found similar numbers of NBS genes from three citrus genomes, r...

  4. Comparative Genome Viewer

    International Nuclear Information System (INIS)

    The amount of information about genomes, both in the form of complete sequences and annotations, has been exponentially increasing in the last few years. As a result there is the need for tools providing a graphical representation of such information that should be comprehensive and intuitive. Visual representation is especially important in the comparative genomics field since it should provide a combined view of data belonging to different genomes. We believe that existing tools are limited in this respect as they focus on a single genome at a time (conservation histograms) or compress alignment representation to a single dimension. We have therefore developed a web-based tool called Comparative Genome Viewer (Cgv): it integrates a bidimensional representation of alignments between two regions, both at small and big scales, with the richness of annotations present in other genome browsers. We give access to our system through a web-based interface that provides the user with an interactive representation that can be updated in real time using the mouse to move from region to region and to zoom in on interesting details.

  5. ePIANNO: ePIgenomics ANNOtation tool.

    Science.gov (United States)

    Liu, Chia-Hsin; Ho, Bing-Ching; Chen, Chun-Ling; Chang, Ya-Hsuan; Hsu, Yi-Chiung; Li, Yu-Cheng; Yuan, Shin-Sheng; Huang, Yi-Huan; Chang, Chi-Sheng; Li, Ker-Chau; Chen, Hsuan-Yu

    2016-01-01

    Recently, with the development of next generation sequencing (NGS), the combination of chromatin immunoprecipitation (ChIP) and NGS, namely ChIP-seq, has become a powerful technique to capture potential genomic binding sites of regulatory factors, histone modifications and chromatin accessible regions. For most researchers, additional information including genomic variations on the TF binding site, allele frequency of variation between different populations, variation associated disease, and other neighbour TF binding sites are essential to generate a proper hypothesis or a meaningful conclusion. Many ChIP-seq datasets had been deposited on the public domain to help researchers make new discoveries. However, researches are often intimidated by the complexity of data structure and largeness of data volume. Such information would be more useful if they could be combined or downloaded with ChIP-seq data. To meet such demands, we built a webtool: ePIgenomic ANNOtation tool (ePIANNO, http://epianno.stat.sinica.edu.tw/index.html). ePIANNO is a web server that combines SNP information of populations (1000 Genomes Project) and gene-disease association information of GWAS (NHGRI) with ChIP-seq (hmChIP, ENCODE, and ROADMAP epigenomics) data. ePIANNO has a user-friendly website interface allowing researchers to explore, navigate, and extract data quickly. We use two examples to demonstrate how users could use functions of ePIANNO webserver to explore useful information about TF related genomic variants. Users could use our query functions to search target regions, transcription factors, or annotations. ePIANNO may help users to generate hypothesis or explore potential biological functions for their studies. PMID:26859295

  6. ePIANNO: ePIgenomics ANNOtation tool.

    Directory of Open Access Journals (Sweden)

    Chia-Hsin Liu

    Full Text Available Recently, with the development of next generation sequencing (NGS, the combination of chromatin immunoprecipitation (ChIP and NGS, namely ChIP-seq, has become a powerful technique to capture potential genomic binding sites of regulatory factors, histone modifications and chromatin accessible regions. For most researchers, additional information including genomic variations on the TF binding site, allele frequency of variation between different populations, variation associated disease, and other neighbour TF binding sites are essential to generate a proper hypothesis or a meaningful conclusion. Many ChIP-seq datasets had been deposited on the public domain to help researchers make new discoveries. However, researches are often intimidated by the complexity of data structure and largeness of data volume. Such information would be more useful if they could be combined or downloaded with ChIP-seq data. To meet such demands, we built a webtool: ePIgenomic ANNOtation tool (ePIANNO, http://epianno.stat.sinica.edu.tw/index.html. ePIANNO is a web server that combines SNP information of populations (1000 Genomes Project and gene-disease association information of GWAS (NHGRI with ChIP-seq (hmChIP, ENCODE, and ROADMAP epigenomics data. ePIANNO has a user-friendly website interface allowing researchers to explore, navigate, and extract data quickly. We use two examples to demonstrate how users could use functions of ePIANNO webserver to explore useful information about TF related genomic variants. Users could use our query functions to search target regions, transcription factors, or annotations. ePIANNO may help users to generate hypothesis or explore potential biological functions for their studies.

  7. Un service Web pour l'annotation sémantique de données biomédicales avec des ontologies

    OpenAIRE

    Jonquet, Clement; Shah, Nigam; Musen, Mark,

    2009-01-01

    The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a l...

  8. Gene coexpression network analysis as a source of functional annotation for rice genes.

    Directory of Open Access Journals (Sweden)

    Kevin L Childs

    Full Text Available With the existence of large publicly available plant gene expression data sets, many groups have undertaken data analyses to construct gene coexpression networks and functionally annotate genes. Often, a large compendium of unrelated or condition-independent expression data is used to construct gene networks. Condition-dependent expression experiments consisting of well-defined conditions/treatments have also been used to create coexpression networks to help examine particular biological processes. Gene networks derived from either condition-dependent or condition-independent data can be difficult to interpret if a large number of genes and connections are present. However, algorithms exist to identify modules of highly connected and biologically relevant genes within coexpression networks. In this study, we have used publicly available rice (Oryza sativa gene expression data to create gene coexpression networks using both condition-dependent and condition-independent data and have identified gene modules within these networks using the Weighted Gene Coexpression Network Analysis method. We compared the number of genes assigned to modules and the biological interpretability of gene coexpression modules to assess the utility of condition-dependent and condition-independent gene coexpression networks. For the purpose of providing functional annotation to rice genes, we found that gene modules identified by coexpression analysis of condition-dependent gene expression experiments to be more useful than gene modules identified by analysis of a condition-independent data set. We have incorporated our results into the MSU Rice Genome Annotation Project database as additional expression-based annotation for 13,537 genes, 2,980 of which lack a functional annotation description. These results provide two new types of functional annotation for our database. Genes in modules are now associated with groups of genes that constitute a collective functional

  9. SpikeGUI: software for rapid interictal discharge annotation via template matching and online machine learning.

    Science.gov (United States)

    Jing Jin; Dauwels, Justin; Cash, Sydney; Westover, M Brandon

    2014-01-01

    Detection of interictal discharges is a key element of interpreting EEGs during the diagnosis and management of epilepsy. Because interpretation of clinical EEG data is time-intensive and reliant on experts who are in short supply, there is a great need for automated spike detectors. However, attempts to develop general-purpose spike detectors have so far been severely limited by a lack of expert-annotated data. Huge databases of interictal discharges are therefore in great demand for the development of general-purpose detectors. Detailed manual annotation of interictal discharges is time consuming, which severely limits the willingness of experts to participate. To address such problems, a graphical user interface "SpikeGUI" was developed in our work for the purposes of EEG viewing and rapid interictal discharge annotation. "SpikeGUI" substantially speeds up the task of annotating interictal discharges using a custom-built algorithm based on a combination of template matching and online machine learning techniques. While the algorithm is currently tailored to annotation of interictal epileptiform discharges, it can easily be generalized to other waveforms and signal types. PMID:25570976

  10. A model-free method for annotating on vascular structure in volume rendered images

    Science.gov (United States)

    He, Wei; Li, Yanfang; Shi, Weili; Miao, Yu; He, Fei; Yan, Fei; Yang, Huamin; Zhang, Huimao; Mori, Kensaku; Jiang, Zhengang

    2015-03-01

    The precise annotation of vessel is desired in computer-assisted systems to help surgeons identify each vessel branch. A method has been reported that annotates vessels on volume rendered images by rendering their names on them using a two-pass rendering process. In the reported method, however, cylinder surface models of the vessels should be generated for writing vessels names. In fact, vessels are not actual cylinders, so the surfaces of the vessels cannot be simulated by such models accurately. This paper presents a model-free method for annotating vessels on volume rendered images by rendering their names on them using the two-pass rendering process: surface rendering and volume rendering. In the surface rendering process, docking points of vessel names are estimated by using such properties as centerlines, running directions, and vessel regions which are obtained in preprocess. Then the vessel names are pasted on the vessel surfaces at the docking points. In the volume rendering process, volume image is rendered using a fast volume rendering algorithm with depth buffer of image rendered in the surface rendering process. Finally, those rendered images are blended into an image as a result. In order to confirm the proposed method, a visualizing system for the automated annotation of abdominal arteries is performed. The experimental results show that vessel names can be drawn on the corresponding vessel in the volume rendered images correctly. The proposed method has enormous potential to be adopted to annotate other organs which cannot be modeled using regular geometrical surface.

  11. The surplus value of semantic annotations

    NARCIS (Netherlands)

    M. Marx

    2010-01-01

    We compare the costs of semantic annotation of textual documents to its benefits for information processing tasks. Semantic annotation can improve the performance of retrieval tasks and facilitates an improved search experience through faceted search, focused retrieval, better document summaries, an

  12. Towards an event annotated corpus of Polish

    Directory of Open Access Journals (Sweden)

    Michał Marcińczuk

    2015-12-01

    Full Text Available Towards an event annotated corpus of PolishThe paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent and text mentions (language dependant. The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.

  13. DIMA – Annotation guidelines for German intonation

    DEFF Research Database (Denmark)

    Kügler, Frank; Smolibocki, Bernadett; Arnold, Denis;

    2015-01-01

    easier since German intonation is currently annotated according to different models. To this end, we aim to provide guidelines that are easy to learn. The guidelines were evaluated running an inter-annotator reliability study on three different speech styles (read speech, monologue and dialogue...

  14. Creating Gaze Annotations in Head Mounted Displays

    DEFF Research Database (Denmark)

    Mardanbeigi, Diako; Qvarfordt, Pernilla

    To facilitate distributed communication in mobile settings, we developed GazeNote for creating and sharing gaze annotations in head mounted displays (HMDs). With gaze annotations it possible to point out objects of interest within an image and add a verbal description. To create an annota- tion, ...

  15. Crowdsourcing and annotating NER for Twitter #drift

    DEFF Research Database (Denmark)

    Fromreide, Hege; Hovy, Dirk; Søgaard, Anders

    2014-01-01

    We present two new NER datasets for Twitter; a manually annotated set of 1,467 tweets (kappa=0.942) and a set of 2,975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a...

  16. Annotation of regular polysemy and underspecification

    DEFF Research Database (Denmark)

    Martínez Alonso, Héctor; Pedersen, Bolette Sandford; Bel, Núria

    2013-01-01

    We present the result of an annotation task on regular polysemy for a series of seman- tic classes or dot types in English, Dan- ish and Spanish. This article describes the annotation process, the results in terms of inter-encoder agreement, and the sense distributions obtained with two methods...

  17. Harnessing Collaborative Annotations on Online Formative Assessments

    Science.gov (United States)

    Lin, Jian-Wei; Lai, Yuan-Cheng

    2013-01-01

    This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…

  18. GRASP: analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database

    OpenAIRE

    Leslie, Richard; O’Donnell, Christopher J.; Johnson, Andrew D.

    2014-01-01

    Summary: We created a deeply extracted and annotated database of genome-wide association studies (GWAS) results. GRASP v1.0 contains >6.2 million SNP-phenotype association from among 1390 GWAS studies. We re-annotated GWAS results with 16 annotation sources including some rarely compared to GWAS results (e.g. RNAediting sites, lincRNAs, PTMs).

  19. Manual Annotation of Translational Equivalence The Blinker Project

    CERN Document Server

    Melamed, I D

    1998-01-01

    Bilingual annotators were paid to link roughly sixteen thousand corresponding words between on-line versions of the Bible in modern French and modern English. These annotations are freely available to the research community from http://www.cis.upenn.edu/~melamed . The annotations can be used for several purposes. First, they can be used as a standard data set for developing and testing translation lexicons and statistical translation models. Second, researchers in lexical semantics will be able to mine the annotations for insights about cross-linguistic lexicalization patterns. Third, the annotations can be used in research into certain recently proposed methods for monolingual word-sense disambiguation. This paper describes the annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations. The annotation process was repeated five times by different annotators. Inter-annotator agreement rates indicate that the annotations are reasonably rel...

  20. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics

    OpenAIRE

    Blake, J. A.; Bult, C. J.; J.A. Kadin; J.E. Richardson; Eppig, J T

    2010-01-01

    The Mouse Genome Database (MGD) is the community model organism database for the laboratory mouse and the authoritative source for phenotype and functional annotations of mouse genes. MGD includes a complete catalog of mouse genes and genome features with integrated access to genetic, genomic and phenotypic information, all serving to further the use of the mouse as a model system for studying human biology and disease. MGD is a major component of the Mouse Genome Informatics (MGI, http://www...

  1. Structured RNAs and synteny regions in the pig genome

    DEFF Research Database (Denmark)

    Anthon, Christian; Tafer, Hakim; Havgaard, Jakob H;

    2014-01-01

    Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog). CONCLUSIONS: We have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian genome, which is likely to play central roles in both health modelling and production. The core annotation is available in Ensembl 70 and the...

  2. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA

    OpenAIRE

    Milos Patrice; Reaman Gregory; Sorensen Poul HB; Reynolds C Patrick; Ozsolak Fatih; Raz Tal; St Laurent Georges; Kapranov Philipp; Arceci Robert J; Thompson John F; Triche Timothy J

    2010-01-01

    Abstract Background Discovery that the transcriptional output of the human genome is far more complex than predicted by the current set of protein-coding annotations and that most RNAs produced do not appear to encode proteins has transformed our understanding of genome complexity and suggests new paradigms of genome regulation. However, the fraction of all cellular RNA whose function we do not understand and the fraction of the genome that is utilized to produce that RNA remain controversial...

  3. The GOA database: Gene Ontology annotation updates for 2015

    OpenAIRE

    Huntley, Rachael P; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J.; O'Donovan, Claire

    2014-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotat...

  4. AgBase: a functional genomics resource for agriculture

    Directory of Open Access Journals (Sweden)

    Hill David P

    2006-09-01

    Full Text Available Abstract Background Many agricultural species and their pathogens have sequenced genomes and more are in progress. Agricultural species provide food, fiber, xenotransplant tissues, biopharmaceuticals and biomedical models. Moreover, many agricultural microorganisms are human zoonoses. However, systems biology from functional genomics data is hindered in agricultural species because agricultural genome sequences have relatively poor structural and functional annotation and agricultural research communities are smaller with limited funding compared to many model organism communities. Description To facilitate systems biology in these traditionally agricultural species we have established "AgBase", a curated, web-accessible, public resource http://www.agbase.msstate.edu for structural and functional annotation of agricultural genomes. The AgBase database includes a suite of computational tools to use GO annotations. We use standardized nomenclature following the Human Genome Organization Gene Nomenclature guidelines and are currently functionally annotating chicken, cow and sheep gene products using the Gene Ontology (GO. The computational tools we have developed accept and batch process data derived from different public databases (with different accession codes, return all existing GO annotations, provide a list of products without GO annotation, identify potential orthologs, model functional genomics data using GO and assist proteomics analysis of ESTs and EST assemblies. Our journal database helps prevent redundant manual GO curation. We encourage and publicly acknowledge GO annotations from researchers and provide a service for researchers interested in GO and analysis of functional genomics data. Conclusion The AgBase database is the first database dedicated to functional genomics and systems biology analysis for agriculturally important species and their pathogens. We use experimental data to improve structural annotation of genomes and to

  5. MOCAT2: a metagenomic assembly, annotation and profiling framework

    Science.gov (United States)

    Kultima, Jens Roat; Coelho, Luis Pedro; Forslund, Kristoffer; Huerta-Cepas, Jaime; Li, Simone S.; Driessen, Marja; Voigt, Anita Yvonne; Zeller, Georg; Sunagawa, Shinichi; Bork, Peer

    2016-01-01

    Summary: MOCAT2 is a software pipeline for metagenomic sequence assembly and gene prediction with novel features for taxonomic and functional abundance profiling. The automated generation and efficient annotation of non-redundant reference catalogs by propagating pre-computed assignments from 18 databases covering various functional categories allows for fast and comprehensive functional characterization of metagenomes. Availability and Implementation: MOCAT2 is implemented in Perl 5 and Python 2.7, designed for 64-bit UNIX systems and offers support for high-performance computer usage via LSF, PBS or SGE queuing systems; source code is freely available under the GPL3 license at http://mocat.embl.de. Contact: bork@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153620

  6. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs

    Directory of Open Access Journals (Sweden)

    Eddy Sean R

    2002-05-01

    Full Text Available Abstract Background When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication. The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics" is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees. Results Here we present RIO (Resampled Inference of Orthologs, a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes. Conclusion The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.

  7. The 1000 bull genome project

    Science.gov (United States)

    To meet growing global demands for high value protein from milk and meat, rates of genetic gain in domestic cattle must be accelerated. At the same time, animal health and welfare must be considered. The 1000 bull genomes project supports these goals by providing annotated sequence variants and ge...

  8. Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote

    Directory of Open Access Journals (Sweden)

    Liu Shikai

    2012-11-01

    Full Text Available Abstract Background Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly. Results In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly. Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop

  9. The genome of the mesopolyploid crop species Brassica rapa

    NARCIS (Netherlands)

    Wang, Xiaowu; Wang, Hanzhong; Sun, Rifei; Bonnema, A.B.

    2011-01-01

    We report the annotation and analysis of the draft genome sequence of Brassica rapa accession Chiifu-401-42, a Chinese cabbage. We modeled 41,174 protein coding genes in the B. rapa genome, which has undergone genome triplication. We used Arabidopsis thaliana as an outgroup for investigating the con

  10. Draft Genome Sequence of Fungus Clonostachys rosea Strain YKD0085.

    Science.gov (United States)

    Liu, Shuai; Chang, Yaowen; Hu, Xujia; Gong, Xuanyun; Di, Yingtong; Dong, Jinyan; Hao, Xiaojiang

    2016-01-01

    Here, we report the draft genome sequence of Clonostachys rosea (strain YKD0085). The functional annotation of C. rosea provides important information related to its ability to produce secondary metabolites. The genome sequence presented here builds the basis for further genome mining. PMID:27340057

  11. Dana-Farber Cancer Institute | Office of Cancer Genomics

    Science.gov (United States)

    Functional Annotation of Cancer Genomes Principal Investigator: William C. Hahn, M.D., Ph.D. The comprehensive characterization of cancer genomes has and will continue to provide an increasingly complete catalog of genetic alterations in specific cancers. However, most epithelial cancers harbor hundreds of genetic alterations as a consequence of genomic instability. Therefore, the functional consequences of the majority of mutations remain unclear.

  12. Functional innovations of three chronological mesohexaploid Brassica rapa genomes

    OpenAIRE

    Kim, Jungeun; Lee, Jeongyeo; Choi, Jae-Pil; Park, Inkyu; Yang, Kyungbong; Kim, Min Keun; Lee, Young Han; Nou, Ill-Sup; Kim, Dae-Soo; Min, Sung Ran; Park, Sang Un; Kim, HyeRan

    2014-01-01

    Background The Brassicaceae family is an exemplary model for studying plant polyploidy. The Brassicaceae knowledge-base includes the well-annotated Arabidopsis thaliana reference sequence; well-established evidence for three rounds of whole genome duplication (WGD); and the conservation of genomic structure, with 24 conserved genomic blocks (GBs). The recently released Brassica rapa draft genome provides an ideal opportunity to update our knowledge of the conserved genomic structures in Brass...

  13. A Common XML-based Framework for Syntactic Annotations

    CERN Document Server

    Ide, Nancy; Erjavec, Tomaz

    2009-01-01

    It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes.

  14. Deep Question Answering for protein annotation.

    Science.gov (United States)

    Gobeill, Julien; Gaudinat, Arnaud; Pasche, Emilie; Vishnyakova, Dina; Gaudet, Pascale; Bairoch, Amos; Ruch, Patrick

    2015-01-01

    Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/. PMID:26384372

  15. Accessing the SEED Genome Databases via Web Services API: Tools for Programmers

    Directory of Open Access Journals (Sweden)

    Vonstein Veronika

    2010-06-01

    Full Text Available Abstract Background The SEED integrates many publicly available genome sequences into a single resource. The database contains accurate and up-to-date annotations based on the subsystems concept that leverages clustering between genomes and other clues to accurately and efficiently annotate microbial genomes. The backend is used as the foundation for many genome annotation tools, such as the Rapid Annotation using Subsystems Technology (RAST server for whole genome annotation, the metagenomics RAST server for random community genome annotations, and the annotation clearinghouse for exchanging annotations from different resources. In addition to a web user interface, the SEED also provides Web services based API for programmatic access to the data in the SEED, allowing the development of third-party tools and mash-ups. Results The currently exposed Web services encompass over forty different methods for accessing data related to microbial genome annotations. The Web services provide comprehensive access to the database back end, allowing any programmer access to the most consistent and accurate genome annotations available. The Web services are deployed using a platform independent service-oriented approach that allows the user to choose the most suitable programming platform for their application. Example code demonstrate that Web services can be used to access the SEED using common bioinformatics programming languages such as Perl, Python, and Java. Conclusions We present a novel approach to access the SEED database. Using Web services, a robust API for access to genomics data is provided, without requiring large volume downloads all at once. The API ensures timely access to the most current datasets available, including the new genomes as soon as they come online.

  16. The Comprehensive Phytopathogen Genomics Resource: a web-based resource for data-mining plant pathogen genomes

    OpenAIRE

    Hamilton, John P; Neeno-Eckwall, Eric C; Adhikari, Bishwo N.; Perna, Nicole T; Tisserat, Ned; Leach, Jan E.; Lévesque, C. André; Buell, C. Robin

    2011-01-01

    The Comprehensive Phytopathogen Genomics Resource (CPGR) provides a web-based portal for plant pathologists and diagnosticians to view the genome and trancriptome sequence status of 806 bacterial, fungal, oomycete, nematode, viral and viroid plant pathogens. Tools are available to search and analyze annotated genome sequences of 74 bacterial, fungal and oomycete pathogens. Oomycete and fungal genomes are obtained directly from GenBank, whereas bacterial genome sequences are downloaded from th...

  17. Comparative Reannotation of 21 Aspergillus Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Salamov, Asaf; Riley, Robert; Kuo, Alan; Grigoriev, Igor

    2013-03-08

    We used comparative gene modeling to reannotate 21 Aspergillus genomes. Initial automatic annotation of individual genomes may contain some errors of different nature, e.g. missing genes, incorrect exon-intron structures, 'chimeras', which fuse 2 or more real genes or alternatively splitting some real genes into 2 or more models. The main premise behind the comparative modeling approach is that for closely related genomes most orthologous families have the same conserved gene structure. The algorithm maps all gene models predicted in each individual Aspergillus genome to the other genomes and, for each locus, selects from potentially many competing models, the one which most closely resembles the orthologous genes from other genomes. This procedure is iterated until no further change in gene models is observed. For Aspergillus genomes we predicted in total 4503 new gene models ( ~;;2percent per genome), supported by comparative analysis, additionally correcting ~;;18percent of old gene models. This resulted in a total of 4065 more genes with annotated PFAM domains (~;;3percent increase per genome). Analysis of a few genomes with EST/transcriptomics data shows that the new annotation sets also have a higher number of EST-supported splice sites at exon-intron boundaries.

  18. Automation Security

    OpenAIRE

    Mirzoev, Dr. Timur

    2014-01-01

    Web-based Automated Process Control systems are a new type of applications that use the Internet to control industrial processes with the access to the real-time data. Supervisory control and data acquisition (SCADA) networks contain computers and applications that perform key functions in providing essential services and commodities (e.g., electricity, natural gas, gasoline, water, waste treatment, transportation) to all Americans. As such, they are part of the nation s critical infrastructu...

  19. Software tool for researching annotations of proteins: open-source protein annotation software with data visualization.

    Science.gov (United States)

    Bhatia, Vivek N; Perlman, David H; Costello, Catherine E; McComb, Mark E

    2009-12-01

    In order that biological meaning may be derived and testable hypotheses may be built from proteomics experiments, assignments of proteins identified by mass spectrometry or other techniques must be supplemented with additional notation, such as information on known protein functions, protein-protein interactions, or biological pathway associations. Collecting, organizing, and interpreting this data often requires the input of experts in the biological field of study, in addition to the time-consuming search for and compilation of information from online protein databases. Furthermore, visualizing this bulk of information can be challenging due to the limited availability of easy-to-use and freely available tools for this process. In response to these constraints, we have undertaken the design of software to automate annotation and visualization of proteomics data in order to accelerate the pace of research. Here we present the Software Tool for Researching Annotations of Proteins (STRAP), a user-friendly, open-source C# application. STRAP automatically obtains gene ontology (GO) terms associated with proteins in a proteomics results ID list using the freely accessible UniProtKB and EBI GOA databases. Summarized in an easy-to-navigate tabular format, STRAP results include meta-information on the protein in addition to complementary GO terminology. Additionally, this information can be edited by the user so that in-house expertise on particular proteins may be integrated into the larger data set. STRAP provides a sortable tabular view for all terms, as well as graphical representations of GO-term association data in pie charts (biological process, cellular component, and molecular function) and bar charts (cross comparison of sample sets) to aid in the interpretation of large data sets and differential analyses experiments. Furthermore, proteins of interest may be exported as a unique FASTA-formatted file to allow for customizable re-searching of mass spectrometry

  20. Fungal Genomics Program

    Energy Technology Data Exchange (ETDEWEB)

    Grigoriev, Igor

    2012-03-12

    The JGI Fungal Genomics Program aims to scale up sequencing and analysis of fungal genomes to explore the diversity of fungi important for energy and the environment, and to promote functional studies on a system level. Combining new sequencing technologies and comparative genomics tools, JGI is now leading the world in fungal genome sequencing and analysis. Over 120 sequenced fungal genomes with analytical tools are available via MycoCosm (www.jgi.doe.gov/fungi), a web-portal for fungal biologists. Our model of interacting with user communities, unique among other sequencing centers, helps organize these communities, improves genome annotation and analysis work, and facilitates new larger-scale genomic projects. This resulted in 20 high-profile papers published in 2011 alone and contributing to the Genomics Encyclopedia of Fungi, which targets fungi related to plant health (symbionts, pathogens, and biocontrol agents) and biorefinery processes (cellulose degradation, sugar fermentation, industrial hosts). Our next grand challenges include larger scale exploration of fungal diversity (1000 fungal genomes), developing molecular tools for DOE-relevant model organisms, and analysis of complex systems and metagenomes.