I-Min A Chen
Full Text Available Effective comparative analysis of microbial genomes requires a consistent and complete view of biological data. Consistency regards the biological coherence of annotations, while completeness regards the extent and coverage of functional characterization for genomes. We have developed tools that allow scientists to assess and improve the consistency and completeness of microbial genome annotations in the context of the Integrated Microbial Genomes (IMG family of systems. All publicly available microbial genomes are characterized in IMG using different functional annotation and pathway resources, thus providing a comprehensive framework for identifying and resolving annotation discrepancies. A rule based system for predicting phenotypes in IMG provides a powerful mechanism for validating functional annotations, whereby the phenotypic traits of an organism are inferred based on the presence of certain metabolic reactions and pathways and compared to experimentally observed phenotypes. The IMG family of systems are available at http://img.jgi.doe.gov/.
Berger Dave K
Full Text Available Abstract Background Microarray technology has matured over the past fifteen years into a cost-effective solution with established data analysis protocols for global gene expression profiling. The Agilent-016047 maize 44 K microarray was custom-designed from EST sequences, but only reporter sequences with EST accession numbers are publicly available. The following information is lacking: (a reporter - gene model match, (b number of reporters per gene model, (c potential for cross hybridization, (d sense/antisense orientation of reporters, (e position of reporter on B73 genome sequence (for eQTL studies, and (f functional annotations of genes represented by reporters. To address this, we developed a strategy to annotate the Agilent-016047 maize microarray, and built a publicly accessible annotation database. Description Genomic annotation of the 42,034 reporters on the Agilent-016047 maize microarray was based on BLASTN results of the 60-mer reporter sequences and their corresponding ESTs against the maize B73 RefGen v2 "Working Gene Set" (WGS predicted transcripts and the genome sequence. The agreement between the EST, WGS transcript and gDNA BLASTN results were used to assign the reporters into six genomic annotation groups. These annotation groups were: (i "annotation by sense gene model" (23,668 reporters, (ii "annotation by antisense gene model" (4,330; (iii "annotation by gDNA" without a WGS transcript hit (1,549; (iv "annotation by EST", in which case the EST from which the reporter was designed, but not the reporter itself, has a WGS transcript hit (3,390; (v "ambiguous annotation" (2,608; and (vi "inconclusive annotation" (6,489. Functional annotations of reporters were obtained by BLASTX and Blast2GO analysis of corresponding WGS transcripts against GenBank. The annotations are available in the Maize Microarray Annotation Database http://MaizeArrayAnnot.bi.up.ac.za/, as well as through a GBrowse annotation file that can be uploaded to
Cannataro, Mario; Hiram Guzzi, Pietro; Veltri, Pierangelo
Biological databases have been developed with a special focus on the efficient retrieval of single records or the efficient computation of specialized bioinformatics algorithms against the overall database, such as in sequence alignment. The continuos production of biological knowledge spread on several biological databases and ontologies, such as Gene Ontology, and the availability of efficient techniques to handle such knowledge, such as annotation and semantic similarity measures, enable the development on novel bioinformatics applications that explicitly use and integrate such knowledge. After introducing the annotation process and the main semantic similarity measures, this paper shows how annotations and semantic similarity can be exploited to improve the extraction and analysis of biologically relevant data from protein interaction databases. As case studies, the paper presents two novel software tools, OntoPIN and CytoSeVis, both based on the use of Gene Ontology annotations, for the advanced querying of protein interaction databases and for the enhanced visualization of protein interaction networks.
Full Text Available Abstract Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i automating the annotation of genomic entities with Gene Ontology concepts, and (ii providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate
Rother, Kristian; Michalsky, Elke; Leser, Ulf
We investigated to what extent Protein Data Bank (PDB) entries are annotated with second-party information based on existing cross-references between PDB and 15 other databases. We report 2 interesting findings. First, there is a clear "annotation gap" for structures less than 7 years old for secondary databases that are manually curated. Second, the examined databases overlap with each other quite well, dividing the PDB into 2 well-annotated thirds and one poorly annotated third. Both observations should be taken into account in any study depending on the selection of protein structures by their annotation.
Full Text Available [ Credits ] BLAST Search Image Search Home About Archive Update History Contact us ...ile name: kome_product_annotation.zip File URL: ftp://ftp.biosciencedbc.jp/archiv...ate History of This Database Site Policy | Contact Us Product annotations - KOME | LSDB Archive ...
Gatterbauer, Wolfgang; Khoussainova, Nodira; Suciu, Dan
We propose a database model that allows users to annotate data with belief statements. Our motivation comes from scientific database applications where a community of users is working together to assemble, revise, and curate a shared data repository. As the community accumulates knowledge and the database content evolves over time, it may contain conflicting information and members can disagree on the information it should store. For example, Alice may believe that a tuple should be in the database, whereas Bob disagrees. He may also insert the reason why he thinks Alice believes the tuple should be in the database, and explain what he thinks the correct tuple should be instead. We propose a formal model for Belief Databases that interprets users' annotations as belief statements. These annotations can refer both to the base data and to other annotations. We give a formal semantics based on a fragment of multi-agent epistemic logic and define a query language over belief databases. We then prove a key technic...
Guzzi, Pietro Hiram; Veltri, Pierangelo; Cannataro, Mario
Protein-protein interaction (PPI) data stored in publicly available databases are queried by the use of simple query interfaces allowing only key-based queries. A typical query on such databases is based on the use of protein identifiers and enables the retrieval of one or more proteins. Nevertheless, a lot of biological information is available and is spread on different sources and encoded in different ontologies such as Gene Ontology. The integration of existing PPI databases and biological information may result in richer querying interfaces and successively could enable the development of novel algorithms that may use biological information. The OntoPIN project showed the effectiveness of the introduction of a framework for the ontology-based management and querying of Protein-Protein Interaction Data. The OntoPIN framework first merges PPI data with annotations extracted from existing ontologies (e.g. Gene Ontology) and stores annotated data into a database. Then, a semantic-based query interface enables users to query these data by using biological concepts. OntoPIN allows: (a) to extend existing PPI databases by using ontologies, (b) to enable a key-based querying of annotated data, and (c) to offer a novel query interface based on semantic similarity among annotations.
Full Text Available [ Credits ] BLAST Search Image Search Home About Archive Update History Contact us ....zip File URL: ftp://ftp.biosciencedbc.jp/archive/kome/LATEST/kome_genome_annotat...e Update History of This Database Site Policy | Contact Us Genome annotations - KOME | LSDB Archive ...
Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A
While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.
Pieper, Ursula; Eswar, Narayanan; Davis, Fred P.; Braberg, Hannes; Madhusudhan, M. S.; Rossi, Andrea; Marti-Renom, Marc; Karchin, Rachel; Webb, Ben M.; Eramian, David; Shen, Min-Yi; Kelly, Libusha; Melo, Francisco; Sali, Andrej
MODBASE () is a database of annotated comparative protein structure models for all available protein sequences that can be matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (). MODBASE is updated regularly to reflect the growth in protein sequence and structure databases, and improvements in the software for calculat...
Full Text Available Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae. Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins. To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt, to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.
Salzberg Steven L
Full Text Available Abstract Background Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes. Results We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review. Conclusions Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.
Zhou, C; Lam, M; Smith, J; Zemla, A; Dyer, M; Kuczmarski, T; Vitalis, E; Slezak, T
MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high
Hwang, Kyung Hoon; Lee, Haejun; Koh, Geon; Willrett, Debra; Rubin, Daniel L
As the use of positron emission tomography-computed tomography (PET-CT) has increased rapidly, there is a need to retrieve relevant medical images that can assist image interpretation. However, the images themselves lack the explicit information needed for query. We constructed a semantically structured database of nuclear medicine images using the Annotation and Image Markup (AIM) format and evaluated the ability the AIM annotations to improve image search. We created AIM annotation templates specific to the nuclear medicine domain and used them to annotate 100 nuclear medicine PET-CT studies in AIM format using controlled vocabulary. We evaluated image retrieval from 20 specific clinical queries. As the gold standard, two nuclear medicine physicians manually retrieved the relevant images from the image database using free text search of radiology reports for the same queries. We compared query results with the manually retrieved results obtained by the physicians. The query performance indicated a 98 % recall for simple queries and a 89 % recall for complex queries. In total, the queries provided 95 % (75 of 79 images) recall, 100 % precision, and an F1 score of 0.97 for the 20 clinical queries. Three of the four images missed by the queries required reasoning for successful retrieval. Nuclear medicine images augmented using semantic annotations in AIM enabled high recall and precision for simple queries, helping physicians to retrieve the relevant images. Further study using a larger data set and the implementation of an inference engine may improve query results for more complex queries.
Childs Kevin L
Full Text Available Abstract Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence.
Vellozo, Augusto F; Véron, Amélie S; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano
In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http://www.cycadsys.org.
O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804
O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
Kristoffersen, Jette Hedegaard; Troelsgård, Thomas; Langer, Gabriele
In a combined corpus-dictionary project, you would need one lexical database that could serve as a shared “backbone” for both corpus annotation and dictionary editing, but it is not that easy to define a database structure that applies satisfactorily to both these purposes. In this paper, we will...
Zhu, Xingquan; Liu, Wenyin; Zhang, HongJiang; Wu, Lide
Image annotation is used in traditional image database systems. However, without the help of human beings, it is very difficult to extract the semantic content of an image automatically. On the other hand, it is a tedious work to annotate images in large databases one by one manually. In this paper, we present a web based semi-automatic annotation and image retrieval scheme, which integrates image search and image annotation seamlessly and effectively. In this scheme, we use both low-level features and high-level semantics to measure similarity between images in an image database. A relevance feedback process at both levels is used to refine similarity assessment. The annotation process is activated when the user provides feedback on the retrieved images. With the help of the proposed similarity metrics and relevance feedback approach at these two levels, the system can find out those images that are relevant to the user's keyword or image query more efficiently. Experimental results have proved that our scheme is effective and efficient and can be used in large image databases for image annotation and retrieval.
HE Fei; SHI Qing-yun; CHEN Ming; WU Ping
RiceDB, a web-based integrated database to annotate rice microarray in various biological contexts was developed. It is composed of eight modules. RiceMap module archives the process of Affymetrix probe sets mapping to different databases about rice, and aims to the genes represented by a microarray set by retrieving annotation information via the identifier or accession number of every database; RiceGO module indicates the association between a microarray set and gene ontology (GO) categories; RiceKO module is used to annotate a microarray set based on the KEGG biochemical pathways; RiceDO module indicates the information of domain associated with a microarray set; RiceUP module is used to obtain promoter sequences for all genes represented by a microarray set; RiceMR module lists potential microRNA which regulated the genes represented by a microarray set; RiceCD and RiceGF are used to annotate the genes represented by a microarray set in the context of chromosome distribution and rice paralogous family distribution. The results of automatic annotation are mostly consistent with manual annotation. Biological interpretation of the microarray data is quickened by the help of RiceDB.
Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard
Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific
Baseler Michael W
Full Text Available Abstract Background Due to the complex and distributed nature of biological research, our current biological knowledge is spread over many redundant annotation databases maintained by many independent groups. Analysts usually need to visit many of these bioinformatics databases in order to integrate comprehensive annotation information for their genes, which becomes one of the bottlenecks, particularly for the analytic task associated with a large gene list. Thus, a highly centralized and ready-to-use gene-annotation knowledgebase is in demand for high throughput gene functional analysis. Description The DAVID Knowledgebase is built around the DAVID Gene Concept, a single-linkage method to agglomerate tens of millions of gene/protein identifiers from a variety of public genomic resources into DAVID gene clusters. The grouping of such identifiers improves the cross-reference capability, particularly across NCBI and UniProt systems, enabling more than 40 publicly available functional annotation sources to be comprehensively integrated and centralized by the DAVID gene clusters. The simple, pair-wise, text format files which make up the DAVID Knowledgebase are freely downloadable for various data analysis uses. In addition, a well organized web interface allows users to query different types of heterogeneous annotations in a high-throughput manner. Conclusion The DAVID Knowledgebase is designed to facilitate high throughput gene functional analysis. For a given gene list, it not only provides the quick accessibility to a wide range of heterogeneous annotation data in a centralized location, but also enriches the level of biological information for an individual gene. Moreover, the entire DAVID Knowledgebase is freely downloadable or searchable at http://david.abcc.ncifcrf.gov/knowledgebase/.
Lea, Isabel A.; Gong, Hui; Paleja, Anand; Rashid, Asif; Fostel, Jennifer
The Chemical Effects in Biological Systems database (CEBS) is a comprehensive and unique toxicology resource that compiles individual and summary animal data from the National Toxicology Program (NTP) testing program and other depositors into a single electronic repository. CEBS has undergone significant updates in recent years and currently contains over 11 000 test articles (exposure agents) and over 8000 studies including all available NTP carcinogenicity, short-term toxicity and genetic toxicity studies. Study data provided to CEBS are manually curated, accessioned and subject to quality assurance review prior to release to ensure high quality. The CEBS database has two main components: data collection and data delivery. To accommodate the breadth of data produced by NTP, the CEBS data collection component is an integrated relational design that allows the flexibility to capture any type of electronic data (to date). The data delivery component of the database comprises a series of dedicated user interface tables containing pre-processed data that support each component of the user interface. The user interface has been updated to include a series of nine Guided Search tools that allow access to NTP summary and conclusion data and larger non-NTP datasets. The CEBS database can be accessed online at http://www.niehs.nih.gov/research/resources/databases/cebs/. PMID:27899660
Full Text Available For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions and 95% (coding regions only of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation  of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org.
We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1(st) BLAST of SAP was conducted using the original metagenomic d...
Li, Weizhong; McWilliam, Hamish; de la Torre, Ana Richart; Grodowski, Adam; Benediktovich, Irina; Goujon, Mickael; Nauche, Stephane; Lopez, Rodrigo
The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/.
Foxx, T. S.; Pierce, L.; Tierney, G. D.; Hansen, L. A.
Studies done in the last 40 years have provided information to construct a checklist of the Jemez Mountains. The present database and checklist builds on the basic list compiled by Teralene Foxx and Gail Tierney in the early 1980s. The checklist is annotated with taxonomic information, geographic and biological information, economic uses, wildlife cover, revegetation potential, and ethnographic uses. There are nearly 1000 species that have been noted for the Jemez Mountains. This list is cross-referenced with the US Department of Agriculture Natural Resource Conservation Service PLANTS database species names and acronyms. All information will soon be available on a Web Page.
Alexandra M Schnoes
Full Text Available Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families; the two other protein sequence databases (GenBank NR and TrEMBL and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.
Yu, Ke; Zhang, Tong
We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1(st) BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1(st) BLAST were subjected to the 2(nd) BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1(st) BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ~150-385 times at the BLAST e-value of 1e-5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e-10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks.
Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui
Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.
Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu
With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924
De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric
The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.
Full Text Available Abstract Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.
Full Text Available Abstract Background The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss. Results We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84% of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins, Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations. Conclusion This report illustrates specific
Paladin, Lisanna; Hirsh, Layla; Piovesan, Damiano; Andrade-Navarro, Miguel A.; Kajava, Andrey V.; Tosatto, Silvio C.E.
RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity. PMID:27899671
Copyright © 2014 ACM. Predicting protein structures and simulating protein folding are two of the most important problems in computational biology today. Simulation methods rely on a scoring function to distinguish the native structure (the most energetically stable) from non-native structures. Decoy databases are collections of non-native structures used to test and verify these functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and removing redundant structures. We test our approach on 17 different decoy databases of varying size and type and show significant improvement across a variety of metrics. We also test our improved databases on a popular modern scoring function and show that they contain a greater number of native-like structures than the original databases, thereby producing a more rigorous database for testing scoring functions.
Full Text Available Abstract Background Plant microRNAs (miRNAs have been revealed to play important roles in developmental control, hormone secretion, cell differentiation and proliferation, and response to environmental stresses. However, our knowledge about the regulatory mechanisms and functions of miRNAs remains very limited. The main difficulties lie in two aspects. On one hand, the number of experimentally validated miRNA targets is very limited and the predicted targets often include many false positives, which constrains us to reveal the functions of miRNAs. On the other hand, the regulation of miRNAs is known to be spatio-temporally specific, which increases the difficulty for us to understand the regulatory mechanisms of miRNAs. Description In this paper we present miRFANs, an online database for Arabidopsis thalianamiRNA function annotations. We integrated various type of datasets, including miRNA-target interactions, transcription factor (TF and their targets, expression profiles, genomic annotations and pathways, into a comprehensive database, and developed various statistical and mining tools, together with a user-friendly web interface. For each miRNA target predicted by psRNATarget, TargetAlign and UEA target-finder, or recorded in TarBase and miRTarBase, the effect of its up-regulated or down-regulated miRNA on the expression level of the target gene is evaluated by carrying out differential expression analysis of both miRNA and targets expression profiles acquired under the same (or similar experimental condition and in the same tissue. Moreover, each miRNA target is associated with gene ontology and pathway terms, together with the target site information and regulating miRNAs predicted by different computational methods. These associated terms may provide valuable insight for the functions of each miRNA. Conclusion First, a comprehensive collection of miRNA targets for Arabidopsis thaliana provides valuable information about the functions of
Data integration is a well-known grand challenge in information systems. It is highly relevant in medicine because of the multitude of patient data sources. Semantic annotations of data items regarding concept and value domain, based on comprehensive terminologies can facilitate data integration and migration. Therefore it should be implemented in databases from the very beginning.
Many genes have been implicated in the pathogenesis of common respiratory and related diseases (RRDs), yet the underlying mechanisms are largely unknown. Differential gene expression patterns in diseased and healthy individuals suggest that RRDs affect or are affected by modified transcription regulation programs. It is thus crucial to characterize implicated genes in terms of transcriptional regulation. For this purpose, we conducted a promoter analysis of genes associated with 11 common RRDs including allergic rhinitis, asthma, bronchiectasis, bronchiolitis, bronchitis, chronic obstructive pulmonary disease, cystic fibrosis, emphysema, eczema, psoriasis, and urticaria, many of which are thought to be genetically related. The objective of the present study was to obtain deeper insight into the transcriptional regulation of these disease-associated genes by annotating their promoter regions with transcription factors (TFs) and TF binding sites (TFBSs). We discovered many TFs that are significantly enriched in the target disease groups including associations that have been documented in the literature. We also identified a number of putative TFs/TFBSs that appear to be novel. The results of our analysis are provided in an online database that is freely accessible to researchers at http://www.respiratorygenomics.com. Promoter-associated TFBS information and related genomic features, such as histone modification sites, microsatellites, CpG islands, and SNPs, are graphically summarized in the database. Users can compare and contrast underlying mechanisms of specific RRDs relative to candidate genes, TFs, gene ontology terms, micro-RNAs, and biological pathways for the conduct of metaanalyses. This database represents a novel, useful resource for RRD researchers. Copyright © 2012 by the American Thoracic Society.
Full Text Available Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/ for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/, the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi.
Wennerberg, Pinar; Schulz, Klaus; Buitelaar, Paul
Searching for medical images and patient reports is a significant challenge in a clinical setting. The contents of such documents are often not described in sufficient detail thus making it difficult to utilize the inherent wealth of information contained within them. Semantic image annotation addresses this problem by describing the contents of images and reports using medical ontologies. Medical images and patient reports are then linked to each other through common annotations. Subsequently, search algorithms can more effectively find related sets of documents on the basis of these semantic descriptions. A prerequisite to realizing such a semantic search engine is that the data contained within should have been previously annotated with concepts from medical ontologies. One major challenge in this regard is the size and complexity of medical ontologies as annotation sources. Manual annotation is particularly time consuming labor intensive in a clinical environment. In this article we propose an approach to reducing the size of clinical ontologies for more efficient manual image and text annotation. More precisely, our goal is to identify smaller fragments of a large anatomy ontology that are relevant for annotating medical images from patients suffering from lymphoma. Our work is in the area of ontology modularization, which is a recent and active field of research. We describe our approach, methods and data set in detail and we discuss our results.
Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F
About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.
Lambert, J.M.; Siezen, R.J.; Vos, W.M. de; Kleerebezem, M.
Most Gram-positive bacteria inhabiting the gastrointestinal tract are capable of hydrolysing bile salts. Bile salt hydrolysis is thought to play an important role in various biological processes in the host. Therefore, correct annotation of bacterial bile salt hydrolases (Bsh) in public databases (E
Lambert, J.M.; Siezen, R.J.; Vos, de W.M.; Kleerebezem, M.
Most Gram-positive bacteria inhabiting the gastrointestinal tract are capable of hydrolysing bile salts. Bile salt hydrolysis is thought to play an important role in various biological processes in the host. Therefore, correct annotation of bacterial bile salt hydrolases (Bsh) in public databases (E
Full Text Available Number of data entries - Data item Description Field 1 ID of amino acid sequence (Sequence ID) Field... 2 Length of amino acid sequence Field 3 Annotation of amino acid sequence Joomla SEF URL
Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim;
to a genome scale metabolic model of A. oryzae. Results: Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted......Background: Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number...... of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other...
Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling
Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.
Full Text Available [ Credits ] BLAST Search Image Search Home About Archive Update History Contact us ...Gclust Server Amino acid sequences of predicted proteins and their annotation for 95 organism species. Data ...detail Data name Amino acid sequences of predicted proteins and their annotation for 95 organism species. De...scription of data contents Amino acid sequences of predicted proteins and their a...nload License Update History of This Database Site Policy | Contact Us Amino acid sequences of predicted pro
Oliver Stephen G
Full Text Available Abstract Background Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs, which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models. Results Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied. Conclusion Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam
Full Text Available Abstract Background The population of HIV replicating within a host consists of independently evolving and interacting sub-populations that can be genetically distinct within anatomical compartments. HIV replicating within the brain causes neurocognitive disorders in up to 20-30% of infected individuals and is a viral sanctuary site for the development of drug resistance. The primary determinant of HIV neurotropism is macrophage tropism, which is primarily determined by the viral envelope (env gene. However, studies of genetic aspects of HIV replicating in the brain are hindered because existing repositories of HIV sequences are not focused on neurotropic virus nor annotated with neurocognitive and neuropathological status. To address this need, we constructed the HIV Brain Sequence Database. Results The HIV Brain Sequence Database is a public database of HIV envelope sequences, directly sequenced from brain and other tissues from the same patients. Sequences are annotated with clinical data including viral load, CD4 count, antiretroviral status, neurocognitive impairment, and neuropathological diagnosis, all curated from the original publication. Tissue source is coded using an anatomical ontology, the Foundational Model of Anatomy, to capture the maximum level of detail available, while maintaining ontological relationships between tissues and their subparts. 44 tissue types are represented within the database, grouped into 4 categories: (i brain, brainstem, and spinal cord; (ii meninges, choroid plexus, and CSF; (iii blood and lymphoid; and (iv other (bone marrow, colon, lung, liver, etc. Patient coding is correlated across studies, allowing sequences from the same patient to be grouped to increase statistical power. Using Cytoscape, we visualized relationships between studies, patients and sequences, illustrating interconnections between studies and the varying depth of sequencing, patient number, and tissue representation across studies
Full Text Available BACKGROUND: The desert locust (Schistocerca gregaria displays a fascinating type of phenotypic plasticity, designated as 'phase polyphenism'. Depending on environmental conditions, one genome can be translated into two highly divergent phenotypes, termed the solitarious and gregarious (swarming phase. Although many of the underlying molecular events remain elusive, the central nervous system (CNS is expected to play a crucial role in the phase transition process. Locusts have also proven to be interesting model organisms in a physiological and neurobiological research context. However, molecular studies in locusts are hampered by the fact that genome/transcriptome sequence information available for this branch of insects is still limited. METHODOLOGY: We have generated 34,672 raw expressed sequence tags (EST from the CNS of desert locusts in both phases. These ESTs were assembled in 12,709 unique transcript sequences and nearly 4,000 sequences were functionally annotated. Moreover, the obtained S. gregaria EST information is highly complementary to the existing orthopteran transcriptomic data. Since many novel transcripts encode neuronal signaling and signal transduction components, this paper includes an overview of these sequences. Furthermore, several transcripts being differentially represented in solitarious and gregarious locusts were retrieved from this EST database. The findings highlight the involvement of the CNS in the phase transition process and indicate that this novel annotated database may also add to the emerging knowledge of concomitant neuronal signaling and neuroplasticity events. CONCLUSIONS: In summary, we met the need for novel sequence data from desert locust CNS. To our knowledge, we hereby also present the first insect EST database that is derived from the complete CNS. The obtained S. gregaria EST data constitute an important new source of information that will be instrumental in further unraveling the molecular
Marais Gabriel AB
Full Text Available Abstract Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO terms, and thousands of single-nucleotide polymorphisms (SNPs were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49% that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to
Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana
Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world's biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop.
Full Text Available Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia, and constructed the Citrus sinensis annotation project (CAP to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.
linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of the annotations with which we in the Kaj Munk Research Centre have desired to enrich the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four......D work has been carried out under the organizational umbrella of the Kaj Munk Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely felt — both inside and outside of Denmark — during...... the period between World War I and World War II. He was murdered by Gestapo in early January 1944 for his resistance stance. The two main tasks of the Kaj Munk Research Centre in which I have been involved during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the texts of Kaj Munk...
Meister, M.J.; Kettenbrink, G.K.; Collins, A.G.
This annotated bibliography contains abstracts, prepared by the authors, of articles published between 1968 and early 1976 on tests of improved aqueous fluid injection methods (i.e., polymer and surfactant floods). The abstracts have been written and organized to facilitate studies of the oil recovery potential of polymer and surfactant floods under known reservoir conditions. 328 citations.
Beijbom, Oscar; Treibitz, Tali; Kline, David I.; Eyal, Gal; Khen, Adi; Neal, Benjamin; Loya, Yossi; Mitchell, B. Greg; Kriegman, David
Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck.
Sara C Novais
Full Text Available Enchytraeus albidus (Oligochaeta is an ecologically relevant species used as standard test organisms for risk assessment. Effects of stressors in this species are commonly determined at the population level using reproduction and survival as endpoints. The assessment of transcriptomic responses can be very useful e.g. to understand underlying mechanisms of toxicity with gene expression fingerprinting. In the present paper the following is being addressed: 1 development of suppressive subtractive hybridization (SSH libraries enriched for differentially expressed genes after metal and pesticide exposures; 2 sequencing and characterization of all generated cDNA inserts; 3 development of a publicly available genomic database on E. albidus. A total of 2100 Expressed Sequence Tags (ESTs were isolated, sequenced and assembled into 1124 clusters (947 singletons and 177 contigs. From these sequences, 41% matched known proteins in GenBank (BLASTX, e-value ≤ 10(-5 and 37% had at least one Gene Ontology (GO term assigned. In total, 5.5% of the sequences were assigned to a metabolic pathway, based on KEGG. With this new sequencing information, an Agilent custom oligonucleotide microarray was designed, representing a potential tool for transcriptomic studies. EnchyBASE (http://bioinformatics.ua.pt/enchybase/ was developed as a web freely available database containing genomic information on E. albidus and will be further extended in the near future for other enchytraeid species. The database so far includes all ESTs generated for E. albidus from three cDNA libraries. This information can be downloaded and applied in functional genomics and transcription studies.
Siew Woh Choo
Full Text Available To facilitate the ongoing research of Vibrio spp., a dedicated platform for the Vibrio research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. We present VibrioBase, a useful resource platform, providing all basic features of a sequence database with the addition of unique analysis tools which could be valuable for the Vibrio research community. VibrioBase currently houses a total of 252 Vibrio genomes developed in a user-friendly manner and useful to enable the analysis of these genomic data, particularly in the field of comparative genomics. Besides general data browsing features, VibrioBase offers analysis tools such as BLAST interfaces and JBrowse genome browser. Other important features of this platform include our newly developed in-house tools, the pairwise genome comparison (PGC tool, and pathogenomics profiling tool (PathoProT. The PGC tool is useful in the identification and comparative analysis of two genomes, whereas PathoProT is designed for comparative pathogenomics analysis of Vibrio strains. Both of these tools will enable researchers with little experience in bioinformatics to get meaningful information from Vibrio genomes with ease. We have tested the validity and suitability of these tools and features for use in the next-generation database development.
Full Text Available and VS) derived from five developmental stages. Clone ID ID of cDNA clone Atlas ID ID of Atlas database ( ht...tp://dictycdb.biol.tsukuba.ac.jp/~tools/bin/ISH/index.html ) and link to Atlas database NBRP ID ID of cDNA c...ir annotations (amino acid sequence, homology search results (with target DBs: dicty EST-DB, DNA-DB and prot...ein-DB)). Links to the Atlas database ( http://dictycdb.biol.tsukuba.ac.jp/~tools/bin/ISH/index.html ), whic
Jin, Cong; Jin, Shu-Wei
Multi-label image annotation (MIA) has been widely studied during recent years and many MIA schemes have been proposed. However, the most existing schemes are not satisfactory. In this paper, an improved multiple kernel learning (IMKL) method of support vector machine (SVM) is proposed to improve the classification accuracy of SVM, then a novel MIA scheme based on IMKL is presented, which uses the discriminant loss to control the number of top semantic labels, and the feature selection approach is also used for improving the performance of MIA. The experiment results show that proposed MIA scheme achieves higher the performance than the existing other MIA schemes, its performance is satisfactory for large image dataset.
Nikolaichik, Yevgeny; Damienikan, Aliaksandr U
The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.
Full Text Available The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp. and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci.
Montano, Maria F; Mehdi, Harshal; Nash, David B
The ambulatory care setting is an increasingly important component of the patient safety conversation. Inpatient safety is the primary focus of the vast majority of safety research and interventions, but the ambulatory setting is actually where most medical care is administered. Recent attention has shifted toward examining ambulatory care in order to implement better health care quality and safety practices. This annotated bibliography was created to analyze and augment the current literature on ambulatory care practices with regard to patient safety and quality improvement. By providing a thorough examination of current practices, potential improvement strategies in ambulatory care health care settings can be suggested. A better understanding of the myriad factors that influence delivery of patient care will catalyze future health care system development and implementation in the ambulatory setting.
Ankita Bhatewara , Kalyani Waghmare
Full Text Available The traditional database is designed for the structured data and the complex query. In the environment of the cloud, the scale of data is very large, the data is non-structured, the request of the data is dynamic, these characteristics raise new challenges for the data storage and administration, in this context, the NoSQL database comes into picture. This paper discusses about some nonstructured databases. It also discusses advantages and disadvantages of Cassandra and howCassandra is used to improve the scalability of thenetwork compared to RDBMS
Full Text Available Abstract Background Systems chemical biology and chemogenomics are considered critical, integrative disciplines in modern biomedical research, but require data mining of large, integrated, heterogeneous datasets from chemistry and biology. We previously developed an RDF-based resource called Chem2Bio2RDF that enabled querying of such data using the SPARQL query language. Whilst this work has proved useful in its own right as one of the first major resources in these disciplines, its utility could be greatly improved by the application of an ontology for annotation of the nodes and edges in the RDF graph, enabling a much richer range of semantic queries to be issued. Results We developed a generalized chemogenomics and systems chemical biology OWL ontology called Chem2Bio2OWL that describes the semantics of chemical compounds, drugs, protein targets, pathways, genes, diseases and side-effects, and the relationships between them. The ontology also includes data provenance. We used it to annotate our Chem2Bio2RDF dataset, making it a rich semantic resource. Through a series of scientific case studies we demonstrate how this (i simplifies the process of building SPARQL queries, (ii enables useful new kinds of queries on the data and (iii makes possible intelligent reasoning and semantic graph mining in chemogenomics and systems chemical biology. Availability Chem2Bio2OWL is available at http://chem2bio2rdf.org/owl. The document is available at http://chem2bio2owl.wikispaces.com.
De Loof Arnold
Full Text Available Abstract Background For holometabolous insects there has been an explosion of proteomic and peptidomic information thanks to large genome sequencing projects. Heterometabolous insects, although comprising many important species, have been far less studied. The migratory locust Locusta migratoria, a heterometabolous insect, is one of the most infamous agricultural pests. They undergo a well-known and profound phase transition from the relatively harmless solitary form to a ferocious gregarious form. The underlying regulatory mechanisms of this phase transition are not fully understood, but it is undoubtedly that neuropeptides are involved. However, neuropeptide research in locusts is hampered by the absence of genomic information. Results Recently, EST (Expressed Sequence Tag databases from Locusta migratoria were constructed. Using bioinformatical tools, we searched these EST databases specifically for neuropeptide precursors. Based on known locust neuropeptide sequences, we confirmed the sequence of several previously identified neuropeptide precursors (i.e. pacifastin-related peptides, which consolidated our method. In addition, we found two novel neuroparsin precursors and annotated the hitherto unknown tachykinin precursor. Besides one of the known tachykinin peptides, this EST contained an additional tachykinin-like sequence. Using neuropeptide precursors from Drosophila melanogaster as a query, we succeeded in annotating the Locusta neuropeptide F, allatostatin-C and ecdysis-triggering hormone precursor, which until now had not been identified in locusts or in any other heterometabolous insect. For the tachykinin precursor, the ecdysis-triggering hormone precursor and the allatostatin-C precursor, translation of the predicted neuropeptides in neural tissues was confirmed with mass spectrometric techniques. Conclusion In this study we describe the annotation of 6 novel neuropeptide precursors and the neuropeptides they encode from the
H Klei; K Kish; M Russo; S Michalczyk; M Cahn; J Tredup; C Chang; J Khan; E Baldwin
An effective process for screening, imaging, and optimizing crystallization trials using a combination of external and internal hardware and software has been deployed. The combination of this infrastructure with a vast annotated crystallization database enables the creation of custom crystallization screening strategies. Because of the strong chemotype-dependent crystallization observed with HCV NS3 protease (HCVPr), this strategy was applied to a chemotype resistant to all prior crystallization efforts. The crystallization database was mined for ingredients used to generate earlier HCVPr/inhibitor co-crystals. A random screen was created from the most prolific ingredients. A previously untested combination of proven ingredients was identified that led to a successful crystallization condition for the resistant chemotype.
Full Text Available Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.
Ely, Bert; Scott, LaTia Etheredge
Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.
Banfield Jillian F
Full Text Available Abstract Background Mass spectrometry-based metabolomics analyses have the potential to complement sequence-based methods of genome annotation, but only if raw mass spectral data can be linked to specific metabolic pathways. In untargeted metabolomics, the measured mass of a detected compound is used to define the location of the compound in chemical space, but uncertainties in mass measurements lead to "degeneracies" in chemical space since multiple chemical formulae correspond to the same measured mass. We compare two methods to eliminate these degeneracies. One method relies on natural isotopic abundances, and the other relies on the use of stable-isotope labeling (SIL to directly determine C and N atom counts. Both depend on combinatorial explorations of the "chemical space" comprised of all possible chemical formulae comprised of biologically relevant chemical elements. Results Of 1532 metabolic pathways curated in the MetaCyc database, 412 contain a metabolite having a chemical formula unique to that metabolic pathway. Thus, chemical formulae alone can suffice to infer the presence of some metabolic pathways. Of 248,928 unique chemical formulae selected from the PubChem database, more than 95% had at least one degeneracy on the basis of accurate mass information alone. Consideration of natural isotopic abundance reduced degeneracy to 64%, but mainly for formulae less than 500 Da in molecular weight, and only if the error in the relative isotopic peak intensity was less than 10%. Knowledge of exact C and N atom counts as determined by SIL enabled reduced degeneracy, allowing for determination of unique chemical formula for 55% of the PubChem formulae. Conclusions To facilitate the assignment of chemical formulae to unknown mass-spectral features, profiling can be performed on cultures uniformly labeled with stable isotopes of nitrogen (15N or carbon (13C. This makes it possible to accurately count the number of carbon and nitrogen atoms in
US Fish and Wildlife Service, Department of the Interior — Annotated, coded bibliography based on holdings of Chugach National Forest office files, Alaska Dept. of Fish & Game Cordova office files, University of Alaska...
Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819
Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.
Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.
Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian
Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant's linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance.
Sangrador-Vegas, Amaia; Mitchell, Alex L; Chang, Hsin-Yu; Yong, Siew-Yit; Finn, Robert D
The removal of annotation from biological databases is often perceived as an indicator of erroneous annotation. As a corollary, annotation stability is considered to be a measure of reliability. However, diverse data-driven events can affect the stability of annotations in both primary protein sequence databases and the protein family databases that are built upon the sequence databases and used to help annotate them. Here, we describe some of these events and their consequences for the InterPro database, and demonstrate that annotation removal or reassignment is not always linked to incorrect annotation by the curator. Database URL: http://www.ebi.ac.uk/interpro.
Gresham Cathy R
Full Text Available Abstract Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO. However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and
Li, Hong; Yang, Shiping; Wang, Chuan; Zhou, Yuan; Zhang, Ziding
Knowledge about protein interaction sites provides detailed information of protein-protein interactions (PPIs). To date, nearly 20,000 of PPIs from Arabidopsis thaliana have been identified. Nevertheless, the interaction site information has been largely missed by previously published PPI databases. Here, AraPPISite, a database that presents fine-grained interaction details for A. thaliana PPIs is established. First, the experimentally determined 3D structures of 27 A. thaliana PPIs are collected from the Protein Data Bank database and the predicted 3D structures of 3023 A. thaliana PPIs are modeled by using two well-established template-based docking methods. For each experimental/predicted complex structure, AraPPISite not only provides an interactive user interface for browsing interaction sites, but also lists detailed evolutionary and physicochemical properties of these sites. Second, AraPPISite assigns domain-domain interactions or domain-motif interactions to 4286 PPIs whose 3D structures cannot be modeled. In this case, users can easily query protein interaction regions at the sequence level. AraPPISite is a free and user-friendly database, which does not require user registration or any configuration on local machines. We anticipate AraPPISite can serve as a helpful database resource for the users with less experience in structural biology or protein bioinformatics to probe the details of PPIs, and thus accelerate the studies of plant genetics and functional genomics. AraPPISite is available at http://systbio.cau.edu.cn/arappisite/index.html .
Panigrahi, Priyabrata; Sule, Manas; Sharma, Ranu; Ramasamy, Sureshkumar; Suresh, C G
Bile salt hydrolases (BSHs) are gut microbial enzymes that play a significant role in the bile acid modification pathway. Penicillin V acylases (PVAs) are enzymes produced by environmental microbes, having a possible role in pathogenesis or scavenging of phenolic compounds in their microbial habitats. The correct annotation of such physiologically and industrially important enzymes is thus vital. The current methods relying solely on sequence homology do not always provide accurate annotations for these two members of the cholylglycine hydrolase (CGH) family as BSH/PVA enzymes. Here, we present an improved method [binding site similarity (BSS)-based scoring system] for the correct annotation of the CGH family members as BSH/PVA enzymes, which along with the phylogenetic information incorporates the substrate specificity as well as the binding site information. The BSS scoring system was developed through the analysis of the binding sites and binding modes of the available BSH/PVA structures with substrates glycocholic acid and penicillin V. The 198 sequences in the dataset were then annotated accurately using BSS scores as BSH/PVA enzymes. The dataset presented contained sequences from Gram-positive bacteria, Gram-negative bacteria and archaea. The clustering obtained for the dataset using the method described above showed a clear distinction in annotation of Gram-positive bacteria and Gram-negative bacteria. Based on this clustering and a detailed analysis of the sequences of the CGH family in the dataset, we could infer that the CGH genes might have evolved in accordance with the hypothesis stating the evolution of diderms and archaea from the monoderms.
Hamilton John P
Full Text Available Abstract Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1 the submission of gene annotation to an annotation project, 2 the review of the submitted models by project annotators, and 3 the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP, an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website http://rice.tigr.org, as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the
Full Text Available Comprehensive collection of the available light curves, prediction possibilities and the online model ﬁtting procedure, that are available via Exoplanet Transit Database became very popular in the community. In this paper we summarized the changes, that we made in the ETD during last year (including the Kepler candidates into the prediction section, modeling of an unknown planet in the model-ﬁt section and some other small improvements. All this new tools cannot be found in the main ETD paper.
The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary(ontology) for plants. The ontology allows users to ascribe attributes o...
King Nichole L
Full Text Available Abstract Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal http://www.drosophila-peptideatlas.org allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1 reduction of the complexity inherently associated with performing targeted proteomic studies, (2 designing and accelerating shotgun proteomics experiments, (3 confirming or questioning gene models, and (4 adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert.
Torres, José Salavert; Damian Segrelles Quilis, J; Espert, Ignacio Blanquer; García, Vicente Hernandez
An important effort has been invested on improving the image diagnosis process in different medical areas using information technologies. The field of medical imaging involves two main data types: medical imaging and reports. Developments based on the DICOM standard have demonstrated to be a convenient and widespread solution among the medical community. The main objective of this work is to design a Web application prototype that will be able to improve diagnosis and follow-on of breast cancer patients. It is based on TRENCADIS middleware, which provides a knowledge-oriented storage model composed by federated repositories of DICOM image studies and DICOM-SR medical reports. The full structure and contents of the diagnosis reports are used as metadata for indexing images. The TRENCADIS infrastructure takes full advantage of Grid technologies by deploying multi-resource grid services that enable multiple views (reports schemes) of the knowledge database. The paper presents a real deployment of such Web application prototype in the Dr. Peset Hospital providing radiologists with a tool to create, store and search diagnostic reports based on breast cancer explorations (mammography, magnetic resonance, ultrasound, pre-surgery biopsy and post-surgery biopsy), improving support for diagnostics decisions. A technical details for use cases (outlining enhanced multi-resource grid services communication and processing steps) and interactions between actors and the deployed prototype are described. As a result, information is more structured, the logic is clearer, network messages have been reduced and, in general, the system is more resistant to failures.
Blohm, Philipp; Frishman, Goar; Smialowski, Pawel; Goebels, Florian; Wachinger, Benedikt; Ruepp, Andreas; Frishman, Dmitrij
Knowledge about non-interacting proteins (NIPs) is important for training the algorithms to predict protein-protein interactions (PPIs) and for assessing the false positive rates of PPI detection efforts. We present the second version of Negatome, a database of proteins and protein domains that are unlikely to engage in physical interactions (available online at http://mips.helmholtz-muenchen.de/proj/ppi/negatome). Negatome is derived by manual curation of literature and by analyzing three-dimensional structures of protein complexes. The main methodological innovation in Negatome 2.0 is the utilization of an advanced text mining procedure to guide the manual annotation process. Potential non-interactions were identified by a modified version of Excerbt, a text mining tool based on semantic sentence analysis. Manual verification shows that nearly a half of the text mining results with the highest confidence values correspond to NIP pairs. Compared to the first version the contents of the database have grown by over 300%.
National Aeronautics and Space Administration — The databases of computational and experimental data from the first Aeroelastic Prediction Workshop are located here. The databases file names tell their contents by...
This document presents Annotated English, a system of diacritical symbols which turns English pronunciation into a precise and unambiguous process. The annotations are defined and located in such a way that the original English text is not altered (not even a letter), thus allowing for a consistent reading and learning of the English language with and without annotations. The annotations are based on a set of general rules that make the frequency of annotations not dramatically high. This makes the reader easily associate annotations with exceptions, and makes it possible to shape, internalise and consolidate some rules for the English language which otherwise are weakened by the enormous amount of exceptions in English pronunciation. The advantages of this annotation system are manifold. Any existing text can be annotated without a significant increase in size. This means that we can get an annotated version of any document or book with the same number of pages and fontsize. Since no letter is affected, the ...
Full Text Available Database is an organized collection of data, many user wants to store their personal and confidential data’s in such database. Unauthorized persons may try to get the data’s from database and misuse them without the owner’s knowledge. To overcome such problem the advanced control mechanism, known as Database security was introduced. Encryption algorithm is one of the way to give protection to the database from various threat or hackers who target to get confidential information. This paper discuss about the proposed encryption algorithm to give security to such database.
Pedro Gabriel Melo de Barros e Silva
Full Text Available ABSTRACT OBJECTIVE: To report the initial changes after quality-improvement programs based on STS-database in a Brazilian hospital. METHODS: Since 2011 a Brazilian hospital has joined STS-Database and in 2012 multifaceted actions based on STS reports were implemented aiming reductions in the time of mechanical ventilation and in the intensive care stay and also improvements in evidence-based perioperative therapies among patients who underwent coronary artery bypass graft surgeries. RESULTS: All the 947 patients submitted to coronary artery bypass graft surgeries from July 2011 to June 2014 were analyzed and there was an improvement in all the three target endpoints after the implementation of the quality-improvement program but the reduction in time on mechanical ventilation was not statistically significant after adjusting for prognostic characteristics. CONCLUSION: The initial experience with STS registry in a Brazilian hospital was associated with improvement in most of targeted quality-indicators.
Full Text Available Databases are deeply embedded in archaeology, underpinning and supporting many aspects of the subject. However, as well as providing a means for storing, retrieving and modifying data, databases themselves must be a result of a detailed analysis and design process. This article looks at this process, and shows how the characteristics of data models affect the process of database design and implementation. The impact of the Internet on the development of databases is examined, and the article concludes with a discussion of a range of issues associated with the recording and management of archaeological data.
Full Text Available Abstract Background Metabolomic studies are targeted at identifying and quantifying all metabolites in a given biological context. Among the tools used for metabolomic research, mass spectrometry is one of the most powerful tools. However, metabolomics by mass spectrometry always reveals a high number of unknown compounds which complicate in depth mechanistic or biochemical understanding. In principle, mass spectrometry can be utilized within strategies of de novo structure elucidation of small molecules, starting with the computation of the elemental composition of an unknown metabolite using accurate masses with errors Results High mass accuracy (95% of false candidates. This orthogonal filter can condense several thousand candidates down to only a small number of molecular formulas. Example calculations for 10, 5, 3, 1 and 0.1 ppm mass accuracy are given. Corresponding software scripts can be downloaded from http://fiehnlab.ucdavis.edu. A comparison of eight chemical databases revealed that PubChem and the Dictionary of Natural Products can be recommended for automatic queries using molecular formulae. Conclusion More than 1.6 million molecular formulae in the range 0–500 Da were generated in an exhaustive manner under strict observation of mathematical and chemical rules. Assuming that ion species are fully resolved (either by chromatography or by high resolution mass spectrometry, we conclude that a mass spectrometer capable of 3 ppm mass accuracy and 2% error for isotopic abundance patterns outperforms mass spectrometers with less than 1 ppm mass accuracy or even hypothetical mass spectrometers with 0.1 ppm mass accuracy that do not include isotope information in the calculation of molecular formulae.
McCarthy Fiona M
Full Text Available Abstract Background Systems biology modeling from microarray data requires the most contemporary structural and functional array annotation. However, microarray annotations, especially for non-commercial, non-traditional biomedical model organisms, are often dated. In addition, most microarray analysis tools do not readily accept EST clone names, which are abundantly represented on arrays. Manual re-annotation of microarrays is impracticable and so we developed a computational re-annotation tool (ArrayIDer to retrieve the most recent accession mapping files from public databases based on EST clone names or accessions and rapidly generate database accessions for entire microarrays. Results We utilized the Fred Hutchinson Cancer Research Centre 13K chicken cDNA array – a widely-used non-commercial chicken microarray – to demonstrate the principle that ArrayIDer could markedly improve annotation. We structurally re-annotated 55% of the entire array. Moreover, we decreased non-chicken functional annotations by 2 fold. One beneficial consequence of our re-annotation was to identify 290 pseudogenes, of which 66 were previously incorrectly annotated. Conclusion ArrayIDer allows rapid automated structural re-annotation of entire arrays and provides multiple accession types for use in subsequent functional analysis. This information is especially valuable for systems biology modeling in the non-traditional biomedical model organisms.
王晓艳; 梁晋春; 姚颖颖; 马艳
With the development of broadcasting,audio data of radio has been dramatically increased,the demands of automation and efficiem retrieval in the program content management of radio station are increasing.Based on the analysis of radio digital audio database construction and problems in content management,this paper focuses on the audio data metadata automatic annotation and retrieval of several key technologies involved,looking forward to the future it can be directly applied to the content management of broadcast programs,and improves work efficiency.%随着广播事业的发展,电台拥有音频数据量急剧增长,电台在节目内容管理自动化及有效检索方面的需求日益增加.在分析广播电台数字音频资料库建设及在内容管理方面问题的基础上,重点介绍音频资料元数据项自动标注及检索涉及的几项关键技术,期望将来能直接应用到广播节目的内容管理上,提高工作效率.
Elbæk, Mikael Karstensen; Heller, Alfred; Pedersen, Gert Schmeltz
A re-implementation of the research database of the Technical University of Denmark, DTU, is based on Fedora. The backbone consists of content models for primary and secondary entities and their relationships, giving flexible and powerful extraction capabilities for interoperability and reporting...
Ohri-Vachaspati, Punam; Martinez, Diane; Yedidia, Michael J; Petlick, Nirvana
PURPOSE. Assessing food environments often requires using commercially available data. Disparate methods used for classifying food outlets in these databases call for creating a classification approach using common definitions. A systematic strategy for reclassifying food stores and restaurants, as they appear in commercial databases, into categories that differentiate the availability of healthy options is described here. DESIGN AND SETTING. Commercially available data on food outlets including names, addresses, North American Industry Classification System codes, and associated characteristics was obtained for five New Jersey communities. ANALYSIS. A reclassification methodology was developed using criteria and definitions from the literature to categorize food outlets based on availability of healthy options. Information in the database was supplemented by systematic Internet and key word searches, and from phone calls to food outlets. RESULTS. The methodology resulted in 622 supermarket/grocery stores, 183 convenience stores, and 148 specialty stores in the original data to be reclassified into 58 supermarkets, 30 grocery stores, 692 convenience stores, and 115 specialty stores. Outlets from the original list of 1485 full-service restaurants and 506 limited-service restaurants were reclassified as 563 full-service restaurants and 1247 limited-service restaurants. Reclassification resulted in less than one-seventh the number of supermarkets and grocery stores, more than three times the number of convenience stores, and twice as many limited-service restaurants-a much less healthy profile than the one generated by using exclusively the commercial databases. CONCLUSION. An explicit and replicable strategy is proposed for reclassifying food outlets in commercial databases into categories that differentiate on the basis of healthy food availability. The intent is to contribute towards building a consensus among researchers on definitions used in public health
Full Text Available This research details the development and validation of an updated constituent tidal database for the Western North Atlantic, Caribbean and Gulf of Mexico (WNAT region, referred to as the EC2015 database. Regional databases, such as EC2015, provide much higher resolution than global databases allowing users to more accurately define the tidal forcing on smaller sub-region domains. The database last underwent major updates in 2001 and was developed using the two-dimensional, depth-integrated form of the coastal hydrodynamic model, ADvanced CIRCulation (ADCIRC, which solves the shallow-water equations in the generalized wave continuity equation form. Six main areas of improvement are examined: (1 placement of the open ocean boundary; (2 higher coastal resolution using Vertical Datum (VDatum models; (3 updated bathymetry from global databases; (4 updated boundary forcing compared using two global tidal databases; (5 updated bottom friction formulations; and (6 improved model physics by incorporating the advective terms in ADCIRC. The skill of the improved database is compared to that of its predecessor and is calculated using harmonic data from the National Oceanic and Atmospheric Administration Center for Operational Oceanographic Products and Services (NOAA CO-OPS stations and historic International Hydrographic Organization (IHO data. Overall, the EC2015 database significantly reduces errors realized in the EC2001 database and improves the quality of coastal tidal constituents available for smaller sub-regional models in the Western North Atlantic, Caribbean and Gulf of Mexico (WNAT region.
Lopez, D. [UCLA; Casero, D. [UCLA; Cokus, S. J. [UCLA; Merchant, S. S. [UCLA; Pellegrini, M. [UCLA
The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.
Goodswen, Stephen J; Barratt, Joel L N; Kennedy, Paul J; Ellis, John T
Neospora caninum is an apicomplexan parasite which can cause abortion in cattle, instigating major economic burden. Vaccination has been proposed as the most cost-effective control measure to alleviate this burden. Consequently the overriding aspiration for N. caninum research is the identification and subsequent evaluation of vaccine candidates in animal models. To save time, cost and effort, it is now feasible to use an in silico approach for vaccine candidate prediction. Precise protein sequences, derived from the correct open reading frame, are paramount and arguably the most important factor determining the success or failure of this approach. The challenge is that publicly available N. caninum sequences are mostly derived from gene predictions. Annotated inaccuracies can lead to erroneously predicted vaccine candidates by bioinformatics programs. This study evaluates the current N. caninum annotation for potential inaccuracies. Comparisons with annotation from a closely related pathogen, Toxoplasma gondii, are also made to distinguish patterns of inconsistency. More importantly, a mRNA sequencing (RNA-Seq) experiment is used to validate the annotation. Potential discrepancies originating from a questionable start codon context and exon boundaries were identified in 1943 protein coding sequences. We conclude, where experimental data were available, that the majority of N. caninum gene sequences were reliably predicted. Nevertheless, almost 28% of genes were identified as questionable. Given the limitations of RNA-Seq, the intention of this study was not to replace the existing annotation but to support or oppose particular aspects of it. Ideally, many studies aimed at improving the annotation are required to build a consensus. We believe this study, in providing a new resource on gene structure and annotation, is a worthy contributor to this endeavour.
Tie Hua Zhou
Full Text Available The ever-increasing quantities of digital photo resources are annotated with enriching vocabularies to form semantic annotations. Photo-sharing social networks have boosted the need for efficient and intuitive querying to respond to user requirements in large-scale image collections. In order to help users formulate efficient and effective image retrieval, we present a novel integration of a probabilistic model based on keyword query architecture that models the probability distribution of image annotations: allowing users to obtain satisfactory results from image retrieval via the integration of multiple annotations. We focus on the annotation integration step in order to specify the meaning of each image annotation, thus leading to the most representative annotations of the intent of a keyword search. For this demonstration, we show how a probabilistic model has been integrated to semantic annotations to allow users to intuitively define explicit and precise keyword queries in order to retrieve satisfactory image results distributed in heterogeneous large data sources. Our experiments on SBU (collected by Stony Brook University database show that (i our integrated annotation contains higher quality representatives and semantic matches; and (ii the results indicating annotation integration can indeed improve image search result quality.
Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan
The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.
Full Text Available Abstract Background Campylobacter jejuni is the leading bacterial cause of human gastroenteritis in the developed world. To improve our understanding of this important human pathogen, the C. jejuni NCTC11168 genome was sequenced and published in 2000. The original annotation was a milestone in Campylobacter research, but is outdated. We now describe the complete re-annotation and re-analysis of the C. jejuni NCTC11168 genome using current database information, novel tools and annotation techniques not used during the original annotation. Results Re-annotation was carried out using sequence database searches such as FASTA, along with programs such as TMHMM for additional support. The re-annotation also utilises sequence data from additional Campylobacter strains and species not available during the original annotation. Re-annotation was accompanied by a full literature search that was incorporated into the updated EMBL file [EMBL: AL111168]. The C. jejuni NCTC11168 re-annotation reduced the total number of coding sequences from 1654 to 1643, of which 90.0% have additional information regarding the identification of new motifs and/or relevant literature. Re-annotation has led to 18.2% of coding sequence product functions being revised. Conclusions Major updates were made to genes involved in the biosynthesis of important surface structures such as lipooligosaccharide, capsule and both O- and N-linked glycosylation. This re-annotation will be a key resource for Campylobacter research and will also provide a prototype for the re-annotation and re-interpretation of other bacterial genomes.
Gibson, Molly K; Forsberg, Kevin J; Dantas, Gautam
Antibiotic resistance is a dire clinical problem with important ecological dimensions. While antibiotic resistance in human pathogens continues to rise at alarming rates, the impact of environmental resistance on human health is still unclear. To investigate the relationship between human-associated and environmental resistomes, we analyzed functional metagenomic selections for resistance against 18 clinically relevant antibiotics from soil and human gut microbiota as well as a set of multidrug-resistant cultured soil isolates. These analyses were enabled by Resfams, a new curated database of protein families and associated highly precise and accurate profile hidden Markov models, confirmed for antibiotic resistance function and organized by ontology. We demonstrate that the antibiotic resistance functions that give rise to the resistance profiles observed in environmental and human-associated microbial communities significantly differ between ecologies. Antibiotic resistance functions that most discriminate between ecologies provide resistance to β-lactams and tetracyclines, two of the most widely used classes of antibiotics in the clinic and agriculture. We also analyzed the antibiotic resistance gene composition of over 6000 sequenced microbial genomes, revealing significant enrichment of resistance functions by both ecology and phylogeny. Together, our results indicate that environmental and human-associated microbial communities harbor distinct resistance genes, suggesting that antibiotic resistance functions are largely constrained by ecology.
Full Text Available AcEST AcEST(EST sequences of Adiantum capillus-veneris and their annotation) Data detail Data name AcEST(EST... sequences of Adiantum capillus-veneris and their annotation) Description of data contents EST sequence of Adiantum capill...db/view/archive_acest#en Data acquisition method Capillary sequencer Data analysi...atabases) Number of data entries Adiantum capillus-veneris ESTs: 30,540. Data item Description Clone id Clon...e ID of EST sequence of Adiantum capillus-veneris. Library Library name Length Nu
Full Text Available Dicty_cDB Contig sequences and their annotation (amino acid sequence and results of homology search), and expre...s of homology search), and expression profile Description of data contents Contig...TogoDB edition only. Annotation Translated Amino Acid sequence Representative ami....u-tokyo.ac.jp/ ), which is the program to predict the localization of proteins in the cell. Expression prof...o acid sequence and results of homology search), and expression profile - Dicty_cDB | LSDB Archive ...
Harboe, Kirstine M; Bardram, Linda
To evaluate whether quality improvements in the performance of cholecystectomy have been achieved in Denmark since 2006, after revision of the Danish National Guidelines for treatment of gallstones....
Harboe, Kirstine M; Bardram, Linda
To evaluate whether quality improvements in the performance of cholecystectomy have been achieved in Denmark since 2006, after revision of the Danish National Guidelines for treatment of gallstones.......To evaluate whether quality improvements in the performance of cholecystectomy have been achieved in Denmark since 2006, after revision of the Danish National Guidelines for treatment of gallstones....
Full Text Available ether with predicted molecular formulae and putative structures, were provided as metabolite annotations. Comparison with public data...bases was performed. A grading system was introduced to describe the evidence supporting the annotations. ...
Camon Evelyn B
Full Text Available Abstract Background The Gene Ontology Annotation (GOA database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an
Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao
KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks.
Daniel J. Dougherty
Full Text Available Alloy is a lightweight modeling formalism based on relational algebra. In prior work with Fisler, Giannakopoulos, Krishnamurthi, and Yoo, we have presented a tool, Alchemy, that compiles Alloy specifications into implementations that execute against persistent databases. The foundation of Alchemy is an algorithm for rewriting relational algebra formulas into code for database transactions. In this paper we report on recent progress in improving the robustness and efficiency of this transformation.
Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer;
eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making ...
Brimhall, Bradley B; Hall, Timothy E; Walczak, Steven
A hospital laboratory relational database, developed over eight years, has demonstrated significant cost savings and a substantial financial return on investment (ROI). In addition, the database has been used to measurably improve laboratory operations and the quality of patient care.
Wattam, Alice R.; Davis, James J.; Assaf, Rida; Boisvert, Sébastien; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Dietrich, Emily M.; Disz, Terry; Gabbard, Joseph L.; Gerdes, Svetlana; Henry, Christopher S.; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K.; Olsen, Gary J.; Murphy-Olson, Daniel E.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Vonstein, Veronika; Warren, Andrew; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L.
The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by ‘virtual integration’ to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics. PMID:27899627
Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen
Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.
Full Text Available Background: Biocides for prolonging the shelf life of a large variety of materials have been extensively used over the last decades. It has estimated that the worldwide biocide consumption to be about 12.4 billion dollars in 2011, and is expected to increase in 2012. As biocides are substances we get in contact with in our everyday lives, access to this type of information is of paramount importance in order to ensure an appropriate living environment. Consequently, a database where information may be quickly processed, sorted, and easily accessed, according to different search criteria, is the most desirable solution. The main aim of this work was to design and implement a relational database with complete information about biocides used in public health management to improve the quality of life.Methods: Design and implementation of a relational database for biocides, by using the software "phpMyAdmin".Results: A database, which allows for an efficient collection, storage, and management of information including chemical properties and applications of a large quantity of biocides, as well as its adequate dissemination into the public health environment.Conclusion: The information contained in the database herein presented promotes an adequate use of biocides, by means of information technologies, which in consequence may help achieve important improvement in our quality of life.
Ph. D. Cosmin Catalin Olteanu
Full Text Available The general idea is to improve NSER database to have a strong unique informatic system where all the data should be collected from all universities. Employers and all academic institutions can check someone’s background easy through a national portal just by log in. As a result of the paper, the author found that this system has it’s flows but can be improved.
Full Text Available A robust automatic micro-expression recognition system would have broad applications in national safety, police interrogation, and clinical diagnosis. Developing such a system requires high quality databases with sufficient training samples which are currently not available. We reviewed the previously developed micro-expression databases and built an improved one (CASME II, with higher temporal resolution (200 fps and spatial resolution (about 280×340 pixels on facial area. We elicited participants' facial expressions in a well-controlled laboratory environment and proper illumination (such as removing light flickering. Among nearly 3000 facial movements, 247 micro-expressions were selected for the database with action units (AUs and emotions labeled. For baseline evaluation, LBP-TOP and SVM were employed respectively for feature extraction and classifier with the leave-one-subject-out cross-validation method. The best performance is 63.41% for 5-class classification.
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE16_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available se search. Peaks with no hit to these databases are then selected to secondary sear...ch using EX-HR2 (http://webs2.kazusa.or.jp/mfsearcher/) databases. After the database search processes, each database...SE41_AM1 PowerGet annotation In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary databa
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE1_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE29_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE28_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE25_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available se search. Peaks with no hit to these databases are then selected to secondary sear...ch using EX-HR2 (http://webs2.kazusa.or.jp/mfsearcher/) databases. After the database search processes, each database...SE40_AM1 PowerGet annotation In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary databa
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE32_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE12_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE14_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE8_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE9_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE27_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE33_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE15_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE4_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE30_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE13_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE11_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE34_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE7_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE5_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE2_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE17_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE20_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE3_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE35_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE36_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available base search. Peaks with no hit to these databases are then selected to secondary se...arch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are ma...SE6_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary data
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE31_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE10_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Full Text Available abase search. Peaks with no hit to these databases are then selected to secondary s...earch using exactMassDB and Pep1000 databases. After the database search processes, each database hits are m...SE26_AM1 PowerGet annotation A1 In annotation process, KEGG, KNApSAcK and LipidMAPS are used for primary dat
Renjit, J Arokia
Association rule mining is an active data mining research area and most ARM algorithms cater to a centralized environment. Centralized data mining to discover useful patterns in distributed databases isn't always feasible because merging data sets from different sites incurs huge network communication costs. In this paper, an Improved algorithm based on good performance level for data mining is being proposed. In local sites, it runs the application based on the improved LMatrix algorithm, which is used to calculate local support counts. Local Site also finds a centre site to manage every message exchanged to obtain all globally frequent item sets. It also reduces the time of scan of partition database by using LMatrix which increases the performance of the algorithm. Therefore, the research is to develop a distributed algorithm for geographically distributed data sets that reduces communication costs, superior running efficiency, and stronger scalability than direct application of a sequential algorithm in d...
Amit Kumar Yadav
Full Text Available The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix and few complex datasets (E. coli and Yeast obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies--separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Type B trichothecenes, which pose a serious hazard to consumer health, occur worldwide in grains. These mycotoxins are produced mainly by three different trichothecene genotypes/chemotypes: 3ADON (3-acetyldeoxynivalenol), 15ADON (15-acetyldeoxynivalenol) and NIV (nivalenol), named after these three major mycotoxin compounds. Correct identification of these genotypes is elementary for all studies relating to population surveys, fungal ecology and mycotoxicology. Trichothecene producers exhibit enormous strain-dependent chemical diversity, which may result in variation in levels of the genotype’s determining toxin and in the production of low to high amounts of atypical compounds. New high-throughput DNA-sequencing technologies promise to boost the diagnostics of mycotoxin genotypes. However, this requires a reference database containing a satisfactory taxonomic sampling of sequences showing high correlation to actually produced chemotypes. We believe that one of the most pressing current challenges of such a database is the linking of molecular identification with chemical diversity of the strains, as well as other metadata. In this study, we use the Tri12 gene involved in mycotoxin biosynthesis for identification of Tri genotypes through sequence comparison. Tri12 sequences from a range of geographically diverse fungal strains comprising 22 Fusarium species were stored in the ToxGen database, which covers descriptive and up-to-date annotations such as indication on Tri genotype and chemotype of the strains, chemical diversity, information on trichothecene-inducing host, substrate or media, geographical locality, and most recent taxonomic affiliations. The present initiative bridges the gap between the demands of comprehensive studies on trichothecene producers and the existing nucleotide sequence databases, which lack toxicological and other auxiliary data. We invite researchers working in the fields of fungal taxonomy, epidemiology and mycotoxicology to join the
Seifert, Sascha; Kelm, Michael; Moeller, Manuel; Mukherjee, Saikat; Cavallaro, Alexander; Huber, Martin; Comaniciu, Dorin
Diagnosis and treatment planning for patients can be significantly improved by comparing with clinical images of other patients with similar anatomical and pathological characteristics. This requires the images to be annotated using common vocabulary from clinical ontologies. Current approaches to such annotation are typically manual, consuming extensive clinician time, and cannot be scaled to large amounts of imaging data in hospitals. On the other hand, automated image analysis while being very scalable do not leverage standardized semantics and thus cannot be used across specific applications. In our work, we describe an automated and context-sensitive workflow based on an image parsing system complemented by an ontology-based context-sensitive annotation tool. An unique characteristic of our framework is that it brings together the diverse paradigms of machine learning based image analysis and ontology based modeling for accurate and scalable semantic image annotation.
Jewison, Timothy; Su, Yilu; Disfany, Fatemeh Miri; Liang, Yongjie; Knox, Craig; Maciejewski, Adam; Poelzer, Jenna; Huynh, Jessica; Zhou, You; Arndt, David; Djoumbou, Yannick; Liu, Yifeng; Deng, Lu; Guo, An Chi; Han, Beomsoo; Pon, Allison; Wilson, Michael; Rafatnia, Shahrzad; Liu, Philip; Wishart, David S
The Small Molecule Pathway Database (SMPDB, http://www.smpdb.ca) is a comprehensive, colorful, fully searchable and highly interactive database for visualizing human metabolic, drug action, drug metabolism, physiological activity and metabolic disease pathways. SMPDB contains >600 pathways with nearly 75% of its pathways not found in any other database. All SMPDB pathway diagrams are extensively hyperlinked and include detailed information on the relevant tissues, organs, organelles, subcellular compartments, protein cofactors, protein locations, metabolite locations, chemical structures and protein quaternary structures. Since its last release in 2010, SMPDB has undergone substantial upgrades and significant expansion. In particular, the total number of pathways in SMPDB has grown by >70%. Additionally, every previously entered pathway has been completely redrawn, standardized, corrected, updated and enhanced with additional molecular or cellular information. Many SMPDB pathways now include transporter proteins as well as much more physiological, tissue, target organ and reaction compartment data. Thanks to the development of a standardized pathway drawing tool (called PathWhiz) all SMPDB pathways are now much more easily drawn and far more rapidly updated. PathWhiz has also allowed all SMPDB pathways to be saved in a BioPAX format. Significant improvements to SMPDB's visualization interface now make the browsing, selection, recoloring and zooming of pathways far easier and far more intuitive. Because of its utility and breadth of coverage, SMPDB is now integrated into several other databases including HMDB and DrugBank.
We compare the costs of semantic annotation of textual documents to its benefits for information processing tasks. Semantic annotation can improve the performance of retrieval tasks and facilitates an improved search experience through faceted search, focused retrieval, better document summaries, an
Schwartz, David Charles; Severin, Jessica
There are provided computer systems for visualizing and annotating single molecule images. Annotation systems in accordance with this disclosure allow a user to mark and annotate single molecules of interest and their restriction enzyme cut sites thereby determining the restriction fragments of single nucleic acid molecules. The markings and annotations may be automatically generated by the system in certain embodiments and they may be overlaid translucently onto the single molecule images. An image caching system may be implemented in the computer annotation systems to reduce image processing time. The annotation systems include one or more connectors connecting to one or more databases capable of storing single molecule data as well as other biomedical data. Such diverse array of data can be retrieved and used to validate the markings and annotations. The annotation systems may be implemented and deployed over a computer network. They may be ergonomically optimized to facilitate user interactions.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were
Full Text Available Abstract Background large scale and reliable proteins' functional annotation is a major challenge in modern biology. Phylogenetic analyses have been shown to be important for such tasks. However, up to now, phylogenetic annotation did not take into account expression data (i.e. ESTs, Microarrays, SAGE, .... Therefore, integrating such data, like ESTs in phylogenetic annotation could be a major advance in post genomic analyses. We developed an approach enabling the combination of expression data and phylogenetic analysis. To illustrate our method, we used an example protein family, the peptidyl arginine deiminases (PADs, probably implied in Rheumatoid Arthritis. Results the analysis was performed as follows: we built a phylogeny of PAD proteins from the NCBI's NR protein database. We completed the phylogenetic reconstruction of PADs using an enlarged sequence database containing translations of ESTs contigs. We then extracted all corresponding expression data contained in EST database This analysis allowed us 1/To extend the spectrum of homologs-containing species and to improve the reconstruction of genes' evolutionary history. 2/To deduce an accurate gene expression pattern for each member of this protein family. 3/To show a correlation between paralogous sequences' evolution rate and pattern of tissular expression. Conclusion coupling phylogenetic reconstruction and expression data is a promising way of analysis that could be applied to all multigenic families to investigate the relationship between molecular and transcriptional evolution and to improve functional annotation.
Nakao, Mitsuteru; Okamoto, Shinobu; Kohara, Mitsuyo; Fujishiro, Tsunakazu; Fujisawa, Takatomo; Sato, Shusei; Tabata, Satoshi; Kaneko, Takakazu; Nakamura, Yasukazu
CyanoBase (http://genome.kazusa.or.jp/cyanobase) is the genome database for cyanobacteria, which are model organisms for photosynthesis. The database houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. In this version, we updated these datasets and improved the navigation and the visual display of the data views. In addition, a web service API now enables users to retrieve the data in various formats with other tools, seamlessly.
Full Text Available Abstract Background Improvements in genome sequence annotation revealed discrepancies in the original probeset/gene assignment in Affymetrix microarray and the existence of differences between annotations and effective alignments of probes and transcription products. In the current generation of Affymetrix human GeneChips, most probesets include probes matching transcripts from more than one gene and probes which do not match any transcribed sequence. Results We developed a novel set of custom Chip Definition Files (CDF and the corresponding Bioconductor libraries for Affymetrix human GeneChips, based on the information contained in the GeneAnnot database. GeneAnnot-based CDFs are composed of unique custom-probesets, including only probes matching a single gene. Conclusion GeneAnnot-based custom CDFs solve the problem of a reliable reconstruction of expression levels and eliminate the existence of more than one probeset per gene, which often leads to discordant expression signals for the same transcript when gene differential expression is the focus of the analysis. GeneAnnot CDFs are freely distributed and fully compliant with Affymetrix standards and all available software for gene expression analysis. The CDF libraries are available from http://www.xlab.unimo.it/GA_CDF, along with supplementary information (CDF libraries, installation guidelines and R code, CDF statistics, and analysis results.
Cress, U.; Barquero, B.; Schwan, S.; Hesse, F. W.
Shared databases are used for knowledge exchange in groups. Whether a person is willing to contribute knowledge to a shared database presents a social dilemma: Each group member saves time and energy by not contributing any information to the database and by using the database only to retrieve information which was contributed by others. But if…
Spraggins Thomas A
Full Text Available Abstract Background Cowpea [Vigna unguiculata (L. Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI, funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace recovered using methylation filtration technology and providing annotation and analysis of the sequence data. Description CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource, and UniProtKB-TrEMBL. Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the
Lin, Jian-Wei; Lai, Yuan-Cheng
This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…
Wang, Dayong; Hoi, Steven C H; He, Ying; Zhu, Jianke; Mei, Tao; Luo, Jiebo
Retrieval-based face annotation is a promising paradigm of mining massive web facial images for automated face annotation. This paper addresses a critical problem of such paradigm, i.e., how to effectively perform annotation by exploiting the similar facial images and their weak labels which are often noisy and incomplete. In particular, we propose an effective Weak Label Regularized Local Coordinate Coding (WLRLCC) technique, which exploits the principle of local coordinate coding in learning sparse features, and employs the idea of graph-based weak label regularization to enhance the weak labels of the similar facial images. We present an efficient optimization algorithm to solve the WLRLCC task. We conduct extensive empirical studies on two large-scale web facial image databases: (i) a Western celebrity database with a total of $6,025$ persons and $714,454$ web facial images, and (ii)an Asian celebrity database with $1,200$ persons and $126,070$ web facial images. The encouraging results validate the efficacy of the proposed WLRLCC algorithm. To further improve the efficiency and scalability, we also propose a PCA-based approximation scheme and an offline approximation scheme (AWLRLCC), which generally maintains comparable results but significantly saves much time cost. Finally, we show that WLRLCC can also tackle two existing face annotation tasks with promising performance.
Full Text Available Abstract Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released. Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens, our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection, the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are
Koptev, V. Yu
The work represents the results of studying basic interconnected criteria of separate equipment units of the transport network machines fleet, depending on production and mining factors to improve the transport systems management. Justifying the selection of a control system necessitates employing new methodologies and models, augmented with stability and transport flow criteria, accounting for mining work development dynamics on mining sites. A necessary condition is the accounting of technical and operating parameters related to vehicle operation. Modern open pit mining dispatching systems must include such kinds of the information database. An algorithm forming a machine fleet is presented based on multi-variation task solution in connection with defining reasonable operating features of a machine working as a part of a complex. Proposals cited in the work may apply to mining machines (drilling equipment, excavators) and construction equipment (bulldozers, cranes, pile-drivers), city transport and other types of production activities using machine fleet.
Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For th...
Hunter, J; Henderson, M; Khan, I
This paper describes the AnnoCryst system-a tool that was designed to enable authenticated collaborators to share online discussions about 3D crystallographic structures through the asynchronous attachment, storage, and retrieval of annotations. Annotations are personal comments, interpretations, questions, assessments, or references that can be attached to files, data, digital objects, or Web pages. The AnnoCryst system enables annotations to be attached to 3D crystallographic models retrieved from either private local repositories (e.g., Fedora) or public online databases (e.g., Protein Data Bank or Inorganic Crystal Structure Database) via a Web browser. The system uses the Jmol plugin for viewing and manipulating the 3D crystal structures but extends Jmol by providing an additional interface through which annotations can be created, attached, stored, searched, browsed, and retrieved. The annotations are stored on a standardized Web annotation server (Annotea), which has been extended to support 3D macromolecular structures. Finally, the system is embedded within a security framework that is capable of authenticating users and restricting access only to trusted colleagues.
Sabariah, F J; Ramesh, N; Mahathar, A W
The first Malaysian National Trauma Database was launched in May 2006 with five tertiary referral centres to determine the fundamental data on major trauma, subsequently to evaluate the major trauma management and to come up with guidelines for improved trauma care. A prospective study, using standardized and validated questionnaires, was carried out from May 2006 till April 2007 for all cases admitted and referred to the participating hospitals. During the one year period, 123,916 trauma patients were registered, of which 933 (0.75%) were classified as major trauma. Patients with blunt injury made up for 83.9% of cases and RTA accounted for 72.6% of injuries with 64.9% involving motorcyclist and pillion rider. 42.8% had severe head injury with an admission Glasgow Coma Scale (GCS) of 3-8 and the Revised Trauma Score (RTS) of 5-6 were recorded in 28.8% of patients. The distribution of Injury Severity Score (ISS) showed that 42.9% of cases were in the range of 16-24. Only 1.9% and 6.3% of the patients were reviewed by the Emergency Physician and Surgeon respectively. Patients with admission systolic blood pressure of less than 90 mmHg had a death rate of 54.6%. Patients with severe head injury (GCS report has successfully demonstrated its significance in giving essential data on major trauma in Malaysia, however further expansion of the study may reflect more comprehensive trauma database in this country.
Full Text Available Abstract Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting. Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations. The vast majority of the missing genes found are small (less than 100 aa. A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
Wang, Qian; He, Beixin Julie; Zhao, Hongyu
Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395
Full Text Available Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.
Full Text Available This is a pilot study, which aims to reorganize a course to better serve learners’ learning styles. In essence, this study is a case study to improve the performance of the Database Management Systems Course in the department of Computer Education and Instructional Technologies (CEIT at Uludag University. Learning styles of students were analyzed through Felder-Soloman's Index of Learning Styles (ILS. A part of data was conducted during the Spring 2009. The participants were the students of the respective course. Findings showed that participants were mostly visual, active and sensory type learners. They were balanced on sequential-global dimensions. No significant relationship was found between the learning styles and achievement scores. This result forms appropriate pre-study conditions for the upcoming study. It was decided for the upcoming study that different learning materials that suits characteristics of participants be developed and blended learning is proposed as a delivery method.
Seiler, Roland, Ed.; Hartmann, Wolfgang, Ed.
Annotated bibliography of 220 books, monographs, and journal articles on orienteering published 1984-94, from SPOLIT database of the Federal Institute of Sport Science (Cologne, Germany). Annotations in English or German. Ten sections including psychological, physiological, health, sociological, and environmental aspects; training and coaching;…
Kadarmideen, Haja; Do, Duy Ngoc; Janss, Luc;
in their contribution to estimated genomic variances and in prediction of genomic breeding values by applying SNP annotation approaches to feed efficiency. Ensembl Variant Predictor (EVP) and Pig QTL database were used as the source of genomic annotation for 60K chip. Genomic prediction was performed using the Bayes...... prove useful for less heritable traits such as diseases and fertility...
Ohkubo, Hiroo [Mitsubishi Research Institute, Inc., Tokyo (Japan)
This year, we have improved and upgraded the waste disposal FEP information database system which has been produced at JNC until last year. In this year's improvements and upgrading, a standard graphics user interface builder was utilized instead of commercial tools limited to particular models, thus making the system commonly available for both server and client regardless of the model and enabling a reduction of the operating environment. Because of these improvements, the system is flexible with respect to future expandability (use of standard tools) and maintainability (standardized operations). Additionally, by enabling to add information about the biosphere to this system efficiently, the biosphere scenario can be handled by the same operations as the conventional groundwater scenario. From here, upgrading the window and operability toward the practical application and publication of the information through the internet (with analyzing the public views on the FEP scenario by collecting opinions from the general public in an interactive manner) have to be done. Assuming these tasks are carried out, the issues that need to be addressed in the future are to incorporate the planning of the FEP scenario and the influence diagram, their explanation, etc. and to make the influence diagram clearer and more comprehensible for the purpose of distributing the waste disposal FEP scenario development research to the general users in a comprehensive form. (author)
Full Text Available Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent and text mentions (language dependant. The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.
Full Text Available Abstract Background Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to
Han, Na; Yu, Weiwen; Qiang, Yujun
Type IV secretion system (T4SS) can mediate the passage of macromolecules across cellular membranes and is essential for virulent and genetic material exchange among bacterial species. The Type IV Secretion Project 2.0 (T4SP 2.0) database is an improved and extended version of the platform released in 2013 aimed at assisting with the detection of Type IV secretion systems (T4SS) in bacterial genomes. This advanced version provides users with web server tools for detecting the existence and variations of T4SS genes online. The new interface for the genome browser provides a user-friendly access to the most complete and accurate resource of T4SS gene information (e.g., gene number, name, type, position, sequence, related articles, and quick links to other webs). Currently, this online database includes T4SS information of 5239 bacterial strains. Conclusions. T4SS is one of the most versatile secretion systems necessary for the virulence and survival of bacteria and the secretion of protein and/or DNA substrates from a donor to a recipient cell. This database on virB/D genes of the T4SS system will help scientists worldwide to improve their knowledge on secretion systems and also identify potential pathogenic mechanisms of various microbial species.
Full Text Available TMBETA-GENOME Database Description General information of database Database name TMBETA-GENOME Alternative n...oinfo/Gromiha/ Database classification Protein sequence databases - Protein prope...: Eukaryota Taxonomy ID: 2759 Database description TMBETA-GENOME is a database for transmembrane β-barrel pr...lgorithms and statistical methods have been perfumed and the annotation results are accumulated in the database.... Features and manner of utilization of database Users can download lists of sequences predicted as β-bar
Kuhn, R M; Karolchik, D; Zweig, A S
The University of California, Santa Cruz Genome Browser Database contains, as of September 2006, sequence and annotation data for the genomes of 13 vertebrate and 19 invertebrate species. The Genome Browser displays a wide variety of annotations at all scales from the single nucleotide level up t...
Hinrichs, A S; Karolchik, D; Baertsch, R
The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, ...
Full Text Available BACKGROUND: In metabolomics researches using mass spectrometry (MS, systematic searching of high-resolution mass data against compound databases is often the first step of metabolite annotation to determine elemental compositions possessing similar theoretical mass numbers. However, incorrect hits derived from errors in mass analyses will be included in the results of elemental composition searches. To assess the quality of peak annotation information, a novel methodology for false discovery rates (FDR evaluation is presented in this study. Based on the FDR analyses, several aspects of an elemental composition search, including setting a threshold, estimating FDR, and the types of elemental composition databases most reliable for searching are discussed. METHODOLOGY/PRINCIPAL FINDINGS: The FDR can be determined from one measured value (i.e., the hit rate for search queries and four parameters determined by Monte Carlo simulation. The results indicate that relatively high FDR values (30-50% were obtained when searching time-of-flight (TOF/MS data using the KNApSAcK and KEGG databases. In addition, searches against large all-in-one databases (e.g., PubChem always produced unacceptable results (FDR >70%. The estimated FDRs suggest that the quality of search results can be improved not only by performing more accurate mass analysis but also by modifying the properties of the compound database. A theoretical analysis indicates that FDR could be improved by using compound database with smaller but higher completeness entries. CONCLUSIONS/SIGNIFICANCE: High accuracy mass analysis, such as Fourier transform (FT-MS, is needed for reliable annotation (FDR <10%. In addition, a small, customized compound database is preferable for high-quality annotation of metabolome data.
Hansen, Frank Allan
Ubiquitous annotation systems allow users to annotate physical places, objects, and persons with digital information. Especially in the field of location based information systems much work has been done to implement adaptive and context-aware systems, but few efforts have focused on the general...... requirements for linking information to objects in both physical and digital space. This paper surveys annotation techniques from open hypermedia systems, Web based annotation systems, and mobile and augmented reality systems to illustrate different approaches to four central challenges ubiquitous annotation...... systems have to deal with: anchoring, structuring, presentation, and authoring. Through a number of examples each challenge is discussed and HyCon, a context-aware hypermedia framework developed at the University of Aarhus, Denmark, is used to illustrate an integrated approach to ubiquitous annotations...
Névéol, Aurélie; Islamaj Doğan, Rezarta; Lu, Zhiyong
Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.
Full Text Available Adverse drug reaction (ADR is widely concerned for public health issue. ADRs are one of most common causes to withdraw some drugs from market. Prescription event monitoring (PEM is an important approach to detect the adverse drug reactions. The main problem to deal with this method is how to automatically extract the medical events or side effects from high-throughput medical events, which are collected from day to day clinical practice. In this study we propose a novel concept of feature matrix to detect the ADRs. Feature matrix, which is extracted from big medical data from The Health Improvement Network (THIN database, is created to characterize the medical events for the patients who take drugs. Feature matrix builds the foundation for the irregular and big medical data. Then feature selection methods are performed on feature matrix to detect the significant features. Finally the ADRs can be located based on the significant features. The experiments are carried out on three drugs: Atorvastatin, Alendronate, and Metoclopramide. Major side effects for each drug are detected and better performance is achieved compared to other computerized methods. The detected ADRs are based on computerized methods, further investigation is needed.
Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron
Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter.
Full Text Available Abstract Background Since DNA sequencing has become easier and cheaper, an increasing number of closely related viral genomes have been sequenced. However, many of these have been deposited in GenBank without annotations, severely limiting their value to researchers. While maintaining comprehensive genomic databases for a set of virus families at the Viral Bioinformatics Resource Center http://www.biovirus.org and Viral Bioinformatics – Canada http://www.virology.ca, we found that researchers were unnecessarily spending time annotating viral genomes that were close relatives of already annotated viruses. We have therefore designed and implemented a novel tool, Genome Annotation Transfer Utility (GATU, to transfer annotations from a previously annotated reference genome to a new target genome, thereby greatly reducing this laborious task. Results GATU transfers annotations from a reference genome to a closely related target genome, while still giving the user final control over which annotations should be included. GATU also detects open reading frames present in the target but not the reference genome and provides the user with a variety of bioinformatics tools to quickly determine if these ORFs should also be included in the annotation. After this process is complete, GATU saves the newly annotated genome as a GenBank, EMBL or XML-format file. The software is coded in Java and runs on a variety of computer platforms. Its user-friendly Graphical User Interface is specifically designed for users trained in the biological sciences. Conclusion GATU greatly simplifies the initial stages of genome annotation by using a closely related genome as a reference. It is not intended to be a gene prediction tool or a "complete" annotation system, but we have found that it significantly reduces the time required for annotation of genes and mature peptides as well as helping to standardize gene names between related organisms by transferring reference genome
Dipper, Stefanie; Götze, Michael; Stede, Manfred; Wegst, Tillmann
In this paper, we discuss the design and implementation of our first version of the database "ANNIS" (ANNotation of Information Structure). For research based on empirical data, ANNIS provides a uniform environment for storing this data together with its linguistic annotations. A central database promotes standardized annotation, which facilitates interpretation and comparison of the data. ANNIS is used through a standard web browser and offers tier-based visualization of data and annotations...
Lindstrøm, Bo; Wells, Lisa Marie
Coloured Petri nets (CP-nets) can be used for several fundamentally different purposes like functional analysis, performance analysis, and visualisation. To be able to use the corresponding tool extensions and libraries it is sometimes necessary to include extra auxiliary information in the CP-ne...... a certain use of the CP-net. We define the semantics of annotations by describing a translation from a CP-net and the corresponding annotation layers to another CP-net where the annotations are an integrated part of the CP-net....... a method which makes it possible to associate auxiliary information, called annotations, with tokens without modifying the colour sets of the CP-net. Annotations are pieces of information that are not essential for determining the behaviour of the system being modelled, but are rather added to support...
Full Text Available Despite the routine nature of comparing sequence variations identified during clinical testing to database records, few databases meet quality requirements for clinical diagnostics. To address this issue, The Royal College of Pathologists of Australasia (RCPA in collaboration with the Human Genetics Society of Australasia (HGSA, and the Human Variome Project (HVP is developing standards for DNA sequence variation databases intended for use in the Australian clinical environment. The outputs of this project will be promoted to other health systems and accreditation bodies by the Human Variome Project to support the development of similar frameworks in other jurisdictions.
Kim, Sanghee; Ahmed, Saeema; Wallace, Ken
, the annotations identify which parts of a text are more important for understanding its contents. The extraction of salient sentences is a major issue in text summarisation. Commonly used methods are based on statistical analysis, but for subject-matter type texts, linguistically motivated natural language...... processing techniques, like semantic annotations, are preferred. An experiment to test the method using 140 documents collected from industry demonstrated that classification accuracy can be improved by up to 16%....
Haznedaroglu Berat Z
Full Text Available Abstract Background The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG ortholog identifiers (KOIs, and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. Results Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63. For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs in the assemblies of individual k-mers (k-19 to k-63 that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. Conclusions This study demonstrated that different k-mer choices result in various quantities
Buza, Teresia M; Jack, Sherman W; Kirunda, Halid; Khaitsa, Margaret L; Lawrence, Mark L; Pruett, Stephen; Peterson, Daniel G
There is an urgent need for a unified resource that integrates trans-disciplinary annotations of emerging and reemerging animal infectious and zoonotic diseases. Such data integration will provide wonderful opportunity for epidemiologists, researchers and health policy makers to make data-driven decisions designed to improve animal health. Integrating emerging and reemerging animal infectious and zoonotic disease data from a large variety of sources into a unified open-access resource provides more plausible arguments to achieve better understanding of infectious and zoonotic diseases. We have developed a model for interlinking annotations of these diseases. These diseases are of particular interest because of the threats they pose to animal health, human health and global health security. We demonstrated the application of this model using brucellosis, an infectious and zoonotic disease. Preliminary annotations were deposited into VetBioBase database (http://vetbiobase.igbb.msstate.edu). This database is associated with user-friendly tools to facilitate searching, retrieving and downloading of disease-related information. Database URL: http://vetbiobase.igbb.msstate.edu.
method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13.
Jerbi, Houssem; Ravat, Franck; Teste, Olivier
This paper deals with personalization of annotated OLAP systems. Data constellation is extended to support annotations and user preferences. Annotations reflect the decision-maker experience whereas user preferences enable users to focus on the most interesting data. User preferences allow annotated contextual recommendations helping the decision-maker during his/her multidimensional navigations.
Kuzniar, A.; Lin, K.; He, Y.; Nijveen, H.; Pongor, S.; Leunissen, J.A.M.
Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotato
Hall, Christoffer; Bonnet, Philippe
The Linux 2.6 kernel supports asynchronous I/O as a result of propositions from the database industry. This is a positive evolution but is it a panacea? In the context of the Badger project, a collaboration between MySQL AB and University of Copenhagen, ......The Linux 2.6 kernel supports asynchronous I/O as a result of propositions from the database industry. This is a positive evolution but is it a panacea? In the context of the Badger project, a collaboration between MySQL AB and University of Copenhagen, ...
Zare, Mohammad Reza; Mueen, Ahmed; Seng, Woo Chaw
The demand for automatically classification of medical X-ray images is rising faster than ever. In this paper, an approach is presented to gain high accuracy rate for those classes of medical database with high ratio of intraclass variability and interclass similarities. The classification framework was constructed via annotation using the following three techniques: annotation by binary classification, annotation by probabilistic latent semantic analysis, and annotation using top similar images. Next, final annotation was constructed by applying ranking similarity on annotated keywords made by each technique. The final annotation keywords were then divided into three levels according to the body region, specific bone structure in body region as well as imaging direction. Different weights were given to each level of the keywords; they are then used to calculate the weightage for each category of medical images based on their ground truth annotation. The weightage computed from the generated annotation of query image was compared with the weightage of each category of medical images, and then the query image would be assigned to the category with closest weightage to the query image. The average accuracy rate reported is 87.5 %.
Campbell, Michael S; Yandell, Mark
Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.
Hall, Christoffer; Bonnet, Philippe
policy used by InnoDB (as well as Oracle 9.2) leads to an under-utilization of the available I/O bandwidth. We then show that introducing prioritized asynchronous I/O in Linux will allow MySQL/InnoDB and the other Linux databases to fully utilize the available I/O bandwith using a more aggressive I...
Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James
For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.
Gaulton, Anna; Hersey, Anne; Nowotka, Michał; Bento, A. Patrícia; Chambers, Jon; Mendez, David; Mutowo, Prudence; Atkinson, Francis; Bellis, Louisa J.; Cibrián-Uhalte, Elena; Davies, Mark; Dedman, Nathan; Karlsson, Anneli; Magariños, María Paula; Overington, John P.; Papadatos, George; Smit, Ines; Leach, Andrew R.
ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services. PMID:27899562
Karolchik, D; Kuhn, R M; Baertsch, R
The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrat...
Full Text Available Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (reannotation of proteomes, to properly identify both the individual proteins involved in the process(es of interest and their function. It also enables the sets of proteins involved in the process(es in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes.
Morris, Robert A; Dou, Lei; Hanken, James; Kelly, Maureen; Lowery, David B; Ludäscher, Bertram; Macklin, James A; Morris, Paul J
Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.
Robert A Morris
Full Text Available Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.
Roe, Mark A.; Finglas, Paul M
The EuroFIR (European Food Information Resource; www.eurofir.org) project has developed the first comprehensive pan-European Food Data Platform, using state-of-the-art database linking. Data from 25 countries are available online (www.eurofir. org) and include vitamin contents of an extensive range of over 27,000 foods. The datasets have been compiled based on a technical standard developed within the project and this standardisation of data allows effective management, updating, extending, a...
Ali A.Alwan; Hamidah Ibrahim; Nur Izura Udzir
Most of the previous studies concerning checking the integrity constraints in distributed database derive simplified forms of the initial integrity constraints with the sufficiency property, since the sufficient test is known to be cheaper than the complete test and its initial integrity constraint as it involves less data to be transferred across the network and can always be evaluated at the target site (single site). Their studies are limited as they depend strictly on the assumption that an update operation will be executed at a site where the relation specified in the update operation is located, which is not always true. Hence, the sufficient test, which is proven to be local test by previous study, is no longer appropriate. This paper proposes an approach to checking integrity constraints in a distributed database by utilizing as much as possible the local information stored at the target site. The proposed approach derives support tests as an alternative to the existing complete and sufficient tests proposed by previous researchers with the intention to increase the number of local checking regardless the location of the submitted update operation. Several analyses have been performed to evaluate the proposed approach, and the results show that support tests can benefit the distributed database, where local constraint checking can be achieved.
Li, Yanpeng; Yu, Hong
Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.
Loughlin, Maria M.; Hughes, John F.
Annotation is a key activity of data analysis. However, current systems for data analysis focus almost exclusively on visualization. We propose a system which integrates annotations into a visualization system. Annotations are embedded in 3D data space, using the Post-it metaphor. This embedding allows contextual-based information storage and retrieval, and facilitates information sharing in collaborative environments. We provide a traditional database filter and a Magic Lens filter to create specialized views of the data. The system has been customized for fluid flow applications, with features which allow users to store parameters of visualization tools and sketch 3D volumes.
Bult, Carol J; Eppig, Janan T; Blake, Judith A; Kadin, James A; Richardson, Joel E
The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the primary community model organism database for the laboratory mouse and serves as the source for key biological reference data related to mouse genes, gene functions, phenotypes and disease models with a strong emphasis on the relationship of these data to human biology and disease. As the cost of genome-scale sequencing continues to decrease and new technologies for genome editing become widely adopted, the laboratory mouse is more important than ever as a model system for understanding the biological significance of human genetic variation and for advancing the basic research needed to support the emergence of genome-guided precision medicine. Recent enhancements to MGD include new graphical summaries of biological annotations for mouse genes, support for mobile access to the database, tools to support the annotation and analysis of sets of genes, and expanded support for comparative biology through the expansion of homology data.
A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.
Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Full Text Available The prognostic evaluation of geriatric patients is critical in helping clinicians to weigh the risks versus the benefits of available therapeutic options. Frailty contributes significantly to the risk of mortality in older patients and is already known to have implications on the outcome of treatment in a clinical context. The multi-dimensional prognostic index (MPI is a prognostic tool based on a comprehensive geriatric assessment and includes detailed information on patient cognition, functionality, disease and drug burden. The value of the MPI in predicting mortality has already been shown in hospital and community settings but never in a population- based healthcare database setting. One of the aims of the ongoing EU-funded MPI_Age project is to improve our understanding of how geriatric frailty data can be identified in healthcare databases and whether this can be used to predict serious adverse events associated with pharmacotherapy. Our findings suggest that data on functionality in elderly patients is poorly registered in The Health Improvement Network (THIN, a UK nationwide general practice database, and only few of the functionality domains could be used in a population-based analysis. The most commonly registered functionality information was related to mobility, dressing, accommodation and cognition. Our results suggest that some of these functionality domains are predictive of short- and long-term mortality in community-dwelling patients. This may have implications in observational research where frailty is an unmeasured confounder.
Full Text Available Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.
Mayer Klaus FX
Full Text Available Abstract Background Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. Results To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. Conclusion This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. The Apollo2Go software is freely available from ftp://ftpmips.gsf.de/plants/apollo_webservice.
Siti Zuraidah Mohamad Zobir
Full Text Available Traditional Chinese medicine (TCM still needs more scientific rationale to be proven for it to be accepted further in the West. We are now in the position to propose computational hypotheses for the mode-of-actions (MOAs of 45 TCM therapeutic action (subclasses from in silico target prediction algorithms, whose target was later annotated with Kyoto Encyclopedia of Genes and Genomes pathway, and to discover the relationship between them by generating a hierarchical clustering. The results of 10,749 TCM compounds showed 183 enriched targets and 99 enriched pathways from Estimation Score ≤ 0 and ≥ 5% of compounds/targets in a (subclass. The MOA of a (subclass was established from supporting literature. Overall, the most frequent top three enriched targets/pathways were immune-related targets such as tyrosine-protein phosphatase nonreceptor type 2 (PTPN2 and digestive system such as mineral absorption. We found two major protein families, G-protein coupled receptor (GPCR, and protein kinase family contributed to the diversity of the bioactivity space, while digestive system was consistently annotated pathway motif, which agreed with the important treatment principle of TCM, “the foundation of acquired constitution” that includes spleen and stomach. In short, the TCM (subclasses, in many cases share similar targets/pathways despite having different indications.
Dolejsi, Martin; Kybic, Jan; Polovincak, Michal; Tuma, Stanislav
The Lung Test Images from Motol Environment (Lung TIME) is a new publicly available dataset of thoracic CT scans with manually annotated pulmonary nodules. It is larger than other publicly available datasets. Pulmonary nodules are lesions in the lungs, which may indicate lung cancer. Their early detection significantly improves survival rate of patients. Automatic nodule detecting systems using CT scans are being developed to reduce physicians' load and to improve detection quality. Besides presenting our own nodule detection system, in this article, we mainly address the problem of testing and comparison of automatic nodule detection methods. Our publicly available 157 CT scan dataset with 394 annotated nodules contains almost every nodule types (pleura attached, vessel attached, solitary, regular, irregular) with 2-10mm in diameter, except ground glass opacities (GGO). Annotation was done consensually by two experienced radiologists. The data are in DICOM format, annotations are provided in XML format compatible with the Lung Imaging Database Consortium (LIDC). Our computer aided diagnosis system (CAD) is based on mathematical morphology and filtration with a subsequent classification step. We use Asymmetric AdaBoost classifier. The system was tested using TIME, LIDC and ANODE09 databases. The performance was evaluated by cross-validation for Lung TIME and LIDC, and using the supplied evaluation procedure for ANODE09. The sensitivity at chosen working point was 94.27% with 7.57 false positives/slice for TIME and LIDC datasets combined, 94.03% with 5.46 FPs/slice for the Lung TIME, 89.62% sensitivity with 12.03 FPs/slice for LIDC, and 78.68% with 4,61 FPs/slice when applied on ANODE09.
Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob L.; Hohimer, Ryan E.; White, Amanda M.
Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.
Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob; Hohimer, Ryan E.; White, Amanda M.
Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.
Gorbalenya Alexander E
Full Text Available Abstract Background A growing diversity of biological data is tagged with unique identifiers (UIDs associated with polynucleotides and proteins to ensure efficient computer-mediated data storage, maintenance, and processing. These identifiers, which are not informative for most people, are often substituted by biologically meaningful names in various presentations to facilitate utilization and dissemination of sequence-based knowledge. This substitution is commonly done manually that may be a tedious exercise prone to mistakes and omissions. Results Here we introduce SNAD (Sequence Name Annotation-based Designer that mediates automatic conversion of sequence UIDs (associated with multiple alignment or phylogenetic tree, or supplied as plain text list into biologically meaningful names and acronyms. This conversion is directed by precompiled or user-defined templates that exploit wealth of annotation available in cognate entries of external databases. Using examples, we demonstrate how this tool can be used to generate names for practical purposes, particularly in virology. Conclusion A tool for controllable annotation-based conversion of sequence UIDs into biologically meaningful names and acronyms has been developed and placed into service, fostering links between quality of sequence annotation, and efficiency of communication and knowledge dissemination among researchers.
Hasson, Uri; Skipper, Jeremy I; Wilde, Michael J; Nusbaum, Howard C; Small, Steven L
The increasingly complex research questions addressed by neuroimaging research impose substantial demands on computational infrastructures. These infrastructures need to support management of massive amounts of data in a way that affords rapid and precise data analysis, to allow collaborative research, and to achieve these aims securely and with minimum management overhead. Here we present an approach that overcomes many current limitations in data analysis and data sharing. This approach is based on open source database management systems that support complex data queries as an integral part of data analysis, flexible data sharing, and parallel and distributed data processing using cluster computing and Grid computing resources. We assess the strengths of these approaches as compared to current frameworks based on storage of binary or text files. We then describe in detail the implementation of such a system and provide a concrete description of how it was used to enable a complex analysis of fMRI time series data.
Pualam Dipa Nusantara
Full Text Available The difficulty in managing the data of books collection in the library is a problem that is often faced by the librarian that effect the quality of service. Arrangement and recording a collection of books in the file system of separate applications Word and Excel, as well as transaction handling borrowing and returning books, there has been no integrated records. Library system can manage the book collection. This system can reduce the problems often experienced by library staff when serving students in borrowing books. There so frequent difficulty in managing the books that still in borrowed state. This system will also record a collection of late fees or lost library conducted by students (borrowers. The conclusion of this study is library performance can be better with the library system using web database.
Wildemuth, Barbara M., Comp.
This 89-item, annotated bibliography was compiled to provide access to research and discussions of cheating and, specifically, cheating on tests. It is not limited to any educational level, nor is it confined to any specific curriculum area. Two data bases were searched by computer, and a library search was conducted. A computer search of the…
This annotated bibliography contains summaries of articles and chapters of books, which are relevant to traceability. After each summary there is a part about the relevancy of the paper for the LEI project. The aim of the LEI-project is to gain insight in several aspects of traceability in order to
Martinez Alonso, Hector
Regular polysemy has received a lot of attention from the theory of lexical semantics and from computational linguistics. However, there is no consensus on how to represent the sense of underspecified examples at the token level, namely when annotating or disambiguating senses of metonymic words...
Zad, Damon Daylamani; Agius, Harry
In this paper, we focus on metadata for self-created movies like those found on YouTube and Google Video, the duration of which are increasing in line with falling upload restrictions. While simple tags may have been sufficient for most purposes for traditionally very short video footage that contains a relatively small amount of semantic content, this is not the case for movies of longer duration which embody more intricate semantics. Creating metadata is a time-consuming process that takes a great deal of individual effort; however, this effort can be greatly reduced by harnessing the power of Web 2.0 communities to create, update and maintain it. Consequently, we consider the annotation of movies within Web 2.0 environments, such that users create and share that metadata collaboratively and propose an architecture for collaborative movie annotation. This architecture arises from the results of an empirical experiment where metadata creation tools, YouTube and an MPEG-7 modelling tool, were used by users to create movie metadata. The next section discusses related work in the areas of collaborative retrieval and tagging. Then, we describe the experiments that were undertaken on a sample of 50 users. Next, the results are presented which provide some insight into how users interact with existing tools and systems for annotating movies. Based on these results, the paper then develops an architecture for collaborative movie annotation.
Haring, Norris G.
An annotated bibliography which presents approximately 300 references from 1951 to 1973 on the education of severely/profoundly handicapped persons. Citations are grouped alphabetically by author's name within the following categories: characteristics and treatment, gross motor development, sensory and motor development, physical therapy for the…
Heaton, Pamela; Wallace, Gregory L.
Background: Whilst interest has focused on the origin and nature of the savant syndrome for over a century, it is only within the past two decades that empirical group studies have been carried out. Methods: The following annotation briefly reviews relevant research and also attempts to address outstanding issues in this research area.…
Kuo, Alan; Grigoriev, Igor
22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.
Gomes-Pereira, Jose Nuno; Auger, Vincent; Beisiegel, Kolja; Benjamin, Robert; Bergmann, Melanie; Bowden, David; Buhl-Mortensen, Pal; De Leo, Fabio C.; Dionísio, Gisela; Durden, Jennifer M.; Edwards, Luke; Friedman, Ariell; Greinert, Jens; Jacobsen-Stout, Nancy; Lerner, Steve; Leslie, Murray; Nattkemper, Tim W.; Sameoto, Jessica A.; Schoening, Timm; Schouten, Ronald; Seager, James; Singh, Hanumant; Soubigou, Olivier; Tojeira, Inês; van den Beld, Inge; Dias, Frederico; Tempera, Fernando; Santos, Ricardo S.
Given the need to describe, analyze and index large quantities of marine imagery data for exploration and monitoring activities, a range of specialized image annotation tools have been developed worldwide. Image annotation - the process of transposing objects or events represented in a video or still image to the semantic level, may involve human interactions and computer-assisted solutions. Marine image annotation software (MIAS) have enabled over 500 publications to date. We review the functioning, application trends and developments, by comparing general and advanced features of 23 different tools utilized in underwater image analysis. MIAS requiring human input are basically a graphical user interface, with a video player or image browser that recognizes a specific time code or image code, allowing to log events in a time-stamped (and/or geo-referenced) manner. MIAS differ from similar software by the capability of integrating data associated to video collection, the most simple being the position coordinates of the video recording platform. MIAS have three main characteristics: annotating events in real time, posteriorly to annotation and interact with a database. These range from simple annotation interfaces, to full onboard data management systems, with a variety of toolboxes. Advanced packages allow to input and display data from multiple sensors or multiple annotators via intranet or internet. Posterior human-mediated annotation often include tools for data display and image analysis, e.g. length, area, image segmentation, point count; and in a few cases the possibility of browsing and editing previous dive logs or to analyze the annotations. The interaction with a database allows the automatic integration of annotations from different surveys, repeated annotation and collaborative annotation of shared datasets, browsing and querying of data. Progress in the field of automated annotation is mostly in post processing, for stable platforms or still images
Dogget, N.; Myers, G. [Los Alamos National Lab., NM (United States); Wills, C.J. [Univ. of California, San Diego, CA (United States)
This is the final report of a three-year, Laboratory Directed Research and Development (LDRD) project at the Los Alamos National Laboratory (LANL). The authors have used computer simulations and examination of a variety of databases to answer questions about a wide range of evolutionary questions. The authors have found that there is a clear distinction in the evolution of HIV-1 and HIV-2, with the former and more virulent virus evolving more rapidly at a functional level. The authors have discovered highly non-random patterns in the evolution of HIV-1 that can be attributed to a variety of selective pressures. In the course of examination of microsatellite DNA (short repeat regions) in microorganisms, the authors have found clear differences between prokaryotes and eukaryotes in their distribution, differences that can be tied to different selective pressures. They have developed a new method (topiary pruning) for enhancing the phylogenetic information contained in DNA sequences. Most recently, the authors have discovered effects in complex rainforest ecosystems that indicate strong frequency-dependent interactions between host species and their parasites, leading to the maintenance of ecosystem variability.
Lang B Franz
Full Text Available Abstract Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1 analyzes nucleotide and protein sequence data; (2 determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3 assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4 generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at http://megasun.bch.umontreal.ca/Software/AutoFACT.htm.
Full Text Available Students at risk of dropping out of Science, Technology, Engineering, and Mathematics (STEM programs often display signs that indicate they are at risk. A need exists to identify at risk STEM students early and to develop and implement effective intervention strategies that utilize the Total Quality Management (TQM approach. Most of all, a database system is needed to track this early intervention process, if retention rates are to be improved. To address this need at a small community college in North Carolina, a system was developed and underwent a pilot study in Fall 2009 and Spring 2010. The two pilot groups were compared to the two control groups to identify differences in retention, course credit completion rates, and grade point averages (GPA. The first pilot group displayed no significant differences, while the second pilot group displayed significant differences in most of the areas analyzed in the study, indicating a database system can be used to improve STEM student retention. While the second of the two pilot groups displayed promising results, managerial and logistical issues, such as less than optimal instructor involvement, impeded success were identified. This paper will describe the design, implementation, and the preliminary results of this study and outlines the need for further research that confirms these preliminary findings.
Saur, Drew D.; Tan, Yap-Peng; Kulkarni, Sanjeev R.; Ramadge, Peter J.
Automated analysis and annotation of video sequences are important for digital video libraries, content-based video browsing and data mining projects. A successful video annotation system should provide users with useful video content summary in a reasonable processing time. Given the wide variety of video genres available today, automatically extracting meaningful video content for annotation still remains hard by using current available techniques. However, a wide range video has inherent structure such that some prior knowledge about the video content can be exploited to improve our understanding of the high-level video semantic content. In this paper, we develop tools and techniques for analyzing structured video by using the low-level information available directly from MPEG compressed video. Being able to work directly in the video compressed domain can greatly reduce the processing time and enhance storage efficiency. As a testbed, we have developed a basketball annotation system which combines the low-level information extracted from MPEG stream with the prior knowledge of basketball video structure to provide high level content analysis, annotation and browsing for events such as wide- angle and close-up views, fast breaks, steals, potential shots, number of possessions and possession times. We expect our approach can also be extended to structured video in other domains.
National Agricultural Library (USDA), Washington, DC.
This annotated bibliography on nutrition and adolescent pregnancy is intended to be a source of technical assistance for nurses, nutritionists, physicians, educators, social workers, and other personnel concerned with improving the health of teenage mothers and their babies. It is divided into two major sections. The first section lists selected…
In view of the defects of current multi-level secure database BLP model, improved methods of its security classification, scope and access rules were proposed, in order to improve the security of the database.%针对目前多级安全数据库BLP模型的缺陷,提出了对其密级、范围和访问规则的改进方法,以提高数据库的安全性.
Huang, Liang-Tsung; Wu, Chao-Chin; Lai, Lien-Fu; Li, Yun-Ju
Sequence alignment lies at heart of the bioinformatics. The Smith-Waterman algorithm is one of the key sequence search algorithms and has gained popularity due to improved implementations and rapidly increasing compute power. Recently, the Smith-Waterman algorithm has been successfully mapped onto the emerging general-purpose graphics processing units (GPUs). In this paper, we focused on how to improve the mapping, especially for short query sequences, by better usage of shared memory. We performed and evaluated the proposed method on two different platforms (Tesla C1060 and Tesla K20) and compared it with two classic methods in CUDASW++. Further, the performance on different numbers of threads and blocks has been analyzed. The results showed that the proposed method significantly improves Smith-Waterman algorithm on CUDA-enabled GPUs in proper allocation of block and thread numbers.
Full Text Available Sequence alignment lies at heart of the bioinformatics. The Smith-Waterman algorithm is one of the key sequence search algorithms and has gained popularity due to improved implementations and rapidly increasing compute power. Recently, the Smith-Waterman algorithm has been successfully mapped onto the emerging general-purpose graphics processing units (GPUs. In this paper, we focused on how to improve the mapping, especially for short query sequences, by better usage of shared memory. We performed and evaluated the proposed method on two different platforms (Tesla C1060 and Tesla K20 and compared it with two classic methods in CUDASW++. Further, the performance on different numbers of threads and blocks has been analyzed. The results showed that the proposed method significantly improves Smith-Waterman algorithm on CUDA-enabled GPUs in proper allocation of block and thread numbers.
Karp Peter D
probability threshold of 0.9 during cross-validation using known reactions in computationally-predicted pathway databases. After applying our method to 513 pathway holes in 333 pathways from three Pathway/Genome databases, we increased the number of complete pathways by 42%. We made putative assignments to 46% of the holes, including annotation of 17 sequences of previously unknown function. Conclusions Our pathway hole filler can be used not only to increase the utility of Pathway/Genome databases to both experimental and computational researchers, but also to improve predictions of protein function.
Montoya Obeso, Abraham; Oropesa Morales, Lester Arturo; Fernando Vázquez, Luis; Cocolán Almeda, Sara Ivonne; Stoian, Andrei; García Vázquez, Mireya Saraí; Zamudio Fuentes, Luis Miguel; Montiel Perez, Jesús Yalja; de la O Torres, Saul; Ramírez Acosta, Alejandro Alvaro
The video annotation is important for web indexing and browsing systems. Indeed, in order to evaluate the performance of video query and mining techniques, databases with concept annotations are required. Therefore, it is necessary generate a database with a semantic indexing that represents the digital content of the Mexican bullfighting atmosphere. This paper proposes a scheme to make complex annotations in a video in the frame of multimedia search engine project. Each video is partitioned using our segmentation algorithm that creates shots of different length and different number of frames. In order to make complex annotations about the video, we use ELAN software. The annotations are done in two steps: First, we take note about the whole content in each shot. Second, we describe the actions as parameters of the camera like direction, position and deepness. As a consequence, we obtain a more complete descriptor of every action. In both cases we use the concepts of the TRECVid 2014 dataset. We also propose new concepts. This methodology allows to generate a database with the necessary information to create descriptors and algorithms capable to detect actions to automatically index and classify new bullfighting multimedia content.
Full Text Available With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance.
Cao, Jianfang; Chen, Lichao
With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP) neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance.
Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Bakke, Peter; Carney, Nick; DeLoache, Will
in databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology...
Su, Addison Y. S.; Huang, Chester S. J.; Yang, Stephen J. H.; Ding, T. J.; Hsieh, Y. Z.
In Taiwan elementary schools, Scratch programming has been taught for more than four years. Previous studies have shown that personal annotations is a useful learning method that improve learning performance. An annotation-based Scratch programming (ASP) system provides for the creation, share, and review of annotations and homework solutions in…
Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick
The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being
Augediva Maria Jucá Pordeus
Full Text Available Objective: To assess the use of database linkage technique on the improvement of information on mother-infant deaths by recovering not registered and/or ignored variables from the deaths of children under one year old, in the city of Fortaleza, in 2006. Methods: The linkage of database SIM (Mortality Information System and SINASC (Live births Information System was done by selecting common variables of the two systems. Using the Reclink III software there were identified the perfect pairs “Death certificate/Birth certificate” (DO/DN by means of the variables of DO: sex, race / color, birth weight, mother’s age, gestational age, type of pregnancy, type of birth, mother’s occupation and mother’s schooling. Results: There were registered 40,391 live births and 706 deaths of children under one year old. There were identified 516 (73.1% DO with their respective DN. The variables occupation and mother’sm schooling increased from 31.4% and 35.8% to 64.6% and 72.8% of filling, respectively. Regarding the mother’s age, the increase of information was 45.2%. Conclusion: The use of Reclink III software in the routine of health services enabled a large recovery of information that had not been fulfilled in the Death certificate (DO, and therefore, may promote a better understanding of infant mortality in studied populations.
Wang, Xin-Jing; Zhang, Lei; Li, Xirong; Ma, Wei-Ying
Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged-one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.
Chemical and petrochemical processes require continuous monitoring to detect abnormal events and to sustain normal operations. Furthermore, process monitoring enhances productivity, efficiency, and safety in process industries. Here, we propose an innovative statistical approach that exploits the advantages of multiscale partial least squares (MSPLS) models and generalized likelihood ratio (GLR) tests for fault detection in processes. Specifically, we combine an MSPLS algorithm with wavelet analysis to create our modeling framework. Then, we use GLR hypothesis testing based on the uncorrelated residuals obtained from the MSPLS model to improve fault detection. We use simulated distillation column data to evaluate the MSPLS-based GLR chart. Results show that our MSPLS-based GLR method is more powerful than the PLS-based Q and GLR method and MSPLS-based Q method, especially in early detection of small faults with abrupt or incipient behavior.
Wright, Thomas L.; Takahashi, Taeko Jane
The Hawaii bibliographic database has been created to contain all of the literature, from 1779 to the present, pertinent to the volcanological history of the Hawaiian-Emperor volcanic chain. References are entered in a PC- and Macintosh-compatible EndNote Plus bibliographic database with keywords and s or (if no ) with annotations as to content. Keywords emphasize location, discipline, process, identification of new chemical data or age determinations, and type of publication. The database is updated approximately three times a year and is available to upload from an ftp site. The bibliography contained 8460 references at the time this paper was submitted for publication. Use of the database greatly enhances the power and completeness of library searches for anyone interested in Hawaiian volcanism.
Ravat, D.; Sabaka, T.; Elshayat, A.; Aref, A.; Elawadi, E.; Kucks, R.; Hill, P.; Phillips, J.; Finn, C.; Bouligand, C.; Blakely, R. J.
Under an initiative started by Thomas G. Hildenbrand of the U. S. Geological Survey, we have improved the long-wavelength (50-2500 km) content of the regional magnetic anomaly compilation for the conterminous United States by utilizing a nearly homogeneous set of National Uranium Resource Evaluation (NURE) magnetic surveys flown from 1975 to 1981. The surveys were flown in quadrangles of 2° of longitude by 1° of latitude with E-W flight-lines spaced 4.8 to 9.6 km, N-S tie-lines variably spaced, and a nominal terrain clearance of 122 m. Many of the surveys used base-station magnetometers to remove external field variations. NURE surveys were originally processed with IGRF core-field models, which left behind non- uniform residual trends in the data and discontinuities at survey boundaries. In this study, in place of the IGRF/DGRF, we used a spatially and temporally continuous model of the magnetic field known as the Comprehensive Model (CM), which allowed us to avoid discontinuities at survey boundaries. The CM simultaneously models the core magnetic field and long-wavelength ionospheric and magnetospheric fields, along with their induced components in the earth. Because of the availability of base-stations for removing external fields, we removed only the core-derived geomagnetic field based on CM4 (spherical harmonic degree 13) for our compilation. The NURE data have short-wavelength (less than 30 km) noise due to cultural sources, base-station offsets, and residual external field effects. It is possible to reduce and even remove these defects by identifying and editing them and by applying leveling and micro-leveling. There are also many high resolution individual surveys over the U.S. which could be incorporated into the improved NURE database; however, this could take a few years. Therefore, we have created a preliminary full spectrum magnetic anomaly database by combining short-wavelength magnetic anomalies from the North American Magnetic Anomaly Map (NAMAM
Full Text Available On Semantic Annotation in Clarin-PL Parallel Corpora In the article, the authors present a proposal for semantic annotation in Clarin-PL parallel corpora: Polish-Bulgarian-Russian and Polish-Lithuanian ones. Semantic annotation of quantification is a novum in developing sentence level semantics in multilingual parallel corpora. This is why our semantic annotation is manual. The authors hope it will be interesting to IT specialists working on automatic processing of the given natural languages. Semantic annotation defined the way it is defined here will make contrastive studies of natural languages more efficient, which in turn will help verify the results of those studies, and will certainly improve human and machine translations.
Full Text Available Abstract Background Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO, pathways, interactions, phenotypes, publications and many more. Results We present the GeneCards Inferred Functionality Score (GIFtS which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25 between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a
Full Text Available BACKGROUND: The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. METHODOLOGY: The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA. The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions.
Jiu, Mingyuan; Sahbi, Hichem
Multiple kernel learning (MKL) is a widely used technique for kernel design. Its principle consists in learning, for a given support vector classifier, the most suitable convex (or sparse) linear combination of standard elementary kernels. However, these combinations are shallow and often powerless to capture the actual similarity between highly semantic data, especially for challenging classification tasks such as image annotation. In this paper, we redefine multiple kernels using deep multi-layer networks. In this new contribution, a deep multiple kernel is recursively defined as a multi-layered combination of nonlinear activation functions, each one involves a combination of several elementary or intermediate kernels, and results into a positive semi-definite deep kernel. We propose four different frameworks in order to learn the weights of these networks: supervised, unsupervised, kernel-based semisupervised and Laplacian-based semi-supervised. When plugged into support vector machines (SVMs), the resulting deep kernel networks show clear gain, compared to several shallow kernels for the task of image annotation. Extensive experiments and analysis on the challenging ImageCLEF photo annotation benchmark, the COREL5k database and the Banana dataset validate the effectiveness of the proposed method.
Dobb, Linda S.
Discussion of technostress in libraries focuses on a database crash at California Polytechnic State University, San Luis Obispo. Steps taken to restore the data are explained, strategies for handling technological accidents are suggested, the impact on library staff is discussed, and a 10-item annotated bibliography on technostress is provided.…
Zhu, Xingquan; Zhang, HongJiang; Liu, Wenyin; Hu, Chunhui; Wu, Lide
Relevance feedback is a powerful and widely used technique in content-based image retrieval systems. However, most relevance feedback approaches use only weighted feature sum of the feedback images to optimize the query for refining image similarity assessment. Such approaches do not work very well in most cases, especially when the user wants to express an `OR' relationship among the queries. In this paper, we propose three methods, weighted distance sum, minimal distance (MD), and minimal distance rank (MDR), to measure the similarity between images in database and the feedback images in query refinement. After experimental comparisons we propose a relevance feedback scheme using the MDR method and the MD method to describe the user's multiple intentions. Experiments show that the proposed methods can result in substantial improvement in retrieval accuracy and can be especially useful for retrieval or annotating large image databases.
Reliability Of A Surgeon-Reported Morbidity And Mortality Database: A Comparison Of Short-Term Morbidity Between The Scoliosis Research Society And National Surgical Quality Improvement Program Databases
Martin, Christopher T.; Pugely, Andrew J.; Gao, Yubo; Skovrlj, Branko; Lee, Nathan J.; Cho, Samuel K.; Mendoza-Lattes, Sergio
Background There exists a lack of comparison between large national healthcare databases reporting surgical morbidity and mortality. Prior authors have expressed concern that the Scoliosis Research Society (SRS) membership may have underreported complications in spinal surgery. Thus, the purpose of the present study was to compare the incidence of morbidity between the SRS and National Surgical Quality Improvement Program (NSQIP) databases. Methods We reviewed patients enrolled between 2012 and 2013, with a total of 96,875 patients identified in the SRS dataset and 15,909 in the combined adult and pediatric NSQIP dataset. Patients were matched based on diagnostic category,and a univariate analysis was used to compare reported complication rates in the categories of perioperative infection, neurologic injury, and mortality. The SRS database only requires detailed demographic data reporting on patients that have had a complication event. We compared the demographics and comorbidities of this subgroup, and used this as a surrogate to assess the potential magnitude of confounders. Results Small differences existed between the SRS and NSQIP databases in terms of mortality (0.1% v. 0.2%), infection (1.2% v. 2%), and neurologic injury (0.8% v. 0.1%) (p<0.001 for each comparison). Infection rates were consistently lower across multiple diagnostic sub-categories in the SRS database, whereas neurologic injury rates were consistently lower in the NSQIP database. These differences reached statistical significance across several diagnostic subcategories, but the clinical magnitude of the differences was small. Amongst the patients with a complication, modest differences in comorbidities existed between the two cohorts. Conclusion Overall, the incidence of short-term morbidity and mortality was similar between the two databases. There were modest differences in comorbidities, which may explain the small differences observed in morbidity. Concerns regarding possible under
Full Text Available Abstract With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1 searching local or remote biological databases for sequence similarities using Blast services, 2 predicting protein coding sequence from EST data and, 3 annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.
Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Barupal Dinesh K
Full Text Available Abstract Background Volatile compounds comprise diverse chemical groups with wide-ranging sources and functions. These compounds originate from major pathways of secondary metabolism in many organisms and play essential roles in chemical ecology in both plant and animal kingdoms. In past decades, sampling methods and instrumentation for the analysis of complex volatile mixtures have improved; however, design and implementation of database tools to process and store the complex datasets have lagged behind. Description The volatile compound BinBase (vocBinBase is an automated peak annotation and database system developed for the analysis of GC-TOF-MS data derived from complex volatile mixtures. The vocBinBase DB is an extension of the previously reported metabolite BinBase software developed to track and identify derivatized metabolites. The BinBase algorithm uses deconvoluted spectra and peak metadata (retention index, unique ion, spectral similarity, peak signal-to-noise ratio, and peak purity from the Leco ChromaTOF software, and annotates peaks using a multi-tiered filtering system with stringent thresholds. The vocBinBase algorithm assigns the identity of compounds existing in the database. Volatile compound assignments are supported by the Adams mass spectral-retention index library, which contains over 2,000 plant-derived volatile compounds. Novel molecules that are not found within vocBinBase are automatically added using strict mass spectral and experimental criteria. Users obtain fully annotated data sheets with quantitative information for all volatile compounds for studies that may consist of thousands of chromatograms. The vocBinBase database may also be queried across different studies, comprising currently 1,537 unique mass spectra generated from 1.7 million deconvoluted mass spectra of 3,435 samples (18 species. Mass spectra with retention indices and volatile profiles are available as free download under the CC-BY agreement (http
Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S.; Pruden, Amy; Xiao, Weidong; Zhang, Liqing
Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution. PMID:27632579
Department of Veterans Affairs — The Surgery Risk Assessment (SRA) database is part of the VA Surgical Quality Improvement Program (VASQIP). This database contains assessments of selected surgical...
Blake, Judith A; Bult, Carol J; Eppig, Janan T; Kadin, James A; Richardson, Joel E
The Mouse Genome Database (MGD) (http://www.informatics.jax.org) is the community model organism database resource for the laboratory mouse, a premier animal model for the study of genetic and genomic systems relevant to human biology and disease. MGD maintains a comprehensive catalog of genes, functional RNAs and other genome features as well as heritable phenotypes and quantitative trait loci. The genome feature catalog is generated by the integration of computational and manual genome annotations generated by NCBI, Ensembl and Vega/HAVANA. MGD curates and maintains the comprehensive listing of functional annotations for mouse genes using the Gene Ontology, and MGD curates and integrates comprehensive phenotype annotations including associations of mouse models with human diseases. Recent improvements include integration of the latest mouse genome build (GRCm38), improved access to comparative and functional annotations for mouse genes with expanded representation of comparative vertebrate genomes and new loads of phenotype data from high-throughput phenotyping projects. All MGD resources are freely available to the research community.
Dhanapal, Arun Prabhu; Govindaraj, Mahalingam
The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away.
The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...
Hornung, Thomas; May, Wolfgang
A large part of the Web, actually holding a significant portion of the useful information throughout the Web, consists of views on hidden databases, provided by numerous heterogeneous interfaces that are partly human-oriented via Web forms ("Deep Web"), and partly based on Web Services (only machine accessible). In this paper we present an approach for annotating these sources in a way that makes them citizens of the Semantic Web. We illustrate how queries can be stated in terms of the ontology, and how the annotations are used to selected and access appropriate sources and to answer the queries.
Boot, P.; Boot, P.; Stronks, E.
From the introduction: Annotation is an important item on the wish list for digital scholarly tools. It is one of John Unsworth’s primitives of scholarship (Unsworth 2000). Especially in linguistics,a number of tools have been developed that facilitate the creation of annotations to source material
Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M. Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J.
Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397
Ye, Zhipeng; Liu, Peng; Zhao, Wei; Tang, Xianglong
We present a simple yet effective scene annotation framework based on a combination of bag-of-visual words (BoVW), three-dimensional scene structure estimation, scene context, and cognitive theory. From a macroperspective, the proposed cognition-based hybrid motivation framework divides the annotation problem into empirical inference and real-time classification. Inspired by the inference ability of human beings, common objects of indoor scenes are defined for experience-based inference, while in the real-time classification stage, an improved BoVW-based multilayer abstract semantics labeling method is proposed by introducing abstract semantic hierarchies to narrow the semantic gap and improve the performance of object categorization. The proposed framework was evaluated on a variety of common data sets and experimental results proved its effectiveness.
Cattuto, Ciro; Baldassarri, Andrea; Schehr, G; Loreto, Vittorio
The enormous increase of popularity and use of the WWW has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with text keywords dubbed tags. Understanding the rich emerging structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks, and the complex networks framework, can effectively contribute to the mathematical modeling of social annotation systems. Here we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of random walks. This modeling framework reproduces several aspects, so far unexplained, of social annotation, among which the peculiar growth of the size of the...
在很多医院里面，都建设了专门的数据库系统，以实现内部数据的共享以及网络化的管理。通过医院内部的数据库系统，可以对数据进行查询以及处理。笔者通过探讨医院数据库的系统性能改善与优化策略，提高医院数据库系统的稳定性和处理速度，提高数据库系统对医院的服务功能作用。%In many hospitals there,are building a specialized database system,in order to achieve internal data sharing and network manage＇merit.Through the hospital＇s internal database system,you can query the data and processing.The author of the hospital database system performance improvement and optimization strategies to improve the hospital database system stability and processing speed,improve database system the role of hospital services.
Full Text Available Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2 production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs. Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs, 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach, we strongly
Williamson, Kirsty; Bannister, Marion
This paper reports a research project funded by the State Library of NSW and the National Library of Australia which focuses on the training needs of public librarians for using electronic databases, and the subsequent development of training modules to address them. The paper discusses literature relevant to training issues in database use for…
Kim, Dokyoon; Lucas, Anastasia; Glessner, Joseph; Verma, Shefali S; Bradford, Yuki; Li, Ruowang; Frase, Alex T; Hakonarson, Hakon; Peissig, Peggy; Brilliant, Murray; Ritchie, Marylyn D
Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter - a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record - total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value pipeline could improve the interpretability of copy number burden analysis where hundreds of loci or genes contribute toward disease susceptibility via biological knowledge groups such as pathways. This CNV annotation pipeline with Biofilter can be used for CNV data from any genotyping or sequencing platform and to
Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.
Full Text Available Abstract Background Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq and EBI (Genome Reviews to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation. Results Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS, and/or in the total number of CDS in the respective version of the same genome. CorBank is freely accessible at http://www.corbank.u-psud.fr. The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon. Conclusion CorBank is very efficient in rapid detection of the numerous differences existing
Full Text Available Background: There are limited studies on the automatic detection of T waves in arrhythmic electrocardiogram (ECG signals. This is perhaps because there is no available arrhythmia dataset with annotated T waves. There is a growing need to develop numerically-efficient algorithms that can accommodate the new trend of battery-driven ECG devices. Moreover, there is also a need to analyze long-term recorded signals in a reliable and time-efficient manner, therefore improving the diagnostic ability of mobile devices and point-of-care technologies. Methods: Here, the T wave annotation of the well-known MIT-BIH arrhythmia database is discussed and provided. Moreover, a simple fast method for detecting T waves is introduced. A typical T wave detection method has been reduced to a basic approach consisting of two moving averages and dynamic thresholds. The dynamic thresholds were calibrated using four clinically known types of sinus node response to atrial premature depolarization (compensation, reset, interpolation, and reentry. Results: The determination of T wave peaks is performed and the proposed algorithm is evaluated on two well-known databases, the QT and MIT-BIH Arrhythmia databases. The detector obtained a sensitivity of 97.14% and a positive predictivity of 99.29% over the first lead of the validation databases (total of 221,186 beats. Conclusions: We present a simple yet very reliable T wave detection algorithm that can be potentially implemented on mobile battery-driven devices. In contrast to complex methods, it can be easily implemented in a digital filter design.
Bertini, Marco; Del Bimbo, Alberto; Nunziati, W.
Broadcasters usually envision two basic applications for video databases: Live Logging and Posterity Logging. The former aims at providing effective annotation of video in quasi-real time and supports extraction of meaningful clips from the live stream; it is usually performed by assistant producers working at the same location of the event. The latter provides annotation for later reuse of video material and is the prerequisite for retrieval by content from video digital libraries; it is performed by trained librarians. Both require that annotation is performed, at a great extent, automatically. Video information structure must encompass both low-intermediate level video organization and event relationships that define specific highlights and situations. Analysis of the visual data of the video stream permits to extract hints, identify events and detect highlights. All of this must be supported by a-priori knowledge of the video domain and effective reasoning engines capable to capture the inherent semantics of the visual events.
Phillip H Pham
Full Text Available Interpretation of human genomes is a major challenge. We present the Scripps Genome ADVISER (SG-ADVISER suite, which aims to fill the gap between data generation and genome interpretation by performing holistic, in-depth, annotations and functional predictions on all variant types and effects. The SG-ADVISER suite includes a de-identification tool, a variant annotation web-server, and a user interface for inheritance and annotation-based filtration. SG-ADVISER allows users with no bioinformatics expertise to manipulate large volumes of variant data with ease--without the need to download large reference databases, install software, or use a command line interface. SG-ADVISER is freely available at genomics.scripps.edu/ADVISER.
Merchant Sabeeha S
Full Text Available Abstract Background Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. Description The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of
I present a tool which tells the quality of document or its usefulness based on annotations. Annotation may include comments, notes, observation, highlights, underline, explanation, question or help etc. comments are used for evaluative purpose while others are used for summarization or for expansion also. Further these comments may be on another annotation. Such annotations are referred as meta-annotation. All annotation may not get equal weightage. My tool considered highlights, underline as well as comments to infer the collective sentiment of annotators. Collective sentiments of annotators are classified as positive, negative, objectivity. My tool computes collective sentiment of annotations in two manners. It counts all the annotation present on the documents as well as it also computes sentiment scores of all annotation which includes comments to obtain the collective sentiments about the document or to judge the quality of document. I demonstrate the use of tool on research paper.
Hamanaka, Masatoshi; Hirata, Keiji; Tojo, Satoshi
This paper, we present the publication of our analysis data and analyzing tool based on the generative theory of tonal music (GTTM). Musical databases such as score databases, instrument sound databases, and musical pieces with standard MIDI files and annotated data are key to advancements in the field of music information technology. We started implementing the GTTM on a computer in 2004 and ever since have collected and publicized test data by musicologists in a step-by-step manner. In our ...
Savidor, Alon [ORNL; Donahoo, Ryan S [ORNL; Hurtado-Gonzales, Oscar [University of Tennessee, Knoxville (UTK); Verberkmoes, Nathan C [ORNL; Shah, Manesh B [ORNL; Lamour, Kurt H [ORNL; McDonald, W Hayes [ORNL
While genome sequencing is becoming ever more routine, genome annotation remains a challenging process. Identification of the coding sequences within the genomic milieu presents a tremendous challenge, especially for eukaryotes with their complex gene architectures. Here we present a method to assist the annotation process through the use of proteomic data and bioinformatics. Mass spectra of digested protein preparations of the organism of interest were acquired and searched against a protein database created by a six frame translation of the genome. The identified peptides were mapped back to the genome, compared to the current annotation, and then categorized as supporting or extending the current genome annotation. We named the classified peptides Expressed Peptide Tags (EPTs). The well annotated bacterium Rhodopseudomonas palustris was used as a control for the method and showed high degree of correlation between EPT mapping and the current annotation, with 86% of the EPTs confirming existing gene calls and less than 1% of the EPTs expanding on the current annotation. The eukaryotic plant pathogens Phytophthora ramorum and Phytophthora sojae, whose genomes have been recently sequenced and are much less well annotated, were also subjected to this method. A series of algorithmic steps were taken to increase the confidence of EPT identification for these organisms, including generation of smaller sub-databases to be searched against, and definition of EPT criteria that accommodates the more complex eukaryotic gene architecture. As expected, the analysis of the Phytophthora species showed less correlation between EPT mapping and their current annotation. While ~77% of Phytophthora EPTs supported the current annotation, a portion of them (7.2% and 12.6% for P. ramorum and P. sojae, respectively) suggested modification to current gene calls or identified novel genes that were missed by the current genome annotation of these organisms.
Romagné, Frédéric; Santesmasses, Didac; White, Louise; Sarangi, Gaurab K; Mariotti, Marco; Hübler, Ron; Weihmann, Antje; Parra, Genís; Gladyshev, Vadim N; Guigó, Roderic; Castellano, Sergi
SelenoDB (http://www.selenodb.org) aims to provide high-quality annotations of selenoprotein genes, proteins and SECIS elements. Selenoproteins are proteins that contain the amino acid selenocysteine (Sec) and the first release of the database included annotations for eight species. Since the release of SelenoDB 1.0 many new animal genomes have been sequenced. The annotations of selenoproteins in new genomes usually contain many errors in major databases. For this reason, we have now fully annotated selenoprotein genes in 58 animal genomes. We provide manually curated annotations for human selenoproteins, whereas we use an automatic annotation pipeline to annotate selenoprotein genes in other animal genomes. In addition, we annotate the homologous genes containing cysteine (Cys) instead of Sec. Finally, we have surveyed genetic variation in the annotated genes in humans. We use exon capture and resequencing approaches to identify single-nucleotide polymorphisms in more than 50 human populations around the world. We thus present a detailed view of the genetic divergence of Sec- and Cys-containing genes in animals and their diversity in humans. The addition of these datasets into the second release of the database provides a valuable resource for addressing medical and evolutionary questions in selenium biology.
Allan Peter Davis
Full Text Available The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/ is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS, wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel. Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.
Davis, Allan Peter; Wiegers, Thomas C; Johnson, Robin J; Lay, Jean M; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.
Damien E. ZOMAHOUN
Full Text Available In the quest for models that could help to represen t the meaning of images, some approaches have used contextual knowledge by building semantic hierarchi es. Others have resorted to the integration of imag es analysis improvement knowledge and images interpret ation using ontologies. The images are often annotated with a set of keywords (or ontologies, w hose relevance remains highly subjective and relate d to only one interpretation (one annotator. However , an image can get many associated semantics because annotators can interpret it differently. Th e purpose of this paper is to propose a collaborati ve annotation system that brings out the meaning of im ages from the different interpretations of annotato rs. The different works carried out in this paper lead to a semantic model of an image, i.e. the different means that a picture may have. This method relies o n the different tools of the Semantic Web, especial ly ontologies.
information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.
Firman, Anthony H.
Offers brief annotations of 52 articles and papers on document production (from the Society for Technical Communication's journal and proceedings) on 9 topics: information processing, document design, using color, typography, tables, illustrations, photography, printing and binding, and production management. (SR)
Kronk, Gary W
Meteor showers are among the most spectacular celestial events that may be observed by the naked eye, and have been the object of fascination throughout human history. In “Meteor Showers: An Annotated Catalog,” the interested observer can access detailed research on over 100 annual and periodic meteor streams in order to capitalize on these majestic spectacles. Each meteor shower entry includes details of their discovery, important observations and orbits, and gives a full picture of duration, location in the sky, and expected hourly rates. Armed with a fuller understanding, the amateur observer can better view and appreciate the shower of their choice. The original book, published in 1988, has been updated with over 25 years of research in this new and improved edition. Almost every meteor shower study is expanded, with some original minor showers being dropped while new ones are added. The book also includes breakthroughs in the study of meteor showers, such as accurate predictions of outbursts as well ...
Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James
Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.
通过对数据库技术专业人才能力结构和特征的研究,提出数据库系统的知识结构,分析当前高校数据库系统相关课程的教学现状,从数据库系统知识结构、教学层次的区分、实践教学的创新等几个方面,探讨提高数据库系统教学质量的途径。%Studied the ability structure and characteristics of Database System technology professionals,proposed database system＇s knowledge structure,it＇s analyzed the current status related to university teaching about database system knowledge,discussed the ways of improving teaching quality of database systems from several aspects of the knowledge structure of database system,the distinction of teaching level,practice teaching innovation,and so on.
Full Text Available The explosive growth of image data leads to the research and development of image content searching and indexing systems. Image annotation systems aim at annotating automatically animage with some controlled keywords that can be used for indexing and retrieval of images. This paper presents a comparative evaluation of the image content annotation system by using the multilayer neural networks and the nearest neighbour classifier. The region growing segmentation is used to separate objects, the Hu moments, Legendre moments and Zernike moments which are used in as feature descriptors for the image content characterization and annotation.The ETH-80 database image is used in the experiments here. The best annotation rate is achieved by using Legendre moments as feature extraction method and the multilayer neural network as a classifier
Aerts, J.; Wetzels, Y.; Cohen, N.; Aerssens, J.
Different strategies to search public single nucleotide polymorphism (SNP) databases for intragenic SNPs were evaluated. First, we assembled a strategy to annotate SNPs onto candidate genes based on a BLAST search of public SNP databases (Intragenic SNP Annotation by BLAST, ISAB). Only BLAST hits th
Mao, Qi; Tsang, Ivor Wai-Hung; Gao, Shenghua
Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined. And then, we formulate these loss functions as relaxed surrogate functions and optimize them by structural SVMs. According to the analysis of various measures and the high time complexity of optimizing micro-averaging measures, in this paper, we focus on example-based measures that are tailor-made for image annotation tasks but are seldom explored in the literature. Experiments show consistency with the formal analysis on two widely used multi-label datasets, and demonstrate the superior performance of our proposed method over state-of-the-art baseline methods in terms of example-based measures on four
Fidelis, K; Adzhubej, A; Kryshtafovych, A; Daniluk, P
The phenomenal success of the genome sequencing projects reveals the power of completeness in revolutionizing biological science. Currently it is possible to sequence entire organisms at a time, allowing for a systemic rather than fractional view of their organization and the various genome-encoded functions. There is an international plan to move towards a similar goal in the area of protein structure. This will not be achieved by experiment alone, but rather by a combination of efforts in crystallography, NMR spectroscopy, and computational modeling. Only a small fraction of structures are expected to be identified experimentally, the remainder to be modeled. Presently there is no organized infrastructure to critically evaluate and present these data to the biological community. The goal of the Protein Model Database project is to create such infrastructure, including (1) public database of theoretically derived protein structures; (2) reliable annotation of protein model quality, (3) novel structure analysis tools, and (4) access to the highest quality modeling techniques available.
Full Text Available Recent sequencing and assembly of the genome for the fungal pathogen Candida albicans used simple automated procedures for the identification of putative genes. We have reviewed the entire assembly, both by hand and with additional bioinformatic resources, to accurately map and describe 6,354 genes and to identify 246 genes whose original database entries contained sequencing errors (or possibly mutations that affect their reading frame. Comparison with other fungal genomes permitted the identification of numerous fungus-specific genes that might be targeted for antifungal therapy. We also observed that, compared to other fungi, the protein-coding sequences in the C. albicans genome are especially rich in short sequence repeats. Finally, our improved annotation permitted a detailed analysis of several multigene families, and comparative genomic studies showed that C. albicans has a far greater catabolic range, encoding respiratory Complex 1, several novel oxidoreductases and ketone body degrading enzymes, malonyl-CoA and enoyl-CoA carriers, several novel amino acid degrading enzymes, a variety of secreted catabolic lipases and proteases, and numerous transporters to assimilate the resulting nutrients. The results of these efforts will ensure that the Candida research community has uniform and comprehensive genomic information for medical research as well as for future diagnostic and therapeutic applications.
Burns, Jonathan; Kukushkin, Denys; Lindblad, Kelsi; Chen, Xiao; Jonoska, Nataša; Landweber, Laura F
Ciliated protists exhibit nuclear dimorphism through the presence of somatic macronuclei (MAC) and germline micronuclei (MIC). In some ciliates, DNA from precursor segments in the MIC genome rearranges to form transcriptionally active genes in the mature MAC genome, making these ciliates model organisms to study the process of somatic genome rearrangement. Similar broad scale, somatic rearrangement events occur in many eukaryotic cells and tumors. The (http://oxytricha.princeton.edu/mds_ies_db) is a database of genome recombination and rearrangement annotations, and it provides tools for visualization and comparative analysis of precursor and product genomes. The database currently contains annotations for two completely sequenced ciliate genomes: Oxytricha trifallax and Tetrahymena thermophila.
Hulstaert, Niels; Reisinger, Florian; Rameseder, Jonathan; Barsnes, Harald; Vizcaíno, Juan Antonio; Vizcaíno, Juan Antonio; Martens, Lennart
We present an open source software application and library written in Java that provides a uniform annotation of identified spectra stored in the PRIDE database. Pride-asap can be ran in a command line mode for automated processing of multiple PRIDE experiments, but also has a graphical user interface that allows end users to annotate the spectra in PRIDE experiments and to inspect the results in detail. Pride-asap binaries, source code and additional information can be downloaded from http:/...
Ritari, Jarmo; Salojärvi, Jarkko; Lahti, Leo; Vos, de Willem M.
Background: Current sequencing technology enables taxonomic profiling of microbial ecosystems at high resolution and depth by using the 16S rRNA gene as a phylogenetic marker. Taxonomic assignation of newly acquired data is based on sequence comparisons with comprehensive reference databases to f
G.P. Patrinos (George); B. Giardine (Belinda); C. Riemer (Cathy); W. Miller (Webb); D.H. Chui (David); N.P. Anagnou (Nicholas); H. Wajcman (Henri); R.C. Hardison (Ross)
textabstractHbVar (http://globin.cse.psu.edu/globin/hbvar/) is a relational database developed by a multi-center academic effort to provide up-to-date and high quality information on the genomic sequence changes leading to hemoglobin variants and all types of thalassemia and hemogl
Kehlet, H.; Bay-Nielsen, M.
BACKGROUND: Increased focus and research on surgical technique and anaesthesia in groin hernia repair have improved outcomes from centres of interest in hernia surgery, but little information is available from nationwide data to document the incorporation of scientific evidence into general...... clinical practice. AIM: To review outcomes after groin hernia repair in Denmark from the Danish Hernia Database 1998-2005 in 87,840 patients. RESULTS: The nationwide Danish hernia collaboration with two annual meetings discussing own results and those of others has led to >50% reduction in reoperation...... rates, increased use of the Lichtenstein hernia technique, higher rate of outpatient surgery, near elimination of regional anaesthesia, and documentation and focus on incidence and mechanisms of chronic pain. CONCLUSION: Establishment of nationwide groin hernia databases leads to general improvement...
Full Text Available Chickpea is an important grain legume used as a rich source of protein in human diet. The narrow genetic diversity and limited availability of genomic resources are the major constraints in implementing breeding strategies and biotechnological interventions for genetic enhancement of chickpea. We developed an integrated Chickpea Transcriptome Database (CTDB, which provides the comprehensive web interface for visualization and easy retrieval of transcriptome data in chickpea. The database features many tools for similarity search, functional annotation (putative function, PFAM domain and gene ontology search and comparative gene expression analysis. The current release of CTDB (v2.0 hosts transcriptome datasets with high quality functional annotation from cultivated (desi and kabuli types and wild chickpea. A catalog of transcription factor families and their expression profiles in chickpea are available in the database. The gene expression data have been integrated to study the expression profiles of chickpea transcripts in major tissues/organs and various stages of flower development. The utilities, such as similarity search, ortholog identification and comparative gene expression have also been implemented in the database to facilitate comparative genomic studies among different legumes and Arabidopsis. Furthermore, the CTDB represents a resource for the discovery of functional molecular markers (microsatellites and single nucleotide polymorphisms between different chickpea types. We anticipate that integrated information content of this database will accelerate the functional and applied genomic research for improvement of chickpea. The CTDB web service is freely available at http://nipgr.res.in/ctdb.html.
Cattuto, Ciro; Barrat, Alain; Baldassarri, Andrea; Schehr, Gregory; Loreto, Vittorio
The enormous increase of popularity and use of the worldwide web has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with keywords known as "tags." Understanding the rich emergent structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks (RWs), and complex networks theory, can effectively contribute to the mathematical modeling of social annotation systems. Here, we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of RWs. This modeling framework reproduces several aspects, thus far unexplained, of social annotation, among which are the peculiar growth of the size of the vocabulary used by the community and its complex network structure that represents an externalization of semantic structures grounded in cognition and that are typically hard to access.
Jung, Sook; Ficklin, Stephen P.; Lee, Taein; Cheng, Chun-Huai; Blenda, Anna; Zheng, Ping; Yu, Jing; Bombarely, Aureliano; Cho, Ilhyung; Ru, Sushan; Evans, Kate; Peace, Cameron; Abbott, Albert G.; Mueller, Lukas A.; Olmstead, Mercy A.; Main, Dorrie
The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making. PMID:24225320
Jung, Sook; Ficklin, Stephen P; Lee, Taein; Cheng, Chun-Huai; Blenda, Anna; Zheng, Ping; Yu, Jing; Bombarely, Aureliano; Cho, Ilhyung; Ru, Sushan; Evans, Kate; Peace, Cameron; Abbott, Albert G; Mueller, Lukas A; Olmstead, Mercy A; Main, Dorrie
The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making.
Bolser, Dan M.; Chibon, Pierre-Yves; Palopoli, Nicolas; Gong, Sungsam; Jacob, Daniel; Angel, Victoria Dominguez Del; Swan, Dan; Bassi, Sebastian; González, Virginia; Suravajhala, Prashanth; Hwang, Seungwoo; Romano, Paolo; Edwards, Rob; Bishop, Bryan; Eargle, John; Shtatland, Timur; Provart, Nicholas J.; Clements, Dave; Renfro, Daniel P.; Bhak, Daeui; Bhak, Jong
Biology is generating more data than ever. As a result, there is an ever increasing number of publicly available databases that analyse, integrate and summarize the available data, providing an invaluable resource for the biological community. As this trend continues, there is a pressing need to organize, catalogue and rate these resources, so that the information they contain can be most effectively exploited. MetaBase (MB) (http://MetaDatabase.Org) is a community-curated database containing more than 2000 commonly used biological databases. Each entry is structured using templates and can carry various user comments and annotations. Entries can be searched, listed, browsed or queried. The database was created using the same MediaWiki technology that powers Wikipedia, allowing users to contribute on many different levels. The initial release of MB was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue. Since then, approximately 100 databases have been manually collected from the literature, and users have added information for over 240 databases. MB is synchronized annually with the static Molecular Biology Database Collection provided by NAR. To date, there have been 19 significant contributors to the project; each one is listed as an author here to highlight the community aspect of the project. PMID:22139927
Bernstein, P.A.; DeWitt, D.; Heuer, A.
There has been a growing interest in improving the publication processes for database research papers. This panel reports on recent changes in those processes and presents an initial cut at historical data for the VLDB Journal and ACM Transactions on Database Systems.......There has been a growing interest in improving the publication processes for database research papers. This panel reports on recent changes in those processes and presents an initial cut at historical data for the VLDB Journal and ACM Transactions on Database Systems....
Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas
Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/
Schaafsma, Gerard C P; Vihinen, Mauno
The Variation Ontology (VariO) is used for describing and annotating types, effects, consequences, and mechanisms of variations. To facilitate easy and consistent annotations, the online application VariOtator was developed. For variation type annotations, VariOtator is fully automated, accepting variant descriptions in Human Genome Variation Society (HGVS) format, and generating VariO terms, either with or without full lineage, that is, all parent terms. When a coding DNA variant description with a reference sequence is provided, VariOtator checks the description first with Mutalyzer and then generates the predicted RNA and protein descriptions with their respective VariO annotations. For the other sublevels, function, structure, and property, annotations cannot be automated, and VariOtator generates annotation based on provided details. For VariO terms relating to structure and property, one can use attribute terms as modifiers and evidence code terms for annotating experimental evidence. There is an online batch version, and stand-alone batch versions to be used with a Leiden Open Variation Database (LOVD) download file. A SOAP Web service allows client programs to access VariOtator programmatically. Thus, systematic variation effect and type annotations can be efficiently generated to allow easy use and integration of variations and their consequences.
Finn, Robert D.; Attwood, Teresa K.; Babbitt, Patricia C.; Bateman, Alex; Bork, Peer; Bridge, Alan J.; Chang, Hsin-Yu; Dosztányi, Zsuzsanna; El-Gebali, Sara; Fraser, Matthew; Gough, Julian; Haft, David; Holliday, Gemma L.; Huang, Hongzhan; Huang, Xiaosong; Letunic, Ivica; Lopez, Rodrigo; Lu, Shennan; Marchler-Bauer, Aron; Mi, Huaiyu; Mistry, Jaina; Natale, Darren A.; Necci, Marco; Nuka, Gift; Orengo, Christine A.; Park, Youngmi; Pesseat, Sebastien; Piovesan, Damiano; Potter, Simon C.; Rawlings, Neil D.; Redaschi, Nicole; Richardson, Lorna; Rivoire, Catherine; Sangrador-Vegas, Amaia; Sigrist, Christian; Sillitoe, Ian; Smithers, Ben; Squizzato, Silvano; Sutton, Granger; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C. E.; Wu, Cathy H.; Xenarios, Ioannis; Yeh, Lai-Su; Young, Siew-Yit; Mitchell, Alex L.
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences. PMID:27899635
Full Text Available Loops represent an important part of protein structures. The study of loop is critical for two main reasons: First, loops are often involved in protein function, stability and folding. Second, despite improvements in experimental and computational structure prediction methods, modeling the conformation of loops remains problematic. Here, we present a structural classification of loops, ArchDB, a mine of information with application in both mentioned fields: loop structure prediction and function prediction. ArchDB (http://sbi.imim.es/archdb is a database of classified protein loop motifs. The current database provides four different classification sets tailored for different purposes. ArchDB-40, a loop classification derived from SCOP40, well suited for modeling common loop motifs. Since features relevant to loop structure or function can be more easily determined on well-populated clusters, we have developed ArchDB-95, a loop classification derived from SCOP95. This new classification set shows a ∼40% increase in the number of subclasses, and a large 7-fold increase in the number of putative structure/function-related subclasses. We also present ArchDB-EC, a classification of loop motifs from enzymes, and ArchDB-KI, a manually annotated classification of loop motifs from kinases. Information about ligand contacts and PDB sites has been included in all classification sets. Improvements in our classification scheme are described, as well as several new database features, such as the ability to query by conserved annotations, sequence similarity, or uploading 3D coordinates of a protein. The lengths of classified loops range between 0 and 36 residues long. ArchDB offers an exhaustive sampling of loop structures. Functional information about loops and links with related biological databases are also provided. All this information and the possibility to browse/query the database through a web-server outline an useful tool with application in the
Direct marketing is now a wel-known discipline and widely used in almost every industry al around the world. The mid to late 2000s saw a huge growth of direct marketing due to the development of technology and the increasing number of wel-educated marketers(Tapp,2008). According to the UK's Institute of Direct Marketing(as cited in Sargeant &West,2001,p.7), Direct marketing is" the planned recording, analysis and tracking of customer's direct response behaviour over time. .in order to develop future marketing strategies for long term customer loyalty and to ensure continued business growth". As Tapp(2008) points out that database is the core of direct marketing. So what is a database in the field of direct marketing? A definition is given by Tapp(2008,p.32)"A marketing database is a list of customers' and prospects' records that enables strategic analysis, and individual selections for communication and customer service support . The data is organized around the customer".
Firth Andrew E
Full Text Available Abstract Background The many Hepadnaviridae sequences available have widely varied functional annotation. The genomes are very compact (~3.2 kb but contain multiple layers of functional regulatory elements in addition to coding regions. Key regions are subject to purifying selection, as mutations in these regions will produce non-functional viruses. Results These genomic sequences have been organized into a structured database to facilitate research at the molecular level. HBVRegDB is a comparative genomic analysis tool with an integrated underlying sequence database. The database contains genomic sequence data from representative viruses. In addition to INSDC and RefSeq annotation, HBVRegDB also contains expert and systematically calculated annotations (e.g. promoters and comparative genome analysis results (e.g. blastn, tblastx. It also contains analyses based on curated HBV alignments. Information about conserved regions – including primary conservation (e.g. CDS-Plotcon and RNA secondary structure predictions (e.g. Alidot – is integrated into the database. A large amount of data is graphically presented using the GBrowse (Generic Genome Browser adapted for analysis of viral genomes. Flexible query access is provided based on any annotated genomic feature. Novel regulatory motifs can be found by analysing the annotated sequences. Conclusion HBVRegDB serves as a knowledge database and as a comparative genomic analysis tool for molecular biologists investigating HBV. It is publicly available and complementary to other viral and HBV focused datasets and tools http://hbvregdb.otago.ac.nz. The availability of multiple and highly annotated sequences of viral genomes in one database combined with comparative analysis tools facilitates detection of novel genomic elements.
DAI Tie; SHI Guangyu; ZHANG Xingying
A large number of experimental and theoretical investigations of carbon dioxide (CO2) spectra have been conducted since the most recent update of the High-Resolution Transmission Molecular Absorption (HITRAN) database.To maintain optimal parameters,the HITRAN 2004 CO2 line list has been completely replaced by HITRAN 2008 data in the near-infrared region from 4300 cm-1 to 7000 cm-1.To examine the effect of this change on the retrieval of CO2 vertical column data from reflected sunlight spectra in the 1.61-μm spectral window,synthetic measurements for a given atmospheric state and instrument setup were generated and compared using radiative transfer model with the line-transition parameters from the HITRAN 2004 and 2008 databases.Simulated retrievals were then performed based on the optimal estimation retrieval theory.The results show that large systematic errors in atmospheric CO2 column retrievals were induced by the differences in the HITRAN laboratory line parameters in the 1.61-μm region.The retrieved CO2 columns were underestimated by ＞10 ppm using the HITRAN 2004 data,and improvements resulting from the use of the improved HITRAN database were more pronounced at a higher spectral resolution.
Wang, Chao; Zhang, Jun; Cai, Mingdeng; Zhu, Zhenggang; Gu, Wenjie; Yu, Yingyan; Zhang, Xiaoyan
The Database of Human Gastric Cancer (DBGC) is a comprehensive database that integrates various human gastric cancer-related data resources. Human gastric cancer-related transcriptomics projects, proteomics projects, mutations, biomarkers and drug-sensitive genes from different sources were collected and unified in this database. Moreover, epidemiological statistics of gastric cancer patients in China and clinicopathological information annotated with gastric cancer cases were also integrated into the DBGC. We believe that this database will greatly facilitate research regarding human gastric cancer in many fields. DBGC is freely available at http://bminfor.tongji.edu.cn/dbgc/index.do.
Bell, D A
Relational Databases explores the major advances in relational databases and provides a balanced analysis of the state of the art in relational databases. Topics covered include capture and analysis of data placement requirements; distributed relational database systems; data dependency manipulation in database schemata; and relational database support for computer graphics and computer aided design. This book is divided into three sections and begins with an overview of the theory and practice of distributed systems, using the example of INGRES from Relational Technology as illustration. The
Carneiro, Gustavo; Chan, Antoni B; Moreno, Pedro J; Vasconcelos, Nuno
A probabilistic formulation for semantic image annotation and retrieval is proposed. Annotation and retrieval are posed as classification problems where each class is defined as the group of database images labeled with a common semantic label. It is shown that, by establishing this one-to-one correspondence between semantic labels and semantic classes, a minimum probability of error annotation and retrieval are feasible with algorithms that are 1) conceptually simple, 2) computationally efficient, and 3) do not require prior semantic segmentation of training images. In particular, images are represented as bags of localized feature vectors, a mixture density estimated for each image, and the mixtures associated with all images annotated with a common semantic label pooled into a density estimate for the corresponding semantic class. This pooling is justified by a multiple instance learning argument and performed efficiently with a hierarchical extension of expectation-maximization. The benefits of the supervised formulation over the more complex, and currently popular, joint modeling of semantic label and visual feature distributions are illustrated through theoretical arguments and extensive experiments. The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost. Finally, the proposed method is shown to be fairly robust to parameter tuning.
Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco
Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline.
Pottegård, A; Tjäderborn, M; Schjerning, O
Aim In Europe, pregabalin is approved for treatment of neuropathic pain, general anxiety disorder (GAD) and as adjunctive therapy for epilepsy. The purpose of this study was to assess utilisation of pregabalin in the UK, including patients with a recorded history of substance abuse, from a large...... general practice database. Methods This observational drug utilisation study (DUS) analysed pregabalin prescription data from the UK Health Improvement Network primary care database between September 2004 and July 2009. Patient demographics, diagnoses (by READ codes) and pregabalin dosing data were...... approved dose of 600 mg/day. Of these, 18.4% (25/136) of patients had a history of substance abuse compared with 14.0% (1884/13,480) in the full population. Conclusion Data from this DUS indicated that the majority of pregabalin prescribing in the UK was consistent with product labelling. The proportion...
Rubisco is a very large, complex and one of the most abundant proteins in the world and comprises up to 50% of all soluble protein in plants. The activity of Rubisco, the enzyme that catalyzes CO2 assimilation in photosynthesis, is regulated by Rubisco activase (Rca). In the present study, we searched for hypothetical protein of Vitis vinifera which has putative Rubisco activase function. The Arabidopsis and tobacco Rubisco activase protein sequences were used as seed sequences to search against Vitis vinifera in UniprotKB database. The selected hypothetical proteins of Vitis vinifera were subjected to sequence, structural and functional annotation. Subcellular localization predictions suggested it to be cytoplasmic protein. Homology modelling was used to define the three-dimensional (3D) structure of selected hypothetical proteins of Vitis vinifera. Template search revealed that all the hypothetical proteins share more than 80% sequence identity with structure of green-type Rubisco activase from tobacco, indicating proteins are evolutionary conserved. The homology modelling was generated using SWISS-MODEL. Several quality assessment and validation parameters computed indicated that homology models are reliable. Further, functional annotation through PFAM, CATH, SUPERFAMILY, CDART suggested that selected hypothetical proteins of Vitis vinifera contain ATPase family associated with various cellular activities (AAA) and belong to the AAA+ super family of ring-shaped P-loop containing nucleoside triphosphate hydrolases. This study will lead to research in the optimization of the functionality of Rubisco which has large implication in the improvement of plant productivity and resource use efficiency.
Silva, Ricardo R.; Jourdan, Fabien; Salvanha, Diego M.; Letisse, Fabien; Jamin, Emilien L.; Guidetti-Gonzalez, Simone; Labate, Carlos A.; Vêncio, Ricardo Z. N.
Summary: We present ProbMetab, an R package that promotes substantial improvement in automatic probabilistic liquid chromatography–mass spectrometry-based metabolome annotation. The inference engine core is based on a Bayesian model implemented to (i) allow diverse source of experimental data and metadata to be systematically incorporated into the model with alternative ways to calculate the likelihood function and (ii) allow sensitive selection of biologically meaningful biochemical reaction databases as Dirichlet-categorical prior distribution. Additionally, to ensure result interpretation by system biologists, we display the annotation in a network where observed mass peaks are connected if their candidate metabolites are substrate/product of known biochemical reactions. This graph can be overlaid with other graph-based analysis, such as partial correlation networks, in a visualization scheme exported to Cytoscape, with web and stand-alone versions. Availability and implementation: ProbMetab was implemented in a modular manner to fit together with established upstream (xcms, CAMERA, AStream, mzMatch.R, etc) and downstream R package tools (GeneNet, RCytoscape, DiffCorr, etc). ProbMetab, along with extensive documentation and case studies, is freely available under GNU license at: http://labpib.fmrp.usp.br/methods/probmetab/. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24443383
It is very important in the field of bioinformatics to apply computer to perform the function annotation for new sequenced bio-sequences. Based on GO database and BLAST program, a novel method for the function annotation of new biological sequences is presented by using the variable-precision rough set theory. The proposed method is applied to the real data in GO database to examine its effectiveness. Numerical results show that the proposed method has better precision, recall-rate and harmonic mean value compared with existing methods.
Part I of this annotated bibliography deals with books and articles on such topics as achievement motivation, process education, transactional analysis, discipline without punishment, role-playing, interpersonal skills, self-acceptance, moral education, self-awareness, values clarification, and non-verbal communication. Part II focuses on…
Thrond, Mary A.
The annotated bibliography contains citations of nine references on alternative student assessment methods in second language programs, particularly at the secondary school level. The references include a critique of conventional reading comprehension assessment, a discussion of performance assessment, a proposal for a multi-trait, multi-method…
McKenna, Bernard H.; And Others
In his introduction to the 86-item annotated bibliography by Mueller and Poliakoff, McKenna discusses his views on teacher evaluation and his impressions of the documents cited. He observes, in part, that the current concern is with the process of evaluation and that most researchers continue to believe that student achievement is the most…
Large amounts of genome sequence data are available and much more will become available in the near future. A DNA sequence alone has, however, limited use. Genome annotation is required to assign biological interpretation to the DNA sequence. This thesis describ
Dickerson, LaVerne Thornton, Comp.
This annotated bibliography focuses on recent publications dealing with factors that influence child growth and development, rather than the developmental processes themselves. Topics include: general sources on child development; physical and perceptual-motor development; cognitive development; social and personality development; and play.…
This research paper consists of an annotated bibliography about Nikos Kazantzakis, one of the major modern Greek writers and author of "The Last Temptation of Christ,""Zorba the Greek," and many other works. Because of Kazantzakis' position in world literature there are many critical works about him; however, bibliographical control of these works…
Biofuel Database (Web, free access) This database brings together structural, biological, and thermodynamic data for enzymes that are either in current use or are being considered for use in the production of biofuels.
National Oceanic and Atmospheric Administration, Department of Commerce — This excel spreadsheet is the result of merging at the port level of several of the in-house fisheries databases in combination with other demographic databases such...
Elias, Micheline; Bezerianos, Anastasia
International audience; Annotations have been identiﬁed as an important aid in analysis record-keeping and recently data discovery. In this paper we discuss the use of annotations on visualization dashboards, with a special focus on business intelligence (BI) analysis. In-depth interviews with experts lead to new annotation needs for multi-chart visualization systems, on which we based the design of a dashboard prototype that supports data and context aware annotations. We focus particularly ...
Muneesawang, Paisarn; Guan, Ling
This book explores multimedia applications that emerged from computer vision and machine learning technologies. These state-of-the-art applications include MPEG-7, interactive multimedia retrieval, multimodal fusion, annotation, and database re-ranking. The application-oriented approach maximizes reader understanding of this complex field. Established researchers explain the latest developments in multimedia database technology and offer a glimpse of future technologies. The authors emphasize the crucial role of innovation, inspiring users to develop new applications in multimedia technologies
David, O.; Lloyd, W.; Carlson, J.; Leavesley, G. H.; Geter, F.
The popular programming languages Java and C# provide annotations, a form of meta-data construct. Software frameworks for web integration, web services, database access, and unit testing now take advantage of annotations to reduce the complexity of APIs and the quantity of integration code between the application and framework infrastructure. Adopting annotation features in frameworks has been observed to lead to cleaner and leaner application code. The USDA Object Modeling System (OMS) version 3.0 fully embraces the annotation approach and additionally defines a meta-data standard for components and models. In version 3.0 framework/model integration previously accomplished using API calls is now achieved using descriptive annotations. This enables the framework to provide additional functionality non-invasively such as implicit multithreading, and auto-documenting capabilities while achieving a significant reduction in the size of the model source code. Using a non-invasive methodology leads to models and modeling components with only minimal dependencies on the modeling framework. Since models and modeling components are not directly bound to framework by the use of specific APIs and/or data types they can more easily be reused both within the framework as well as outside of it. To study the effectiveness of an annotation based framework approach with other modeling frameworks, a framework-invasiveness study was conducted to evaluate the effects of framework design on model code quality. A monthly water balance model was implemented across several modeling frameworks and several software metrics were collected. The metrics selected were measures of non-invasive design methods for modeling frameworks from a software engineering perspective. It appears that the use of annotations positively impacts several software quality measures. In a next step, the PRMS model was implemented in OMS 3.0 and is currently being implemented for water supply forecasting in the
Cam Margaret C
Full Text Available Abstract Background Extracting biological information from high-density Affymetrix arrays is a multi-step process that begins with the accurate annotation of microarray probes. Shortfalls in the original Affymetrix probe annotation have been described; however, few studies have provided rigorous solutions for routine data analysis. Results Using AceView, a comprehensive human transcript database, we have reannotated the probes by matching them to RNA transcripts instead of genes. Based on this transcript-level annotation, a new probe set definition was created in which every probe in a probe set maps to a common set of AceView gene transcripts. In addition, using artificial data sets we identified that a minimal probe set size of 4 is necessary for reliable statistical summarization. We further demonstrate that applying the new probe set definition can detect specific transcript variants contributing to differential expression and it also improves cross-platform concordance. Conclusion We conclude that our transcript-level reannotation and redefinition of probe sets complement the original Affymetrix design. Redefinitions introduce probe sets whose sizes may not support reliable statistical summarization; therefore, we advocate using our transcript-level mapping redefinition in a secondary analysis step rather than as a replacement. Knowing which specific transcripts are differentially expressed is important to properly design probe/primer pairs for validation purposes. For convenience, we have created custom chip-description-files (CDFs and annotation files for our new probe set definitions that are compatible with Bioconductor, Affymetrix Expression Console or third party software.
Mahdaviani, Kaveh; Tellambura, Chintha
In this paper, an extension of raptor codes is introduced which keeps all the desirable properties of raptor codes, including the linear complexity of encoding and decoding per information bit, unchanged. The new design, however, improves the performance in terms of the reception rate. Our simulations show a 10% reduction in the needed overhead at the benchmark block length of 64,520 bits and with the same complexity per information bit.
Full Text Available Abstract Background The ability to accurately predict enzymatic functions is an essential prerequisite for the interpretation of cellular functions, and the reconstruction and analysis of metabolic models. Several biological databases exist that provide such information. However, in many cases these databases provide partly different and inconsistent genome annotations. Description We analysed nine prokaryotic genomes and found about 70% inconsistencies in the enzyme predictions of the main annotation resources. Therefore, we implemented the annotation pipeline EnzymeDetector. This tool automatically compares and evaluates the assigned enzyme functions from the main annotation databases and supplements them with its own function prediction. This is based on a sequence similarity analysis, on manually created organism-specific enzyme information from BRENDA (Braunschweig Enzyme Database, and on sequence pattern searches. Conclusions EnzymeDetector provides a fast and comprehensive overview of the available enzyme function annotations for a genome of interest. The web interface allows the user to work with customisable weighting schemes and cut-offs for the different prediction methods. These customised quality criteria can easily be applied, and the resulting annotation can be downloaded. The summarised view of all used annotation sources provides up-to-date information. Annotation errors that occur in only one of the databases can be recognised (because of their low relevance score. The results are stored in a database and can be accessed at http://enzymedetector.tu-bs.de.
The Internet and electronic commerce (e-commerce) generate lots of data. Data must be stored, organized, and managed. Database administrators, or DBAs, work with database software to find ways to do this. They identify user needs, set up computer databases, and test systems. They ensure that systems perform as they should and add people to the…
Wiley, Laura K; Sivley, R Michael; Bush, William S
Efficient storage and retrieval of genomic annotations based on range intervals is necessary, given the amount of data produced by next-generation sequencing studies. The indexing strategies of relational database systems (such as MySQL) greatly inhibit their use in genomic annotation tasks. This has led to the development of stand-alone applications that are dependent on flat-file libraries. In this work, we introduce MyNCList, an implementation of the NCList data structure within a MySQL database. MyNCList enables the storage, update and rapid retrieval of genomic annotations from the convenience of a relational database system. Range-based annotations of 1 million variants are retrieved in under a minute, making this approach feasible for whole-genome annotation tasks. Database URL: https://github.com/bushlab/mynclist.
Tsikrika, T.; Diou, C.; Vries, A.P. de; Delopoulos, A.
We examine the reliability of clickthrough data as concept-based image annotations, by comparing them against manual annotations, for different concept categories. Our analysis shows that, for many concepts, the image annotations generated by using clickthrough data are reliable, with up to 90% of t
Wang, X.J.; Zhang, L.; Li, X.; Ma, W.Y.
Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results
National audience; On the basis of the experience acquired in the course of the ANNODIS project, the following questions are discussed: - what is the annotation campaign for? building an annotated " reference corpus" vs. annotation as an experiment; - defining annotation tasks. Naïve vs. expert annotation; - the annotation manual : from linguistic model to annotation protocol; - automatic pre-processing vs. manual annotation. Segmentation, tagging and mark-ups: steps in corpus preparation; - ...
Tan-de Ramos, Jennifer
The study examines the effect of teaching strategies to improved writing of students in the tertiary level. Specifically, three teaching approaches--the use of modelling, grammar-based, and information element-focused--were tested on their effect on the writing of annotated bibliography in three research classes at a university in Manila.…
Ang, Darwin N; Behrns, Kevin E
The emphasis on high-quality care has spawned the development of quality programs, most of which focus on broad outcome measures across a diverse group of providers. Our aim was to investigate the clinical outcomes for a department of surgery with multiple service lines of patient care using a relational database. Mortality, length of stay (LOS), patient safety indicators (PSIs), and hospital-acquired conditions were examined for each service line. Expected values for mortality and LOS were derived from University HealthSystem Consortium regression models, whereas expected values for PSIs were derived from Agency for Healthcare Research and Quality regression models. Overall, 5200 patients were evaluated from the months of January through May of both 2011 (n = 2550) and 2012 (n = 2650). The overall observed-to-expected (O/E) ratio of mortality improved from 1.03 to 0.92. The overall O/E ratio for LOS improved from 0.92 to 0.89. PSIs that predicted mortality included postoperative sepsis (O/E:1.89), postoperative respiratory failure (O/E:1.83), postoperative metabolic derangement (O/E:1.81), and postoperative deep vein thrombosis or pulmonary embolus (O/E:1.8). Mortality and LOS can be improved by using a relational database with outcomes reported to specific service lines. Service line quality can be influenced by distribution of frequent reports, group meetings, and service line-directed interventions.
Standage Daniel S
Full Text Available Abstract Background Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. Results ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. Conclusions Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation comparison to date. The source code is freely available (under an ISC license at http://parseval.sourceforge.net/.
Full Text Available The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts, providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.
Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.
A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.
Pick, Sebastian; Weyers, Benjamin; Hentschel, Bernd; Kuhlen, Torsten W
Data annotation finds increasing use in Virtual Reality applications with the goal to support the data analysis process, such as architectural reviews. In this context, a variety of different annotation systems for application to immersive virtual environments have been presented. While many interesting interaction designs for the data annotation workflow have emerged from them, important details and evaluations are often omitted. In particular, we observe that the process of handling metadata to interactively create and manage complex annotations is often not covered in detail. In this paper, we strive to improve this situation by focusing on the design of data annotation workflows and their evaluation. We propose a workflow design that facilitates the most important annotation operations, i.e., annotation creation, review, and modification. Our workflow design is easily extensible in terms of supported annotation and metadata types as well as interaction techniques, which makes it suitable for a variety of application scenarios. To evaluate it, we have conducted a user study in a CAVE-like virtual environment in which we compared our design to two alternatives in terms of a realistic annotation creation task. Our design obtained good results in terms of task performance and user experience.
Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.
Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.
Satoshi Tamaki; Kazuharu Arakawa; Nobuaki Kono; Masaru Tomita
Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/ under the GNU General Public License.
Hart, Steven N; Duffy, Patrick; Quest, Daniel J; Hossain, Asif; Meiners, Mike A; Kocher, Jean-Pierre
Next-generation sequencing platforms are widely used to discover variants associated with disease. The processing of sequencing data involves read alignment, variant calling, variant annotation and variant filtering. The standard file format to hold variant calls is the variant call format (VCF) file. According to the format specifications, any arbitrary annotation can be added to the VCF file for downstream processing. However, most downstream analysis programs disregard annotations already present in the VCF and re-annotate variants using the annotation provided by that particular program. This precludes investigators who have collected information on variants from literature or other sources from including these annotations in the filtering and mining of variants. We have developed VCF-Miner, a graphical user interface-based stand-alone tool, to mine variants and annotation stored in the VCF. Powered by a MongoDB database engine, VCF-Miner enables the stepwise trimming of non-relevant variants. The grouping feature implemented in VCF-Miner can be used to identify somatic variants by contrasting variants in tumor and in normal samples or to identify recessive/dominant variants in family studies. It is not limited to human data, but can also be extended to include non-diploid organisms. It also supports copy number or any other variant type supported by the VCF specification. VCF-Miner can be used on a personal computer or large institutional servers and is freely available for download from http://bioinformaticstools.mayo.edu/research/vcf-miner/.
Full Text Available Abstract Background Olea europaea L. is a traditional tree crop of the Mediterranean basin with a worldwide economical high impact. Differently from other fruit tree species, little is known about the physiological and molecular basis of the olive fruit development and a few sequences of genes and gene products are available for olive in public databases. This study deals with the identification of large sets of differentially expressed genes in developing olive fruits and the subsequent computational annotation by means of different software. Results mRNA from fruits of the cv. Leccino sampled at three different stages [i.e., initial fruit set (stage 1, completed pit hardening (stage 2 and veraison (stage 3] was used for the identification of differentially expressed genes putatively involved in main processes along fruit development. Four subtractive hybridization libraries were constructed: forward and reverse between stage 1 and 2 (libraries A and B, and 2 and 3 (libraries C and D. All sequenced clones (1,132 in total were analyzed through BlastX against non-redundant NCBI databases and about 60% of them showed similarity to known proteins. A total of 89 out of 642 differentially expressed unique sequences was further investigated by Real-Time PCR, showing a validation of the SSH results as high as 69%. Library-specific cDNA repertories were annotated according to the three main vocabularies of the gene ontology (GO: cellular component, biological process and molecular function. BlastX analysis, GO terms mapping and annotation analysis were performed using the Blast2GO software, a research tool designed with the main purpose of enabling GO based data mining on sequence sets for which no GO annotation is yet available. Bioinformatic analysis pointed out a significantly different distribution of the annotated sequences for each GO category, when comparing the three fruit developmental stages. The olive fruit-specific transcriptome dataset was
Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Tennessen, Kristin; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases.
Jia Chen; Yi-He Zhu; Hao-Fen Wang; Wei Jin; Yong Yu
The vast amount of images available on the Web request for an effective and efficient search service to help users find relevant images.The prevalent way is to provide a keyword interface for users to submit queries.However,the amount of images without any tags or annotations are beyond the reach of manual efforts.To overcome this,automatic image annotation techniques emerge,which are generally a process of selecting a suitable set of tags for a given image without user intervention.However,there are three main challenges with respect to Web-scale image annotation:scalability,noiseresistance and diversity.Scalability has a twofold meaning:first an automatic image annotation system should be scalable with respect to billions of images on the Web; second it should be able to automatically identify several relevant tags among a huge tag set for a given image within seconds or even faster.Noise-resistance means that the system should be robust enough against typos and ambiguous terms used in tags.Diversity represents that image content may include both scenes and objects,which are further described by multiple different image features constituting different facets in annotation.In this paper,we propose a unified framework to tackle the above three challenges for automatic Web image annotation.It mainly involves two components:tag candidate retrieval and multi-facet annotation.In the former content-based indexing and concept-based codebook are leveraged to solve scalability and noise-resistance issues.In the latter the joint feature map has been designed to describe different facets of tags in annotations and the relations between these facets.Tag graph is adopted to represent tags in the entire annotation and the structured learning technique is employed to construct a learning model on top of the tag graph based on the generated joint feature map.Millions of images from Flickr are used in our evaluation.Experimental results show that we have achieved 33％ performance
Amadio, Roberto M; Régis-Gianas, Yann; Saillard, Ronan
We discuss the problem of building a compiler which can lift in a provably correct way pieces of information on the execution cost of the object code to cost annotations on the source code. To this end, we need a clear and flexible picture of: (i) the meaning of cost annotations, (ii) the method to prove them sound and precise, and (iii) the way such proofs can be composed. We propose a so-called labelling approach to these three questions. As a first step, we examine its application to a toy compiler. This formal study suggests that the labelling approach has good compositionality and scalability properties. In order to provide further evidence for this claim, we report our successful experience in implementing and testing the labelling approach on top of a prototype compiler written in OCAML for (a large fragment of) the C language.
CARROLL, JOHN; Minnen, Guido; Briscoe, Ted
We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.
Full Text Available Abstract Background Subcellular localization information is one of the key features to protein function research. Locating to a specific subcellular compartment is essential for a protein to function efficiently. Proteins which have multiple localizations will provide more clues. This kind of proteins may take a high proportion, even more than 35%. Description We have developed a database of proteins with multiple subcellular localizations, designated DBMLoc. The initial release contains 10470 multiple subcellular localization-annotated entries. Annotations are collected from primary protein databases, specific subcellular localization databases and literature texts. All the protein entries are cross-referenced to GO annotations and SwissProt. Protein-protein interactions are also annotated. They are classified into 12 large subcellular localization categories based on GO hierarchical architecture and original annotations. Download, search and sequence BLAST tools are also available on the website. Conclusion DBMLoc is a protein database which collects proteins with more than one subcellular localization annotation. It is freely accessed at http://www.bioinfo.tsinghua.edu.cn/DBMLoc/index.htm.
Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia
Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform.
Full Text Available With the increased number of web databases, major part of deep web is one of the bases of database. In several search engines, encoded data in the returned resultant pages from the web often comes from structured databases which are referred as Web databases (WDB. A result page returned from WDB has multiple search records (SRR.Data units obtained from these databases are encoded into the dynamic resultant pages for manual processing. In order to make these units to be machine process able, relevant information are extracted and labels of data are assigned meaningfully. In this paper, feature ranking is proposed to extract the relevant information of extracted feature from WDB. Feature ranking is practical to enhance ideas of data and identify relevant features. This research explores the performance of feature ranking process by using the linear support vector machines with various feature of WDB database for annotation of relevant results. Experimental result of proposed system provides better result when compared with the earlier methods.
Panitz, Frank; Stengaard, Henrik; Hornshoj, Henrik
-synonymous SNPs were analyzed for their potential effect on the protein structure/function using the PolyPhen and SIFT prediction programs. Predicted SNPs and annotations are stored in a web-based database. Using MAVIANT SNPs can visually be verified based on the DNA sequencing traces. A subset of candidate SNPs...
Grötzinger, Stefan W; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B; Stingl, Ulrich; Eppinger, Jörg
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available
Grötzinger, Stefan W.
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile\\'s genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available
Stefan Wolfgang Grötzinger
Full Text Available Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs and poor homology of novel extremophile’s genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the INDIGO data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes may translate into false positives when searching for specific functions. The Profile & Pattern Matching (PPM strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO-terms (which represent enzyme function profiles and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern. The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2,577 E.C. numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from 6 different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter and PROSITE IDs (pattern filter. Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website.
Christiansen, Christian Fynbo; Møller, Morten Hylander; Nielsen, Henrik
AIM OF DATABASE: The aim of this database is to improve the quality of care in Danish intensive care units (ICUs) by monitoring key domains of intensive care and to compare these with predefined standards. STUDY POPULATION: The Danish Intensive Care Database (DID) was established in 2007...
Gardner, Paul P; Daub, Jennifer; Tate, John G
Rfam is a collection of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination...... with CMs. A minority of families with a very broad taxonomic range (e.g. tRNA and rRNA) provide the majority of the sequence annotations, whilst the majority of Rfam families (e.g. snoRNAs and miRNAs) have a limited taxonomic range and provide a limited number of annotations. Recent improvements...
It is normal practice today for organizations to store large quantities of records of related information as computer-based files or databases. Purposeful information is retrieved by performing queries on the data sets. The purpose of DATABASE MANAGER is to communicate to students the method by which the computer performs these queries. This…
Ko, Byoung Chul; Lee, JiHyeon; Nam, Jae-Yeal
This paper presents novel multiple keywords annotation for medical images, keyword-based medical image retrieval, and relevance feedback method for image retrieval for enhancing image retrieval performance. For semantic keyword annotation, this study proposes a novel medical image classification method combining local wavelet-based center symmetric-local binary patterns with random forests. For keyword-based image retrieval, our retrieval system use the confidence score that is assigned to each annotated keyword by combining probabilities of random forests with predefined body relation graph. To overcome the limitation of keyword-based image retrieval, we combine our image retrieval system with relevance feedback mechanism based on visual feature and pattern classifier. Compared with other annotation and relevance feedback algorithms, the proposed method shows both improved annotation performance and accurate retrieval results.
Galperin, Michael Y; Fernández-Suárez, Xosé M
The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
Craig Douglas B
Full Text Available Abstract Background The rabbit is an important model organism used in a wide range of biomedical research. However, the rabbit genome is still sparsely annotated, thus prohibiting extensive functional analysis of gene sets derived from whole-genome experiments. We developed a web-based application that provides augmented annotation and orthologue analysis for rabbit genes. Importantly, the application allows comprehensive functional analysis through the use of orthologous relationships. Results Using data extracted from several public bioinformatics repositories we created Better Bunny, a database and query tool that extensively augments the available functional annotation for rabbit genes. Using the complete set of target genes from a commercial rabbit gene expression microarray as our benchmark, we are able to obtain functional information for 88 % of the genes on the microarray. Previously, functional information was available for fewer than 10 % of the rabbit genes. Conclusions We have developed a freely available, web-accessible bioinformatics tool that enables investigators to quickly and easily perform extensive functional analysis of rabbit genes (http://cptweb.cpt.wayne.edu. The software application fills a critical void for a wide range of biomedical research that relies on the rabbit model and requires characterization of biological function for large sets of genes.
Hignette, Gaëlle; Buche, Patrice; Couvert, Olivier; Dibie-Barthélemy, Juliette; Doussot, David; Haemmerlé, Ollivier; Mettler, Eric; Soler, Lydie
A preliminary step to risk in food assessment is the gathering of experimental data. In the framework of the Sym'Previus project (http://www.symprevius.org), a complete data integration system has been designed, grouping data provided by industrial partners and data extracted from papers published in the main scientific journals of the domain. Those data have been classified by means of a predefined vocabulary, called ontology. Our aim is to complement the database with data extracted from the Web. In the framework of the WebContent project (www.webcontent.fr), we have designed a semi-automatic acquisition tool, called @WEB, which retrieves scientific documents from the Web. During the @WEB process, data tables are extracted from the documents and then annotated with the ontology. We focus on the data tables as they contain, in general, a synthesis of data published in the documents. In this paper, we explain how the columns of the data tables are automatically annotated with data types of the ontology and how the relations represented by the table are recognised. We also give the results of our experimentation to assess the quality of such an annotation.
Did you ever read something on a book, felt the need to comment, took up a pencil and scribbled something on the books' text'? If you did, you just annotated a book. But that process has now become something fundamental and revolutionary in these days of computing. Annotation is all about adding further information to text, pictures, movies and even to physical objects. In practice, anything which can be identified either virtually or physically can be annotated. In this book, we will delve into what makes annotations, and analyse their significance for the future evolutions of the web. We wil
Howard L. Kaufman
Full Text Available Registries evaluating un-randomized patients have provided valuable information with respect to a therapy’s utility, treatment practices, and evolution over time. While immunotherapy for cancer has been around for more than three decades, data collection in the form of a registry has not been undertaken. The authors believe that establishing a registry to study HD IL-2 immunotherapy, which has been the only systemic therapy producing long term unmaintained remissions for advanced kidney cancer and melanoma for over 20 years, will be an important resource in understanding the impact of immunotherapy with HD IL-2 in a rapidly changing therapeutic environment. Optimizing administration and improving selection of appropriate patients likely to benefit from HD IL-2 immunotherapy are two of many benefits to be derived from this endeavor.
Suciu, Dan; Koch, Christop
Probabilistic databases are databases where the value of some attributes or the presence of some records are uncertain and known only with some probability. Applications in many areas such as information extraction, RFID and scientific data management, data cleaning, data integration, and financial risk assessment produce large volumes of uncertain data, which are best modeled and processed by a probabilistic database. This book presents the state of the art in representation formalisms and query processing techniques for probabilistic data. It starts by discussing the basic principles for rep
Database replication is widely used for fault-tolerance, scalability and performance. The failure of one database replica does not stop the system from working as available replicas can take over the tasks of the failed replica. Scalability can be achieved by distributing the load across all replicas, and adding new replicas should the load increase. Finally, database replication can provide fast local access, even if clients are geographically distributed clients, if data copies are located close to clients. Despite its advantages, replication is not a straightforward technique to apply, and
Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).
Zhang Jun D
Full Text Available Abstract Background A significant portion of the biomedical and chemical literature refers to small molecules. The accurate identification and annotation of compound name that are relevant to the topic of the given literature can establish links between scientific publications and various chemical and life science databases. Manual annotation is the preferred method for these works because well-trained indexers can understand the paper topics as well as recognize key terms. However, considering the hundreds of thousands of new papers published annually, an automatic annotation system with high precision and relevance can be a useful complement to manual annotation. Results An automated chemical name annotation system, MeSH Automated Annotations (MAA, was developed to annotate small molecule names in scientific abstracts with tunable accuracy. This system aims to reproduce the MeSH term annotations on biomedical and chemical literature that would be created by indexers. When comparing automated free text matching to those indexed manually of 26 thousand MEDLINE abstracts, more than 40% of the annotations were false-positive (FP cases. To reduce the FP rate, MAA incorporated several filters to remove "incorrect" annotations caused by nonspecific, partial, and low relevance chemical names. In part, relevance was measured by the position of the chemical name in the text. Tunable accuracy was obtained by adding or restricting the sections of the text scanned for chemical names. The best precision obtained was 96% with a 28% recall rate. The best performance of MAA, as measured with the F statistic was 66%, which favorably compares to other chemical name annotation systems. Conclusions Accurate chemical name annotation can help researchers not only identify important chemical names in abstracts, but also match unindexed and unstructured abstracts to chemical records. The current work is tested against MEDLINE, but the algorithm is not specific to this
Lohse, Marc; Nagel, Axel; Herter, Thomas; May, Patrick; Schroda, Michael; Zrenner, Rita; Tohge, Takayuki; Fernie, Alisdair R; Stitt, Mark; Usadel, Björn
Next-generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan 'BIN' ontology, which is tailored for functional annotation of plant 'omics' data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan-to-GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
National Oceanic and Atmospheric Administration, Department of Commerce — This database was established to oversee documents issued in support of fishery research activities including experimental fishing permits (EFP), letters of...
National Oceanic and Atmospheric Administration, Department of Commerce — The dealer reporting databases contain the primary data reported by federally permitted seafood dealers in the northeast. Electronic reporting was implemented May 1,...
Kristensen, Helen Grundtvig; Stjernø, Henrik
Artikel om national database for sygeplejeforskning oprettet på Dansk Institut for Sundheds- og Sygeplejeforskning. Det er målet med databasen at samle viden om forsknings- og udviklingsaktiviteter inden for sygeplejen.......Artikel om national database for sygeplejeforskning oprettet på Dansk Institut for Sundheds- og Sygeplejeforskning. Det er målet med databasen at samle viden om forsknings- og udviklingsaktiviteter inden for sygeplejen....
The DIGIT (Database of ImmunoGlobulins with Integrated Tools) database (http://biocomputing.it/digit) is an integrated resource storing sequences of annotated immunoglobulin variable domains and enriched with tools for searching and analyzing them. The annotations in the database include information on the type of antigen, the respective germline sequences and on pairing information between light and heavy chains. Other annotations, such as the identification of the complementarity determining regions, assignment of their structural class and identification of mutations with respect to the germline, are computed on the fly and can also be obtained for user-submitted sequences. The system allows customized BLAST searches and automatic building of 3D models of the domains to be performed.
U.S. Department of Health & Human Services — CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.
G Charles Babu
Full Text Available The web has greatly improved access to scientific literature. A wide spectrum of research data has been created and collected by researchers. However, textual information on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites and researcher homepages. Data was widely available over internet and many kinds of data pose the current challenge in storage and retrieval. Datasets can be made more accessible and user-friendly through annotation, aggregation, cross-linking to other datasets. Biomedical datasets are growing exponentially and new curative information appears regularly in research publications such as MedLine, PubMed, Science Direct etc. Therefore, a context based text mining was developed using python language to search huge database such as PubMed based on a given keyword which retrieves data between specified years.
Full Text Available Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS data model. Aims: (1 Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2 Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3 Develop a set of queries to support data sampling and result comparisons; (4 Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1 algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2 algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The
Full Text Available Abstract Background The enormous throughput and low cost of second-generation sequencing platforms now allow research and clinical geneticists to routinely perform single experiments that identify tens of thousands to millions of variant sites. Existing methods to annotate variant sites using information from publicly available databases via web browsers are too slow to be useful for the large sequencing datasets being routinely generated by geneticists. Because sequence annotation of variant sites is required before functional characterization can proceed, the lack of a high-throughput pipeline to efficiently annotate variant sites can act as a significant bottleneck in genetics research. Results SeqAnt (Sequence Annotator is an open source web service and software package that rapidly annotates DNA sequence variants and identifies recessive or compound heterozygous loci in human, mouse, fly, and worm genome sequencing experiments. Variants are characterized with respect to their functional type, frequency, and evolutionary conservation. Annotated variants can be viewed on a web browser, downloaded in a tab-delimited text file, or directly uploaded in a BED format to the UCSC genome browser. To demonstrate the speed of SeqAnt, we annotated a series of publicly available datasets that ranged in size from 37 to 3,439,107 variant sites. The total time to completely annotate these data completely ranged from 0.17 seconds to 28 minutes 49.8 seconds. Conclusion SeqAnt is an open source web service and software package that overcomes a critical bottleneck facing research and clinical geneticists using second-generation sequencing platforms. SeqAnt will prove especially useful for those investigators who lack dedicated bioinformatics personnel or infrastructure in their laboratories.
Beisson, Frédéric; Koo, Abraham J K; Ruuska, Sari; Schwender, Jörg; Pollard, Mike; Thelen, Jay J; Paddock, Troy; Salas, Joaquín J; Savage, Linda; Milcamps, Anne; Mhaske, Vandana B; Cho, Younghee; Ohlrogge, John B
The genome of Arabidopsis has been searched for sequences of genes involved in acyl lipid metabolism. Over 600 encoded proteins have been identified, cataloged, and classified according to predicted function, subcellular location, and alternative splicing. At least one-third of these proteins were previously annotated as "unknown function" or with functions unrelated to acyl lipid metabolism; therefore, this study has improved the annotation of over 200 genes. In particular, annotation of the lipolytic enzyme group (at least 110 members total) has been improved by the critical examination of the biochemical literature and the sequences of the numerous proteins annotated as "lipases." In addition, expressed sequence tag (EST) data have been surveyed, and more than 3,700 ESTs associated with the genes were cataloged. Statistical analysis of the number of ESTs associated with specific cDNA libraries has allowed calculation of probabilities of differential expression between different organs. More than 130 genes have been identified with a statistical probability > 0.95 of preferential expression in seed, leaf, root, or flower. All the data are available as a Web-based database, the Arabidopsis Lipid Gene database (http://www.plantbiology.msu.edu/lipids/genesurvey/index.htm). The combination of the data of the Lipid Gene Catalog and the EST analysis can be used to gain insights into differential expression of gene family members and sets of pathway-specific genes, which in turn will guide studies to understand specific functions of individual genes.
This video shows how to annotate the ground truth tracks in the thermal videos. The ground truth tracks are produced to be able to compare them to tracks obtained from a Computer Vision tracking approach. The program used for annotation is T-Analyst, which is developed by Aliaksei Laureshyn, Ph...
Mardanbeigi, Diako; Qvarfordt, Pernilla
To facilitate distributed communication in mobile settings, we developed GazeNote for creating and sharing gaze annotations in head mounted displays (HMDs). With gaze annotations it possible to point out objects of interest within an image and add a verbal description. To create an annota- tion, ...
Martínez Alonso, Héctor; Pedersen, Bolette Sandford; Bel, Núria
We present the result of an annotation task on regular polysemy for a series of seman- tic classes or dot types in English, Dan- ish and Spanish. This article describes the annotation process, the results in terms of inter-encoder agreement, and the sense distributions obtained with two methods...
Leung, Yuk Yee; Kuksa, Pavel P; Amlie-Wolf, Alexandre; Valladares, Otto; Ungar, Lyle H; Kannan, Sampath; Gregory, Brian D; Wang, Li-San
Small non-coding RNAs (sncRNAs) are highly abundant RNAs, typically database provides searchable, unified annotation, and expression information for full sncRNA transcripts and mature RNA products derived from these larger RNAs. Here, we present the Database of small human noncoding RNAs (DASHR). DASHR contains the most comprehensive information to date on human sncRNA genes and mature sncRNA products. DASHR provides a simple user interface for researchers to view sequence and secondary structure, compare expression levels, and evidence of specific processing across all sncRNA genes and mature sncRNA products in various human tissues. DASHR annotation and expression data covers all major classes of sncRNAs including microRNAs (miRNAs), Piwi-interacting (piRNAs), small nuclear, nucleolar, cytoplasmic (sn-, sno-, scRNAs, respectively), transfer (tRNAs), and ribosomal RNAs (rRNAs). Currently, DASHR (v1.0) integrates 187 smRNA high-throughput sequencing (smRNA-seq) datasets with over 2.5 billion reads and annotation data from multiple public sources. DASHR contains annotations for ∼ 48,000 human sncRNA genes and mature sncRNA products, 82% of which are expressed in one or more of the curated tissues. DASHR is available at http://lisanwanglab.org/DASHR.
Frigeri, A.; Cardellini, C.; Chiodini, G.; Frondini, F.; Bagnato, E.; Aiuppa, A.; Fischer, T. P.; Lehnert, K. A.
The study of the main pathways of carbon flux from the deep Earth requires the analysis of a large quantity and variety of data on volcanic and non-volcanic gas emissions. Hence, there is need for common frameworks to aggregate available data and insert new observations. Since 2010 we have been developing the Mapping Gas emissions (MaGa) web-based database to collect data on carbon degassing form volcanic and non-volcanic environments. MaGa uses an Object-relational model, translating the experience of field surveyors into the database schema. The current web interface of MaGa allows users to browse the data in tabular format or by browsing an interactive web-map. Enabled users can insert information as measurement methods, instrument details as well as the actual values collected in the field. Measurements found in the literature can be inserted as well as direct field observations made by human-operated instruments. Currently the database includes fluxes and gas compositions from active craters degassing, diffuse soil degassing and fumaroles both from dormant volcanoes and open-vent volcanoes from literature survey and data about non-volcanic emission of the Italian territory. Currently, MaGa holds more than 1000 volcanic plume degassing fluxes, data from 30 sites of diffuse soil degassing from italian volcanoes, and about 60 measurements from fumarolic and non volcanic emission sites. For each gas emission site, the MaGa holds data, pictures, descriptions on gas sampling, analysis and measurement methods, together with bibliographic references and contacts to researchers having experience on each site. From 2012, MaGa developments started to be focused towards the framework of the Deep Earth Carbon Degassing research initiative of the Deep Carbon Observatory. Whithin the DECADE initiative, there are others data systems, as EarthChem and the Smithsonian Institution's Global Volcanism Program. An interoperable interaction between the DECADE data systems is being
In 2013, National Agricultural Biotechnology Information Center (NABIC) reconstructs a molecular marker database for useful genetic resources. The web-based marker database consists of three major functional categories: map viewer, RSN marker and gene annotation. It provides 7250 marker locations, 3301 RSN marker property, 3280 molecular marker annotation information in agricultural plants. The individual molecular marker provides information such as marker name, expressed sequence tag number...
Melamed, I D
Bilingual annotators were paid to link roughly sixteen thousand corresponding words between on-line versions of the Bible in modern French and modern English. These annotations are freely available to the research community from http://www.cis.upenn.edu/~melamed . The annotations can be used for several purposes. First, they can be used as a standard data set for developing and testing translation lexicons and statistical translation models. Second, researchers in lexical semantics will be able to mine the annotations for insights about cross-linguistic lexicalization patterns. Third, the annotations can be used in research into certain recently proposed methods for monolingual word-sense disambiguation. This paper describes the annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations. The annotation process was repeated five times by different annotators. Inter-annotator agreement rates indicate that the annotations are reasonably rel...
Full Text Available Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage and biological (best-matching sequence in annotated databases indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots. We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.
Guldberg, Rikke; Brostrøm, Søren; Hansen, Jesper Kjær
INTRODUCTION AND HYPOTHESIS: The Danish Urogynaecological Database (DugaBase) is a nationwide clinical database established in 2006 to monitor, ensure and improve the quality of urogynaecological surgery. We aimed to describe its establishment and completeness and to validate selected variables....... This is the first study based on data from the DugaBase. METHODS: The database completeness was calculated as a comparison between urogynaecological procedures reported to the Danish National Patient Registry and to the DugaBase. Validity was assessed for selected variables from a random sample of 200 women...... in the DugaBase from 1 January 2009 to 31 October 2010, using medical records as a reference. RESULTS: A total of 16,509 urogynaecological procedures were registered in the DugaBase by 31 December 2010. The database completeness has increased by calendar time, from 38.2 % in 2007 to 93.2 % in 2010 for public...
Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).
Full Text Available Biology has entered a new era in distributing information based on database and this collection of database become primary in publishing information. This data publishing is done through Internet Gopher where information resources easy and affordable offered by powerful research tools. The more important thing now is the development of high quality and professionally operated electronic data publishing sites. To enhance the service and appropriate editorial and policies for electronic data publishing has been established and editors of article shoulder the responsibility.
Full Text Available Extracellular vesicles (EVs are membraneous vesicles released by a variety of cells into their microenvironment. Recent studies have elucidated the role of EVs in intercellular communication, pathogenesis, drug, vaccine and gene-vector delivery, and as possible reservoirs of biomarkers. These findings have generated immense interest, along with an exponential increase in molecular data pertaining to EVs. Here, we describe Vesiclepedia, a manually curated compendium of molecular data (lipid, RNA, and protein identified in different classes of EVs from more than 300 independent studies published over the past several years. Even though databases are indispensable resources for the scientific community, recent studies have shown that more than 50% of the databases are not regularly updated. In addition, more than 20% of the database links are inactive. To prevent such database and link decay, we have initiated a continuous community annotation project with the active involvement of EV researchers. The EV research community can set a gold standard in data sharing with Vesiclepedia, which could evolve as a primary resource for the field.
Kalra, Hina; Simpson, Richard J.; Ji, Hong; Aikawa, Elena; Altevogt, Peter; Askenase, Philip; Bond, Vincent C.; Borràs, Francesc E.; Breakefield, Xandra; Budnik, Vivian; Buzas, Edit; Camussi, Giovanni; Clayton, Aled; Cocucci, Emanuele; Falcon-Perez, Juan M.; Gabrielsson, Susanne; Gho, Yong Song; Gupta, Dwijendra; Harsha, H. C.; Hendrix, An; Hill, Andrew F.; Inal, Jameel M.; Jenster, Guido; Krämer-Albers, Eva-Maria; Lim, Sai Kiang; Llorente, Alicia; Lötvall, Jan; Marcilla, Antonio; Mincheva-Nilsson, Lucia; Nazarenko, Irina; Nieuwland, Rienk; Nolte-'t Hoen, Esther N. M.; Pandey, Akhilesh; Patel, Tushar; Piper, Melissa G.; Pluchino, Stefano; Prasad, T. S. Keshava; Rajendran, Lawrence; Raposo, Graca; Record, Michel; Reid, Gavin E.; Sánchez-Madrid, Francisco; Schiffelers, Raymond M.; Siljander, Pia; Stensballe, Allan; Stoorvogel, Willem; Taylor, Douglas; Thery, Clotilde; Valadi, Hadi; van Balkom, Bas W. M.; Vázquez, Jesús; Vidal, Michel; Wauben, Marca H. M.; Yáñez-Mó, María; Zoeller, Margot; Mathivanan, Suresh
Extracellular vesicles (EVs) are membraneous vesicles released by a variety of cells into their microenvironment. Recent studies have elucidated the role of EVs in intercellular communication, pathogenesis, drug, vaccine and gene-vector delivery, and as possible reservoirs of biomarkers. These findings have generated immense interest, along with an exponential increase in molecular data pertaining to EVs. Here, we describe Vesiclepedia, a manually curated compendium of molecular data (lipid, RNA, and protein) identified in different classes of EVs from more than 300 independent studies published over the past several years. Even though databases are indispensable resources for the scientific community, recent studies have shown that more than 50% of the databases are not regularly updated. In addition, more than 20% of the database links are inactive. To prevent such database and link decay, we have initiated a continuous community annotation project with the active involvement of EV researchers. The EV research community can set a gold standard in data sharing with Vesiclepedia, which could evolve as a primary resource for the field. PMID:23271954
Ide, Nancy; Erjavec, Tomaz
It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes.
McFadden, Justin; Ellis, Joshua; Anwar, Tasneem; Roehrig, Gillian
The development of teachers as reflective practitioners is a central concept in national guidelines for teacher preparation and induction (National Council for Accreditation of Teacher Education 2008). The Teacher Induction Network (TIN) supports the development of reflective practice for beginning secondary science teachers through the creation of online "communities of practice" (Barab et al. in Inf Soc, 237-256, 2003), which have been shown to have positive impacts on teacher collaboration, communication, and reflection. Specifically, TIN integrated the use of asynchronous, video annotation as an affordance to directly facilitate teachers' reflection on their classroom practices (Tripp and Rich in Teach Teach Educ 28(5):728-739, 2013). This study examines the use of video annotation as a tool for developing reflective practices for beginning secondary science teachers. Teachers were enrolled in an online teacher induction course designed to promote reflective practice and inquiry-based instruction. A modified version of the Learning to Notice Framework (Sherin and van Es in J Teach Educ 60(1):20-37, 2009) was used to classify teachers' annotations on video of their teaching. Findings from the study include the tendency of teachers to focus on themselves in their annotations, as well as a preponderance of annotations focused on lower-level reflective practices of description and explanation. Suggestions for utilizing video annotation tools are discussed, as well as design features, which could be improved to further the development of richer annotations and deeper reflective practices.
Dreos, René; Ambrosini, Giovanna; Groux, Romain; Cavin Périer, Rouaïda; Bucher, Philipp
We present an update of the Eukaryotic Promoter Database EPD (http://epd.vital-it.ch), more specifically on the EPDnew division, which contains comprehensive organisms-specific transcription start site (TSS) collections automatically derived from next generation sequencing (NGS) data. Thanks to the abundant release of new high-throughput transcript mapping data (CAGE, TSS-seq, GRO-cap) the database could be extended to plant and fungal species. We further report on the expansion of the mass genome annotation (MGA) repository containing promoter-relevant chromatin profiling data and on improvements for the EPD entry viewers. Finally, we present a new data access tool, ChIP-Extract, which enables computational biologists to extract diverse types of promoter-associated data in numerical table formats that are readily imported into statistical analysis platforms such as R. PMID:27899657
Eppig, Janan T; Bult, Carol J; Kadin, James A; Richardson, Joel E; Blake, Judith A; Anagnostopoulos, A; Baldarelli, R M; Baya, M; Beal, J S; Bello, S M; Boddy, W J; Bradt, D W; Burkart, D L; Butler, N E; Campbell, J; Cassell, M A; Corbani, L E; Cousins, S L; Dahmen, D J; Dene, H; Diehl, A D; Drabkin, H J; Frazer, K S; Frost, P; Glass, L H; Goldsmith, C W; Grant, P L; Lennon-Pierce, M; Lewis, J; Lu, I; Maltais, L J; McAndrews-Hill, M; McClellan, L; Miers, D B; Miller, L A; Ni, L; Ormsby, J E; Qi, D; Reddy, T B K; Reed, D J; Richards-Smith, B; Shaw, D R; Sinclair, R; Smith, C L; Szauter, P; Walker, M B; Walton, D O; Washburn, L L; Witham, I T; Zhu, Y
The Mouse Genome Database (MGD) forms the core of the Mouse Genome Informatics (MGI) system (http://www.informatics.jax.org), a model organism database resource for the laboratory mouse. MGD provides essential integration of experimental knowledge for the mouse system with information annotated from both literature and online sources. MGD curates and presents consensus and experimental data representations of genotype (sequence) through phenotype information, including highly detailed reports about genes and gene products. Primary foci of integration are through representations of relationships among genes, sequences and phenotypes. MGD collaborates with other bioinformatics groups to curate a definitive set of information about the laboratory mouse and to build and implement the data and semantic standards that are essential for comparative genome analysis. Recent improvements in MGD discussed here include the enhancement of phenotype resources, the re-development of the International Mouse Strain Resource, IMSR, the update of mammalian orthology datasets and the electronic publication of classic books in mouse genetics.
Sanderson, Robert [Los Alamos National Laboratory; Van De Sompel, Herbert [Los Alamos National Laboratory
As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations of web resources changing over time, an annotation made about a web resource today may no longer be relevant to the representation that is served from that same resource tomorrow. We assume the existence of archived versions of resources, and combine the temporal features of the emerging Open Annotation data model with the capability offered by the Memento framework that allows seamless navigation from the URI of a resource to archived versions of that resource, and arrive at a solution that provides guarantees regarding the persistence of web annotations over time. More specifically, we provide theoretical solutions and proof-of-concept experimental evaluations for two problems: reconstructing an existing annotation so that the correct archived version is displayed for all resources involved in the annotation, and retrieving all annotations that involve a given archived version of a web resource.
Pruitt, Kim D; Tatusova, Tatiana; Brown, Garth R; Maglott, Donna R
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).
Full Text Available Abstract Background Metabolomics experiments using Mass Spectrometry (MS technology measure the mass to charge ratio (m/z and intensity of ionised molecules in crude extracts of complex biological samples to generate high dimensional metabolite 'fingerprint' or metabolite 'profile' data. High resolution MS instruments perform routinely with a mass accuracy of Results Metabolite 'structures' harvested from publicly accessible databases were converted into a common format to generate a comprehensive archive in MZedDB. 'Rules' were derived from chemical information that allowed MZedDB to generate a list of adducts and neutral loss fragments putatively able to form for each structure and calculate, on the fly, the exact molecular weight of every potential ionisation product to provide targets for annotation searches based on accurate mass. We demonstrate that data matrices representing populations of ionisation products generated from different biological matrices contain a large proportion (sometimes > 50% of molecular isotopes, salt adducts and neutral loss fragments. Correlation analysis of ESI-MS data features confirmed the predicted relationships of m/z signals. An integrated isotope enumerator in MZedDB allowed verification of exact isotopic pattern distributions to corroborate experimental data. Conclusion We conclude that although ultra-high accurate mass instruments provide major insight into the chemical diversity of biological extracts, the facile annotation of a large proportion of signals is not possible by simple, automated query of current databases using computed molecular formulae. Parameterising MZedDB to take into account predicted ionisation behaviour and the biological source of any sample improves greatly both the frequency and accuracy of potential annotation 'hits' in ESI-MS data.
Full Text Available Bats account for ~20% of mammalian species, and are the only mammals with true powered flight. For the sake of their specialized phenotypic traits, many researches have been devoted to examine the evolution of bats. Until now, some whole genome sequences of bats have been assembled and annotated, however, a uniform resource for the annotated bat genomes is still unavailable. To make the extensive data associated with the bat genomes accessible to the general biological communities, we established a Bat Genome Database (BGD. BGD is an open-access, web-available portal that integrates available data of bat genomes and genes. It hosts data from six bat species, including two megabats and four microbats. Users can query the gene annotations using efficient searching engine, and it offers browsable tracks of bat genomes. Furthermore, an easy-to-use phylogenetic analysis tool was also provided to facilitate online phylogeny study of genes. To the best of our knowledge, BGD is the first database of bat genomes. It will extend our understanding of the bat evolution and be advantageous to the bat sequences analysis. BGD is freely available at: http://donglab.ecnu.edu.cn/databases/BatGenome/.
Full Text Available Abstract Background Concurrent with the efforts currently underway in mapping microbial genomes using high-throughput sequencing methods, systems biologists are building metabolic models to characterize and predict cell metabolisms. One of the key steps in building a metabolic model is using multiple databases to collect and assemble essential information about genome-annotations and the architecture of the metabolic network for a specific organism. To speed up metabolic model development for a large number of microorganisms, we need a user-friendly platform to construct metabolic networks and to perform constraint-based flux balance analysis based on genome databases and experimental results. Results We have developed a semi-automatic, web-based platform (MicrobesFlux for generating and reconstructing metabolic models for annotated microorganisms. MicrobesFlux is able to automatically download the metabolic network (including enzymatic reactions and metabolites of ~1,200 species from the KEGG database (Kyoto Encyclopedia of Genes and Genomes and then convert it to a metabolic model draft. The platform also provides diverse customized tools, such as gene knockouts and the introduction of heterologous pathways, for users to reconstruct the model network. The reconstructed metabolic network can be formulated to a constraint-based flux model to predict and analyze the carbon fluxes in microbial metabolisms. The simulation results can be exported in the SBML format (The Systems Biology Markup Language. Furthermore, we also demonstrated the platform functionalities by developing an FBA model (including 229 reactions for a recent annotated bioethanol producer, Thermoanaerobacter sp. strain X514, to predict its biomass growth and ethanol production. Conclusion MicrobesFlux is an installation-free and open-source platform that enables biologists without prior programming knowledge to develop metabolic models for annotated microorganisms in the KEGG
Arnaud, Martha B; Chibucos, Marcus C; Costanzo, Maria C; Crabtree, Jonathan; Inglis, Diane O; Lotia, Adil; Orvis, Joshua; Shah, Prachi; Skrzypek, Marek S; Binkley, Gail; Miyasato, Stuart R; Wortman, Jennifer R; Sherlock, Gavin
The Aspergillus Genome Database (AspGD) is an online genomics resource for researchers studying the genetics and molecular biology of the Aspergilli. AspGD combines high-quality manual curation of the experimental scientific literature examining the genetics and molecular biology of Aspergilli, cutting-edge comparative genomics approaches to iteratively refine and improve structural gene annotations across multiple Aspergillus species, and web-based research tools for accessing and exploring the data. All of these data are freely available at http://www.aspgd.org. We welcome feedback from users and the research community at firstname.lastname@example.org.
Daugaard, Gedske; Kier, Maria Gry Gundgaard; Bandak, Mikkel
AIM: The nationwide Danish Testicular Cancer database consists of a retrospective research database (DaTeCa database) and a prospective clinical database (Danish Multidisciplinary Cancer Group [DMCG] DaTeCa database). The aim is to improve the quality of care for patients with testicular cancer (TC......) in Denmark, that is, by identifying risk factors for relapse, toxicity related to treatment, and focusing on late effects. STUDY POPULATION: All Danish male patients with a histologically verified germ cell cancer diagnosis in the Danish Pathology Registry are included in the DaTeCa databases. Data...... survivors alive on October 2014 were invited to fill in this questionnaire including 160 validated questions. Collection of questionnaires is still ongoing. A biobank including blood/sputum samples for future genetic analyses has been established. Both samples related to DaTeCa and DMCG DaTeCa database...
Melamed, I D
This annotation style guide was created by and for the Blinker project at the University of Pennsylvania. The Blinker project was so named after the ``bilingual linker'' GUI, which was created to enable bilingual annotators to ``link'' word tokens that are mutual translations in parallel texts. The parallel text chosen for this project was the Bible, because it is probably the easiest text to obtain in electronic form in multiple languages. The languages involved were English and French, because, of the languages with which the project co-ordinator was familiar, these were the two for which a sufficient number of annotators was likely to be found.
Kügler, Frank; Smolibocki, Bernadett; Arnold, Denis
easier since German intonation is currently annotated according to different models. To this end, we aim to provide guidelines that are easy to learn. The guidelines were evaluated running an inter-annotator reliability study on three different speech styles (read speech, monologue and dialogue......This paper presents newly developed guidelines for prosodic annotation of German as a consensus system agreed upon by German intonologists. The DIMA system is rooted in the framework of autosegmental-metrical phonology. One important goal of the consensus is to make exchanging data between groups...
Fromreide, Hege; Hovy, Dirk; Søgaard, Anders
We present two new NER datasets for Twitter; a manually annotated set of 1,467 tweets (kappa=0.942) and a set of 2,975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a......) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can beobtained from crowdsourced annotations, making it more feasible...
This thesis combines programmer knowledge and feedback to improve modeling and optimization of software. The research is motivated by two observations. First, there is a great need for automatic analysis of software for embedded systems - to expose and model parallelism inherent in programs. Second......, some program properties are beyond reach of such analysis for theoretical and practical reasons - but can be described by programmers. Three aspects are explored. The first is annotation of the source code. Two annotations are introduced. These allow more accurate modeling of parallelism...... are not effective unless programmers are told how and when they are benecial. A prototype compilation feedback system was developed in collaboration with IBM Haifa Research Labs. It reports issues that prevent further analysis to the programmer. Performance evaluation shows that three programs performes signicantly...
Full Text Available Poul Videbech,1 Anette Deleuran2 1Mental Health Centre Glostrup, Department of Clinical Medicine, University of Copenhagen, Glostrup, 2Psychiatric Centre Amager, Copenhagen S, Denmark Aim of database: The purpose of the Danish Depression Database (DDD is to monitor and facilitate the improvement of the quality of the treatment of depression in Denmark. Furthermore, the DDD has been designed to facilitate research. Study population: Inpatients as well as outpatients with depression, aged above 18 years, and treated in the public psychiatric hospital system were enrolled. Main variables: Variables include whether the patient has been thoroughly somatically examined and has been interviewed about the psychopathology by a specialist in psychiatry. The Hamilton score as well as an evaluation of the risk of suicide are measured before and after treatment. Whether psychiatric aftercare has been scheduled for inpatients and the rate of rehospitalization are also registered. Descriptive data: The database was launched in 2011. Every year since then ~5,500 inpatients and 7,500 outpatients have been registered annually in the database. A total of 24,083 inpatients and 29,918 outpatients have been registered. The DDD produces an annual report published on the Internet. Conclusion: The DDD can become an important tool for quality improvement and research, when the reporting is more complete. Keywords: quality assurance, suicide, somatic diseases, national database
Full Text Available A metabolome—the collection of comprehensive quantitative data on metabolites in an organism—has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal, where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.
Sakurai, Nozomu; Ara, Takeshi; Enomoto, Mitsuo; Motegi, Takeshi; Morishita, Yoshihiko; Kurabayashi, Atsushi; Iijima, Yoko; Ogata, Yoshiyuki; Nakajima, Daisuke; Suzuki, Hideyuki; Shibata, Daisuke
A metabolome--the collection of comprehensive quantitative data on metabolites in an organism--has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal), where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.
Full Text Available Abstract Background A recent analysis of protein sequences deposited in the NCBI RefSeq database indicates that ~8.5 million protein sequences are encoded in prokaryotic and eukaryotic genomes, where ~30% are explicitly annotated as "hypothetical" or "uncharacterized" protein. Our Comparison of Protein Active-Site Structures (CPASS v.2 database and software compares the sequence and structural characteristics of experimentally determined ligand binding sites to infer a functional relationship in the absence of global sequence or structure similarity. CPASS is an important component of our Functional Annotation Screening Technology by NMR (FAST-NMR protocol and has been successfully applied to aid the annotation of a number of proteins of unknown function. Findings We report a major upgrade to our CPASS software and database that significantly improves its broad utility. CPASS v.2 is designed with a layered architecture to increase flexibility and portability that also enables job distribution over the Open Science Grid (OSG to increase speed. Similarly, the CPASS interface was enhanced to provide more user flexibility in submitting a CPASS query. CPASS v.2 now allows for both automatic and manual definition of ligand-binding sites and permits pair-wise, one versus all, one versus list, or list versus list comparisons. Solvent accessible surface area, ligand root-mean square difference, and Cβ distances have been incorporated into the CPASS similarity function to improve the quality of the results. The CPASS database has also been updated. Conclusions CPASS v.2 is more than an order of magnitude faster than the original implementation, and allows for multiple simultaneous job submissions. Similarly, the CPASS database of ligand-defined binding sites has increased in size by ~ 38%, dramatically increasing the likelihood of a positive search result. The modification to the CPASS similarity function is effective in reducing CPASS similarity scores
Jayakar, P B; Brusse, E; Patrick, J P; Shwedyk, E; Seshia, S S
The paper describes an ambulatory EEG database. The database contains segments of AEEGs done on 45 subjects. Each epoch (1/8th second or more) of AEEG data has been annotated into 1 of 40 classes. The classes represent background activity, paroxysmal patterns and artifacts. The majority of classes have over 200 discrete epochs. The structure is flexible enough to allow additional epochs to be readily added. The database is stored on transportable media such as digital magnetic tape or hard disk and is thus available to other researchers in the field. The database can be used to design, evaluate and compare EEG signal processing algorithms and pattern recognition systems. It can also serve as an educational medium in EEG laboratories.
Kronmiller, Brent A; Wise, Roger P
Grass genomes harbor a diverse and complex content of repeated sequences. Most of these repeats occur as abundant transposable elements (TEs), which present unique challenges to sequence, assemble, and annotate genomes. Multiple copies of Long Terminal Repeat (LTR) retrotransposons can hinder sequence assembly and also cause problems with gene annotation. TEs can also contain protein-encoding genes, the ancient remnants of which can mislead gene identification software if not correctly masked. Hence, accurate assembly is crucial for gene annotation. We present TEnest v2.0. TEnest computationally annotates and chronologically displays nested transposable elements. Utilizing organism-specific TE databases as a reference for reconstructing degraded TEs to their ancestral state, annotation of repeats is accomplished by iterative sequence alignment. Subsequently, an output consisting of a graphical display of the chronological nesting structure and coordinate positions of all TE insertions is the result. Both linux command line and Web versions of the TEnest software are available at www.wiselab.org and www.plantgdb.org/tool/, respectively.
Full Text Available Abstract Background Protein function is often dependent on subsets of solvent-exposed residues that may exist in a similar three-dimensional configuration in non homologous proteins thus having different order and/or spacing in the sequence. Hence, functional annotation by means of sequence or fold similarity is not adequate for such cases. Results We describe a method for the function-related annotation of protein structures by means of the detection of local structural similarity with a library of annotated functional sites. An automatic procedure was used to annotate the function of local surface regions. Next, we employed a sequence-independent algorithm to compare exhaustively these functional patches with a larger collection of protein surface cavities. After tuning and validating the algorithm on a dataset of well annotated structures, we applied it to a list of protein structures that are classified as being of unknown function in the Protein Data Bank. By this strategy, we were able to provide functional clues to proteins that do not show any significant sequence or global structural similarity with proteins in the current databases. Conclusion This method is able to spot structural similarities associated to function-related similarities, independently on sequence or fold resemblance, therefore is a valuable tool for the functional analysis of uncharacterized proteins. Results are available at http://cbm.bio.uniroma2.it/surface/structuralGenomics.html
Full Text Available In this work, we address the use of object recognition techniques to annotate what is shown where in online video collections. These annotations are suitable to retrieve specific video scenes for object related text queries which is not possible with the manually generated metadata that is used by current portals. We are not the first to present object annotations that are generated with content-based analysis methods. However, the proposed framework possesses some outstanding features that offer good prospects for its application in real video portals. Firstly, it can be easily used as background module in any video environment. Secondly, it is not based on a fixed analysis chain but on an extensive recognition infrastructure that can be used with all kinds of visual features, matching and machine learning techniques. New recognition approaches can be integrated into this infrastructure with low development costs and a configuration of the used recognition approaches can be performed even on a running system. Thus, this framework might also benefit from future advances in computer vision. Thirdly, we present an automatic selection approach to support the use of different recognition strategies for different objects. Last but not least, visual analysis can be performed efficiently on distributed, multi-processor environments and a database schema is presented to store the resulting video annotations as well as the off-line generated low-level features in a compact form. We achieve promising results in an annotation case study and the instance search task of the TRECVID 2011 challenge.
Full Text Available Abstract Background Advances in structural biology, including structural genomics, have resulted in a rapid increase in the number of experimentally determined protein structures. However, about half of the structures deposited by the structural genomics consortia have little or no information about their biological function. Therefore, there is a need for tools for automatically and comprehensively annotating the function of protein structures. We aim to provide such tools by applying comparative protein structure annotation that relies on detectable relationships between protein structures to transfer functional annotations. Here we introduce two programs, AnnoLite and AnnoLyze, which use the structural alignments deposited in the DBAli database. Description AnnoLite predicts the SCOP, CATH, EC, InterPro, PfamA, and GO terms with an average sensitivity of ~90% and average precision of ~80%. AnnoLyze predicts ligand binding site and domain interaction patches with an average sensitivity of ~70% and average precision of ~30%, correctly localizing binding sites for small molecules in ~95% of its predictions. Conclusion The AnnoLite and AnnoLyze programs for comparative annotation of protein structures can reliably and automatically annotate new protein structures. The programs are fully accessible via the Internet as part of the DBAli suite of tools at http://salilab.org/DBAli/.
Caicedo, Juan C; González, Fabio A; Romero, Eduardo
Large amounts of histology images are captured and archived in pathology departments due to the ever expanding use of digital microscopy. The ability to manage and access these collections of digital images is regarded as a key component of next generation medical imaging systems. This paper addresses the problem of retrieving histopathology images from a large collection using an example image as query. The proposed approach automatically annotates the images in the collection, as well as the query images, with high-level semantic concepts. This semantic representation delivers an improved retrieval performance providing more meaningful results. We model the problem of automatic image annotation using kernel methods, resulting in a unified framework that includes: (1) multiple features for image representation, (2) a feature integration and selection mechanism (3) and an automatic semantic image annotation strategy. An extensive experimental evaluation demonstrated the effectiveness of the proposed framework to build meaningful image representations for learning and useful semantic annotations for image retrieval.
Yuan, Pingpeng; Wang, Guoyin; Zhang, Qin; Jin, Hai
Due to ambiguity, search engines for scientific literatures may not return right search results. One efficient solution to the problems is to automatically annotate literatures and attach the semantic information to them. Generally, semantic annotation requires identifying entities before attaching semantic information to them. However, due to abbreviation and other reasons, it is very difficult to identify entities correctly. The paper presents a Semantic Annotation System for Literature (SASL), which utilizes Wikipedia as knowledge base to annotate literatures. SASL mainly attaches semantic to terminology, academic institutions, conferences, and journals etc. Many of them are usually abbreviations, which induces ambiguity. Here, SASL uses regular expressions to extract the mapping between full name of entities and their abbreviation. Since full names of several entities may map to a single abbreviation, SASL introduces Hidden Markov Model to implement name disambiguation. Finally, the paper presents the experimental results, which confirm SASL a good performance.
Collaborative tagging systems, such as del.icio.us, CiteULike, and others, allow users to annotate objects, e.g., Web pages or scientific papers, with descriptive labels called tags. The social annotations, contributed by thousands of users, can potentially be used to infer categorical knowledge, classify documents or recommend new relevant information. Traditional text inference methods do not make best use of socially-generated data, since they do not take into account variations in individual users' perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes interests of individual annotators into account in order to find hidden topics of annotated objects. Unfortunately, our proposed approach had a number of shortcomings, including overfitting, local maxima and the requirement to specify values for some parameters. In this paper we address these shortcomings in two ways. First, we extend the model to a fully Bayesian framework. Second, we describe an infinite ver...
Ibekwe-Sanjuan, Fidelia; Eric, Sanjuan; Eric, Charton
We present a methodology combining surface NLP and Machine Learning techniques for ranking asbtracts and generating summaries based on annotated corpora. The corpora were annotated with meta-semantic tags indicating the category of information a sentence is bearing (objective, findings, newthing, hypothesis, conclusion, future work, related work). The annotated corpus is fed into an automatic summarizer for query-oriented abstract ranking and multi- abstract summarization. To adapt the summarizer to these two tasks, two novel weighting functions were devised in order to take into account the distribution of the tags in the corpus. Results, although still preliminary, are encouraging us to pursue this line of work and find better ways of building IR systems that can take into account semantic annotations in a corpus.
Zellweger, Polle Trescott; Bouvin, Niels Olof; Jehøj, Henning
Fluid Documents use animated typographical changes to provide a novel and appealing user experience for hypertext browsing and for viewing document annotations in context. This paper describes an effort to broaden the utility of Fluid Documents by using the open hypermedia Arakne Environment to l...... to layer fluid annotations and links on top of abitrary HTML pages on the World Wide Web. Changes to both Fluid Documents and Arakne are required....
Videbech, Poul Bror Hemming; Deleuran, Anette
AIM OF DATABASE: The purpose of the Danish Depression Database (DDD) is to monitor and facilitate the improvement of the quality of the treatment of depression in Denmark. Furthermore, the DDD has been designed to facilitate research. STUDY POPULATION: Inpatients as well as outpatients...... with depression, aged above 18 years, and treated in the public psychiatric hospital system were enrolled. MAIN VARIABLES: Variables include whether the patient has been thoroughly somatically examined and has been interviewed about the psychopathology by a specialist in psychiatry. The Hamilton score as well...
Li, Guoqing; Ma, Lijia; Song, Chao;
genome consensus. The YH database is currently one of the three personal genome database, organizing the original data and analysis results in a user-friendly interface, which is an endeavor to achieve fundamental goals for establishing personal medicine. The database is available at http://yh.genomics.org.cn.......The YH database is a server that allows the user to easily browse and download data from the first Asian diploid genome. The aim of this platform is to facilitate the study of this Asian genome and to enable improved organization and presentation large-scale personal genome data. Powered by GBrowse...
Xie, Chen; Mao, Xizeng; Huang, Jiaju; Ding, Yang; Wu, Jianmin; Dong, Shan; Kong, Lei; Gao, Ge; Li, Chuan-Yun; Wei, Liping
High-throughput experimental technologies often identify dozens to hundreds of genes related to, or changed in, a biological or pathological process. From these genes one wants to identify biological pathways that may be involved and diseases that may be implicated. Here, we report a web server, KOBAS 2.0, which annotates an input set of genes with putative pathways and disease relationships based on mapping to genes with known annotations. It allows for both ID mapping and cross-species sequence similarity mapping. It then performs statistical tests to identify statistically significantly enriched pathways and diseases. KOBAS 2.0 incorporates knowledge across 1327 species from 5 pathway databases (KEGG PATHWAY, PID, BioCyc, Reactome and Panther) and 5 human disease databases (OMIM, KEGG DISEASE, FunDO, GAD and NHGRI GWAS Catalog). KOBAS 2.0 can be accessed at http://kobas.cbi.pku.edu.cn.
Spitzer, Russell; Cleves, Ann E; Varela, Rocco; Jain, Ajay N
Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against ∼60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that predated query protein biochemical annotation for five out of the eight query proteins. A panel of 12 currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins.
Zwisler, Ann-Dorthe; Rossau, Henriette Knold; Nakano, Anne
AIM OF DATABASE: The Danish Cardiac Rehabilitation Database (DHRD) aims to improve the quality of cardiac rehabilitation (CR) to the benefit of patients with coronary heart disease (CHD). STUDY POPULATION: Hospitalized patients with CHD with stenosis on coronary angiography treated with percutane...
Pasterczyk, Catherine E.
Discusses ways in which using mathematical symbols to search online bibliographic databases in scientific and technical areas can improve search results. The representations used for Greek letters, relations, binary operators, arrows, and miscellaneous special symbols in the MathSci, Inspec, Compendex, and Chemical Abstracts databases are…
The aim of this paper is to propose an approach based on the concept of annotation for supporting design communication. In this paper, we describe a co-operative design case study where we analyse some annotation practices, mainly focused on design minutes recorded during project reviews. We point out specific requirements concerning annotation needs. Based on these requirements, we propose an annotation model, inspired from the Speech Act Theory (SAT) to support communication in a 3D digital environment. We define two types of annotations in the engineering design context, locutionary and illocutionary annotations. The annotations we describe in this paper are materialised by a set of digital artefacts, which have a semantic dimension allowing express/record elements of technical justifications, traces of contradictory debates, etc. In this paper, we first clarify the semantic annotation concept, and we define general properties of annotations in the engineering design context, and the role of annotations in...
Jing Ginger Han
Full Text Available Premise of the study: Digital microscopic pollen images are being generated with increasing speed and volume, producing opportunities to develop new computational methods that increase the consistency and efficiency of pollen analysis and provide the palynological community a computational framework for information sharing and knowledge transfer. Methods: Mathematical methods were used to assign trait semantics (abstract morphological representations of the images of neotropical Miocene pollen and spores. Advanced database-indexing structures were built to compare and retrieve similar images based on their visual content. A Web-based system was developed to provide novel tools for automatic trait semantic annotation and image retrieval by trait semantics and visual content. Results: Mathematical models that map visual features to trait semantics can be used to annotate images with morphology semantics and to search image databases with improved reliability and productivity. Images can also be searched by visual content, providing users with customized emphases on traits such as color, shape, and texture. Discussion: Content- and semantic-based image searches provide a powerful computational platform for pollen and spore identification. The infrastructure outlined provides a framework for building a community-wide palynological resource, streamlining the process of manual identification, analysis, and species discovery.
Song, Yan; Hua, Xian-Sheng; Dai, Li-Rong; Wang, Ren-Hua
In this paper, we propose a novel semi-automatic annotation scheme for home videos based on active learning. It is well-known that there is a large gap between semantics and low-level features. To narrow down this gap, relevance feedback has been introduced in a number of literatures. Furthermore, to accelerate the convergence to the optimal result, several active learning schemes, in which the most informative samples are chosen to be annotated, have been proposed in literature instead of randomly selecting samples. In this paper, a representative active learning method is proposed, which local consistency of video content is effectively taken into consideration. The main idea is to exploit the global and local statistical characteristics of videos, and the temporal relationship between shots. The global model is trained on a smaller pre-labeled video dataset, and the local information is obtained online in the process of active learning, and will be used to adjust the initial global model adaptively. The experiment results show that the proposed active learning scheme has significantly improved the annotation performance compared with random selecting and common active learning method.
Hölmich, Lisbet Rosenkrantz; Klausen, Siri; Spaun, Eva
AIM OF DATABASE: The aim of the database is to monitor and improve the treatment and survival of melanoma patients. STUDY POPULATION: All Danish patients with cutaneous melanoma and in situ melanomas must be registered in the Danish Melanoma Database (DMD). In 2014, 2,525 patients with invasive...... melanoma and 780 with in situ tumors were registered. The coverage is currently 93% compared with the Danish Pathology Register. MAIN VARIABLES: The main variables include demographic, clinical, and pathological characteristics, including Breslow's tumor thickness, ± ulceration, mitoses, and tumor...... quality register. The coverage is high, and the performance in the five Danish regions is quite similar due to strong adherence to guidelines provided by the Danish Melanoma Group. The list of monitored indicators is constantly expanding, and annual quality reports are issued. Several important scientific...
Full Text Available Data warehouse technology includes a set of concepts and methods that offer the users useful information for decision making. The necessity to build a data warehouse arises from the necessity to improve the quality of information in the organization. The date proceeding from different sources, having a variety of forms - both structured and unstructured, are filtered according to business rules and are integrated in a single large data collection. Using informatics solutions, managers have understood that data stored in operational systems - including databases, are an informational gold mine that must be exploited. Data warehouses have been developed to answer the increasing demands for complex analysis, which could not be properly achieved with operational databases. The present paper emphasizes some of the criteria that information application developers can use in order to choose between a database solution or a data warehouse one.
Rappaport, Noa; Nativ, Noam; Stelzer, Gil; Twik, Michal; Guan-Golan, Yaron; Iny Stein, Tsippi; Bahir, Iris; Belinky, Frida; Morrey, C. Paul; Safran, Marilyn; Lancet, Doron
Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16 919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches and GeneDecks set analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we use four schemes to populate MalaCards sections: (i) directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests and anatomical context; (ii) searching GeneCards for related publications, and for associated genes with corresponding relevance scores; (iii) analyzing disease-associated gene sets in GeneDecks to yield affiliated pathways, phenotypes, compounds and GO terms, sorted by a composite relevance score and presented with GeneCards links; and (iv) searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, suggesting that this might be an inherent property of such networks. Work in progress includes hierarchical malady classification, ontological mapping and disease set analyses, striving to make MalaCards an even more effective tool for biomedical research. Database URL: http
Lapi, Francesco; Nicotra, Federica; Scotti, Lorenza; Vannacci, Alfredo; Thompson, Mary; Pieri, Francesco; Mugelli, Niccolò; Zambon, Antonella; Corrao, Giovanni; Mugelli, Alessandro; Rubino, Annalisa
AIMS To quantify the risk of cardiac valvulopathy (CV) associated with the use of antidepressant serotoninergic medications (SMs). METHODS We conducted a case–control study nested in a cohort of users of antidepressant SMs selected from The Health Improvement Network database. Patients who experienced a CV event during follow-up were cases. Cases were ascertained in a random sample of them. Up to 10 controls were matched to each case by sex, age, month and year of the study entry. Use of antidepressant SMs during follow-up was defined as current (the last prescription for antidepressant SMs occurred in the 2 months before the CV event), recent (in the 2–12 months before the CV event) and past (>12 months before the CV event). We fitted a conditional regression model to estimate the association between use of antidepressant SMs and the risk of CV by means of odds ratios (ORs) and corresponding 95% confidence intervals (CIs). Sensitivity analyses were conducted to test the robustness of our results. RESULTS The study cohort included 752 945 subjects aged 18–89 years. Throughout follow-up, 1663 cases (incidence rate: 3.4 per 10 000 person-years) of CV were detected and were matched to 16 566 controls. The adjusted OR (95% CI) for current and recent users compared with past users of antidepressant SMs were 1.16 (0.96–1.40) and 1.06 (0.93–1.22), respectively. Consistent effect estimates were obtained when considering cumulative exposure to antidepressant SMs during follow-up. CONCLUSIONS These results would suggest that exposure to antidepressant SMs is not associated with an increased risk of CV. PMID:22356433
Elmi, Maryam; Mahar, Alyson; Kagedan, Daniel; Law, Calvin H.L.; Karanicolas, Paul J.; Lin, Yulia; Callum, Jeannie; Coburn, Natalie G.; Hallet, Julie
Background Red blood cell transfusions (RBCT) carry risk of transfusion-related immunodulation that may impact postoperative recovery. This study examined the association between perioperative RBCT and short-term postoperative outcomes following gastrectomy for gastric cancer. Methods Using the American College of Surgeons National Surgical Quality Improvement Program database, we compared outcomes of patients (transfused v. nontransfused) undergoing elective gastrectomy for gastric cancer (2007–2012). Outcomes were 30-day major morbidity, mortality and length of stay. The association between perioperative RBCT and outcomes was estimated using modified Poisson, logistic, or negative binomial regression. Results Of the 3243 patients in the entire cohort, we included 2884 patients with nonmissing data, of whom 535 (18.6%) received RBCT. Overall 30-day major morbidity and mortality were 20% and 3.5%, respectively. After adjustment for baseline and clinical characteristics, RBCT was independently associated with increased 30-day mortality (relative risk [RR] 3.1, 95% confidence interval [CI] 1.9–5.0), major morbidity (RR 1.4, 95% CI 1.2–1.8), length of stay (RR 1.2, 95% CI 1.1–1.2), infections (RR 1.4, 95% CI 1.1–1.6), cardiac complications (RR 1.8, 95% CI 1.0–3.2) and respiratory failure (RR 2.3, 95% CI 1.6–3.3). Conclusion Red blood cell transfusions are associated with worse postoperative short-term outcomes in patients with gastric cancer. Blood management strategies are needed to reduce the use of RBCT after gastrectomy for gastric cancer. PMID:27668330
Alberto J M Martin
Full Text Available Increasingly large numbers of proteins require methods for functional annotation. This is typically based on pairwise inference from the homology of either protein sequence or structure. Recently, similarity networks have been presented to leverage both the ability to visualize relationships between proteins and assess the transferability of functional inference. Here we present PANADA, a novel toolkit for the visualization and analysis of protein similarity networks in Cytoscape. Networks can be constructed based on pairwise sequence or structural alignments either on a set of proteins or, alternatively, by database search from a single sequence. The Panada web server, executable for download and examples and extensive help files are available at URL: http://protein.bio.unipd.it/panada/.
Full Text Available Peter Holmberg Jørgensen,1 Gunnar Schwarz Lausten,2 Alma B Pedersen3 1Tumor Section, Department of Orthopedic Surgery, Aarhus University Hospital, Aarhus, 2Tumor Section, Department of Orthopedic Surgery, Rigshospitalet, Copenhagen, 3Department of Clinical Epidemiology, Aarhus University Hospital, Aarhus, Denmark Aim: The aim of the database is to gather information about sarcomas treated in Denmark in order to continuously monitor and improve the quality of sarcoma treatment in a local, a national, and an international perspective. Study population: Patients in Denmark diagnosed with a sarcoma, both skeletal and ekstraskeletal, are to be registered since 2009. Main variables: The database contains information about appearance of symptoms; date of receiving referral to a sarcoma center; date of first visit; whether surgery has been performed elsewhere before referral, diagnosis, and treatment; tumor characteristics such as location, size, malignancy grade, and growth pattern; details on treatment (kind of surgery, amount of radiation therapy, type and duration of chemotherapy; complications of treatment; local recurrence and metastases; and comorbidity. In addition, several quality indicators are registered in order to measure the quality of care provided by the hospitals and make comparisons between hospitals and with international standards. Descriptive data: Demographic patient-specific data such as age, sex, region of living, comorbidity, World Health Organization's International Classification of Diseases – tenth edition codes and TNM Classification of Malignant Tumours, and date of death (after yearly coupling to the Danish Civil Registration System. Data quality and completeness are currently secured. Conclusion: The Danish Sarcoma Database is population based and includes sarcomas occurring in Denmark since 2009. It is a valuable tool for monitoring sarcoma incidence and quality of treatment and its improvement, postoperative
Full Text Available Introduction: HIV-positive (HIV+ temporary residents living in Australia legally are unable to access government subsidized antiretroviral treatment (ART which is provided via Medicare to Australian citizens and permanent residents. Currently, there is no information systematically being collected on non-Medicare eligible HIV+ patients in Australia. The objectives of this study are to describe the population recruited to the Australian HIV Observational Database (AHOD Temporary Residents Access Study (ATRAS and to determine the short- and long-term outcomes of receiving (subsidized optimal ART and the impact on onwards HIV transmission. Methods: ATRAS was established in 2011. Eligible patients were recruited via the AHOD network. Key HIV-related characteristics were recorded at baseline and prospectively. Additional visa-related information was also recorded at baseline, and updated annually. Descriptive statistics were used to describe the ATRAS cohort in terms of visa status by key demographic characteristics, including sex, region of birth, and HIV disease status. CD4 cell count (mean and SD and the proportion with undetectable (<50 copies/ml HIV viral load are reported at baseline, 6 and 12 months of follow-up. We also estimate the proportion reduction of onward HIV transmission based on the reduction in proportion of people with detectable HIV viral load. Results: A total of 180 patients were recruited to ATRAS by June 2012, and by July 2013 39 patients no longer required ART via ATRAS, 35 of whom became eligible for Medicare-funded medication. At enrolment, 63% of ATRAS patients were receiving ART from alternative sources, 47% had an undetectable HIV viral load (<50 copies/ml and the median CD4 cell count was 343 cells/µl (IQR: 222–479. At 12 months of follow-up, 85% had an undetectable viral load. We estimated a 75% reduction in the risk of onward HIV transmission with the improved rate of undetectable viral load. Conclusions: The
Gorohovski, Alessandro; Tagore, Somnath; Palande, Vikrant; Malka, Assaf; Raviv-Shay, Dorith; Frenkel-Morgenstern, Milana
Discovery of chimeric RNAs, which are produced by chromosomal translocations as well as the joining of exons from different genes by trans-splicing, has added a new level of complexity to our study and understanding of the transcriptome. The enhanced ChiTaRS-3.1 database (http://chitars.md.biu.ac.il) is designed to make widely accessible a wealth of mined data on chimeric RNAs, with easy-to-use analytical tools built-in. The database comprises 34 922 chimeric transcripts along with 11 714 cancer breakpoints. In this latest version, we have included multiple cross-references to GeneCards, iHop, PubMed, NCBI, Ensembl, OMIM, RefSeq and the Mitelman collection for every entry in the ‘Full Collection’. In addition, for every chimera, we have added a predicted chimeric protein–protein interaction (ChiPPI) network, which allows for easy visualization of protein partners of both parental and fusion proteins for all human chimeras. The database contains a comprehensive annotation for 34 922 chimeric transcripts from eight organisms, and includes the manual annotation of 200 sense-antiSense (SaS) chimeras. The current improvements in the content and functionality to the ChiTaRS database make it a central resource for the study of chimeric transcripts and fusion proteins. PMID:27899596
Full Text Available Abstract Background The SEED integrates many publicly available genome sequences into a single resource. The database contains accurate and up-to-date annotations based on the subsystems concept that leverages clustering between genomes and other clues to accurately and efficiently annotate microbial genomes. The backend is used as the foundation for many genome annotation tools, such as the Rapid Annotation using Subsystems Technology (RAST server for whole genome annotation, the metagenomics RAST server for random community genome annotations, and the annotation clearinghouse for exchanging annotations from different resources. In addition to a web user interface, the SEED also provides Web services based API for programmatic access to the data in the SEED, allowing the development of third-party tools and mash-ups. Results The currently exposed Web services encompass over forty different methods for accessing data related to microbial genome annotations. The Web services provide comprehensive access to the database back end, allowing any programmer access to the most consistent and accurate genome annotations available. The Web services are deployed using a platform independent service-oriented approach that allows the user to choose the most suitable programming platform for their application. Example code demonstrate that Web services can be used to access the SEED using common bioinformatics programming languages such as Perl, Python, and Java. Conclusions We present a novel approach to access the SEED database. Using Web services, a robust API for access to genomics data is provided, without requiring large volume downloads all at once. The API ensures timely access to the most current datasets available, including the new genomes as soon as they come online.
Full Text Available This work elaborates the semi-semantic part of speech annotation guidelines for the URDU.KON-TB treebank: an annotated corpus. A hierarchical annotation scheme was designed to label the part of speech and then applied on the corpus. This raw corpus was collected from the Urdu Wikipedia and the Jang newspaper and then annotated with the proposed semi-semantic part of speech labels. The corpus contains text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. This exercise finally contributed a part of speech annotation to the URDU.KON-TB treebank. Twenty-two main part of speech categories are divided into subcategories, which conclude the morphological, and semantical information encoded in it. This article reports the annotation guidelines in major; however, it also briefs the development of the URDU.KON-TB treebank, which includes the raw corpus collection, designing & employment of annotation scheme and finally, its statistical evaluation and results. The guidelines presented as follows, will be useful for linguistic community to annotate the sentences not only for the national language Urdu but for the other indigenous languages like Punjab, Sindhi, Pashto, etc., as well.
Full Text Available The MixtureTree Annotator, written in JAVA, allows the user to automatically color any phylogenetic tree in Newick format generated from any phylogeny reconstruction program and output the Nexus file. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides a unique advantage over any other programs which perform a similar function. In addition, the MixtureTree Annotator is the only package that can efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. In order to visualize the resulting output file, a modified version of FigTree is used. Certain popular methods, which lack good built-in visualization tools, for example, MEGA, Mesquite, PHY-FI, TreeView, treeGraph and Geneious, may give results with human errors due to either manually adding colors to each node or with other limitations, for example only using color based on a number, such as branch length, or by taxonomy. In addition to allowing the user to automatically color any given Newick tree by sequence name, the MixtureTree Annotator is the only method that allows the user to automatically annotate the resulting tree created by the MixtureTree program. The MixtureTree Annotator is fast and easy-to-use, while still allowing the user full control over the coloring and annotating process.
Chen, Shu-Chuan; Ogata, Aaron
The MixtureTree Annotator, written in JAVA, allows the user to automatically color any phylogenetic tree in Newick format generated from any phylogeny reconstruction program and output the Nexus file. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides a unique advantage over any other programs which perform a similar function. In addition, the MixtureTree Annotator is the only package that can efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. In order to visualize the resulting output file, a modified version of FigTree is used. Certain popular methods, which lack good built-in visualization tools, for example, MEGA, Mesquite, PHY-FI, TreeView, treeGraph and Geneious, may give results with human errors due to either manually adding colors to each node or with other limitations, for example only using color based on a number, such as branch length, or by taxonomy. In addition to allowing the user to automatically color any given Newick tree by sequence name, the MixtureTree Annotator is the only method that allows the user to automatically annotate the resulting tree created by the MixtureTree program. The MixtureTree Annotator is fast and easy-to-use, while still allowing the user full control over the coloring and annotating process.
Background The rationale for gathering information from plants procuring nitrogen through symbiotic interactions controlled by a common genetic program for a sustainable biofuel production is the high energy demanding application of synthetic nitrogen fertilizers. We curated sequence information publicly available for the biofuel plant sugarcane, performed an analysis of the common SYM pathway known to control symbiosis in other plants, and provide results, sequences and literature links as an online database. Methods Sugarcane sequences and informations were downloaded from the nucEST database, cleaned and trimmed with seqclean, assembled with TGICL plus translating mapping method, and annotated. The annotation is based on BLAST searches against a local formatted plant Uniprot90 generated with CD-HIT for functional assignment, rpsBLAST to CDD database for conserved domain analysis, and BLAST search to sorghum's for Gene Ontology (GO) assignment. Gene expression was normalized according the Unigene standard, presented as ESTs/100 kb. Protein sequences known in the SYM pathway were used as queries to search the SymGRASS sequence database. Additionally, antimicrobial peptides described in the PhytAMP database served as queries to retrieve and generate expression profiles of these defense genes in the libraries compared to the libraries obtained under symbiotic interactions. Results We describe the SymGRASS, a database of sugarcane orthologous genes involved in arbuscular mycorrhiza (AM) and root nodule (RN) symbiosis. The database aggregates knowledge about sequences, tissues, organ, developmental stages and experimental conditions, and provides annotation and level of gene expression for sugarcane transcripts and SYM orthologous genes in sugarcane through a web interface. Several candidate genes were found for all nodes in the pathway, and interestingly a set of symbiosis specific genes was found. Conclusions The knowledge integrated in SymGRASS may guide studies on
Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a novel region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of >1000 datasets of transcription factor binding sites and histone marks, collected from previous publications, were reported and it was found that HNF4G significantly co-locates with cohesin subunit STAG1 (SA1).Inc.
Good, Benjamin M; Nanis, Max; Wu, Chunlei; Su, Andrew I
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses. Many biological natural language processing (BioNLP) projects attempt to address this challenge, but the state of the art still leaves much room for improvement. Progress in BioNLP research depends on large, annotated corpora for evaluating information extraction systems and training machine learning models. Traditionally, such corpora are created by small numbers of expert annotators often working over extended periods of time. Recent studies have shown that workers on microtask crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. Here, we investigated the use of the AMT in capturing disease mentions in PubMed abstracts. We used the NCBI Disease corpus as a gold standard for refining and benchmarking our crowdsourcing protocol. After several iterations, we arrived at a protocol that reproduced the annotations of the 593 documents in the 'training set' of this gold standard with an overall F measure of 0.872 (precision 0.862, recall 0.883). The output can also be tuned to optimize for precision (max = 0.984 when recall = 0.269) or recall (max = 0.980 when precision = 0.436). Each document was completed by 15 workers, and their annotations were merged based on a simple voting method. In total 145 workers combined to complete all 593 documents in the span of 9 days at a cost of $.066 per abstract per worker. The quality of the annotations, as judged with the F measure, increases with the number of workers assigned to each task; however minimal performance gains were observed beyond 8 workers per task. These results add further evidence that microtask crowdsourcing can be a valuable tool for generating well-annotated corpora in BioNLP. Data produced for this analysis are available at http://figshare.com/articles/Disease_Mention_Annotation_with_Mechanical_Turk/1126402.
Friis-Andersen, Hans; Bisgaard, Thue
AIM OF DATABASE: To monitor and improve nation-wide surgical outcome after groin hernia repair based on scientific evidence-based surgical strategies for the national and international surgical community. STUDY POPULATION: Patients ≥18 years operated for groin hernia. MAIN VARIABLES: Type and size...... the medical management of the database. RESULTS: The Danish Inguinal Hernia Database comprises intraoperative data from >130,000 repairs (May 2015). A total of 49 peer-reviewed national and international publications have been published from the database (June 2015). CONCLUSION: The Danish Inguinal Hernia...... Database is fully active monitoring surgical quality and contributes to the national and international surgical society to improve outcome after groin hernia repair....
Tang, Lilian H.; Hanka, Rudolf; Ip, Horace H. S.; Cheung, Kent K. T.; Lam, Ringo
In this paper we present a semantic content representation scheme and the associated techniques for supporting (1) query by image examples or by natural language in a histological image database and (2) automatic annotation generation for images through image semantic analysis. In this research, various types of query are analyzed by either a semantic analyzer or a natural language analyzer to extract high level concepts and histological information, which are subsequently converted into an internal semantic content representation structure code-named 'Papillon.' Papillon serves not only as an intermediate representation scheme but also stores the semantic content of the image that will be used to match against the semantic index structure within the image database during query processing. During the image database population phase, all images that are going to be put into the database will go through the same processing so that every image would have its semantic content represented by a Papillon structure. Since the Papillon structure for an image contains high level semantic information of the image, it forms the basis of the technique that automatically generates textual annotation for the input images. Papillon bridges the gap between different media in the database, allows complicated intelligent browsing to be carried out efficiently, and also provides a well- defined semantic content representation scheme for different content processing engines developed for content-based retrieval.
Full Text Available BACKGROUND: Liaoning cashmere goat is a famous goat breed for cashmere wool. In order to increase the transcriptome data and accelerate genetic improvement for this breed, we performed de novo transcriptome sequencing to generate the first expressed sequence tag dataset for the Liaoning cashmere goat, using next-generation sequencing technology. RESULTS: Transcriptome sequencing of Liaoning cashmere goat on a Roche 454 platform yielded 804,601 high-quality reads. Clustering and assembly of these reads produced a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons. Based on similarity searches with known proteins, 17,356 unigenes were assigned to 6,700 GO categories, and the terms were summarized into three main GO categories and 59 sub-categories. 3,548 and 46,778 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Comparative analysis revealed that 42,254 unigenes were aligned to 17,532 different sequences in NCBI non-redundant nucleotide databases. 97,236 (82.51% unigenes were mapped to the 30 goat chromosomes. 35,551 (30.17% unigenes were matched to 11,438 reported goat protein-coding genes. The remaining non-matched unigenes were further compared with cattle and human reference genes, 67 putative new goat genes were discovered. Additionally, 2,781 potential simple sequence repeats were initially identified from all unigenes. CONCLUSION: The transcriptome of Liaoning cashmere goat was deep sequenced, de novo assembled, and annotated, providing abundant data to better understand the Liaoning cashmere goat transcriptome. The potential simple sequence repeats provide a material basis for future genetic linkage and quantitative trait loci analyses.
The availability of a representative gene ontology (GO) database is a prerequisite for a successful functional genomics study. Using online Blast2GO resources we constructed a GO database of Aspergillus flavus. Of the predicted total 13,485 A. flavus genes 8,987 were annotated with GO terms. The mea...
Santos Delgado, Alberto; Wernersson, Rasmus; Jensen, Lars Juhl
3.0, we have updated the content of the database to reflect changes to genome annotation, added new mRNAand protein expression data, and integrated cell-cycle phenotype information from high-content screens and model-organism databases. The new version of Cyclebase also features a new web interface...
Yang, In Seok; Ryu, Chunsun; Cho, Ki Joon; Kim, Jin Kwang; Ong, Swee Hoe; Mitchell, Wayne P; Kim, Bong Su; Oh, Hee-Bok; Kim, Kyung Hyun
Biomarkers enable early diagnosis, guide molecularly targeted therapy and monitor the activity and therapeutic responses across a variety of diseases. Despite intensified interest and research, however, the overall rate of development of novel biomarkers has been falling. Moreover, no solution is yet available that efficiently retrieves and processes biomarker information pertaining to infectious diseases. Infectious Disease Biomarker Database (IDBD) is one of the first efforts to build an easily accessible and comprehensive literature-derived database covering known infectious disease biomarkers. IDBD is a community annotation database, utilizing collaborative Web 2.0 features, providing a convenient user interface to input and revise data online. It allows users to link infectious diseases or pathogens to protein, gene or carbohydrate biomarkers through the use of search tools. It supports various types of data searches and application tools to analyze sequence and structure features of potential and validated biomarkers. Currently, IDBD integrates 611 biomarkers for 66 infectious diseases and 70 pathogens. It is publicly accessible at http://biomarker.cdc.go.kr and http://biomarker.korea.ac.kr.
Full Text Available Abstract Background Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii co-occurrence of extracellular and nuclear domains; (iv violation of domain integrity; (v chimeras encoded by two or more genes located on different chromosomes. Results Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis and two protostome species (Caenorhabditis elegans and Drosophila melanogaster have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON
Finn, Robert D; Coggill, Penelope; Eberhardt, Ruth Y; Eddy, Sean R; Mistry, Jaina; Mitchell, Alex L; Potter, Simon C; Punta, Marco; Qureshi, Matloob; Sangrador-Vegas, Amaia; Salazar, Gustavo A; Tate, John; Bateman, Alex
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.
Full Text Available I introduce an open-source R package 'dcGOR' to provide the bioinformatics community with the ease to analyse ontologies and protein domain annotations, particularly those in the dcGO database. The dcGO is a comprehensive resource for protein domain annotations using a panel of ontologies including Gene Ontology. Although increasing in popularity, this database needs statistical and graphical support to meet its full potential. Moreover, there are no bioinformatics tools specifically designed for domain ontology analysis. As an add-on package built in the R software environment, dcGOR offers a basic infrastructure with great flexibility and functionality. It implements new data structure to represent domains, ontologies, annotations, and all analytical outputs as well. For each ontology, it provides various mining facilities, including: (i domain-based enrichment analysis and visualisation; (ii construction of a domain (semantic similarity network according to ontology annotations; and (iii significance analysis for estimating a contact (statistical significance network. To reduce runtime, most analyses support high-performance parallel computing. Taking as inputs a list of protein domains of interest, the package is able to easily carry out in-depth analyses in terms of functional, phenotypic and diseased relevance, and network-level understanding. More importantly, dcGOR is designed to allow users to import and analyse their own ontologies and annotations on domains (taken from SCOP, Pfam and InterPro and RNAs (from Rfam as well. The package is freely available at CRAN for easy installation, and also at GitHub for version control. The dedicated website with reproducible demos can be found at http://supfam.org/dcGOR.
I introduce an open-source R package 'dcGOR' to provide the bioinformatics community with the ease to analyse ontologies and protein domain annotations, particularly those in the dcGO database. The dcGO is a comprehensive resource for protein domain annotations using a panel of ontologies including Gene Ontology. Although increasing in popularity, this database needs statistical and graphical support to meet its full potential. Moreover, there are no bioinformatics tools specifically designed for domain ontology analysis. As an add-on package built in the R software environment, dcGOR offers a basic infrastructure with great flexibility and functionality. It implements new data structure to represent domains, ontologies, annotations, and all analytical outputs as well. For each ontology, it provides various mining facilities, including: (i) domain-based enrichment analysis and visualisation; (ii) construction of a domain (semantic similarity) network according to ontology annotations; and (iii) significance analysis for estimating a contact (statistical significance) network. To reduce runtime, most analyses support high-performance parallel computing. Taking as inputs a list of protein domains of interest, the package is able to easily carry out in-depth analyses in terms of functional, phenotypic and diseased relevance, and network-level understanding. More importantly, dcGOR is designed to allow users to import and analyse their own ontologies and annotations on domains (taken from SCOP, Pfam and InterPro) and RNAs (from Rfam) as well. The package is freely available at CRAN for easy installation, and also at GitHub for version control. The dedicated website with reproducible demos can be found at http://supfam.org/dcGOR.
Hecksel, Corey W; Darrow, Michele C; Dai, Wei; Galaz-Montoya, Jesús G; Chin, Jessica A; Mitchell, Patrick G; Chen, Shurui; Jakana, Jemba; Schmid, Michael F; Chiu, Wah
Although acknowledged to be variable and subjective, manual annotation of cryo-electron tomography data is commonly used to answer structural questions and to create a "ground truth" for evaluation of automated segmentation algorithms. Validation of such annotation is lacking, but is critical for understanding the reproducibility of manual annotations. Here, we used voxel-based similarity scores for a variety of specimens, ranging in complexity and segmented by several annotators, to quantify the variation among their annotations. In addition, we have identified procedures for merging annotations to reduce variability, thereby increasing the reliability of manual annotation. Based on our analyses, we find that it is necessary to combine multiple manual annotations to increase the confidence level for answering structural questions. We also make recommendations to guide algorithm development for automated annotation of features of interest.
Ghouila, Amel; Terrapon, Nicolas; Gascuel, Olivier; Guerfali, Fatma Z; Laouini, Dhafer; Maréchal, Eric; Bréhélin, Laurent
Eukaryotic pathogens (e.g. Plasmodium, Leishmania, Trypanosomes, etc.) are a major source of morbidity and mortality worldwide. In Africa, one of the most impacted continents, they cause millions of deaths and constitute an immense economic burden. While the genome sequence of several of these organisms is now available, the biological functions of more than half of their proteins are still unknown. This is a serious issue for bringing to the foreground the expected new therapeutic targets. In this context, the identification of protein domains is a key step to improve the functional annotation of the proteins. However, several domains are missed in eukaryotic pathogens because of the high phylogenetic distance of these organisms from the classical eukaryote models. We recently proposed a method, co-occurrence domain detection (CODD), that improves the sensitivity of Pfam domain detection by exploiting the tendency of domains to appear preferentially with a few other favorite domains in a protein. In this paper, we present EuPathDomains (http://www.atgc-montpellier.fr/EuPathDomains/), an extended database of protein domains belonging to ten major eukaryotic human pathogens. EuPathDomains gathers known and new domains detected by CODD, along with the associated confidence measurements and the GO annotations that can be deduced from the new domains. This database significantly extends the Pfam domain coverage of all selected genomes, by proposing new occurrences of domains as well as new domain families that have never been reported before. For example, with a false discovery rate lower than 20%, EuPathDomains increases the number of detected domains by 13% in Toxoplasma gondii genome and up to 28% in Cryptospordium parvum, and the total number of domain families by 10% in Plasmodium falciparum and up to 16% in C. parvum genome. The database can be queried by protein names, domain identifiers, Pfam or Interpro identifiers, or organisms, and should become a valuable
Full Text Available Abstract Background Macromolecular visualization as well as automated structural and functional annotation tools play an increasingly important role in the post-genomic era, contributing significantly towards the understanding of molecular systems and processes. For example, three dimensional (3D models help in exploring protein active sites and functional hot spots that can be targeted in drug design. Automated annotation and visualization pipelines can also reveal other functionally important attributes of macromolecules. These goals are dependent on the availability of advanced tools that integrate better the existing databases, annotation servers and other resources with state-of-the-art rendering programs. Results We present a new tool for protein structure analysis, with the focus on annotation and visualization of protein complexes, which is an extension of our previously developed POLYVIEW web server. By integrating the web technology with state-of-the-art software for macromolecular visualization, such as the PyMol program, POLYVIEW-3D enables combining versatile structural and functional annotations with a simple web-based interface for creating publication quality structure rendering, as well as animated images for Powerpoint™, web sites and other electronic resources. The service is platform independent and no plug-ins are required. Several examples of how POLYVIEW-3D can be used for structural and functional analysis in the context of protein-protein interactions are presented to illustrate the available annotation options. Conclusion POLYVIEW-3D server features the PyMol image rendering that provides detailed and high quality presentation of macromolecular structures, with an easy to use web-based interface. POLYVIEW-3D also provides a wide array of options for automated structural and functional analysis of proteins and their complexes. Thus, the POLYVIEW-3D server may become an important resource for researches and educators in
Full Text Available Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant database and 13,382 were assigned to 34,100 Gene Ontology (GO terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally.
Full Text Available Abstract Background Advanced Text Mining (TM such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech, syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1 to design a scheme of annotation which meets specific requirements of text annotation, (2 to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3 to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing-based TM in the bio-medical domain.
Alexandra M Schnoes
Full Text Available The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the "few articles - many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.
The Critical Assessment of Function Annotation meeting was held July 14-15, 2011 at the Austria Conference Center in Vienna, Austria. There were 73 registered delegates at the meeting. We thank the DOE for this award. It helped us organize and support a scientific meeting AFP 2011 as a special interest group (SIG) meeting associated with the ISMB 2011 conference. The conference was held in Vienna, Austria, in July 2011. The AFP SIG was held on July 15-16, 2011 (immediately preceding the conference). The meeting consisted of two components, the first being a series of talks (invited and contributed) and discussion sections dedicated to protein function research, with an emphasis on the theory and practice of computational methods utilized in functional annotation. The second component provided a large-scale assessment of computational methods through participation in the Critical Assessment of Functional Annotation (CAFA).
Chronaki, C E; Zabulis, X; Orphanoudakis, S C
I2Cnet (Image Indexing by Content network) aims to provide services related to the content-based management of images in healthcare over the World-Wide Web. Each I2Cnet server maintains an autonomous repository of medical images and related information. The annotation service of I2Cnet allows specialists to interact with the contents of the repository, adding comments or illustrations to medical images of interest. I2Cnet annotations may be communicated to other users via e-mail or posted to I2Cnet for inclusion in its local repositories. This paper discusses the annotation service of I2Cnet and argues that such services pave the way towards the evolution of active digital medical image libraries.
Full Text Available Abstract Background The Stanley Medical Research Institute online genomics database (SMRIDB is a comprehensive web-based system for understanding the genetic effects of human brain disease (i.e. bipolar, schizophrenia, and depression. This database contains fully annotated clinical metadata and gene expression patterns generated within 12 controlled studies across 6 different microarray platforms. Description A thorough collection of gene expression summaries are provided, inclusive of patient demographics, disease subclasses, regulated biological pathways, and functional classifications. Conclusion The combination of database content, structure, and query speed offers researchers an efficient tool for data mining of brain disease complete with information such as: cross-platform comparisons, biomarkers elucidation for target discovery, and lifestyle/demographic associations to brain diseases.
Full Text Available Abstract Background High throughput proteomics experiments are useful for analyzing the protein expression of an organism, identifying the correct gene structure of a genome, or locating possible post-translational modifications within proteins. High throughput methods necessitate publicly accessible and easily queried databases for efficiently and logically storing, displaying, and analyzing the large volume of data. Description EPICDB is a publicly accessible, queryable, relational database that organizes and displays experimental, high throughput proteomics data for Toxoplasma gondii and Cryptosporidium parvum. Along with detailed information on mass spectrometry experiments, the database also provides antibody experimental results and analysis of functional annotations, comparative genomics, and aligned expressed sequence tag (EST and genomic open reading frame (ORF sequences. The database contains all available alternative gene datasets for each organism, which comprises a complete theoretical proteome for the respective organism, and all data is referenced to these sequences. The database is structured around clusters of protein sequences, which allows for the evaluation of redundancy, protein prediction discrepancies, and possible splice variants. The database can be expanded to include genomes of other organisms for which proteome-wide experimental data are available. Conclusion EPICDB is a comprehensive database of genome-wide T. gondii and C. parvum proteomics data and incorporates many features that allow for the analysis of the entire proteomes and/or annotation of specific protein sequences. EPICDB is complementary to other -genomics- databases of these organisms by offering complete mass spectrometry analysis on a comprehensive set of all available protein sequences.
Full Text Available We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.
Jacobs, Jeffrey Phillip; Jacobs, Marshall Lewis; Mavroudis, Constantine; Backer, Carl Lewis; Lacour-Gayet, Francois G.; Tchervenkov, Christo I.; Franklin, Rodney C. G.; Beland, Marie J.; Jenkins, Kathy J.; Walters, Hal; Bacha, Emile A.; Maruszewski, Bohdan; Kurosawa, Hiromi; Clarke, David Robinson; Gaynor, J. William; Spray, Thomas L.; Stellin, Giovanni; Ebels, Tjark; Krogmann, Otto N.; Aiello, Vera D.; Colan, Steven D.; Weinberg, Paul; Giroud, Jorge M.; Everett, Allen; Wernovsky, Gil; Elliott, Martin J.; Edwards, Fred H.
This review discusses the historical aspects, current state of the art, and potential future advances in the areas of nomenclature and databases for the analysis Of Outcomes of treatments for patients with congenitally malformed hearts. We will consider the current state of analysis of outcomes, lay
Rappaport, Noa; Twik, Michal; Plaschkes, Inbar; Nudel, Ron; Iny Stein, Tsippi; Levitt, Jacob; Gershoni, Moran; Morrey, C. Paul; Safran, Marilyn; Lancet, Doron
The MalaCards human disease database (http://www.malacards.org/) is an integrated compendium of annotated diseases mined from 68 data sources. MalaCards has a web card for each of ∼20 000 disease entries, in six global categories. It portrays a broad array of annotation topics in 15 sections, including Summaries, Symptoms, Anatomical Context, Drugs, Genetic Tests, Variations and Publications. The Aliases and Classifications section reflects an algorithm for disease name integration across often-conflicting sources, providing effective annotation consolidation. A central feature is a balanced Genes section, with scores reflecting the strength of disease-gene associations. This is accompanied by other gene-related disease information such as pathways, mouse phenotypes and GO-terms, stemming from MalaCards’ affiliation with the GeneCards Suite of databases. MalaCards’ capacity to inter-link information from complementary sources, along with its elaborate search function, relational database infrastructure and convenient data dumps, allows it to tackle its rich disease annotation landscape, and facilitates systems analyses and genome sequence interpretation. MalaCards adopts a ‘flat’ disease-card approach, but each card is mapped to popular hierarchical ontologies (e.g. International Classification of Diseases, Human Phenotype Ontology and Unified Medical Language System) and also contains information about multi-level relations among diseases, thereby providing an optimal tool for disease representation and scrutiny. PMID:27899610
Rahmati, Sara; Abovsky, Mark; Pastrello, Chiara; Jurisica, Igor
Molecular pathway data are essential in current computational and systems biology research. While there are many primary and integrated pathway databases, several challenges remain, including low proteome coverage (57%), low overlap across different databases, unavailability of direct information about underlying physical connectivity of pathway members, and high fraction of protein-coding genes without any pathway annotations, i.e. ‘pathway orphans’. In order to address all these challenges, we developed pathDIP, which integrates data from 20 source pathway databases, ‘core pathways’, with physical protein–protein interactions to predict biologically relevant protein–pathway associations, referred to as ‘extended pathways’. Cross-validation determined 71% recovery rate of our predictions. Data integration and predictions increase coverage of pathway annotations for protein-coding genes to 86%, and provide novel annotations for 5732 pathway orphans. PathDIP (http://ophid.utoronto.ca/pathdip) annotates 17 070 protein-coding genes with 4678 pathways, and provides multiple query, analysis and output options. PMID:27899558
Full Text Available Abstract Background The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality. Results As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of
Showalter, C.; Rex, R.; Hurlburt, N. E.; Zita, E. J.
We have written a software suite designed to facilitate solar data analysis by scientists, students, and the public, anticipating enormous datasets from future instruments. Our “STAR" suite includes an interactive learning section explaining 15 classes of solar events. Users learn software tools that exploit humans’ superior ability (over computers) to identify many events. Annotation tools include time slice generation to quantify loop oscillations, the interpolation of event shapes using natural cubic splines (for loops, sigmoids, and filaments) and closed cubic splines (for coronal holes). Learning these tools in an environment where examples are provided prepares new users to comfortably utilize annotation software with new data. Upon completion of our tutorial, users are presented with media of various solar events and asked to identify and annotate the images, to test their mastery of the system. Goals of the project include public input into the data analysis of very large datasets from future solar satellites, and increased public interest and knowledge about the Sun. In 2010, the Solar Dynamics Observatory (SDO) will be launched into orbit. SDO’s advancements in solar telescope technology will generate a terabyte per day of high-quality data, requiring innovation in data management. While major projects develop automated feature recognition software, so that computers can complete much of the initial event tagging and analysis, still, that software cannot annotate features such as sigmoids, coronal magnetic loops, coronal dimming, etc., due to large amounts of data concentrated in relatively small areas. Previously, solar physicists manually annotated these features, but with the imminent influx of data it is unrealistic to expect specialized researchers to examine every image that computers cannot fully process. A new approach is needed to efficiently process these data. Providing analysis tools and data access to students and the public have proven
Peptides with diverse amino acid sequences, structures, and functions are essential players in biological systems. The construction of well-annotated databases not only facilitates effective information management, search, and mining but also lays the foundation for developing and testing new peptide algorithms and machines. The antimicrobial peptide database (APD) is an original construction in terms of both database design and peptide entries. The host defense antimicrobial peptides (AMPs) registered in the APD cover the five kingdoms (bacteria, protists, fungi, plants, and animals) or three domains of life (bacteria, archaea, and eukaryota). This comprehensive database ( http://aps.unmc.edu/AP ) provides useful information on peptide discovery timeline, nomenclature, classification, glossary, calculation tools, and statistics. The APD enables effective search, prediction, and design of peptides with antibacterial, antiviral, antifungal, antiparasitic, insecticidal, spermicidal, anticancer activities, chemotactic, immune modulation, or antioxidative properties. A universal classification scheme is proposed herein to unify innate immunity peptides from a variety of biological sources. As an improvement, the upgraded APD makes predictions based on the database-defined parameter space and provides a list of the sequences most similar to natural AMPs. In addition, the powerful pipeline design of the database search engine laid a solid basis for designing novel antimicrobials to combat resistant superbugs, viruses, fungi, or parasites. This comprehensive AMP database is a useful tool for both research and education.
Full Text Available Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator’s knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user’s vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D.
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D
Yu, Feiyang; Ip, Horace H S
This paper presents a novel two-dimensional (2-D) stochastic method for semantic analysis of the content of histological images Specifically, we propose a 2-D generalization of the traditional hidden Markov model (HMM). The generalization is called spatial-hidden Markov model (SHMM) that captures the contextual characteristics of complex biological features in histological images The model employs a second-order neighborhood system and assumes the conditional independence of vertical and horizontal transitions between hidden states. The notion of 'past' in SHMM is defined as what have been observed in a row-wise raster scan. This paper focuses on two fundamental problems: the best states decoding problem and the estimation of generation probability of an image by a SHMM. Based on our independence assumption of horizontal and vertical transitions, we derive computational tractable solutions to those problems. These solutions are direct extensions of their counterparts, i.e., the Viterbi algorithm and Forward-Backward algorithm, for 1-D HMM. Our experiments were carried on a medical image database with 200 images and compared with a state-of-the-art approach that was run on the same database. The annotation results demonstrated that SHMM consistently outperforms the previous approach and ameliorates many of its drawbacks. In addition, performance comparison with HMM has also validated the superiority of SHMM.
Full Text Available Abstract Background Medicago truncatula has been chosen as a model species for genomic studies. It is closely related to an important legume, alfalfa. Transporters are a large group of membrane-spanning proteins. They deliver essential nutrients, eject waste products, and assist the cell in sensing environmental conditions by forming a complex system of pumps and channels. Although studies have effectively characterized individual M. truncatula transporters in several databases, until now there has been no available systematic database that includes all transporters in M. truncatula. Description The M. truncatula transporter database (MTDB contains comprehensive information on the transporters in M. truncatula. Based on the TransportTP method, we have presented a novel prediction pipeline. A total of 3,665 putative transporters have been annotated based on International Medicago Genome Annotated Group (IMGAG V3.5 V3 and the M. truncatula Gene Index (MTGI V10.0 releases and assigned to 162 families according to the transporter classification system. These families were further classified into seven types according to their transport mode and energy coupling mechanism. Extensive annotations referring to each protein were generated, including basic protein function, expressed sequence tag (EST mapping, genome locus, three-dimensional template prediction, transmembrane segment, and domain annotation. A chromosome distribution map and text-based Basic Local Alignment Search Tools were also created. In addition, we have provided a way to explore the expression of putative M. truncatula transporter genes under stress treatments. Conclusions In summary, the MTDB enables the exploration and comparative analysis of putative transporters in M. truncatula. A user-friendly web interface and regular updates make MTDB valuable to researchers in related fields. The MTDB is freely available now to all users at http://bioinformatics.cau.edu.cn/MtTransporter/.
Ashish V Tendulkar
Full Text Available BACKGROUND: FragKB (Fragment Knowledgebase is a repository of clusters of structurally similar fragments from proteins. Fragments are annotated with information at the level of sequence, structure and function, integrating biological descriptions derived from multiple existing resources and text mining. METHODOLOGY: FragKB contains approximately 400,000 conserved fragments from 4,800 representative proteins from PDB. Literature annotations are extracted from more than 1,700 articles and are available for over 12,000 fragments. The underlying systematic annotation workflow of FragKB ensures efficient update and maintenance of this database. The information in FragKB can be accessed through a web interface that facilitates sequence and structural visualization of fragments together with known literature information on the consequences of specific residue mutations and functional annotations of proteins and fragment clusters. FragKB is accessible online at http://ubio.bioinfo.cnio.es/biotools/fragkb/. SIGNIFICANCE: The information presented in FragKB can be used for modeling protein structures, for designing novel proteins and for functional characterization of related fragments. The current release is focused on functional characterization of proteins through inspection of conservation of the fragments.
Rice(Oryza sativa) feeds over half of the global population.A web-based integrated platform for rice microarray annotation and data analysis in various biological contexts is presented,which provides a convenient query for comprehensive annotation compared with similar databases.Coupled with existing rice microarray data,it provides online analysis methods from the perspective of bioinformatics.This comprehensive bioinformatics analysis platform is composed of five modules,including data retrieval,microarray annotation,sequence analysis,results visualization and data analysis.The BioChip module facilitates the retrieval of microarray data information via identifiers of "Probe Set ID","Locus ID" and "Analysis Name".The BioAnno module is used to annotate the gene or probe set based on the gene function,the domain information,the KEGG biochemical and regulatory pathways and the potential microRNA which regulates the genes.The BioSeq module lists all of the related sequence information by a microarray probe set.The BioView module provides various visual results for the microarray data.The BioAnaly module is used to analyze the rice microarray’s data set.
Introduction to Database Systems Functions of a DatabaseDatabase Management SystemDatabase ComponentsDatabase Development ProcessConceptual Design and Data Modeling Introduction to Database Design Process Understanding Business ProcessEntity-Relationship Data Model Representing Business Process with Entity-RelationshipModelTable Structure and NormalizationIntroduction to TablesTable NormalizationTransforming Data Models to Relational Databases .DBMS Selection Transforming Data Models to Relational DatabasesEnforcing ConstraintsCreating Database for Business ProcessPhysical Design and Database
Full Text Available Gedske Daugaard,1 Maria Gry Gundgaard Kier,1 Mikkel Bandak,1 Mette Saksø Mortensen,1 Heidi Larsson,2 Mette Søgaard,2 Birgitte Groenkaer Toft,3 Birte Engvad,4 Mads Agerbæk,5 Niels Vilstrup Holm,6 Jakob Lauritsen1 1Department of Oncology 5073, Copenhagen University Hospital, Rigshospitalet, Copenhagen, 2Department of Clinical Epidemiology, Aarhus University Hospital, Aarhus, 3Department of Pathology, Copenhagen University Hospital, Rigshospitalet, Copenhagen, 4Department of Pathology, Odense University Hospital, Odense, 5Department of Oncology, Aarhus University Hospital, Aarhus, 6Department of Oncology, Odense University Hospital, Odense, Denmark Aim: The nationwide Danish Testicular Cancer database consists of a retrospective research database (DaTeCa database and a prospective clinical database (Danish Multidisciplinary Cancer Group [DMCG] DaTeCa database. The aim is to improve the quality of care for patients with testicular cancer (TC in Denmark, that is, by identifying risk factors for relapse, toxicity related to treatment, and focusing on late effects. Study population: All Danish male patients with a histologically verified germ cell cancer diagnosis in the Danish Pathology Registry are included in the DaTeCa databases. Data collection has been performed from 1984 to 2007 and from 2013 onward, respectively. Main variables and descriptive data: The retrospective DaTeCa database contains detailed information with more than 300 variables related to histology, stage, treatment, relapses, pathology, tumor markers, kidney function, lung function, etc. A questionnaire related to late effects has been conducted, which includes questions regarding social relationships, life situation, general health status, family background, diseases, symptoms, use of medication, marital status, psychosocial issues, fertility, and sexuality. TC survivors alive on October 2014 were invited to fill in this questionnaire including 160 validated questions
Full Text Available Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES, the National Center for Biomedical Ontology (NCBO Annotator, the Biomedical Concept Annotation System (BeCAS and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74% and their quality (best F1-measure of 33%, independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%, the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content
王东华; 刘建军; 赵仁亮; 商瑶玲; 杜晓; 杨眉
1：10000数据库是我国省级基础地理信息资源的主要组成部分，是构建数字中国的重要内容。本文介绍了全国1：10000数据库整合升级工程的总体技术设计，根据全国1：10000数据库建设与更新的实际情况，对其存在的问题进行了分析，并重点阐述了整合处理的内容、总体思路、技术方法以及数据库优化升级技术方案等。%1:10 000 database is one of main contents of national fundamental geographic information resources, and it is also significant for constructing Digital China. According to the reality of constructing and updating of national 1:10 000 database, the general technical design of the project of integration and improvement of national 1:10 000 database was introduced. In this paper, problems of the technical design were analyzed. Then we focused on the contents, general thought, technical methods and the technical scheme of database upgrading.
J.D. Strachan and G. Corrigan
This annotated bibliography is intended to help EDGE2D users, and particularly new users, find existing published literature that has used EDGE2D. Our idea is that a person can find existing studies which may relate to his intended use, as well as gain ideas about other possible applications by scanning the attached tables.
Barrios, Ernie, Ed.
More than 300 books and articles published from 1920 to 1971 are reviewed in this annotated bibliography of literature on the Chicano. The citations and reviews are categorized by subject area and deal with contemporary Chicano history, education, health, history of Mexico, literature, native Americans, philosophy, political science, pre-Columbian…