WorldWideScience

Sample records for protein sequence database

  1. MIPS: a database for protein sequences and complete genomes.

    Science.gov (United States)

    Mewes, H W; Hani, J; Pfeiffer, F; Frishman, D

    1998-01-01

    The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). Through its WWW server (http://www.mips.biochem.mpg.de ) MIPS provides access to a variety of generic databases, including a database of protein families as well as automatically generated data by the systematic application of sequence analysis algorithms. The yeast genome sequence and its related information was also compiled on CD-ROM to provide dynamic interactive access to the 16 chromosomes of the first eukaryotic genome unraveled. PMID:9399795

  2. MIPS: a database for genomes and protein sequences.

    Science.gov (United States)

    Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B

    2002-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).

  3. MIPS: a database for protein sequences, homology data and yeast genome information.

    Science.gov (United States)

    Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F

    1997-01-01

    The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498

  4. Domain fusion analysis by applying relational algebra to protein sequence and domain databases.

    Science.gov (United States)

    Truong, Kevin; Ikura, Mitsuhiko

    2003-05-06

    Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at http://calcium.uhnres.utoronto.ca/pi. As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.

  5. Protein backbone angle restraints from searching a database for chemical shift and sequence homology

    Energy Technology Data Exchange (ETDEWEB)

    Cornilescu, Gabriel; Delaglio, Frank; Bax, Ad [National Institutes of Health, Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases (United States)

    1999-03-15

    Chemical shifts of backbone atoms in proteins are exquisitely sensitive to local conformation, and homologous proteins show quite similar patterns of secondary chemical shifts. The inverse of this relation is used to search a database for triplets of adjacent residues with secondary chemical shifts and sequence similarity which provide the best match to the query triplet of interest. The database contains 13C{alpha}, 13C{beta}, 13C', 1H{alpha} and 15N chemical shifts for 20 proteins for which a high resolution X-ray structure is available. The computer program TALOS was developed to search this database for strings of residues with chemical shift and residue type homology. The relative importance of the weighting factors attached to the secondary chemical shifts of the five types of resonances relative to that of sequence similarity was optimized empirically. TALOS yields the 10 triplets which have the closest similarity in secondary chemical shift and amino acid sequence to those of the query sequence. If the central residues in these 10 triplets exhibit similar {phi} and {psi} backbone angles, their averages can reliably be used as angular restraints for the protein whose structure is being studied. Tests carried out for proteins of known structure indicate that the root-mean-square difference (rmsd) between the output of TALOS and the X-ray derived backbone angles is about 15 deg. Approximately 3% of the predictions made by TALOS are found to be in error.

  6. muBLASTP: database-indexed protein sequence search on multicore CPUs.

    Science.gov (United States)

    Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

    2016-11-04

    The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

  7. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    Science.gov (United States)

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-04

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. High Performance Protein Sequence Database Scanning on the Cell Broadband Engine

    Directory of Open Access Journals (Sweden)

    Adrianto Wirawan

    2009-01-01

    Full Text Available The enormous growth of biological sequence databases has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing rapidly as well. The recent emergence of low cost parallel multicore accelerator technologies has made it possible to reduce execution times of many bioinformatics applications. In this paper, we demonstrate how the Cell Broadband Engine can be used as a computational platform to accelerate two approaches for protein sequence database scanning: exhaustive and heuristic. We present efficient parallelization techniques for two representative algorithms: the dynamic programming based Smith–Waterman algorithm and the popular BLASTP heuristic. Their implementation on a Playstation®3 leads to significant runtime savings compared to corresponding sequential implementations.

  9. Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology

    International Nuclear Information System (INIS)

    Shen Yang; Bax, Ad

    2007-01-01

    Chemical shifts of nuclei in or attached to a protein backbone are exquisitely sensitive to their local environment. A computer program, SPARTA, is described that uses this correlation with local structure to predict protein backbone chemical shifts, given an input three-dimensional structure, by searching a newly generated database for triplets of adjacent residues that provide the best match in φ/ψ/χ 1 torsion angles and sequence similarity to the query triplet of interest. The database contains 15 N, 1 H N , 1 H α , 13 C α , 13 C β and 13 C' chemical shifts for 200 proteins for which a high resolution X-ray (≤2.4 A) structure is available. The relative importance of the weighting factors for the φ/ψ/χ 1 angles and sequence similarity was optimized empirically. The weighted, average secondary shifts of the central residues in the 20 best-matching triplets, after inclusion of nearest neighbor, ring current, and hydrogen bonding effects, are used to predict chemical shifts for the protein of known structure. Validation shows good agreement between the SPARTA-predicted and experimental shifts, with standard deviations of 2.52, 0.51, 0.27, 0.98, 1.07 and 1.08 ppm for 15 N, 1 H N , 1 H α , 13 C α , 13 C β and 13 C', respectively, including outliers

  10. MannDB – A microbial database of automated protein sequence analyses and evidence integration for protein characterization

    Directory of Open Access Journals (Sweden)

    Kuczmarski Thomas A

    2006-10-01

    Full Text Available Abstract Background MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. Description MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. Conclusion MannDB comprises a large number of genomes and comprehensive protein

  11. Sequence protein identification by randomized sequence database and transcriptome mass spectrometry (SPIDER-TMS): from manual to automatic application of a 'de novo sequencing' approach.

    Science.gov (United States)

    Pascale, Raffaella; Grossi, Gerarda; Cruciani, Gabriele; Mecca, Giansalvatore; Santoro, Donatello; Sarli Calace, Renzo; Falabella, Patrizia; Bianco, Giuliana

    Sequence protein identification by a randomized sequence database and transcriptome mass spectrometry software package has been developed at the University of Basilicata in Potenza (Italy) and designed to facilitate the determination of the amino acid sequence of a peptide as well as an unequivocal identification of proteins in a high-throughput manner with enormous advantages of time, economical resource and expertise. The software package is a valid tool for the automation of a de novo sequencing approach, overcoming the main limits and a versatile platform useful in the proteomic field for an unequivocal identification of proteins, starting from tandem mass spectrometry data. The strength of this software is that it is a user-friendly and non-statistical approach, so protein identification can be considered unambiguous.

  12. The master two-dimensional gel database of human AMA cell proteins: towards linking protein and genome sequence and mapping information (update 1991)

    DEFF Research Database (Denmark)

    Celis, J E; Leffers, H; Rasmussen, H H

    1991-01-01

    autoantigens" and "cDNAs". For convenience we have included an alphabetical list of all known proteins recorded in this database. In the long run, the main goal of this database is to link protein and DNA sequencing and mapping information (Human Genome Program) and to provide an integrated picture......The master two-dimensional gel database of human AMA cells currently lists 3801 cellular and secreted proteins, of which 371 cellular polypeptides (306 IEF; 65 NEPHGE) were added to the master images during the last 10 months. These include: (i) very basic and acidic proteins that do not focus...

  13. PROCARB: A Database of Known and Modelled Carbohydrate-Binding Protein Structures with Sequence-Based Prediction Tools

    Directory of Open Access Journals (Sweden)

    Adeel Malik

    2010-01-01

    Full Text Available Understanding of the three-dimensional structures of proteins that interact with carbohydrates covalently (glycoproteins as well as noncovalently (protein-carbohydrate complexes is essential to many biological processes and plays a significant role in normal and disease-associated functions. It is important to have a central repository of knowledge available about these protein-carbohydrate complexes as well as preprocessed data of predicted structures. This can be significantly enhanced by tools de novo which can predict carbohydrate-binding sites for proteins in the absence of structure of experimentally known binding site. PROCARB is an open-access database comprising three independently working components, namely, (i Core PROCARB module, consisting of three-dimensional structures of protein-carbohydrate complexes taken from Protein Data Bank (PDB, (ii Homology Models module, consisting of manually developed three-dimensional models of N-linked and O-linked glycoproteins of unknown three-dimensional structure, and (iii CBS-Pred prediction module, consisting of web servers to predict carbohydrate-binding sites using single sequence or server-generated PSSM. Several precomputed structural and functional properties of complexes are also included in the database for quick analysis. In particular, information about function, secondary structure, solvent accessibility, hydrogen bonds and literature reference, and so forth, is included. In addition, each protein in the database is mapped to Uniprot, Pfam, PDB, and so forth.

  14. Cluster based on sequence comparison of homologous proteins of 95 organism species - Gclust Server | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Gclust Server Cluster based on sequence comparison of homologous proteins of 95 organism spe...cies Data detail Data name Cluster based on sequence comparison of homologous proteins of 95 organism specie...istory of This Database Site Policy | Contact Us Cluster based on sequence compariso

  15. Amino acid sequences of predicted proteins and their annotation for 95 organism species. - Gclust Server | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Gclust Server Amino acid sequences of predicted proteins and their annotation for 95 organis...m species. Data detail Data name Amino acid sequences of predicted proteins and their annotation for 95 orga...nism species. DOI 10.18908/lsdba.nbdc00464-001 Description of data contents Amino acid sequences of predicted proteins...Database Description Download License Update History of This Database Site Policy | Contact Us Amino acid sequences of predicted prot...eins and their annotation for 95 organism species. - Gclust Server | LSDB Archive ...

  16. ORFer--retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files.

    Science.gov (United States)

    Büssow, Konrad; Hoffmann, Steve; Sievert, Volker

    2002-12-19

    Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.

  17. Shotgun protein sequencing.

    Energy Technology Data Exchange (ETDEWEB)

    Faulon, Jean-Loup Michel; Heffelfinger, Grant S.

    2009-06-01

    A novel experimental and computational technique based on multiple enzymatic digestion of a protein or protein mixture that reconstructs protein sequences from sequences of overlapping peptides is described in this SAND report. This approach, analogous to shotgun sequencing of DNA, is to be used to sequence alternative spliced proteins, to identify post-translational modifications, and to sequence genetically engineered proteins.

  18. Protein-Protein Interaction Databases

    DEFF Research Database (Denmark)

    Szklarczyk, Damian; Jensen, Lars Juhl

    2015-01-01

    Years of meticulous curation of scientific literature and increasingly reliable computational predictions have resulted in creation of vast databases of protein interaction data. Over the years, these repositories have become a basic framework in which experiments are analyzed and new directions...

  19. A curated gluten protein sequence database to support development of proteomics methods for determination of gluten in gluten-free foods.

    Science.gov (United States)

    Bromilow, Sophie; Gethings, Lee A; Buckley, Mike; Bromley, Mike; Shewry, Peter R; Langridge, James I; Clare Mills, E N

    2017-06-23

    The unique physiochemical properties of wheat gluten enable a diverse range of food products to be manufactured. However, gluten triggers coeliac disease, a condition which is treated using a gluten-free diet. Analytical methods are required to confirm if foods are gluten-free, but current immunoassay-based methods can unreliable and proteomic methods offer an alternative but require comprehensive and well annotated sequence databases which are lacking for gluten. A manually a curated database (GluPro V1.0) of gluten proteins, comprising 630 discrete unique full length protein sequences has been compiled. It is representative of the different types of gliadin and glutenin components found in gluten. An in silico comparison of their coeliac toxicity was undertaken by analysing the distribution of coeliac toxic motifs. This demonstrated that whilst the α-gliadin proteins contained more toxic motifs, these were distributed across all gluten protein sub-types. Comparison of annotations observed using a discovery proteomics dataset acquired using ion mobility MS/MS showed that more reliable identifications were obtained using the GluPro V1.0 database compared to the complete reviewed Viridiplantae database. This highlights the value of a curated sequence database specifically designed to support the proteomic workflows and the development of methods to detect and quantify gluten. We have constructed the first manually curated open-source wheat gluten protein sequence database (GluPro V1.0) in a FASTA format to support the application of proteomic methods for gluten protein detection and quantification. We have also analysed the manually verified sequences to give the first comprehensive overview of the distribution of sequences able to elicit a reaction in coeliac disease, the prevalent form of gluten intolerance. Provision of this database will improve the reliability of gluten protein identification by proteomic analysis, and aid the development of targeted mass

  20. PSI/TM-Coffee: a web server for fast and accurate multiple sequence alignments of regular and transmembrane proteins using homology extension on reduced databases.

    Science.gov (United States)

    Floden, Evan W; Tommaso, Paolo D; Chatzou, Maria; Magis, Cedrik; Notredame, Cedric; Chang, Jia-Ming

    2016-07-08

    The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. Winnowing sequences from a database search.

    Science.gov (United States)

    Berman, P; Zhang, Z; Wolf, Y I; Koonin, E V; Miller, W

    2000-01-01

    In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

  2. Genome Sequence Databases (Overview): Sequencing and Assembly

    Energy Technology Data Exchange (ETDEWEB)

    Lapidus, Alla L.

    2009-01-01

    From the date its role in heredity was discovered, DNA has been generating interest among scientists from different fields of knowledge: physicists have studied the three dimensional structure of the DNA molecule, biologists tried to decode the secrets of life hidden within these long molecules, and technologists invent and improve methods of DNA analysis. The analysis of the nucleotide sequence of DNA occupies a special place among the methods developed. Thanks to the variety of sequencing technologies available, the process of decoding the sequence of genomic DNA (or whole genome sequencing) has become robust and inexpensive. Meanwhile the assembly of whole genome sequences remains a challenging task. In addition to the need to assemble millions of DNA fragments of different length (from 35 bp (Solexa) to 800 bp (Sanger)), great interest in analysis of microbial communities (metagenomes) of different complexities raises new problems and pushes some new requirements for sequence assembly tools to the forefront. The genome assembly process can be divided into two steps: draft assembly and assembly improvement (finishing). Despite the fact that automatically performed assembly (or draft assembly) is capable of covering up to 98% of the genome, in most cases, it still contains incorrectly assembled reads. The error rate of the consensus sequence produced at this stage is about 1/2000 bp. A finished genome represents the genome assembly of much higher accuracy (with no gaps or incorrectly assembled areas) and quality ({approx}1 error/10,000 bp), validated through a number of computer and laboratory experiments.

  3. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

    Science.gov (United States)

    Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D

    2018-01-04

    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  4. Using SQL Databases for Sequence Similarity Searching and Analysis.

    Science.gov (United States)

    Pearson, William R; Mackey, Aaron J

    2017-09-13

    Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  5. Compressing DNA sequence databases with coil

    Directory of Open Access Journals (Sweden)

    Hendy Michael D

    2008-05-01

    Full Text Available Abstract Background Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.

  6. Comprehensive two-dimensional gel protein databases offer a global approach to the analysis of human cells: the transformed amnion cells (AMA) master database and its link to genome DNA sequence data

    DEFF Research Database (Denmark)

    Celis, J E; Gesser, B; Rasmussen, H H

    1990-01-01

    , mitochondria, Golgi, ribosomes, intermediate filaments, microfilaments and microtubules), levels in fetal human tissues, partial protein sequences (containing information on 48 human proteins microsequenced so far), cell cycle-regulated proteins, proteins sensitive to interferons alpha, beta, and gamma, heat...

  7. The PMDB Protein Model Database

    Science.gov (United States)

    Castrignanò, Tiziana; De Meo, Paolo D'Onorio; Cozzetto, Domenico; Talamo, Ivano Giuseppe; Tramontano, Anna

    2006-01-01

    The Protein Model Database (PMDB) is a public resource aimed at storing manually built 3D models of proteins. The database is designed to provide access to models published in the scientific literature, together with validating experimental data. It is a relational database and it currently contains >74 000 models for ∼240 proteins. The system is accessible at and allows predictors to submit models along with related supporting evidence and users to download them through a simple and intuitive interface. Users can navigate in the database and retrieve models referring to the same target protein or to different regions of the same protein. Each model is assigned a unique identifier that allows interested users to directly access the data. PMID:16381873

  8. Database of Interacting Proteins (DIP)

    Data.gov (United States)

    U.S. Department of Health & Human Services — The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent...

  9. The International Nucleotide Sequence Database Collaboration.

    Science.gov (United States)

    Cochrane, Guy; Karsch-Mizrachi, Ilene; Nakamura, Yasukazu

    2011-01-01

    Under the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org), globally comprehensive public domain nucleotide sequence is captured, preserved and presented. The partners of this long-standing collaboration work closely together to provide data formats and conventions that enable consistent data submission to their databases and support regular data exchange around the globe. Clearly defined policy and governance in relation to free access to data and relationships with journal publishers have positioned INSDC databases as a key provider of the scientific record and a core foundation for the global bioinformatics data infrastructure. While growth in sequence data volumes comes no longer as a surprise to INSDC partners, the uptake of next-generation sequencing technology by mainstream science that we have witnessed in recent years brings a step-change to growth, necessarily making a clear mark on INSDC strategy. In this article, we introduce the INSDC, outline data growth patterns and comment on the challenges of increased growth.

  10. BIOPEP database and other programs for processing bioactive peptide sequences.

    Science.gov (United States)

    Minkiewicz, Piotr; Dziuba, Jerzy; Iwaniak, Anna; Dziuba, Marta; Darewicz, Małgorzata

    2008-01-01

    This review presents the potential for application of computational tools in peptide science based on a sample BIOPEP database and program as well as other programs and databases available via the World Wide Web. The BIOPEP application contains a database of biologically active peptide sequences and a program enabling construction of profiles of the potential biological activity of protein fragments, calculation of quantitative descriptors as measures of the value of proteins as potential precursors of bioactive peptides, and prediction of bonds susceptible to hydrolysis by endopeptidases in a protein chain. Other bioactive and allergenic peptide sequence databases are also presented. Programs enabling the construction of binary and multiple alignments between peptide sequences, the construction of sequence motifs attributed to a given type of bioactivity, searching for potential precursors of bioactive peptides, and the prediction of sites susceptible to proteolytic cleavage in protein chains are available via the Internet as are other approaches concerning secondary structure prediction and calculation of physicochemical features based on amino acid sequence. Programs for prediction of allergenic and toxic properties have also been developed. This review explores the possibilities of cooperation between various programs.

  11. Update History of This Database - Yeast Interacting Proteins Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Yeast Interacting Proteins Database Update History of This Database Date Update contents 201...0/03/29 Yeast Interacting Proteins Database English archive site is opened. 2000/12/4 Yeast Interacting Proteins Database...( http://itolab.cb.k.u-tokyo.ac.jp/Y2H/ ) is released. About This Database Database Description... Download License Update History of This Database Site Policy | Contact Us Update History of This Database... - Yeast Interacting Proteins Database | LSDB Archive ...

  12. The Sequenced Angiosperm Genomes and Genome Databases.

    Science.gov (United States)

    Chen, Fei; Dong, Wei; Zhang, Jiawei; Guo, Xinyue; Chen, Junhao; Wang, Zhengjia; Lin, Zhenguo; Tang, Haibao; Zhang, Liangsheng

    2018-01-01

    Angiosperms, the flowering plants, provide the essential resources for human life, such as food, energy, oxygen, and materials. They also promoted the evolution of human, animals, and the planet earth. Despite the numerous advances in genome reports or sequencing technologies, no review covers all the released angiosperm genomes and the genome databases for data sharing. Based on the rapid advances and innovations in the database reconstruction in the last few years, here we provide a comprehensive review for three major types of angiosperm genome databases, including databases for a single species, for a specific angiosperm clade, and for multiple angiosperm species. The scope, tools, and data of each type of databases and their features are concisely discussed. The genome databases for a single species or a clade of species are especially popular for specific group of researchers, while a timely-updated comprehensive database is more powerful for address of major scientific mysteries at the genome scale. Considering the low coverage of flowering plants in any available database, we propose construction of a comprehensive database to facilitate large-scale comparative studies of angiosperm genomes and to promote the collaborative studies of important questions in plant biology.

  13. Specialized microbial databases for inductive exploration of microbial genome sequences

    Directory of Open Access Journals (Sweden)

    Cabau Cédric

    2005-02-01

    Full Text Available Abstract Background The enormous amount of genome sequence data asks for user-oriented databases to manage sequences and annotations. Queries must include search tools permitting function identification through exploration of related objects. Methods The GenoList package for collecting and mining microbial genome databases has been rewritten using MySQL as the database management system. Functions that were not available in MySQL, such as nested subquery, have been implemented. Results Inductive reasoning in the study of genomes starts from "islands of knowledge", centered around genes with some known background. With this concept of "neighborhood" in mind, a modified version of the GenoList structure has been used for organizing sequence data from prokaryotic genomes of particular interest in China. GenoChore http://bioinfo.hku.hk/genochore.html, a set of 17 specialized end-user-oriented microbial databases (including one instance of Microsporidia, Encephalitozoon cuniculi, a member of Eukarya has been made publicly available. These databases allow the user to browse genome sequence and annotation data using standard queries. In addition they provide a weekly update of searches against the world-wide protein sequences data libraries, allowing one to monitor annotation updates on genes of interest. Finally, they allow users to search for patterns in DNA or protein sequences, taking into account a clustering of genes into formal operons, as well as providing extra facilities to query sequences using predefined sequence patterns. Conclusion This growing set of specialized microbial databases organize data created by the first Chinese bacterial genome programs (ThermaList, Thermoanaerobacter tencongensis, LeptoList, with two different genomes of Leptospira interrogans and SepiList, Staphylococcus epidermidis associated to related organisms for comparison.

  14. gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes.

    Science.gov (United States)

    Nakagawa, So; Takahashi, Mahoko Ueda

    2016-01-01

    In mammals, approximately 10% of genome sequences correspond to endogenous viral elements (EVEs), which are derived from ancient viral infections of germ cells. Although most EVEs have been inactivated, some open reading frames (ORFs) of EVEs obtained functions in the hosts. However, EVE ORFs usually remain unannotated in the genomes, and no databases are available for EVE ORFs. To investigate the function and evolution of EVEs in mammalian genomes, we developed EVE ORF databases for 20 genomes of 19 mammalian species. A total of 736,771 non-overlapping EVE ORFs were identified and archived in a database named gEVE (http://geve.med.u-tokai.ac.jp). The gEVE database provides nucleotide and amino acid sequences, genomic loci and functional annotations of EVE ORFs for all 20 genomes. In analyzing RNA-seq data with the gEVE database, we successfully identified the expressed EVE genes, suggesting that the gEVE database facilitates studies of the genomic analyses of various mammalian species.Database URL: http://geve.med.u-tokai.ac.jp. © The Author(s) 2016. Published by Oxford University Press.

  15. Construction of an integrated database to support genomic sequence analysis

    Energy Technology Data Exchange (ETDEWEB)

    Gilbert, W.; Overbeek, R.

    1994-11-01

    The central goal of this project is to develop an integrated database to support comparative analysis of genomes including DNA sequence data, protein sequence data, gene expression data and metabolism data. In developing the logic-based system GenoBase, a broader integration of available data was achieved due to assistance from collaborators. Current goals are to easily include new forms of data as they become available and to easily navigate through the ensemble of objects described within the database. This report comments on progress made in these areas.

  16. HIV protein sequence hotspots for crosstalk with host hub proteins.

    Directory of Open Access Journals (Sweden)

    Mahdi Sarmady

    Full Text Available HIV proteins target host hub proteins for transient binding interactions. The presence of viral proteins in the infected cell results in out-competition of host proteins in their interaction with hub proteins, drastically affecting cell physiology. Functional genomics and interactome datasets can be used to quantify the sequence hotspots on the HIV proteome mediating interactions with host hub proteins. In this study, we used the HIV and human interactome databases to identify HIV targeted host hub proteins and their host binding partners (H2. We developed a high throughput computational procedure utilizing motif discovery algorithms on sets of protein sequences, including sequences of HIV and H2 proteins. We identified as HIV sequence hotspots those linear motifs that are highly conserved on HIV sequences and at the same time have a statistically enriched presence on the sequences of H2 proteins. The HIV protein motifs discovered in this study are expressed by subsets of H2 host proteins potentially outcompeted by HIV proteins. A large subset of these motifs is involved in cleavage, nuclear localization, phosphorylation, and transcription factor binding events. Many such motifs are clustered on an HIV sequence in the form of hotspots. The sequential positions of these hotspots are consistent with the curated literature on phenotype altering residue mutations, as well as with existing binding site data. The hotspot map produced in this study is the first global portrayal of HIV motifs involved in altering the host protein network at highly connected hub nodes.

  17. Protein structure database search and evolutionary classification.

    Science.gov (United States)

    Yang, Jinn-Moon; Tung, Chi-Hua

    2006-01-01

    As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].

  18. Systematization of the protein sequence diversity in enzymes related to secondary metabolic pathways in plants, in the context of big data biology inspired by the KNApSAcK motorcycle database.

    Science.gov (United States)

    Ikeda, Shun; Abe, Takashi; Nakamura, Yukiko; Kibinge, Nelson; Hirai Morita, Aki; Nakatani, Atsushi; Ono, Naoaki; Ikemura, Toshimichi; Nakamura, Kensuke; Altaf-Ul-Amin, Md; Kanaya, Shigehiko

    2013-05-01

    Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species-metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.

  19. Thermodynamic database for proteins: features and applications.

    Science.gov (United States)

    Gromiha, M Michael; Sarai, Akinori

    2010-01-01

    We have developed a thermodynamic database for proteins and mutants, ProTherm, which is a collection of a large number of thermodynamic data on protein stability along with the sequence and structure information, experimental methods and conditions, and literature information. This is a valuable resource for understanding/predicting the stability of proteins, and it can be accessible at http://www.gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.html . ProTherm has several features including various search, display, and sorting options and visualization tools. We have analyzed the data in ProTherm to examine the relationship among thermodynamics, structure, and function of proteins. We describe the progress on the development of methods for understanding/predicting protein stability, such as (i) relationship between the stability of protein mutants and amino acid properties, (ii) average assignment method, (iii) empirical energy functions, (iv) torsion, distance, and contact potentials, and (v) machine learning techniques. The list of online resources for predicting protein stability has also been provided.

  20. The SWISS-PROT protein sequence data bank: current status.

    OpenAIRE

    Bairoch, A; Boeckmann, B

    1994-01-01

    SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library. The SWISS-PROT protein sequence data bank consist of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Databa...

  1. Database Description - Yeast Interacting Proteins Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Yeast Interacting Proteins Database Database Description General information of database Database... name Yeast Interacting Proteins Database Alternative name - DOI 10.18908/lsdba.nbdc00742-000 Creator C...-ken 277-8561 Tel: +81-4-7136-3989 FAX: +81-4-7136-3979 E-mail : Database classif...s cerevisiae Taxonomy ID: 4932 Database description Information on interactions and related information obta...l Acad Sci U S A. 2001 Apr 10;98(8):4569-74. Epub 2001 Mar 13. External Links: Original website information Database

  2. cuticleDB: a relational database of Arthropod cuticular proteins

    Directory of Open Access Journals (Sweden)

    Willis Judith H

    2004-09-01

    Full Text Available Abstract Background The insect exoskeleton or cuticle is a bi-partite composite of proteins and chitin that provides protective, skeletal and structural functions. Little information is available about the molecular structure of this important complex that exhibits a helicoidal architecture. Scores of sequences of cuticular proteins have been obtained from direct protein sequencing, from cDNAs, and from genomic analyses. Most of these cuticular protein sequences contain motifs found only in arthropod proteins. Description cuticleDB is a relational database containing all structural proteins of Arthropod cuticle identified to date. Many come from direct sequencing of proteins isolated from cuticle and from sequences from cDNAs that share common features with these authentic cuticular proteins. It also includes proteins from the Drosophila melanogaster and the Anopheles gambiae genomes, that have been predicted to be cuticular proteins, based on a Pfam motif (PF00379 responsible for chitin binding in Arthropod cuticle. The total number of the database entries is 445: 370 derive from insects, 60 from Crustacea and 15 from Chelicerata. The database can be accessed from our web server at http://bioinformatics.biol.uoa.gr/cuticleDB. Conclusions CuticleDB was primarily designed to contain correct and full annotation of cuticular protein data. The database will be of help to future genome annotators. Users will be able to test hypotheses for the existence of known and also of yet unknown motifs in cuticular proteins. An analysis of motifs may contribute to understanding how proteins contribute to the physical properties of cuticle as well as to the precise nature of their interaction with chitin.

  3. AMYPdb: A database dedicated to amyloid precursor proteins

    Directory of Open Access Journals (Sweden)

    Delamarche Christian

    2008-06-01

    Full Text Available Abstract Background Misfolding and aggregation of proteins into ordered fibrillar structures is associated with a number of severe pathologies, including Alzheimer's disease, prion diseases, and type II diabetes. The rapid accumulation of knowledge about the sequences and structures of these proteins allows using of in silico methods to investigate the molecular mechanisms of their abnormal conformational changes and assembly. However, such an approach requires the collection of accurate data, which are inconveniently dispersed among several generalist databases. Results We therefore created a free online knowledge database (AMYPdb dedicated to amyloid precursor proteins and we have performed large scale sequence analysis of the included data. Currently, AMYPdb integrates data on 31 families, including 1,705 proteins from nearly 600 organisms. It displays links to more than 2,300 bibliographic references and 1,200 3D-structures. A Wiki system is available to insert data into the database, providing a sharing and collaboration environment. We generated and analyzed 3,621 amino acid sequence patterns, reporting highly specific patterns for each amyloid family, along with patterns likely to be involved in protein misfolding and aggregation. Conclusion AMYPdb is a comprehensive online database aiming at the centralization of bioinformatic data regarding all amyloid proteins and their precursors. Our sequence pattern discovery and analysis approach unveiled protein regions of significant interest. AMYPdb is freely accessible 1.

  4. Protein sequence comparison and protein evolution

    Energy Technology Data Exchange (ETDEWEB)

    Pearson, W.R. [Univ. of Virginia, Charlottesville, VA (United States). Dept. of Biochemistry

    1995-12-31

    This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared proteinfold and possibly a shared active site or function. The authors start by reviewing a geological/evolutionary time scale. Next they look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. They also examine different modes of protein evolution and consider some hypotheses that have been presented to explain the very earliest events in protein evolution. The next part of the tutorial will examine the technical aspects of protein sequence comparison. Both optimal and heuristic algorithms and their associated parameters that are used to characterize protein sequence similarities are discussed. Perhaps more importantly, they survey the statistics of local similarity scores, and how these statistics can both be used to improve the selectivity of a search and to evaluate the significance of a match. They them examine distantly related members of three protein families, the serine proteases, the glutathione transferases, and the G-protein-coupled receptors (GCRs). Finally, the discuss how sequence similarity can be used to examine internal repeated or mosaic structures in proteins.

  5. Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

    Science.gov (United States)

    Mackey, Aaron J; Pearson, William R

    2004-10-01

    Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.

  6. Full Data of Yeast Interacting Proteins Database (Original Version) - Yeast Interacting Proteins Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Yeast Interacting Proteins Database Full Data of Yeast Interacting Proteins Database (Origin...al Version) Data detail Data name Full Data of Yeast Interacting Proteins Database (Original Version) DOI 10....18908/lsdba.nbdc00742-004 Description of data contents The entire data in the Yeast Interacting Proteins Database...eir interactions are required. Several sources including YPD (Yeast Proteome Database, Costanzo, M. C., Hoga...ematic name in the SGD (Saccharomyces Genome Database; http://www.yeastgenome.org /). Bait gene name The gen

  7. The SWISS-PROT protein sequence data bank

    OpenAIRE

    Bairoch, Amos; Boeckmann, Brigitte

    1992-01-01

    SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library

  8. Study of event sequence database for a nuclear power domain

    International Nuclear Information System (INIS)

    Kusumi, Yoshiaki

    1998-01-01

    A retrieval engine developed to extract event sequences from an accident information database using a time series retrieval formula expressed with ordered retrieval terms is explored. This engine outputs not only a sequence which completely matches with a time series retrieval formula, but also sequence which approximately matches the formula (fuzzy retrieval). An event sequence database in which records consist of three ordered parameters, namely the causal event, the process and result. Then the database is used to assess the feasibility of this engine and favorable results were obtained. (author)

  9. Improving decoy databases for protein folding algorithms

    KAUST Repository

    Lindsey, Aaron

    2014-01-01

    Copyright © 2014 ACM. Predicting protein structures and simulating protein folding are two of the most important problems in computational biology today. Simulation methods rely on a scoring function to distinguish the native structure (the most energetically stable) from non-native structures. Decoy databases are collections of non-native structures used to test and verify these functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and removing redundant structures. We test our approach on 17 different decoy databases of varying size and type and show significant improvement across a variety of metrics. We also test our improved databases on a popular modern scoring function and show that they contain a greater number of native-like structures than the original databases, thereby producing a more rigorous database for testing scoring functions.

  10. The Protein Identifier Cross-Referencing (PICR service: reconciling protein identifiers across multiple source databases

    Directory of Open Access Journals (Sweden)

    Leinonen Rasko

    2007-10-01

    Full Text Available Abstract Background Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs. Results We have created the Protein Identifier Cross-Reference (PICR service, a web application that provides interactive and programmatic (SOAP and REST access to a mapping algorithm that uses the UniProt Archive (UniParc as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV or Microsoft Excel (XLS files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface. Conclusion We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR

  11. HCVpro: Hepatitis C virus protein interaction database

    KAUST Repository

    Kwofie, Samuel K.

    2011-12-01

    It is essential to catalog characterized hepatitis C virus (HCV) protein-protein interaction (PPI) data and the associated plethora of vital functional information to augment the search for therapies, vaccines and diagnostic biomarkers. In furtherance of these goals, we have developed the hepatitis C virus protein interaction database (HCVpro) by integrating manually verified hepatitis C virus-virus and virus-human protein interactions curated from literature and databases. HCVpro is a comprehensive and integrated HCV-specific knowledgebase housing consolidated information on PPIs, functional genomics and molecular data obtained from a variety of virus databases (VirHostNet, VirusMint, HCVdb and euHCVdb), and from BIND and other relevant biology repositories. HCVpro is further populated with information on hepatocellular carcinoma (HCC) related genes that are mapped onto their encoded cellular proteins. Incorporated proteins have been mapped onto Gene Ontologies, canonical pathways, Online Mendelian Inheritance in Man (OMIM) and extensively cross-referenced to other essential annotations. The database is enriched with exhaustive reviews on structure and functions of HCV proteins, current state of drug and vaccine development and links to recommended journal articles. Users can query the database using specific protein identifiers (IDs), chromosomal locations of a gene, interaction detection methods, indexed PubMed sources as well as HCVpro, BIND and VirusMint IDs. The use of HCVpro is free and the resource can be accessed via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/. © 2011 Elsevier B.V.

  12. Proteomics: Protein Identification Using Online Databases

    Science.gov (United States)

    Eurich, Chris; Fields, Peter A.; Rice, Elizabeth

    2012-01-01

    Proteomics is an emerging area of systems biology that allows simultaneous study of thousands of proteins expressed in cells, tissues, or whole organisms. We have developed this activity to enable high school or college students to explore proteomic databases using mass spectrometry data files generated from yeast proteins in a college laboratory…

  13. Polymorphism Sequence - JSNP | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us JSNP Polymorphism Sequence Data detail Data name Polymorphism Sequence DOI 10.18908/lsdba.nb...dc00114-001 Description of data contents Information on polymorphisms (SNPs and insertions/deletions) and th...se Name database name JSNP_SNP: single nucleotide polymorphism JSNP_InsDel_IND: insertion/deletion JSNP_InsD...ved allele observed 3' Flanking Sequence 3' flanking sequence Offset in Flanking Sequence position of the polymorphism...uence Accession No. accession No. of the sequence for polymorphism screening Offset in Record position of the polymorphism

  14. The VirusBanker database uses a Java program to allow flexible searching through Bunyaviridae sequences

    Directory of Open Access Journals (Sweden)

    Gibbs Mark J

    2008-02-01

    Full Text Available Abstract Background Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. Results The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. Conclusion VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.

  15. The VirusBanker database uses a Java program to allow flexible searching through Bunyaviridae sequences.

    Science.gov (United States)

    Fourment, Mathieu; Gibbs, Mark J

    2008-02-05

    Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.

  16. Yeast Interacting Proteins Database: YCL046W, YGL115W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YCL046W - Dubious open reading frame unlikely to encode a protein, based on availab...ading frame unlikely to encode a protein, based on available experimental and comparative sequence data; par

  17. Core Data of Yeast Interacting Proteins Database (Original Version) - Yeast Interacting Proteins Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available y are in the reverse direction. *1 A comprehensive two-hybrid analysis to explore the yeast protein interact...s. 2000 Jan 1;28(1):73-6. *2 The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): comprehensive...000 Jan 1;28(1):73-6. *3 A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisia

  18. Dynamics of domain coverage of the protein sequence universe

    Science.gov (United States)

    2012-01-01

    Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data. PMID:23157439

  19. Dynamics of domain coverage of the protein sequence universe

    Directory of Open Access Journals (Sweden)

    Rekapalli Bhanu

    2012-11-01

    Full Text Available Abstract Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.

  20. Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

    Directory of Open Access Journals (Sweden)

    Qingyu Chen

    Full Text Available First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases.We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

  1. Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

    Science.gov (United States)

    Chen, Qingyu; Zobel, Justin; Zhang, Xiuzhen; Verspoor, Karin

    2016-01-01

    First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

  2. Applications of Protein Thermodynamic Database for Understanding Protein Mutant Stability and Designing Stable Mutants.

    Science.gov (United States)

    Gromiha, M Michael; Anoosha, P; Huang, Liang-Tsung

    2016-01-01

    Protein stability is the free energy difference between unfolded and folded states of a protein, which lies in the range of 5-25 kcal/mol. Experimentally, protein stability is measured with circular dichroism, differential scanning calorimetry, and fluorescence spectroscopy using thermal and denaturant denaturation methods. These experimental data have been accumulated in the form of a database, ProTherm, thermodynamic database for proteins and mutants. It also contains sequence and structure information of a protein, experimental methods and conditions, and literature information. Different features such as search, display, and sorting options and visualization tools have been incorporated in the database. ProTherm is a valuable resource for understanding/predicting the stability of proteins and it can be accessed at http://www.abren.net/protherm/ . ProTherm has been effectively used to examine the relationship among thermodynamics, structure, and function of proteins. We describe the recent progress on the development of methods for understanding/predicting protein stability, such as (1) general trends on mutational effects on stability, (2) relationship between the stability of protein mutants and amino acid properties, (3) applications of protein three-dimensional structures for predicting their stability upon point mutations, (4) prediction of protein stability upon single mutations from amino acid sequence, and (5) prediction methods for addressing double mutants. A list of online resources for predicting has also been provided.

  3. Sequence modelling and an extensible data model for genomic database

    Energy Technology Data Exchange (ETDEWEB)

    Li, Peter Wei-Der [California Univ., San Francisco, CA (United States); Univ. of California, Berkeley, CA (United States)

    1992-01-01

    The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS`s do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data model that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the ``Extensible Object Model``, to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.

  4. Sequence modelling and an extensible data model for genomic database

    Energy Technology Data Exchange (ETDEWEB)

    Li, Peter Wei-Der (California Univ., San Francisco, CA (United States) Lawrence Berkeley Lab., CA (United States))

    1992-01-01

    The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS's do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data model that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the Extensible Object Model'', to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.

  5. O-GLYCOBASE version 4.0: a revised database of O-glycosylated proteins

    DEFF Research Database (Denmark)

    Gupta, Ramneek; Birch, Hanne; Rapacki, Krzysztof

    1999-01-01

    O-GLYCBASE is a database of glycoproteins with O-linked glycosylation sites. Entries with at least one experimentally verified O-glycosylation site have been complied from protein sequence databases and literature. Each entry contains information about the glycan involved, the species, sequence, ...

  6. Use of designed sequences in protein structure recognition.

    Science.gov (United States)

    Kumar, Gayatri; Mudgal, Richa; Srinivasan, Narayanaswamy; Sandhya, Sankaran

    2018-05-09

    Knowledge of the protein structure is a pre-requisite for improved understanding of molecular function. The gap in the sequence-structure space has increased in the post-genomic era. Grouping related protein sequences into families can aid in narrowing the gap. In the Pfam database, structure description is provided for part or full-length proteins of 7726 families. For the remaining 52% of the families, information on 3-D structure is not yet available. We use the computationally designed sequences that are intermediately related to two protein domain families, which are already known to share the same fold. These strategically designed sequences enable detection of distant relationships and here, we have employed them for the purpose of structure recognition of protein families of yet unknown structure. We first measured the success rate of our approach using a dataset of protein families of known fold and achieved a success rate of 88%. Next, for 1392 families of yet unknown structure, we made structural assignments for part/full length of the proteins. Fold association for 423 domains of unknown function (DUFs) are provided as a step towards functional annotation. The results indicate that knowledge-based filling of gaps in protein sequence space is a lucrative approach for structure recognition. Such sequences assist in traversal through protein sequence space and effectively function as 'linkers', where natural linkers between distant proteins are unavailable. This article was reviewed by Oliviero Carugo, Christine Orengo and Srikrishna Subramanian.

  7. An update of the DEF database of protein fold class predictions

    DEFF Research Database (Denmark)

    Reczko, Martin; Karras, Dimitris; Bohr, Henrik

    1997-01-01

    An update is given on the Database of Expected Fold classes (DEF) that contains a collection of fold-class predictions made from protein sequences and a mail server that provides new predictions for new sequences. To any given sequence one of 49 fold-classes is chosen to classify the structure re...... related to the sequence with high accuracy. The updated predictions system is developed using data from the new version of the 3D-ALI database of aligned protein structures and thus is giving more reliable and more detailed predictions than the previous DEF system.......An update is given on the Database of Expected Fold classes (DEF) that contains a collection of fold-class predictions made from protein sequences and a mail server that provides new predictions for new sequences. To any given sequence one of 49 fold-classes is chosen to classify the structure...

  8. Novel algorithms for protein sequence analysis

    NARCIS (Netherlands)

    Ye, Kai

    2008-01-01

    Each protein is characterized by its unique sequential order of amino acids, the so-called protein sequence. Biology”s paradigm is that this order of amino acids determines the protein”s architecture and function. In this thesis, we introduce novel algorithms to analyze protein sequences. Chapter 1

  9. Organizing, exploring, and analyzing antibody sequence data: the case for relational-database managers.

    Science.gov (United States)

    Owens, John

    2009-01-01

    Technological advances in the acquisition of DNA and protein sequence information and the resulting onrush of data can quickly overwhelm the scientist unprepared for the volume of information that must be evaluated and carefully dissected to discover its significance. Few laboratories have the luxury of dedicated personnel to organize, analyze, or consistently record a mix of arriving sequence data. A methodology based on a modern relational-database manager is presented that is both a natural storage vessel for antibody sequence information and a conduit for organizing and exploring sequence data and accompanying annotation text. The expertise necessary to implement such a plan is equal to that required by electronic word processors or spreadsheet applications. Antibody sequence projects maintained as independent databases are selectively unified by the relational-database manager into larger database families that contribute to local analyses, reports, interactive HTML pages, or exported to facilities dedicated to sophisticated sequence analysis techniques. Database files are transposable among current versions of Microsoft, Macintosh, and UNIX operating systems.

  10. Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines in Shotgun Proteomics.

    Science.gov (United States)

    Muth, Thilo; Rapp, Erdmann; Berven, Frode S; Barsnes, Harald; Vaudel, Marc

    2016-01-01

    Protein identification via database searches has become the gold standard in mass spectrometry based shotgun proteomics. However, as the quality of tandem mass spectra improves, direct mass spectrum sequencing gains interest as a database-independent alternative. In this chapter, the general principle of this so-called de novo sequencing is introduced along with pitfalls and challenges of the technique. The main tools available are presented with a focus on user friendly open source software which can be directly applied in everyday proteomic workflows.

  11. THPdb: Database of FDA-approved peptide and protein therapeutics.

    Directory of Open Access Journals (Sweden)

    Salman Sadullah Usmani

    Full Text Available THPdb (http://crdd.osdd.net/raghava/thpdb/ is a manually curated repository of Food and Drug Administration (FDA approved therapeutic peptides and proteins. The information in THPdb has been compiled from 985 research publications, 70 patents and other resources like DrugBank. The current version of the database holds a total of 852 entries, providing comprehensive information on 239 US-FDA approved therapeutic peptides and proteins and their 380 drug variants. The information on each peptide and protein includes their sequences, chemical properties, composition, disease area, mode of activity, physical appearance, category or pharmacological class, pharmacodynamics, route of administration, toxicity, target of activity, etc. In addition, we have annotated the structure of most of the protein and peptides. A number of user-friendly tools have been integrated to facilitate easy browsing and data analysis. To assist scientific community, a web interface and mobile App have also been developed.

  12. The development and application of a Mycoplasma gallisepticum sequence database.

    Science.gov (United States)

    Armour, Natalie K; Laibinis, Victoria A; Collett, Stephen R; Ferguson-Noel, Naola

    2013-01-01

    Molecular analysis was conducted on 36 Mycoplasma gallisepticum DNA extracts from tracheal swab samples of commercial poultry in seven South African provinces between 2009 and 2012. Twelve unique M. gallisepticum genotypes were identified by polymerase chain reaction and sequence analysis of the 16S-23S rRNA intergenic spacer region (IGSR), M. gallisepticum cytadhesin 2 (mgc2), MGA_0319 and gapA genetic regions. The DNA sequences of these genotypes were distinct from those of M. gallisepticum isolates in a database composed of sequences from other countries, vaccine and reference strains. The most prevalent genotype (SA-WT#7) was detected in samples from commercial broilers, broiler breeders and layers in five provinces. South African M. gallisepticum sequences were more similar to those of the live vaccines commercially available in South Africa, but were distinct from that of F strain vaccine, which is not registered for use in South Africa. The IGSR, mgc2 or MGA_0319 sequences of three South African genotypes were identical to those of the ts-11 vaccine strain, necessitating a combination of mgc2 and IGSR targeted sequencing to differentiate South African wild-type genotypes from ts-11 vaccine. To identify and differentiate all 12 wild-types, mgc2, IGSR and MGA_0319 sequencing was required. Sequencing of gapA was least effective at strain differentiation. This research serves as a model for the development of an M. gallisepticum sequence database, and illustrates its application to characterize M. gallisepticum genotypes, select diagnostic tests and better understand the epidemiology of M. gallisepticum.

  13. Integrating protein structures and precomputed genealogies in the Magnum database: Examples with cellular retinoid binding proteins

    Directory of Open Access Journals (Sweden)

    Bradley Michael E

    2006-02-01

    Full Text Available Abstract Background When accurate models for the divergent evolution of protein sequences are integrated with complementary biological information, such as folded protein structures, analyses of the combined data often lead to new hypotheses about molecular physiology. This represents an excellent example of how bioinformatics can be used to guide experimental research. However, progress in this direction has been slowed by the lack of a publicly available resource suitable for general use. Results The precomputed Magnum database offers a solution to this problem for ca. 1,800 full-length protein families with at least one crystal structure. The Magnum deliverables include 1 multiple sequence alignments, 2 mapping of alignment sites to crystal structure sites, 3 phylogenetic trees, 4 inferred ancestral sequences at internal tree nodes, and 5 amino acid replacements along tree branches. Comprehensive evaluations revealed that the automated procedures used to construct Magnum produced accurate models of how proteins divergently evolve, or genealogies, and correctly integrated these with the structural data. To demonstrate Magnum's capabilities, we asked for amino acid replacements requiring three nucleotide substitutions, located at internal protein structure sites, and occurring on short phylogenetic tree branches. In the cellular retinoid binding protein family a site that potentially modulates ligand binding affinity was discovered. Recruitment of cellular retinol binding protein to function as a lens crystallin in the diurnal gecko afforded another opportunity to showcase the predictive value of a browsable database containing branch replacement patterns integrated with protein structures. Conclusion We integrated two areas of protein science, evolution and structure, on a large scale and created a precomputed database, known as Magnum, which is the first freely available resource of its kind. Magnum provides evolutionary and structural

  14. Taxonomic colouring of phylogenetic trees of protein sequences

    Directory of Open Access Journals (Sweden)

    Andrade-Navarro Miguel A

    2006-02-01

    Full Text Available Abstract Background Phylogenetic analyses of protein families are used to define the evolutionary relationships between homologous proteins. The interpretation of protein-sequence phylogenetic trees requires the examination of the taxonomic properties of the species associated to those sequences. However, there is no online tool to facilitate this interpretation, for example, by automatically attaching taxonomic information to the nodes of a tree, or by interactively colouring the branches of a tree according to any combination of taxonomic divisions. This is especially problematic if the tree contains on the order of hundreds of sequences, which, given the accelerated increase in the size of the protein sequence databases, is a situation that is becoming common. Results We have developed PhyloView, a web based tool for colouring phylogenetic trees upon arbitrary taxonomic properties of the species represented in a protein sequence phylogenetic tree. Provided that the tree contains SwissProt, SpTrembl, or GenBank protein identifiers, the tool retrieves the taxonomic information from the corresponding database. A colour picker displays a summary of the findings and allows the user to associate colours to the leaves of the tree according to any number of taxonomic partitions. Then, the colours are propagated to the branches of the tree. Conclusion PhyloView can be used at http://www.ogic.ca/projects/phyloview/. A tutorial, the software with documentation, and GPL licensed source code, can be accessed at the same web address.

  15. PATACSDB—the database of polyA translational attenuators in coding sequences

    Directory of Open Access Journals (Sweden)

    Malgorzata Habich

    2016-02-01

    Full Text Available Recent additions to the repertoire of gene expression regulatory mechanisms are polyadenylate (polyA tracks encoding for poly-lysine runs in protein sequences. Such tracks stall the translation apparatus and induce frameshifting independently of the effects of charged nascent poly-lysine sequence on the ribosome exit channel. As such, they substantially influence the stability of mRNA and the amount of protein produced from a given transcript. Single base changes in these regions are enough to exert a measurable response on both protein and mRNA abundance; this makes each of these sequences a potentially interesting case study for the effects of synonymous mutation, gene dosage balance and natural frameshifting. Here we present PATACSDB, a resource that contain a comprehensive list of polyA tracks from over 250 eukaryotic genomes. Our data is based on the Ensembl genomic database of coding sequences and filtered with algorithm of 12A-1 which selects sequences of polyA tracks with a minimal length of 12 A’s allowing for one mismatched base. The PATACSDB database is accessible at: http://sysbio.ibb.waw.pl/patacsdb. The source code is available at http://github.com/habich/PATACSDB, and it includes the scripts with which the database can be recreated.

  16. SinEx DB: a database for single exon coding sequences in mammalian genomes.

    Science.gov (United States)

    Jorquera, Roddy; Ortiz, Rodrigo; Ossandon, F; Cárdenas, Juan Pablo; Sepúlveda, Rene; González, Carolina; Holmes, David S

    2016-01-01

    Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl. © The Author(s) 2016. Published by Oxford University Press.

  17. Intelligent Access to Sequence and Structure Databases (IASSD) - an interface for accessing information from major web databases.

    Science.gov (United States)

    Ganguli, Sayak; Gupta, Manoj Kumar; Basu, Protip; Banik, Rahul; Singh, Pankaj Kumar; Vishal, Vineet; Bera, Abhisek Ranjan; Chakraborty, Hirak Jyoti; Das, Sasti Gopal

    2014-01-01

    With the advent of age of big data and advances in high throughput technology accessing data has become one of the most important step in the entire knowledge discovery process. Most users are not able to decipher the query result that is obtained when non specific keywords or a combination of keywords are used. Intelligent access to sequence and structure databases (IASSD) is a desktop application for windows operating system. It is written in Java and utilizes the web service description language (wsdl) files and Jar files of E-utilities of various databases such as National Centre for Biotechnology Information (NCBI) and Protein Data Bank (PDB). Apart from that IASSD allows the user to view protein structure using a JMOL application which supports conditional editing. The Jar file is freely available through e-mail from the corresponding author.

  18. Gene Unprediction with Spurio: A tool to identify spurious protein sequences.

    Science.gov (United States)

    Höps, Wolfram; Jeffryes, Matt; Bateman, Alex

    2018-01-01

    We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation.  Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases.  We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes.  Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.

  19. PrionHome: a database of prions and other sequences relevant to prion phenomena.

    Directory of Open Access Journals (Sweden)

    Djamel Harbi

    Full Text Available Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions, prionoids (i.e., proteins that propagate like prions between individual cells, and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant, and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments. We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic

  20. PrionHome: a database of prions and other sequences relevant to prion phenomena.

    Science.gov (United States)

    Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M

    2012-01-01

    Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing

  1. SAAS: Short Amino Acid Sequence - A Promising Protein Secondary Structure Prediction Method of Single Sequence

    Directory of Open Access Journals (Sweden)

    Zhou Yuan Wu

    2013-07-01

    Full Text Available In statistical methods of predicting protein secondary structure, many researchers focus on single amino acid frequencies in α-helices, β-sheets, and so on, or the impact near amino acids on an amino acid forming a secondary structure. But the paper considers a short sequence of amino acids (3, 4, 5 or 6 amino acids as integer, and statistics short sequence's probability forming secondary structure. Also, many researchers select low homologous sequences as statistical database. But this paper select whole PDB database. In this paper we propose a strategy to predict protein secondary structure using simple statistical method. Numerical computation shows that, short amino acids sequence as integer to statistics, which can easy see trend of short sequence forming secondary structure, and it will work well to select large statistical database (whole PDB database without considering homologous, and Q3 accuracy is ca. 74% using this paper proposed simple statistical method, but accuracy of others statistical methods is less than 70%.

  2. Tidying up international nucleotide sequence databases: ecological, geographical and sequence quality annotation of its sequences of mycorrhizal fungi.

    Science.gov (United States)

    Tedersoo, Leho; Abarenkov, Kessy; Nilsson, R Henrik; Schüssler, Arthur; Grelet, Gwen-Aëlle; Kohout, Petr; Oja, Jane; Bonito, Gregory M; Veldre, Vilmar; Jairus, Teele; Ryberg, Martin; Larsson, Karl-Henrik; Kõljalg, Urmas

    2011-01-01

    Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS) region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD) are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/) for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/), the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi.

  3. Protein Function Prediction Based on Sequence and Structure Information

    KAUST Repository

    Smaili, Fatima Z.

    2016-05-25

    The number of available protein sequences in public databases is increasing exponentially. However, a significant fraction of these sequences lack functional annotation which is essential to our understanding of how biological systems and processes operate. In this master thesis project, we worked on inferring protein functions based on the primary protein sequence. In the approach we follow, 3D models are first constructed using I-TASSER. Functions are then deduced by structurally matching these predicted models, using global and local similarities, through three independent enzyme commission (EC) and gene ontology (GO) function libraries. The method was tested on 250 “hard” proteins, which lack homologous templates in both structure and function libraries. The results show that this method outperforms the conventional prediction methods based on sequence similarity or threading. Additionally, our method could be improved even further by incorporating protein-protein interaction information. Overall, the method we use provides an efficient approach for automated functional annotation of non-homologous proteins, starting from their sequence.

  4. Comprehensive Genetic Database of Expressed Sequence Tags for Coccolithophorids

    Science.gov (United States)

    Ranji, Mohammad; Hadaegh, Ahmad R.

    Coccolithophorids are unicellular, marine, golden-brown, single-celled algae (Haptophyta) commonly found in near-surface waters in patchy distributions. They belong to the Phytoplankton family that is known to be responsible for much of the earth reproduction. Phytoplankton, just like plants live based on the energy obtained by Photosynthesis which produces oxygen. Substantial amount of oxygen in the earth's atmosphere is produced by Phytoplankton through Photosynthesis. The single-celled Emiliana Huxleyi is the most commonly known specie of Coccolithophorids and is known for extracting bicarbonate (HCO3) from its environment and producing calcium carbonate to form Coccoliths. Coccolithophorids are one of the world's primary producers, contributing about 15% of the average oceanic phytoplankton biomass to the oceans. They produce elaborate, minute calcite platelets (Coccoliths), covering the cell to form a Coccosphere and supplying up to 60% of the bulk pelagic calcite deposited on the sea floors. In order to understand the genetics of Coccolithophorid and the complexities of their biochemical reactions, we decided to build a database to store a complete profile of these organisms' genomes. Although a variety of such databases currently exist, (http://www.geneservice.co.uk/home/) none have yet been developed to comprehensively address the sequencing efforts underway by the Coccolithophorid research community. This database is called CocooExpress and is available to public (http://bioinfo.csusm.edu) for both data queries and sequence contribution.

  5. PROXiMATE: a database of mutant protein-protein complex thermodynamics and kinetics.

    Science.gov (United States)

    Jemimah, Sherlyn; Yugandhar, K; Michael Gromiha, M

    2017-09-01

    We have developed PROXiMATE, a database of thermodynamic data for more than 6000 missense mutations in 174 heterodimeric protein-protein complexes, supplemented with interaction network data from STRING database, solvent accessibility, sequence, structural and functional information, experimental conditions and literature information. Additional features include complex structure visualization, search and display options, download options and a provision for users to upload their data. The database is freely available at http://www.iitm.ac.in/bioinfo/PROXiMATE/ . The website is implemented in Python, and supports recent versions of major browsers such as IE10, Firefox, Chrome and Opera. gromiha@iitm.ac.in. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  6. Protein - AT Atlas | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available switchLanguage; BLAST Search Image Search Home About Archive Update History Data ..._protein.zip File URL: ftp://ftp.biosciencedbc.jp/archive/at_atlas/LATEST/at_atla...About This Database Database Description Download License Update History of This Database Site Policy | Contact Us Protein - AT Atlas | LSDB Archive ...

  7. EST2Prot: Mapping EST sequences to proteins

    Directory of Open Access Journals (Sweden)

    Lin David M

    2006-03-01

    Full Text Available Abstract Background EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. Results We describe a system (EST2Prot that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. Conclusion EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at http://biozon.org/tools/est/.

  8. Protein structure determination by exhaustive search of Protein Data Bank derived databases.

    Science.gov (United States)

    Stokes-Rees, Ian; Sliz, Piotr

    2010-12-14

    Parallel sequence and structure alignment tools have become ubiquitous and invaluable at all levels in the study of biological systems. We demonstrate the application and utility of this same parallel search paradigm to the process of protein structure determination, benefitting from the large and growing corpus of known structures. Such searches were previously computationally intractable. Through the method of Wide Search Molecular Replacement, developed here, they can be completed in a few hours with the aide of national-scale federated cyberinfrastructure. By dramatically expanding the range of models considered for structure determination, we show that small (less than 12% structural coverage) and low sequence identity (less than 20% identity) template structures can be identified through multidimensional template scoring metrics and used for structure determination. Many new macromolecular complexes can benefit significantly from such a technique due to the lack of known homologous protein folds or sequences. We demonstrate the effectiveness of the method by determining the structure of a full-length p97 homologue from Trichoplusia ni. Example cases with the MHC/T-cell receptor complex and the EmoB protein provide systematic estimates of minimum sequence identity, structure coverage, and structural similarity required for this method to succeed. We describe how this structure-search approach and other novel computationally intensive workflows are made tractable through integration with the US national computational cyberinfrastructure, allowing, for example, rapid processing of the entire Structural Classification of Proteins protein fragment database.

  9. License - Yeast Interacting Proteins Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Yeast Interacting Proteins Database License to Use This Database Last updated : 2010/02/15 You may use this database...nal License described below. The Standard License specifies the license terms regarding the use of this database... and the requirements you must follow in using this database. The Additional ...the Standard License. Standard License The Standard License for this database is the license specified in th...e Creative Commons Attribution-Share Alike 2.1 Japan . If you use data from this database

  10. The HMMER Web Server for Protein Sequence Similarity Search.

    Science.gov (United States)

    Prakash, Ananth; Jeffryes, Matt; Bateman, Alex; Finn, Robert D

    2017-12-08

    Protein sequence similarity search is one of the most commonly used bioinformatics methods for identifying evolutionarily related proteins. In general, sequences that are evolutionarily related share some degree of similarity, and sequence-search algorithms use this principle to identify homologs. The requirement for a fast and sensitive sequence search method led to the development of the HMMER software, which in the latest version (v3.1) uses a combination of sophisticated acceleration heuristics and mathematical and computational optimizations to enable the use of profile hidden Markov models (HMMs) for sequence analysis. The HMMER Web server provides a common platform by linking the HMMER algorithms to databases, thereby enabling the search for homologs, as well as providing sequence and functional annotation by linking external databases. This unit describes three basic protocols and two alternate protocols that explain how to use the HMMER Web server using various input formats and user defined parameters. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  11. Repeat Sequence Proteins as Matrices for Nanocomposites

    Energy Technology Data Exchange (ETDEWEB)

    Drummy, L.; Koerner, H; Phillips, D; McAuliffe, J; Kumar, M; Farmer, B; Vaia, R; Naik, R

    2009-01-01

    Recombinant protein-inorganic nanocomposites comprised of exfoliated Na+ montmorillonite (MMT) in a recombinant protein matrix based on silk-like and elastin-like amino acid motifs (silk elastin-like protein (SELP)) were formed via a solution blending process. Charged residues along the protein backbone are shown to dominate long-range interactions, whereas the SELP repeat sequence leads to local protein/MMT compatibility. Up to a 50% increase in room temperature modulus and a comparable decrease in high temperature coefficient of thermal expansion occur for cast films containing 2-10 wt.% MMT.

  12. ARCPHdb: A comprehensive protein database for SF1 and SF2 helicase from archaea.

    Science.gov (United States)

    Moukhtar, Mirna; Chaar, Wafi; Abdel-Razzak, Ziad; Khalil, Mohamad; Taha, Samir; Chamieh, Hala

    2017-01-01

    Superfamily 1 and Superfamily 2 helicases, two of the largest helicase protein families, play vital roles in many biological processes including replication, transcription and translation. Study of helicase proteins in the model microorganisms of archaea have largely contributed to the understanding of their function, architecture and assembly. Based on a large phylogenomics approach, we have identified and classified all SF1 and SF2 protein families in ninety five sequenced archaea genomes. Here we developed an online webserver linked to a specialized protein database named ARCPHdb to provide access for SF1 and SF2 helicase families from archaea. ARCPHdb was implemented using MySQL relational database. Web interfaces were developed using Netbeans. Data were stored according to UniProt accession numbers, NCBI Ref Seq ID, PDB IDs and Entrez Databases. A user-friendly interactive web interface has been developed to browse, search and download archaeal helicase protein sequences, their available 3D structure models, and related documentation available in the literature provided by ARCPHdb. The database provides direct links to matching external databases. The ARCPHdb is the first online database to compile all protein information on SF1 and SF2 helicase from archaea in one platform. This database provides essential resource information for all researchers interested in the field. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

    Directory of Open Access Journals (Sweden)

    Maskell Douglas L

    2009-05-01

    Full Text Available Abstract Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.

  14. Protein - Trypanosomes Database | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available switchLanguage; BLAST Search Image Search Home About Archive Update History Data List Contact us Trypanoso...nhibitor of the protein. Data file File name: trypanosome.zip File URL: ftp://ftp....biosciencedbc.jp/archive/trypanosome/LATEST/trypanosome.zip File size: 1.4 KB Simple search URL http://togo...db.biosciencedbc.jp/togodb/view/trypanosome#en Data acquisition method - Data analysis method - Number of da...ndelian inheritance in Man ) map Location of the gene on a chromosome or its chromosome number pdb PDB ID (P

  15. SHEETSPAIR: A Database of Amino Acid Pairs in Protein Sheet Structures

    Directory of Open Access Journals (Sweden)

    Ning Zhang

    2007-10-01

    Full Text Available Within folded strands of a protein, amino acids (AAs on every adjacent two strands form a pair of AAs. To explore the interactions between strands in a protein sheet structure, we have established an Internet-accessible relational database named SheetsPairs based on SQL Server 2000. The database has collected AAs pairs in proteins with detailed information. Furthermore, it utilizes a non-freetext database structure to store protein sequences and a specific database table with a unique number to store strands, which provides more searching options and rapid and accurate access to data queries. An IIS web server has been set up for data retrieval through a custom web interface, which enables complex data queries. Also searchable are parallel or anti-parallel folded strands and the list of strands in a specified protein.

  16. Improving decoy databases for protein folding algorithms

    KAUST Repository

    Lindsey, Aaron; Yeh, Hsin-Yi (Cindy); Wu, Chih-Peng; Thomas, Shawna; Amato, Nancy M.

    2014-01-01

    energetically stable) from non-native structures. Decoy databases are collections of non-native structures used to test and verify these functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and removing

  17. MSDB: A Comprehensive Database of Simple Sequence Repeats.

    Science.gov (United States)

    Avvaru, Akshay Kumar; Saxena, Saketh; Sowpati, Divya Tej; Mishra, Rakesh Kumar

    2017-06-01

    Microsatellites, also known as Simple Sequence Repeats (SSRs), are short tandem repeats of 1-6 nt motifs present in all genomes, particularly eukaryotes. Besides their usefulness as genome markers, SSRs have been shown to perform important regulatory functions, and variations in their length at coding regions are linked to several disorders in humans. Microsatellites show a taxon-specific enrichment in eukaryotic genomes, and some may be functional. MSDB (Microsatellite Database) is a collection of >650 million SSRs from 6,893 species including Bacteria, Archaea, Fungi, Plants, and Animals. This database is by far the most exhaustive resource to access and analyze SSR data of multiple species. In addition to exploring data in a customizable tabular format, users can view and compare the data of multiple species simultaneously using our interactive plotting system. MSDB is developed using the Django framework and MySQL. It is freely available at http://tdb.ccmb.res.in/msdb. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  18. HCVpro: Hepatitis C virus protein interaction database

    KAUST Repository

    Kwofie, Samuel K.; Schaefer, Ulf; Sundararajan, Vijayaraghava Seshadri; Bajic, Vladimir B.; Christoffels, Alan G.

    2011-01-01

    It is essential to catalog characterized hepatitis C virus (HCV) protein-protein interaction (PPI) data and the associated plethora of vital functional information to augment the search for therapies, vaccines and diagnostic biomarkers

  19. PARPs database: A LIMS systems for protein-protein interaction data mining or laboratory information management system

    Directory of Open Access Journals (Sweden)

    Picard-Cloutier Aude

    2007-12-01

    Full Text Available Abstract Background In the "post-genome" era, mass spectrometry (MS has become an important method for the analysis of proteins and the rapid advancement of this technique, in combination with other proteomics methods, results in an increasing amount of proteome data. This data must be archived and analysed using specialized bioinformatics tools. Description We herein describe "PARPs database," a data analysis and management pipeline for liquid chromatography tandem mass spectrometry (LC-MS/MS proteomics. PARPs database is a web-based tool whose features include experiment annotation, protein database searching, protein sequence management, as well as data-mining of the peptides and proteins identified. Conclusion Using this pipeline, we have successfully identified several interactions of biological significance between PARP-1 and other proteins, namely RFC-1, 2, 3, 4 and 5.

  20. Genome cluster database. A sequence family analysis platform for Arabidopsis and rice.

    Science.gov (United States)

    Horan, Kevin; Lauricha, Josh; Bailey-Serres, Julia; Raikhel, Natasha; Girke, Thomas

    2005-05-01

    The genome-wide protein sequences from Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) spp. japonica were clustered into families using sequence similarity and domain-based clustering. The two fundamentally different methods resulted in separate cluster sets with complementary properties to compensate the limitations for accurate family analysis. Functional names for the identified families were assigned with an efficient computational approach that uses the description of the most common molecular function gene ontology node within each cluster. Subsequently, multiple alignments and phylogenetic trees were calculated for the assembled families. All clustering results and their underlying sequences were organized in the Web-accessible Genome Cluster Database (http://bioinfo.ucr.edu/projects/GCD) with rich interactive and user-friendly sequence family mining tools to facilitate the analysis of any given family of interest for the plant science community. An automated clustering pipeline ensures current information for future updates in the annotations of the two genomes and clustering improvements. The analysis allowed the first systematic identification of family and singlet proteins present in both organisms as well as those restricted to one of them. In addition, the established Web resources for mining these data provide a road map for future studies of the composition and structure of protein families between the two species.

  1. GuiTope: an application for mapping random-sequence peptides to protein sequences.

    Science.gov (United States)

    Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert

    2012-01-03

    Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  2. GuiTope: an application for mapping random-sequence peptides to protein sequences

    Directory of Open Access Journals (Sweden)

    Halperin Rebecca F

    2012-01-01

    Full Text Available Abstract Background Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. Results GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. Conclusions GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  3. Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes.

    Science.gov (United States)

    Hiscock, D; Upton, C

    2000-05-01

    The Viral Genome DataBase (VGDB) contains detailed information of the genes and predicted protein sequences from 15 completely sequenced genomes of large (&100 kb) viruses (2847 genes). The data that is stored includes DNA sequence, protein sequence, GenBank and user-entered notes, molecular weight (MW), isoelectric point (pI), amino acid content, A + T%, nucleotide frequency, dinucleotide frequency and codon use. The VGDB is a mySQL database with a user-friendly JAVA GUI. Results of queries can be easily sorted by any of the individual parameters. The software and additional figures and information are available at http://athena.bioc.uvic.ca/genomes/index.html .

  4. Nonlinear deterministic structures and the randomness of protein sequences

    CERN Document Server

    Huang Yan Zhao

    2003-01-01

    To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences representing each structural classes by using nonlinear prediction method. No deterministic structures are found in these protein sequences and this implies that they behave as random sequences. We also give an explanation to the controversial results obtained in previous investigations.

  5. SwissPalm: Protein Palmitoylation database.

    Science.gov (United States)

    Blanc, Mathieu; David, Fabrice; Abrami, Laurence; Migliozzi, Daniel; Armand, Florence; Bürgi, Jérôme; van der Goot, Françoise Gisou

    2015-01-01

    Protein S-palmitoylation is a reversible post-translational modification that regulates many key biological processes, although the full extent and functions of protein S-palmitoylation remain largely unexplored. Recent developments of new chemical methods have allowed the establishment of palmitoyl-proteomes of a variety of cell lines and tissues from different species.  As the amount of information generated by these high-throughput studies is increasing, the field requires centralization and comparison of this information. Here we present SwissPalm ( http://swisspalm.epfl.ch), our open, comprehensive, manually curated resource to study protein S-palmitoylation. It currently encompasses more than 5000 S-palmitoylated protein hits from seven species, and contains more than 500 specific sites of S-palmitoylation. SwissPalm also provides curated information and filters that increase the confidence in true positive hits, and integrates predictions of S-palmitoylated cysteine scores, orthologs and isoform multiple alignments. Systems analysis of the palmitoyl-proteome screens indicate that 10% or more of the human proteome is susceptible to S-palmitoylation. Moreover, ontology and pathway analyses of the human palmitoyl-proteome reveal that key biological functions involve this reversible lipid modification. Comparative analysis finally shows a strong crosstalk between S-palmitoylation and other post-translational modifications. Through the compilation of data and continuous updates, SwissPalm will provide a powerful tool to unravel the global importance of protein S-palmitoylation.

  6. Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.

    Science.gov (United States)

    Apweiler, R; Gateau, A; Contrino, S; Martin, M J; Junker, V; O'Donovan, C; Lang, F; Mitaritonna, N; Kappus, S; Bairoch, A

    1997-01-01

    SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.

  7. cDNA sequence quality data - Budding yeast cDNA sequencing project | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Budding yeast cDNA sequencing project cDNA sequence quality data Data detail Data name cDNA sequence quality... data DOI 10.18908/lsdba.nbdc00838-003 Description of data contents Phred's quality score. P...tion Download License Update History of This Database Site Policy | Contact Us cDNA sequence quality

  8. TOPDOM: database of conservatively located domains and motifs in proteins.

    Science.gov (United States)

    Varga, Julia; Dobson, László; Tusnády, Gábor E

    2016-09-01

    The TOPDOM database-originally created as a collection of domains and motifs located consistently on the same side of the membranes in α-helical transmembrane proteins-has been updated and extended by taking into consideration consistently localized domains and motifs in globular proteins, too. By taking advantage of the recently developed CCTOP algorithm to determine the type of a protein and predict topology in case of transmembrane proteins, and by applying a thorough search for domains and motifs as well as utilizing the most up-to-date version of all source databases, we managed to reach a 6-fold increase in the size of the whole database and a 2-fold increase in the number of transmembrane proteins. TOPDOM database is available at http://topdom.enzim.hu The webpage utilizes the common Apache, PHP5 and MySQL software to provide the user interface for accessing and searching the database. The database itself is generated on a high performance computer. tusnady.gabor@ttk.mta.hu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  9. Protein - TP Atlas | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available switchLanguage; BLAST Search Image Search Home About Archive Update History Data ...p_atlas_protein.zip File URL: ftp://ftp.biosciencedbc.jp/archive/tp_atlas/LATEST/...story of This Database Site Policy | Contact Us Protein - TP Atlas | LSDB Archive ...

  10. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.

    Science.gov (United States)

    Tian, Pengfei; Best, Robert B

    2017-10-17

    Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance. Published by Elsevier Inc.

  11. The DExH/D protein family database.

    Science.gov (United States)

    Jankowsky, E; Jankowsky, A

    2000-01-01

    DExH/D proteins are essential for all aspects of cellular RNA metabolism and processing, in the replication of many viruses and in DNA replication. DExH/D proteins are subject to current biological, biochemical and biophysical research which provides a continuous wealth of data. The DExH/D protein family database compiles this information and makes it available over the WWW (http://www.columbia.edu/ ej67/dbhome.htm ). The database can be fully searched by text based queries, facilitating fast access to specific information about this important class of enzymes.

  12. The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments.

    Science.gov (United States)

    Murvai, J; Vlahovicek, K; Barta, E; Pongor, S

    2001-01-01

    SBASE 8.0 is the eighth release of the SBASE library of protein domain sequences that contains 294 898 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to most major sequence databases and sequence pattern collections. The entries are clustered into over 2005 statistically validated domain groups (SBASE-A) and 595 non-validated groups (SBASE-B), provided with several WWW-based search and browsing facilities for online use. A domain-search facility was developed, based on non-parametric pattern recognition methods, including artificial neural networks. SBASE 8.0 is freely available by anonymous 'ftp' file transfer from ftp.icgeb.trieste.it. Automated searching of SBASE can be carried out with the WWW servers http://www.icgeb.trieste.it/sbase/ and http://sbase.abc. hu/sbase/.

  13. VaProS: a database-integration approach for protein/genome information retrieval

    KAUST Repository

    Gojobori, Takashi; Ikeo, Kazuho; Katayama, Yukie; Kawabata, Takeshi; Kinjo, Akira R.; Kinoshita, Kengo; Kwon, Yeondae; Migita, Ohsuke; Mizutani, Hisashi; Muraoka, Masafumi; Nagata, Koji; Omori, Satoshi; Sugawara, Hideaki; Yamada, Daichi; Yura, Kei

    2016-01-01

    Life science research now heavily relies on all sorts of databases for genome sequences, transcription, protein three-dimensional (3D) structures, protein–protein interactions, phenotypes and so forth. The knowledge accumulated by all the omics research is so vast that a computer-aided search of data is now a prerequisite for starting a new study. In addition, a combinatory search throughout these databases has a chance to extract new ideas and new hypotheses that can be examined by wet-lab experiments. By virtually integrating the related databases on the Internet, we have built a new web application that facilitates life science researchers for retrieving experts’ knowledge stored in the databases and for building a new hypothesis of the research target. This web application, named VaProS, puts stress on the interconnection between the functional information of genome sequences and protein 3D structures, such as structural effect of the gene mutation. In this manuscript, we present the notion of VaProS, the databases and tools that can be accessed without any knowledge of database locations and data formats, and the power of search exemplified in quest of the molecular mechanisms of lysosomal storage disease. VaProS can be freely accessed at http://p4d-info.nig.ac.jp/vapros/.

  14. VaProS: a database-integration approach for protein/genome information retrieval

    KAUST Repository

    Gojobori, Takashi

    2016-12-24

    Life science research now heavily relies on all sorts of databases for genome sequences, transcription, protein three-dimensional (3D) structures, protein–protein interactions, phenotypes and so forth. The knowledge accumulated by all the omics research is so vast that a computer-aided search of data is now a prerequisite for starting a new study. In addition, a combinatory search throughout these databases has a chance to extract new ideas and new hypotheses that can be examined by wet-lab experiments. By virtually integrating the related databases on the Internet, we have built a new web application that facilitates life science researchers for retrieving experts’ knowledge stored in the databases and for building a new hypothesis of the research target. This web application, named VaProS, puts stress on the interconnection between the functional information of genome sequences and protein 3D structures, such as structural effect of the gene mutation. In this manuscript, we present the notion of VaProS, the databases and tools that can be accessed without any knowledge of database locations and data formats, and the power of search exemplified in quest of the molecular mechanisms of lysosomal storage disease. VaProS can be freely accessed at http://p4d-info.nig.ac.jp/vapros/.

  15. UbiProt: a database of ubiquitylated proteins

    Directory of Open Access Journals (Sweden)

    Kondratieva Ekaterina V

    2007-04-01

    Full Text Available Abstract Background Post-translational protein modification with ubiquitin, or ubiquitylation, is one of the hottest topics in a modern biology due to a dramatic impact on diverse metabolic pathways and involvement in pathogenesis of severe human diseases. A great number of eukaryotic proteins was found to be ubiquitylated. However, data about particular ubiquitylated proteins are rather disembodied. Description To fill a general need for collecting and systematizing experimental data concerning ubiquitylation we have developed a new resource, UbiProt Database, a knowledgebase of ubiquitylated proteins. The database contains retrievable information about overall characteristics of a particular protein, ubiquitylation features, related ubiquitylation and de-ubiquitylation machinery and literature references reflecting experimental evidence of ubiquitylation. UbiProt is available at http://ubiprot.org.ru for free. Conclusion UbiProt Database is a public resource offering comprehensive information on ubiquitylated proteins. The resource can serve as a general reference source both for researchers in ubiquitin field and those who deal with particular ubiquitylated proteins which are of their interest. Further development of the UbiProt Database is expected to be of common interest for research groups involved in studies of the ubiquitin system.

  16. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    Directory of Open Access Journals (Sweden)

    Kevin R Ramkissoon

    Full Text Available The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  17. Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

    Science.gov (United States)

    Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392

  18. ZifBASE: a database of zinc finger proteins and associated resources

    Directory of Open Access Journals (Sweden)

    Punetha Ankita

    2009-09-01

    databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. Conclusion A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.

  19. Minimotif Miner 3.0: database expansion and significantly improved reduction of false-positive predictions from consensus sequences.

    Science.gov (United States)

    Mi, Tian; Merlin, Jerlin Camilus; Deverasetty, Sandeep; Gryk, Michael R; Bill, Travis J; Brooks, Andrew W; Lee, Logan Y; Rathnayake, Viraj; Ross, Christian A; Sargeant, David P; Strong, Christy L; Watts, Paula; Rajasekaran, Sanguthevar; Schiller, Martin R

    2012-01-01

    Minimotif Miner (MnM available at http://minimotifminer.org or http://mnm.engr.uconn.edu) is an online database for identifying new minimotifs in protein queries. Minimotifs are short contiguous peptide sequences that have a known function in at least one protein. Here we report the third release of the MnM database which has now grown 60-fold to approximately 300,000 minimotifs. Since short minimotifs are by their nature not very complex we also summarize a new set of false-positive filters and linear regression scoring that vastly enhance minimotif prediction accuracy on a test data set. This online database can be used to predict new functions in proteins and causes of disease.

  20. Database-driven primary analysis of raw sequencing data

    DEFF Research Database (Denmark)

    2014-01-01

    The present invention relates to methods for identifying the source of a biological sequence containing sample from raw sequencing reads. The method may be used to identify the source of unknown DNA and can be used for diagnostic, biodefense, food safety and quality, and hygiene applications...

  1. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues.

    Science.gov (United States)

    Garrido-Martín, Diego; Pazos, Florencio

    2018-02-27

    The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.

  2. Yeast Interacting Proteins Database: YNR006W, YHL002W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ling Golgi proteins, forming lumenal membranes and sorting ubiquitinated proteins destined for degradation; ..., as well as for recycling of Golgi proteins and formation of lumenal membranes Rows with this prey as prey ...1p; required for recycling Golgi proteins, forming lumenal membranes and sorting ubiquitinated proteins dest...degradation, as well as for recycling of Golgi proteins and formation of lumenal membranes

  3. A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

    Directory of Open Access Journals (Sweden)

    Suganthan PN

    2007-09-01

    Full Text Available Abstract Background Odorant binding proteins (OBPs are believed to shuttle odorants from the environment to the underlying odorant receptors, for which they could potentially serve as odorant presenters. Although several sequence based search methods have been exploited for protein family prediction, less effort has been devoted to the prediction of OBPs from sequence data and this area is more challenging due to poor sequence identity between these proteins. Results In this paper, we propose a new algorithm that uses Regularized Least Squares Classifier (RLSC in conjunction with multiple physicochemical properties of amino acids to predict odorant-binding proteins. The algorithm was applied to the dataset derived from Pfam and GenDiS database and we obtained overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively. Conclusion Our study suggests that RLSC is potentially useful for predicting the odorant binding proteins from sequence-derived properties irrespective of sequence similarity. Our method predicts 92.8% of 56 odorant binding proteins non-homologous to any protein in the swissprot database and 97.1% of the 414 independent dataset proteins, suggesting the usefulness of RLSC method for facilitating the prediction of odorant binding proteins from sequence information.

  4. Yeast Interacting Proteins Database: YLR447C, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available xpression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Sp...; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; act

  5. Yeast Interacting Proteins Database: YGR013W, YKL012W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available tion U1 snRNP protein involved in splicing, interacts with the branchpoint-binding protein during the formation of the second commitm... PRP40 U1 snRNP protein involved in splicing, interacts with the branchpoint-binding protein during the form...ation of the second commitment complex Rows with this prey as prey (1) Rows with

  6. The PANTHER database of protein families, subfamilies, functions and pathways

    OpenAIRE

    Mi, Huaiyu; Lazareva-Ulitsky, Betty; Loo, Rozina; Kejariwal, Anish; Vandergriff, Jody; Rabkin, Steven; Guo, Nan; Muruganujan, Anushya; Doremieux, Olivier; Campbell, Michael J.; Kitano, Hiroaki; Thomas, Paul D.

    2004-01-01

    PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (ontology terms and pathways), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. The l...

  7. Yeast Interacting Proteins Database: YOR047C, YKL038W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available racts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; acts as a...Bait description Protein involved in control of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose senso...rs Snf3p and Rgt2p, and TATA-binding protein Spt15p; acts as a regulator of the tra

  8. Yeast Interacting Proteins Database: YFR049W, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; acts as a regulator... (0) YOR047C STD1 Protein involved in control of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sens...ors Snf3p and Rgt2p, and TATA-binding protein Spt15p; ac

  9. Yeast Interacting Proteins Database: YHL002W, YNR006W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ycling of Golgi proteins and formation of lumenal membranes Rows with this bait as bait (1) Rows with this b...required for recycling Golgi proteins, forming lumenal membranes and sorting ubiquitinated proteins destined...on, as well as for recycling of Golgi proteins and formation of lumenal membranes...ith Hse1p; required for recycling Golgi proteins, forming lumenal membranes and sorting ubiquitinated protei

  10. MultitaskProtDB: a database of multitasking proteins.

    Science.gov (United States)

    Hernández, Sergio; Ferragut, Gabriela; Amela, Isaac; Perez-Pons, JosepAntoni; Piñol, Jaume; Mozo-Villarias, Angel; Cedano, Juan; Querol, Enrique

    2014-01-01

    We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth.

  11. CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database.

    Science.gov (United States)

    Park, Byung H; Karpinets, Tatiana V; Syed, Mustafa H; Leuze, Michael R; Uberbacher, Edward C

    2010-12-01

    The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.

  12. DB-PABP: a database of polyanion-binding proteins.

    Science.gov (United States)

    Fang, Jianwen; Dong, Yinghua; Salamat-Miller, Nazila; Middaugh, C Russell

    2008-01-01

    The interactions between polyanions (PAs) and polyanion-binding proteins (PABPs) have been found to play significant roles in many essential biological processes including intracellular organization, transport and protein folding. Furthermore, many neurodegenerative disease-related proteins are PABPs. Thus, a better understanding of PA/PABP interactions may not only enhance our understandings of biological systems but also provide new clues to these deadly diseases. The literature in this field is widely scattered, suggesting the need for a comprehensive and searchable database of PABPs. The DB-PABP is a comprehensive, manually curated and searchable database of experimentally characterized PABPs. It is freely available and can be accessed online at http://pabp.bcf.ku.edu/DB_PABP/. The DB-PABP was implemented as a MySQL relational database. An interactive web interface was created using Java Server Pages (JSP). The search page of the database is organized into a main search form and a section for utilities. The main search form enables custom searches via four menus: protein names, polyanion names, the source species of the proteins and the methods used to discover the interactions. Available utilities include a commonality matrix, a function of listing PABPs by the number of interacting polyanions and a string search for author surnames. The DB-PABP is maintained at the University of Kansas. We encourage users to provide feedback and submit new data and references.

  13. AgdbNet – antigen sequence database software for bacterial typing

    Directory of Open Access Journals (Sweden)

    Maiden Martin CJ

    2006-06-01

    Full Text Available Abstract Background Bacterial typing schemes based on the sequences of genes encoding surface antigens require databases that provide a uniform, curated, and widely accepted nomenclature of the variants identified. Due to the differences in typing schemes, imposed by the diversity of genes targeted, creating these databases has typically required the writing of one-off code to link the database to a web interface. Here we describe agdbNet, widely applicable web database software that facilitates simultaneous BLAST querying of multiple loci using either nucleotide or peptide sequences. Results Databases are described by XML files that are parsed by a Perl CGI script. Each database can have any number of loci, which may be defined by nucleotide and/or peptide sequences. The software is currently in use on at least five public databases for the typing of Neisseria meningitidis, Campylobacter jejuni and Streptococcus equi and can be set up to query internal isolate tables or suitably-configured external isolate databases, such as those used for multilocus sequence typing. The style of the resulting website can be fully configured by modifying stylesheets and through the use of customised header and footer files that surround the output of the script. Conclusion The software provides a rapid means of setting up customised Internet antigen sequence databases. The flexible configuration options enable typing schemes with differing requirements to be accommodated.

  14. Yeast Interacting Proteins Database: YPR103W, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available tein involved in control of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sensors...gulated gene expression; interacts with protein kinase Snf1p, glucose sensors Snf

  15. Yeast Interacting Proteins Database: YNL152W, YMR032W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YNL152W INN1 Essential protein that associates with the contractile actomyosin ring... Bait description Essential protein that associates with the contractile actomyosin ring, required for ingre

  16. Yeast Interacting Proteins Database: YGL145W, YNL258C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ripheral membrane protein required for Golgi-to-ER retrograde traffic; component ... membrane protein required for Golgi-to-ER retrograde traffic; component of the ER target site that interact

  17. Yeast Interacting Proteins Database: YNL258C, YGL145W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YNL258C DSL1 Peripheral membrane protein required for Golgi-to-ER retrograde traffi...t description Peripheral membrane protein required for Golgi-to-ER retrograde traffic; component of the ER t

  18. Yeast Interacting Proteins Database: YNL216W, YLR453C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YNL216W RAP1 DNA-binding protein involved in either activation or repression of transcription, depending...NA-binding protein involved in either activation or repression of transcription, depending on binding site c

  19. Yeast Interacting Proteins Database: YOL006C, YMR233W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available fusion protein localizes to the cytoplasm, nucleus and nucleolus Rows with this prey as prey (1) Rows with t...on protein localizes to the cytoplasm, nucleus and nucleolus Rows with this prey

  20. Yeast Interacting Proteins Database: YKL002W, YFL034C-B [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available integral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthes...ntegral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthesi

  1. Yeast Interacting Proteins Database: YJR091C, YKL002W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available g of integral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly sy... integral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthe

  2. Yeast Interacting Proteins Database: YGL237C, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ene expression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding prote... expression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein

  3. Yeast Interacting Proteins Database: YOR358W, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; act...rotein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; acts as a regulator o

  4. Yeast Interacting Proteins Database: YKL002W, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ene expression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding prote...xpression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Sp

  5. Yeast Interacting Proteins Database: YGL127C, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available ith protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt15p; acts as a regula...rotein involved in control of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sensors

  6. Taxonomic evaluation of selected Ganoderma species and database sequence validation

    Directory of Open Access Journals (Sweden)

    Suldbold Jargalmaa

    2017-07-01

    Full Text Available Species in the genus Ganoderma include several ecologically important and pathogenic fungal species whose medicinal and economic value is substantial. Due to the highly similar morphological features within the Ganoderma, identification of species has relied heavily on DNA sequencing using BLAST searches, which are only reliable if the GenBank submissions are accurately labeled. In this study, we examined 113 specimens collected from 1969 to 2016 from various regions in Korea using morphological features and multigene analysis (internal transcribed spacer, translation elongation factor 1-α, and the second largest subunit of RNA polymerase II. These specimens were identified as four Ganoderma species: G. sichuanense, G. cf. adspersum, G. cf. applanatum, and G. cf. gibbosum. With the exception of G. sichuanense, these species were difficult to distinguish based solely on morphological features. However, phylogenetic analysis at three different loci yielded concordant phylogenetic information, and supported the four species distinctions with high bootstrap support. A survey of over 600 Ganoderma sequences available on GenBank revealed that 65% of sequences were either misidentified or ambiguously labeled. Here, we suggest corrected annotations for GenBank sequences based on our phylogenetic validation and provide updated global distribution patterns for these Ganoderma species.

  7. Taxonomic evaluation of selected Ganoderma species and database sequence validation

    Science.gov (United States)

    Jargalmaa, Suldbold; Eimes, John A.; Park, Myung Soo; Park, Jae Young; Oh, Seung-Yoon

    2017-01-01

    Species in the genus Ganoderma include several ecologically important and pathogenic fungal species whose medicinal and economic value is substantial. Due to the highly similar morphological features within the Ganoderma, identification of species has relied heavily on DNA sequencing using BLAST searches, which are only reliable if the GenBank submissions are accurately labeled. In this study, we examined 113 specimens collected from 1969 to 2016 from various regions in Korea using morphological features and multigene analysis (internal transcribed spacer, translation elongation factor 1-α, and the second largest subunit of RNA polymerase II). These specimens were identified as four Ganoderma species: G. sichuanense, G. cf. adspersum, G. cf. applanatum, and G. cf. gibbosum. With the exception of G. sichuanense, these species were difficult to distinguish based solely on morphological features. However, phylogenetic analysis at three different loci yielded concordant phylogenetic information, and supported the four species distinctions with high bootstrap support. A survey of over 600 Ganoderma sequences available on GenBank revealed that 65% of sequences were either misidentified or ambiguously labeled. Here, we suggest corrected annotations for GenBank sequences based on our phylogenetic validation and provide updated global distribution patterns for these Ganoderma species. PMID:28761785

  8. Filling and mining the reactive metabolite target protein database.

    Science.gov (United States)

    Hanzlik, Robert P; Fang, Jianwen; Koen, Yakov M

    2009-04-15

    The post-translational modification of proteins is a well-known endogenous mechanism for regulating protein function and activity. Cellular proteins are also susceptible to post-translational modification by xenobiotic agents that possess, or whose metabolites possess, significant electrophilic character. Such non-physiological modifications to endogenous proteins are sometimes benign, but in other cases they are strongly associated with, and are presumed to cause, lethal cytotoxic consequences via necrosis and/or apoptosis. The Reactive Metabolite Target Protein Database (TPDB) is a searchable, freely web-accessible (http://tpdb.medchem.ku.edu:8080/protein_database/) resource that attempts to provide a comprehensive, up-to-date listing of known reactive metabolite target proteins. In this report we characterize the TPDB by reviewing briefly how the information it contains came to be known. We also compare its information to that provided by other types of "-omics" studies relevant to toxicology, and we illustrate how bioinformatic analysis of target proteins may help to elucidate mechanisms of cytotoxic responses to reactive metabolites.

  9. Medicago PhosphoProtein Database: a repository for Medicago truncatula phosphoprotein data

    Directory of Open Access Journals (Sweden)

    Christopher M. Rose

    2012-06-01

    Full Text Available The ability of legume crops to fix atmospheric nitrogen via a symbiotic association with soil rhizobia makes them an essential component of many agricultural systems. Initiation of this symbiosis requires protein phosphorylation-mediated signaling in response to rhizobial signals named Nod factors. Medicago truncatula (Medicago is the model system for studying legume biology, making the study of its phosphoproteome essential. Here, we describe the Medicago Phosphoprotein Database (http://phospho.medicago.wisc.edu, a repository built to house phosphoprotein, phosphopeptide, and phosphosite data specific to Medicago. Currently, the Medicago Phosphoprotein Database holds 3,457 unique phosphopeptides that contain 3,404 non-redundant sites of phosphorylation on 829 proteins. Through the web-based interface, users are allowed to browse identified proteins or search for proteins of interest. Furthermore, we allow users to conduct BLAST searches of the database using both peptide sequences and phosphorylation motifs as queries. The data contained within the database are available for download to be investigated at the user’s discretion. The Medicago Phosphoprotein Database will be updated continually with novel phosphoprotein and phosphopeptide identifications, with the intent of constructing an unparalleled compendium of large-scale Medicago phosphorylation data.

  10. Gene composer: database software for protein construct design, codon engineering, and gene synthesis.

    Science.gov (United States)

    Lorimer, Don; Raymond, Amy; Walchli, John; Mixon, Mark; Barrow, Adrienne; Wallace, Ellen; Grice, Rena; Burgin, Alex; Stewart, Lance

    2009-04-21

    To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the information-rich design of protein constructs and their codon engineered synthetic gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bio-informatics steps used in modern structure guided protein engineering and synthetic gene engineering. An interactive Alignment Viewer allows the researcher to simultaneously visualize sequence conservation in the context of known protein secondary structure, ligand contacts, water contacts, crystal contacts, B-factors, solvent accessible area, residue property type and several other useful property views. The Construct Design Module enables the facile design of novel protein constructs with altered N- and C-termini, internal insertions or deletions, point mutations, and desired affinity tags. The modifications can be combined and permuted into multiple protein constructs, and then virtually cloned in silico into defined expression vectors. The Gene Design Module uses a protein-to-gene algorithm that automates the back-translation of a protein amino acid sequence into a codon engineered nucleic acid gene sequence according to a selected codon usage table with minimal codon usage threshold, defined G:C% content, and desired sequence features achieved through synonymous codon selection that is optimized for the intended expression system. The gene-to-oligo algorithm of the Gene Design Module plans out all of the required overlapping oligonucleotides and mutagenic primers needed to synthesize the desired gene constructs by PCR, and for physically cloning them into selected vectors by the most popular subcloning strategies. We present a complete description of Gene Composer functionality, and an efficient PCR-based synthetic gene assembly procedure with mis-match specific endonuclease

  11. Gene Composer: database software for protein construct design, codon engineering, and gene synthesis

    Directory of Open Access Journals (Sweden)

    Mixon Mark

    2009-04-01

    Full Text Available Abstract Background To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the information-rich design of protein constructs and their codon engineered synthetic gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bio-informatics steps used in modern structure guided protein engineering and synthetic gene engineering. Results An interactive Alignment Viewer allows the researcher to simultaneously visualize sequence conservation in the context of known protein secondary structure, ligand contacts, water contacts, crystal contacts, B-factors, solvent accessible area, residue property type and several other useful property views. The Construct Design Module enables the facile design of novel protein constructs with altered N- and C-termini, internal insertions or deletions, point mutations, and desired affinity tags. The modifications can be combined and permuted into multiple protein constructs, and then virtually cloned in silico into defined expression vectors. The Gene Design Module uses a protein-to-gene algorithm that automates the back-translation of a protein amino acid sequence into a codon engineered nucleic acid gene sequence according to a selected codon usage table with minimal codon usage threshold, defined G:C% content, and desired sequence features achieved through synonymous codon selection that is optimized for the intended expression system. The gene-to-oligo algorithm of the Gene Design Module plans out all of the required overlapping oligonucleotides and mutagenic primers needed to synthesize the desired gene constructs by PCR, and for physically cloning them into selected vectors by the most popular subcloning strategies. Conclusion We present a complete description of Gene Composer functionality, and an efficient PCR-based synthetic gene

  12. Next-Generation Sequencing for Binary Protein-Protein Interactions

    Directory of Open Access Journals (Sweden)

    Bernhard eSuter

    2015-12-01

    Full Text Available The yeast two-hybrid (Y2H system exploits host cell genetics in order to display binary protein-protein interactions (PPIs via defined and selectable phenotypes. Numerous improvements have been made to this method, adapting the screening principle for diverse applications, including drug discovery and the scale-up for proteome wide interaction screens in human and other organisms. Here we discuss a systematic workflow and analysis scheme for screening data generated by Y2H and related assays that includes high-throughput selection procedures, readout of comprehensive results via next-generation sequencing (NGS, and the interpretation of interaction data via quantitative statistics. The novel assays and tools will serve the broader scientific community to harness the power of NGS technology to address PPI networks in health and disease. We discuss examples of how this next-generation platform can be applied to address specific questions in diverse fields of biology and medicine.

  13. License - Budding yeast cDNA sequencing project | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us Budding yeast cDNA sequencing project License to Use This Database Last updated : 2010/02/15 You may use this databas...ional License described below. The Standard License specifies the license terms regarding the use of this database... and the requirements you must follow in using this database. The Additiona...n the Standard License. Standard License The Standard License for this database is the license specified in ...the Creative Commons Attribution-Share Alike 2.1 Japan . If you use data from this database

  14. Yeast Interacting Proteins Database: YOR302W, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available rol of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt...tein kinase Snf1p, glucose sensors Snf3p and Rgt2p, and TATA-binding protein Spt1

  15. Yeast Interacting Proteins Database: YMR280C, YOR047C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available olved in control of glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sensor... glucose-regulated gene expression; interacts with protein kinase Snf1p, glucose sensors Snf3p and Rgt2p, an

  16. Yeast Interacting Proteins Database: YNL258C, YKR022C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YNL258C DSL1 Peripheral membrane protein required for Golgi-to-ER retrograde traffi...equired for Golgi-to-ER retrograde traffic; component of the ER target site that interacts with coatomer, th...it ORF YNL258C Bait gene name DSL1 Bait description Peripheral membrane protein r

  17. Yeast Interacting Proteins Database: YDL239C, YDR273W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available of a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle p...it as prey (1) YDR273W DON1 Meiosis-specific component of the spindle pole body, part of the leading... edge protein (LEP) coat, forms a ring-like structure at the leading edge of the prospore...ption Protein required for spore wall formation, thought to mediate assembly of a Don1p-containing structure at the leading...description Meiosis-specific component of the spindle pole body, part of the leading edge protein (LEP) coat

  18. PseudoMLSA: a database for multigenic sequence analysis of Pseudomonas species

    Directory of Open Access Journals (Sweden)

    Lalucat Jorge

    2010-04-01

    Full Text Available Abstract Background The genus Pseudomonas comprises more than 100 species of environmental, clinical, agricultural, and biotechnological interest. Although, the recommended method for discriminating bacterial species is DNA-DNA hybridisation, alternative techniques based on multigenic sequence analysis are becoming a common practice in bacterial species discrimination studies. Since there is not a general criterion for determining which genes are more useful for species resolution; the number of strains and genes analysed is increasing continuously. As a result, sequences of different genes are dispersed throughout several databases. This sequence information needs to be collected in a common database, in order to be useful for future identification-based projects. Description The PseudoMLSA Database is a comprehensive database of multiple gene sequences from strains of Pseudomonas species. The core of the database is composed of selected gene sequences from all Pseudomonas type strains validly assigned to the genus through 2008. The database is aimed to be useful for MultiLocus Sequence Analysis (MLSA procedures, for the identification and characterisation of any Pseudomonas bacterial isolate. The sequences are available for download via a direct connection to the National Center for Biotechnology Information (NCBI. Additionally, the database includes an online BLAST interface for flexible nucleotide queries and similarity searches with the user's datasets, and provides a user-friendly output for easily parsing, navigating, and analysing BLAST results. Conclusions The PseudoMLSA database amasses strains and sequence information of validly described Pseudomonas species, and allows free querying of the database via a user-friendly, web-based interface available at http://www.uib.es/microbiologiaBD/Welcome.html. The web-based platform enables easy retrieval at strain or gene sequence information level; including references to published peer

  19. Yeast Interacting Proteins Database: YOR117W, YJL184W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available c stress response, telomere uncapping and elongation, transcription; component of the EKC/KEOPS protein comp...n proposed to be involved in the modification of N-linked oligosaccharides, osmotic stress response, telomere uncap

  20. Yeast Interacting Proteins Database: YER081W, YDR105C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available YDR105C TMS1 Vacuolar membrane protein of unknown function that is conserved in mammals; predicted to contai...tion that is conserved in mammals; predicted to contain eleven transmembrane heli

  1. Yeast Interacting Proteins Database: YKL002W, YLR423C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available integral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthes... into lumenal vesicles of multivesicular bodies, and for delivery of newly synthesized vacuolar enzymes to t

  2. Yeast Interacting Proteins Database: YKL002W, YDL165W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available integral membrane proteins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthes...ins into lumenal vesicles of multivesicular bodies, and for delivery of newly synthesized vacuolar enzymes t

  3. EST-PAC a web package for EST annotation and protein sequence prediction

    Directory of Open Access Journals (Sweden)

    Strahm Yvan

    2006-10-01

    Full Text Available Abstract With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1 searching local or remote biological databases for sequence similarities using Blast services, 2 predicting protein coding sequence from EST data and, 3 annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  4. Simple sequence proteins in prokaryotic proteomes

    Directory of Open Access Journals (Sweden)

    Ramachandran Srinivasan

    2006-06-01

    Full Text Available Abstract Background The structural and functional features associated with Simple Sequence Proteins (SSPs are non-globularity, disease states, signaling and post-translational modification. SSPs are also an important source of genetic and possibly phenotypic variation. Analysis of 249 prokaryotic proteomes offers a new opportunity to examine the genomic properties of SSPs. Results SSPs are a minority but they grow with proteome size. This relationship is exhibited across species varying in genomic GC, mutational bias, life style, and pathogenicity. Their proportion in each proteome is strongly influenced by genomic base compositional bias. In most species simple duplications is favoured, but in a few cases such as Mycobacteria, large families of duplications occur. Amino acid preference in SSPs exhibits a trend towards low cost of biosynthesis. In SSPs and in non-SSPs, Alanine, Glycine, Leucine, and Valine are abundant in species widely varying in genomic GC whereas Isoleucine and Lysine are rich only in organisms with low genomic GC. Arginine is abundant in SSPs of two species and in the non-SSPs of Xanthomonas oryzae. Asparagine is abundant only in SSPs of low GC species. Aspartic acid is abundant only in the non-SSPs of Halobacterium sp NRC1. The abundance of Serine in SSPs of 62 species extends over a broader range compared to that of non-SSPs. Threonine(T is abundant only in SSPs of a couple of species. SSPs exhibit preferential association with Cell surface, Cell membrane and Transport functions and a negative association with Metabolism. Mesophiles and Thermophiles display similar ranges in the content of SSPs. Conclusion Although SSPs are a minority, the genomic forces of base compositional bias and duplications influence their growth and pattern in each species. The preferences and abundance of amino acids are governed by low biosynthetic cost, evolutionary age and base composition of codons. Abundance of charged amino acids Arginine

  5. A protein domain interaction interface database: InterPare

    Directory of Open Access Journals (Sweden)

    Lee Jungsul

    2005-08-01

    Full Text Available Abstract Background Most proteins function by interacting with other molecules. Their interaction interfaces are highly conserved throughout evolution to avoid undesirable interactions that lead to fatal disorders in cells. Rational drug discovery includes computational methods to identify the interaction sites of lead compounds to the target molecules. Identifying and classifying protein interaction interfaces on a large scale can help researchers discover drug targets more efficiently. Description We introduce a large-scale protein domain interaction interface database called InterPare http://interpare.net. It contains both inter-chain (between chains interfaces and intra-chain (within chain interfaces. InterPare uses three methods to detect interfaces: 1 the geometric distance method for checking the distance between atoms that belong to different domains, 2 Accessible Surface Area (ASA, a method for detecting the buried region of a protein that is detached from a solvent when forming multimers or complexes, and 3 the Voronoi diagram, a computational geometry method that uses a mathematical definition of interface regions. InterPare includes visualization tools to display protein interior, surface, and interaction interfaces. It also provides statistics such as the amino acid propensities of queried protein according to its interior, surface, and interface region. The atom coordinates that belong to interface, surface, and interior regions can be downloaded from the website. Conclusion InterPare is an open and public database server for protein interaction interface information. It contains the large-scale interface data for proteins whose 3D-structures are known. As of November 2004, there were 10,583 (Geometric distance, 10,431 (ASA, and 11,010 (Voronoi diagram entries in the Protein Data Bank (PDB containing interfaces, according to the above three methods. In the case of the geometric distance method, there are 31,620 inter-chain domain

  6. Quality standards for DNA sequence variation databases to improve clinical management under development in Australia

    Directory of Open Access Journals (Sweden)

    B. Bennetts

    2014-09-01

    Full Text Available Despite the routine nature of comparing sequence variations identified during clinical testing to database records, few databases meet quality requirements for clinical diagnostics. To address this issue, The Royal College of Pathologists of Australasia (RCPA in collaboration with the Human Genetics Society of Australasia (HGSA, and the Human Variome Project (HVP is developing standards for DNA sequence variation databases intended for use in the Australian clinical environment. The outputs of this project will be promoted to other health systems and accreditation bodies by the Human Variome Project to support the development of similar frameworks in other jurisdictions.

  7. The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database

    Energy Technology Data Exchange (ETDEWEB)

    Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika; Tanaka, Yoshihiro; Teranishi, Kristen S.; Sunagawa, Shinichi; Wong, Mike; Stillman, Jonathon H.

    2010-01-27

    Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set of tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in

  8. Exploring Protein Function Using the Saccharomyces Genome Database.

    Science.gov (United States)

    Wong, Edith D

    2017-01-01

    Elucidating the function of individual proteins will help to create a comprehensive picture of cell biology, as well as shed light on human disease mechanisms, possible treatments, and cures. Due to its compact genome, and extensive history of experimentation and annotation, the budding yeast Saccharomyces cerevisiae is an ideal model organism in which to determine protein function. This information can then be leveraged to infer functions of human homologs. Despite the large amount of research and biological data about S. cerevisiae, many proteins' functions remain unknown. Here, we explore ways to use the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org ) to predict the function of proteins and gain insight into their roles in various cellular processes.

  9. Protein Function Prediction Based on Sequence and Structure Information

    KAUST Repository

    Smaili, Fatima Z.

    2016-01-01

    operate. In this master thesis project, we worked on inferring protein functions based on the primary protein sequence. In the approach we follow, 3D models are first constructed using I-TASSER. Functions are then deduced by structurally matching

  10. Yeast Interacting Proteins Database: YDL239C, YLR423C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available of a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle p...cription Protein required for spore wall formation, thought to mediate assembly of a Don1p-containing structure at the leading

  11. Yeast Interacting Proteins Database: YDL239C, YPL070W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available of a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle p...cription Protein required for spore wall formation, thought to mediate assembly of a Don1p-containing structure at the leading

  12. Yeast Interacting Proteins Database: YDL239C, YML042W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available of a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle p...iption Protein required for spore wall formation, thought to mediate assembly of a Don1p-containing structure at the leading

  13. Yeast Interacting Proteins Database: YDR176W, YDL239C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle pole...ining structure at the leading edge of the prospore membrane via interaction with spindle pole body componen...DY3 Prey description Protein required for spore wall formation, thought to mediate assembly of a Don1p-conta

  14. Yeast Interacting Proteins Database: YDL239C, YKL103C [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available of a Don1p-containing structure at the leading edge of the prospore membrane via interaction with spindle p...ait description Protein required for spore wall formation, thought to mediate assembly of a Don1p-containing structure at the leading

  15. Identification of Anhydrobiosis-related Genes from an Expressed Sequence Tag Database in the Cryptobiotic Midge Polypedilum vanderplanki (Diptera; Chironomidae)*

    Science.gov (United States)

    Cornette, Richard; Kanamori, Yasushi; Watanabe, Masahiko; Nakahara, Yuichi; Gusev, Oleg; Mitsumasu, Kanako; Kadono-Okuda, Keiko; Shimomura, Michihiko; Mita, Kazuei; Kikawada, Takahiro; Okuda, Takashi

    2010-01-01

    Some organisms are able to survive the loss of almost all their body water content, entering a latent state known as anhydrobiosis. The sleeping chironomid (Polypedilum vanderplanki) lives in the semi-arid regions of Africa, and its larvae can survive desiccation in an anhydrobiotic form during the dry season. To unveil the molecular mechanisms of this resistance to desiccation, an anhydrobiosis-related Expressed Sequence Tag (EST) database was obtained from the sequences of three cDNA libraries constructed from P. vanderplanki larvae after 0, 12, and 36 h of desiccation. The database contained 15,056 ESTs distributed into 4,807 UniGene clusters. ESTs were classified according to gene ontology categories, and putative expression patterns were deduced for all clusters on the basis of the number of clones in each library; expression patterns were confirmed by real-time PCR for selected genes. Among up-regulated genes, antioxidants, late embryogenesis abundant (LEA) proteins, and heat shock proteins (Hsps) were identified as important groups for anhydrobiosis. Genes related to trehalose metabolism and various transporters were also strongly induced by desiccation. Those results suggest that the oxidative stress response plays a central role in successful anhydrobiosis. Similarly, protein denaturation and aggregation may be prevented by marked up-regulation of Hsps and the anhydrobiosis-specific LEA proteins. A third major feature is the predicted increase in trehalose synthesis and in the expression of various transporter proteins allowing the distribution of trehalose and other solutes to all tissues. PMID:20833722

  16. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases

    Directory of Open Access Journals (Sweden)

    Bányai László

    2008-08-01

    Full Text Available Abstract Background Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii co-occurrence of extracellular and nuclear domains; (iv violation of domain integrity; (v chimeras encoded by two or more genes located on different chromosomes. Results Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis and two protostome species (Caenorhabditis elegans and Drosophila melanogaster have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON

  17. A protein relational database and protein family knowledge bases to facilitate structure-based design analyses.

    Science.gov (United States)

    Mobilio, Dominick; Walker, Gary; Brooijmans, Natasja; Nilakantan, Ramaswamy; Denny, R Aldrin; Dejoannis, Jason; Feyfant, Eric; Kowticwar, Rupesh K; Mankala, Jyoti; Palli, Satish; Punyamantula, Sairam; Tatipally, Maneesh; John, Reji K; Humblet, Christine

    2010-08-01

    The Protein Data Bank is the most comprehensive source of experimental macromolecular structures. It can, however, be difficult at times to locate relevant structures with the Protein Data Bank search interface. This is particularly true when searching for complexes containing specific interactions between protein and ligand atoms. Moreover, searching within a family of proteins can be tedious. For example, one cannot search for some conserved residue as residue numbers vary across structures. We describe herein three databases, Protein Relational Database, Kinase Knowledge Base, and Matrix Metalloproteinase Knowledge Base, containing protein structures from the Protein Data Bank. In Protein Relational Database, atom-atom distances between protein and ligand have been precalculated allowing for millisecond retrieval based on atom identity and distance constraints. Ring centroids, centroid-centroid and centroid-atom distances and angles have also been included permitting queries for pi-stacking interactions and other structural motifs involving rings. Other geometric features can be searched through the inclusion of residue pair and triplet distances. In Kinase Knowledge Base and Matrix Metalloproteinase Knowledge Base, the catalytic domains have been aligned into common residue numbering schemes. Thus, by searching across Protein Relational Database and Kinase Knowledge Base, one can easily retrieve structures wherein, for example, a ligand of interest is making contact with the gatekeeper residue.

  18. CPAD, Curated Protein Aggregation Database: A Repository of Manually Curated Experimental Data on Protein and Peptide Aggregation.

    Science.gov (United States)

    Thangakani, A Mary; Nagarajan, R; Kumar, Sandeep; Sakthivel, R; Velmurugan, D; Gromiha, M Michael

    2016-01-01

    Accurate distinction between peptide sequences that can form amyloid-fibrils or amorphous β-aggregates, identification of potential aggregation prone regions in proteins, and prediction of change in aggregation rate of a protein upon mutation(s) are critical to research on protein misfolding diseases, such as Alzheimer's and Parkinson's, as well as biotechnological production of protein based therapeutics. We have developed a Curated Protein Aggregation Database (CPAD), which has collected results from experimental studies performed by scientific community aimed at understanding protein/peptide aggregation. CPAD contains more than 2300 experimentally observed aggregation rates upon mutations in known amyloidogenic proteins. Each entry includes numerical values for the following parameters: change in rate of aggregation as measured by fluorescence intensity or turbidity, name and source of the protein, Uniprot and Protein Data Bank codes, single point as well as multiple mutations, and literature citation. The data in CPAD has been supplemented with five different types of additional information: (i) Amyloid fibril forming hexa-peptides, (ii) Amorphous β-aggregating hexa-peptides, (iii) Amyloid fibril forming peptides of different lengths, (iv) Amyloid fibril forming hexa-peptides whose crystal structures are available in the Protein Data Bank (PDB) and (v) Experimentally validated aggregation prone regions found in amyloidogenic proteins. Furthermore, CPAD is linked to other related databases and resources, such as Uniprot, Protein Data Bank, PUBMED, GAP, TANGO, WALTZ etc. We have set up a web interface with different search and display options so that users have the ability to get the data in multiple ways. CPAD is freely available at http://www.iitm.ac.in/bioinfo/CPAD/. The potential applications of CPAD have also been discussed.

  19. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval.

    Science.gov (United States)

    Wang, Jingyan; Gao, Xin; Wang, Quanquan; Li, Yongping

    2012-05-08

    The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database. In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial

  20. ProDis-ContSHC: Learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval

    KAUST Repository

    Wang, Jim Jing-Yan

    2012-05-08

    Background: The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database.Results: In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N (i) and N (j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N (i) and N (j). Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update

  1. ASAView: Database and tool for solvent accessibility representation in proteins

    Directory of Open Access Journals (Sweden)

    Fawareh Hamed

    2004-05-01

    Full Text Available Abstract Background Accessible surface area (ASA or solvent accessibility of amino acids in a protein has important implications. Knowledge of surface residues helps in locating potential candidates of active sites. Therefore, a method to quickly see the surface residues in a two dimensional model would help to immediately understand the population of amino acid residues on the surface and in the inner core of the proteins. Results ASAView is an algorithm, an application and a database of schematic representations of solvent accessibility of amino acid residues within proteins. A characteristic two-dimensional spiral plot of solvent accessibility provides a convenient graphical view of residues in terms of their exposed surface areas. In addition, sequential plots in the form of bar charts are also provided. Online plots of the proteins included in the entire Protein Data Bank (PDB, are provided for the entire protein as well as their chains separately. Conclusions These graphical plots of solvent accessibility are likely to provide a quick view of the overall topological distribution of residues in proteins. Chain-wise computation of solvent accessibility is also provided.

  2. Completion of autobuilt protein models using a database of protein fragments

    International Nuclear Information System (INIS)

    Cowtan, Kevin

    2012-01-01

    Two developments in the process of automated protein model building in the Buccaneer software are described: the use of a database of protein fragments in improving the model completeness and the assembly of disconnected chain fragments into complete molecules. Two developments in the process of automated protein model building in the Buccaneer software are presented. A general-purpose library for protein fragments of arbitrary size is described, with a highly optimized search method allowing the use of a larger database than in previous work. The problem of assembling an autobuilt model into complete chains is discussed. This involves the assembly of disconnected chain fragments into complete molecules and the use of the database of protein fragments in improving the model completeness. Assembly of fragments into molecules is a standard step in existing model-building software, but the methods have not received detailed discussion in the literature

  3. Database Description - TMFunction | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available sidue (or mutant) in a protein. The experimental data are collected from the literature both by searching th...the sequence database, UniProt, structural database, PDB, and literature database

  4. Searching the protein structure database for ligand-binding site similarities using CPASS v.2

    Directory of Open Access Journals (Sweden)

    Caprez Adam

    2011-01-01

    Full Text Available Abstract Background A recent analysis of protein sequences deposited in the NCBI RefSeq database indicates that ~8.5 million protein sequences are encoded in prokaryotic and eukaryotic genomes, where ~30% are explicitly annotated as "hypothetical" or "uncharacterized" protein. Our Comparison of Protein Active-Site Structures (CPASS v.2 database and software compares the sequence and structural characteristics of experimentally determined ligand binding sites to infer a functional relationship in the absence of global sequence or structure similarity. CPASS is an important component of our Functional Annotation Screening Technology by NMR (FAST-NMR protocol and has been successfully applied to aid the annotation of a number of proteins of unknown function. Findings We report a major upgrade to our CPASS software and database that significantly improves its broad utility. CPASS v.2 is designed with a layered architecture to increase flexibility and portability that also enables job distribution over the Open Science Grid (OSG to increase speed. Similarly, the CPASS interface was enhanced to provide more user flexibility in submitting a CPASS query. CPASS v.2 now allows for both automatic and manual definition of ligand-binding sites and permits pair-wise, one versus all, one versus list, or list versus list comparisons. Solvent accessible surface area, ligand root-mean square difference, and Cβ distances have been incorporated into the CPASS similarity function to improve the quality of the results. The CPASS database has also been updated. Conclusions CPASS v.2 is more than an order of magnitude faster than the original implementation, and allows for multiple simultaneous job submissions. Similarly, the CPASS database of ligand-defined binding sites has increased in size by ~ 38%, dramatically increasing the likelihood of a positive search result. The modification to the CPASS similarity function is effective in reducing CPASS similarity scores

  5. Partial sequence determination of metabolically labeled radioactive proteins and peptides

    International Nuclear Information System (INIS)

    Anderson, C.W.

    1982-01-01

    The author has used the sequence analysis of radioactive proteins and peptides to approach several problems during the past few years. They, in collaboration with others, have mapped precisely several adenovirus proteins with respect to the nucleotide sequence of the adenovirus genome; identified hitherto missed proteins encoded by bacteriophage MS2 and by simian virus 40; analyzed the aminoterminal maturation of several virus proteins; determined the cleavage sites for processing of the poliovirus polyprotein; and analyzed the mechanism of frameshifting by excess normal tRNAs during cell-free protein synthesis. This chapter is designed to aid those without prior experience at protein sequence determinations. It is based primarily on the experience gained in the studies cited above, which made use of the Beckman 890 series automated protein sequencers

  6. Combining next-generation sequencing and online databases for microsatellite development in non-model organisms.

    Science.gov (United States)

    Rico, Ciro; Normandeau, Eric; Dion-Côté, Anne-Marie; Rico, María Inés; Côté, Guillaume; Bernatchez, Louis

    2013-12-03

    Next-generation sequencing (NGS) is revolutionising marker development and the rapidly increasing amount of transcriptomes published across a wide variety of taxa is providing valuable sequence databases for the identification of genetic markers without the need to generate new sequences. Microsatellites are still the most important source of polymorphic markers in ecology and evolution. Motivated by our long-term interest in the adaptive radiation of a non-model species complex of whitefishes (Coregonus spp.), in this study, we focus on microsatellite characterisation and multiplex optimisation using transcriptome sequences generated by Illumina® and Roche-454, as well as online databases of Expressed Sequence Tags (EST) for the study of whitefish evolution and demographic history. We identified and optimised 40 polymorphic loci in multiplex PCR reactions and validated the robustness of our analyses by testing several population genetics and phylogeographic predictions using 494 fish from five lakes and 2 distinct ecotypes.

  7. Yeast Interacting Proteins Database: YOL069W, YIL144W [Yeast Interacting Proteins Database

    Lifescience Database Archive (English)

    Full Text Available complex (Ndc80p-Nuf2p-Spc24p-Spc25p); involved in chromosome segregation, spindle checkpoint activity and kinetochore clustering...vity, kinetochore assembly and clustering Rows with this prey as prey (2) Rows with this prey as bait (0) 12...-Nuf2p-Spc24p-Spc25p); involved in chromosome segregation, spindle checkpoint activity and kinetochore clustering...d coiled-coil protein involved in chromosome segregation, spindle checkpoint activity, kinetochore assembly and clustering

  8. Complete cDNA sequence coding for human docking protein

    Energy Technology Data Exchange (ETDEWEB)

    Hortsch, M; Labeit, S; Meyer, D I

    1988-01-11

    Docking protein (DP, or SRP receptor) is a rough endoplasmic reticulum (ER)-associated protein essential for the targeting and translocation of nascent polypeptides across this membrane. It specifically interacts with a cytoplasmic ribonucleoprotein complex, the signal recognition particle (SRP). The nucleotide sequence of cDNA encoding the entire human DP and its deduced amino acid sequence are given.

  9. AlignMe—a membrane protein sequence alignment web server

    Science.gov (United States)

    Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.

    2014-01-01

    We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425

  10. Ebolavirus Database: Gene and Protein Information Resource for Ebolaviruses

    Directory of Open Access Journals (Sweden)

    Rayapadi G. Swetha

    2016-01-01

    Full Text Available Ebola Virus Disease (EVD is a life-threatening haemorrhagic fever in humans. Even though there are many reports on EVD, the protein precursor functions and virulent factors of ebolaviruses remain poorly understood. Comparative analyses of Ebolavirus genomes will help in the identification of these important features. This prompted us to develop the Ebolavirus Database (EDB and we have provided links to various tools that will aid researchers to locate important regions in both the genomes and proteomes of Ebolavirus. The genomic analyses of ebolaviruses will provide important clues for locating the essential and core functional genes. The aim of EDB is to act as an integrated resource for ebolaviruses and we strongly believe that the database will be a useful tool for clinicians, microbiologists, health care workers, and bioscience researchers.

  11. Improving accuracy of protein-protein interaction prediction by considering the converse problem for sequence representation

    Directory of Open Access Journals (Sweden)

    Wang Yong

    2011-10-01

    Full Text Available Abstract Background With the development of genome-sequencing technologies, protein sequences are readily obtained by translating the measured mRNAs. Therefore predicting protein-protein interactions from the sequences is of great demand. The reason lies in the fact that identifying protein-protein interactions is becoming a bottleneck for eventually understanding the functions of proteins, especially for those organisms barely characterized. Although a few methods have been proposed, the converse problem, if the features used extract sufficient and unbiased information from protein sequences, is almost untouched. Results In this study, we interrogate this problem theoretically by an optimization scheme. Motivated by the theoretical investigation, we find novel encoding methods for both protein sequences and protein pairs. Our new methods exploit sufficiently the information of protein sequences and reduce artificial bias and computational cost. Thus, it significantly outperforms the available methods regarding sensitivity, specificity, precision, and recall with cross-validation evaluation and reaches ~80% and ~90% accuracy in Escherichia coli and Saccharomyces cerevisiae respectively. Our findings here hold important implication for other sequence-based prediction tasks because representation of biological sequence is always the first step in computational biology. Conclusions By considering the converse problem, we propose new representation methods for both protein sequences and protein pairs. The results show that our method significantly improves the accuracy of protein-protein interaction predictions.

  12. Nonlinear analysis of sequence repeats of multi-domain proteins

    Energy Technology Data Exchange (ETDEWEB)

    Huang Yanzhao [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Li Mingfeng [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xiao Yi [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China)]. E-mail: lmf_bill@sina.com

    2007-11-15

    Many multi-domain proteins have repetitive three-dimensional structures but nearly-random amino acid sequences. In the present paper, by using a modified recurrence plot proposed by us previously, we show that these amino acid sequences have hidden repetitions in fact. These results indicate that the repetitive domain structures are encoded by the repetitive sequences. This also gives a method to detect the repetitive domain structures directly from amino acid sequences.

  13. PSSRdb: a relational database of polymorphic simple sequence repeats extracted from prokaryotic genomes.

    Science.gov (United States)

    Kumar, Pankaj; Chaitanya, Pasumarthy S; Nagarajaram, Hampapathalu A

    2011-01-01

    PSSRdb (Polymorphic Simple Sequence Repeats database) (http://www.cdfd.org.in/PSSRdb/) is a relational database of polymorphic simple sequence repeats (PSSRs) extracted from 85 different species of prokaryotes. Simple sequence repeats (SSRs) are the tandem repeats of nucleotide motifs of the sizes 1-6 bp and are highly polymorphic. SSR mutations in and around coding regions affect transcription and translation of genes. Such changes underpin phase variations and antigenic variations seen in some bacteria. Although SSR-mediated phase variation and antigenic variations have been well-studied in some bacteria there seems a lot of other species of prokaryotes yet to be investigated for SSR mediated adaptive and other evolutionary advantages. As a part of our on-going studies on SSR polymorphism in prokaryotes we compared the genome sequences of various strains and isolates available for 85 different species of prokaryotes and extracted a number of SSRs showing length variations and created a relational database called PSSRdb. This database gives useful information such as location of PSSRs in genomes, length variation across genomes, the regions harboring PSSRs, etc. The information provided in this database is very useful for further research and analysis of SSRs in prokaryotes.

  14. EuMicroSatdb: A database for microsatellites in the sequenced genomes of eukaryotes

    Directory of Open Access Journals (Sweden)

    Grover Atul

    2007-07-01

    Full Text Available Abstract Background Microsatellites have immense utility as molecular markers in different fields like genome characterization and mapping, phylogeny and evolutionary biology. Existing microsatellite databases are of limited utility for experimental and computational biologists with regard to their content and information output. EuMicroSatdb (Eukaryotic MicroSatellite database http://ipu.ac.in/usbt/EuMicroSatdb.htm is a web based relational database for easy and efficient positional mining of microsatellites from sequenced eukaryotic genomes. Description A user friendly web interface has been developed for microsatellite data retrieval using Active Server Pages (ASP. The backend database codes for data extraction and assembly have been written using Perl based scripts and C++. Precise need based microsatellites data retrieval is possible using different input parameters like microsatellite type (simple perfect or compound perfect, repeat unit length (mono- to hexa-nucleotide, repeat number, microsatellite length and chromosomal location in the genome. Furthermore, information about clustering of different microsatellites in the genome can also be retrieved. Finally, to facilitate primer designing for PCR amplification of any desired microsatellite locus, 200 bp upstream and downstream sequences are provided. Conclusion The database allows easy systematic retrieval of comprehensive information about simple and compound microsatellites, microsatellite clusters and their locus coordinates in 31 sequenced eukaryotic genomes. The information content of the database is useful in different areas of research like gene tagging, genome mapping, population genetics, germplasm characterization and in understanding microsatellite dynamics in eukaryotic genomes.

  15. Sequence Classification - TMBETA-GENOME | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available ansmembrane helical proteins by applying statistical and machine learning methods to each amino acid sequenc.... Amino Acid Result of predicting β-barrel membrane protein with a statistical method using amino acid compo...sition. ( TMBETADISC-COMP ) Dipeptide Result of predicting β-barrel membrane protein with a statistic...ting β-barrel membrane protein with a statistical method using motifs. ( TMBETADISC-MOTIF ) SVM Result of pr

  16. Gene Discovery in the Apicomplexa as Revealed by EST Sequencing and Assembly of a Comparative Gene Database

    Science.gov (United States)

    Li, Li; Brunk, Brian P.; Kissinger, Jessica C.; Pape, Deana; Tang, Keliang; Cole, Robert H.; Martin, John; Wylie, Todd; Dante, Mike; Fogarty, Steven J.; Howe, Daniel K.; Liberator, Paul; Diaz, Carmen; Anderson, Jennifer; White, Michael; Jerome, Maria E.; Johnson, Emily A.; Radke, Jay A.; Stoeckert, Christian J.; Waterston, Robert H.; Clifton, Sandra W.; Roos, David S.; Sibley, L. David

    2003-01-01

    Large-scale EST sequencing projects for several important parasites within the phylum Apicomplexa were undertaken for the purpose of gene discovery. Included were several parasites of medical importance (Plasmodium falciparum, Toxoplasma gondii) and others of veterinary importance (Eimeria tenella, Sarcocystis neurona, and Neospora caninum). A total of 55,192 ESTs, deposited into dbEST/GenBank, were included in the analyses. The resulting sequences have been clustered into nonredundant gene assemblies and deposited into a relational database that supports a variety of sequence and text searches. This database has been used to compare the gene assemblies using BLAST similarity comparisons to the public protein databases to identify putative genes. Of these new entries, ∼15%–20% represent putative homologs with a conservative cutoff of p neurona: , , , , , , , , , , , , , –, –, –, –, –. Eimeria tenella: –, –, –, –, –, –, –, –, – , –, –, –, –, –, –, –, –, –, –, –. Neospora caninum: –, –, , – , –, –.] PMID:12618375

  17. Estimating the annotation error rate of curated GO database sequence annotations

    Directory of Open Access Journals (Sweden)

    Brown Alfred L

    2007-05-01

    Full Text Available Abstract Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO sequence database (GOSeqLite. This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006 at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.

  18. Integration of gel-based and gel-free proteomic data for functional analysis of proteins through Soybean Proteome Database

    KAUST Repository

    Komatsu, Setsuko

    2017-05-10

    predicted proteins from genome sequences, though there are over lapped proteins. Based on the demonstrated application of data stored in the database for functional analyses, it is suggested that these data will be useful for analyses of biological mechanisms in soybean. Furthermore, coupled with recent advances in information and communication technology, the usefulness of this database would increase in the analyses of biological mechanisms.

  19. Integration of gel-based and gel-free proteomic data for functional analysis of proteins through Soybean Proteome Database.

    Science.gov (United States)

    Komatsu, Setsuko; Wang, Xin; Yin, Xiaojian; Nanjo, Yohei; Ohyanagi, Hajime; Sakata, Katsumi

    2017-06-23

    genome sequences, though there are over lapped proteins. Based on the demonstrated application of data stored in the database for functional analyses, it is suggested that these data will be useful for analyses of biological mechanisms in soybean. Furthermore, coupled with recent advances in information and communication technology, the usefulness of this database would increase in the analyses of biological mechanisms. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. Integration of gel-based and gel-free proteomic data for functional analysis of proteins through Soybean Proteome Database

    KAUST Repository

    Komatsu, Setsuko; Wang, Xin; Yin, Xiaojian; Nanjo, Yohei; Ohyanagi, Hajime; Sakata, Katsumi

    2017-01-01

    predicted proteins from genome sequences, though there are over lapped proteins. Based on the demonstrated application of data stored in the database for functional analyses, it is suggested that these data will be useful for analyses of biological mechanisms in soybean. Furthermore, coupled with recent advances in information and communication technology, the usefulness of this database would increase in the analyses of biological mechanisms.

  1. Translational database selection and multiplexed sequence capture for up front filtering of reliable breast cancer biomarker candidates.

    Directory of Open Access Journals (Sweden)

    Patrik L Ståhl

    Full Text Available Biomarker identification is of utmost importance for the development of novel diagnostics and therapeutics. Here we make use of a translational database selection strategy, utilizing data from the Human Protein Atlas (HPA on differentially expressed protein patterns in healthy and breast cancer tissues as a means to filter out potential biomarkers for underlying genetic causatives of the disease. DNA was isolated from ten breast cancer biopsies, and the protein coding and flanking non-coding genomic regions corresponding to the selected proteins were extracted in a multiplexed format from the samples using a single DNA sequence capture array. Deep sequencing revealed an even enrichment of the multiplexed samples and a great variation of genetic alterations in the tumors of the sampled individuals. Benefiting from the upstream filtering method, the final set of biomarker candidates could be completely verified through bidirectional Sanger sequencing, revealing a 40 percent false positive rate despite high read coverage. Of the variants encountered in translated regions, nine novel non-synonymous variations were identified and verified, two of which were present in more than one of the ten tumor samples.

  2. An algorithm to find all palindromic sequences in proteins

    Indian Academy of Sciences (India)

    2013-01-20

    Jan 20, 2013 ... 1976; Karrer and Gall 1976; Vogt and Braun 1976) and (iii) in the formation of hairpin loops in the newly transcribed RNA. Palindromic sequences are observed in various classes of proteins like histones (Cheng et al. 1989), prion proteins (Sulkowski 1992; Kazim 1993),. DNA-binding proteins (Suzuki 1992; ...

  3. Discriminating Microbial Species Using Protein Sequence Properties and Machine Learning

    NARCIS (Netherlands)

    Shahib, Ali Al-; Gilbert, David; Breitling, Rainer

    2007-01-01

    Much work has been done to identify species-specific proteins in sequenced genomes and hence to determine their function. We assumed that such proteins have specific physico-chemical properties that will discriminate them from proteins in other species. In this paper, we examine the validity of this

  4. Protein (Cyanobacteria) - PGDBj - Ortholog DB | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available ut This Database Database Description Download License Update History of This Database Site Policy | Contact Us Protein (Cyanobacteria) - PGDBj - Ortholog DB | LSDB Archive ... ...List Contact us PGDBj - Ortholog DB Protein (Cyanobacteria) Data detail Data name Protein (Cyanobacteria) DO...switchLanguage; BLAST Search Image Search Home About Archive Update History Data

  5. Efficient use of unlabeled data for protein sequence classification: a comparative study.

    Science.gov (United States)

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-04-29

    Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

  6. dbPAF: an integrative database of protein phosphorylation in animals and fungi.

    Science.gov (United States)

    Ullah, Shahid; Lin, Shaofeng; Xu, Yang; Deng, Wankun; Ma, Lili; Zhang, Ying; Liu, Zexian; Xue, Yu

    2016-03-24

    Protein phosphorylation is one of the most important post-translational modifications (PTMs) and regulates a broad spectrum of biological processes. Recent progresses in phosphoproteomic identifications have generated a flood of phosphorylation sites, while the integration of these sites is an urgent need. In this work, we developed a curated database of dbPAF, containing known phosphorylation sites in H. sapiens, M. musculus, R. norvegicus, D. melanogaster, C. elegans, S. pombe and S. cerevisiae. From the scientific literature and public databases, we totally collected and integrated 54,148 phosphoproteins with 483,001 phosphorylation sites. Multiple options were provided for accessing the data, while original references and other annotations were also present for each phosphoprotein. Based on the new data set, we computationally detected significantly over-represented sequence motifs around phosphorylation sites, predicted potential kinases that are responsible for the modification of collected phospho-sites, and evolutionarily analyzed phosphorylation conservation states across different species. Besides to be largely consistent with previous reports, our results also proposed new features of phospho-regulation. Taken together, our database can be useful for further analyses of protein phosphorylation in human and other model organisms. The dbPAF database was implemented in PHP + MySQL and freely available at http://dbpaf.biocuckoo.org.

  7. Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource.

    Science.gov (United States)

    Sharpton, Thomas J; Jospin, Guillaume; Wu, Dongying; Langille, Morgan G I; Pollard, Katherine S; Eisen, Jonathan A

    2012-10-13

    New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as "Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology-based analyses. We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

  8. The relationship of protein conservation and sequence length

    Directory of Open Access Journals (Sweden)

    Panchenko Anna R

    2002-11-01

    Full Text Available Abstract Background In general, the length of a protein sequence is determined by its function and the wide variance in the lengths of an organism's proteins reflects the diversity of specific functional roles for these proteins. However, additional evolutionary forces that affect the length of a protein may be revealed by studying the length distributions of proteins evolving under weaker functional constraints. Results We performed sequence comparisons to distinguish highly conserved and poorly conserved proteins from the bacterium Escherichia coli, the archaeon Archaeoglobus fulgidus, and the eukaryotes Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. For all organisms studied, the conserved and nonconserved proteins have strikingly different length distributions. The conserved proteins are, on average, longer than the poorly conserved ones, and the length distributions for the poorly conserved proteins have a relatively narrow peak, in contrast to the conserved proteins whose lengths spread over a wider range of values. For the two prokaryotes studied, the poorly conserved proteins approximate the minimal length distribution expected for a diverse range of structural folds. Conclusions There is a relationship between protein conservation and sequence length. For all the organisms studied, there seems to be a significant evolutionary trend favoring shorter proteins in the absence of other, more specific functional constraints.

  9. mESAdb: microRNA expression and sequence analysis database.

    Science.gov (United States)

    Kaya, Koray D; Karakülah, Gökhan; Yakicier, Cengiz M; Acar, Aybar C; Konu, Ozlen

    2011-01-01

    microRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data.

  10. An Internet-Accessible DNA Sequence Database for Identifying Fusaria from Human and Animal Infections

    Science.gov (United States)

    Because less than one-third of clinically relevant fusaria can be accurately identified to species level using phenotypic data (i.e., morphological species recognition), we constructed a three-locus DNA sequence database to facilitate molecular identification of the 69 Fusarium species associated wi...

  11. Inverse statistical physics of protein sequences: a key issues review.

    Science.gov (United States)

    Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin

    2018-03-01

    In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.

  12. Aligning protein sequence and analysing substitution pattern using ...

    Indian Academy of Sciences (India)

    Prakash

    Aligning protein sequences using a score matrix has became a routine but valuable method in modern biological ..... the amino acids according to their substitution behaviour ...... which may cause great change (e.g. prolonging the helix) in.

  13. Protein secondary structure prediction for a single-sequence using hidden semi-Markov models

    Directory of Open Access Journals (Sweden)

    Borodovsky Mark

    2006-03-01

    Full Text Available Abstract Background The accuracy of protein secondary structure prediction has been improving steadily towards the 88% estimated theoretical limit. There are two types of prediction algorithms: Single-sequence prediction algorithms imply that information about other (homologous proteins is not available, while algorithms of the second type imply that information about homologous proteins is available, and use it intensively. The single-sequence algorithms could make an important contribution to studies of proteins with no detected homologs, however the accuracy of protein secondary structure prediction from a single-sequence is not as high as when the additional evolutionary information is present. Results In this paper, we further refine and extend the hidden semi-Markov model (HSMM initially considered in the BSPSS algorithm. We introduce an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. We also derive models that specialize on different sections of the dependency structure and incorporate them into HSMM. In addition, we implement an iterative training method to refine estimates of HSMM parameters. The three-state-per-residue accuracy and other accuracy measures of the new method, IPSSP, are shown to be comparable or better than ones for BSPSS as well as for PSIPRED, tested under the single-sequence condition. Conclusions We have shown that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. The results are obtained under cross-validation conditions using a dataset with no pair of sequences having significant sequence similarity. As new sequences are added to the database it is possible to augment the dependency structure and obtain even higher accuracy. Current and future advances should contribute to the improvement of function prediction for orphan proteins inscrutable

  14. Development of human protein reference database as an initial platform for approaching systems biology in humans

    DEFF Research Database (Denmark)

    Peri, Suraj; Navarro, J Daniel; Amanchy, Ramars

    2003-01-01

    Human Protein Reference Database (HPRD) is an object database that integrates a wealth of information relevant to the function of human proteins in health and disease. Data pertaining to thousands of protein-protein interactions, posttranslational modifications, enzyme/substrate relationships...

  15. Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment

    Directory of Open Access Journals (Sweden)

    Daniels Noah M

    2012-10-01

    Full Text Available Abstract Background The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult. Results We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD. Conclusions Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

  16. Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)

    Science.gov (United States)

    Bustamam, A.; Siswantining, T.; Febriyani, N. L.; Novitasari, I. D.; Cahyaningrum, R. D.

    2017-07-01

    The herpes virus can be found anywhere and one of the important characteristics is its ability to cause acute and chronic infection at certain times so as a result of the infection allows severe complications occurred. The herpes virus is composed of DNA containing protein and wrapped by glycoproteins. In this work, the Herpes viruses family is classified and analyzed by clustering their protein-sequence using Tribe Markov Clustering (Tribe-MCL) algorithm. Tribe-MCL is an efficient clustering method based on the theory of Markov chains, to classify protein families from protein sequences using pre-computed sequence similarity information. We implement the Tribe-MCL algorithm using an open source program of R. We select 24 protein sequences of Herpes virus obtained from NCBI database. The dataset consists of three types of glycoprotein B, F, and H. Each type has eight herpes virus that infected humans. Based on our simulation using different inflation factor r=1.5, 2, 3 we find a various number of the clusters results. The greater the inflation factor the greater the number of their clusters. Each protein will grouped together in the same type of protein.

  17. Using random forests for assistance in the curation of G-protein coupled receptor databases.

    Science.gov (United States)

    Shkurin, Aleksei; Vellido, Alfredo

    2017-08-18

    Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has

  18. Toxicological relationships between proteins obtained from protein target predictions of large toxicity databases

    International Nuclear Information System (INIS)

    Nigsch, Florian; Mitchell, John B.O.

    2008-01-01

    The combination of models for protein target prediction with large databases containing toxicological information for individual molecules allows the derivation of 'toxiclogical' profiles, i.e., to what extent are molecules of known toxicity predicted to interact with a set of protein targets. To predict protein targets of drug-like and toxic molecules, we built a computational multiclass model using the Winnow algorithm based on a dataset of protein targets derived from the MDL Drug Data Report. A 15-fold Monte Carlo cross-validation using 50% of each class for training, and the remaining 50% for testing, provided an assessment of the accuracy of that model. We retained the 3 top-ranking predictions and found that in 82% of all cases the correct target was predicted within these three predictions. The first prediction was the correct one in almost 70% of cases. A model built on the whole protein target dataset was then used to predict the protein targets for 150 000 molecules from the MDL Toxicity Database. We analysed the frequency of the predictions across the panel of protein targets for experimentally determined toxicity classes of all molecules. This allowed us to identify clusters of proteins related by their toxicological profiles, as well as toxicities that are related. Literature-based evidence is provided for some specific clusters to show the relevance of the relationships identified

  19. Protein Structural Change Data - PSCDB | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available List Contact us PSCDB Protein Structural Change Data Data detail Data name Protein Structural Change Data DO...History of This Database Site Policy | Contact Us Protein Structural Change Data - PSCDB | LSDB Archive ...

  20. Protein 3D structure computed from evolutionary sequence variation.

    Directory of Open Access Journals (Sweden)

    Debora S Marks

    Full Text Available The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing.In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues, including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7-4.8 Å C(α-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org. This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of

  1. GarlicESTdb: an online database and mining tool for garlic EST sequences

    Directory of Open Access Journals (Sweden)

    Choi Sang-Haeng

    2009-05-01

    Full Text Available Abstract Background Allium sativum., commonly known as garlic, is a species in the onion genus (Allium, which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. Description GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition software technology (JSP/EJB/JavaServlet for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation

  2. GarlicESTdb: an online database and mining tool for garlic EST sequences.

    Science.gov (United States)

    Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog

    2009-05-18

    Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The Garlic

  3. CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins.

    Science.gov (United States)

    Khorshid, Mohsen; Rodak, Christoph; Zavolan, Mihaela

    2011-01-01

    The stability, localization and translation rate of mRNAs are regulated by a multitude of RNA-binding proteins (RBPs) that find their targets directly or with the help of guide RNAs. Among the experimental methods for mapping RBP binding sites, cross-linking and immunoprecipitation (CLIP) coupled with deep sequencing provides transcriptome-wide coverage as well as high resolution. However, partly due to their vast volume, the data that were so far generated in CLIP experiments have not been put in a form that enables fast and interactive exploration of binding sites. To address this need, we have developed the CLIPZ database and analysis environment. Binding site data for RBPs such as Argonaute 1-4, Insulin-like growth factor II mRNA-binding protein 1-3, TNRC6 proteins A-C, Pumilio 2, Quaking and Polypyrimidine tract binding protein can be visualized at the level of the genome and of individual transcripts. Individual users can upload their own sequence data sets while being able to limit the access to these data to specific users, and analyses of the public and private data sets can be performed interactively. CLIPZ, available at http://www.clipz.unibas.ch, aims to provide an open access repository of information for post-transcriptional regulatory elements.

  4. Quantiprot - a Python package for quantitative analysis of protein sequences.

    Science.gov (United States)

    Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold

    2017-07-17

    The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.

  5. SeqHound: biological sequence and structure database as a platform for bioinformatics research

    Directory of Open Access Journals (Sweden)

    Dumontier Michel

    2002-10-01

    Full Text Available Abstract Background SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment. Results SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries. Conclusions The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit.

  6. Correlation between protein sequence similarity and x-ray diffraction quality in the protein data bank.

    Science.gov (United States)

    Lu, Hui-Meng; Yin, Da-Chuan; Ye, Ya-Jing; Luo, Hui-Min; Geng, Li-Qiang; Li, Hai-Sheng; Guo, Wei-Hong; Shang, Peng

    2009-01-01

    As the most widely utilized technique to determine the 3-dimensional structure of protein molecules, X-ray crystallography can provide structure of the highest resolution among the developed techniques. The resolution obtained via X-ray crystallography is known to be influenced by many factors, such as the crystal quality, diffraction techniques, and X-ray sources, etc. In this paper, the authors found that the protein sequence could also be one of the factors. We extracted information of the resolution and the sequence of proteins from the Protein Data Bank (PDB), classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the best resolution obtained. The results showed that there was a pronounced correlation between the sequence similarity and the obtained resolution. These results indicate that protein structure itself is one variable that may affect resolution when X-ray crystallography is used.

  7. Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins.

    Science.gov (United States)

    Firman, Taylor; Ghosh, Kingshuk

    2018-03-28

    We present an analytical theory to compute conformations of heteropolymers-applicable to describe disordered proteins-as a function of temperature and charge sequence. The theory describes coil-globule transition for a given protein sequence when temperature is varied and has been benchmarked against the all-atom Monte Carlo simulation (using CAMPARI) of intrinsically disordered proteins (IDPs). In addition, the model quantitatively shows how subtle alterations of charge placement in the primary sequence-while maintaining the same charge composition-can lead to significant changes in conformation, even as drastic as a coil (swelled above a purely random coil) to globule (collapsed below a random coil) and vice versa. The theory provides insights on how to control (enhance or suppress) these changes by tuning the temperature (or solution condition) and charge decoration. As an application, we predict the distribution of conformations (at room temperature) of all naturally occurring IDPs in the DisProt database and notice significant size variation even among IDPs with a similar composition of positive and negative charges. Based on this, we provide a new diagram-of-states delineating the sequence-conformation relation for proteins in the DisProt database. Next, we study the effect of post-translational modification, e.g., phosphorylation, on IDP conformations. Modifications as little as two-site phosphorylation can significantly alter the size of an IDP with everything else being constant (temperature, salt concentration, etc.). However, not all possible modification sites have the same effect on protein conformations; there are certain "hot spots" that can cause maximal change in conformation. The location of these "hot spots" in the parent sequence can readily be identified by using a sequence charge decoration metric originally introduced by Sawle and Ghosh. The ability of our model to predict conformations (both expanded and collapsed states) of IDPs at a high

  8. Kin-Driver: a database of driver mutations in protein kinases.

    Science.gov (United States)

    Simonetti, Franco L; Tornador, Cristian; Nabau-Moretó, Nuria; Molina-Vila, Miguel A; Marino-Buslje, Cristina

    2014-01-01

    Somatic mutations in protein kinases (PKs) are frequent driver events in many human tumors, while germ-line mutations are associated with hereditary diseases. Here we present Kin-driver, the first database that compiles driver mutations in PKs with experimental evidence demonstrating their functional role. Kin-driver is a manual expert-curated database that pays special attention to activating mutations (AMs) and can serve as a validation set to develop new generation tools focused on the prediction of gain-of-function driver mutations. It also offers an easy and intuitive environment to facilitate the visualization and analysis of mutations in PKs. Because all mutations are mapped onto a multiple sequence alignment, analogue positions between kinases can be identified and tentative new mutations can be proposed for studying by transferring annotation. Finally, our database can also be of use to clinical and translational laboratories, helping them to identify uncommon AMs that can correlate with response to new antitumor drugs. The website was developed using PHP and JavaScript, which are supported by all major browsers; the database was built using MySQL server. Kin-driver is available at: http://kin-driver.leloir.org.ar/ © The Author(s) 2014. Published by Oxford University Press.

  9. Identification of Alternative Splice Variants Using Unique Tryptic Peptide Sequences for Database Searches.

    Science.gov (United States)

    Tran, Trung T; Bollineni, Ravi C; Strozynski, Margarita; Koehler, Christian J; Thiede, Bernd

    2017-07-07

    Alternative splicing is a mechanism in eukaryotes by which different forms of mRNAs are generated from the same gene. Identification of alternative splice variants requires the identification of peptides specific for alternative splice forms. For this purpose, we generated a human database that contains only unique tryptic peptides specific for alternative splice forms from Swiss-Prot entries. Using this database allows an easy access to splice variant-specific peptide sequences that match to MS data. Furthermore, we combined this database without alternative splice variant-1-specific peptides with human Swiss-Prot. This combined database can be used as a general database for searching of LC-MS data. LC-MS data derived from in-solution digests of two different cell lines (LNCaP, HeLa) and phosphoproteomics studies were analyzed using these two databases. Several nonalternative splice variant-1-specific peptides were found in both cell lines, and some of them seemed to be cell-line-specific. Control and apoptotic phosphoproteomes from Jurkat T cells revealed several nonalternative splice variant-1-specific peptides, and some of them showed clear quantitative differences between the two states.

  10. A Public Database of Memory and Naive B-Cell Receptor Sequences.

    Directory of Open Access Journals (Sweden)

    William S DeWitt

    Full Text Available The vast diversity of B-cell receptors (BCR and secreted antibodies enables the recognition of, and response to, a wide range of epitopes, but this diversity has also limited our understanding of humoral immunity. We present a public database of more than 37 million unique BCR sequences from three healthy adult donors that is many fold deeper than any existing resource, together with a set of online tools designed to facilitate the visualization and analysis of the annotated data. We estimate the clonal diversity of the naive and memory B-cell repertoires of healthy individuals, and provide a set of examples that illustrate the utility of the database, including several views of the basic properties of immunoglobulin heavy chain sequences, such as rearrangement length, subunit usage, and somatic hypermutation positions and dynamics.

  11. A Public Database of Memory and Naive B-Cell Receptor Sequences.

    Science.gov (United States)

    DeWitt, William S; Lindau, Paul; Snyder, Thomas M; Sherwood, Anna M; Vignali, Marissa; Carlson, Christopher S; Greenberg, Philip D; Duerkopp, Natalie; Emerson, Ryan O; Robins, Harlan S

    2016-01-01

    The vast diversity of B-cell receptors (BCR) and secreted antibodies enables the recognition of, and response to, a wide range of epitopes, but this diversity has also limited our understanding of humoral immunity. We present a public database of more than 37 million unique BCR sequences from three healthy adult donors that is many fold deeper than any existing resource, together with a set of online tools designed to facilitate the visualization and analysis of the annotated data. We estimate the clonal diversity of the naive and memory B-cell repertoires of healthy individuals, and provide a set of examples that illustrate the utility of the database, including several views of the basic properties of immunoglobulin heavy chain sequences, such as rearrangement length, subunit usage, and somatic hypermutation positions and dynamics.

  12. Can Natural Proteins Designed with ‘Inverted’ Peptide Sequences Adopt Native-Like Protein Folds?

    Science.gov (United States)

    Sridhar, Settu; Guruprasad, Kunchur

    2014-01-01

    We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to ‘swap’ certain short peptide sequences in naturally occurring proteins with their corresponding ‘inverted’ peptides and generate ‘artificial’ proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5–12 and 18 amino acid residues. Our analysis illustrates with examples that such ‘artificial’ proteins may be generated by identifying peptides with ‘similar structural environment’ and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides. PMID:25210740

  13. AllergenOnline: A peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity.

    Science.gov (United States)

    Goodman, Richard E; Ebisawa, Motohiro; Ferreira, Fatima; Sampson, Hugh A; van Ree, Ronald; Vieths, Stefan; Baumert, Joseph L; Bohle, Barbara; Lalithambika, Sreedevi; Wise, John; Taylor, Steve L

    2016-05-01

    Increasingly regulators are demanding evaluation of potential allergenicity of foods prior to marketing. Primary risks are the transfer of allergens or potentially cross-reactive proteins into new foods. AllergenOnline was developed in 2005 as a peer-reviewed bioinformatics platform to evaluate risks of new dietary proteins in genetically modified organisms (GMO) and novel foods. The process used to identify suspected allergens and evaluate the evidence of allergenicity was refined between 2010 and 2015. Candidate proteins are identified from the NCBI database using keyword searches, the WHO/IUIS nomenclature database and peer reviewed publications. Criteria to classify proteins as allergens are described. Characteristics of the protein, the source and human subjects, test methods and results are evaluated by our expert panel and archived. Food, inhalant, salivary, venom, and contact allergens are included. Users access allergen sequences through links to the NCBI database and relevant references are listed online. Version 16 includes 1956 sequences from 778 taxonomic-protein groups that are accepted with evidence of allergic serum IgE-binding and/or biological activity. AllergenOnline provides a useful peer-reviewed tool for identifying the primary potential risks of allergy for GMOs and novel foods based on criteria described by the Codex Alimentarius Commission (2003). © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  14. O-GLYCBASE version 3.0: a revised database of O-glycosylated proteins

    DEFF Research Database (Denmark)

    Hansen, Jan; Lund, Ole; Nilsson, Jette

    1998-01-01

    O-GLYCBASE is a revised database of information on glycoproteins and their O-linked glycosylation sites. Entries are compiled and revised from the literature, and from the sequence databases. Entries include informations about species, sequence, glycosylation sites and glycan type and is fully cr...

  15. AT_CHLORO, a comprehensive chloroplast proteome database with subplastidial localization and curated information on envelope proteins.

    Science.gov (United States)

    Ferro, Myriam; Brugière, Sabine; Salvi, Daniel; Seigneurin-Berny, Daphné; Court, Magali; Moyet, Lucas; Ramus, Claire; Miras, Stéphane; Mellal, Mourad; Le Gall, Sophie; Kieffer-Jaquinod, Sylvie; Bruley, Christophe; Garin, Jérôme; Joyard, Jacques; Masselon, Christophe; Rolland, Norbert

    2010-06-01

    Recent advances in the proteomics field have allowed a series of high throughput experiments to be conducted on chloroplast samples, and the data are available in several public databases. However, the accurate localization of many chloroplast proteins often remains hypothetical. This is especially true for envelope proteins. We went a step further into the knowledge of the chloroplast proteome by focusing, in the same set of experiments, on the localization of proteins in the stroma, the thylakoids, and envelope membranes. LC-MS/MS-based analyses first allowed building the AT_CHLORO database (http://www.grenoble.prabi.fr/protehome/grenoble-plant-proteomics/), a comprehensive repertoire of the 1323 proteins, identified by 10,654 unique peptide sequences, present in highly purified chloroplasts and their subfractions prepared from Arabidopsis thaliana leaves. This database also provides extensive proteomics information (peptide sequences and molecular weight, chromatographic retention times, MS/MS spectra, and spectral count) for a unique chloroplast protein accurate mass and time tag database gathering identified peptides with their respective and precise analytical coordinates, molecular weight, and retention time. We assessed the partitioning of each protein in the three chloroplast compartments by using a semiquantitative proteomics approach (spectral count). These data together with an in-depth investigation of the literature were compiled to provide accurate subplastidial localization of previously known and newly identified proteins. A unique knowledge base containing extensive information on the proteins identified in envelope fractions was thus obtained, allowing new insights into this membrane system to be revealed. Altogether, the data we obtained provide unexpected information about plastidial or subplastidial localization of some proteins that were not suspected to be associated to this membrane system. The spectral counting-based strategy was further

  16. Correlated mutations in protein sequences: Phylogenetic and structural effects

    Energy Technology Data Exchange (ETDEWEB)

    Lapedes, A.S. [Los Alamos National Lab., NM (United States). Theoretical Div.]|[Santa Fe Inst., NM (United States); Giraud, B.G. [C.E.N. Saclay, Gif/Yvette (France). Service Physique Theorique; Liu, L.C. [Los Alamos National Lab., NM (United States). Theoretical Div.; Stormo, G.D. [Univ. of Colorado, Boulder, CO (United States). Dept. of Molecular, Cellular and Developmental Biology

    1998-12-01

    Covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure. Covariation analysis of sets of aligned sequences for protein molecules is successful in certain instances in elucidating certain structural and functional links, but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure. In this paper the authors identify two reasons why naive use of covariation analysis for protein sequences fails to reliably indicate sequence positions that are spatially proximate. The first reason involves the bias introduced in calculation of covariation measures due to the fact that biological sequences are generally related by a non-trivial phylogenetic tree. The authors present a null-model approach to solve this problem. The second reason involves linked chains of covariation which can result in pairs of sites displaying significant covariation even though they are not spatially proximate. They present a maximum entropy solution to this classic problem of causation versus correlation. The methodologies are validated in simulation.

  17. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

    Science.gov (United States)

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-12-01

    Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/

  18. Alignment of high-throughput sequencing data inside in-memory databases.

    Science.gov (United States)

    Firnkorn, Daniel; Knaup-Gregori, Petra; Lorenzo Bermejo, Justo; Ganzinger, Matthias

    2014-01-01

    In times of high-throughput DNA sequencing techniques, performance-capable analysis of DNA sequences is of high importance. Computer supported DNA analysis is still an intensive time-consuming task. In this paper we explore the potential of a new In-Memory database technology by using SAP's High Performance Analytic Appliance (HANA). We focus on read alignment as one of the first steps in DNA sequence analysis. In particular, we examined the widely used Burrows-Wheeler Aligner (BWA) and implemented stored procedures in both, HANA and the free database system MySQL, to compare execution time and memory management. To ensure that the results are comparable, MySQL has been running in memory as well, utilizing its integrated memory engine for database table creation. We implemented stored procedures, containing exact and inexact searching of DNA reads within the reference genome GRCh37. Due to technical restrictions in SAP HANA concerning recursion, the inexact matching problem could not be implemented on this platform. Hence, performance analysis between HANA and MySQL was made by comparing the execution time of the exact search procedures. Here, HANA was approximately 27 times faster than MySQL which means, that there is a high potential within the new In-Memory concepts, leading to further developments of DNA analysis procedures in the future.

  19. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database

    Science.gov (United States)

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G.; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-01-01

    Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/ Contact: artemis@sanger.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:18845581

  20. Large scale identification and categorization of protein sequences using structured logistic regression

    DEFF Research Database (Denmark)

    Pedersen, Bjørn Panella; Ifrim, Georgiana; Liboriussen, Poul

    2014-01-01

    Abstract Background Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well...... problem. Results Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known...... for further biochemical characterization and structural analysis....

  1. Semi-Supervised Learning for Classification of Protein Sequence Data

    Directory of Open Access Journals (Sweden)

    Brian R. King

    2008-01-01

    Full Text Available Protein sequence data continue to become available at an exponential rate. Annotation of functional and structural attributes of these data lags far behind, with only a small fraction of the data understood and labeled by experimental methods. Classification methods that are based on semi-supervised learning can increase the overall accuracy of classifying partly labeled data in many domains, but very few methods exist that have shown their effect on protein sequence classification. We show how proven methods from text classification can be applied to protein sequence data, as we consider both existing and novel extensions to the basic methods, and demonstrate restrictions and differences that must be considered. We demonstrate comparative results against the transductive support vector machine, and show superior results on the most difficult classification problems. Our results show that large repositories of unlabeled protein sequence data can indeed be used to improve predictive performance, particularly in situations where there are fewer labeled protein sequences available, and/or the data are highly unbalanced in nature.

  2. Revised Mimivirus major capsid protein sequence reveals intron-containing gene structure and extra domain

    Directory of Open Access Journals (Sweden)

    Suzan-Monti Marie

    2009-05-01

    Full Text Available Abstract Background Acanthamoebae polyphaga Mimivirus (APM is the largest known dsDNA virus. The viral particle has a nearly icosahedral structure with an internal capsid shell surrounded with a dense layer of fibrils. A Capsid protein sequence, D13L, was deduced from the APM L425 coding gene and was shown to be the most abundant protein found within the viral particle. However this protein remained poorly characterised until now. A revised protein sequence deposited in a database suggested an additional N-terminal stretch of 142 amino acids missing from the original deduced sequence. This result led us to investigate the L425 gene structure and the biochemical properties of the complete APM major Capsid protein. Results This study describes the full length 3430 bp Capsid coding gene and characterises the 593 amino acids long corresponding Capsid protein 1. The recombinant full length protein allowed the production of a specific monoclonal antibody able to detect the Capsid protein 1 within the viral particle. This protein appeared to be post-translationnally modified by glycosylation and phosphorylation. We proposed a secondary structure prediction of APM Capsid protein 1 compared to the Capsid protein structure of Paramecium Bursaria Chlorella Virus 1, another member of the Nucleo-Cytoplasmic Large DNA virus family. Conclusion The characterisation of the full length L425 Capsid coding gene of Acanthamoebae polyphaga Mimivirus provides new insights into the structure of the main Capsid protein. The production of a full length recombinant protein will be useful for further structural studies.

  3. The need for high-quality whole-genome sequence databases in microbial forensics.

    Science.gov (United States)

    Sjödin, Andreas; Broman, Tina; Melefors, Öjar; Andersson, Gunnar; Rasmusson, Birgitta; Knutsson, Rickard; Forsman, Mats

    2013-09-01

    Microbial forensics is an important part of a strengthened capability to respond to biocrime and bioterrorism incidents to aid in the complex task of distinguishing between natural outbreaks and deliberate acts. The goal of a microbial forensic investigation is to identify and criminally prosecute those responsible for a biological attack, and it involves a detailed analysis of the weapon--that is, the pathogen. The recent development of next-generation sequencing (NGS) technologies has greatly increased the resolution that can be achieved in microbial forensic analyses. It is now possible to identify, quickly and in an unbiased manner, previously undetectable genome differences between closely related isolates. This development is particularly relevant for the most deadly bacterial diseases that are caused by bacterial lineages with extremely low levels of genetic diversity. Whole-genome analysis of pathogens is envisaged to be increasingly essential for this purpose. In a microbial forensic context, whole-genome sequence analysis is the ultimate method for strain comparisons as it is informative during identification, characterization, and attribution--all 3 major stages of the investigation--and at all levels of microbial strain identity resolution (ie, it resolves the full spectrum from family to isolate). Given these capabilities, one bottleneck in microbial forensics investigations is the availability of high-quality reference databases of bacterial whole-genome sequences. To be of high quality, databases need to be curated and accurate in terms of sequences, metadata, and genetic diversity coverage. The development of whole-genome sequence databases will be instrumental in successfully tracing pathogens in the future.

  4. Prediction of glutathionylation sites in proteins using minimal sequence information and their experimental validation.

    Science.gov (United States)

    Pal, Debojyoti; Sharma, Deepak; Kumar, Mukesh; Sandur, Santosh K

    2016-09-01

    S-glutathionylation of proteins plays an important role in various biological processes and is known to be protective modification during oxidative stress. Since, experimental detection of S-glutathionylation is labor intensive and time consuming, bioinformatics based approach is a viable alternative. Available methods require relatively longer sequence information, which may prevent prediction if sequence information is incomplete. Here, we present a model to predict glutathionylation sites from pentapeptide sequences. It is based upon differential association of amino acids with glutathionylated and non-glutathionylated cysteines from a database of experimentally verified sequences. This data was used to calculate position dependent F-scores, which measure how a particular amino acid at a particular position may affect the likelihood of glutathionylation event. Glutathionylation-score (G-score), indicating propensity of a sequence to undergo glutathionylation, was calculated using position-dependent F-scores for each amino-acid. Cut-off values were used for prediction. Our model returned an accuracy of 58% with Matthew's correlation-coefficient (MCC) value of 0.165. On an independent dataset, our model outperformed the currently available model, in spite of needing much less sequence information. Pentapeptide motifs having high abundance among glutathionylated proteins were identified. A list of potential glutathionylation hotspot sequences were obtained by assigning G-scores and subsequent Protein-BLAST analysis revealed a total of 254 putative glutathionable proteins, a number of which were already known to be glutathionylated. Our model predicted glutathionylation sites in 93.93% of experimentally verified glutathionylated proteins. Outcome of this study may assist in discovering novel glutathionylation sites and finding candidate proteins for glutathionylation.

  5. Single-molecule protein sequencing through fingerprinting: computational assessment

    Science.gov (United States)

    Yao, Yao; Docter, Margreet; van Ginkel, Jetty; de Ridder, Dick; Joo, Chirlmin

    2015-10-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences.

  6. Single-molecule protein sequencing through fingerprinting: computational assessment

    International Nuclear Information System (INIS)

    Yao, Yao; Docter, Margreet; Van Ginkel, Jetty; Joo, Chirlmin; De Ridder, Dick

    2015-01-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences. (paper)

  7. Deep sequencing methods for protein engineering and design.

    Science.gov (United States)

    Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A

    2017-08-01

    The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. Sequence analysis reveals how G protein-coupled receptors transduce the signal to the G protein.

    NARCIS (Netherlands)

    Oliveira, L.; Paiva, P.B.; Paiva, A.C.; Vriend, G.

    2003-01-01

    Sequence entropy-variability plots based on alignments of very large numbers of sequences-can indicate the location in proteins of the main active site and modulator sites. In the previous article in this issue, we applied this observation to a series of well-studied proteins and concluded that it

  9. WildSpan: mining structured motifs from protein sequences

    Directory of Open Access Journals (Sweden)

    Chen Chien-Yu

    2011-03-01

    Full Text Available Abstract Background Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions that incorporates several pruning strategies to largely reduce the mining cost. Results WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode

  10. FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences

    Energy Technology Data Exchange (ETDEWEB)

    Macklin, Derek; Egan, Rob; Wang, Zhong

    2014-03-14

    Metagenome sequencing efforts have provided a large pool of billions of genes for identifying enzymes with desirable biochemical traits. However, homology search with billions of genes in a rapidly growing database has become increasingly computationally impractical. Here we present our pilot efforts to develop a novel alignment-free algorithm for homology search. Specifically, we represent individual proteins as feature vectors that denote the presence or absence of short kmers in the protein sequence. Similarity between feature vectors is then computed using the Tanimoto score, a distance metric that can be rapidly computed on bit string representations of feature vectors. Preliminary results indicate good correlation with optimal alignment algorithms (Spearman r of 0.87, ~;;1,000,000 proteins from Pfam), as well as with heuristic algorithms such as BLAST (Spearman r of 0.86, ~;;1,000,000 proteins). Furthermore, a prototype of FASTERp implemented in Python runs approximately four times faster than BLAST on a small scale dataset (~;;1000 proteins). We are optimizing and scaling to improve FASTERp to enable rapid homology searches against billion-protein databases, thereby enabling more comprehensive gene annotation efforts.

  11. Efficiency of Database Search for Identification of Mutated and Modified Proteins via Mass Spectrometry

    OpenAIRE

    Pevzner, Pavel A.; Mulyukov, Zufar; Dancik, Vlado; Tang, Chris L

    2001-01-01

    Although protein identification by matching tandem mass spectra (MS/MS) against protein databases is a widespread tool in mass spectrometry, the question about reliability of such searches remains open. Absence of rigorous significance scores in MS/MS database search makes it difficult to discard random database hits and may lead to erroneous protein identification, particularly in the case of mutated or post-translationally modified peptides. This problem is especially important for high-thr...

  12. Databases

    Digital Repository Service at National Institute of Oceanography (India)

    Kunte, P.D.

    Information on bibliographic as well as numeric/textual databases relevant to coastal geomorphology has been included in a tabular form. Databases cover a broad spectrum of related subjects like coastal environment and population aspects, coastline...

  13. Prediction of Carbohydrate-Binding Proteins from Sequences Using Support Vector Machines

    Directory of Open Access Journals (Sweden)

    Seizi Someya

    2010-01-01

    Full Text Available Carbohydrate-binding proteins are proteins that can interact with sugar chains but do not modify them. They are involved in many physiological functions, and we have developed a method for predicting them from their amino acid sequences. Our method is based on support vector machines (SVMs. We first clarified the definition of carbohydrate-binding proteins and then constructed positive and negative datasets with which the SVMs were trained. By applying the leave-one-out test to these datasets, our method delivered 0.92 of the area under the receiver operating characteristic (ROC curve. We also examined two amino acid grouping methods that enable effective learning of sequence patterns and evaluated the performance of these methods. When we applied our method in combination with the homology-based prediction method to the annotated human genome database, H-invDB, we found that the true positive rate of prediction was improved.

  14. Polymorphisms and resistance mutations of hepatitis C virus on sequences in the European hepatitis C virus database

    Science.gov (United States)

    Kliemann, Dimas Alexandre; Tovo, Cristiane Valle; da Veiga, Ana Beatriz Gorini; de Mattos, Angelo Alves; Wood, Charles

    2016-01-01

    AIM To evaluate the occurrence of resistant mutations in treatment-naïve hepatitis C virus (HCV) sequences deposited in the European hepatitis C virus database (euHCVdb). METHODS The sequences were downloaded from the euHCVdb (https://euhcvdb.ibcp.fr/euHCVdb/). The search was performed for full-length NS3 protease, NS5A and NS5B polymerase sequences of HCV, separated by genotypes 1a, 1b, 2a, 2b and 3a, and resulted in 798 NS3, 708 NS5A and 535 NS5B sequences from HCV genotypes 1a, 1b, 2a, 2b and 3a, after the exclusion of sequences containing errors and/or gaps or incomplete sequences, and sequences from patients previously treated with direct antiviral agents (DAA). The sequence alignment was performed with MEGA 6.06 MAC and the resulting protein sequences were then analyzed using the BioEdit 7.2.5. for mutations associated with resistance. Only positions that have been described as being associated with failure in treatment in in vivo studies, and/or as conferring a more than 2-fold change in replication in comparison to the wildtype reference strain in in vitro phenotypic assays were included in the analysis. RESULTS The Q80K variant in the NS3 gene was the most prevalent mutation, being found in 44.66% of subtype 1a and 0.25% of subtype 1b. Other frequent mutations observed in more than 2% of the NS3 sequences were: I170V (3.21%) in genotype 1a, and Y56F (15.93%), V132I (23.28%) and I170V (65.20%) in genotype 1b. For the NS5A, 2.21% of the genotype 1a sequences have the P58S mutation, 5.95% of genotype 1b sequences have the R30Q mutation, 15.79% of subtypes 2a sequences have the Q30R mutation, 23.08% of subtype 2b sequences have a L31M mutation, and in subtype 3a sequences, 23.08% have the M31L resistant variants. For the NS5B, the V321L RAV was identified in 0.60% of genotype 1a and in 0.32% of genotype 1b sequences, and the N142T variant was observed in 0.32% of subtype 1b sequences. The C316Y, S556G, D559N RAV were identified in 0.33%, 7.82% and 0.32% of

  15. Polymorphisms and resistance mutations of hepatitis C virus on sequences in the European hepatitis C virus database.

    Science.gov (United States)

    Kliemann, Dimas Alexandre; Tovo, Cristiane Valle; da Veiga, Ana Beatriz Gorini; de Mattos, Angelo Alves; Wood, Charles

    2016-10-28

    To evaluate the occurrence of resistant mutations in treatment-naïve hepatitis C virus (HCV) sequences deposited in the European hepatitis C virus database (euHCVdb). The sequences were downloaded from the euHCVdb (https://euhcvdb.ibcp.fr/euHCVdb/). The search was performed for full-length NS3 protease, NS5A and NS5B polymerase sequences of HCV, separated by genotypes 1a, 1b, 2a, 2b and 3a, and resulted in 798 NS3, 708 NS5A and 535 NS5B sequences from HCV genotypes 1a, 1b, 2a, 2b and 3a, after the exclusion of sequences containing errors and/or gaps or incomplete sequences, and sequences from patients previously treated with direct antiviral agents (DAA). The sequence alignment was performed with MEGA 6.06 MAC and the resulting protein sequences were then analyzed using the BioEdit 7.2.5. for mutations associated with resistance. Only positions that have been described as being associated with failure in treatment in in vivo studies, and/or as conferring a more than 2-fold change in replication in comparison to the wildtype reference strain in in vitro phenotypic assays were included in the analysis. The Q80K variant in the NS3 gene was the most prevalent mutation, being found in 44.66% of subtype 1a and 0.25% of subtype 1b. Other frequent mutations observed in more than 2% of the NS3 sequences were: I170V (3.21%) in genotype 1a, and Y56F (15.93%), V132I (23.28%) and I170V (65.20%) in genotype 1b. For the NS5A, 2.21% of the genotype 1a sequences have the P58S mutation, 5.95% of genotype 1b sequences have the R30Q mutation, 15.79% of subtypes 2a sequences have the Q30R mutation, 23.08% of subtype 2b sequences have a L31M mutation, and in subtype 3a sequences, 23.08% have the M31L resistant variants. For the NS5B, the V321L RAV was identified in 0.60% of genotype 1a and in 0.32% of genotype 1b sequences, and the N142T variant was observed in 0.32% of subtype 1b sequences. The C316Y, S556G, D559N RAV were identified in 0.33%, 7.82% and 0.32% of genotype 1b sequences

  16. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences.

    Science.gov (United States)

    Meinicke, Peter

    2009-09-02

    Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.

  17. Structure and Sequence Search on Aptamer-Protein Docking

    Science.gov (United States)

    Xiao, Jiajie; Bonin, Keith; Guthold, Martin; Salsbury, Freddie

    2015-03-01

    Interactions between proteins and deoxyribonucleic acid (DNA) play a significant role in the living systems, especially through gene regulation. However, short nucleic acids sequences (aptamers) with specific binding affinity to specific proteins exhibit clinical potential as therapeutics. Our capillary and gel electrophoresis selection experiments show that specific sequences of aptamers can be selected that bind specific proteins. Computationally, given the experimentally-determined structure and sequence of a thrombin-binding aptamer, we can successfully dock the aptamer onto thrombin in agreement with experimental structures of the complex. In order to further study the conformational flexibility of this thrombin-binding aptamer and to potentially develop a predictive computational model of aptamer-binding, we use GPU-enabled molecular dynamics simulations to both examine the conformational flexibility of the aptamer in the absence of binding to thrombin, and to determine our ability to fold an aptamer. This study should help further de-novo predictions of aptamer sequences by enabling the study of structural and sequence-dependent effects on aptamer-protein docking specificity.

  18. ProDis-ContSHC: Learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval

    KAUST Repository

    Wang, Jim Jing-Yan; Gao, Xin; Wang, Quanquan; Li, Yongping

    2012-01-01

    Background: The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity

  19. EKPD: a hierarchical database of eukaryotic protein kinases and protein phosphatases.

    Science.gov (United States)

    Wang, Yongbo; Liu, Zexian; Cheng, Han; Gao, Tianshun; Pan, Zhicheng; Yang, Qing; Guo, Anyuan; Xue, Yu

    2014-01-01

    We present here EKPD (http://ekpd.biocuckoo.org), a hierarchical database of eukaryotic protein kinases (PKs) and protein phosphatases (PPs), the key molecules responsible for the reversible phosphorylation of proteins that are involved in almost all aspects of biological processes. As extensive experimental and computational efforts have been carried out to identify PKs and PPs, an integrative resource with detailed classification and annotation information would be of great value for both experimentalists and computational biologists. In this work, we first collected 1855 PKs and 347 PPs from the scientific literature and various public databases. Based on previously established rationales, we classified all of the known PKs and PPs into a hierarchical structure with three levels, i.e. group, family and individual PK/PP. There are 10 groups with 149 families for the PKs and 10 groups with 33 families for the PPs. We constructed 139 and 27 Hidden Markov Model profiles for PK and PP families, respectively. Then we systematically characterized ∼50,000 PKs and >10,000 PPs in eukaryotes. In addition, >500 PKs and >400 PPs were computationally identified by ortholog search. Finally, the online service of the EKPD database was implemented in PHP + MySQL + JavaScript.

  20. SynechoNET: integrated protein-protein interaction database of a model cyanobacterium Synechocystis sp. PCC 6803

    OpenAIRE

    Kim, Woo-Yeon; Kang, Sungsoo; Kim, Byoung-Chul; Oh, Jeehyun; Cho, Seongwoong; Bhak, Jong; Choi, Jong-Soon

    2008-01-01

    Background Cyanobacteria are model organisms for studying photosynthesis, carbon and nitrogen assimilation, evolution of plant plastids, and adaptability to environmental stresses. Despite many studies on cyanobacteria, there is no web-based database of their regulatory and signaling protein-protein interaction networks to date. Description We report a database and website SynechoNET that provides predicted protein-protein interactions. SynechoNET shows cyanobacterial domain-domain interactio...

  1. Prediction of Protein Hotspots from Whole Protein Sequences by a Random Projection Ensemble System

    Directory of Open Access Journals (Sweden)

    Jinjian Jiang

    2017-07-01

    Full Text Available Hotspot residues are important in the determination of protein-protein interactions, and they always perform specific functions in biological processes. The determination of hotspot residues is by the commonly-used method of alanine scanning mutagenesis experiments, which is always costly and time consuming. To address this issue, computational methods have been developed. Most of them are structure based, i.e., using the information of solved protein structures. However, the number of solved protein structures is extremely less than that of sequences. Moreover, almost all of the predictors identified hotspots from the interfaces of protein complexes, seldom from the whole protein sequences. Therefore, determining hotspots from whole protein sequences by sequence information alone is urgent. To address the issue of hotspot predictions from the whole sequences of proteins, we proposed an ensemble system with random projections using statistical physicochemical properties of amino acids. First, an encoding scheme involving sequence profiles of residues and physicochemical properties from the AAindex1 dataset is developed. Then, the random projection technique was adopted to project the encoding instances into a reduced space. Then, several better random projections were obtained by training an IBk classifier based on the training dataset, which were thus applied to the test dataset. The ensemble of random projection classifiers is therefore obtained. Experimental results showed that although the performance of our method is not good enough for real applications of hotspots, it is very promising in the determination of hotspot residues from whole sequences.

  2. A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection.

    Science.gov (United States)

    Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike; Khan, Arifa S

    2018-01-01

    Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have

  3. Osteocalcin protein sequences of Neanderthals and modern primates.

    Science.gov (United States)

    Nielsen-Marsh, Christina M; Richards, Michael P; Hauschka, Peter V; Thomas-Oates, Jane E; Trinkaus, Erik; Pettitt, Paul B; Karavanic, Ivor; Poinar, Hendrik; Collins, Matthew J

    2005-03-22

    We report here protein sequences of fossil hominids, from two Neanderthals dating to approximately 75,000 years old from Shanidar Cave in Iraq. These sequences, the oldest reported fossil primate protein sequences, are of bone osteocalcin, which was extracted and sequenced by using MALDI-TOF/TOF mass spectrometry. Through a combination of direct sequencing and peptide mass mapping, we determined that Neanderthals have an osteocalcin amino acid sequence that is identical to that of modern humans. We also report complete osteocalcin sequences for chimpanzee (Pan troglodytes) and gorilla (Gorilla gorilla gorilla) and a partial sequence for orangutan (Pongo pygmaeus), all of which are previously unreported. We found that the osteocalcin sequences of Neanderthals, modern human, chimpanzee, and orangutan are unusual among mammals in that the ninth amino acid is proline (Pro-9), whereas most species have hydroxyproline (Hyp-9). Posttranslational hydroxylation of Pro-9 in osteocalcin by prolyl-4-hydroxylase requires adequate concentrations of vitamin C (l-ascorbic acid), molecular O(2), Fe(2+), and 2-oxoglutarate, and also depends on enzyme recognition of the target proline substrate consensus sequence Leu-Gly-Ala-Pro-9-Ala-Pro-Tyr occurring in most mammals. In five species with Pro-9-Val-10, hydroxylation is blocked, whereas in gorilla there is a mixture of Pro-9 and Hyp-9. We suggest that the absence of hydroxylation of Pro-9 in Pan, Pongo, and Homo may reflect response to a selective pressure related to a decline in vitamin C in the diet during omnivorous dietary adaptation, either independently or through the common ancestor of these species.

  4. Large-scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe.

    Science.gov (United States)

    Necci, Marco; Piovesan, Damiano; Tosatto, Silvio C E

    2016-12-01

    Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. © 2016 The Protein Society.

  5. An expressed sequence tag (EST) data mining strategy succeeding in the discovery of new G-protein coupled receptors.

    Science.gov (United States)

    Wittenberger, T; Schaller, H C; Hellebrand, S

    2001-03-30

    We have developed a comprehensive expressed sequence tag database search method and used it for the identification of new members of the G-protein coupled receptor superfamily. Our approach proved to be especially useful for the detection of expressed sequence tag sequences that do not encode conserved parts of a protein, making it an ideal tool for the identification of members of divergent protein families or of protein parts without conserved domain structures in the expressed sequence tag database. At least 14 of the expressed sequence tags found with this strategy are promising candidates for new putative G-protein coupled receptors. Here, we describe the sequence and expression analysis of five new members of this receptor superfamily, namely GPR84, GPR86, GPR87, GPR90 and GPR91. We also studied the genomic structure and chromosomal localization of the respective genes applying in silico methods. A cluster of six closely related G-protein coupled receptors was found on the human chromosome 3q24-3q25. It consists of four orphan receptors (GPR86, GPR87, GPR91, and H963), the purinergic receptor P2Y1, and the uridine 5'-diphosphoglucose receptor KIAA0001. It seems likely that these receptors evolved from a common ancestor and therefore might have related ligands. In conclusion, we describe a data mining procedure that proved to be useful for the identification and first characterization of new genes and is well applicable for other gene families. Copyright 2001 Academic Press.

  6. Experimental Rugged Fitness Landscape in Protein Sequence Space

    OpenAIRE

    HAYASHI, Yuuki; 相田, 拓洋; TOYOTA, Hitoshi; 伏見, 譲; URABE, Itaru; YOMO, Tetsuya

    2006-01-01

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phag...

  7. Sequence alignment reveals possible MAPK docking motifs on HIV proteins.

    Directory of Open Access Journals (Sweden)

    Perry Evans

    Full Text Available Over the course of HIV infection, virus replication is facilitated by the phosphorylation of HIV proteins by human ERK1 and ERK2 mitogen-activated protein kinases (MAPKs. MAPKs are known to phosphorylate their substrates by first binding with them at a docking site. Docking site interactions could be viable drug targets because the sequences guiding them are more specific than phosphorylation consensus sites. In this study we use multiple bioinformatics tools to discover candidate MAPK docking site motifs on HIV proteins known to be phosphorylated by MAPKs, and we discuss the possibility of targeting docking sites with drugs. Using sequence alignments of HIV proteins of different subtypes, we show that MAPK docking patterns previously described for human proteins appear on the HIV matrix, Tat, and Vif proteins in a strain dependent manner, but are absent from HIV Rev and appear on all HIV Nef strains. We revise the regular expressions of previously annotated MAPK docking patterns in order to provide a subtype independent motif that annotates all HIV proteins. One revision is based on a documented human variant of one of the substrate docking motifs, and the other reduces the number of required basic amino acids in the standard docking motifs from two to one. The proposed patterns are shown to be consistent with in silico docking between ERK1 and the HIV matrix protein. The motif usage on HIV proteins is sufficiently different from human proteins in amino acid sequence similarity to allow for HIV specific targeting using small-molecule drugs.

  8. Prediction of host - pathogen protein interactions between Mycobacterium tuberculosis and Homo sapiens using sequence motifs.

    Science.gov (United States)

    Huo, Tong; Liu, Wei; Guo, Yu; Yang, Cheng; Lin, Jianping; Rao, Zihe

    2015-03-26

    Emergence of multiple drug resistant strains of M. tuberculosis (MDR-TB) threatens to derail global efforts aimed at reigning in the pathogen. Co-infections of M. tuberculosis with HIV are difficult to treat. To counter these new challenges, it is essential to study the interactions between M. tuberculosis and the host to learn how these bacteria cause disease. We report a systematic flow to predict the host pathogen interactions (HPIs) between M. tuberculosis and Homo sapiens based on sequence motifs. First, protein sequences were used as initial input for identifying the HPIs by 'interolog' method. HPIs were further filtered by prediction of domain-domain interactions (DDIs). Functional annotations of protein and publicly available experimental results were applied to filter the remaining HPIs. Using such a strategy, 118 pairs of HPIs were identified, which involve 43 proteins from M. tuberculosis and 48 proteins from Homo sapiens. A biological interaction network between M. tuberculosis and Homo sapiens was then constructed using the predicted inter- and intra-species interactions based on the 118 pairs of HPIs. Finally, a web accessible database named PATH (Protein interactions of M. tuberculosis and Human) was constructed to store these predicted interactions and proteins. This interaction network will facilitate the research on host-pathogen protein-protein interactions, and may throw light on how M. tuberculosis interacts with its host.

  9. Functional role of bacteriophage transfer RNAs: codon usage analysis of genomic sequences stored in the GENBANK/EMBL/DDBJ databases

    Directory of Open Access Journals (Sweden)

    T Kunisawa

    2006-01-01

    Full Text Available Complete genomic sequence data are stored in the public GenBank/EMBL/DDBJ databases so that any investigator can make use of the data. This report describes a comparative analysis of codon usage that is impossible without such a public and open data system. A limited number of bacteriophages harbor their own transfer RNAs. Based on a comparison between T4 phage-encoded tRNA species and the relative cellular amounts of host Escherichia coli tRNAs, it is hypothesized that T4 tRNAs could serve to supplement host isoacceptor tRNA species that are present in minor amounts and thus enhance the translational efficiency of phage proteins. When compared to their respective host bacteria, the codon usage data of bacteriophages D3, φC31, HP1, D29 and 933W all show an increased frequency of synonymous codons or amino acids that correspond to phage tRNA species, suggesting their supplemental role in the efficient production of phage proteins. The data-analysis presents an example in which the availability of an open and fully accessible database system would allow one to obtain comprehensive insights into a fundamental problem in molecular biology.

  10. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data.

    Science.gov (United States)

    Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong

    2018-01-04

    Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. HomPPI: a class of sequence homology based protein-protein interface prediction methods

    Directory of Open Access Journals (Sweden)

    Dobbs Drena

    2011-06-01

    Full Text Available Abstract Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i NPS-HomPPI (Non partner-specific HomPPI, which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii PS-HomPPI (Partner-specific HomPPI, which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of

  12. Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins

    Science.gov (United States)

    Firman, Taylor; Ghosh, Kingshuk

    2018-03-01

    We present an analytical theory to compute conformations of heteropolymers—applicable to describe disordered proteins—as a function of temperature and charge sequence. The theory describes coil-globule transition for a given protein sequence when temperature is varied and has been benchmarked against the all-atom Monte Carlo simulation (using CAMPARI) of intrinsically disordered proteins (IDPs). In addition, the model quantitatively shows how subtle alterations of charge placement in the primary sequence—while maintaining the same charge composition—can lead to significant changes in conformation, even as drastic as a coil (swelled above a purely random coil) to globule (collapsed below a random coil) and vice versa. The theory provides insights on how to control (enhance or suppress) these changes by tuning the temperature (or solution condition) and charge decoration. As an application, we predict the distribution of conformations (at room temperature) of all naturally occurring IDPs in the DisProt database and notice significant size variation even among IDPs with a similar composition of positive and negative charges. Based on this, we provide a new diagram-of-states delineating the sequence-conformation relation for proteins in the DisProt database. Next, we study the effect of post-translational modification, e.g., phosphorylation, on IDP conformations. Modifications as little as two-site phosphorylation can significantly alter the size of an IDP with everything else being constant (temperature, salt concentration, etc.). However, not all possible modification sites have the same effect on protein conformations; there are certain "hot spots" that can cause maximal change in conformation. The location of these "hot spots" in the parent sequence can readily be identified by using a sequence charge decoration metric originally introduced by Sawle and Ghosh. The ability of our model to predict conformations (both expanded and collapsed states) of IDPs at

  13. Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads.

    Science.gov (United States)

    Huson, Daniel H; Tappu, Rewati; Bazinet, Adam L; Xie, Chao; Cummings, Michael P; Nieselt, Kay; Williams, Rohan

    2017-01-25

    Microbiome sequencing projects typically collect tens of millions of short reads per sample. Depending on the goals of the project, the short reads can either be subjected to direct sequence analysis or be assembled into longer contigs. The assembly of whole genomes from metagenomic sequencing reads is a very difficult problem. However, for some questions, only specific genes of interest need to be assembled. This is then a gene-centric assembly where the goal is to assemble reads into contigs for a family of orthologous genes. We present a new method for performing gene-centric assembly, called protein-alignment-guided assembly, and provide an implementation in our metagenome analysis tool MEGAN. Genes are assembled on the fly, based on the alignment of all reads against a protein reference database such as NCBI-nr. Specifically, the user selects a gene family based on a classification such as KEGG and all reads binned to that gene family are assembled. Using published synthetic community metagenome sequencing reads and a set of 41 gene families, we show that the performance of this approach compares favorably with that of full-featured assemblers and that of a recently published HMM-based gene-centric assembler, both in terms of the number of reference genes detected and of the percentage of reference sequence covered. Protein-alignment-guided assembly of orthologous gene families complements whole-metagenome assembly in a new and very useful way.

  14. Sequence analysis corresponding to the PPE and PE proteins in ...

    Indian Academy of Sciences (India)

    Unknown

    AB repeats; Mycobacterium tuberculosis genome; PE-PPE domain; PPE, PE proteins; sequence analysis; surface antigens. J. Biosci. | Vol. ... bacterium tuberculosis genomes resulted in the identification of a previously uncharacterized 225 amino acid- ...... Vega Lopez F, Brooks L A, Dockrell H M, De Smet K A,. Thompson ...

  15. Representation of protein-sequence information by amino acid subalphabets

    DEFF Research Database (Denmark)

    Andersen, C.A.F.; Brunak, Søren

    2004-01-01

    -sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids...

  16. Databases

    Directory of Open Access Journals (Sweden)

    Nick Ryan

    2004-01-01

    Full Text Available Databases are deeply embedded in archaeology, underpinning and supporting many aspects of the subject. However, as well as providing a means for storing, retrieving and modifying data, databases themselves must be a result of a detailed analysis and design process. This article looks at this process, and shows how the characteristics of data models affect the process of database design and implementation. The impact of the Internet on the development of databases is examined, and the article concludes with a discussion of a range of issues associated with the recording and management of archaeological data.

  17. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

    Directory of Open Access Journals (Sweden)

    Rognes Torbjørn

    2011-06-01

    Full Text Available Abstract Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.

  18. Secure and robust cloud computing for high-throughput forensic microsatellite sequence analysis and databasing.

    Science.gov (United States)

    Bailey, Sarah F; Scheible, Melissa K; Williams, Christopher; Silva, Deborah S B S; Hoggan, Marina; Eichman, Christopher; Faith, Seth A

    2017-11-01

    Next-generation Sequencing (NGS) is a rapidly evolving technology with demonstrated benefits for forensic genetic applications, and the strategies to analyze and manage the massive NGS datasets are currently in development. Here, the computing, data storage, connectivity, and security resources of the Cloud were evaluated as a model for forensic laboratory systems that produce NGS data. A complete front-to-end Cloud system was developed to upload, process, and interpret raw NGS data using a web browser dashboard. The system was extensible, demonstrating analysis capabilities of autosomal and Y-STRs from a variety of NGS instrumentation (Illumina MiniSeq and MiSeq, and Oxford Nanopore MinION). NGS data for STRs were concordant with standard reference materials previously characterized with capillary electrophoresis and Sanger sequencing. The computing power of the Cloud was implemented with on-demand auto-scaling to allow multiple file analysis in tandem. The system was designed to store resulting data in a relational database, amenable to downstream sample interpretations and databasing applications following the most recent guidelines in nomenclature for sequenced alleles. Lastly, a multi-layered Cloud security architecture was tested and showed that industry standards for securing data and computing resources were readily applied to the NGS system without disadvantageous effects for bioinformatic analysis, connectivity or data storage/retrieval. The results of this study demonstrate the feasibility of using Cloud-based systems for secured NGS data analysis, storage, databasing, and multi-user distributed connectivity. Copyright © 2017 Elsevier B.V. All rights reserved.

  19. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.

    Science.gov (United States)

    Rognes, Torbjørn

    2011-06-01

    The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.

  20. RStrucFam: a web server to associate structure and cognate RNA for RNA-binding proteins from sequence information.

    Science.gov (United States)

    Ghosh, Pritha; Mathew, Oommen K; Sowdhamini, Ramanathan

    2016-10-07

    RNA-binding proteins (RBPs) interact with their cognate RNA(s) to form large biomolecular assemblies. They are versatile in their functionality and are involved in a myriad of processes inside the cell. RBPs with similar structural features and common biological functions are grouped together into families and superfamilies. It will be useful to obtain an early understanding and association of RNA-binding property of sequences of gene products. Here, we report a web server, RStrucFam, to predict the structure, type of cognate RNA(s) and function(s) of proteins, where possible, from mere sequence information. The web server employs Hidden Markov Model scan (hmmscan) to enable association to a back-end database of structural and sequence families. The database (HMMRBP) comprises of 437 HMMs of RBP families of known structure that have been generated using structure-based sequence alignments and 746 sequence-centric RBP family HMMs. The input protein sequence is associated with structural or sequence domain families, if structure or sequence signatures exist. In case of association of the protein with a family of known structures, output features like, multiple structure-based sequence alignment (MSSA) of the query with all others members of that family is provided. Further, cognate RNA partner(s) for that protein, Gene Ontology (GO) annotations, if any and a homology model of the protein can be obtained. The users can also browse through the database for details pertaining to each family, protein or RNA and their related information based on keyword search or RNA motif search. RStrucFam is a web server that exploits structurally conserved features of RBPs, derived from known family members and imprinted in mathematical profiles, to predict putative RBPs from sequence information. Proteins that fail to associate with such structure-centric families are further queried against the sequence-centric RBP family HMMs in the HMMRBP database. Further, all other essential

  1. Biophysical and structural considerations for protein sequence evolution

    Directory of Open Access Journals (Sweden)

    Grahnen Johan A

    2011-12-01

    Full Text Available Abstract Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model.

  2. Protein sequencing via nanopore based devices: a nanofluidics perspective

    Science.gov (United States)

    Chinappi, Mauro; Cecconi, Fabio

    2018-05-01

    Proteins perform a huge number of central functions in living organisms, thus all the new techniques allowing their precise, fast and accurate characterization at single-molecule level certainly represent a burst in proteomics with important biomedical impact. In this review, we describe the recent progresses in the developing of nanopore based devices for protein sequencing. We start with a critical analysis of the main technical requirements for nanopore protein sequencing, summarizing some ideas and methodologies that have recently appeared in the literature. In the last sections, we focus on the physical modelling of the transport phenomena occurring in nanopore based devices. The multiscale nature of the problem is discussed and, in this respect, some of the main possible computational approaches are illustrated.

  3. Sequence heterogeneity accelerates protein search for targets on DNA

    International Nuclear Information System (INIS)

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-01-01

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome

  4. Sequence heterogeneity accelerates protein search for targets on DNA

    Energy Technology Data Exchange (ETDEWEB)

    Shvets, Alexey A.; Kolomeisky, Anatoly B., E-mail: tolya@rice.edu [Department of Chemistry and Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005 (United States)

    2015-12-28

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  5. Determining and comparing protein function in Bacterial genome sequences

    DEFF Research Database (Denmark)

    Vesth, Tammi Camilla

    of this class have very little homology to other known genomes making functional annotation based on sequence similarity very difficult. Inspired in part by this analysis, an approach for comparative functional annotation was created based public sequenced genomes, CMGfunc. Functionally related groups......In November 2013, there was around 21.000 different prokaryotic genomes sequenced and publicly available, and the number is growing daily with another 20.000 or more genomes expected to be sequenced and deposited by the end of 2014. An important part of the analysis of this data is the functional...... annotation of genes – the descriptions assigned to genes that describe the likely function of the encoded proteins. This process is limited by several factors, including the definition of a function which can be more or less specific as well as how many genes can actually be assigned a function based...

  6. Relationships between residue Voronoi volume and sequence conservation in proteins.

    Science.gov (United States)

    Liu, Jen-Wei; Cheng, Chih-Wen; Lin, Yu-Feng; Chen, Shao-Yu; Hwang, Jenn-Kang; Yen, Shih-Chung

    2018-02-01

    Functional and biophysical constraints can cause different levels of sequence conservation in proteins. Previously, structural properties, e.g., relative solvent accessibility (RSA) and packing density of the weighted contact number (WCN), have been found to be related to protein sequence conservation (CS). The Voronoi volume has recently been recognized as a new structural property of the local protein structural environment reflecting CS. However, for surface residues, it is sensitive to water molecules surrounding the protein structure. Herein, we present a simple structural determinant termed the relative space of Voronoi volume (RSV); it uses the Voronoi volume and the van der Waals volume of particular residues to quantify the local structural environment. RSV (range, 0-1) is defined as (Voronoi volume-van der Waals volume)/Voronoi volume of the target residue. The concept of RSV describes the extent of available space for every protein residue. RSV and Voronoi profiles with and without water molecules (RSVw, RSV, VOw, and VO) were compared for 554 non-homologous proteins. RSV (without water) showed better Pearson's correlations with CS than did RSVw, VO, or VOw values. The mean correlation coefficient between RSV and CS was 0.51, which is comparable to the correlation between RSA and CS (0.49) and that between WCN and CS (0.56). RSV is a robust structural descriptor with and without water molecules and can quantitatively reflect evolutionary information in a single protein structure. Therefore, it may represent a practical structural determinant to study protein sequence, structure, and function relationships. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Sequence motifs in MADS transcription factors responsible for specificity and diversification of protein-protein interaction.

    Directory of Open Access Journals (Sweden)

    Aalt D J van Dijk

    Full Text Available Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and

  8. Ultra-fast evaluation of protein energies directly from sequence.

    Directory of Open Access Journals (Sweden)

    Gevorg Grigoryan

    2006-06-01

    Full Text Available The structure, function, stability, and many other properties of a protein in a fixed environment are fully specified by its sequence, but in a manner that is difficult to discern. We present a general approach for rapidly mapping sequences directly to their energies on a pre-specified rigid backbone, an important sub-problem in computational protein design and in some methods for protein structure prediction. The cluster expansion (CE method that we employ can, in principle, be extended to model any computable or measurable protein property directly as a function of sequence. Here we show how CE can be applied to the problem of computational protein design, and use it to derive excellent approximations of physical potentials. The approach provides several attractive advantages. First, following a one-time derivation of a CE expansion, the amount of time necessary to evaluate the energy of a sequence adopting a specified backbone conformation is reduced by a factor of 10(7 compared to standard full-atom methods for the same task. Second, the agreement between two full-atom methods that we tested and their CE sequence-based expressions is very high (root mean square deviation 1.1-4.7 kcal/mol, R2 = 0.7-1.0. Third, the functional form of the CE energy expression is such that individual terms of the expansion have clear physical interpretations. We derived expressions for the energies of three classic protein design targets-a coiled coil, a zinc finger, and a WW domain-as functions of sequence, and examined the most significant terms. Single-residue and residue-pair interactions are sufficient to accurately capture the energetics of the dimeric coiled coil, whereas higher-order contributions are important for the two more globular folds. For the task of designing novel zinc-finger sequences, a CE-derived energy function provides significantly better solutions than a standard design protocol, in comparable computation time. Given these advantages

  9. HippDB: a database of readily targeted helical protein-protein interactions.

    Science.gov (United States)

    Bergey, Christina M; Watkins, Andrew M; Arora, Paramjit S

    2013-11-01

    HippDB catalogs every protein-protein interaction whose structure is available in the Protein Data Bank and which exhibits one or more helices at the interface. The Web site accepts queries on variables such as helix length and sequence, and it provides computational alanine scanning and change in solvent-accessible surface area values for every interfacial residue. HippDB is intended to serve as a starting point for structure-based small molecule and peptidomimetic drug development. HippDB is freely available on the web at http://www.nyu.edu/projects/arora/hippdb. The Web site is implemented in PHP, MySQL and Apache. Source code freely available for download at http://code.google.com/p/helidb, implemented in Perl and supported on Linux. arora@nyu.edu.

  10. Genomic Enzymology: Web Tools for Leveraging Protein Family Sequence-Function Space and Genome Context to Discover Novel Functions.

    Science.gov (United States)

    Gerlt, John A

    2017-08-22

    The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.

  11. The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists.

    OpenAIRE

    Sven Heinicke; Michael S Livstone; Charles Lu; Rose Oughtred; Fan Kang; Samuel V Angiuoli; Owen White; David Botstein; Kara Dolinski

    2007-01-01

    Many biological databases that provide comparative genomics information and tools are now available on the internet. While certainly quite useful, to our knowledge none of the existing databases combine results from multiple comparative genomics methods with manually curated information from the literature. Here we describe the Princeton Protein Orthology Database (P-POD, http://ortholog.princeton.edu), a user-friendly database system that allows users to find and visualize the phylogenetic r...

  12. Conformational dynamics data bank: a database for conformational dynamics of proteins and supramolecular protein assemblies.

    Science.gov (United States)

    Kim, Do-Nyun; Altschuler, Josiah; Strong, Campbell; McGill, Gaël; Bathe, Mark

    2011-01-01

    The conformational dynamics data bank (CDDB, http://www.cdyn.org) is a database that aims to provide comprehensive results on the conformational dynamics of high molecular weight proteins and protein assemblies. Analysis is performed using a recently introduced coarse-grained computational approach that is applied to the majority of structures present in the electron microscopy data bank (EMDB). Results include equilibrium thermal fluctuations and elastic strain energy distributions that identify rigid versus flexible protein domains generally, as well as those associated with specific functional transitions, and correlations in molecular motions that identify molecular regions that are highly coupled dynamically, with implications for allosteric mechanisms. A practical web-based search interface enables users to easily collect conformational dynamics data in various formats. The data bank is maintained and updated automatically to include conformational dynamics results for new structural entries as they become available in the EMDB. The CDDB complements static structural information to facilitate the investigation and interpretation of the biological function of proteins and protein assemblies essential to cell function.

  13. Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

    Directory of Open Access Journals (Sweden)

    Sharpton Thomas J

    2012-10-01

    Full Text Available Abstract Background New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. Results We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as “Sifting Families,” or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology–based analyses. Conclusions We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/.

  14. Access to DNA and protein databases on the Internet.

    Science.gov (United States)

    Harper, R

    1994-02-01

    During the past year, the number of biological databases that can be queried via Internet has dramatically increased. This increase has resulted from the introduction of networking tools, such as Gopher and WAIS, that make it easy for research workers to index databases and make them available for on-line browsing. Biocomputing in the nineties will see the advent of more client/server options for the solution of problems in bioinformatics.

  15. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    Directory of Open Access Journals (Sweden)

    Colin A Smith

    Full Text Available Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface, interactions between and within parts of the structure (e.g. domains can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  16. UNcleProt (Universal Nuclear Protein database of barley): The first nuclear protein database that distinguishes proteins from different phases of the cell cycle

    Czech Academy of Sciences Publication Activity Database

    Blavet, Nicolas; Uřinovská, J.; Jeřábková, Hana; Chamrád, I.; Vrána, Jan; Lenobel, R.; Beinhauer, D.; Šebela, M.; Doležel, Jaroslav; Petrovská, Beáta

    2017-01-01

    Roč. 8, č. 1 (2017), s. 70-80 ISSN 1949-1034 R&D Projects: GA ČR(CZ) GA14-28443S; GA MŠk(CZ) LO1204 Institutional support: RVO:61389030 Keywords : cicer-arietinum l. * rice oryza-sativa * chromatin-associated protein s * proteomic analysis * mitotic chromosomes * dehydration * localization * chickpea * network * phosphoproteome * barley * cell cycle * database * flow-cytometry * localization * mass spectrometry * nuclear proteome * nucleus Subject RIV: CE - Biochemistry OBOR OECD: Cell biology Impact factor: 2.387, year: 2016

  17. PACSY, a relational database management system for protein structure and chemical shift analysis.

    Science.gov (United States)

    Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L

    2012-10-01

    PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu.

  18. PACSY, a relational database management system for protein structure and chemical shift analysis

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Woonghee, E-mail: whlee@nmrfam.wisc.edu [University of Wisconsin-Madison, National Magnetic Resonance Facility at Madison, and Biochemistry Department (United States); Yu, Wookyung [Center for Proteome Biophysics, Pusan National University, Department of Physics (Korea, Republic of); Kim, Suhkmann [Pusan National University, Department of Chemistry and Chemistry Institute for Functional Materials (Korea, Republic of); Chang, Iksoo [Center for Proteome Biophysics, Pusan National University, Department of Physics (Korea, Republic of); Lee, Weontae, E-mail: wlee@spin.yonsei.ac.kr [Yonsei University, Structural Biochemistry and Molecular Biophysics Laboratory, Department of Biochemistry (Korea, Republic of); Markley, John L., E-mail: markley@nmrfam.wisc.edu [University of Wisconsin-Madison, National Magnetic Resonance Facility at Madison, and Biochemistry Department (United States)

    2012-10-15

    PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.eduhttp://pacsy.nmrfam.wisc.edu.

  19. PACSY, a relational database management system for protein structure and chemical shift analysis

    Science.gov (United States)

    Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo

    2012-01-01

    PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu. PMID:22903636

  20. PACSY, a relational database management system for protein structure and chemical shift analysis

    International Nuclear Information System (INIS)

    Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L.

    2012-01-01

    PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.eduhttp://pacsy.nmrfam.wisc.edu.

  1. Expanded microbial genome coverage and improved protein family annotation in the COG database.

    Science.gov (United States)

    Galperin, Michael Y; Makarova, Kira S; Wolf, Yuri I; Koonin, Eugene V

    2015-01-01

    Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the

  2. Identification of similar regions of protein structures using integrated sequence and structure analysis tools

    Directory of Open Access Journals (Sweden)

    Heiland Randy

    2006-03-01

    Full Text Available Abstract Background Understanding protein function from its structure is a challenging problem. Sequence based approaches for finding homology have broad use for annotation of both structure and function. 3D structural information of protein domains and their interactions provide a complementary view to structure function relationships to sequence information. We have developed a web site http://www.sblest.org/ and an API of web services that enables users to submit protein structures and identify statistically significant neighbors and the underlying structural environments that make that match using a suite of sequence and structure analysis tools. To do this, we have integrated S-BLEST, PSI-BLAST and HMMer based superfamily predictions to give a unique integrated view to prediction of SCOP superfamilies, EC number, and GO term, as well as identification of the protein structural environments that are associated with that prediction. Additionally, we have extended UCSF Chimera and PyMOL to support our web services, so that users can characterize their own proteins of interest. Results Users are able to submit their own queries or use a structure already in the PDB. Currently the databases that a user can query include the popular structural datasets ASTRAL 40 v1.69, ASTRAL 95 v1.69, CLUSTER50, CLUSTER70 and CLUSTER90 and PDBSELECT25. The results can be downloaded directly from the site and include function prediction, analysis of the most conserved environments and automated annotation of query proteins. These results reflect both the hits found with PSI-BLAST, HMMer and with S-BLEST. We have evaluated how well annotation transfer can be performed on SCOP ID's, Gene Ontology (GO ID's and EC Numbers. The method is very efficient and totally automated, generally taking around fifteen minutes for a 400 residue protein. Conclusion With structural genomics initiatives determining structures with little, if any, functional characterization

  3. O-GLYCBASE version 2.0: a revised database of O-glycosylated proteins

    DEFF Research Database (Denmark)

    Hansen, Jan; Lund, Ole; Rapacki, Kristoffer

    1997-01-01

    O-GLYCBASE is an updated database of information on glycoproteins and their O-linked glycosylation sites. Entries are compiled and revised from the literature, and from the SWISS-PROT database. Entries include information about species, sequence, glycosylation sites and glycan type. O-GLYCBASE is...... patterns for the GalNAc, mannose and GlcNAc transferases are shown. The O-GLYCBASE database is available through WWW or by anonymous FTP....

  4. Protein sequences bound to mineral surfaces persist into deep time

    DEFF Research Database (Denmark)

    Demarchi, Beatrice; Hall, Shaun; Roncal-Herrero, Teresa

    2016-01-01

    of Laetoli (3.8 Ma) and Olduvai Gorge (1.3 Ma) in Tanzania. By tracking protein diagenesis back in time we find consistent patterns of preservation, demonstrating authenticity of the surviving sequences. Molecular dynamics simulations of struthiocalcin-1 and -2, the dominant proteins within the eggshell......, reveal that distinct domains bind to the mineral surface. It is the domain with the strongest calculated binding energy to the calcite surface that is selectively preserved. Thermal age calculations demonstrate that the Laetoli and Olduvai peptides are 50 times older than any previously authenticated...

  5. Improved taxonomic assignment of human intestinal 16S rRNA sequences by a dedicated reference database

    NARCIS (Netherlands)

    Ritari, Jarmo; Salojärvi, Jarkko; Lahti, Leo; Vos, de Willem M.

    2015-01-01

    Background: Current sequencing technology enables taxonomic profiling of microbial ecosystems at high resolution and depth by using the 16S rRNA gene as a phylogenetic marker. Taxonomic assignation of newly acquired data is based on sequence comparisons with comprehensive reference databases to

  6. Protein identification from two-dimensional gel electrophoresis analysis of Klebsiella pneumoniae by combined use of mass spectrometry data and raw genome sequences

    Directory of Open Access Journals (Sweden)

    Zeng An-Ping

    2003-12-01

    Full Text Available Abstract Separation of proteins by two-dimensional gel electrophoresis (2-DE coupled with identification of proteins through peptide mass fingerprinting (PMF by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS is the widely used technique for proteomic analysis. This approach relies, however, on the presence of the proteins studied in public-accessible protein databases or the availability of annotated genome sequences of an organism. In this work, we investigated the reliability of using raw genome sequences for identifying proteins by PMF without the need of additional information such as amino acid sequences. The method is demonstrated for proteomic analysis of Klebsiella pneumoniae grown anaerobically on glycerol. For 197 spots excised from 2-DE gels and submitted for mass spectrometric analysis 164 spots were clearly identified as 122 individual proteins. 95% of the 164 spots can be successfully identified merely by using peptide mass fingerprints and a strain-specific protein database (ProtKpn constructed from the raw genome sequences of K. pneumoniae. Cross-species protein searching in the public databases mainly resulted in the identification of 57% of the 66 high expressed protein spots in comparison to 97% by using the ProtKpn database. 10 dha regulon related proteins that are essential for the initial enzymatic steps of anaerobic glycerol metabolism were successfully identified using the ProtKpn database, whereas none of them could be identified by cross-species searching. In conclusion, the use of strain-specific protein database constructed from raw genome sequences makes it possible to reliably identify most of the proteins from 2-DE analysis simply through peptide mass fingerprinting.

  7. Improving pairwise comparison of protein sequences with domain co-occurrence

    Science.gov (United States)

    Gascuel, Olivier

    2018-01-01

    Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence PMID:29293498

  8. Computational identification of MoRFs in protein sequences.

    Science.gov (United States)

    Malhis, Nawar; Gsponer, Jörg

    2015-06-01

    Intrinsically disordered regions of proteins play an essential role in the regulation of various biological processes. Key to their regulatory function is the binding of molecular recognition features (MoRFs) to globular protein domains in a process known as a disorder-to-order transition. Predicting the location of MoRFs in protein sequences with high accuracy remains an important computational challenge. In this study, we introduce MoRFCHiBi, a new computational approach for fast and accurate prediction of MoRFs in protein sequences. MoRFCHiBi combines the outcomes of two support vector machine (SVM) models that take advantage of two different kernels with high noise tolerance. The first, SVMS, is designed to extract maximal information from the general contrast in amino acid compositions between MoRFs, their surrounding regions (Flanks), and the remainders of the sequences. The second, SVMT, is used to identify similarities between regions in a query sequence and MoRFs of the training set. We evaluated the performance of our predictor by comparing its results with those of two currently available MoRF predictors, MoRFpred and ANCHOR. Using three test sets that have previously been collected and used to evaluate MoRFpred and ANCHOR, we demonstrate that MoRFCHiBi outperforms the other predictors with respect to different evaluation metrics. In addition, MoRFCHiBi is downloadable and fast, which makes it useful as a component in other computational prediction tools. http://www.chibi.ubc.ca/morf/. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  9. Functional analysis of bipartite begomovirus coat protein promoter sequences

    International Nuclear Information System (INIS)

    Lacatus, Gabriela; Sunter, Garry

    2008-01-01

    We demonstrate that the AL2 gene of Cabbage leaf curl virus (CaLCuV) activates the CP promoter in mesophyll and acts to derepress the promoter in vascular tissue, similar to that observed for Tomato golden mosaic virus (TGMV). Binding studies indicate that sequences mediating repression and activation of the TGMV and CaLCuV CP promoter specifically bind different nuclear factors common to Nicotiana benthamiana, spinach and tomato. However, chromatin immunoprecipitation demonstrates that TGMV AL2 can interact with both sequences independently. Binding of nuclear protein(s) from different crop species to viral sequences conserved in both bipartite and monopartite begomoviruses, including TGMV, CaLCuV, Pepper golden mosaic virus and Tomato yellow leaf curl virus suggests that bipartite begomoviruses bind common host factors to regulate the CP promoter. This is consistent with a model in which AL2 interacts with different components of the cellular transcription machinery that bind viral sequences important for repression and activation of begomovirus CP promoters

  10. A two-locus DNA sequence database for typing plant and human pathogens within the Fusarium oxysporum species complex

    DEFF Research Database (Denmark)

    O'Donnell, Kerry; Gueidan, C; Sink, S

    2009-01-01

    We constructed a two-locus database, comprising partial translation elongation factor (EF-1alpha) gene sequences and nearly full-length sequences of the nuclear ribosomal intergenic spacer region (IGS rDNA) for 850 isolates spanning the phylogenetic breadth of the Fusarium oxysporum species compl...... of the IGS rDNA sequences may be non-orthologous. We also evaluated enniatin, fumonisin and moniliformin mycotoxin production in vitro within a phylogenetic framework....

  11. The Ising model for prediction of disordered residues from protein sequence alone

    International Nuclear Information System (INIS)

    Lobanov, Michail Yu; Galzitskaya, Oxana V

    2011-01-01

    Intrinsically disordered regions serve as molecular recognition elements, which play an important role in the control of many cellular processes and signaling pathways. It is useful to be able to predict positions of disordered residues and disordered regions in protein chains using protein sequence alone. A new method (IsUnstruct) based on the Ising model for prediction of disordered residues from protein sequence alone has been developed. According to this model, each residue can be in one of two states: ordered or disordered. The model is an approximation of the Ising model in which the interaction term between neighbors has been replaced by a penalty for changing between states (the energy of border). The IsUnstruct has been compared with other available methods and found to perform well. The method correctly finds 77% of disordered residues as well as 87% of ordered residues in the CASP8 database, and 72% of disordered residues as well as 85% of ordered residues in the DisProt database

  12. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition.

    Science.gov (United States)

    Hayat, Maqsood; Khan, Asifullah

    2011-02-21

    Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. Copyright © 2010 Elsevier Ltd. All rights reserved.

  13. Protein (Viridiplantae) - PGDBj - Ortholog DB | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available ase Description Download License Update History of This Database Site Policy | Contact Us Protein (Viridiplantae) - PGDBj - Ortholog DB | LSDB Archive ... ...List Contact us PGDBj - Ortholog DB Protein (Viridiplantae) Data detail Data name Protein (Viridiplantae) DO...switchLanguage; BLAST Search Image Search Home About Archive Update History Data

  14. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    Science.gov (United States)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  15. CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences

    Directory of Open Access Journals (Sweden)

    Charalambos Chrysostomou

    2015-01-01

    Full Text Available Complex informational spectrum analysis for protein sequences (CISAPS and its web-based server are developed and presented. As recent studies show, only the use of the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient. Therefore, CISAPS is developed to consider and provide results in three forms including absolute, real, and imaginary spectrum. Biologically related features to the analysis of influenza A subtypes as presented as a case study in this study can also appear individually either in the real or imaginary spectrum. As the results presented, protein classes can present similarities or differences according to the features extracted from CISAPS web server. These associations are probable to be related with the protein feature that the specific amino acid index represents. In addition, various technical issues such as zero-padding and windowing that may affect the analysis are also addressed. CISAPS uses an expanded list of 611 unique amino acid indices where each one represents a different property to perform the analysis. This web-based server enables researchers with little knowledge of signal processing methods to apply and include complex informational spectrum analysis to their work.

  16. Next-generation sequencing can reveal in vitro-generated PCR crossover products: some artifactual sequences correspond to HLA alleles in the IMGT/HLA database.

    Science.gov (United States)

    Holcomb, C L; Rastrou, M; Williams, T C; Goodridge, D; Lazaro, A M; Tilanus, M; Erlich, H A

    2014-01-01

    The high-resolution human leukocyte antigen (HLA) genotyping assay that we developed using 454 sequencing and Conexio software uses generic polymerase chain reaction (PCR) primers for DRB exon 2. Occasionally, we observed low abundance DRB amplicon sequences that resulted from in vitro PCR 'crossing over' between DRB1 and DRB3/4/5. These hybrid sequences, revealed by the clonal sequencing property of the 454 system, were generally observed at a read depth of 5%-10% of the true alleles. They usually contained at least one mismatch with the IMGT/HLA database, and consequently, were easily recognizable and did not cause a problem for HLA genotyping. Sometimes, however, these artifactual sequences matched a rare allele and the automatic genotype assignment was incorrect. These observations raised two issues: (1) could PCR conditions be modified to reduce such artifacts? and (2) could some of the rare alleles listed in the IMGT/HLA database be artifacts rather than true alleles? Because PCR crossing over occurs during late cycles of PCR, we compared DRB genotypes resulting from 28 and (our standard) 35 cycles of PCR. For all 21 cell line DNAs amplified for 35 cycles, crossover products were detected. In 33% of the cases, these hybrid sequences corresponded to named alleles. With amplification for only 28 cycles, these artifactual sequences were not detectable. To investigate whether some rare alleles in the IMGT/HLA database might be due to PCR artifacts, we analyzed four samples obtained from the investigators who submitted the sequences. In three cases, the sequences were generated from true alleles. In one case, our 454 sequencing revealed an error in the previously submitted sequence. © 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  17. SigniSite: Identification of residue-level genotype-phenotype correlations in protein multiple sequence alignments

    DEFF Research Database (Denmark)

    Jessen, Leon Ivar; Hoof, Ilka; Lund, Ole

    2013-01-01

    Site does not require any pre-definition of subgroups or binary classification. Input is a set of protein sequences where each sequence has an associated real number, quantifying a given phenotype. SigniSite will then identify which amino acid residues are significantly associated with the data set......) using a set of human immunodeficiency virus protease-inhibitor genotype–phenotype data and corresponding resistance mutation scores from the Stanford University HIV Drug Resistance Database, and a data set of protein families with experimentally annotated SDPs. For both data sets, SigniSite was found...

  18. SCOWLP: a web-based database for detailed characterization and visualization of protein interfaces

    Directory of Open Access Journals (Sweden)

    Schroeder Michael

    2006-03-01

    Full Text Available Abstract Background Currently there is a strong need for methods that help to obtain an accurate description of protein interfaces in order to be able to understand the principles that govern molecular recognition and protein function. Many of the recent efforts to computationally identify and characterize protein networks extract protein interaction information at atomic resolution from the PDB. However, they pay none or little attention to small protein ligands and solvent. They are key components and mediators of protein interactions and fundamental for a complete description of protein interfaces. Interactome profiling requires the development of computational tools to extract and analyze protein-protein, protein-ligand and detailed solvent interaction information from the PDB in an automatic and comparative fashion. Adding this information to the existing one on protein-protein interactions will allow us to better understand protein interaction networks and protein function. Description SCOWLP (Structural Characterization Of Water, Ligands and Proteins is a user-friendly and publicly accessible web-based relational database for detailed characterization and visualization of the PDB protein interfaces. The SCOWLP database includes proteins, peptidic-ligands and interface water molecules as descriptors of protein interfaces. It contains currently 74,907 protein interfaces and 2,093,976 residue-residue interactions formed by 60,664 structural units (protein domains and peptidic-ligands and their interacting solvent. The SCOWLP web-server allows detailed structural analysis and comparisons of protein interfaces at atomic level by text query of PDB codes and/or by navigating a SCOP-based tree. It includes a visualization tool to interactively display the interfaces and label interacting residues and interface solvent by atomic physicochemical properties. SCOWLP is automatically updated with every SCOP release. Conclusion SCOWLP enriches

  19. O-GLYCBASE: a revised database of O-glycosylated proteins

    DEFF Research Database (Denmark)

    Hansen, Jan; Lund, Ole; Nielsen, Jens O.

    1996-01-01

    O-GLYCBASE is a comprehensive database of information on glycoproteins and their O-linked glycosylation sites. Entries are compiled and revised from the SWISS-PROT and PIR databases as well as directly from recently published reports. Nineteen percent of the entries extracted from the databases n...... of mucin type O-glycosylation sites in mammalian glycoproteins exclusively from the primary sequence is made available by E-mail or WWW. The O-GLYCBASE database is also available electronically through our WWW server or by anonymous FTP....

  20. Prediction of protein-protein interaction sites in sequences and 3D structures by random forests.

    Directory of Open Access Journals (Sweden)

    Mile Sikić

    2009-01-01

    Full Text Available Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i a combination of sequence- and structure-derived parameters and (ii sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras-Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information.

  1. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences

    Directory of Open Access Journals (Sweden)

    Meinicke Peter

    2009-09-01

    Full Text Available Abstract Background Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Description Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. Conclusion For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.

  2. Identification of a novel Plasmopara halstedii elicitor protein combining de novo peptide sequencing algorithms and RACE-PCR

    Directory of Open Access Journals (Sweden)

    Madlung Johannes

    2010-05-01

    Full Text Available Abstract Background Often high-quality MS/MS spectra of tryptic peptides do not match to any database entry because of only partially sequenced genomes and therefore, protein identification requires de novo peptide sequencing. To achieve protein identification of the economically important but still unsequenced plant pathogenic oomycete Plasmopara halstedii, we first evaluated the performance of three different de novo peptide sequencing algorithms applied to a protein digests of standard proteins using a quadrupole TOF (QStar Pulsar i. Results The performance order of the algorithms was PEAKS online > PepNovo > CompNovo. In summary, PEAKS online correctly predicted 45% of measured peptides for a protein test data set. All three de novo peptide sequencing algorithms were used to identify MS/MS spectra of tryptic peptides of an unknown 57 kDa protein of P. halstedii. We found ten de novo sequenced peptides that showed homology to a Phytophthora infestans protein, a closely related organism of P. halstedii. Employing a second complementary approach, verification of peptide prediction and protein identification was performed by creation of degenerate primers for RACE-PCR and led to an ORF of 1,589 bp for a hypothetical phosphoenolpyruvate carboxykinase. Conclusions Our study demonstrated that identification of proteins within minute amounts of sample material improved significantly by combining sensitive LC-MS methods with different de novo peptide sequencing algorithms. In addition, this is the first study that verified protein prediction from MS data by also employing a second complementary approach, in which RACE-PCR led to identification of a novel elicitor protein in P. halstedii.

  3. Reticulamoeba Is a Long-Branched Granofilosean (Cercozoa) That Is Missing from Sequence Databases

    Science.gov (United States)

    Bass, David; Yabuki, Akinori; Santini, Sébastien; Romac, Sarah; Berney, Cédric

    2012-01-01

    We sequenced the 18S ribosomal RNA gene of seven isolates of the enigmatic marine amoeboflagellate Reticulamoeba Grell, which resolved into four genetically distinct Reticulamoeba lineages, two of which correspond to R. gemmipara Grell and R. minor Grell, another with a relatively large cell body forming lacunae, and another that has similarities to both R. minor and R. gemmipara but with a greater propensity to form cell clusters. These lineages together form a long-branched clade that branches within the cercozoan class Granofilosea (phylum Cercozoa), showing phylogenetic affinities with the genus Mesofila. The basic morphology of Reticulamoeba is a roundish or ovoid cell with a more or less irregular outline. Long and branched reticulopodia radiate from the cell. The reticulopodia bear granules that are bidirectionally motile. There is also a biflagellate dispersal stage. Reticulamoeba is frequently observed in coastal marine environmental samples. PCR primers specific to the Reticulamoeba clade confirm that it is a frequent member of benthic marine microbial communities, and is also found in brackish water sediments and freshwater biofilm. However, so far it has not been found in large molecular datasets such as the nucleotide database in NCBI GenBank, metagenomic datasets in Camera, and the marine microbial eukaryote sampling and sequencing consortium BioMarKs, although closely related lineages can be found in some of these datasets using a highly targeted approach. Therefore, although such datasets are very powerful tools in microbial ecology, they may, for several methodological reasons, fail to detect ecologically and evolutionary key lineages. PMID:23226495

  4. Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database.

    Science.gov (United States)

    Billoud, B; Kontic, M; Viari, A

    1996-01-01

    At the DNA/RNA level, biological signals are defined by a combination of spatial structures and sequence motifs. Until now, few attempts had been made in writing general purpose search programs that take into account both sequence and structure criteria. Indeed, the most successful structure scanning programs are usually dedicated to particular structures and are written using general purpose programming languages through a complex and time consuming process where the biological problem of defining the structure and the computer engineering problem of looking for it are intimately intertwined. In this paper, we describe a general representation of structures, suitable for database scanning, together with a programming language, Palingol, designed to manipulate it. Palingol has specific data types, corresponding to structural elements-basically helices-that can be arranged in any way to form a complex structure. As a consequence of the declarative approach used in Palingol, the user should only focus on 'what to search for' while the language engine takes care of 'how to look for it'. Therefore, it becomes simpler to write a scanning program and the structural constraints that define the required structure are more clearly identified. PMID:8628670

  5. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence.

    Science.gov (United States)

    Zhang, Ya-Nan; Pan, Xiao-Yong; Huang, Yan; Shen, Hong-Bin

    2011-08-21

    Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems. Copyright © 2011 Elsevier Ltd. All rights reserved.

  6. Analysis of correlations between sites in models of protein sequences

    International Nuclear Information System (INIS)

    Giraud, B.G.; Lapedes, A.; Liu, L.C.

    1998-01-01

    A criterion based on conditional probabilities, related to the concept of algorithmic distance, is used to detect correlated mutations at noncontiguous sites on sequences. We apply this criterion to the problem of analyzing correlations between sites in protein sequences; however, the analysis applies generally to networks of interacting sites with discrete states at each site. Elementary models, where explicit results can be derived easily, are introduced. The number of states per site considered ranges from 2, illustrating the relation to familiar classical spin systems, to 20 states, suitable for representing amino acids. Numerical simulations show that the criterion remains valid even when the genetic history of the data samples (e.g., protein sequences), as represented by a phylogenetic tree, introduces nonindependence between samples. Statistical fluctuations due to finite sampling are also investigated and do not invalidate the criterion. A subsidiary result is found: The more homogeneous a population, the more easily its average properties can drift from the properties of its ancestor. copyright 1998 The American Physical Society

  7. Determining Clostridium difficile intra-taxa diversity by mining multilocus sequence typing databases.

    Science.gov (United States)

    Muñoz, Marina; Ríos-Chaparro, Dora Inés; Patarroyo, Manuel Alfonso; Ramírez, Juan David

    2017-03-14

    Multilocus sequence typing (MLST) is a highly discriminatory typing strategy; it is reproducible and scalable. There is a MLST scheme for Clostridium difficile (CD), a gram positive bacillus causing different pathologies of the gastrointestinal tract. This work was aimed at describing the frequency of sequence types (STs) and Clades (C) reported and evalute the intra-taxa diversity in the CD MLST database (CD-MLST-db) using an MLSA approach. Analysis of 1778 available isolates showed that clade 1 (C1) was the most frequent worldwide (57.7%), followed by C2 (29.1%). Regarding sequence types (STs), it was found that ST-1, belonging to C2, was the most frequent. The isolates analysed came from 17 countries, mostly from the United Kingdom (UK) (1541 STs, 87.0%). The diversity of the seven housekeeping genes in the MLST scheme was evaluated, and alleles from the profiles (STs), for identifying CD population structure. It was found that adk and atpA are conserved genes allowing a limited amount of clusters to be discriminated; however, different genes such as drx, glyA and particularly sodA showed high diversity indexes and grouped CD populations in many clusters, suggesting that these genes' contribution to CD typing should be revised. It was identified that CD STs reported to date have a mostly clonal population structure with foreseen events of recombination; however, one group of STs was not assigned to a clade being highly different containing at least nine well-supported clusters, suggesting a greater amount of clades for CD. This study shows the usefulness of CD-MLST-db as a tool for studying CD distribution and population structure, identifying the need for reviewing the usefulness of sodA as housekeeping gene within the MLST scheme and suggesting the existence of a greater amount of CD clades. The study also shows the plausible exchange of genetic material between STs, contributing towards intra-taxa genetic diversity.

  8. Integrated Controlling System and Unified Database for High Throughput Protein Crystallography Experiments

    International Nuclear Information System (INIS)

    Gaponov, Yu.A.; Igarashi, N.; Hiraki, M.; Sasajima, K.; Matsugaki, N.; Suzuki, M.; Kosuge, T.; Wakatsuki, S.

    2004-01-01

    An integrated controlling system and a unified database for high throughput protein crystallography experiments have been developed. Main features of protein crystallography experiments (purification, crystallization, crystal harvesting, data collection, data processing) were integrated into the software under development. All information necessary to perform protein crystallography experiments is stored (except raw X-ray data that are stored in a central data server) in a MySQL relational database. The database contains four mutually linked hierarchical trees describing protein crystals, data collection of protein crystal and experimental data processing. A database editor was designed and developed. The editor supports basic database functions to view, create, modify and delete user records in the database. Two search engines were realized: direct search of necessary information in the database and object oriented search. The system is based on TCP/IP secure UNIX sockets with four predefined sending and receiving behaviors, which support communications between all connected servers and clients with remote control functions (creating and modifying data for experimental conditions, data acquisition, viewing experimental data, and performing data processing). Two secure login schemes were designed and developed: a direct method (using the developed Linux clients with secure connection) and an indirect method (using the secure SSL connection using secure X11 support from any operating system with X-terminal and SSH support). A part of the system has been implemented on a new MAD beam line, NW12, at the Photon Factory Advanced Ring for general user experiments

  9. Simplifying complex sequence information: a PCP-consensus protein binds antibodies against all four Dengue serotypes.

    Science.gov (United States)

    Bowen, David M; Lewis, Jessica A; Lu, Wenzhe; Schein, Catherine H

    2012-09-14

    Designing proteins that reflect the natural variability of a pathogen is essential for developing novel vaccines and drugs. Flaviviruses, including Dengue (DENV) and West Nile (WNV), evolve rapidly and can "escape" neutralizing monoclonal antibodies by mutation. Designing antigens that represent many distinct strains is important for DENV, where infection with a strain from one of the four serotypes may lead to severe hemorrhagic disease on subsequent infection with a strain from another serotype. Here, a DENV physicochemical property (PCP)-consensus sequence was derived from 671 unique sequences from the Flavitrack database. PCP-consensus proteins for domain 3 of the envelope protein (EdomIII) were expressed from synthetic genes in Escherichia coli. The ability of the purified consensus proteins to bind polyclonal antibodies generated in response to infection with strains from each of the four DENV serotypes was determined. The initial consensus protein bound antibodies from DENV-1-3 in ELISA and Western blot assays. This sequence was altered in 3 steps to incorporate regions of maximum variability, identified as significant changes in the PCPs, characteristic of DENV-4 strains. The final protein was recognized by antibodies against all four serotypes. Two amino acids essential for efficient binding to all DENV antibodies are part of a discontinuous epitope previously defined for a neutralizing monoclonal antibody. The PCP-consensus method can significantly reduce the number of experiments required to define a multivalent antigen, which is particularly important when dealing with pathogens that must be tested at higher biosafety levels. Copyright © 2012 Elsevier Ltd. All rights reserved.

  10. Protein model discrimination using mutational sensitivity derived from deep sequencing.

    Science.gov (United States)

    Adkar, Bharat V; Tripathi, Arti; Sahoo, Anusmita; Bajaj, Kanika; Goswami, Devrishi; Chakrabarti, Purbani; Swarnkar, Mohit K; Gokhale, Rajesh S; Varadarajan, Raghavan

    2012-02-08

    A major bottleneck in protein structure prediction is the selection of correct models from a pool of decoys. Relative activities of ∼1,200 individual single-site mutants in a saturation library of the bacterial toxin CcdB were estimated by determining their relative populations using deep sequencing. This phenotypic information was used to define an empirical score for each residue (RankScore), which correlated with the residue depth, and identify active-site residues. Using these correlations, ∼98% of correct models of CcdB (RMSD ≤ 4Å) were identified from a large set of decoys. The model-discrimination methodology was further validated on eleven different monomeric proteins using simulated RankScore values. The methodology is also a rapid, accurate way to obtain relative activities of each mutant in a large pool and derive sequence-structure-function relationships without protein isolation or characterization. It can be applied to any system in which mutational effects can be monitored by a phenotypic readout. Copyright © 2012 Elsevier Ltd. All rights reserved.

  11. ARAMEMNON, a novel database for Arabidopsis integral membrane proteins

    DEFF Research Database (Denmark)

    Schwacke, Rainer; Schneider, Anja; van der Graaff, Eric

    2003-01-01

    spans and are possibly linked to transport functions. The ARAMEMNON DB enables direct comparison of the predictions of seven different TM span computation programs and the predictions of subcellular localization by eight signal peptide recognition programs. A special function displays the proteins...

  12. Nucleation phenomena in protein folding: the modulating role of protein sequence

    International Nuclear Information System (INIS)

    Travasso, Rui D M; FaIsca, Patricia F N; Gama, Margarida M Telo da

    2007-01-01

    For the vast majority of naturally occurring, small, single-domain proteins, folding is often described as a two-state process that lacks detectable intermediates. This observation has often been rationalized on the basis of a nucleation mechanism for protein folding whose basic premise is the idea that, after completion of a specific set of contacts forming the so-called folding nucleus, the native state is achieved promptly. Here we propose a methodology to identify folding nuclei in small lattice polymers and apply it to the study of protein molecules with a chain length of N = 48. To investigate the extent to which protein topology is a robust determinant of the nucleation mechanism, we compare the nucleation scenario of a native-centric model with that of a sequence-specific model sharing the same native fold. To evaluate the impact of the sequence's finer details in the nucleation mechanism, we consider the folding of two non-homologous sequences. We conclude that, in a sequence-specific model, the folding nucleus is, to some extent, formed by the most stable contacts in the protein and that the less stable linkages in the folding nucleus are solely determined by the fold's topology. We have also found that, independently of the protein sequence, the folding nucleus performs the same 'topological' function. This unifying feature of the nucleation mechanism results from the residues forming the folding nucleus being distributed along the protein chain in a similar and well-defined manner that is determined by the fold's topological features

  13. HIVBrainSeqDB: a database of annotated HIV envelope sequences from brain and other anatomical sites

    Directory of Open Access Journals (Sweden)

    O'Connor Niall

    2010-12-01

    Full Text Available Abstract Background The population of HIV replicating within a host consists of independently evolving and interacting sub-populations that can be genetically distinct within anatomical compartments. HIV replicating within the brain causes neurocognitive disorders in up to 20-30% of infected individuals and is a viral sanctuary site for the development of drug resistance. The primary determinant of HIV neurotropism is macrophage tropism, which is primarily determined by the viral envelope (env gene. However, studies of genetic aspects of HIV replicating in the brain are hindered because existing repositories of HIV sequences are not focused on neurotropic virus nor annotated with neurocognitive and neuropathological status. To address this need, we constructed the HIV Brain Sequence Database. Results The HIV Brain Sequence Database is a public database of HIV envelope sequences, directly sequenced from brain and other tissues from the same patients. Sequences are annotated with clinical data including viral load, CD4 count, antiretroviral status, neurocognitive impairment, and neuropathological diagnosis, all curated from the original publication. Tissue source is coded using an anatomical ontology, the Foundational Model of Anatomy, to capture the maximum level of detail available, while maintaining ontological relationships between tissues and their subparts. 44 tissue types are represented within the database, grouped into 4 categories: (i brain, brainstem, and spinal cord; (ii meninges, choroid plexus, and CSF; (iii blood and lymphoid; and (iv other (bone marrow, colon, lung, liver, etc. Patient coding is correlated across studies, allowing sequences from the same patient to be grouped to increase statistical power. Using Cytoscape, we visualized relationships between studies, patients and sequences, illustrating interconnections between studies and the varying depth of sequencing, patient number, and tissue representation across studies

  14. Cα and Cβ Carbon-13 Chemical Shifts in Proteins From an Empirical Database

    International Nuclear Information System (INIS)

    Iwadate, Mitsuo; Asakura, Tetsuo; Williamson, Michael P.

    1999-01-01

    We have constructed an extensive database of 13C Cα and Cβ chemical shifts in proteins of solution, for proteins of which a high-resolution crystal structure exists, and for which the crystal structure has been shown to be essentially identical to the solution structure. There is no systematic effect of temperature, reference compound, or pH on reported shifts, but there appear to be differences in reported shifts arising from referencing differences of up to 4.2 ppm. The major factor affecting chemical shifts is the backbone geometry, which causes differences of ca. 4 ppm between typical α- helix and β-sheet geometries for Cα, and of ca. 2 ppm for Cβ. The side-chain dihedral angle χ1 has an effect of up to 0.5 ppm on the Cα shift, particularly for amino acids with branched side-chains at Cβ. Hydrogen bonding to main-chain atoms has an effect of up to 0.9 ppm, which depends on the main- chain conformation. The sequence of the protein and ring-current shifts from aromatic rings have an insignificant effect (except for residues following proline). There are significant differences between different amino acid types in the backbone geometry dependence; the amino acids can be grouped together into five different groups with different φ,ψ shielding surfaces. The overall fit of individual residues to a single non-residue-specific surface, incorporating the effects of hydrogen bonding and χ1 angle, is 0.96 ppm for both Cα and Cβ. The results from this study are broadly similar to those from ab initio studies, but there are some differences which could merit further attention

  15. A novel approach to sequence validating protein expression clones with automated decision making

    Directory of Open Access Journals (Sweden)

    Mohr Stephanie E

    2007-06-01

    Full Text Available Abstract Background Whereas the molecular assembly of protein expression clones is readily automated and routinely accomplished in high throughput, sequence verification of these clones is still largely performed manually, an arduous and time consuming process. The ultimate goal of validation is to determine if a given plasmid clone matches its reference sequence sufficiently to be "acceptable" for use in protein expression experiments. Given the accelerating increase in availability of tens of thousands of unverified clones, there is a strong demand for rapid, efficient and accurate software that automates clone validation. Results We have developed an Automated Clone Evaluation (ACE system – the first comprehensive, multi-platform, web-based plasmid sequence verification software package. ACE automates the clone verification process by defining each clone sequence as a list of multidimensional discrepancy objects, each describing a difference between the clone and its expected sequence including the resulting polypeptide consequences. To evaluate clones automatically, this list can be compared against user acceptance criteria that specify the allowable number of discrepancies of each type. This strategy allows users to re-evaluate the same set of clones against different acceptance criteria as needed for use in other experiments. ACE manages the entire sequence validation process including contig management, identifying and annotating discrepancies, determining if discrepancies correspond to polymorphisms and clone finishing. Designed to manage thousands of clones simultaneously, ACE maintains a relational database to store information about clones at various completion stages, project processing parameters and acceptance criteria. In a direct comparison, the automated analysis by ACE took less time and was more accurate than a manual analysis of a 93 gene clone set. Conclusion ACE was designed to facilitate high throughput clone sequence

  16. Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences | Center for Cancer Research

    Science.gov (United States)

    A graphical method is presented for displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo.

  17. The reactive metabolite target protein database (TPDB)--a web-accessible resource.

    Science.gov (United States)

    Hanzlik, Robert P; Koen, Yakov M; Theertham, Bhargav; Dong, Yinghua; Fang, Jianwen

    2007-03-16

    The toxic effects of many simple organic compounds stem from their biotransformation to chemically reactive metabolites which bind covalently to cellular proteins. To understand the mechanisms of cytotoxic responses it may be important to know which proteins become adducted and whether some may be common targets of multiple toxins. The literature of this field is widely scattered but expanding rapidly, suggesting the need for a comprehensive, searchable database of reactive metabolite target proteins. The Reactive Metabolite Target Protein Database (TPDB) is a comprehensive, curated, searchable, documented compilation of publicly available information on the protein targets of reactive metabolites of 18 well-studied chemicals and drugs of known toxicity. TPDB software enables i) string searches for author names and proteins names/synonyms, ii) more complex searches by selecting chemical compound, animal species, target tissue and protein names/synonyms from pull-down menus, and iii) commonality searches over multiple chemicals. Tabulated search results provide information, references and links to other databases. The TPDB is a unique on-line compilation of information on the covalent modification of cellular proteins by reactive metabolites of chemicals and drugs. Its comprehensiveness and searchability should facilitate the elucidation of mechanisms of reactive metabolite toxicity. The database is freely available at http://tpdb.medchem.ku.edu/tpdb.html.

  18. The reactive metabolite target protein database (TPDB – a web-accessible resource

    Directory of Open Access Journals (Sweden)

    Dong Yinghua

    2007-03-01

    Full Text Available Abstract Background The toxic effects of many simple organic compounds stem from their biotransformation to chemically reactive metabolites which bind covalently to cellular proteins. To understand the mechanisms of cytotoxic responses it may be important to know which proteins become adducted and whether some may be common targets of multiple toxins. The literature of this field is widely scattered but expanding rapidly, suggesting the need for a comprehensive, searchable database of reactive metabolite target proteins. Description The Reactive Metabolite Target Protein Database (TPDB is a comprehensive, curated, searchable, documented compilation of publicly available information on the protein targets of reactive metabolites of 18 well-studied chemicals and drugs of known toxicity. TPDB software enables i string searches for author names and proteins names/synonyms, ii more complex searches by selecting chemical compound, animal species, target tissue and protein names/synonyms from pull-down menus, and iii commonality searches over multiple chemicals. Tabulated search results provide information, references and links to other databases. Conclusion The TPDB is a unique on-line compilation of information on the covalent modification of cellular proteins by reactive metabolites of chemicals and drugs. Its comprehensiveness and searchability should facilitate the elucidation of mechanisms of reactive metabolite toxicity. The database is freely available at http://tpdb.medchem.ku.edu/tpdb.html

  19. Experimental Rugged Fitness Landscape in Protein Sequence Space

    Science.gov (United States)

    Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya

    2006-01-01

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12–130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7×104-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18–24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region. PMID:17183728

  20. Experimental rugged fitness landscape in protein sequence space.

    Science.gov (United States)

    Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya

    2006-12-20

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7x10(4)-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18-24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.

  1. Experimental rugged fitness landscape in protein sequence space.

    Directory of Open Access Journals (Sweden)

    Yuuki Hayashi

    Full Text Available The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7x10(4-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1 the dependence of stationary fitness on library size, which increased gradually, and (2 the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18-24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.

  2. iPfam: a database of protein family and domain interactions found in the Protein Data Bank.

    Science.gov (United States)

    Finn, Robert D; Miller, Benjamin L; Clements, Jody; Bateman, Alex

    2014-01-01

    The database iPfam, available at http://ipfam.org, catalogues Pfam domain interactions based on known 3D structures that are found in the Protein Data Bank, providing interaction data at the molecular level. Previously, the iPfam domain-domain interaction data was integrated within the Pfam database and website, but it has now been migrated to a separate database. This allows for independent development, improving data access and giving clearer separation between the protein family and interactions datasets. In addition to domain-domain interactions, iPfam has been expanded to include interaction data for domain bound small molecule ligands. Functional annotations are provided from source databases, supplemented by the incorporation of Wikipedia articles where available. iPfam (version 1.0) contains >9500 domain-domain and 15 500 domain-ligand interactions. The new website provides access to this data in a variety of ways, including interactive visualizations of the interaction data.

  3. MEGADOCK-Web: an integrated database of high-throughput structure-based protein-protein interaction predictions.

    Science.gov (United States)

    Hayashi, Takanori; Matsuzaki, Yuri; Yanagisawa, Keisuke; Ohue, Masahito; Akiyama, Yutaka

    2018-05-08

    Protein-protein interactions (PPIs) play several roles in living cells, and computational PPI prediction is a major focus of many researchers. The three-dimensional (3D) structure and binding surface are important for the design of PPI inhibitors. Therefore, rigid body protein-protein docking calculations for two protein structures are expected to allow elucidation of PPIs different from known complexes in terms of 3D structures because known PPI information is not explicitly required. We have developed rapid PPI prediction software based on protein-protein docking, called MEGADOCK. In order to fully utilize the benefits of computational PPI predictions, it is necessary to construct a comprehensive database to gather prediction results and their predicted 3D complex structures and to make them easily accessible. Although several databases exist that provide predicted PPIs, the previous databases do not contain a sufficient number of entries for the purpose of discovering novel PPIs. In this study, we constructed an integrated database of MEGADOCK PPI predictions, named MEGADOCK-Web. MEGADOCK-Web provides more than 10 times the number of PPI predictions than previous databases and enables users to conduct PPI predictions that cannot be found in conventional PPI prediction databases. In MEGADOCK-Web, there are 7528 protein chains and 28,331,628 predicted PPIs from all possible combinations of those proteins. Each protein structure is annotated with PDB ID, chain ID, UniProt AC, related KEGG pathway IDs, and known PPI pairs. Additionally, MEGADOCK-Web provides four powerful functions: 1) searching precalculated PPI predictions, 2) providing annotations for each predicted protein pair with an experimentally known PPI, 3) visualizing candidates that may interact with the query protein on biochemical pathways, and 4) visualizing predicted complex structures through a 3D molecular viewer. MEGADOCK-Web provides a huge amount of comprehensive PPI predictions based on

  4. Sequence-specific capture of protein-DNA complexes for mass spectrometric protein identification.

    Directory of Open Access Journals (Sweden)

    Cheng-Hsien Wu

    Full Text Available The regulation of gene transcription is fundamental to the existence of complex multicellular organisms such as humans. Although it is widely recognized that much of gene regulation is controlled by gene-specific protein-DNA interactions, there presently exists little in the way of tools to identify proteins that interact with the genome at locations of interest. We have developed a novel strategy to address this problem, which we refer to as GENECAPP, for Global ExoNuclease-based Enrichment of Chromatin-Associated Proteins for Proteomics. In this approach, formaldehyde cross-linking is employed to covalently link DNA to its associated proteins; subsequent fragmentation of the DNA, followed by exonuclease digestion, produces a single-stranded region of the DNA that enables sequence-specific hybridization capture of the protein-DNA complex on a solid support. Mass spectrometric (MS analysis of the captured proteins is then used for their identification and/or quantification. We show here the development and optimization of GENECAPP for an in vitro model system, comprised of the murine insulin-like growth factor-binding protein 1 (IGFBP1 promoter region and FoxO1, a member of the forkhead rhabdomyosarcoma (FoxO subfamily of transcription factors, which binds specifically to the IGFBP1 promoter. This novel strategy provides a powerful tool for studies of protein-DNA and protein-protein interactions.

  5. A structural study for the optimisation of functional motifs encoded in protein sequences

    Directory of Open Access Journals (Sweden)

    Helmer-Citterich Manuela

    2004-04-01

    Full Text Available Abstract Background A large number of PROSITE patterns select false positives and/or miss known true positives. It is possible that – at least in some cases – the weak specificity and/or sensitivity of a pattern is due to the fact that one, or maybe more, functional and/or structural key residues are not represented in the pattern. Multiple sequence alignments are commonly used to build functional sequence patterns. If residues structurally conserved in proteins sharing a function cannot be aligned in a multiple sequence alignment, they are likely to be missed in a standard pattern construction procedure. Results Here we present a new procedure aimed at improving the sensitivity and/ or specificity of poorly-performing patterns. The procedure can be summarised as follows: 1. residues structurally conserved in different proteins, that are true positives for a pattern, are identified by means of a computational technique and by visual inspection. 2. the sequence positions of the structurally conserved residues falling outside the pattern are used to build extended sequence patterns. 3. the extended patterns are optimised on the SWISS-PROT database for their sensitivity and specificity. The method was applied to eight PROSITE patterns. Whenever structurally conserved residues are found in the surface region close to the pattern (seven out of eight cases, the addition of information inferred from structural analysis is shown to improve pattern selectivity and in some cases selectivity and sensitivity as well. In some of the cases considered the procedure allowed the identification of functionally interesting residues, whose biological role is also discussed. Conclusion Our method can be applied to any type of functional motif or pattern (not only PROSITE ones which is not able to select all and only the true positive hits and for which at least two true positive structures are available. The computational technique for the identification of

  6. Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies.

    NARCIS (Netherlands)

    G.P. Patrinos (George); B. Giardine (Belinda); C. Riemer (Cathy); W. Miller (Webb); D.H. Chui (David); N.P. Anagnou (Nicholas); H. Wajcman (Henri); R.C. Hardison (Ross)

    2004-01-01

    textabstractHbVar (http://globin.cse.psu.edu/globin/hbvar/) is a relational database developed by a multi-center academic effort to provide up-to-date and high quality information on the genomic sequence changes leading to hemoglobin variants and all types of thalassemia and

  7. Sequence Variation in Rhoptry Neck Protein 10 Gene among Toxoplasma gondii Isolates from Different Hosts and Geographical Locations.

    Science.gov (United States)

    Zhao, Yu; Zhou, Donghui; Chen, Jia; Sun, Xiaolin

    2017-01-01

    Toxoplasma gondii, as a eukaryotic parasite of the phylum Apicomplexa, can infect almost all the warm-blooded animals and humans, causing toxoplasmosis. Rhoptry neck proteins (RONs) play a key role in the invasion process of T. gondii and are potential vaccine candidate molecules against toxoplasmosis. The present study examined sequence variation in the rhoptry neck protein 10 (TgRON10) gene among 10 T. gondii isolates from different hosts and geographical locations from Lanzhou province during 2014, and compared with the corresponding sequences of strains ME49 and VEG obtained from the ToxoDB database, using polymerase chain reaction (PCR) amplification, sequence analysis, and phylogenetic reconstruction by Bayesian inference (BI) and maximum parsimony (MP). Analysis of all the 12 TgRON10 genomic and cDNA sequences revealed 7 exons and 6 introns in the TgRON10 gDNA. The complete genomic sequence of the TgRON10 gene ranged from 4759 bp to 4763 bp, and sequence variation was 0-0.6% among the 12 T. gondii isolates, indicating a low sequence variation in TgRON10 gene. Phylogenetic analysis of TgRON10 sequences showed that the cluster of the 12 T. gondii isolates was not completely consistent with their respective genotypes. TgRON10 gene is not a suitable genetic marker for the differentiation of T. gondii isolates from different hosts and geographical locations, but may represent a potential vaccine candidate against toxoplasmosis, worth further studies.

  8. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences.

    Directory of Open Access Journals (Sweden)

    Alexander M Sevy

    2015-07-01

    Full Text Available Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a 'single state' design (SSD paradigm. Multi-specificity design (MSD, on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON. The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design "promiscuous", polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes.

  9. Large scale identification and categorization of protein sequences using structured logistic regression.

    Directory of Open Access Journals (Sweden)

    Bjørn P Pedersen

    Full Text Available BACKGROUND: Structured Logistic Regression (SLR is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. RESULTS: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. CONCLUSIONS: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

  10. An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

    Directory of Open Access Journals (Sweden)

    Md. Rezaul Karim

    2012-03-01

    Full Text Available Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

  11. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

    NARCIS (Netherlands)

    Al-Shahib, A.; Breitling, R.; Gilbert, D.

    2005-01-01

    Abstract: When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence

  12. Discovering approximate-associated sequence patterns for protein-DNA interactions

    KAUST Repository

    Chan, Tak Ming

    2010-12-30

    Motivation: The bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are fundamental protein-DNA interactions in transcriptional regulation. Extensive efforts have been made to better understand the protein-DNA interactions. Recent mining on exact TF-TFBS-associated sequence patterns (rules) has shown great potentials and achieved very promising results. However, exact rules cannot handle variations in real data, resulting in limited informative rules. In this article, we generalize the exact rules to approximate ones for both TFs and TFBSs, which are essential for biological variations. Results: A progressive approach is proposed to address the approximation to alleviate the computational requirements. Firstly, similar TFBSs are grouped from the available TF-TFBS data (TRANSFAC database). Secondly, approximate and highly conserved binding cores are discovered from TF sequences corresponding to each TFBS group. A customized algorithm is developed for the specific objective. We discover the approximate TF-TFBS rules by associating the grouped TFBS consensuses and TF cores. The rules discovered are evaluated by matching (verifying with) the actual protein-DNA binding pairs from Protein Data Bank (PDB) 3D structures. The approximate results exhibit many more verified rules and up to 300% better verification ratios than the exact ones. The customized algorithm achieves over 73% better verification ratios than traditional methods. Approximate rules (64-79%) are shown statistically significant. Detailed variation analysis and conservation verification on NCBI records demonstrate that the approximate rules reveal both the flexible and specific protein-DNA interactions accurately. The approximate TF-TFBS rules discovered show great generalized capability of exploring more informative binding rules. © The Author 2010. Published by Oxford University Press. All rights reserved.

  13. A MEMORY EFFICIENT HARDWARE BASED PATTERN MATCHING AND PROTEIN ALIGNMENT SCHEMES FOR HIGHLY COMPLEX DATABASES

    OpenAIRE

    Bennet, M.Anto; Sankaranarayanan, S.; Deepika, M.; Nanthini, N.; Bhuvaneshwari, S.; Priyanka, M.

    2017-01-01

    Protein sequence alignment to find correlation between different species, or genetic mutations etc. is the most computational intensive task when performing protein comparison. To speed-up the alignment, Systolic Arrays (SAs) have been used. In order to avoid the internal-loop problem which reduces the performance, pipeline interleaving strategy has been presented. This strategy is applied to an SA for Smith Waterman (SW) algorithm which is an alignment algorithm to locally align two proteins...

  14. Designing sequence to control protein function in an EF-hand protein.

    Science.gov (United States)

    Bunick, Christopher G; Nelson, Melanie R; Mangahas, Sheryll; Hunter, Michael J; Sheehan, Jonathan H; Mizoue, Laura S; Bunick, Gerard J; Chazin, Walter J

    2004-05-19

    The extent of conformational change that calcium binding induces in EF-hand proteins is a key biochemical property specifying Ca(2+) sensor versus signal modulator function. To understand how differences in amino acid sequence lead to differences in the response to Ca(2+) binding, comparative analyses of sequence and structures, combined with model building, were used to develop hypotheses about which amino acid residues control Ca(2+)-induced conformational changes. These results were used to generate a first design of calbindomodulin (CBM-1), a calbindin D(9k) re-engineered with 15 mutations to respond to Ca(2+) binding with a conformational change similar to that of calmodulin. The gene for CBM-1 was synthesized, and the protein was expressed and purified. Remarkably, this protein did not exhibit any non-native-like molten globule properties despite the large number of mutations and the nonconservative nature of some of them. Ca(2+)-induced changes in CD intensity and in the binding of the hydrophobic probe, ANS, implied that CBM-1 does undergo Ca(2+) sensorlike conformational changes. The X-ray crystal structure of Ca(2+)-CBM-1 determined at 1.44 A resolution reveals the anticipated increase in hydrophobic surface area relative to the wild-type protein. A nascent calmodulin-like hydrophobic docking surface was also found, though it is occluded by the inter-EF-hand loop. The results from this first calbindomodulin design are discussed in terms of progress toward understanding the relationships between amino acid sequence, protein structure, and protein function for EF-hand CaBPs, as well as the additional mutations for the next CBM design.

  15. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    KAUST Repository

    Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin

    2016-01-01

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment

  16. HIP2: An online database of human plasma proteins from healthy individuals

    Directory of Open Access Journals (Sweden)

    Shen Changyu

    2008-04-01

    Full Text Available Abstract Background With the introduction of increasingly powerful mass spectrometry (MS techniques for clinical research, several recent large-scale MS proteomics studies have sought to characterize the entire human plasma proteome with a general objective for identifying thousands of proteins leaked from tissues in the circulating blood. Understanding the basic constituents, diversity, and variability of the human plasma proteome is essential to the development of sensitive molecular diagnosis and treatment monitoring solutions for future biomedical applications. Biomedical researchers today, however, do not have an integrated online resource in which they can search for plasma proteins collected from different mass spectrometry platforms, experimental protocols, and search software for healthy individuals. The lack of such a resource for comparisons has made it difficult to interpret proteomics profile changes in patients' plasma and to design protein biomarker discovery experiments. Description To aid future protein biomarker studies of disease and health from human plasma, we developed an online database, HIP2 (Healthy Human Individual's Integrated Plasma Proteome. The current version contains 12,787 protein entries linked to 86,831 peptide entries identified using different MS platforms. Conclusion This web-based database will be useful to biomedical researchers involved in biomarker discovery research. This database has been developed to be the comprehensive collection of healthy human plasma proteins, and has protein data captured in a relational database schema built to contain mappings of supporting peptide evidence from several high-quality and high-throughput mass-spectrometry (MS experimental data sets. Users can search for plasma protein/peptide annotations, peptide/protein alignments, and experimental/sample conditions with options for filter-based retrieval to achieve greater analytical power for discovery and validation.

  17. FishPathogens.eu/vhsv: a user-friendly viral haemorrhagic septicaemia virus isolate and sequence database

    DEFF Research Database (Denmark)

    Jonstrup, Søren Peter; Gray, Tanya; Kahns, Søren

    2009-01-01

    A database has been created, http://www.Fish Pathogens.eu, with the aim of providing a single repository for collating important information on significant pathogens of aquaculture, relevant to their control and management. This database will be developed, maintained and managed as part of the Eu......A database has been created, http://www.Fish Pathogens.eu, with the aim of providing a single repository for collating important information on significant pathogens of aquaculture, relevant to their control and management. This database will be developed, maintained and managed as part...... of the European Community Reference Laboratory for Fish Diseases function. This concept has been initially developed for viral haemorrhagic septicaemia virus and will be extended in future to include information on other significant aquaculture pathogens. Information included for each isolate comprises sequence...... to obtain data from any selected part of the genome of interest. The output of the sequence search can be readily retrieved as a FASTA file ready to be imported into a sequence alignment tool of choice, facilitating further molecular epidemiological study....

  18. The SDH mutation database: an online resource for succinate dehydrogenase sequence variants involved in pheochromocytoma, paraganglioma and mitochondrial complex II deficiency

    Directory of Open Access Journals (Sweden)

    Devilee Peter

    2005-11-01

    Full Text Available Abstract Background The SDHA, SDHB, SDHC and SDHD genes encode the subunits of succinate dehydrogenase (succinate: ubiquinone oxidoreductase, a component of both the Krebs cycle and the mitochondrial respiratory chain. SDHA, a flavoprotein and SDHB, an iron-sulfur protein together constitute the catalytic domain, while SDHC and SDHD encode membrane anchors that allow the complex to participate in the respiratory chain as complex II. Germline mutations of SDHD and SDHB are a major cause of the hereditary forms of the tumors paraganglioma and pheochromocytoma. The largest subunit, SDHA, is mutated in patients with Leigh syndrome and late-onset optic atrophy, but has not as yet been identified as a factor in hereditary cancer. Description The SDH mutation database is based on the recently described Leiden Open (source Variation Database (LOVD system. The variants currently described in the database were extracted from the published literature and in some cases annotated to conform to current mutation nomenclature. Researchers can also directly submit new sequence variants online. Since the identification of SDHD, SDHC, and SDHB as classic tumor suppressor genes in 2000 and 2001, studies from research groups around the world have identified a total of 120 variants. Here we introduce all reported paraganglioma and pheochromocytoma related sequence variations in these genes, in addition to all reported mutations of SDHA. The database is now accessible online. Conclusion The SDH mutation database offers a valuable tool and resource for clinicians involved in the treatment of patients with paraganglioma-pheochromocytoma, clinical geneticists needing an overview of current knowledge, and geneticists and other researchers needing a solid foundation for further exploration of both these tumor syndromes and SDHA-related phenotypes.

  19. Nucleotide sequence of the coat protein gene of Lettuce big-vein virus.

    Science.gov (United States)

    Sasaya, T; Ishikawa, K; Koganezawa, H

    2001-06-01

    A sequence of 1425 nt was established that included the complete coat protein (CP) gene of Lettuce big-vein virus (LBVV). The LBVV CP gene encodes a 397 amino acid protein with a predicted M(r) of 44486. Antisera raised against synthetic peptides corresponding to N-terminal or C-terminal parts of the LBVV CP reacted in Western blot analysis with a protein with an M(r) of about 48000. RNA extracted from purified particles of LBVV by using proteinase K, SDS and phenol migrated in gels as two single-stranded RNA species of approximately 7.3 kb (ss-1) and 6.6 kb (ss-2). After denaturation by heat and annealing at room temperature, the RNA migrated as four species, ss-1, ss-2 and two additional double-stranded RNAs (ds-1 and ds-2). The Northern blot hybridization analysis using riboprobes from a full-length clone of the LBVV CP gene indicated that ss-2 has a negative-sense nature and contains the LBVV CP gene. Moreover, ds-2 is a double-stranded form of ss-2. Database searches showed that the LBVV CP most resembled the nucleocapsid proteins of rhabdoviruses. These results indicate that it would be appropriate to classify LBVV as a negative-sense single-stranded RNA virus rather than as a double-stranded RNA virus.

  20. A scalable double-barcode sequencing platform for characterization of dynamic protein-protein interactions.

    Science.gov (United States)

    Schlecht, Ulrich; Liu, Zhimin; Blundell, Jamie R; St Onge, Robert P; Levy, Sasha F

    2017-05-25

    Several large-scale efforts have systematically catalogued protein-protein interactions (PPIs) of a cell in a single environment. However, little is known about how the protein interactome changes across environmental perturbations. Current technologies, which assay one PPI at a time, are too low throughput to make it practical to study protein interactome dynamics. Here, we develop a highly parallel protein-protein interaction sequencing (PPiSeq) platform that uses a novel double barcoding system in conjunction with the dihydrofolate reductase protein-fragment complementation assay in Saccharomyces cerevisiae. PPiSeq detects PPIs at a rate that is on par with current assays and, in contrast with current methods, quantitatively scores PPIs with enough accuracy and sensitivity to detect changes across environments. Both PPI scoring and the bulk of strain construction can be performed with cell pools, making the assay scalable and easily reproduced across environments. PPiSeq is therefore a powerful new tool for large-scale investigations of dynamic PPIs.

  1. Functional and Structural Overview of G-Protein-Coupled Receptors Comprehensively Obtained from Genome Sequences

    Directory of Open Access Journals (Sweden)

    Makiko Suwa

    2011-04-01

    Full Text Available An understanding of the functional mechanisms of G-protein-coupled receptors (GPCRs is very important for GPCR-related drug design. We have developed an integrated GPCR database (SEVENS http://sevens.cbrc.jp/ that includes 64,090 reliable GPCR genes comprehensively identified from 56 eukaryote genome sequences, and overviewed the sequences and structure spaces of the GPCRs. In vertebrates, the number of receptors for biological amines, peptides, etc. is conserved in most species, whereas the number of chemosensory receptors for odorant, pheromone, etc. significantly differs among species. The latter receptors tend to be single exon type or a few exon type and show a high ratio in the numbers of GPCRs, whereas some families, such as Class B and Class C receptors, have long lengths due to the presence of many exons. Statistical analyses of amino acid residues reveal that most of the conserved residues in Class A GPCRs are found in the cytoplasmic half regions of transmembrane (TM helices, while residues characteristic to each subfamily found on the extracellular half regions. The 69 of Protein Data Bank (PDB entries of complete or fragmentary structures could be mapped on the TM/loop regions of Class A GPCRs covering 14 subfamilies.

  2. Top-Down-Assisted Bottom-Up Method for Homologous Protein Sequencing: Hemoglobin from 33 Bird Species

    Science.gov (United States)

    Song, Yang; Laskay, Ünige A.; Vilcins, Inger-Marie E.; Barbour, Alan G.; Wysocki, Vicki H.

    2015-11-01

    Ticks are vectors for disease transmission because they are indiscriminant in their feeding on multiple vertebrate hosts, transmitting pathogens between their hosts. Identifying the hosts on which ticks have fed is important for disease prevention and intervention. We have previously shown that hemoglobin (Hb) remnants from a host on which a tick fed can be used to reveal the host's identity. For the present research, blood was collected from 33 bird species that are common in the U.S. as hosts for ticks but that have unknown Hb sequences. A top-down-assisted bottom-up mass spectrometry approach with a customized searching database, based on variability in known bird hemoglobin sequences, has been devised to facilitate fast and complete sequencing of hemoglobin from birds with unknown sequences. These hemoglobin sequences will be added to a hemoglobin database and used for tick host identification. The general approach has the potential to sequence any set of homologous proteins completely in a rapid manner.

  3. SPiCE : A web-based tool for sequence-based protein classification and exploration

    NARCIS (Netherlands)

    Van den Berg, B.A.; Reinders, M.J.; Roubos, J.A.; De Ridder, D.

    2014-01-01

    Background Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using classifier algorithms. Although software tools are available for both feature extraction and classifier construction,

  4. BtoxDB: a comprehensive database of protein structural data on toxin-antitoxin systems.

    Science.gov (United States)

    Barbosa, Luiz Carlos Bertucci; Garrido, Saulo Santesso; Marchetto, Reinaldo

    2015-03-01

    Toxin-antitoxin (TA) systems are diverse and abundant genetic modules in prokaryotic cells that are typically formed by two genes encoding a stable toxin and a labile antitoxin. Because TA systems are able to repress growth or kill cells and are considered to be important actors in cell persistence (multidrug resistance without genetic change), these modules are considered potential targets for alternative drug design. In this scenario, structural information for the proteins in these systems is highly valuable. In this report, we describe the development of a web-based system, named BtoxDB, that stores all protein structural data on TA systems. The BtoxDB database was implemented as a MySQL relational database using PHP scripting language. Web interfaces were developed using HTML, CSS and JavaScript. The data were collected from the PDB, UniProt and Entrez databases. These data were appropriately filtered using specialized literature and our previous knowledge about toxin-antitoxin systems. The database provides three modules ("Search", "Browse" and "Statistics") that enable searches, acquisition of contents and access to statistical data. Direct links to matching external databases are also available. The compilation of all protein structural data on TA systems in one platform is highly useful for researchers interested in this content. BtoxDB is publicly available at http://www.gurupi.uft.edu.br/btoxdb. Copyright © 2015 Elsevier Ltd. All rights reserved.

  5. Formation of a Multiple Protein Complex on the Adenovirus Packaging Sequence by the IVa2 Protein▿

    OpenAIRE

    Tyler, Ryan E.; Ewing, Sean G.; Imperiale, Michael J.

    2007-01-01

    During adenovirus virion assembly, the packaging sequence mediates the encapsidation of the viral genome. This sequence is composed of seven functional units, termed A repeats. Recent evidence suggests that the adenovirus IVa2 protein binds the packaging sequence and is involved in packaging of the genome. Study of the IVa2-packaging sequence interaction has been hindered by difficulty in purifying the protein produced in virus-infected cells or by recombinant techniques. We report the first ...

  6. LoopX: A Graphical User Interface-Based Database for Comprehensive Analysis and Comparative Evaluation of Loops from Protein Structures.

    Science.gov (United States)

    Kadumuri, Rajashekar Varma; Vadrevu, Ramakrishna

    2017-10-01

    Due to their crucial role in function, folding, and stability, protein loops are being targeted for grafting/designing to create novel or alter existing functionality and improve stability and foldability. With a view to facilitate a thorough analysis and effectual search options for extracting and comparing loops for sequence and structural compatibility, we developed, LoopX a comprehensively compiled library of sequence and conformational features of ∼700,000 loops from protein structures. The database equipped with a graphical user interface is empowered with diverse query tools and search algorithms, with various rendering options to visualize the sequence- and structural-level information along with hydrogen bonding patterns, backbone φ, ψ dihedral angles of both the target and candidate loops. Two new features (i) conservation of the polar/nonpolar environment and (ii) conservation of sequence and conformation of specific residues within the loops have also been incorporated in the search and retrieval of compatible loops for a chosen target loop. Thus, the LoopX server not only serves as a database and visualization tool for sequence and structural analysis of protein loops but also aids in extracting and comparing candidate loops for a given target loop based on user-defined search options.

  7. CBS Genome Atlas Database: a dynamic storage for bioinformatic results and sequence data

    DEFF Research Database (Denmark)

    Hallin, Peter Fischer; Ussery, David

    2004-01-01

    , these results counts to more than 220 pieces of information. The backbone of this solution consists of a program package written in Perl, which enables administrators to synchronize and update the database content. The MySQL database has been connected to the CBS web-server via PHP4, to present a dynamic web...... and frequent addition of new models are factors that require a dynamic database layout. Using basic tools like the GNU Make system, csh, Perl and MySQL, we have created a flexible database environment for storing and maintaining such results for a collection of complete microbial genomes. Currently...... content for users outside the center. This solution is tightly fitted to existing server infrastructure and the solutions proposed here can perhaps serve as a template for other research groups to solve database issues....

  8. PDTD: a web-accessible protein database for drug target identification

    Directory of Open Access Journals (Sweden)

    Gao Zhenting

    2008-02-01

    Full Text Available Abstract Background Target identification is important for modern drug discovery. With the advances in the development of molecular docking, potential binding proteins may be discovered by docking a small molecule to a repository of proteins with three-dimensional (3D structures. To complete this task, a reverse docking program and a drug target database with 3D structures are necessary. To this end, we have developed a web server tool, TarFisDock (Target Fishing Docking http://www.dddc.ac.cn/tarfisdock, which has been used widely by others. Recently, we have constructed a protein target database, Potential Drug Target Database (PDTD, and have integrated PDTD with TarFisDock. This combination aims to assist target identification and validation. Description PDTD is a web-accessible protein database for in silico target identification. It currently contains >1100 protein entries with 3D structures presented in the Protein Data Bank. The data are extracted from the literatures and several online databases such as TTD, DrugBank and Thomson Pharma. The database covers diverse information of >830 known or potential drug targets, including protein and active sites structures in both PDB and mol2 formats, related diseases, biological functions as well as associated regulating (signaling pathways. Each target is categorized by both nosology and biochemical function. PDTD supports keyword search function, such as PDB ID, target name, and disease name. Data set generated by PDTD can be viewed with the plug-in of molecular visualization tools and also can be downloaded freely. Remarkably, PDTD is specially designed for target identification. In conjunction with TarFisDock, PDTD can be used to identify binding proteins for small molecules. The results can be downloaded in the form of mol2 file with the binding pose of the probe compound and a list of potential binding targets according to their ranking scores. Conclusion PDTD serves as a comprehensive and

  9. STITCH 2: an interaction network database for small molecules and proteins

    DEFF Research Database (Denmark)

    Kuhn, Michael; Szklarczyk, Damian; Franceschini, Andrea

    2010-01-01

    Over the last years, the publicly available knowledge on interactions between small molecules and proteins has been steadily increasing. To create a network of interactions, STITCH aims to integrate the data dispersed over the literature and various databases of biological pathways, drug......-target relationships and binding affinities. In STITCH 2, the number of relevant interactions is increased by incorporation of BindingDB, PharmGKB and the Comparative Toxicogenomics Database. The resulting network can be explored interactively or used as the basis for large-scale analyses. To facilitate links to other...... chemical databases, we adopt InChIKeys that allow identification of chemicals with a short, checksum-like string. STITCH 2.0 connects proteins from 630 organisms to over 74,000 different chemicals, including 2200 drugs. STITCH can be accessed at http://stitch.embl.de/....

  10. Interleukin-1beta induced changes in the protein expression of rat islets: a computerized database

    DEFF Research Database (Denmark)

    Andersen, H U; Fey, S J; Larsen, Peter Mose

    1997-01-01

    as well as the intracellular mechanisms of action of interleukin 1-mediated beta-cell cytotoxicity are unknown. However, previous studies have found an association of beta-cell destruction with alterations in protein synthesis. Thus, two-dimensional (2-D) gel electrophoresis of pancreatic islet proteins...... may be an important tool facilitating studies of the molecular pathogenesis of insulin-dependent diabetes mellitus. 2-D gel electrophoresis of islet proteins may lead to (i) the determination of qualitative and quantitative changes in specific islet proteins induced by cytokines, (ii......) the determination of the effects of agents modulating cytokine action, and (iii) the identification of primary islet protein antigen(s) initiating the immune destruction of the beta-cells. Therefore, the aim of this study was to create databases (DB) of all reproducibly detectable protein spots on 10% and 15...

  11. Characterization of new Schistosoma mansoni microsatellite loci in sequences obtained from public DNA databases and microsatellite enriched genomic libraries

    Directory of Open Access Journals (Sweden)

    Rodrigues NB

    2002-01-01

    Full Text Available In the last decade microsatellites have become one of the most useful genetic markers used in a large number of organisms due to their abundance and high level of polymorphism. Microsatellites have been used for individual identification, paternity tests, forensic studies and population genetics. Data on microsatellite abundance comes preferentially from microsatellite enriched libraries and DNA sequence databases. We have conducted a search in GenBank of more than 16,000 Schistosoma mansoni ESTs and 42,000 BAC sequences. In addition, we obtained 300 sequences from CA and AT microsatellite enriched genomic libraries. The sequences were searched for simple repeats using the RepeatMasker software. Of 16,022 ESTs, we detected 481 (3% sequences that contained 622 microsatellites (434 perfect, 164 imperfect and 24 compounds. Of the 481 ESTs, 194 were grouped in 63 clusters containing 2 to 15 ESTs per cluster. Polymorphisms were observed in 16 clusters. The 287 remaining ESTs were orphan sequences. Of the 42,017 BAC end sequences, 1,598 (3.8% contained microsatellites (2,335 perfect, 287 imperfect and 79 compounds. The 1,598 BAC end sequences 80 were grouped into 17 clusters containing 3 to 17 BAC end sequences per cluster. Microsatellites were present in 67 out of 300 sequences from microsatellite enriched libraries (55 perfect, 38 imperfect and 15 compounds. From all of the observed loci 55 were selected for having the longest perfect repeats and flanking regions that allowed the design of primers for PCR amplification. Additionally we describe two new polymorphic microsatellite loci.

  12. Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.

    Science.gov (United States)

    Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun

    2015-01-01

    The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.

  13. Strategies in protein sequencing and characterization: Multi-enzyme digestion coupled with alternate CID/ETD tandem mass spectrometry

    Energy Technology Data Exchange (ETDEWEB)

    Nardiello, Donatella; Palermo, Carmen, E-mail: carmen.palermo@unifg.it; Natale, Anna; Quinto, Maurizio; Centonze, Diego

    2015-01-07

    Highlights: • Multi-enzyme digestion for protein sequencing and characterization by CID/ETD. • Simultaneous use of trypsin/chymotrypsin for the maximization of sequence. • Identification of PTMs, sequence variants and species-specific residues. • Increase of accuracy in sequence assignments by orthogonal fragmentation techniques. - Abstract: A strategy based on a simultaneous multi-enzyme digestion coupled with electron transfer dissociation (ETD) and collision-induced dissociation (CID) was developed for protein sequencing and characterization, as a valid alternative platform in ion-trap based proteomics. The effect of different proteolytic procedures using chymotrypsin, trypsin, a combination of both, and Lys-C, was carefully evaluated in terms of number of identified peptides, protein coverage, and score distribution. A systematic comparison between CID and ETD is shown for the analysis of peptides originating from the in-solution digestion of standard caseins. The best results were achieved with a trypsin/chymotrypsin mix combined with CID and ETD operating in alternating mode. A post-database search validation of MS/MS dataset was performed, then, the matched peptides were cross checked by the evaluation of ion scores, rank, number of experimental product ions, and their relative abundances in the MS/MS spectrum. By integrated CID/ETD experiments, high quality-spectra have been obtained, thus allowing a confirmation of spectral information and an increase of accuracy in peptide sequence assignments. Overlapping peptides, produced throughout the proteins, reduce the ambiguity in mapping modifications between natural variants and animal species, and allow the characterization of post translational modifications. The advantages of using the enzymatic mix trypsin/chymotrypsin were confirmed by the nanoLC and CID/ETD tandem mass spectrometry of goat milk proteins, previously separated by two-dimensional gel electrophoresis.

  14. Elman RNN based classification of proteins sequences on account of their mutual information.

    Science.gov (United States)

    Mishra, Pooja; Nath Pandey, Paras

    2012-10-21

    In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.

  15. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases

    Directory of Open Access Journals (Sweden)

    Ma'ayan Avi

    2007-10-01

    Full Text Available Abstract Background In recent years, mammalian protein-protein interaction network databases have been developed. The interactions in these databases are either extracted manually from low-throughput experimental biomedical research literature, extracted automatically from literature using techniques such as natural language processing (NLP, generated experimentally using high-throughput methods such as yeast-2-hybrid screens, or interactions are predicted using an assortment of computational approaches. Genes or proteins identified as significantly changing in proteomic experiments, or identified as susceptibility disease genes in genomic studies, can be placed in the context of protein interaction networks in order to assign these genes and proteins to pathways and protein complexes. Results Genes2Networks is a software system that integrates the content of ten mammalian interaction network datasets. Filtering techniques to prune low-confidence interactions were implemented. Genes2Networks is delivered as a web-based service using AJAX. The system can be used to extract relevant subnetworks created from "seed" lists of human Entrez gene symbols. The output includes a dynamic linkable three color web-based network map, with a statistical analysis report that identifies significant intermediate nodes used to connect the seed list. Conclusion Genes2Networks is powerful web-based software that can help experimental biologists to interpret lists of genes and proteins such as those commonly produced through genomic and proteomic experiments, as well as lists of genes and proteins associated with disease processes. This system can be used to find relationships between genes and proteins from seed lists, and predict additional genes or proteins that may play key roles in common pathways or protein complexes.

  16. JACOP: A simple and robust method for the automated classification of protein sequences with modular architecture

    Directory of Open Access Journals (Sweden)

    Pagni Marco

    2005-08-01

    Full Text Available Abstract Background Whole-genome sequencing projects are rapidly producing an enormous number of new sequences. Consequently almost every family of proteins now contains hundreds of members. It has thus become necessary to develop tools, which classify protein sequences automatically and also quickly and reliably. The difficulty of this task is intimately linked to the mechanism by which protein sequences diverge, i.e. by simultaneous residue substitutions, insertions and/or deletions and whole domain reorganisations (duplications/swapping/fusion. Results Here we present a novel approach, which is based on random sampling of sub-sequences (probes out of a set of input sequences. The probes are compared to the input sequences, after a normalisation step; the results are used to partition the input sequences into homogeneous groups of proteins. In addition, this method provides information on diagnostic parts of the proteins. The performance of this method is challenged by two data sets. The first one contains the sequences of prokaryotic lyases that could be arranged as a multiple sequence alignment. The second one contains all proteins from Swiss-Prot Release 36 with at least one Src homology 2 (SH2 domain – a classical example for proteins with modular architecture. Conclusion The outcome of our method is robust, highly reproducible as shown using bootstrap and resampling validation procedures. The results are essentially coherent with the biology. This method depends solely on well-established publicly available software and algorithms.

  17. Analysis of long-range correlation in sequences data of proteins

    OpenAIRE

    ADRIANA ISVORAN; LAURA UNIPAN; DANA CRACIUN; VASILE MORARIU

    2007-01-01

    The results presented here suggest the existence of correlations in the sequence data of proteins. 32 proteins, both globular and fibrous, both monomeric and polymeric, were analyzed. The primary structures of these proteins were treated as time series. Three spatial series of data for each sequence of a protein were generated from numerical correspondences between each amino acid and a physical property associated with it, i.e., its electric charge, its polar character and its dipole moment....

  18. Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

    Science.gov (United States)

    Sun, Tanlin; Zhou, Bo; Lai, Luhua; Pei, Jianfeng

    2017-05-25

    Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

  19. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM

    Directory of Open Access Journals (Sweden)

    Yunyun Liang

    2015-01-01

    Full Text Available Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM. Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS, segmented PsePSSM, and segmented autocovariance transformation (ACT based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640 are adopted in this paper. Then a 700-dimensional (700D feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA. To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  20. Camps 2.0: exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins.

    Science.gov (United States)

    Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij

    2012-03-01

    Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.

  1. Construction and analysis of a plant non-specific lipid transfer protein database (nsLTPDB

    Directory of Open Access Journals (Sweden)

    Wang Nai-Jyuan

    2012-01-01

    Full Text Available Abstract Background Plant non-specific lipid transfer proteins (nsLTPs are small and basic proteins. Recently, nsLTPs have been reported involved in many physiological functions such as mediating phospholipid transfer, participating in plant defence activity against bacterial and fungal pathogens, and enhancing cell wall extension in tobacco. However, the lipid transfer mechanism of nsLTPs is still unclear, and comprehensive information of nsLTPs is difficult to obtain. Methods In this study, we identified 595 nsLTPs from 121 different species and constructed an nsLTPs database -- nsLTPDB -- which comprises the sequence information, structures, relevant literatures, and biological data of all plant nsLTPs http://nsltpdb.life.nthu.edu.tw/. Results Meanwhile, bioinformatics and statistics methods were implemented to develop a classification method for nsLTPs based on the patterns of the eight highly-conserved cysteine residues, and to suggest strict Prosite-styled patterns for Type I and Type II nsLTPs. The pattern of Type I is C X2 V X5-7 C [V, L, I] × Y [L, A, V] X8-13 CC × G X12 D × [Q, K, R] X2 CXC X16-21 P X2 C X13-15C, and that of Type II is C X4 L X2 C X9-11 P [S, T] X2 CC X5 Q X2-4 C[L, F]C X2 [A, L, I] × [D, N] P X10-12 [K, R] X4-5 C X3-4 P X0-2 C. Moreover, we referred the Prosite-styled patterns to the experimental mutagenesis data that previously established by our group, and found that the residues with higher conservation played an important role in the structural stability or lipid binding ability of nsLTPs. Conclusions Taken together, this research has suggested potential residues that might be essential to modulate the structural and functional properties of plant nsLTPs. Finally, we proposed some biologically important sites of the nsLTPs, which are described by using a new Prosite-styled pattern that we defined.

  2. Construction and analysis of a plant non-specific lipid transfer protein database (nsLTPDB).

    Science.gov (United States)

    Wang, Nai-Jyuan; Lee, Chi-Ching; Cheng, Chao-Sheng; Lo, Wei-Cheng; Yang, Ya-Fen; Chen, Ming-Nan; Lyu, Ping-Chiang

    2012-01-01

    Plant non-specific lipid transfer proteins (nsLTPs) are small and basic proteins. Recently, nsLTPs have been reported involved in many physiological functions such as mediating phospholipid transfer, participating in plant defence activity against bacterial and fungal pathogens, and enhancing cell wall extension in tobacco. However, the lipid transfer mechanism of nsLTPs is still unclear, and comprehensive information of nsLTPs is difficult to obtain. In this study, we identified 595 nsLTPs from 121 different species and constructed an nsLTPs database--nsLTPDB--which comprises the sequence information, structures, relevant literatures, and biological data of all plant nsLTPs http://nsltpdb.life.nthu.edu.tw/. Meanwhile, bioinformatics and statistics methods were implemented to develop a classification method for nsLTPs based on the patterns of the eight highly-conserved cysteine residues, and to suggest strict Prosite-styled patterns for Type I and Type II nsLTPs. The pattern of Type I is C X2 V X5-7 C [V, L, I] × Y [L, A, V] X8-13 CC × G X12 D × [Q, K, R] X2 CXC X16-21 P X2 C X13-15C, and that of Type II is C X4 L X2 C X9-11 P [S, T] X2 CC X5 Q X2-4 C[L, F]C X2 [A, L, I] × [D, N] P X10-12 [K, R] X4-5 C X3-4 P X0-2 C. Moreover, we referred the Prosite-styled patterns to the experimental mutagenesis data that previously established by our group, and found that the residues with higher conservation played an important role in the structural stability or lipid binding ability of nsLTPs. Taken together, this research has suggested potential residues that might be essential to modulate the structural and functional properties of plant nsLTPs. Finally, we proposed some biologically important sites of the nsLTPs, which are described by using a new Prosite-styled pattern that we defined.

  3. GPCR-SSFE: A comprehensive database of G-protein-coupled receptor template predictions and homology models

    Directory of Open Access Journals (Sweden)

    Kreuchwig Annika

    2011-05-01

    Full Text Available Abstract Background G protein-coupled receptors (GPCRs transduce a wide variety of extracellular signals to within the cell and therefore have a key role in regulating cell activity and physiological function. GPCR malfunction is responsible for a wide range of diseases including cancer, diabetes and hyperthyroidism and a large proportion of drugs on the market target these receptors. The three dimensional structure of GPCRs is important for elucidating the molecular mechanisms underlying these diseases and for performing structure-based drug design. Although structural data are restricted to only a handful of GPCRs, homology models can be used as a proxy for those receptors not having crystal structures. However, many researchers working on GPCRs are not experienced homology modellers and are therefore unable to benefit from the information that can be gleaned from such three-dimensional models. Here, we present a comprehensive database called the GPCR-SSFE, which provides initial homology models of the transmembrane helices for a large variety of family A GPCRs. Description Extending on our previous theoretical work, we have developed an automated pipeline for GPCR homology modelling and applied it to a large set of family A GPCR sequences. Our pipeline is a fragment-based approach that exploits available family A crystal structures. The GPCR-SSFE database stores the template predictions, sequence alignments, identified sequence and structure motifs and homology models for 5025 family A GPCRs. Users are able to browse the GPCR dataset according to their pharmacological classification or search for results using a UniProt entry name. It is also possible for a user to submit a GPCR sequence that is not contained in the database for analysis and homology model building. The models can be viewed using a Jmol applet and are also available for download along with the alignments. Conclusions The data provided by GPCR-SSFE are useful for investigating

  4. GPCR-SSFE: a comprehensive database of G-protein-coupled receptor template predictions and homology models.

    Science.gov (United States)

    Worth, Catherine L; Kreuchwig, Annika; Kleinau, Gunnar; Krause, Gerd

    2011-05-23

    G protein-coupled receptors (GPCRs) transduce a wide variety of extracellular signals to within the cell and therefore have a key role in regulating cell activity and physiological function. GPCR malfunction is responsible for a wide range of diseases including cancer, diabetes and hyperthyroidism and a large proportion of drugs on the market target these receptors. The three dimensional structure of GPCRs is important for elucidating the molecular mechanisms underlying these diseases and for performing structure-based drug design. Although structural data are restricted to only a handful of GPCRs, homology models can be used as a proxy for those receptors not having crystal structures. However, many researchers working on GPCRs are not experienced homology modellers and are therefore unable to benefit from the information that can be gleaned from such three-dimensional models. Here, we present a comprehensive database called the GPCR-SSFE, which provides initial homology models of the transmembrane helices for a large variety of family A GPCRs. Extending on our previous theoretical work, we have developed an automated pipeline for GPCR homology modelling and applied it to a large set of family A GPCR sequences. Our pipeline is a fragment-based approach that exploits available family A crystal structures. The GPCR-SSFE database stores the template predictions, sequence alignments, identified sequence and structure motifs and homology models for 5025 family A GPCRs. Users are able to browse the GPCR dataset according to their pharmacological classification or search for results using a UniProt entry name. It is also possible for a user to submit a GPCR sequence that is not contained in the database for analysis and homology model building. The models can be viewed using a Jmol applet and are also available for download along with the alignments. The data provided by GPCR-SSFE are useful for investigating general and detailed sequence-structure-function relationships

  5. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    Directory of Open Access Journals (Sweden)

    Marais Gabriel AB

    2011-07-01

    Full Text Available Abstract Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO terms, and thousands of single-nucleotide polymorphisms (SNPs were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49% that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to

  6. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    Science.gov (United States)

    2011-01-01

    Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO) terms, and thousands of single-nucleotide polymorphisms (SNPs) were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49%) that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to further develop Silene as a

  7. PATtyFams: Protein families for the microbial genomes in the PATRIC database

    Directory of Open Access Journals (Sweden)

    James J Davis

    2016-02-01

    Full Text Available The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL. This new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.

  8. Identification of ace inhibitory cryptides in Tilapia protein hydrolysate by UPLC-MS/MS coupled to database analysis.

    Science.gov (United States)

    Yesmine, Ben Henda; Antoine, Bonnet; da Silva Ortência Leocádia, Nunes Gonzalez; Rogério, Boscolo Wilson; Ingrid, Arnaudin; Nicolas, Bridiau; Thierry, Maugard; Jean-Marie, Piot; Frédéric, Sannier; Stéphanie, Bordenave-Juchereau

    2017-05-01

    An ultra-performance liquid chromatography-quadrupole-time of flight mass spectrometry method was developed and applied to identify short angiotensin-I-converting enzyme (ACE) inhibitory cryptides in Tilapia (Oreochromis Niloticus) protein hydrolyzate. A database was created with previously identified ACE-inhibitory di- and tripeptides and the lowest molecular weight fraction of Tilapia hydrolysate was analysed for coincidences. Only VW and VY were identified. Further analysis of collected fractions conducted to the identification of 51 different peptides in major fractions. 19 peptides selected were synthesised and tested for their ACE inhibitory potential. TL, TI, IK, LR, LD, IQ, DI, AILE, ALLE, ALIE and AIIE were identified as new ACE inhibitors. The findings from this study point UPLC-MS/MS combined with the creation of a database as an efficient technique to identify specific short peptides within a complex hydrolysate, in addition with de novo sequencing. This efficient characterisation of bioactive factors like cryptides in protein hydrolysates will extend their use as functional foods. Copyright © 2017 Elsevier B.V. All rights reserved.

  9. Exploring Sequence Characteristics Related to High- Level Production of Secreted Proteins in Aspergillus niger

    NARCIS (Netherlands)

    Van den Berg, B.A.; Reinders, M.J.T.; Hulsman, M.; Wu, L.; Pel, H.J.; Roubos, J.A.; De Ridder, D.

    2012-01-01

    Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large

  10. Fly-DPI: database of protein interactomes for D. melanogaster in the approach of systems biology

    Directory of Open Access Journals (Sweden)

    Lin Chieh-Hua

    2006-12-01

    Full Text Available Abstract Background Proteins control and mediate many biological activities of cells by interacting with other protein partners. This work presents a statistical model to predict protein interaction networks of Drosophila melanogaster based on insight into domain interactions. Results Three high-throughput yeast two-hybrid experiments and the collection in FlyBase were used as our starting datasets. The co-occurrences of domains in these interactive events are converted into a probability score of domain-domain interaction. These scores are used to infer putative interaction among all available open reading frames (ORFs of fruit fly. Additionally, the likelihood function is used to estimate all potential protein-protein interactions. All parameters are successfully iterated and MLE is obtained for each pair of domains. Additionally, the maximized likelihood reaches its converged criteria and maintains the probability stable. The hybrid model achieves a high specificity with a loss of sensitivity, suggesting that the model may possess major features of protein-protein interactions. Several putative interactions predicted by the proposed hybrid model are supported by literatures, while experimental data with a low probability score indicate an uncertain reliability and require further proof of interaction. Fly-DPI is the online database used to present this work. It is an integrated proteomics tool with comprehensive protein annotation information from major databases as well as an effective means of predicting protein-protein interactions. As a novel search strategy, the ping-pong search is a naïve path map between two chosen proteins based on pre-computed shortest paths. Adopting effective filtering strategies will facilitate researchers in depicting the bird's eye view of the network of interest. Fly-DPI can be accessed at http://flydpi.nhri.org.tw. Conclusion This work provides two reference systems, statistical and biological, to evaluate

  11. The human interactome knowledge base (hint-kb): An integrative human protein interaction database enriched with predicted protein–protein interaction scores using a novel hybrid technique

    KAUST Repository

    Theofilatos, Konstantinos A.; Dimitrakopoulos, Christos M.; Likothanassis, Spiridon D.; Kleftogiannis, Dimitrios A.; Moschopoulos, Charalampos N.; Alexakos, Christos; Papadimitriou, Stergios; Mavroudi, Seferina P.

    2013-01-01

    Proteins are the functional components of many cellular processes and the identification of their physical protein–protein interactions (PPIs) is an area of mature academic research. Various databases have been developed containing information about

  12. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs).

    Science.gov (United States)

    Natale, D A; Shankavaram, U T; Galperin, M Y; Wolf, Y I; Aravind, L; Koonin, E V

    2000-01-01

    Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and

  13. Sequence- and interactome-based prediction of viral protein hotspots targeting host proteins: a case study for HIV Nef.

    Directory of Open Access Journals (Sweden)

    Mahdi Sarmady

    Full Text Available Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk.

  14. Protein Science by DNA Sequencing: How Advances in Molecular Biology Are Accelerating Biochemistry.

    Science.gov (United States)

    Higgins, Sean A; Savage, David F

    2018-01-09

    A fundamental goal of protein biochemistry is to determine the sequence-function relationship, but the vastness of sequence space makes comprehensive evaluation of this landscape difficult. However, advances in DNA synthesis and sequencing now allow researchers to assess the functional impact of every single mutation in many proteins, but challenges remain in library construction and the development of general assays applicable to a diverse range of protein functions. This Perspective briefly outlines the technical innovations in DNA manipulation that allow massively parallel protein biochemistry and then summarizes the methods currently available for library construction and the functional assays of protein variants. Areas in need of future innovation are highlighted with a particular focus on assay development and the use of computational analysis with machine learning to effectively traverse the sequence-function landscape. Finally, applications in the fundamentals of protein biochemistry, disease prediction, and protein engineering are presented.

  15. Microwave-assisted acid and base hydrolysis of intact proteins containing disulfide bonds for protein sequence analysis by mass spectrometry.

    Science.gov (United States)

    Reiz, Bela; Li, Liang

    2010-09-01

    Controlled hydrolysis of proteins to generate peptide ladders combined with mass spectrometric analysis of the resultant peptides can be used for protein sequencing. In this paper, two methods of improving the microwave-assisted protein hydrolysis process are described to enable rapid sequencing of proteins containing disulfide bonds and increase sequence coverage, respectively. It was demonstrated that proteins containing disulfide bonds could be sequenced by MS analysis by first performing hydrolysis for less than 2 min, followed by 1 h of reduction to release the peptides originally linked by disulfide bonds. It was shown that a strong base could be used as a catalyst for microwave-assisted protein hydrolysis, producing complementary sequence information to that generated by microwave-assisted acid hydrolysis. However, using either acid or base hydrolysis, amide bond breakages in small regions of the polypeptide chains of the model proteins (e.g., cytochrome c and lysozyme) were not detected. Dynamic light scattering measurement of the proteins solubilized in an acid or base indicated that protein-protein interaction or aggregation was not the cause of the failure to hydrolyze certain amide bonds. It was speculated that there were some unknown local structures that might play a role in preventing an acid or base from reacting with the peptide bonds therein. 2010 American Society for Mass Spectrometry. Published by Elsevier Inc. All rights reserved.

  16. CPLA 1.0: an integrated database of protein lysine acetylation.

    Science.gov (United States)

    Liu, Zexian; Cao, Jun; Gao, Xinjiao; Zhou, Yanhong; Wen, Longping; Yang, Xiangjiao; Yao, Xuebiao; Ren, Jian; Xue, Yu

    2011-01-01

    As a reversible post-translational modification (PTM) discovered decades ago, protein lysine acetylation was known for its regulation of transcription through the modification of histones. Recent studies discovered that lysine acetylation targets broad substrates and especially plays an essential role in cellular metabolic regulation. Although acetylation is comparable with other major PTMs such as phosphorylation, an integrated resource still remains to be developed. In this work, we presented the compendium of protein lysine acetylation (CPLA) database for lysine acetylated substrates with their sites. From the scientific literature, we manually collected 7151 experimentally identified acetylation sites in 3311 targets. We statistically studied the regulatory roles of lysine acetylation by analyzing the Gene Ontology (GO) and InterPro annotations. Combined with protein-protein interaction information, we systematically discovered a potential human lysine acetylation network (HLAN) among histone acetyltransferases (HATs), substrates and histone deacetylases (HDACs). In particular, there are 1862 triplet relationships of HAT-substrate-HDAC retrieved from the HLAN, at least 13 of which were previously experimentally verified. The online services of CPLA database was implemented in PHP + MySQL + JavaScript, while the local packages were developed in JAVA 1.5 (J2SE 5.0). The CPLA database is freely available for all users at: http://cpla.biocuckoo.org.

  17. Protein and DNA sequence determinants of thermophilic adaptation.

    Directory of Open Access Journals (Sweden)

    Konstantin B Zeldovich

    2007-01-01

    Full Text Available There have been considerable attempts in the past to relate phenotypic trait--habitat temperature of organisms--to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from -10 degrees C to 110 degrees C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL, and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = -0.10. On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.

  18. Scoring protein relationships in functional interaction networks predicted from sequence data.

    Directory of Open Access Journals (Sweden)

    Gaston K Mazandu

    Full Text Available UNLABELLED: The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins. AVAILABILITY: Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.

  19. Efficient Disk-Based Techniques for Manipulating Very Large String Databases

    KAUST Repository

    Allam, Amin

    2017-01-01

    Indexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories

  20. Two-dimensional gel human protein databases offer a systematic approach to the study of cell proliferation and differentiation

    DEFF Research Database (Denmark)

    Celis, julio E.; Gesser, Borbala; Dejgaard, Kurt

    1989-01-01

    Human cellular protein databases have been established using computer-analyzed 2D gel electrophoresis. These databases, which include information on various properties of proteins, offer a global approach to the study of regulation of cell proliferation and differentiation. Furthermore, thanks...

  1. Two dimensional gel human protein databases offer a systematic approach to the study of cell proliferation and differentiation

    DEFF Research Database (Denmark)

    Celis, J E; Gesser, B; Dejgaard, K

    1989-01-01

    Human cellular protein databases have been established using computer-analyzed 2D gel electrophoresis. These databases, which include information on various properties of proteins, offer a global approach to the study of regulation of cell proliferation and differentiation. Furthermore, thanks to...

  2. Nonlinear analysis of sequence symmetry of beta-trefoil family proteins

    Energy Technology Data Exchange (ETDEWEB)

    Li Mingfeng [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Huang Yanzhao [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xu Ruizhen [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xiao Yi [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China)]. E-mail: yxiao@mail.hust.edu.cn

    2005-07-01

    The tertiary structures of proteins of beta-trefoil family have three-fold quasi-symmetry while their amino acid sequences appear almost at random. In the present paper we show that these amino acid sequences have hidden symmetries in fact and furthermore the degrees of these hidden symmetries are the same as those of their tertiary structures. We shall present a modified recurrence plot to reveal hidden symmetries in protein sequences. Our results can explain the contradiction in sequence-structure relations of proteins of beta-trefoil family.

  3. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    KAUST Repository

    Cui, Xuefeng

    2016-06-15

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence–structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM–HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

  4. Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

    Science.gov (United States)

    Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C

    2007-04-13

    Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained.

  5. Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences

    Directory of Open Access Journals (Sweden)

    Jun Wang

    2017-11-01

    Full Text Available Protein-protein interactions (PPIs play crucial roles in almost all cellular processes. Although a large amount of PPIs have been verified by high-throughput techniques in the past decades, currently known PPIs pairs are still far from complete. Furthermore, the wet-lab experiments based techniques for detecting PPIs are time-consuming and expensive. Hence, it is urgent and essential to develop automatic computational methods to efficiently and accurately predict PPIs. In this paper, a sequence-based approach called DNN-LCTD is developed by combining deep neural networks (DNNs and a novel local conjoint triad description (LCTD feature representation. LCTD incorporates the advantage of local description and conjoint triad, thus, it is capable to account for the interactions between residues in both continuous and discontinuous regions of amino acid sequences. DNNs can not only learn suitable features from the data by themselves, but also learn and discover hierarchical representations of data. When performing on the PPIs data of Saccharomyces cerevisiae, DNN-LCTD achieves superior performance with accuracy as 93.12%, precision as 93.75%, sensitivity as 93.83%, area under the receiver operating characteristic curve (AUC as 97.92%, and it only needs 718 s. These results indicate DNN-LCTD is very promising for predicting PPIs. DNN-LCTD can be a useful supplementary tool for future proteomics study.

  6. TMC-SNPdb: an Indian germline variant database derived from whole exome sequences.

    Science.gov (United States)

    Upadhyay, Pawan; Gardi, Nilesh; Desai, Sanket; Sahoo, Bikram; Singh, Ankita; Togar, Trupti; Iyer, Prajish; Prasad, Ratnam; Chandrani, Pratik; Gupta, Sudeep; Dutt, Amit

    2016-01-01

    Cancer is predominantly a somatic disease. A mutant allele present in a cancer cell genome is considered somatic when it's absent in the paired normal genome along with public SNP databases. The current build of dbSNP, the most comprehensive public SNP database, however inadequately represents several non-European Caucasian populations, posing a limitation in cancer genomic analyses of data from these populations. We present the T: ata M: emorial C: entre-SNP D: ata B: ase (TMC-SNPdb), as the first open source, flexible, upgradable, and freely available SNP database (accessible through dbSNP build 149 and ANNOVAR)-representing 114 309 unique germline variants-generated from whole exome data of 62 normal samples derived from cancer patients of Indian origin. The TMC-SNPdb is presented with a companion subtraction tool that can be executed with command line option or using an easy-to-use graphical user interface with the ability to deplete additional Indian population specific SNPs over and above dbSNP and 1000 Genomes databases. Using an institutional generated whole exome data set of 132 samples of Indian origin, we demonstrate that TMC-SNPdb could deplete 42, 33 and 28% false positive somatic events post dbSNP depletion in Indian origin tongue, gallbladder, and cervical cancer samples, respectively. Beyond cancer somatic analyses, we anticipate utility of the TMC-SNPdb in several Mendelian germline diseases. In addition to dbSNP build 149 and ANNOVAR, the TMC-SNPdb along with the subtraction tool is available for download in the public domain at the following:Database URL: http://www.actrec.gov.in/pi-webpages/AmitDutt/TMCSNP/TMCSNPdp.html. © The Author(s) 2016. Published by Oxford University Press.

  7. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases.

    Science.gov (United States)

    Shen, Li; Shao, Ningyi; Liu, Xiaochuan; Nestler, Eric

    2014-04-15

    Understanding the relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge. We have developed ngs.plot - a standalone program to visualize enrichment patterns of DNA-interacting proteins at functionally important regions based on next-generation sequencing data. We demonstrate that ngs.plot is not only efficient but also scalable. We use a few examples to demonstrate that ngs.plot is easy to use and yet very powerful to generate figures that are publication ready. We conclude that ngs.plot is a useful tool to help fill the gap between massive datasets and genomic information in this era of big sequencing data.

  8. AMS 4.0: consensus prediction of post-translational modifications in protein sequences.

    Science.gov (United States)

    Plewczynski, Dariusz; Basu, Subhadip; Saha, Indrajit

    2012-08-01

    We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The

  9. Data-driven modelling of protein synthesis : A sequence perspective

    NARCIS (Netherlands)

    Gritsenko, A.

    2017-01-01

    Recent advances in DNA sequencing, synthesis and genetic engineering have enabled the introduction of choice DNA sequences into living cells. This is an exciting prospect for the field of industrial biotechnology, which aims at using microorganisms to produce foods, beverages, pharmaceuticals and

  10. Biological sequence analysis: probabilistic models of proteins and nucleic acids

    National Research Council Canada - National Science Library

    Durbin, Richard

    1998-01-01

    ... analysis methods are now based on principles of probabilistic modelling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference...

  11. HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.

    Directory of Open Access Journals (Sweden)

    Charles Richard Bradshaw

    Full Text Available Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10, a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in

  12. Comparative analysis of the prion protein gene sequences in African lion.

    Science.gov (United States)

    Wu, Chang-De; Pang, Wan-Yong; Zhao, De-Ming

    2006-10-01

    The prion protein gene of African lion (Panthera Leo) was first cloned and polymorphisms screened. The results suggest that the prion protein gene of eight African lions is highly homogenous. The amino acid sequences of the prion protein (PrP) of all samples tested were identical. Four single nucleotide polymorphisms (C42T, C81A, C420T, T600C) in the prion protein gene (Prnp) of African lion were found, but no amino acid substitutions. Sequence analysis showed that the higher homology is observed to felis catus AF003087 (96.7%) and to sheep number M31313.1 (96.2%) Genbank accessed. With respect to all the mammalian prion protein sequences compared, the African lion prion protein sequence has three amino acid substitutions. The homology might in turn affect the potential intermolecular interactions critical for cross species transmission of prion disease.

  13. Sequence Variation in Rhoptry Neck Protein 10 Gene among Toxoplasma gondii Isolates from Different Hosts and Geographical Locations

    Directory of Open Access Journals (Sweden)

    Yu ZHAO

    2017-09-01

    Full Text Available Background: Toxoplasma gondii, as a eukaryotic parasite of the phylum Apicomplexa, can infect almost all the warm-blooded animals and humans, causing toxoplasmosis. Rhoptry neck proteins (RONs play a key role in the invasion process of T. gondii and are potential vaccine candidate molecules against toxoplasmosis.Methods: The present study examined sequence variation in the rhoptry neck protein 10 (TgRON10 gene among 10 T. gondii isolates from different hosts and geographical locations from Lanzhou province during 2014, and compared with the corresponding sequences of strains ME49 and VEG obtained from the ToxoDB database, using polymerase chain reaction (PCR amplification, sequence analysis, and phylogenetic reconstruction by Bayesian inference (BI and maximum parsimony (MP. Results: Analysis of all the 12 TgRON10 genomic and cDNA sequences revealed 7 exons and 6 introns in the TgRON10 gDNA. The complete genomic sequence of the TgRON10 gene ranged from 4759 bp to 4763 bp, and sequence variation was 0-0.6% among the 12 T. gondii isolates, indicating a low sequence variation in TgRON10 gene. Phylogenetic analysis of TgRON10 sequences showed that the cluster of the 12 T. gondii isolates was not completely consistent with their respective genotypes.Conclusion: TgRON10 gene is not a suitable genetic marker for the differentiation of T. gondii isolates from different hosts and geographical locations, but may represent a potential vaccine candidate against toxoplasmosis, worth further studies.

  14. ChickVD: a sequence variation database for the chicken genome

    DEFF Research Database (Denmark)

    Wang, Jing; He, Ximiao; Ruan, Jue

    2005-01-01

    Working in parallel with the efforts to sequence the chicken (Gallus gallus) genome, the Beijing Genomics Institute led an international team of scientists from China, USA, UK, Sweden, The Netherlands and Germany to map extensive DNA sequence variation throughout the chicken genome by sampling DN...... on quantitative trait loci using data from collaborating institutions and public resources. Our data can be queried by search engine and homology-based BLAST searches. ChickVD is publicly accessible at http://chicken.genomics.org.cn. Udgivelsesdato: 2005-Jan-1...

  15. Merging in-silico and in vitro salivary protein complex partners using the STRING database: A tutorial.

    Science.gov (United States)

    Crosara, Karla Tonelli Bicalho; Moffa, Eduardo Buozi; Xiao, Yizhi; Siqueira, Walter Luiz

    2018-01-16

    Protein-protein interaction is a common physiological mechanism for protection and actions of proteins in an organism. The identification and characterization of protein-protein interactions in different organisms is necessary to better understand their physiology and to determine their efficacy. In a previous in vitro study using mass spectrometry, we identified 43 proteins that interact with histatin 1. Six previously documented interactors were confirmed and 37 novel partners were identified. In this tutorial, we aimed to demonstrate the usefulness of the STRING database for studying protein-protein interactions. We used an in-silico approach along with the STRING database (http://string-db.org/) and successfully performed a fast simulation of a novel constructed histatin 1 protein-protein network, including both the previously known and the predicted interactors, along with our newly identified interactors. Our study highlights the advantages and importance of applying bioinformatics tools to merge in-silico tactics with experimental in vitro findings for rapid advancement of our knowledge about protein-protein interactions. Our findings also indicate that bioinformatics tools such as the STRING protein network database can help predict potential interactions between proteins and thus serve as a guide for future steps in our exploration of the Human Interactome. Our study highlights the usefulness of the STRING protein database for studying protein-protein interactions. The STRING database can collect and integrate data about known and predicted protein-protein associations from many organisms, including both direct (physical) and indirect (functional) interactions, in an easy-to-use interface. Copyright © 2017 Elsevier B.V. All rights reserved.

  16. Final Technical Report on the Genome Sequence DataBase (GSDB): DE-FG03 95 ER 62062 September 1997-September 1999

    Energy Technology Data Exchange (ETDEWEB)

    Harger, Carol A.

    1999-10-28

    Since September 1997 NCGR has produced two web-based tools for researchers to use to access and analyze data in the Genome Sequence DataBase (GSDB). These tools are: Sequence Viewer, a nucleotide sequence and annotation visualization tool, and MAR-Finder, a tool that predicts, base upon statistical inferences, the location of matrix attachment regions (MARS) within a nucleotide sequence. [The annual report for June 1996 to August 1997 is included as an attachment to this final report.

  17. Final Technical Report on the Genome Sequence DataBase (GSDB): DE-FG03 95 ER 62062 September 1997-September 1999; FINAL

    International Nuclear Information System (INIS)

    Harger, Carol A.

    1999-01-01

    Since September 1997 NCGR has produced two web-based tools for researchers to use to access and analyze data in the Genome Sequence DataBase (GSDB). These tools are: Sequence Viewer, a nucleotide sequence and annotation visualization tool, and MAR-Finder, a tool that predicts, base upon statistical inferences, the location of matrix attachment regions (MARS) within a nucleotide sequence.[The annual report for June 1996 to August 1997 is included as an attachment to this final report.

  18. Performance of Correspondence Algorithms in Vision-Based Driver Assistance Using an Online Image Sequence Database

    DEFF Research Database (Denmark)

    Klette, Reinhard; Krüger, Norbert; Vaudrey, Tobi

    2011-01-01

    the classification of recorded video data into situations defined by a cooccurrence of some events in recorded traffic scenes. About 100-400 stereo frames (or 4-16 s of recording) are considered a basic sequence, which will be identified with one particular situation. Future testing is expected to be on data...

  19. DSFL database: A hub of target proteins of Leishmania sp. to combat leishmaniasis

    Directory of Open Access Journals (Sweden)

    Ameer Khusro

    2017-07-01

    Full Text Available Leishmaniasis is a vector-borne chronic infectious tropical dermal disease caused by the protozoa parasite of the genus Leishmania that causes high mortality globally. Among three different clinical forms of leishmaniasis, visceral leishmaniasis (VL or kala-azar is a systemic public health disease with high morbidity and mortality in developing countries, caused by Leishmania donovani, Leishmania infantum or Leishmania chagasi. Unfortunately, there is no vaccine available till date for the treatment of leishmaniasis. On the other hand, the therapeutics approved to treat this fatal disease is expensive, toxic, and associated with serious side effects. Furthermore, the emergence of drug-resistant Leishmania parasites in most endemic countries due to the incessant utilization of existing drugs is a major concern at present. Drug Search for Leishmaniasis (DSFL is a unique database that involves 50 crystallized target proteins of varied Leishmania sp. in order to develop new drugs in future by interacting several antiparasitic compounds or molecules with specific protein through computational tools. The structure of target protein from different Leishmania sp. is available in this database. In this review, we spotlighted not only the current global status of leishmaniasis in brief but also detailed information about target proteins of various Leishmania sp. available in DSFL. DSFL has created a new expectation for mankind in order to combat leishmaniasis by targeting parasitic proteins and commence a new era to get rid of drug resistance parasites. The database will substantiate to be a worthwhile project for further development of new, non-toxic, and cost-effective antileishmanial drugs as targeted therapies using in vitro/in vivo assays.

  20. mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences.

    Science.gov (United States)

    Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E

    2013-08-15

    Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.

  1. Chaos game representation of functional protein sequences, and simulation and multifractal analysis of induced measures

    International Nuclear Information System (INIS)

    Zu-Guo, Yu; Qian-Jun, Xiao; Long, Shi; Jun-Wu, Yu; Anh, Vo

    2010-01-01

    Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the D q curves, one sees that these functional protein sequences are not completely random. The D q of all linked functional proteins studied are multifractal-like and sufficiently smooth for the C q (analogous to specific heat) curves to be meaningful. Furthermore, the D q curves of the measure μ based on their CGRs for different orders to link the functional protein sequences are almost identical if q ≥ 0. Finally, the C q curves of all linked functional proteins resemble a classical phase transition at a critical point. (cross-disciplinary physics and related areas of science and technology)

  2. In Silico Characterization of Pectate Lyase Protein Sequences from Different Source Organisms

    Directory of Open Access Journals (Sweden)

    Amit Kumar Dubey

    2010-01-01

    Full Text Available A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis. The phylogenetic tree constructed revealed different clusters based on different source organisms representing bacterial, fungal, plant, and nematode pectate lyases. The multiple accessions of bacterial, fungal, nematode, and plant pectate lyase protein sequences were placed closely revealing a sequence level similarity. The multiple sequence alignment of these pectate lyase protein sequences from different source organisms showed conserved regions at different stretches with maximum homology from amino acid residues 439–467, 715–816, and 829–910 which could be used for designing degenerate primers or probes specific for pectate lyases. The motif analysis revealed a conserved Pec_Lyase_C domain uniformly observed in all pectate lyases irrespective of variable sources suggesting its possible role in structural and enzymatic functions.

  3. Peptide Pattern Recognition for high-throughput protein sequence analysis and clustering

    DEFF Research Database (Denmark)

    Busk, Peter Kamp

    2017-01-01

    Large collections of protein sequences with divergent sequences are tedious to analyze for understanding their phylogenetic or structure-function relation. Peptide Pattern Recognition is an algorithm that was developed to facilitate this task but the previous version does only allow a limited...... number of sequences as input. I implemented Peptide Pattern Recognition as a multithread software designed to handle large numbers of sequences and perform analysis in a reasonable time frame. Benchmarking showed that the new implementation of Peptide Pattern Recognition is twenty times faster than...... the previous implementation on a small protein collection with 673 MAP kinase sequences. In addition, the new implementation could analyze a large protein collection with 48,570 Glycosyl Transferase family 20 sequences without reaching its upper limit on a desktop computer. Peptide Pattern Recognition...

  4. On the relationship between residue structural environment and sequence conservation in proteins.

    Science.gov (United States)

    Liu, Jen-Wei; Lin, Jau-Ji; Cheng, Chih-Wen; Lin, Yu-Feng; Hwang, Jenn-Kang; Huang, Tsun-Tsao

    2017-09-01

    Residues that are crucial to protein function or structure are usually evolutionarily conserved. To identify the important residues in protein, sequence conservation is estimated, and current methods rely upon the unbiased collection of homologous sequences. Surprisingly, our previous studies have shown that the sequence conservation is closely correlated with the weighted contact number (WCN), a measure of packing density for residue's structural environment, calculated only based on the C α positions of a protein structure. Moreover, studies have shown that sequence conservation is correlated with environment-related structural properties calculated based on different protein substructures, such as a protein's all atoms, backbone atoms, side-chain atoms, or side-chain centroid. To know whether the C α atomic positions are adequate to show the relationship between residue environment and sequence conservation or not, here we compared C α atoms with other substructures in their contributions to the sequence conservation. Our results show that C α positions are substantially equivalent to the other substructures in calculations of various measures of residue environment. As a result, the overlapping contributions between C α atoms and the other substructures are high, yielding similar structure-conservation relationship. Take the WCN as an example, the average overlapping contribution to sequence conservation is 87% between C α and all-atom substructures. These results indicate that only C α atoms of a protein structure could reflect sequence conservation at the residue level. © 2017 Wiley Periodicals, Inc.

  5. ProFITS of maize: a database of protein families involved in the transduction of signalling in the maize genome

    Directory of Open Access Journals (Sweden)

    Zhang Zhenhai

    2010-10-01

    Full Text Available Abstract Background Maize (Zea mays ssp. mays L. is an important model for plant basic and applied research. In 2009, the B73 maize genome sequencing made a great step forward, using clone by clone strategy; however, functional annotation and gene classification of the maize genome are still limited. Thus, a well-annotated datasets and informative database will be important for further research discoveries. Signal transduction is a fundamental biological process in living cells, and many protein families participate in this process in sensing, amplifying and responding to various extracellular or internal stimuli. Therefore, it is a good starting point to integrate information on the maize functional genes involved in signal transduction. Results Here we introduce a comprehensive database 'ProFITS' (Protein Families Involved in the Transduction of Signalling, which endeavours to identify and classify protein kinases/phosphatases, transcription factors and ubiquitin-proteasome-system related genes in the B73 maize genome. Users can explore gene models, corresponding transcripts and FLcDNAs using the three abovementioned protein hierarchical categories, and visualize them using an AJAX-based genome browser (JBrowse or Generic Genome Browser (GBrowse. Functional annotations such as GO annotation, protein signatures, protein best-hits in the Arabidopsis and rice genome are provided. In addition, pre-calculated transcription factor binding sites of each gene are generated and mutant information is incorporated into ProFITS. In short, ProFITS provides a user-friendly web interface for studies in signal transduction process in maize. Conclusion ProFITS, which utilizes both the B73 maize genome and full length cDNA (FLcDNA datasets, provides users a comprehensive platform of maize annotation with specific focus on the categorization of families involved in the signal transduction process. ProFITS is designed as a user-friendly web interface and it is

  6. Fast computational methods for predicting protein structure from primary amino acid sequence

    Science.gov (United States)

    Agarwal, Pratul Kumar [Knoxville, TN

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  7. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences

    OpenAIRE

    Lescot, Magali; Déhais, Patrice; Thijs, Gert; Marchal, Kathleen; Moreau, Yves; Van de Peer, Yves; Rouzé, Pierre; Rombauts, Stephane

    2002-01-01

    PlantCARE is a database of plant cis-acting regulatory elements, enhancers and repressors. Regulatory elements are represented by positional matrices, consensus sequences and individual sites on particular promoter sequences. Links to the EMBL, TRANSFAC and MEDLINE databases are provided when available. Data about the transcription sites are extracted mainly from the literature, supplemented with an increasing number of in silico predicted data. Apart from a general description for specific t...

  8. STING Millennium: a web-based suite of programs for comprehensive and simultaneous analysis of protein structure and sequence

    Science.gov (United States)

    Neshich, Goran; Togawa, Roberto C.; Mancini, Adauto L.; Kuser, Paula R.; Yamagishi, Michel E. B.; Pappas, Georgios; Torres, Wellington V.; Campos, Tharsis Fonseca e; Ferreira, Leonardo L.; Luna, Fabio M.; Oliveira, Adilton G.; Miura, Ronald T.; Inoue, Marcus K.; Horita, Luiz G.; de Souza, Dimas F.; Dominiquini, Fabiana; Álvaro, Alexandre; Lima, Cleber S.; Ogawa, Fabio O.; Gomes, Gabriel B.; Palandrani, Juliana F.; dos Santos, Gabriela F.; de Freitas, Esther M.; Mattiuz, Amanda R.; Costa, Ivan C.; de Almeida, Celso L.; Souza, Savio; Baudet, Christian; Higa, Roberto H.

    2003-01-01

    STING Millennium Suite (SMS) is a new web-based suite of programs and databases providing visualization and a complex analysis of molecular sequence and structure for the data deposited at the Protein Data Bank (PDB). SMS operates with a collection of both publicly available data (PDB, HSSP, Prosite) and its own data (contacts, interface contacts, surface accessibility). Biologists find SMS useful because it provides a variety of algorithms and validated data, wrapped-up in a user friendly web interface. Using SMS it is now possible to analyze sequence to structure relationships, the quality of the structure, nature and volume of atomic contacts of intra and inter chain type, relative conservation of amino acids at the specific sequence position based on multiple sequence alignment, indications of folding essential residue (FER) based on the relationship of the residue conservation to the intra-chain contacts and Cα–Cα and Cβ–Cβ distance geometry. Specific emphasis in SMS is given to interface forming residues (IFR)—amino acids that define the interactive portion of the protein surfaces. SMS may simultaneously display and analyze previously superimposed structures. PDB updates trigger SMS updates in a synchronized fashion. SMS is freely accessible for public data at http://www.cbi.cnptia.embrapa.br, http://mirrors.rcsb.org/SMS and http://trantor.bioc.columbia.edu/SMS. PMID:12824333

  9. The Mitochondrial Protein Atlas: A Database of Experimentally Verified Information on the Human Mitochondrial Proteome.

    Science.gov (United States)

    Godin, Noa; Eichler, Jerry

    2017-09-01

    Given its central role in various biological systems, as well as its involvement in numerous pathologies, the mitochondrion is one of the best-studied organelles. However, although the mitochondrial genome has been extensively investigated, protein-level information remains partial, and in many cases, hypothetical. The Mitochondrial Protein Atlas (MPA; URL: lifeserv.bgu.ac.il/wb/jeichler/MPA ) is a database that provides a complete, manually curated inventory of only experimentally validated human mitochondrial proteins. The MPA presently contains 911 unique protein entries, each of which is associated with at least one experimentally validated and referenced mitochondrial localization. The MPA also contains experimentally validated and referenced information defining function, structure, involvement in pathologies, interactions with other MPA proteins, as well as the method(s) of analysis used in each instance. Connections to relevant external data sources are offered for each entry, including links to NCBI Gene, PubMed, and Protein Data Bank. The MPA offers a prototype for other information sources that allow for a distinction between what has been confirmed and what remains to be verified experimentally.

  10. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation.

    Science.gov (United States)

    Kristensen, David M; Wolf, Yuri I; Koonin, Eugene V

    2017-01-04

    The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of 'index' orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  11. Generation of a predicted protein database from EST data and application to iTRAQ analyses in grape (Vitis vinifera cv. Cabernet Sauvignon) berries at ripening initiation

    Science.gov (United States)

    Lücker, Joost; Laszczak, Mario; Smith, Derek; Lund, Steven T

    2009-01-01

    Background iTRAQ is a proteomics technique that uses isobaric tags for relative and absolute quantitation of tryptic peptides. In proteomics experiments, the detection and high confidence annotation of proteins and the significance of corresponding expression differences can depend on the quality and the species specificity of the tryptic peptide map database used for analysis of the data. For species for which finished genome sequence data are not available, identification of proteins relies on similarity to proteins from other species using comprehensive peptide map databases such as the MSDB. Results We were interested in characterizing ripening initiation ('veraison') in grape berries at the protein level in order to better define the molecular control of this important process for grape growers and wine makers. We developed a bioinformatic pipeline for processing EST data in order to produce a predicted tryptic peptide database specifically targeted to the wine grape cultivar, Vitis vinifera cv. Cabernet Sauvignon, and lacking truncated N- and C-terminal fragments. By searching iTRAQ MS/MS data generated from berry exocarp and mesocarp samples at ripening initiation, we determined that implementation of the custom database afforded a large improvement in high confidence peptide annotation in comparison to the MSDB. We used iTRAQ MS/MS in conjunction with custom peptide db searches to quantitatively characterize several important pathway components for berry ripening previously described at the transcriptional level and confirmed expression patterns for these at the protein level. Conclusion We determined that a predicted peptide database for MS/MS applications can be derived from EST data using advanced clustering and trimming approaches and successfully implemented for quantitative proteome profiling. Quantitative shotgun proteome profiling holds great promise for characterizing biological processes such as fruit ripening initiation and may be further

  12. Generation of a predicted protein database from EST data and application to iTRAQ analyses in grape (Vitis vinifera cv. Cabernet Sauvignon berries at ripening initiation

    Directory of Open Access Journals (Sweden)

    Smith Derek

    2009-01-01

    Full Text Available Abstract Background iTRAQ is a proteomics technique that uses isobaric tags for relative and absolute quantitation of tryptic peptides. In proteomics experiments, the detection and high confidence annotation of proteins and the significance of corresponding expression differences can depend on the quality and the species specificity of the tryptic peptide map database used for analysis of the data. For species for which finished genome sequence data are not available, identification of proteins relies on similarity to proteins from other species using comprehensive peptide map databases such as the MSDB. Results We were interested in characterizing ripening initiation ('veraison' in grape berries at the protein level in order to better define the molecular control of this important process for grape growers and wine makers. We developed a bioinformatic pipeline for processing EST data in order to produce a predicted tryptic peptide database specifically targeted to the wine grape cultivar, Vitis vinifera cv. Cabernet Sauvignon, and lacking truncated N- and C-terminal fragments. By searching iTRAQ MS/MS data generated from berry exocarp and mesocarp samples at ripening initiation, we determined that implementation of the custom database afforded a large improvement in high confidence peptide annotation in comparison to the MSDB. We used iTRAQ MS/MS in conjunction with custom peptide db searches to quantitatively characterize several important pathway components for berry ripening previously described at the transcriptional level and confirmed expression patterns for these at the protein level. Conclusion We determined that a predicted peptide database for MS/MS applications can be derived from EST data using advanced clustering and trimming approaches and successfully implemented for quantitative proteome profiling. Quantitative shotgun proteome profiling holds great promise for characterizing biological processes such as fruit ripening

  13. The establishment of a database of Italian feeds for the Cornell Net Carbohydrate and Protein System

    Directory of Open Access Journals (Sweden)

    Enzo Tartari

    2010-01-01

    Full Text Available A field application of the Cornell Net Carbohydrate and Protein System (CNCPS in Italy has been limited because thefeed bank is based on North American feedstuffs and still few laboratories are able to analyze feeds as requested by theCNCPS. Moreover, the standardization of analytical procedures is still not homogeneous among laboratories. This workwas carried out to establish a first database for feeds commonly used in Italy, providing nutritionists and producers anaccurate and current feed composition, also indicating methods and apparatus for analytical procedures potentially availablefor routine analysis. A total of 909 samples of hays, silages and raw materials (protein feeds, cereals and by-productswere analyzed through 1999 and 2002; analysis included protein solubility and degradability, protein fractions,structural carbohydrate fractions and the calculation of neutral detergent structural carbohydrates. When possible, averagedata were compared with those included in the feed bank of CNCPS ver. 3 and with those obtained by another Italianlaboratory. The main differences were observed in chemical composition of forages and silages, whose composition largelydepends on environmental conditions and physiological stage; protein feeds, cereals and by-products showed somedifferences in crude protein, soluble protein and protein fractions even in feeds of national origin.The intent to modify the feed bank values of CNCPS for establishing an Italian data base of feeds will require a collaborativestudy of many laboratories not only for forages, hays and silages samples - whose composition is greatly dependenton environmental factors and agronomic techniques - but also for protein fractions, whose values are largely influencedby even small changes in analytical techniques.

  14. PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan.

    Science.gov (United States)

    Kinjo, Akira R; Yamashita, Reiko; Nakamura, Haruki

    2010-08-25

    This article is a tutorial for PDBj Mine, a new database and its interface for Protein Data Bank Japan (PDBj). In PDBj Mine, data are loaded from files in the PDBMLplus format (an extension of PDBML, PDB's canonical XML format, enriched with annotations), which are then served for the user of PDBj via the worldwide web (WWW). We describe the basic design of the relational database (RDB) and web interfaces of PDBj Mine. The contents of PDBMLplus files are first broken into XPath entities, and these paths and data are indexed in the way that reflects the hierarchical structure of the XML files. The data for each XPath type are saved into the corresponding relational table that is named as the XPath itself. The generation of table definitions from the PDBMLplus XML schema is fully automated. For efficient search, frequently queried terms are compiled into a brief summary table. Casual users can perform simple keyword search, and 'Advanced Search' which can specify various conditions on the entries. More experienced users can query the database using SQL statements which can be constructed in a uniform manner. Thus, PDBj Mine achieves a combination of the flexibility of XML documents and the robustness of the RDB. Database URL: http://www.pdbj.org/

  15. Seeing the trees through the forest : sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

    NARCIS (Netherlands)

    Hou, Qingzhen; De Geest, Paul F.G.; Vranken, Wim F.; Heringa, Jaap; Feenstra, K. Anton

    2017-01-01

    Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains

  16. Solving Classification Problems for Large Sets of Protein Sequences with the Example of Hox and ParaHox Proteins

    Directory of Open Access Journals (Sweden)

    Stefanie D. Hueber

    2016-02-01

    Full Text Available Phylogenetic methods are key to providing models for how a given protein family evolved. However, these methods run into difficulties when sequence divergence is either too low or too high. Here, we provide a case study of Hox and ParaHox proteins so that additional insights can be gained using a new computational approach to help solve old classification problems. For two (Gsx and Cdx out of three ParaHox proteins the assignments differ between the currently most established view and four alternative scenarios. We use a non-phylogenetic, pairwise-sequence-similarity-based method to assess which of the previous predictions, if any, are best supported by the sequence-similarity relationships between Hox and ParaHox proteins. The overall sequence-similarities show Gsx to be most similar to Hox2–3, and Cdx to be most similar to Hox4–8. The results indicate that a purely pairwise-sequence-similarity-based approach can provide additional information not only when phylogenetic inference methods have insufficient information to provide reliable classifications (as was shown previously for central Hox proteins, but also when the sequence variation is so high that the resulting phylogenetic reconstructions are likely plagued by long-branch-attraction artifacts.

  17. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier

    KAUST Repository

    Kulmanov, Maxat; Khan, Mohammed Asif; Hoehndorf, Robert

    2017-01-01

    A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often

  18. Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

    Directory of Open Access Journals (Sweden)

    Ji-Yong An

    2016-01-01

    Full Text Available We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM model and Local Phase Quantization (LPQ to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM, reducing the influence of noise using a Principal Component Analysis (PCA, and using a Relevance Vector Machine (RVM based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research.

  19. The Danish STR sequence database: duplicate typing of 363 Danes with the ForenSeq™ DNA Signature Prep Kit.

    Science.gov (United States)

    Hussing, C; Bytyci, R; Huber, C; Morling, N; Børsting, C

    2018-05-24

    Some STR loci have internal sequence variations, which are not revealed by the standard STR typing methods used in forensic genetics (PCR and fragment length analysis by capillary electrophoresis (CE)). Typing of STRs with next-generation sequencing (NGS) uncovers the sequence variation in the repeat region and in the flanking regions. In this study, 363 Danish individuals were typed for 56 STRs (26 autosomal STRs, 24 Y-STRs, and 6 X-STRs) using the ForenSeq™ DNA Signature Prep Kit to establish a Danish STR sequence database. Increased allelic diversity was observed in 34 STRs by the PCR-NGS assay. The largest increases were found in DYS389II and D12S391, where the numbers of sequenced alleles were around four times larger than the numbers of alleles determined by repeat length alone. Thirteen SNPs and one InDel were identified in the flanking regions of 12 STRs. Furthermore, 36 single positions and five longer stretches in the STR flanking regions were found to have dubious genotyping quality. The combined match probability of the 26 autosomal STRs was 10,000 times larger using the PCR-NGS assay than by using PCR-CE. The typical paternity indices for trios and duos were 500 and 100 times larger, respectively, than those obtained with PCR-CE. The assay also amplified 94 SNPs selected for human identification. Eleven of these loci were not in Hardy-Weinberg equilibrium in the Danish population, most likely because the minimum threshold for allele calling (30 reads) in the ForenSeq™ Universal Analysis Software was too low and frequent allele dropouts were not detected.

  20. Neutron cross-sections database for amino acids and proteins analysis

    Energy Technology Data Exchange (ETDEWEB)

    Voi, Dante L.; Ferreira, Francisco de O.; Nunes, Rogerio Chaffin, E-mail: dante@ien.gov.br, E-mail: fferreira@ien.gov.br, E-mail: Chaffin@ien.gov.br [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Rocha, Helio F. da, E-mail: hrocha@gbl.com.br [Universidade Federal do Rio de Janeiro (IPPMG/UFRJ), Rio de Janeiro, RJ (Brazil). Instituto de Pediatria

    2015-07-01

    Biological materials may be studied using neutrons as an unconventional tool of analysis. Dynamics and structures data can be obtained for amino acids, protein and others cellular components by neutron cross sections determinations especially for applications in nuclear purity and conformation analysis. The instrument used for this is the crystal spectrometer of the Instituto de Engenharia Nuclear (IEN-CNEN-RJ), the only one in Latin America that uses neutrons for this type of analyzes and it is installed in one of the reactor Argonauta irradiation channels. The experimentally values obtained are compared with calculated values using literature data with a rigorous analysis of the chemical composition, conformation and molecular structure analysis of the materials. A neutron cross-section database was constructed to assist in determining molecular dynamic, structure and formulae of biological materials. The database contains neutron cross-sections values of all amino acids, chemical elements, molecular groups, auxiliary radicals, as well as values of constants and parameters necessary for the analysis. An unprecedented analytical procedure was developed using the neutron cross section parceling and grouping method for data manipulation. This database is a result of measurements obtained from twenty amino acids that were provided by different manufactories and are used in oral administration in hospital individuals for nutritional applications. It was also constructed a small data file of compounds with different molecular groups including carbon, nitrogen, sulfur and oxygen, all linked to hydrogen atoms. A review of global and national scene in the acquisition of neutron cross sections data, the formation of libraries and the application of neutrons for analyzing biological materials is presented. This database has further application in protein analysis and the neutron cross-section from the insulin was estimated. (author)

  1. Neutron cross-sections database for amino acids and proteins analysis

    International Nuclear Information System (INIS)

    Voi, Dante L.; Ferreira, Francisco de O.; Nunes, Rogerio Chaffin; Rocha, Helio F. da

    2015-01-01

    Biological materials may be studied using neutrons as an unconventional tool of analysis. Dynamics and structures data can be obtained for amino acids, protein and others cellular components by neutron cross sections determinations especially for applications in nuclear purity and conformation analysis. The instrument used for this is the crystal spectrometer of the Instituto de Engenharia Nuclear (IEN-CNEN-RJ), the only one in Latin America that uses neutrons for this type of analyzes and it is installed in one of the reactor Argonauta irradiation channels. The experimentally values obtained are compared with calculated values using literature data with a rigorous analysis of the chemical composition, conformation and molecular structure analysis of the materials. A neutron cross-section database was constructed to assist in determining molecular dynamic, structure and formulae of biological materials. The database contains neutron cross-sections values of all amino acids, chemical elements, molecular groups, auxiliary radicals, as well as values of constants and parameters necessary for the analysis. An unprecedented analytical procedure was developed using the neutron cross section parceling and grouping method for data manipulation. This database is a result of measurements obtained from twenty amino acids that were provided by different manufactories and are used in oral administration in hospital individuals for nutritional applications. It was also constructed a small data file of compounds with different molecular groups including carbon, nitrogen, sulfur and oxygen, all linked to hydrogen atoms. A review of global and national scene in the acquisition of neutron cross sections data, the formation of libraries and the application of neutrons for analyzing biological materials is presented. This database has further application in protein analysis and the neutron cross-section from the insulin was estimated. (author)

  2. Analysis of long-range correlation in sequences data of proteins

    Directory of Open Access Journals (Sweden)

    ADRIANA ISVORAN

    2007-04-01

    Full Text Available The results presented here suggest the existence of correlations in the sequence data of proteins. 32 proteins, both globular and fibrous, both monomeric and polymeric, were analyzed. The primary structures of these proteins were treated as time series. Three spatial series of data for each sequence of a protein were generated from numerical correspondences between each amino acid and a physical property associated with it, i.e., its electric charge, its polar character and its dipole moment. For each series, the spectral coefficient, the scaling exponent and the Hurst coefficient were determined. The values obtained for these coefficients revealed non-randomness in the series of data.

  3. A method for partitioning the information contained in a protein sequence between its structure and function.

    Science.gov (United States)

    Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido

    2018-05-23

    Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.

  4. Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database.

    Science.gov (United States)

    Allard, Marc W; Strain, Errol; Melka, David; Bunning, Kelly; Musser, Steven M; Brown, Eric W; Timme, Ruth

    2016-08-01

    The FDA has created a United States-based open-source whole-genome sequencing network of state, federal, international, and commercial partners. The GenomeTrakr network represents a first-of-its-kind distributed genomic food shield for characterizing and tracing foodborne outbreak pathogens back to their sources. The GenomeTrakr network is leading investigations of outbreaks of foodborne illnesses and compliance actions with more accurate and rapid recalls of contaminated foods as well as more effective monitoring of preventive controls for food manufacturing environments. An expanded network would serve to provide an international rapid surveillance system for pathogen traceback, which is critical to support an effective public health response to bacterial outbreaks. Copyright © 2016, American Society for Microbiology. All Rights Reserved.

  5. Exploring the Ligand-Protein Networks in Traditional Chinese Medicine: Current Databases, Methods, and Applications

    Directory of Open Access Journals (Sweden)

    Mingzhu Zhao

    2013-01-01

    Full Text Available The traditional Chinese medicine (TCM, which has thousands of years of clinical application among China and other Asian countries, is the pioneer of the “multicomponent-multitarget” and network pharmacology. Although there is no doubt of the efficacy, it is difficult to elucidate convincing underlying mechanism of TCM due to its complex composition and unclear pharmacology. The use of ligand-protein networks has been gaining significant value in the history of drug discovery while its application in TCM is still in its early stage. This paper firstly surveys TCM databases for virtual screening that have been greatly expanded in size and data diversity in recent years. On that basis, different screening methods and strategies for identifying active ingredients and targets of TCM are outlined based on the amount of network information available, both on sides of ligand bioactivity and the protein structures. Furthermore, applications of successful in silico target identification attempts are discussed in detail along with experiments in exploring the ligand-protein networks of TCM. Finally, it will be concluded that the prospective application of ligand-protein networks can be used not only to predict protein targets of a small molecule, but also to explore the mode of action of TCM.

  6. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways

    OpenAIRE

    Mi, Huaiyu; Guo, Nan; Kejariwal, Anish; Thomas, Paul D.

    2006-01-01

    PANTHER is a freely available, comprehensive software system for relating protein sequence evolution to the evolution of specific protein functions and biological roles. Since 2005, there have been three main improvements to PANTHER. First, the sequences used to create evolutionary trees are carefully selected to provide coverage of phylogenetic as well as functional information. Second, PANTHER is now a member of the InterPro Consortium, and the PANTHER hidden markov Models (HMMs) are distri...

  7. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy

    Directory of Open Access Journals (Sweden)

    Searle Stephen MJ

    2003-10-01

    Full Text Available Abstract Background The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged. Results The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with http://www.compbio.dundee.ac.uk. Conclusions The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP Sc Score which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94

  8. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction

    KAUST Repository

    Chen, Peng

    2015-12-03

    Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures

  9. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction

    KAUST Repository

    Chen, Peng; Hu, ShanShan; Zhang, Jun; Gao, Xin; Li, Jinyan; Xia, Junfeng; Wang, Bing

    2015-01-01

    Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures

  10. The drug-minded protein interaction database (DrumPID) for efficient target analysis and drug development.

    Science.gov (United States)

    Kunz, Meik; Liang, Chunguang; Nilla, Santosh; Cecil, Alexander; Dandekar, Thomas

    2016-01-01

    The drug-minded protein interaction database (DrumPID) has been designed to provide fast, tailored information on drugs and their protein networks including indications, protein targets and side-targets. Starting queries include compound, target and protein interactions and organism-specific protein families. Furthermore, drug name, chemical structures and their SMILES notation, affected proteins (potential drug targets), organisms as well as diseases can be queried including various combinations and refinement of searches. Drugs and protein interactions are analyzed in detail with reference to protein structures and catalytic domains, related compound structures as well as potential targets in other organisms. DrumPID considers drug functionality, compound similarity, target structure, interactome analysis and organismic range for a compound, useful for drug development, predicting drug side-effects and structure-activity relationships.Database URL:http://drumpid.bioapps.biozentrum.uni-wuerzburg.de. © The Author(s) 2016. Published by Oxford University Press.

  11. Generation and analysis of a large-scale expressed sequence Tag database from a full-length enriched cDNA library of developing leaves of Gossypium hirsutum L.

    Directory of Open Access Journals (Sweden)

    Min Lin

    Full Text Available BACKGROUND: Cotton (Gossypium hirsutum L. is one of the world's most economically-important crops. However, its entire genome has not been sequenced, and limited resources are available in GenBank for understanding the molecular mechanisms underlying leaf development and senescence. METHODOLOGY/PRINCIPAL FINDINGS: In this study, 9,874 high-quality ESTs were generated from a normalized, full-length cDNA library derived from pooled RNA isolated from throughout leaf development during the plant blooming stage. After clustering and assembly of these ESTs, 5,191 unique sequences, representative 1,652 contigs and 3,539 singletons, were obtained. The average unique sequence length was 682 bp. Annotation of these unique sequences revealed that 84.4% showed significant homology to sequences in the NCBI non-redundant protein database, and 57.3% had significant hits to known proteins in the Swiss-Prot database. Comparative analysis indicated that our library added 2,400 ESTs and 991 unique sequences to those known for cotton. The unigenes were functionally characterized by gene ontology annotation. We identified 1,339 and 200 unigenes as potential leaf senescence-related genes and transcription factors, respectively. Moreover, nine genes related to leaf senescence and eleven MYB transcription factors were randomly selected for quantitative real-time PCR (qRT-PCR, which revealed that these genes were regulated differentially during senescence. The qRT-PCR for three GhYLSs revealed that these genes express express preferentially in senescent leaves. CONCLUSIONS/SIGNIFICANCE: These EST resources will provide valuable sequence information for gene expression profiling analyses and functional genomics studies to elucidate their roles, as well as for studying the mechanisms of leaf development and senescence in cotton and discovering candidate genes related to important agronomic traits of cotton. These data will also facilitate future whole-genome sequence

  12. 3D representations of amino acids—applications to protein sequence comparison and classification

    Directory of Open Access Journals (Sweden)

    Jie Li

    2014-08-01

    Full Text Available The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.

  13. The evolution of a Web resource: The Galactosemia Proteins Database 2.0.

    Science.gov (United States)

    d'Acierno, Antonio; Scafuri, Bernardina; Facchiano, Angelo; Marabotti, Anna

    2018-01-01

    Galactosemia Proteins Database 2.0 is a Web-accessible resource collecting information about the structural and functional effects of the known variations associated to the three different enzymes of the Leloir pathway encoded by the genes GALT, GALE, and GALK1 and involved in the different forms of the genetic disease globally called "galactosemia." It represents an evolution of two available online resources we previously developed, with new data deriving from new structures, new analysis tools, and new interfaces and filters in order to improve the quality and quantity of information available for different categories of users. We propose this new resource both as a landmark for the entire world community of galactosemia and as a model for the development of similar tools for other proteins object of variations and involved in human diseases. © 2017 Wiley Periodicals, Inc.

  14. Database Description - eSOL | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available base Description General information of database Database name eSOL Alternative nam...eator Affiliation: The Research and Development of Biological Databases Project, National Institute of Genet...nology 4259 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8501 Japan Email: Tel.: +81-45-924-5785 Database... classification Protein sequence databases - Protein properties Organism Taxonomy Name: Escherichia coli Taxonomy ID: 562 Database...i U S A. 2009 Mar 17;106(11):4201-6. External Links: Original website information Database maintenance site

  15. Structural insights and ab initio sequencing within the DING proteins family

    International Nuclear Information System (INIS)

    Elias, Mikael; Liebschner, Dorothee; Gotthard, Guillaume; Chabriere, Eric

    2011-01-01

    DING proteins constitute a recently discovered protein family that is ubiquitous in eukaryotes. The structural insights and the physiological involvements of these intriguing proteins are hereby deciphered. DING proteins constitute an intriguing family of phosphate-binding proteins that was identified in a wide range of organisms, from prokaryotes and archae to eukaryotes. Despite their seemingly ubiquitous occurrence in eukaryotes, their encoding genes are missing from sequenced genomes. Such a lack has considerably hampered functional studies. In humans, these proteins have been related to several diseases, like atherosclerosis, kidney stones, inflammation processes and HIV inhibition. The human phosphate binding protein is a human representative of the DING family that was serendipitously discovered from human plasma. An original approach was developed to determine ab initio the complete and exact sequence of this 38 kDa protein by utilizing mass spectrometry and X-ray data in tandem. Taking advantage of this first complete eukaryotic DING sequence, a immunohistochemistry study was undertaken to check the presence of DING proteins in various mice tissues, revealing that these proteins are widely expressed. Finally, the structure of a bacterial representative from Pseudomonas fluorescens was solved at sub-angstrom resolution, allowing the molecular mechanism of the phosphate binding in these high-affinity proteins to be elucidated

  16. Structural insights and ab initio sequencing within the DING proteins family

    Energy Technology Data Exchange (ETDEWEB)

    Elias, Mikael, E-mail: mikael.elias@weizmann.ac.il [Weizmann Institute of Science, Rehovot (Israel); Liebschner, Dorothee [CRM2, Nancy Université (France); Gotthard, Guillaume; Chabriere, Eric [AFMB, Université Aix-Marseille II (France)

    2011-01-01

    DING proteins constitute a recently discovered protein family that is ubiquitous in eukaryotes. The structural insights and the physiological involvements of these intriguing proteins are hereby deciphered. DING proteins constitute an intriguing family of phosphate-binding proteins that was identified in a wide range of organisms, from prokaryotes and archae to eukaryotes. Despite their seemingly ubiquitous occurrence in eukaryotes, their encoding genes are missing from sequenced genomes. Such a lack has considerably hampered functional studies. In humans, these proteins have been related to several diseases, like atherosclerosis, kidney stones, inflammation processes and HIV inhibition. The human phosphate binding protein is a human representative of the DING family that was serendipitously discovered from human plasma. An original approach was developed to determine ab initio the complete and exact sequence of this 38 kDa protein by utilizing mass spectrometry and X-ray data in tandem. Taking advantage of this first complete eukaryotic DING sequence, a immunohistochemistry study was undertaken to check the presence of DING proteins in various mice tissues, revealing that these proteins are widely expressed. Finally, the structure of a bacterial representative from Pseudomonas fluorescens was solved at sub-angstrom resolution, allowing the molecular mechanism of the phosphate binding in these high-affinity proteins to be elucidated.

  17. Characterization and compilation of polymorphic simple sequence repeat (SSR markers of peanut from public database

    Directory of Open Access Journals (Sweden)

    Zhao Yongli

    2012-07-01

    Full Text Available Abstract Background There are several reports describing thousands of SSR markers in the peanut (Arachis hypogaea L. genome. There is a need to integrate various research reports of peanut DNA polymorphism into a single platform. Further, because of lack of uniformity in the labeling of these markers across the publications, there is some confusion on the identities of many markers. We describe below an effort to develop a central comprehensive database of polymorphic SSR markers in peanut. Findings We compiled 1,343 SSR markers as detecting polymorphism (14.5% within a total of 9,274 markers. Amongst all polymorphic SSRs examined, we found that AG motif (36.5% was the most abundant followed by AAG (12.1%, AAT (10.9%, and AT (10.3%.The mean length of SSR repeats in dinucleotide SSRs was significantly longer than that in trinucleotide SSRs. Dinucleotide SSRs showed higher polymorphism frequency for genomic SSRs when compared to trinucleotide SSRs, while for EST-SSRs, the frequency of polymorphic SSRs was higher in trinucleotide SSRs than in dinucleotide SSRs. The correlation of the length of SSR and the frequency of polymorphism revealed that the frequency of polymorphism was decreased as motif repeat number increased. Conclusions The assembled polymorphic SSRs would enhance the density of the existing genetic maps of peanut, which could also be a useful source of DNA markers suitable for high-throughput QTL mapping and marker-assisted selection in peanut improvement and thus would be of value to breeders.

  18. Feasibilty of zein proteins, simple sequence repeats and phenotypic ...

    African Journals Online (AJOL)

    Widespread adoption of quality protein maize (QPM), especially among tropical farming systems has been slow mainly due to the slow process of generating varieties with acceptable kernel quality and adaptability to different agroecological contexts. A molecular based foreground selection system for opaque 2 (o2), the ...

  19. Variation in the prion protein sequence in Dutch goat breeds

    NARCIS (Netherlands)

    Windig, J.J.; Hoving, R.A.H.; Priem, J.; Bossers, A.; Keulen, van L.J.M.; Langeveld, J.P.M.

    2016-01-01

    Scrapie is a neurodegenerative disease occurring in goats and sheep. Several haplotypes of the prion protein increase resistance to scrapie infection and may be used in selective breeding to help eradicate scrapie. In this study, frequencies of the allelic variants of the PrP gene are determined

  20. Sequence-based feature prediction and annotation of proteins

    DEFF Research Database (Denmark)

    Juncker, Agnieszka; Jensen, Lars J.; Pierleoni, Andrea

    2009-01-01

    A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome....

  1. Identification of human microRNA-like sequences embedded within the protein-encoding genes of the human immunodeficiency virus.

    Directory of Open Access Journals (Sweden)

    Bryan Holland

    Full Text Available BACKGROUND: MicroRNAs (miRNAs are highly conserved, short (18-22 nts, non-coding RNA molecules that regulate gene expression by binding to the 3' untranslated regions (3'UTRs of mRNAs. While numerous cellular microRNAs have been associated with the progression of various diseases including cancer, miRNAs associated with retroviruses have not been well characterized. Herein we report identification of microRNA-like sequences in coding regions of several HIV-1 genomes. RESULTS: Based on our earlier proteomics and bioinformatics studies, we have identified 8 cellular miRNAs that are predicted to bind to the mRNAs of multiple proteins that are dysregulated during HIV-infection of CD4+ T-cells in vitro. In silico analysis of the full length and mature sequences of these 8 miRNAs and comparisons with all the genomic and subgenomic sequences of HIV-1 strains in global databases revealed that the first 18/18 sequences of the mature hsa-miR-195 sequence (including the short seed sequence, matched perfectly (100%, or with one nucleotide mismatch, within the envelope (env genes of five HIV-1 genomes from Africa. In addition, we have identified 4 other miRNA-like sequences (hsa-miR-30d, hsa-miR-30e, hsa-miR-374a and hsa-miR-424 within the env and the gag-pol encoding regions of several HIV-1 strains, albeit with reduced homology. Mapping of the miRNA-homologues of env within HIV-1 genomes localized these sequence to the functionally significant variable regions of the env glycoprotein gp120 designated V1, V2, V4 and V5. CONCLUSIONS: We conclude that microRNA-like sequences are embedded within the protein-encoding regions of several HIV-1 genomes. Given that the V1 to V5 regions of HIV-1 envelopes contain specific, well-characterized domains that are critical for immune responses, virus neutralization and disease progression, we propose that the newly discovered miRNA-like sequences within the HIV-1 genomes may have evolved to self-regulate survival of the

  2. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

    Directory of Open Access Journals (Sweden)

    Holly J Atkinson

    Full Text Available The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

  3. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

    Science.gov (United States)

    Atkinson, Holly J; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C

    2009-01-01

    The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

  4. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

    Science.gov (United States)

    Neuwald, Andrew F

    2009-08-01

    The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.

  5. Adhesive proteins of stalked and acorn barnacles display homology with low sequence similarities.

    Directory of Open Access Journals (Sweden)

    Jaimie-Leigh Jonker

    Full Text Available Barnacle adhesion underwater is an important phenomenon to understand for the prevention of biofouling and potential biotechnological innovations, yet so far, identifying what makes barnacle glue proteins 'sticky' has proved elusive. Examination of a broad range of species within the barnacles may be instructive to identify conserved adhesive domains. We add to extensive information from the acorn barnacles (order Sessilia by providing the first protein analysis of a stalked barnacle adhesive, Lepas anatifera (order Lepadiformes. It was possible to separate the L. anatifera adhesive into at least 10 protein bands using SDS-PAGE. Intense bands were present at approximately 30, 70, 90 and 110 kilodaltons (kDa. Mass spectrometry for protein identification was followed by de novo sequencing which detected 52 peptides of 7-16 amino acids in length. None of the peptides matched published or unpublished transcriptome sequences, but some amino acid sequence similarity was apparent between L. anatifera and closely-related Dosima fascicularis. Antibodies against two acorn barnacle proteins (ab-cp-52k and ab-cp-68k showed cross-reactivity in the adhesive glands of L. anatifera. We also analysed the similarity of adhesive proteins across several barnacle taxa, including Pollicipes pollicipes (a stalked barnacle in the order Scalpelliformes. Sequence alignment of published expressed sequence tags clearly indicated that P. pollicipes possesses homologues for the 19 kDa and 100 kDa proteins in acorn barnacles. Homology aside, sequence similarity in amino acid and gene sequences tended to decline as taxonomic distance increased, with minimum similarities of 18-26%, depending on the gene. The results indicate that some adhesive proteins (e.g. 100 kDa are more conserved within barnacles than others (20 kDa.

  6. The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences

    Science.gov (United States)

    Portales-Casamar, Elodie; Arenillas, David; Lim, Jonathan; Swanson, Magdalena I.; Jiang, Steven; McCallum, Anthony; Kirov, Stefan; Wasserman, Wyeth W.

    2009-01-01

    The PAZAR database unites independently created and maintained data collections of transcription factor and regulatory sequence annotation. The flexible PAZAR schema permits the representation of diverse information derived from experiments ranging from biochemical protein–DNA binding to cellular reporter gene assays. Data collections can be made available to the public, or restricted to specific system users. The data ‘boutiques’ within the shopping-mall-inspired system facilitate the analysis of genomics data and the creation of predictive models of gene regulation. Since its initial release, PAZAR has grown in terms of data, features and through the addition of an associated package of software tools called the ORCA toolkit (ORCAtk). ORCAtk allows users to rapidly develop analyses based on the information stored in the PAZAR system. PAZAR is available at http://www.pazar.info. ORCAtk can be accessed through convenient buttons located in the PAZAR pages or via our website at http://www.cisreg.ca/ORCAtk. PMID:18971253

  7. ProteinSplit: splitting of multi-domain proteins using prediction of ordered and disordered regions in protein sequences for virtual structural genomics

    International Nuclear Information System (INIS)

    Wyrwicz, Lucjan S; Koczyk, Grzegorz; Rychlewski, Leszek; Plewczynski, Dariusz

    2007-01-01

    The annotation of protein folds within newly sequenced genomes is the main target for semi-automated protein structure prediction (virtual structural genomics). A large number of automated methods have been developed recently with very good results in the case of single-domain proteins. Unfortunately, most of these automated methods often fail to properly predict the distant homology between a given multi-domain protein query and structural templates. Therefore a multi-domain protein should be split into domains in order to overcome this limitation. ProteinSplit is designed to identify protein domain boundaries using a novel algorithm that predicts disordered regions in protein sequences. The software utilizes various sequence characteristics to assess the local propensity of a protein to be disordered or ordered in terms of local structure stability. These disordered parts of a protein are likely to create interdomain spacers. Because of its speed and portability, the method was successfully applied to several genome-wide fold annotation experiments. The user can run an automated analysis of sets of proteins or perform semi-automated multiple user projects (saving the results on the server). Additionally the sequences of predicted domains can be sent to the Bioinfo.PL Protein Structure Prediction Meta-Server for further protein three-dimensional structure and function prediction. The program is freely accessible as a web service at http://lucjan.bioinfo.pl/proteinsplit together with detailed benchmark results on the critical assessment of a fully automated structure prediction (CAFASP) set of sequences. The source code of the local version of protein domain boundary prediction is available upon request from the authors

  8. Cloning and sequence analysis of cDNA coding for rat nucleolar protein C23

    International Nuclear Information System (INIS)

    Ghaffari, S.H.; Olson, M.O.J.

    1986-01-01

    Using synthetic oligonucleotides as primers and probes, the authors have isolated and sequenced cDNA clones encoding protein C23, a putative nucleolus organizer protein. Poly(A + ) RNA was isolated from rat Novikoff hepatoma cells and enriched in C23 mRNA by sucrose density gradient ultracentrifugation. Two deoxyoligonuleotides, a 48- and a 27-mer, were synthesized on the basis of amino acid sequence from the C-terminal half of protein C23 and cDNA sequence data from CHO cell protein. The 48-mer was used a primer for synthesis of cDNA which was then inserted into plasmid pUC9. Transformed bacterial colonies were screened by hybridization with 32 P labeled 27-mer. Two clones among 5000 gave a strong positive signal. Plasmid DNAs from these clones were purified and characterized by blotting and nucleotide sequence analysis. The length of C23 mRNA was estimated to be 3200 bases in a northern blot analysis. The sequence of a 267 b.p. insert shows high homology with the CHO cDNA with only 9 nucleotide differences and an identical amino acid sequence. These studies indicate that this region of the protein is highly conserved

  9. Application of native signal sequences for recombinant proteins secretion in Pichia pastoris

    DEFF Research Database (Denmark)

    Borodina, Irina; Do, Duy Duc; Eriksen, Jens C.

    Background Methylotrophic yeast Pichia pastoris is widely used for recombinant protein production, largely due to its ability to secrete correctly folded heterologous proteins to the fermentation medium. Secretion is usually achieved by cloning the recombinant gene after a leader sequence, where...... alpha‐mating factor (MF) prepropeptide from Saccharomyces cerevisiae is most commonly used. Our aim was to test whether signal peptides from P. pastoris native secreted proteins could be used to direct secretion of recombinant proteins. Results Eleven native signal peptides from P. pastoris were tested...... by optimization of expression of three different proteins in P. pastoris. Conclusions Native signal peptides from P. pastoris can be used to direct secretion of recombinant proteins. A novel USER‐based P. pastoris system allows easy cloning of protein‐coding gene with the promoter and leader sequence of choice....

  10. Interactions of rat repetitive sequence MspI8 with nuclear matrix proteins during spermatogenesis

    International Nuclear Information System (INIS)

    Rogolinski, J.; Widlak, P.; Rzeszowska-Wolny, J.

    1996-01-01

    Using the Southwestern blot analysis we have studied the interactions between rat repetitive sequence MspI8 and the nuclear matrix proteins of rats testis cells. Starting from 2 weeks the young to adult animal showed differences in type of testis nuclear matrix proteins recognizing the MspI8 sequence. The same sets of nuclear matrix proteins were detected in some enriched in spermatocytes and spermatids and obtained after fractionation of cells of adult animal by the velocity sedimentation technique. (author). 21 refs, 5 figs

  11. GenProBiS: web server for mapping of sequence variants to protein binding sites.

    Science.gov (United States)

    Konc, Janez; Skrlj, Blaz; Erzen, Nika; Kunej, Tanja; Janezic, Dusanka

    2017-07-03

    Discovery of potentially deleterious sequence variants is important and has wide implications for research and generation of new hypotheses in human and veterinary medicine, and drug discovery. The GenProBiS web server maps sequence variants to protein structures from the Protein Data Bank (PDB), and further to protein-protein, protein-nucleic acid, protein-compound, and protein-metal ion binding sites. The concept of a protein-compound binding site is understood in the broadest sense, which includes glycosylation and other post-translational modification sites. Binding sites were defined by local structural comparisons of whole protein structures using the Protein Binding Sites (ProBiS) algorithm and transposition of ligands from the similar binding sites found to the query protein using the ProBiS-ligands approach with new improvements introduced in GenProBiS. Binding site surfaces were generated as three-dimensional grids encompassing the space occupied by predicted ligands. The server allows intuitive visual exploration of comprehensively mapped variants, such as human somatic mis-sense mutations related to cancer and non-synonymous single nucleotide polymorphisms from 21 species, within the predicted binding sites regions for about 80 000 PDB protein structures using fast WebGL graphics. The GenProBiS web server is open and free to all users at http://genprobis.insilab.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Insights into SCP/TAPS proteins of liver flukes based on large-scale bioinformatic analyses of sequence datasets.

    Directory of Open Access Journals (Sweden)

    Cinzia Cantacessi

    Full Text Available BACKGROUND: SCP/TAPS proteins of parasitic helminths have been proposed to play key roles in fundamental biological processes linked to the invasion of and establishment in their mammalian host animals, such as the transition from free-living to parasitic stages and the modulation of host immune responses. Despite the evidence that SCP/TAPS proteins of parasitic nematodes are involved in host-parasite interactions, there is a paucity of information on this protein family for parasitic trematodes of socio-economic importance. METHODOLOGY/PRINCIPAL FINDINGS: We conducted the first large-scale study of SCP/TAPS proteins of a range of parasitic trematodes of both human and veterinary importance (including the liver flukes Clonorchis sinensis, Opisthorchis viverrini, Fasciola hepatica and F. gigantica as well as the blood flukes Schistosoma mansoni, S. japonicum and S. haematobium. We mined all current transcriptomic and/or genomic sequence datasets from public databases, predicted secondary structures of full-length protein sequences, undertook systematic phylogenetic analyses and investigated the differential transcription of SCP/TAPS genes in O. viverrini and F. hepatica, with an emphasis on those that are up-regulated in the developmental stages infecting the mammalian host. CONCLUSIONS: This work, which sheds new light on SCP/TAPS proteins, guides future structural and functional explorations of key SCP/TAPS molecules associated with diseases caused by flatworms. Future fundamental investigations of these molecules in parasites and the integration of structural and functional data could lead to new approaches for the control of parasitic diseases.

  13. Prediction of protein hydration sites from sequence by modular neural networks

    DEFF Research Database (Denmark)

    Ehrlich, L.; Reczko, M.; Bohr, Henrik

    1998-01-01

    The hydration properties of a protein are important determinants of its structure and function. Here, modular neural networks are employed to predict ordered hydration sites using protein sequence information. First, secondary structure and solvent accessibility are predicted from sequence with two...... separate neural networks. These predictions are used as input together with protein sequences for networks predicting hydration of residues, backbone atoms and sidechains. These networks are teined with protein crystal structures. The prediction of hydration is improved by adding information on secondary...... structure and solvent accessibility and, using actual values of these properties, redidue hydration can be predicted to 77% accuracy with a Metthews coefficient of 0.43. However, predicted property data with an accuracy of 60-70% result in less than half the improvement in predictive performance observed...

  14. Identification and Removal of Contaminant Sequences From Ribosomal Gene Databases: Lessons From the Census of Deep Life.

    Science.gov (United States)

    Sheik, Cody S; Reese, Brandi Kiel; Twing, Katrina I; Sylvan, Jason B; Grim, Sharon L; Schrenk, Matthew O; Sogin, Mitchell L; Colwell, Frederick S

    2018-01-01

    Earth's subsurface environment is one of the largest, yet least studied, biomes on Earth, and many questions remain regarding what microorganisms are indigenous to the subsurface. Through the activity of the Census of Deep Life (CoDL) and the Deep Carbon Observatory, an open access 16S ribosomal RNA gene sequence database from diverse subsurface environments has been compiled. However, due to low quantities of biomass in the deep subsurface, the potential for incorporation of contaminants from reagents used during sample collection, processing, and/or sequencing is high. Thus, to understand the ecology of subsurface microorganisms (i.e., the distribution, richness, or survival), it is necessary to minimize, identify, and remove contaminant sequences that will skew the relative abundances of all taxa in the sample. In this meta-analysis, we identify putative contaminants associated with the CoDL dataset, recommend best practices for removing contaminants from samples, and propose a series of best practices for subsurface microbiology sampling. The most abundant putative contaminant genera observed, independent of evenness across samples, were Propionibacterium , Aquabacterium , Ralstonia , and Acinetobacter . While the top five most frequently observed genera were Pseudomonas , Propionibacterium , Acinetobacter , Ralstonia , and Sphingomonas . The majority of the most frequently observed genera (high evenness) were associated with reagent or potential human contamination. Additionally, in DNA extraction blanks, we observed potential archaeal contaminants, including methanogens, which have not been discussed in previous contamination studies. Such contaminants would directly affect the interpretation of subsurface molecular studies, as methanogenesis is an important subsurface biogeochemical process. Utilizing previously identified contaminant genera, we found that ∼27% of the total dataset were identified as contaminant sequences that likely originate from DNA

  15. RegTransBase - A Database Of Regulatory Sequences and Interactionsin a Wide Range of Prokaryotic Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kazakov, Alexei E.; Cipriano, Michael J.; Novichkov, Pavel S.; Minovitsky, Simon; Vinogradov, Dmitry V.; Arkin, Adam; Mironov, AndreyA.; Gelfand, Mikhail S.; Dubchak, Inna

    2006-07-01

    RegTransBase, a manually curated database of regulatoryinteractions in prokaryotes, captures the knowledge in publishedscientific literature using a controlled vocabulary. Although a number ofdatabases describing interactions between regulatory proteins and theirbinding sites are currently being maintained, they focus mostly on themodel organisms Escherichia coli and Bacillus subtilis, or are entirelycomputationally derived. RegTransBase describes a large number ofregulatory interactions reported in many organisms and contains varioustypes of experimental data, in particular: the activation or repressionof transcription by an identified direct regulator; determining thetranscriptional regulatory function of a protein (or RNA) directlybinding to DNA (RNA); mapping or prediction of binding site for aregulatory protein; characterization of regulatory mutations. Currently,the RegTransBase content is derived from about 3000 relevant articlesdescribing over 7000 experiments in relation to 128 microbes. It containsdata on the regulation of about 7500 genes and evidence for 6500interactions with 650 regulators. RegTransBase also contains manuallycreated position weight matrices (PWM) that can be used to identifycandidate regulatory sites in over 60 species. RegTransBase is availableat http://regtransbase.lbl.gov.

  16. PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

    Directory of Open Access Journals (Sweden)

    Huilin Wang

    Full Text Available X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM. Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I. Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II, which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization

  17. Unlimited Thirst for Genome Sequencing, Data Interpretation, and Database Usage in Genomic Era: The Road towards Fast-Track Crop Plant Improvement

    Directory of Open Access Journals (Sweden)

    Arun Prabhu Dhanapal

    2015-01-01

    Full Text Available The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away.

  18. Exploring sequence characteristics related to high-level production of secreted proteins in Aspergillus niger.

    Directory of Open Access Journals (Sweden)

    Bastiaan A van den Berg

    Full Text Available Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large set, over 600 homologous and nearly 2,000 heterologous fungal genes, were overexpressed in Aspergillus niger using a standardized expression cassette and scored for high versus no production. Subsequently, sequence-based machine learning techniques were applied for identifying relevant DNA and protein sequence features. The amino-acid composition of the protein sequence was found to be most predictive and interpretation revealed that, for both homologous and heterologous gene expression, the same features are important: tyrosine and asparagine composition was found to have a positive correlation with high-level production, whereas for unsuccessful production, contributions were found for methionine and lysine composition. The predictor is available online at http://bioinformatics.tudelft.nl/hipsec. Subsequent work aims at validating these findings by protein engineering as a method for increasing expression levels per gene copy.

  19. The N-terminal sequence of ribosomal protein L10 from the archaebacterium Halobacterium marismortui and its relationship to eubacterial protein L6 and other ribosomal proteins.

    Science.gov (United States)

    Dijk, J; van den Broek, R; Nasiulas, G; Beck, A; Reinhardt, R; Wittmann-Liebold, B

    1987-08-01

    The amino-terminal sequence of ribosomal protein L10 from Halobacterium marismortui has been determined up to residue 54, using both a liquid- and a gas-phase sequenator. The two sequences are in good agreement. The protein is clearly homologous to protein HcuL10 from the related strain Halobacterium cutirubrum. Furthermore, a weaker but distinct homology to ribosomal protein L6 from Escherichia coli and Bacillus stearothermophilus can be detected. In addition to 7 identical amino acids in the first 36 residues in all four sequences a number of conservative replacements occurs, of mainly hydrophobic amino acids. In this common region the pattern of conserved amino acids suggests the presence of a beta-alpha fold as it occurs in ribosomal proteins L12 and L30. Furthermore, several potential cases of homology to other ribosomal components of the three ur-kingdoms have been found.

  20. Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein data.

    Science.gov (United States)

    Kehl, Catherine; Simms, Andrew M; Toofanny, Rudesh D; Daggett, Valerie

    2008-06-01

    The Dynameomics project is our effort to characterize the native-state dynamics and folding/unfolding pathways of representatives of all known protein folds by way of molecular dynamics simulations, as described by Beck et al. (in Protein Eng. Des. Select., the first paper in this series). The data produced by these simulations are highly multidimensional in structure and multi-terabytes in size. Both of these features present significant challenges for storage, retrieval and analysis. For optimal data modeling and flexibility, we needed a platform that supported both multidimensional indices and hierarchical relationships between related types of data and that could be integrated within our data warehouse, as described in the accompanying paper directly precedi