WorldWideScience

Sample records for protein sequence evolution

  1. Protein sequence comparison and protein evolution

    Energy Technology Data Exchange (ETDEWEB)

    Pearson, W.R. [Univ. of Virginia, Charlottesville, VA (United States). Dept. of Biochemistry

    1995-12-31

    This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared proteinfold and possibly a shared active site or function. The authors start by reviewing a geological/evolutionary time scale. Next they look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. They also examine different modes of protein evolution and consider some hypotheses that have been presented to explain the very earliest events in protein evolution. The next part of the tutorial will examine the technical aspects of protein sequence comparison. Both optimal and heuristic algorithms and their associated parameters that are used to characterize protein sequence similarities are discussed. Perhaps more importantly, they survey the statistics of local similarity scores, and how these statistics can both be used to improve the selectivity of a search and to evaluate the significance of a match. They them examine distantly related members of three protein families, the serine proteases, the glutathione transferases, and the G-protein-coupled receptors (GCRs). Finally, the discuss how sequence similarity can be used to examine internal repeated or mosaic structures in proteins.

  2. Biophysical and structural considerations for protein sequence evolution

    Directory of Open Access Journals (Sweden)

    Grahnen Johan A

    2011-12-01

    Full Text Available Abstract Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model.

  3. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways

    OpenAIRE

    Mi, Huaiyu; Guo, Nan; Kejariwal, Anish; Thomas, Paul D.

    2006-01-01

    PANTHER is a freely available, comprehensive software system for relating protein sequence evolution to the evolution of specific protein functions and biological roles. Since 2005, there have been three main improvements to PANTHER. First, the sequences used to create evolutionary trees are carefully selected to provide coverage of phylogenetic as well as functional information. Second, PANTHER is now a member of the InterPro Consortium, and the PANTHER hidden markov Models (HMMs) are distri...

  4. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently

    Science.gov (United States)

    Currin, Andrew; Swainston, Neil; Day, Philip J.

    2015-01-01

    The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the ‘search space’ of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (K d) and catalytic (k cat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving k cat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the ‘best’ amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole

  5. Neutral evolution of proteins: The superfunnel in sequence space and its relation to mutational robustness

    Science.gov (United States)

    Noirel, Josselin; Simonson, Thomas

    2008-11-01

    Following Kimura's neutral theory of molecular evolution [M. Kimura, The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge, 1983) (reprinted in 1986)], it has become common to assume that the vast majority of viable mutations of a gene confer little or no functional advantage. Yet, in silico models of protein evolution have shown that mutational robustness of sequences could be selected for, even in the context of neutral evolution. The evolution of a biological population can be seen as a diffusion on the network of viable sequences. This network is called a "neutral network." Depending on the mutation rate μ and the population size N, the biological population can evolve purely randomly (μN ≪1) or it can evolve in such a way as to select for sequences of higher mutational robustness (μN ≫1). The stringency of the selection depends not only on the product μN but also on the exact topology of the neutral network, the special arrangement of which was named "superfunnel." Even though the relation between mutation rate, population size, and selection was thoroughly investigated, a study of the salient topological features of the superfunnel that could affect the strength of the selection was wanting. This question is addressed in this study. We use two different models of proteins: on lattice and off lattice. We compare neutral networks computed using these models to random networks. From this, we identify two important factors of the topology that determine the stringency of the selection for mutationally robust sequences. First, the presence of highly connected nodes ("hubs") in the network increases the selection for mutationally robust sequences. Second, the stringency of the selection increases when the correlation between a sequence's mutational robustness and its neighbors' increases. The latter finding relates a global characteristic of the neutral network to a local one, which is attainable through experiments or molecular

  6. A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences.

    Science.gov (United States)

    Groussin, M; Boussau, B; Gouy, M

    2013-07-01

    Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

  7. Rapid evolution of the sequences and gene repertoires of secreted proteins in bacteria.

    Directory of Open Access Journals (Sweden)

    Teresa Nogueira

    Full Text Available Proteins secreted to the extracellular environment or to the periphery of the cell envelope, the secretome, play essential roles in foraging, antagonistic and mutualistic interactions. We hypothesize that arms races, genetic conflicts and varying selective pressures should lead to the rapid change of sequences and gene repertoires of the secretome. The analysis of 42 bacterial pan-genomes shows that secreted, and especially extracellular proteins, are predominantly encoded in the accessory genome, i.e. among genes not ubiquitous within the clade. Genes encoding outer membrane proteins might engage more frequently in intra-chromosomal gene conversion because they are more often in multi-genic families. The gene sequences encoding the secretome evolve faster than the rest of the genome and in particular at non-synonymous positions. Cell wall proteins in Firmicutes evolve particularly fast when compared with outer membrane proteins of Proteobacteria. Virulence factors are over-represented in the secretome, notably in outer membrane proteins, but cell localization explains more of the variance in substitution rates and gene repertoires than sequence homology to known virulence factors. Accordingly, the repertoires and sequences of the genes encoding the secretome change fast in the clades of obligatory and facultative pathogens and also in the clades of mutualists and free-living bacteria. Our study shows that cell localization shapes genome evolution. In agreement with our hypothesis, the repertoires and the sequences of genes encoding secreted proteins evolve fast. The particularly rapid change of extracellular proteins suggests that these public goods are key players in bacterial adaptation.

  8. Combining protein sequence, structure, and dynamics: A novel approach for functional evolution analysis of PAS domain superfamily.

    Science.gov (United States)

    Dong, Zheng; Zhou, Hongyu; Tao, Peng

    2018-02-01

    PAS domains are widespread in archaea, bacteria, and eukaryota, and play important roles in various functions. In this study, we aim to explore functional evolutionary relationship among proteins in the PAS domain superfamily in view of the sequence-structure-dynamics-function relationship. We collected protein sequences and crystal structure data from RCSB Protein Data Bank of the PAS domain superfamily belonging to three biological functions (nucleotide binding, photoreceptor activity, and transferase activity). Protein sequences were aligned and then used to select sequence-conserved residues and build phylogenetic tree. Three-dimensional structure alignment was also applied to obtain structure-conserved residues. The protein dynamics were analyzed using elastic network model (ENM) and validated by molecular dynamics (MD) simulation. The result showed that the proteins with same function could be grouped by sequence similarity, and proteins in different functional groups displayed statistically significant difference in their vibrational patterns. Interestingly, in all three functional groups, conserved amino acid residues identified by sequence and structure conservation analysis generally have a lower fluctuation than other residues. In addition, the fluctuation of conserved residues in each biological function group was strongly correlated with the corresponding biological function. This research suggested a direct connection in which the protein sequences were related to various functions through structural dynamics. This is a new attempt to delineate functional evolution of proteins using the integrated information of sequence, structure, and dynamics. © 2017 The Protein Society.

  9. Pleiotropy constrains the evolution of protein but not regulatory sequences in a transcription regulatory network influencing complex social behaviours

    Directory of Open Access Journals (Sweden)

    Daria eMolodtsova

    2014-12-01

    Full Text Available It is increasingly apparent that genes and networks that influence complex behaviour are evolutionary conserved, which is paradoxical considering that behaviour is labile over evolutionary timescales. How does adaptive change in behaviour arise if behaviour is controlled by conserved, pleiotropic, and likely evolutionary constrained genes? Pleiotropy and connectedness are known to constrain the general rate of protein evolution, prompting some to suggest that the evolution of complex traits, including behaviour, is fuelled by regulatory sequence evolution. However, we seldom have data on the strength of selection on mutations in coding and regulatory sequences, and this hinders our ability to study how pleiotropy influences coding and regulatory sequence evolution. Here we use population genomics to estimate the strength of selection on coding and regulatory mutations for a transcriptional regulatory network that influences complex behaviour of honey bees. We found that replacement mutations in highly connected transcription factors and target genes experience significantly stronger negative selection relative to weakly connected transcription factors and targets. Adaptively evolving proteins were significantly more likely to reside at the periphery of the regulatory network, while proteins with signs of negative selection were near the core of the network. Interestingly, connectedness and network structure had minimal influence on the strength of selection on putative regulatory sequences for both transcription factors and their targets. Our study indicates that adaptive evolution of complex behaviour can arise because of positive selection on protein-coding mutations in peripheral genes, and on regulatory sequence mutations in both transcription factors and their targets throughout the network.

  10. Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins

    Directory of Open Access Journals (Sweden)

    Yan Koon-Kiu

    2007-11-01

    Full Text Available Abstract Background The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins. Results We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens. It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ p-γ with the value of the exponent γ around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent α ≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate. Conclusion We separately measure the short-term ("raw" duplication and deletion rates rdup∗ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwha1jabbchaWbqaaiabgEHiQaaaaaa@3283@, rdel∗ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsga

  11. Shotgun protein sequencing.

    Energy Technology Data Exchange (ETDEWEB)

    Faulon, Jean-Loup Michel; Heffelfinger, Grant S.

    2009-06-01

    A novel experimental and computational technique based on multiple enzymatic digestion of a protein or protein mixture that reconstructs protein sequences from sequences of overlapping peptides is described in this SAND report. This approach, analogous to shotgun sequencing of DNA, is to be used to sequence alternative spliced proteins, to identify post-translational modifications, and to sequence genetically engineered proteins.

  12. Roles of Solvent Accessibility and Gene Expression in Modeling Protein Sequence Evolution

    OpenAIRE

    Kuangyu Wang; Shuhui Yu; Xiang Ji; Clemens Lakner; Alexander Griffing; Jeffrey L. Thorne

    2015-01-01

    Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construc...

  13. From Sequence and Forces to Structure, Function and Evolution of Intrinsically Disordered Proteins

    Science.gov (United States)

    Forman-Kay, Julie D.; Mittag, Tanja

    2015-01-01

    Intrinsically disordered proteins (IDPs), which lack persistent structure, are a challenge to structural biology due to the inapplicability of standard methods for characterization of folded proteins as well as their deviation from the dominant structure/function paradigm. Their widespread presence and involvement in biological function, however, has spurred the growing acceptance of the importance of IDPs and the development of new tools for studying their structure, dynamics and function. The interplay of folded and disordered domains or regions for function and the existence of a continuum of protein states with respect to conformational energetics, motional timescales and compactness is shaping a unified understanding of structure-dynamics-disorder/function relationships. On the 20th anniversary of this journal, Structure, we provide a historical perspective on the investigation of IDPs and summarize the sequence features and physical forces that underlie their unique structural, functional and evolutionary properties. PMID:24010708

  14. Structural and Sequence Similarities of Hydra Xeroderma Pigmentosum A Protein to Human Homolog Suggest Early Evolution and Conservation

    Directory of Open Access Journals (Sweden)

    Apurva Barve

    2013-01-01

    Full Text Available Xeroderma pigmentosum group A (XPA is a protein that binds to damaged DNA, verifies presence of a lesion, and recruits other proteins of the nucleotide excision repair (NER pathway to the site. Though its homologs from yeast, Drosophila, humans, and so forth are well studied, XPA has not so far been reported from protozoa and lower animal phyla. Hydra is a fresh-water cnidarian with a remarkable capacity for regeneration and apparent lack of organismal ageing. Cnidarians are among the first metazoa with a defined body axis, tissue grade organisation, and nervous system. We report here for the first time presence of XPA gene in hydra. Putative protein sequence of hydra XPA contains nuclear localization signal and bears the zinc-finger motif. It contains two conserved Pfam domains and various characterized features of XPA proteins like regions for binding to excision repair cross-complementing protein-1 (ERCC1 and replication protein A 70 kDa subunit (RPA70 proteins. Hydra XPA shows a high degree of similarity with vertebrate homologs and clusters with deuterostomes in phylogenetic analysis. Homology modelling corroborates the very close similarity between hydra and human XPA. The protein thus most likely functions in hydra in the same manner as in other animals, indicating that it arose early in evolution and has been conserved across animal phyla.

  15. Genetic diversity and molecular evolution of Ornithogalum mosaic virus based on the coat protein gene sequence

    Directory of Open Access Journals (Sweden)

    Fangluan Gao

    2018-03-01

    Full Text Available Ornithogalum mosaic virus (OrMV has a wide host range and affects the production of a variety of ornamentals. In this study, the coat protein (CP gene of OrMVwas used to investigate the molecular mechanisms underlying the evolution of this virus. The 36 OrMV isolates fell into two groups which have significant subpopulation differentiation with an FST value of 0.470. One isolate was identified as a recombinant and the other 35 recombination-free isolates could be divided into two major clades under different evolutionary constraints with dN/dS values of 0.055 and 0.028, respectively, indicating a role of purifying selection in the differentiation of OrMV. In addition, the results from analysis of molecular variance (AMOVA indicated that the effect of host species on the genetic divergence of OrMV is greater than that of geography. Furthermore, OrMV isolates from the genera Ornithogalum, Lachenalia and Diuri tended to group together, indicating that OrMV diversification was maintained, in part, by host-driven adaptation.

  16. Evolution of early life inferred from protein and ribonucleic acid sequences

    Science.gov (United States)

    Dayhoff, M. O.; Schwartz, R. M.

    1978-01-01

    The chemical structures of ferredoxin, 5S ribosomal RNA, and c-type cytochrome sequences have been employed to construct a phylogenetic tree which connects all major photosynthesizing organisms: the three types of bacteria, blue-green algae, and chloroplasts. Anaerobic and aerobic bacteria, eukaryotic cytoplasmic components and mitochondria are also included in the phylogenetic tree. Anaerobic nonphotosynthesizing bacteria similar to Clostridium were the earliest organisms, arising more than 3.2 billion years ago. Bacterial photosynthesis evolved nearly 3.0 billion years ago, while oxygen-evolving photosynthesis, originating in the blue-green algal line, came into being about 2.0 billion years ago. The phylogenetic tree supports the symbiotic theory of the origin of eukaryotes.

  17. Full-Length Venom Protein cDNA Sequences from Venom-Derived mRNA: Exploring Compositional Variation and Adaptive Multigene Evolution.

    Science.gov (United States)

    Modahl, Cassandra M; Mackessy, Stephen P

    2016-06-01

    access to cDNA sequences in the absence of living specimens, even from commercial venom sources, to evaluate important regional differences in venom composition and to study snake venom protein evolution.

  18. Sequence, structure and function relationships in flaviviruses as assessed by evolutive aspects of its conserved non-structural protein domains.

    Science.gov (United States)

    da Fonseca, Néli José; Lima Afonso, Marcelo Querino; Pedersolli, Natan Gonçalves; de Oliveira, Lucas Carrijo; Andrade, Dhiego Souto; Bleicher, Lucas

    2017-10-28

    Flaviviruses are responsible for serious diseases such as dengue, yellow fever, and zika fever. Their genomes encode a polyprotein which, after cleavage, results in three structural and seven non-structural proteins. Homologous proteins can be studied by conservation and coevolution analysis as detected in multiple sequence alignments, usually reporting positions which are strictly necessary for the structure and/or function of all members in a protein family or which are involved in a specific sub-class feature requiring the coevolution of residue sets. This study provides a complete conservation and coevolution analysis on all flaviviruses non-structural proteins, with results mapped on all well-annotated available sequences. A literature review on the residues found in the analysis enabled us to compile available information on their roles and distribution among different flaviviruses. Also, we provide the mapping of conserved and coevolved residues for all sequences currently in SwissProt as a supplementary material, so that particularities in different viruses can be easily analyzed. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. Biophysics of protein evolution and evolutionary protein biophysics

    Science.gov (United States)

    Sikosek, Tobias; Chan, Hue Sun

    2014-01-01

    The study of molecular evolution at the level of protein-coding genes often entails comparing large datasets of sequences to infer their evolutionary relationships. Despite the importance of a protein's structure and conformational dynamics to its function and thus its fitness, common phylogenetic methods embody minimal biophysical knowledge of proteins. To underscore the biophysical constraints on natural selection, we survey effects of protein mutations, highlighting the physical basis for marginal stability of natural globular proteins and how requirement for kinetic stability and avoidance of misfolding and misinteractions might have affected protein evolution. The biophysical underpinnings of these effects have been addressed by models with an explicit coarse-grained spatial representation of the polypeptide chain. Sequence–structure mappings based on such models are powerful conceptual tools that rationalize mutational robustness, evolvability, epistasis, promiscuous function performed by ‘hidden’ conformational states, resolution of adaptive conflicts and conformational switches in the evolution from one protein fold to another. Recently, protein biophysics has been applied to derive more accurate evolutionary accounts of sequence data. Methods have also been developed to exploit sequence-based evolutionary information to predict biophysical behaviours of proteins. The success of these approaches demonstrates a deep synergy between the fields of protein biophysics and protein evolution. PMID:25165599

  20. Protein splicing and its evolution in eukaryotes

    Directory of Open Access Journals (Sweden)

    Starokadomskyy P. L.

    2010-02-01

    Full Text Available Inteins, or protein introns, are parts of protein sequences that are post-translationally excised, their flanking regions (exteins being spliced together. This process was called protein splicing. Originally inteins were found in prokaryotic or unicellular eukaryotic organisms. But the general principles of post-translation protein rearrangement are evolving yielding different post-translation modification of proteins in multicellular organisms. For clarity, these non-intein mediated events call either protein rearrangements or protein editing. The most intriguing example of protein editing is proteasome-mediated splicing of antigens in vertebrates that may play important role in antigen presentation. Other examples of protein rearrangements are maturation of Hg-proteins (critical receptors in embryogenesis as well as maturation of several metabolic enzymes. Despite a lack of experimental data we try to analyze some intriguing examples of protein splicing evolution.

  1. Evolution of protein-protein interactions

    Indian Academy of Sciences (India)

    Evolution of protein-protein interactions · Our interests in protein-protein interactions · Slide 3 · Slide 4 · Slide 5 · Slide 6 · Slide 7 · Slide 8 · Slide 9 · Slide 10 · Slide 11 · Slide 12 · Slide 13 · Slide 14 · Slide 15 · Slide 16 · Slide 17 · Slide 18 · Slide 19 · Slide 20.

  2. The interface of protein structure, protein biophysics, and molecular evolution

    Science.gov (United States)

    Liberles, David A; Teichmann, Sarah A; Bahar, Ivet; Bastolla, Ugo; Bloom, Jesse; Bornberg-Bauer, Erich; Colwell, Lucy J; de Koning, A P Jason; Dokholyan, Nikolay V; Echave, Julian; Elofsson, Arne; Gerloff, Dietlind L; Goldstein, Richard A; Grahnen, Johan A; Holder, Mark T; Lakner, Clemens; Lartillot, Nicholas; Lovell, Simon C; Naylor, Gavin; Perica, Tina; Pollock, David D; Pupko, Tal; Regan, Lynne; Roger, Andrew; Rubinstein, Nimrod; Shakhnovich, Eugene; Sjölander, Kimmen; Sunyaev, Shamil; Teufel, Ashley I; Thorne, Jeffrey L; Thornton, Joseph W; Weinreich, Daniel M; Whelan, Simon

    2012-01-01

    Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction. PMID:22528593

  3. Modeling Protein Evolution

    Science.gov (United States)

    Goldstein, Richard; Pollock, David

    The study of biology is fundamentally different from many other scientific pursuits, such as geology or astrophysics. This difference stems from the ubiquitous questions that arise about function and purpose. These are questions concerning why biological objects operate the way they do: what is the function of a polymerase? What is the role of the immune system? No one, aside from the most dedicated anthropist or interventionist theist, would attempt to determine the purpose of the earth's mantle or the function of a binary star. Among the sciences, it is only biology in which the details of what an object does can be said to be part of the reason for its existence. This is because the process of evolution is capable of improving an object to better carry out a function; that is, it adapts an object within the constraints of mechanics and history (i.e., what has come before). Thus, the ultimate basis of these biological questions is the process of evolution; generally, the function of an enzyme, cell type, organ, system, or trait is the thing that it does that contributes to the fitness (i.e., reproductive success) of the organism of which it is a part or characteristic. Our investigations cannot escape the simple fact that all things in biology (including ourselves) are, ultimately, the result of an evolutionary process.

  4. Simulating efficiently the evolution of DNA sequences.

    Science.gov (United States)

    Schöniger, M; von Haeseler, A

    1995-02-01

    Two menu-driven FORTRAN programs are described that simulate the evolution of DNA sequences in accordance with a user-specified model. This general stochastic model allows for an arbitrary stationary nucleotide composition and any transition-transversion bias during the process of base substitution. In addition, the user may define any hypothetical model tree according to which a family of sequences evolves. The programs suggest the computationally most inexpensive approach to generate nucleotide substitutions. Either reproducible or non-repeatable simulations, depending on the method of initializing the pseudo-random number generator, can be performed. The corresponding options are offered by the interface menu.

  5. Experimental Rugged Fitness Landscape in Protein Sequence Space

    OpenAIRE

    HAYASHI, Yuuki; 相田, 拓洋; TOYOTA, Hitoshi; 伏見, 譲; URABE, Itaru; YOMO, Tetsuya

    2006-01-01

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phag...

  6. Mutation of miRNA target sequences during human evolution

    DEFF Research Database (Denmark)

    Gardner, Paul P; Vinther, Jeppe

    2008-01-01

    It has long-been hypothesized that changes in non-protein-coding genes and the regulatory sequences controlling expression could undergo positive selection. Here we identify 402 putative microRNA (miRNA) target sequences that have been mutated specifically in the human lineage and show that genes...... containing such deletions are more highly expressed than their mouse orthologs. Our findings indicate that some miRNA target mutations are fixed by positive selection and might have been involved in the evolution of human-specific traits....

  7. Insights into hominid evolution from the gorilla genome sequence

    Science.gov (United States)

    Scally, Aylwyn; Dutheil, Julien Y.; Hillier, LaDeana W.; Jordan, Greg E.; Goodhead, Ian; Herrero, Javier; Hobolth, Asger; Lappalainen, Tuuli; Mailund, Thomas; Marques-Bonet, Tomas; McCarthy, Shane; Montgomery, Stephen H.; Schwalie, Petra C.; Tang, Y. Amy; Ward, Michelle C.; Xue, Yali; Yngvadottir, Bryndis; Alkan, Can; Andersen, Lars N.; Ayub, Qasim; Ball, Edward V.; Beal, Kathryn; Bradley, Brenda J.; Chen, Yuan; Clee, Chris M.; Fitzgerald, Stephen; Graves, Tina A.; Gu, Yong; Heath, Paul; Heger, Andreas; Karakoc, Emre; Kolb-Kokocinski, Anja; Laird, Gavin K.; Lunter, Gerton; Meader, Stephen; Mort, Matthew; Mullikin, James C.; Munch, Kasper; O’Connor, Timothy D.; Phillips, Andrew D.; Prado-Martinez, Javier; Rogers, Anthony S.; Sajjadian, Saba; Schmidt, Dominic; Shaw, Katy; Simpson, Jared T.; Stenson, Peter D.; Turner, Daniel J.; Vigilant, Linda; Vilella, Albert J.; Whitener, Weldon; Zhu, Baoli; Cooper, David N.; de Jong, Pieter; Dermitzakis, Emmanouil T.; Eichler, Evan E.; Flicek, Paul; Goldman, Nick; Mundy, Nicholas I.; Ning, Zemin; Odom, Duncan T.; Ponting, Chris P.; Quail, Michael A.; Ryder, Oliver A.; Searle, Stephen M.; Warren, Wesley C.; Wilson, Richard K.; Schierup, Mikkel H.; Rogers, Jane; Tyler-Smith, Chris; Durbin, Richard

    2012-01-01

    Summary Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution. PMID:22398555

  8. Novel algorithms for protein sequence analysis

    NARCIS (Netherlands)

    Ye, Kai

    2008-01-01

    Each protein is characterized by its unique sequential order of amino acids, the so-called protein sequence. Biology”s paradigm is that this order of amino acids determines the protein”s architecture and function. In this thesis, we introduce novel algorithms to analyze protein sequences. Chapter 1

  9. Learning about evolution from sequence data

    Science.gov (United States)

    Dayarian, Adel; Shraiman, Boris

    2012-02-01

    Recent advances in sequencing and in laboratory evolution experiments have made possible to obtain quantitative data on genetic diversity of populations and on the dynamics of evolution. This dynamics is shaped by the interplay between selection acting on beneficial and deleterious mutations and recombination which reshuffles genotypes. Mounting evidence suggests that natural populations harbor extensive fitness diversity, yet most of the currently available tools for analyzing polymorphism data are based on the neutral theory. Our aim is to develop methods to analyze genomic data for populations in the presence of the above-mentioned factors. We consider different evolutionary regimes - Muller's ratchet, mutation-recombination-selection balance and positive adaption rate - and revisit a number of observables considered in the nearly-neutral theory of evolution. In particular, we examine the coalescent structure in the presence of recombination and calculate quantities such as the distribution of the coalescent times along the genome, the distribution of haplotype block sizes and the correlation between ancestors of different loci along the genome. In addition, we characterize the probability and time of fixation of mutations as a function of their fitness effect.

  10. Inverse statistical physics of protein sequences: a key issues review.

    Science.gov (United States)

    Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin

    2018-03-01

    In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.

  11. HIV protein sequence hotspots for crosstalk with host hub proteins.

    Directory of Open Access Journals (Sweden)

    Mahdi Sarmady

    Full Text Available HIV proteins target host hub proteins for transient binding interactions. The presence of viral proteins in the infected cell results in out-competition of host proteins in their interaction with hub proteins, drastically affecting cell physiology. Functional genomics and interactome datasets can be used to quantify the sequence hotspots on the HIV proteome mediating interactions with host hub proteins. In this study, we used the HIV and human interactome databases to identify HIV targeted host hub proteins and their host binding partners (H2. We developed a high throughput computational procedure utilizing motif discovery algorithms on sets of protein sequences, including sequences of HIV and H2 proteins. We identified as HIV sequence hotspots those linear motifs that are highly conserved on HIV sequences and at the same time have a statistically enriched presence on the sequences of H2 proteins. The HIV protein motifs discovered in this study are expressed by subsets of H2 host proteins potentially outcompeted by HIV proteins. A large subset of these motifs is involved in cleavage, nuclear localization, phosphorylation, and transcription factor binding events. Many such motifs are clustered on an HIV sequence in the form of hotspots. The sequential positions of these hotspots are consistent with the curated literature on phenotype altering residue mutations, as well as with existing binding site data. The hotspot map produced in this study is the first global portrayal of HIV motifs involved in altering the host protein network at highly connected hub nodes.

  12. Protein 3D structure computed from evolutionary sequence variation.

    Directory of Open Access Journals (Sweden)

    Debora S Marks

    Full Text Available The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing.In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues, including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7-4.8 Å C(α-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org. This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of

  13. Deep sequencing methods for protein engineering and design.

    Science.gov (United States)

    Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A

    2017-08-01

    The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.

  14. Interplay between chaperones and protein disorder promotes the evolution of protein networks.

    Directory of Open Access Journals (Sweden)

    Sebastian Pechmann

    2014-06-01

    Full Text Available Evolution is driven by mutations, which lead to new protein functions but come at a cost to protein stability. Non-conservative substitutions are of interest in this regard because they may most profoundly affect both function and stability. Accordingly, organisms must balance the benefit of accepting advantageous substitutions with the possible cost of deleterious effects on protein folding and stability. We here examine factors that systematically promote non-conservative mutations at the proteome level. Intrinsically disordered regions in proteins play pivotal roles in protein interactions, but many questions regarding their evolution remain unanswered. Similarly, whether and how molecular chaperones, which have been shown to buffer destabilizing mutations in individual proteins, generally provide robustness during proteome evolution remains unclear. To this end, we introduce an evolutionary parameter λ that directly estimates the rate of non-conservative substitutions. Our analysis of λ in Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens sequences reveals how co- and post-translationally acting chaperones differentially promote non-conservative substitutions in their substrates, likely through buffering of their destabilizing effects. We further find that λ serves well to quantify the evolution of intrinsically disordered proteins even though the unstructured, thus generally variable regions in proteins are often flanked by very conserved sequences. Crucially, we show that both intrinsically disordered proteins and highly re-wired proteins in protein interaction networks, which have evolved new interactions and functions, exhibit a higher λ at the expense of enhanced chaperone assistance. Our findings thus highlight an intricate interplay of molecular chaperones and protein disorder in the evolvability of protein networks. Our results illuminate the role of chaperones in enabling protein evolution, and underline the

  15. Sequence motifs in MADS transcription factors responsible for specificity and diversification of protein-protein interaction.

    Directory of Open Access Journals (Sweden)

    Aalt D J van Dijk

    Full Text Available Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and

  16. Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution

    DEFF Research Database (Denmark)

    Richards, Stephen; Liu, Yue; Bettencourt, Brian R.

    2005-01-01

    years (Myr) since the pseudoobscura/melanogaster divergence. Genes expressed in the testes had higher amino acid sequence divergence than the genome-wide average, consistent with the rapid evolution of sex-specific proteins. Cis-regulatory sequences are more conserved than random and nearby sequences......We have sequenced the genome of a second Drosophila species, Drosophila pseudoobscura, and compared this to the genome sequence of Drosophila melanogaster, a primary model organism. Throughout evolution the vast majority of Drosophila genes have remained on the same chromosome arm, but within each...... between the species-but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible. Overall, a pattern of repeat-mediated chromosomal rearrangement, and high coadaptation of both male genes and cis-regulatory sequences emerges as important themes of genome divergence...

  17. Evolution of vertebrate interferon inducible transmembrane proteins

    Directory of Open Access Journals (Sweden)

    Hickford Danielle

    2012-04-01

    Full Text Available Abstract Background Interferon inducible transmembrane proteins (IFITMs have diverse roles, including the control of cell proliferation, promotion of homotypic cell adhesion, protection against viral infection, promotion of bone matrix maturation and mineralisation, and mediating germ cell development. Most IFITMs have been well characterised in human and mouse but little published data exists for other animals. This study characterised IFITMs in two distantly related marsupial species, the Australian tammar wallaby and the South American grey short-tailed opossum, and analysed the phylogeny of the IFITM family in vertebrates. Results Five IFITM paralogues were identified in both the tammar and opossum. As in eutherians, most marsupial IFITM genes exist within a cluster, contain two exons and encode proteins with two transmembrane domains. Only two IFITM genes, IFITM5 and IFITM10, have orthologues in both marsupials and eutherians. IFITM5 arose in bony fish and IFITM10 in tetrapods. The bone-specific expression of IFITM5 appears to be restricted to therian mammals, suggesting that its specialised role in bone production is a recent adaptation specific to mammals. IFITM10 is the most highly conserved IFITM, sharing at least 85% amino acid identity between birds, reptiles and mammals and suggesting an important role for this presently uncharacterised protein. Conclusions Like eutherians, marsupials also have multiple IFITM genes that exist in a gene cluster. The differing expression patterns for many of the paralogues, together with poor sequence conservation between species, suggests that IFITM genes have acquired many different roles during vertebrate evolution.

  18. Repeat Sequence Proteins as Matrices for Nanocomposites

    Energy Technology Data Exchange (ETDEWEB)

    Drummy, L.; Koerner, H; Phillips, D; McAuliffe, J; Kumar, M; Farmer, B; Vaia, R; Naik, R

    2009-01-01

    Recombinant protein-inorganic nanocomposites comprised of exfoliated Na+ montmorillonite (MMT) in a recombinant protein matrix based on silk-like and elastin-like amino acid motifs (silk elastin-like protein (SELP)) were formed via a solution blending process. Charged residues along the protein backbone are shown to dominate long-range interactions, whereas the SELP repeat sequence leads to local protein/MMT compatibility. Up to a 50% increase in room temperature modulus and a comparable decrease in high temperature coefficient of thermal expansion occur for cast films containing 2-10 wt.% MMT.

  19. Molecular evolution, intracellular organization, and the quinary structure of proteins.

    OpenAIRE

    McConkey, E H

    1982-01-01

    High-resolution two-dimensional polyacrylamide gel electrophoresis shows that at least half of 370 denatured polypeptides from hamster cells and human cells are indistinguishable in terms of isoelectric points and molecular weights. Molecular evolution may have been more conservative for this set of proteins than sequence studies on soluble proteins have implied. This may be a consequence of complexities of intracellular organization and the numerous macromolecular interactions in which most ...

  20. Diversity and evolution of ABC proteins in basidiomycetes.

    Science.gov (United States)

    Kovalchuk, Andriy; Lee, Yong-Hwan; Asiegbu, Fred O

    2013-01-01

    ABC proteins constitute one of the largest families of proteins. They are implicated in wide variety of cellular processes ranging from ribosome biogenesis to multidrug resistance. With the advance of fungal genomics, the number of known fungal ABC proteins increases rapidly but the information on their biological functions remains scarce. In this work we extended the previous analysis of fungal ABC proteins to include recently sequenced species of basidiomycetes. We performed an identification and initial cataloging of ABC proteins from 23 fungal species representing 10 orders within class Agaricomycotina. We identified more than 1000 genes coding for ABC proteins. Comparison of sets of ABC proteins present in basidiomycetes and ascomycetes revealed the existence of two groups of ABC proteins specific for basidiomycetes. Results of survey should contribute to the better understanding of evolution of ABC proteins in fungi and support further experimental work on their characterization.

  1. Divergence, recombination and retention of functionality during protein evolution

    Directory of Open Access Journals (Sweden)

    Xu Yanlong O

    2005-09-01

    Full Text Available Abstract We have only a vague idea of precisely how protein sequences evolve in the context of protein structure and function. This is primarily because structural and functional contexts are not easily predictable from the primary sequence, and evaluating patterns of evolution at individual residue positions is also difficult. As a result of increasing biodiversity in genomics studies, progress is being made in detecting context-dependent variation in substitution processes, but it remains unclear exactly what context-dependent patterns we should be looking for. To address this, we have been simulating protein evolution in the context of structure and function using lattice models of proteins and ligands (or substrates. These simulations include thermodynamic features of protein stability and population dynamics. We refer to this approach as 'ab initio evolution' to emphasise the fact that the equilibrium details of fitness distributions arise from the physical principles of the system and not from any preconceived notions or arbitrary mathematical distributions. Here, we present results on the retention of functionality in homologous recombinants following population divergence. A central result is that protein structure characteristics can strongly influence recombinant functionality. Exceptional structures with many sequence options evolve quickly and tend to retain functionality -- even in highly diverged recombinants. By contrast, the more common structures with fewer sequence options evolve more slowly, but the fitness of recombinants drops off rapidly as homologous proteins diverge. These results have implications for understanding viral evolution, speciation and directed evolutionary experiments. Our analysis of the divergence process can also guide improved methods for accurately approximating folding probabilities in more complex but realistic systems.

  2. Exploring the correlations between sequence evolution rate and ...

    Indian Academy of Sciences (India)

    2012-10-15

    Oct 15, 2012 ... The vast functional divergence within mammalian lineages that ... Keywords. Phylogenetics; molecular clock; sequence evolutionary rate; phenotypic evolution; morphology; genomics .... entire lineages during periods with ecosystem-level commu- ... increases from fish to amphibians to birds to mammals.

  3. The evolution of coronal activity in main sequence cool stars

    International Nuclear Information System (INIS)

    Stern, R.A.

    1984-01-01

    Stars spend most of their lifetime and show the least amount of nuclear evolution on the main sequence. However, the x-ray luminosities of cool star coronas change by orders of magnitude as a function of main sequence age. Such coronal evolution is discussed in relation to our knowledge of the solar corona, solar and stellar flares, stellar rotation and binarity. The relevance of X-ray observations to current speculations on stellar dynamos is also considered

  4. The Origin and Early Evolution of Membrane Proteins

    Science.gov (United States)

    Pohorille, Andrew; Schweighofter, Karl; Wilson, Michael A.

    2006-01-01

    The origin and early evolution of membrane proteins, and in particular ion channels, are considered from the point of view that the transmembrane segments of membrane proteins are structurally quite simple and do not require specific sequences to fold. We argue that the transport of solute species, especially ions, required an early evolution of efficient transport mechanisms, and that the emergence of simple ion channels was protobiologically plausible. We also argue that, despite their simple structure, such channels could possess properties that, at the first sight, appear to require markedly larger complexity. These properties can be subtly modulated by local modifications to the sequence rather than global changes in molecular architecture. In order to address the evolution and development of ion channels, we focus on identifying those protein domains that are commonly associated with ion channel proteins and are conserved throughout the three main domains of life (Eukarya, Prokarya, and Archaea). We discuss the potassium-sodium-calcium superfamily of voltage-gated ion channels, mechanosensitive channels, porins, and ABC-transporters and argue that these families of membrane channels have sufficiently universal architectures that they can readily adapt to the diverse functional demands arising during evolution.

  5. Mitochondrial DNA sequence evolution in shorebird populations

    NARCIS (Netherlands)

    Wenink, P.W.

    1994-01-01

    This thesis describes the global molecular population structure of two shorebird species, in particular of the dunlin, Calidris alpina, by means of comparative sequence analysis of the most variable part of the mitochondrial DNA (mtDNA) genome. There are several reasons

  6. Post-main-sequence planetary system evolution

    Science.gov (United States)

    Veras, Dimitri

    2016-01-01

    The fates of planetary systems provide unassailable insights into their formation and represent rich cross-disciplinary dynamical laboratories. Mounting observations of post-main-sequence planetary systems necessitate a complementary level of theoretical scrutiny. Here, I review the diverse dynamical processes which affect planets, asteroids, comets and pebbles as their parent stars evolve into giant branch, white dwarf and neutron stars. This reference provides a foundation for the interpretation and modelling of currently known systems and upcoming discoveries. PMID:26998326

  7. Evolution of developmental sequences in lepidosaurs

    Directory of Open Access Journals (Sweden)

    Tomasz Skawiński

    2017-04-01

    Full Text Available Background Lepidosaurs, a group including rhynchocephalians and squamates, are one of the major clades of extant vertebrates. Although there has been extensive phylogenetic work on this clade, its interrelationships are a matter of debate. Morphological and molecular data suggest very different relationships within squamates. Despite this, relatively few studies have assessed the utility of other types of data for inferring squamate phylogeny. Methods We used developmental sequences of 20 events in 29 species of lepidosaurs. These sequences were analysed using event-pairing and continuous analysis. They were transformed into cladistic characters and analysed in TNT. Ancestral state reconstructions were performed on two main phylogenetic hypotheses of squamates (morphological and molecular. Results Cladistic analyses conducted using characters generated by these methods do not resemble any previously published phylogeny. Ancestral state reconstructions are equally consistent with both morphological and molecular hypotheses of squamate phylogeny. Only several inferred heterochronic events are common to all methods and phylogenies. Discussion Results of the cladistic analyses, and the fact that reconstructions of heterochronic events show more similarities between certain methods rather than phylogenetic hypotheses, suggest that phylogenetic signal is at best weak in the studied developmental events. Possibly the developmental sequences analysed here evolve too quickly to recover deep divergences within Squamata.

  8. Differential evolution-simulated annealing for multiple sequence alignment

    Science.gov (United States)

    Addawe, R. C.; Addawe, J. M.; Sueño, M. R. K.; Magadia, J. C.

    2017-10-01

    Multiple sequence alignments (MSA) are used in the analysis of molecular evolution and sequence structure relationships. In this paper, a hybrid algorithm, Differential Evolution - Simulated Annealing (DESA) is applied in optimizing multiple sequence alignments (MSAs) based on structural information, non-gaps percentage and totally conserved columns. DESA is a robust algorithm characterized by self-organization, mutation, crossover, and SA-like selection scheme of the strategy parameters. Here, the MSA problem is treated as a multi-objective optimization problem of the hybrid evolutionary algorithm, DESA. Thus, we name the algorithm as DESA-MSA. Simulated sequences and alignments were generated to evaluate the accuracy and efficiency of DESA-MSA using different indel sizes, sequence lengths, deletion rates and insertion rates. The proposed hybrid algorithm obtained acceptable solutions particularly for the MSA problem evaluated based on the three objectives.

  9. Nonlinear deterministic structures and the randomness of protein sequences

    CERN Document Server

    Huang Yan Zhao

    2003-01-01

    To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences representing each structural classes by using nonlinear prediction method. No deterministic structures are found in these protein sequences and this implies that they behave as random sequences. We also give an explanation to the controversial results obtained in previous investigations.

  10. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    Science.gov (United States)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  11. Mitochondrial DNA sequence evolution in the Arctoidea.

    OpenAIRE

    Zhang, Y P; Ryder, O A

    1993-01-01

    Some taxa in the superfamily Arctoidea, such as the giant panda and the lesser panda, have presented puzzles to taxonomists. In the present study, approximately 397 bases of the cytochrome b gene, 364 bases of the 12S rRNA gene, and 74 bases of the tRNA(Thr) and tRNA(Pro) genes from the giant panda, lesser panda, kinkajou, raccoon, coatimundi, and all species of the Ursidae were sequenced. The high transition/transversion ratios in cytochrome b and RNA genes prior to saturation suggest that t...

  12. Conservation and diversification of Msx protein in metazoan evolution.

    Science.gov (United States)

    Takahashi, Hirokazu; Kamiya, Akiko; Ishiguro, Akira; Suzuki, Atsushi C; Saitou, Naruya; Toyoda, Atsushi; Aruga, Jun

    2008-01-01

    Msx (/msh) family genes encode homeodomain (HD) proteins that control ontogeny in many animal species. We compared the structures of Msx genes from a wide range of Metazoa (Porifera, Cnidaria, Nematoda, Arthropoda, Tardigrada, Platyhelminthes, Mollusca, Brachiopoda, Annelida, Echiura, Echinodermata, Hemichordata, and Chordata) to gain an understanding of the role of these genes in phylogeny. Exon-intron boundary analysis suggested that the position of the intron located N-terminally to the HDs was widely conserved in all the genes examined, including those of cnidarians. Amino acid (aa) sequence comparison revealed 3 new evolutionarily conserved domains, as well as very strong conservation of the HDs. Two of the three domains were associated with Groucho-like protein binding in both a vertebrate and a cnidarian Msx homolog, suggesting that the interaction between Groucho-like proteins and Msx proteins was established in eumetazoan ancestors. Pairwise comparison among the collected HDs and their C-flanking aa sequences revealed that the degree of sequence conservation varied depending on the animal taxa from which the sequences were derived. Highly conserved Msx genes were identified in the Vertebrata, Cephalochordata, Hemichordata, Echinodermata, Mollusca, Brachiopoda, and Anthozoa. The wide distribution of the conserved sequences in the animal phylogenetic tree suggested that metazoan ancestors had already acquired a set of conserved domains of the current Msx family genes. Interestingly, although strongly conserved sequences were recovered from the Vertebrata, Cephalochordata, and Anthozoa, the sequences from the Urochordata and Hydrozoa showed weak conservation. Because the Vertebrata-Cephalochordata-Urochordata and Anthozoa-Hydrozoa represent sister groups in the Chordata and Cnidaria, respectively, Msx sequence diversification may have occurred differentially in the course of evolution. We speculate that selective loss of the conserved domains in Msx family

  13. Intermediate filament protein evolution and protists.

    Science.gov (United States)

    Preisner, Harald; Habicht, Jörn; Garg, Sriram G; Gould, Sven B

    2018-03-23

    Metazoans evolved from a single protist lineage. While all eukaryotes share a conserved actin and tubulin-based cytoskeleton, it is commonly perceived that intermediate filaments (IFs), including lamin, vimentin or keratin among many others, are restricted to metazoans. Actin and tubulin proteins are conserved enough to be detectable across all eukaryotic genomes using standard phylogenetic methods, but IF proteins, in contrast, are notoriously difficult to identify by such means. Since the 1950s, dozens of cytoskeletal proteins in protists have been identified that seemingly do not belong to any of the IF families described for metazoans, yet, from a structural and functional perspective fit criteria that define metazoan IF proteins. Here, we briefly review IF protein discovery in metazoans and the implications this had for the definition of this protein family. We argue that the many cytoskeletal and filament-forming proteins of protists should be incorporated into a more comprehensive picture of IF evolution by aligning it with the recent identification of lamins across the phylogenetic diversity of eukaryotic supergroups. This then brings forth the question of how the diversity of IF proteins has unfolded. The evolution of IF proteins likely represents an example of convergent evolution, which, in combination with the speed with which these cytoskeletal proteins are evolving, generated their current diversity. IF proteins did not first emerge in metazoa, but in protists. Only the emergence of cytosolic IF proteins that appear to stem from a nuclear lamin is unique to animals and coincided with the emergence of true animal multicellularity. © 2018 Wiley Periodicals, Inc.

  14. MODELING THE RED SEQUENCE: HIERARCHICAL GROWTH YET SLOW LUMINOSITY EVOLUTION

    International Nuclear Information System (INIS)

    Skelton, Rosalind E.; Bell, Eric F.; Somerville, Rachel S.

    2012-01-01

    We explore the effects of mergers on the evolution of massive early-type galaxies by modeling the evolution of their stellar populations in a hierarchical context. We investigate how a realistic red sequence population set up by z ∼ 1 evolves under different assumptions for the merger and star formation histories, comparing changes in color, luminosity, and mass. The purely passive fading of existing red sequence galaxies, with no further mergers or star formation, results in dramatic changes at the bright end of the luminosity function and color-magnitude relation. Without mergers there is too much evolution in luminosity at a fixed space density compared to observations. The change in color and magnitude at a fixed mass resembles that of a passively evolving population that formed relatively recently, at z ∼ 2. Mergers among the red sequence population ('dry mergers') occurring after z = 1 build up mass, counteracting the fading of the existing stellar populations to give smaller changes in both color and luminosity for massive galaxies. By allowing some galaxies to migrate from the blue cloud onto the red sequence after z = 1 through gas-rich mergers, younger stellar populations are added to the red sequence. This manifestation of the progenitor bias increases the scatter in age and results in even smaller changes in color and luminosity between z = 1 and z = 0 at a fixed mass. The resultant evolution appears much slower, resembling the passive evolution of a population that formed at high redshift (z ∼ 3-5), and is in closer agreement with observations. We conclude that measurements of the luminosity and color evolution alone are not sufficient to distinguish between the purely passive evolution of an old population and cosmologically motivated hierarchical growth, although these scenarios have very different implications for the mass growth of early-type galaxies over the last half of cosmic history.

  15. Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

    Science.gov (United States)

    Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C

    2007-04-13

    Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained.

  16. Distribution and Evolution of Yersinia Leucine-Rich Repeat Proteins

    Science.gov (United States)

    Hu, Yueming; Huang, He; Hui, Xinjie; Cheng, Xi; White, Aaron P.

    2016-01-01

    Leucine-rich repeat (LRR) proteins are widely distributed in bacteria, playing important roles in various protein-protein interaction processes. In Yersinia, the well-characterized type III secreted effector YopM also belongs to the LRR protein family and is encoded by virulence plasmids. However, little has been known about other LRR members encoded by Yersinia genomes or their evolution. In this study, the Yersinia LRR proteins were comprehensively screened, categorized, and compared. The LRR proteins encoded by chromosomes (LRR1 proteins) appeared to be more similar to each other and different from those encoded by plasmids (LRR2 proteins) with regard to repeat-unit length, amino acid composition profile, and gene expression regulation circuits. LRR1 proteins were also different from LRR2 proteins in that the LRR1 proteins contained an E3 ligase domain (NEL domain) in the C-terminal region or an NEL domain-encoding nucleotide relic in flanking genomic sequences. The LRR1 protein-encoding genes (LRR1 genes) varied dramatically and were categorized into 4 subgroups (a to d), with the LRR1a to -c genes evolving from the same ancestor and LRR1d genes evolving from another ancestor. The consensus and ancestor repeat-unit sequences were inferred for different LRR1 protein subgroups by use of a maximum parsimony modeling strategy. Structural modeling disclosed very similar repeat-unit structures between LRR1 and LRR2 proteins despite the different unit lengths and amino acid compositions. Structural constraints may serve as the driving force to explain the observed mutations in the LRR regions. This study suggests that there may be functional variation and lays the foundation for future experiments investigating the functions of the chromosomally encoded LRR proteins of Yersinia. PMID:27217422

  17. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.

    Science.gov (United States)

    Tian, Pengfei; Best, Robert B

    2017-10-17

    Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance. Published by Elsevier Inc.

  18. Evolution of sequence-defined highly functionalized nucleic acid polymers

    Science.gov (United States)

    Chen, Zhen; Lichtor, Phillip A.; Berliner, Adrian P.; Chen, Jonathan C.; Liu, David R.

    2018-03-01

    The evolution of sequence-defined synthetic polymers made of building blocks beyond those compatible with polymerase enzymes or the ribosome has the potential to generate new classes of receptors, catalysts and materials. Here we describe a ligase-mediated DNA-templated polymerization and in vitro selection system to evolve highly functionalized nucleic acid polymers (HFNAPs) made from 32 building blocks that contain eight chemically diverse side chains on a DNA backbone. Through iterated cycles of polymer translation, selection and reverse translation, we discovered HFNAPs that bind proprotein convertase subtilisin/kexin type 9 (PCSK9) and interleukin-6, two protein targets implicated in human diseases. Mutation and reselection of an active PCSK9-binding polymer yielded evolved polymers with high affinity (KD = 3 nM). This evolved polymer potently inhibited the binding between PCSK9 and the low-density lipoprotein receptor. Structure-activity relationship studies revealed that specific side chains at defined positions in the polymers are required for binding to their respective targets. Our findings expand the chemical space of evolvable polymers to include densely functionalized nucleic acids with diverse, researcher-defined chemical repertoires.

  19. Molecular evolution of cyclin proteins in animals and fungi

    Directory of Open Access Journals (Sweden)

    Afonnikov Dmitry A

    2011-07-01

    Full Text Available Abstract Background The passage through the cell cycle is controlled by complexes of cyclins, the regulatory units, with cyclin-dependent kinases, the catalytic units. It is also known that cyclins form several families, which differ considerably in primary structure from one eukaryotic organism to another. Despite these lines of evidence, the relationship between the evolution of cyclins and their function is an open issue. Here we present the results of our study on the molecular evolution of A-, B-, D-, E-type cyclin proteins in animals and fungi. Results We constructed phylogenetic trees for these proteins, their ancestral sequences and analyzed patterns of amino acid replacements. The analysis of infrequently fixed atypical amino acid replacements in cyclins evidenced that accelerated evolution proceeded predominantly during paralog duplication or after it in animals and fungi and that it was related to aromorphic changes in animals. It was shown also that evolutionary flexibility of cyclin function may be provided by consequential reorganization of regions on protein surface remote from CDK binding sites in animal and fungal cyclins and by functional differentiation of paralogous cyclins formed in animal evolution. Conclusions The results suggested that changes in the number and/or nature of cyclin-binding proteins may underlie the evolutionary role of the alterations in the molecular structure of cyclins and their involvement in diverse molecular-genetic events.

  20. HIV-specific probabilistic models of protein evolution.

    Directory of Open Access Journals (Sweden)

    David C Nickle

    2007-06-01

    Full Text Available Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1 genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1-the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic

  1. Origin and evolution of chromosomal sperm proteins.

    Science.gov (United States)

    Eirín-López, José M; Ausió, Juan

    2009-10-01

    In the eukaryotic cell, DNA compaction is achieved through its interaction with histones, constituting a nucleoprotein complex called chromatin. During metazoan evolution, the different structural and functional constraints imposed on the somatic and germinal cell lines led to a unique process of specialization of the sperm nuclear basic proteins (SNBPs) associated with chromatin in male germ cells. SNBPs encompass a heterogeneous group of proteins which, since their discovery in the nineteenth century, have been studied extensively in different organisms. However, the origin and controversial mechanisms driving the evolution of this group of proteins has only recently started to be understood. Here, we analyze in detail the histone hypothesis for the vertical parallel evolution of SNBPs, involving a "vertical" transition from a histone to a protamine-like and finally protamine types (H --> PL --> P), the last one of which is present in the sperm of organisms at the uppermost tips of the phylogenetic tree. In particular, the common ancestry shared by the protamine-like (PL)- and protamine (P)-types with histone H1 is discussed within the context of the diverse structural and functional constraints acting upon these proteins during bilaterian evolution.

  2. Three aspects of stellar evolution near the main sequence

    International Nuclear Information System (INIS)

    Morgan, J.C.

    1979-05-01

    Three problems of stellar evolution are considered: the gap in the HR diagram of M67, the evolutionary status of RS CVn binaries and the solar neutrino problem. The physical basis of the Eggleton stellar evolution computer program is described. The program was used to calculate a grid of evolutionary tracks for models with masses between 0.7 and 1.29 solar masses. The more massive stars considered here have expanding convective cores during their main sequence evolution. The isochrone of the old galactic cluster M67 has a gap at the top of its main sequence because of the rapid evolution of stars at hydrogen exhaustion. RS CVn binaries present a complex collection of observational phenomena although they appear to be detached binaries. Their evolutionary status has remained controversial because of their high space density. Here it is shown that a post main sequence interpretation is satisfactory. Models of the Sun with metal poor interiors have been proposed in an attempt to resolve the solar neutrino problem. Here the evolution of two such models is calculated in detail, including a gradual contamination of the surface convection zone to produce the observed metal abundance, giving fully consistent models of the Sun as it is observed. (author)

  3. Domain organizations of modular extracellular matrix proteins and their evolution.

    Science.gov (United States)

    Engel, J

    1996-11-01

    Multidomain proteins which are composed of modular units are a rather recent invention of evolution. Domains are defined as autonomously folding regions of a protein, and many of them are similar in sequence and structure, indicating common ancestry. Their modular nature is emphasized by frequent repetitions in identical or in different proteins and by a large number of different combinations with other domains. The extracellular matrix is perhaps the largest biological system composed of modular mosaic proteins, and its astonishing complexity and diversity are based on them. A cluster of minireviews on modular proteins is being published in Matrix Biology. These deal with the evolution of modular proteins, the three-dimensional structure of domains and the ways in which these interact in a multidomain protein. They discuss structure-function relationships in calcium binding domains, collagen helices, alpha-helical coiled-coil domains and C-lectins. The present minireview is focused on some general aspects and serves as an introduction to the cluster.

  4. Anomalous diffusion in neutral evolution of model proteins

    Science.gov (United States)

    Nelson, Erik D.; Grishin, Nick V.

    2015-06-01

    Protein evolution is frequently explored using minimalist polymer models, however, little attention has been given to the problem of structural drift, or diffusion. Here, we study neutral evolution of small protein motifs using an off-lattice heteropolymer model in which individual monomers interact as low-resolution amino acids. In contrast to most earlier models, both the length and folded structure of the polymers are permitted to change. To describe structural change, we compute the mean-square distance (MSD) between monomers in homologous folds separated by n neutral mutations. We find that structural change is episodic, and, averaged over lineages (for example, those extending from a single sequence), exhibits a power-law dependence on n . We show that this exponent depends on the alignment method used, and we analyze the distribution of waiting times between neutral mutations. The latter are more disperse than for models required to maintain a specific fold, but exhibit a similar power-law tail.

  5. Practical aspects of protein co-evolution.

    Science.gov (United States)

    Ochoa, David; Pazos, Florencio

    2014-01-01

    Co-evolution is a fundamental aspect of Evolutionary Theory. At the molecular level, co-evolutionary linkages between protein families have been used as indicators of protein interactions and functional relationships from long ago. Due to the complexity of the problem and the amount of genomic data required for these approaches to achieve good performances, it took a relatively long time from the appearance of the first ideas and concepts to the quotidian application of these approaches and their incorporation to the standard toolboxes of bioinformaticians and molecular biologists. Today, these methodologies are mature (both in terms of performance and usability/implementation), and the genomic information that feeds them large enough to allow their general application. This review tries to summarize the current landscape of co-evolution-based methodologies, with a strong emphasis on describing interesting cases where their application to important biological systems, alone or in combination with other computational and experimental approaches, allowed getting new insight into these.

  6. The SWISS-PROT protein sequence data bank: current status.

    OpenAIRE

    Bairoch, A; Boeckmann, B

    1994-01-01

    SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library. The SWISS-PROT protein sequence data bank consist of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Databa...

  7. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.

    Science.gov (United States)

    2004-12-09

    We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.

  8. Molecular clock in neutral protein evolution

    Directory of Open Access Journals (Sweden)

    Wilke Claus O

    2004-08-01

    Full Text Available Abstract Background A frequent observation in molecular evolution is that amino-acid substitution rates show an index of dispersion (that is, ratio of variance to mean substantially larger than one. This observation has been termed the overdispersed molecular clock. On the basis of in silico protein-evolution experiments, Bastolla and coworkers recently proposed an explanation for this observation: Proteins drift in neutral space, and can temporarily get trapped in regions of substantially reduced neutrality. In these regions, substitution rates are suppressed, which results in an overall substitution process that is not Poissonian. However, the simulation method of Bastolla et al. is representative only for cases in which the product of mutation rate μ and population size Ne is small. How the substitution process behaves when μNe is large is not known. Results Here, I study the behavior of the molecular clock in in silico protein evolution as a function of mutation rate and population size. I find that the index of dispersion decays with increasing μNe, and approaches 1 for large μNe . This observation can be explained with the selective pressure for mutational robustness, which is effective when μNe is large. This pressure keeps the population out of low-neutrality traps, and thus steadies the ticking of the molecular clock. Conclusions The molecular clock in neutral protein evolution can fall into two distinct regimes, a strongly overdispersed one for small μNe, and a mostly Poissonian one for large μNe. The former is relevant for the majority of organisms in the plant and animal kingdom, and the latter may be relevant for RNA viruses.

  9. Experimental Rugged Fitness Landscape in Protein Sequence Space

    Science.gov (United States)

    Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya

    2006-01-01

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12–130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7×104-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18–24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region. PMID:17183728

  10. Experimental rugged fitness landscape in protein sequence space.

    Science.gov (United States)

    Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya

    2006-12-20

    The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7x10(4)-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18-24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.

  11. Experimental rugged fitness landscape in protein sequence space.

    Directory of Open Access Journals (Sweden)

    Yuuki Hayashi

    Full Text Available The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7x10(4-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1 the dependence of stationary fitness on library size, which increased gradually, and (2 the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18-24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.

  12. Understanding Cancer Genome and Its Evolution by Next Generation Sequencing

    DEFF Research Database (Denmark)

    Hou, Yong

    Cancer will cause 13 million deaths by the year of 2030, ranking the second leading cause of death worldwide. Previous studies indicate that most of the cancers originate from cells that acquired somatic mutations and evolved as Darwin Theory. Ten biological insights of cancer have been summarized...... recently. Cutting-age technologies like next generation sequencing (NGS) enable exploring cancer genome and evolution much more efficiently. However, integrated cancer genome sequencing studies showed great inter-/intra-tumoral heterogeneity (ITH) and complex evolution patterns beyond the cancer biological...... knowledge we previously know. There is very limited knowledge of East Asia lung cancer genome except enrichment of EGFR mutations and lack of KRAS mutations. We carried out integrated genomic, transcriptomic and methylomic analysis of 335 primary Chinese lung adenocarcinomas (LUAD) and 35 corresponding...

  13. Next-Generation Sequencing for Binary Protein-Protein Interactions

    Directory of Open Access Journals (Sweden)

    Bernhard eSuter

    2015-12-01

    Full Text Available The yeast two-hybrid (Y2H system exploits host cell genetics in order to display binary protein-protein interactions (PPIs via defined and selectable phenotypes. Numerous improvements have been made to this method, adapting the screening principle for diverse applications, including drug discovery and the scale-up for proteome wide interaction screens in human and other organisms. Here we discuss a systematic workflow and analysis scheme for screening data generated by Y2H and related assays that includes high-throughput selection procedures, readout of comprehensive results via next-generation sequencing (NGS, and the interpretation of interaction data via quantitative statistics. The novel assays and tools will serve the broader scientific community to harness the power of NGS technology to address PPI networks in health and disease. We discuss examples of how this next-generation platform can be applied to address specific questions in diverse fields of biology and medicine.

  14. Convergent evolution and mimicry of protein linear motifs in host-pathogen interactions.

    Science.gov (United States)

    Chemes, Lucía Beatriz; de Prat-Gay, Gonzalo; Sánchez, Ignacio Enrique

    2015-06-01

    Pathogen linear motif mimics are highly evolvable elements that facilitate rewiring of host protein interaction networks. Host linear motifs and pathogen mimics differ in sequence, leading to thermodynamic and structural differences in the resulting protein-protein interactions. Moreover, the functional output of a mimic depends on the motif and domain repertoire of the pathogen protein. Regulatory evolution mediated by linear motifs can be understood by measuring evolutionary rates, quantifying positive and negative selection and performing phylogenetic reconstructions of linear motif natural history. Convergent evolution of linear motif mimics is widespread among unrelated proteins from viral, prokaryotic and eukaryotic pathogens and can also take place within individual protein phylogenies. Statistics, biochemistry and laboratory models of infection link pathogen linear motifs to phenotypic traits such as tropism, virulence and oncogenicity. In vitro evolution experiments and analysis of natural sequences suggest that changes in linear motif composition underlie pathogen adaptation to a changing environment. Copyright © 2015 Elsevier Ltd. All rights reserved.

  15. Structural Approaches to Sequence Evolution Molecules, Networks, Populations

    CERN Document Server

    Bastolla, Ugo; Roman, H. Eduardo; Vendruscolo, Michele

    2007-01-01

    Structural requirements constrain the evolution of biological entities at all levels, from macromolecules to their networks, right up to populations of biological organisms. Classical models of molecular evolution, however, are focused at the level of the symbols - the biological sequence - rather than that of their resulting structure. Now recent advances in understanding the thermodynamics of macromolecules, the topological properties of gene networks, the organization and mutation capabilities of genomes, and the structure of populations make it possible to incorporate these key elements into a broader and deeply interdisciplinary view of molecular evolution. This book gives an account of such a new approach, through clear tutorial contributions by leading scientists specializing in the different fields involved.

  16. Comparative genome sequencing of drosophila pseudoobscura: Chromosomal, gene and cis-element evolution

    Energy Technology Data Exchange (ETDEWEB)

    Richards, Stephen; Liu, Yue; Bettencourt, Brian R.; Hradecky, Pavel; Letovsky, Stan; Nielsen, Rasmus; Thornton, Kevin; Todd, Melissa J.; Chen, Rui; Meisel, Richard P.; Couronne, Olivier; Hua, Sujun; Smith, Mark A.; Bussemaker, Harmen J.; van Batenburg, Marinus F.; Howells, Sally L.; Scherer, Steven E.; Sodergren, Erica; Matthews, Beverly B.; Crosby, Madeline A.; Schroeder, Andrew J.; Ortiz-Barrientos, Daniel; Rives, Catherine M.; Metzker, Michael L.; Muzny, Donna M.; Scott, Graham; Steffen, David; Wheeler, David A.; Worley, Kim C.; Havlak, Paul; Durbin, K. James; Egan, Amy; Gill, Rachel; Hume, Jennifer; Morgan, Margaret B.; Miner, George; Hamilton, Cerissa; Huang, Yanmei; Waldron, Lenee; Verduzco, Daniel; Blankenburg, Kerstin P.; Dubchak, Inna; Noor, Mohamed A.F.; Anderson, Wyatt; White, Kevin P.; Clark, Andrew G.; Schaeffer, Stephen W.; Gelbart, William; Weinstock, George M.; Gibbs, Richard A.

    2004-04-01

    The genome sequence of a second fruit fly, D. pseudoobscura, presents an opportunity for comparative analysis of a primary model organism D. melanogaster. The vast majority of Drosophila genes have remained on the same arm, but within each arm gene order has been extensively reshuffled leading to the identification of approximately 1300 syntenic blocks. A repetitive sequence is found in the D. pseudoobscura genome at many junctions between adjacent syntenic blocks. Analysis of this novel repetitive element family suggests that recombination between offset elements may have given rise to many paracentric inversions, thereby contributing to the shuffling of gene order in the D. pseudoobscura lineage. Based on sequence similarity and synteny, 10,516 putative orthologs have been identified as a core gene set conserved over 35 My since divergence. Genes expressed in the testes had higher amino acid sequence divergence than the genome wide average consistent with the rapid evolution of sex-specific proteins. Cis-regulatory sequences are more conserved than control sequences between the species but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible. Overall, a picture of repeat mediated chromosomal rearrangement, and high co-adaptation of both male genes and cis-regulatory sequences emerges as important themes of genome divergence between these species of Drosophila.

  17. Genome sequence diversity and clues to the evolution of variola (smallpox) virus.

    Science.gov (United States)

    Esposito, Joseph J; Sammons, Scott A; Frace, A Michael; Osborne, John D; Olsen-Rasmussen, Melissa; Zhang, Ming; Govil, Dhwani; Damon, Inger K; Kline, Richard; Laker, Miriam; Li, Yu; Smith, Geoffrey L; Meyer, Hermann; Leduc, James W; Wohlhueter, Robert M

    2006-08-11

    Comparative genomics of 45 epidemiologically varied variola virus isolates from the past 30 years of the smallpox era indicate low sequence diversity, suggesting that there is probably little difference in the isolates' functional gene content. Phylogenetic clustering inferred three clades coincident with their geographical origin and case-fatality rate; the latter implicated putative proteins that mediate viral virulence differences. Analysis of the viral linear DNA genome suggests that its evolution involved direct descent and DNA end-region recombination events. Knowing the sequences will help understand the viral proteome and improve diagnostic test precision, therapeutics, and systems for their assessment.

  18. The evolution of function in strictosidine synthase-like proteins.

    Science.gov (United States)

    Hicks, Michael A; Barber, Alan E; Giddings, Lesley-Ann; Caldwell, Jenna; O'Connor, Sarah E; Babbitt, Patricia C

    2011-11-01

    The exponential growth of sequence data provides abundant information for the discovery of new enzyme reactions. Correctly annotating the functions of highly diverse proteins can be difficult, however, hindering use of this information. Global analysis of large superfamilies of related proteins is a powerful strategy for understanding the evolution of reactions by identifying catalytic commonalities and differences in reaction and substrate specificity, even when only a few members have been biochemically or structurally characterized. A comparison of >2500 sequences sharing the six-bladed β-propeller fold establishes sequence, structural, and functional links among the three subgroups of the functionally diverse N6P superfamily: the arylesterase-like and senescence marker protein-30/gluconolactonase/luciferin-regenerating enzyme-like (SGL) subgroups, representing enzymes that catalyze lactonase and related hydrolytic reactions, and the so-called strictosidine synthase-like (SSL) subgroup. Metal-coordinating residues were identified as broadly conserved in the active sites of all three subgroups except for a few proteins from the SSL subgroup, which have been experimentally determined to catalyze the quite different strictosidine synthase (SS) reaction, a metal-independent condensation reaction. Despite these differences, comparison of conserved catalytic features of the arylesterase-like and SGL enzymes with the SSs identified similar structural and mechanistic attributes between the hydrolytic reactions catalyzed by the former and the condensation reaction catalyzed by SS. The results also suggest that despite their annotations, the great majority of these >500 SSL sequences do not catalyze the SS reaction; rather, they likely catalyze hydrolytic reactions typical of the other two subgroups instead. This prediction was confirmed experimentally for one of these proteins. Copyright © 2011 Wiley-Liss, Inc.

  19. Effects of mass loss on the evolution of massive stars. I. Main-sequence evolution

    International Nuclear Information System (INIS)

    Dearborn, D.S.P.; Blake, J.B.; Hainebach, K.L.; Schramm, D.N.

    1978-01-01

    The effect of mass loss on the evolution and surface composition of massive stars during main-sequence evolution are examined. While some details of the evolutionary track depend on the formula used for the mass loss, the results appear most sensitive to the total mass removed during the main-sequence lifetime. It was found that low mass-loss rates have very little effect on the evolution of a star; the track is slightly subluminous, but the lifetime is almost unaffected. High rates of mass loss lead to a hot, high-luminosity stellar model with a helium core surrounded by a hydrogen-deficient (Xapprox.0.1) envelope. The main-sequence lifetime is extended by a factor of 2--3. These models may be identified with Wolf-Rayet stars. Between these mass-loss extremes are intermediate models which appear as OBN stars on the main sequence. The mass-loss rates required for significant observable effects range from 8 x 10 -7 to 10 -5 M/sub sun/ yr -1 , depending on the initial stellar mass. It is found that observationally consistent mass-loss rates for stars with M> or =30 M/sub sun/ may be sufficiently high that these stars lose mass on a time scale more rapidly than their main-sequence core evolution time. This result implies that the helium cores resulting from the main-sequence evolution of these massive stars may all be very similar to that of a star of Mapprox.30 M/sub sun/ regardless of the zero-age mass

  20. Sequence analysis of serum albumins reveals the molecular evolution of ligand recognition properties.

    Science.gov (United States)

    Fanali, Gabriella; Ascenzi, Paolo; Bernardi, Giorgio; Fasano, Mauro

    2012-01-01

    Serum albumin (SA) is a circulating protein providing a depot and carrier for many endogenous and exogenous compounds. At least seven major binding sites have been identified by structural and functional investigations mainly in human SA. SA is conserved in vertebrates, with at least 49 entries in protein sequence databases. The multiple sequence analysis of this set of entries leads to the definition of a cladistic tree for the molecular evolution of SA orthologs in vertebrates, thus showing the clustering of the considered species, with lamprey SAs (Lethenteron japonicum and Petromyzon marinus) in a separate outgroup. Sequence analysis aimed at searching conserved domains revealed that most SA sequences are made up by three repeated domains (about 600 residues), as extensively characterized for human SA. On the contrary, lamprey SAs are giant proteins (about 1400 residues) comprising seven repeated domains. The phylogenetic analysis of the SA family reveals a stringent correlation with the taxonomic classification of the species available in sequence databases. A focused inspection of the sequences of ligand binding sites in SA revealed that in all sites most residues involved in ligand binding are conserved, although the versatility towards different ligands could be peculiar of higher organisms. Moreover, the analysis of molecular links between the different sites suggests that allosteric modulation mechanisms could be restricted to higher vertebrates.

  1. Simple sequence proteins in prokaryotic proteomes

    Directory of Open Access Journals (Sweden)

    Ramachandran Srinivasan

    2006-06-01

    Full Text Available Abstract Background The structural and functional features associated with Simple Sequence Proteins (SSPs are non-globularity, disease states, signaling and post-translational modification. SSPs are also an important source of genetic and possibly phenotypic variation. Analysis of 249 prokaryotic proteomes offers a new opportunity to examine the genomic properties of SSPs. Results SSPs are a minority but they grow with proteome size. This relationship is exhibited across species varying in genomic GC, mutational bias, life style, and pathogenicity. Their proportion in each proteome is strongly influenced by genomic base compositional bias. In most species simple duplications is favoured, but in a few cases such as Mycobacteria, large families of duplications occur. Amino acid preference in SSPs exhibits a trend towards low cost of biosynthesis. In SSPs and in non-SSPs, Alanine, Glycine, Leucine, and Valine are abundant in species widely varying in genomic GC whereas Isoleucine and Lysine are rich only in organisms with low genomic GC. Arginine is abundant in SSPs of two species and in the non-SSPs of Xanthomonas oryzae. Asparagine is abundant only in SSPs of low GC species. Aspartic acid is abundant only in the non-SSPs of Halobacterium sp NRC1. The abundance of Serine in SSPs of 62 species extends over a broader range compared to that of non-SSPs. Threonine(T is abundant only in SSPs of a couple of species. SSPs exhibit preferential association with Cell surface, Cell membrane and Transport functions and a negative association with Metabolism. Mesophiles and Thermophiles display similar ranges in the content of SSPs. Conclusion Although SSPs are a minority, the genomic forces of base compositional bias and duplications influence their growth and pattern in each species. The preferences and abundance of amino acids are governed by low biosynthetic cost, evolutionary age and base composition of codons. Abundance of charged amino acids Arginine

  2. Evolution of a protein domain interaction network

    International Nuclear Information System (INIS)

    Li-Feng, Gao; Jian-Jun, Shi; Shan, Guan

    2010-01-01

    In this paper, we attempt to understand complex network evolution from the underlying evolutionary relationship between biological organisms. Firstly, we construct a Pfam domain interaction network for each of the 470 completely sequenced organisms, and therefore each organism is correlated with a specific Pfam domain interaction network; secondly, we infer the evolutionary relationship of these organisms with the nearest neighbour joining method; thirdly, we use the evolutionary relationship between organisms constructed in the second step as the evolutionary course of the Pfam domain interaction network constructed in the first step. This analysis of the evolutionary course shows: (i) there is a conserved sub-network structure in network evolution; in this sub-network, nodes with lower degree prefer to maintain their connectivity invariant, and hubs tend to maintain their role as a hub is attached preferentially to new added nodes; (ii) few nodes are conserved as hubs; most of the other nodes are conserved as one with very low degree; (iii) in the course of network evolution, new nodes are added to the network either individually in most cases or as clusters with relative high clustering coefficients in a very few cases. (general)

  3. Pervasive Adaptive Evolution in Primate Seminal Proteins.

    Directory of Open Access Journals (Sweden)

    2005-09-01

    Full Text Available Seminal fluid proteins show striking effects on reproduction, involving manipulation of female behavior and physiology, mechanisms of sperm competition, and pathogen defense. Strong adaptive pressures are expected for such manifestations of sexual selection and host defense, but the extent of positive selection in seminal fluid proteins from divergent taxa is unknown. We identified adaptive evolution in primate seminal proteins using genomic resources in a tissue-specific study. We found extensive signatures of positive selection when comparing 161 human seminal fluid proteins and 2,858 prostate-expressed genes to those in chimpanzee. Seven of eight outstanding genes yielded statistically significant evidence of positive selection when analyzed in divergent primates. Functional clues were gained through divergent analysis, including several cases of species-specific loss of function in copulatory plug genes, and statistically significant spatial clustering of positively selected sites near the active site of kallikrein 2. This study reveals previously unidentified positive selection in seven primate seminal proteins, and when considered with findings in Drosophila, indicates that extensive positive selection is found in seminal fluid across divergent taxonomic groups.

  4. Evolution of a protein folding nucleus.

    Science.gov (United States)

    Xia, Xue; Longo, Liam M; Sutherland, Mason A; Blaber, Michael

    2016-07-01

    The folding nucleus (FN) is a cryptic element within protein primary structure that enables an efficient folding pathway and is the postulated heritable element in the evolution of protein architecture; however, almost nothing is known regarding how the FN structurally changes as complex protein architecture evolves from simpler peptide motifs. We report characterization of the FN of a designed purely symmetric β-trefoil protein by ϕ-value analysis. We compare the structure and folding properties of key foldable intermediates along the evolutionary trajectory of the β-trefoil. The results show structural acquisition of the FN during gene fusion events, incorporating novel turn structure created by gene fusion. Furthermore, the FN is adjusted by circular permutation in response to destabilizing functional mutation. FN plasticity by way of circular permutation is made possible by the intrinsic C3 cyclic symmetry of the β-trefoil architecture, identifying a possible selective advantage that helps explain the prevalence of cyclic structural symmetry in the proteome. © 2015 The Protein Society.

  5. Rampant adaptive evolution in regions of proteins with unknown function in Drosophila simulans.

    Directory of Open Access Journals (Sweden)

    Alisha K Holloway

    2007-10-01

    Full Text Available Adaptive protein evolution is pervasive in Drosophila. Genomic studies, thus far, have analyzed each protein as a single entity. However, the targets of adaptive events may be localized to particular parts of proteins, such as protein domains or regions involved in protein folding. We compared the population genetic mechanisms driving sequence polymorphism and divergence in defined protein domains and non-domain regions. Interestingly, we find that non-domain regions of proteins are more frequent targets of directional selection. Protein domains are also evolving under directional selection, but appear to be under stronger purifying selection than non-domain regions. Non-domain regions of proteins clearly play a major role in adaptive protein evolution on a genomic scale and merit future investigations of their functional properties.

  6. Protein Function Prediction Based on Sequence and Structure Information

    KAUST Repository

    Smaili, Fatima Z.

    2016-01-01

    operate. In this master thesis project, we worked on inferring protein functions based on the primary protein sequence. In the approach we follow, 3D models are first constructed using I-TASSER. Functions are then deduced by structurally matching

  7. Sequence diversity and evolution of antimicrobial peptides in invertebrates.

    Science.gov (United States)

    Tassanakajon, Anchalee; Somboonwiwat, Kunlaya; Amparyup, Piti

    2015-02-01

    Antimicrobial peptides (AMPs) are evolutionarily ancient molecules that act as the key components in the invertebrate innate immunity against invading pathogens. Several AMPs have been identified and characterized in invertebrates, and found to display considerable diversity in their amino acid sequence, structure and biological activity. AMP genes appear to have rapidly evolved, which might have arisen from the co-evolutionary arms race between host and pathogens, and enabled organisms to survive in different microbial environments. Here, the sequence diversity of invertebrate AMPs (defensins, cecropins, crustins and anti-lipopolysaccharide factors) are presented to provide a better understanding of the evolution pattern of these peptides that play a major role in host defense mechanisms. Copyright © 2014 Elsevier Ltd. All rights reserved.

  8. The genome sequence of the protostome Daphnia pulex encodes respective orthologues of a neurotrophin, a Trk and a p75NTR: Evolution of neurotrophin signaling components and related proteins in the bilateria

    Directory of Open Access Journals (Sweden)

    Wilson Karen HS

    2009-10-01

    Full Text Available Abstract Background Neurotrophins and their Trk and p75NTR receptors play an important role in the nervous system. To date, neurotrophins, Trk and p75NTR have only been found concomitantly in deuterostomes. In protostomes, homologues to either neurotrophin, Trk or p75NTR are reported but their phylogenetic relationship to deuterostome neurotrophin signaling components is unclear. Drosophila has neurotrophin homologues called Spätzles (Spz, some of which were recently renamed neurotrophins, but direct proof that these are deuterostome neurotrophin orthologues is lacking. Trks belong to the receptor tyrosine kinase (RTK family and among RTKs, Trks and RORs are closest related. Flies lack Trks but have ROR and ROR-related proteins called NRKs playing a neurotrophic role. Mollusks have so far the most similar proteins to Trks (Lymnaea Trk and Aplysia Trkl but the exact phylogenetic relationship of mollusk Trks to each other and to vertebrate Trks is unknown. p75NTR belongs to the tumor necrosis factor receptor (TNFR superfamily. The divergence of the TNFR families in vertebrates has been suggested to parallel the emergence of the adaptive immune system. Only one TNFR representative, the Drosophila Wengen, has been found in protostomes. To clarify the evolution of neurotrophin signaling components in bilateria, this work analyzes the genome of the crustacean Daphnia pulex as well as new genetic data from protostomes. Results The Daphnia genome encodes a neurotrophin, p75NTR and Trk orthologue together with Trkl, ROR, and NRK-RTKs. Drosophila Spz1, 2, 3, 5, 6 orthologues as well as two new groups of Spz proteins (Spz7 and 8 are also found in the Daphnia genome. Searching genbank and the genomes of Capitella, Helobdella and Lottia reveals neurotrophin signaling components in other protostomes. Conclusion It appears that a neurotrophin, Trk and p75NTR existed at the protostome/deuterostome split. In protostomes, a "neurotrophin superfamily" includes

  9. Use of designed sequences in protein structure recognition.

    Science.gov (United States)

    Kumar, Gayatri; Mudgal, Richa; Srinivasan, Narayanaswamy; Sandhya, Sankaran

    2018-05-09

    Knowledge of the protein structure is a pre-requisite for improved understanding of molecular function. The gap in the sequence-structure space has increased in the post-genomic era. Grouping related protein sequences into families can aid in narrowing the gap. In the Pfam database, structure description is provided for part or full-length proteins of 7726 families. For the remaining 52% of the families, information on 3-D structure is not yet available. We use the computationally designed sequences that are intermediately related to two protein domain families, which are already known to share the same fold. These strategically designed sequences enable detection of distant relationships and here, we have employed them for the purpose of structure recognition of protein families of yet unknown structure. We first measured the success rate of our approach using a dataset of protein families of known fold and achieved a success rate of 88%. Next, for 1392 families of yet unknown structure, we made structural assignments for part/full length of the proteins. Fold association for 423 domains of unknown function (DUFs) are provided as a step towards functional annotation. The results indicate that knowledge-based filling of gaps in protein sequence space is a lucrative approach for structure recognition. Such sequences assist in traversal through protein sequence space and effectively function as 'linkers', where natural linkers between distant proteins are unavailable. This article was reviewed by Oliviero Carugo, Christine Orengo and Srikrishna Subramanian.

  10. Partial sequence determination of metabolically labeled radioactive proteins and peptides

    International Nuclear Information System (INIS)

    Anderson, C.W.

    1982-01-01

    The author has used the sequence analysis of radioactive proteins and peptides to approach several problems during the past few years. They, in collaboration with others, have mapped precisely several adenovirus proteins with respect to the nucleotide sequence of the adenovirus genome; identified hitherto missed proteins encoded by bacteriophage MS2 and by simian virus 40; analyzed the aminoterminal maturation of several virus proteins; determined the cleavage sites for processing of the poliovirus polyprotein; and analyzed the mechanism of frameshifting by excess normal tRNAs during cell-free protein synthesis. This chapter is designed to aid those without prior experience at protein sequence determinations. It is based primarily on the experience gained in the studies cited above, which made use of the Beckman 890 series automated protein sequencers

  11. [Structure and evolution of the eukaryotic FANCJ-like proteins].

    Science.gov (United States)

    Wuhe, Jike; Zefeng, Wu; Sanhong, Fan; Xuguang, Xi

    2015-02-01

    The FANCJ-like protein family is a class of ATP-dependent helicases that can catalytically unwind duplex DNA along the 5'-3' direction. It is involved in the processes of DNA damage repair, homologous recombination and G-quadruplex DNA unwinding, and plays a critical role in maintaining genome integrity. In this study, we systemically analyzed FNACJ-like proteins from 47 eukaryotic species and discussed their sequences diversity, origin and evolution, motif organization patterns and spatial structure differences. Four members of FNACJ-like proteins, including XPD, CHL1, RTEL1 and FANCJ, were found in eukaryotes, but some of them were seriously deficient in most fungi and some insects. For example, the Zygomycota fungi lost RTEL1, Basidiomycota and Ascomycota fungi lost RTEL1 and FANCJ, and Diptera insect lost FANCJ. FANCJ-like proteins contain canonical motor domains HD1 and HD2, and the HD1 domain further integrates with three unique domains Fe-S, Arch and Extra-D. Fe-S and Arch domains are relatively conservative in all members of the family, but the Extra-D domain is lost in XPD and differs from one another in rest members. There are 7, 10 and 2 specific motifs found from the three unique domains respectively, while 5 and 12 specific motifs are found from HD1 and HD2 domains except the conserved motifs reported previously. By analyzing the arrangement pattern of these specific motifs, we found that RTEL1 and FANCJ are more closer and share two specific motifs Vb2 and Vc in HD2 domain, which are likely related with their G-quadruplex DNA unwinding activity. The evidence of evolution showed that FACNJ-like proteins were originated from a helicase, which has a HD1 domain inserted by extra Fe-S domain and Arch domain. By three continuous gene duplication events and followed specialization, eukaryotes finally possessed the current four members of FANCJ-like proteins.

  12. Complete cDNA sequence coding for human docking protein

    Energy Technology Data Exchange (ETDEWEB)

    Hortsch, M; Labeit, S; Meyer, D I

    1988-01-11

    Docking protein (DP, or SRP receptor) is a rough endoplasmic reticulum (ER)-associated protein essential for the targeting and translocation of nascent polypeptides across this membrane. It specifically interacts with a cytoplasmic ribonucleoprotein complex, the signal recognition particle (SRP). The nucleotide sequence of cDNA encoding the entire human DP and its deduced amino acid sequence are given.

  13. Incorporating information on predicted solvent accessibility to the co-evolution-based study of protein interactions.

    Science.gov (United States)

    Ochoa, David; García-Gutiérrez, Ponciano; Juan, David; Valencia, Alfonso; Pazos, Florencio

    2013-01-27

    A widespread family of methods for studying and predicting protein interactions using sequence information is based on co-evolution, quantified as similarity of phylogenetic trees. Part of the co-evolution observed between interacting proteins could be due to co-adaptation caused by inter-protein contacts. In this case, the co-evolution is expected to be more evident when evaluated on the surface of the proteins or the internal layers close to it. In this work we study the effect of incorporating information on predicted solvent accessibility to three methods for predicting protein interactions based on similarity of phylogenetic trees. We evaluate the performance of these methods in predicting different types of protein associations when trees based on positions with different characteristics of predicted accessibility are used as input. We found that predicted accessibility improves the results of two recent versions of the mirrortree methodology in predicting direct binary physical interactions, while it neither improves these methods, nor the original mirrortree method, in predicting other types of interactions. That improvement comes at no cost in terms of applicability since accessibility can be predicted for any sequence. We also found that predictions of protein-protein interactions are improved when multiple sequence alignments with a richer representation of sequences (including paralogs) are incorporated in the accessibility prediction.

  14. AlignMe—a membrane protein sequence alignment web server

    Science.gov (United States)

    Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.

    2014-01-01

    We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425

  15. MIPS: a database for protein sequences and complete genomes.

    Science.gov (United States)

    Mewes, H W; Hani, J; Pfeiffer, F; Frishman, D

    1998-01-01

    The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). Through its WWW server (http://www.mips.biochem.mpg.de ) MIPS provides access to a variety of generic databases, including a database of protein families as well as automatically generated data by the systematic application of sequence analysis algorithms. The yeast genome sequence and its related information was also compiled on CD-ROM to provide dynamic interactive access to the 16 chromosomes of the first eukaryotic genome unraveled. PMID:9399795

  16. Experimental Evolution of a Green Fluorescent Protein Composed of 19 Unique Amino Acids without Tryptophan

    Science.gov (United States)

    Kawahara-Kobayashi, Akio; Hitotsuyanagi, Mitsuhiro; Amikura, Kazuaki; Kiga, Daisuke

    2014-04-01

    At some stage of evolution, genes of organisms may have encoded proteins that were synthesized using fewer than 20 unique amino acids. Similar to evolution of the natural 19-amino-acid proteins GroEL/ES, proteins composed of 19 unique amino acids would have been able to evolve by accumulating beneficial mutations within the 19-amino-acid repertoire encoded in an ancestral genetic code. Because Trp is thought to be the last amino acid included in the canonical 20-amino-acid repertoire, this late stage of protein evolution could be mimicked by experimental evolution of 19-amino-acid proteins without tryptophan (Trp). To further understand the evolution of proteins, we tried to mimic the evolution of a 19-amino-acid protein involving the accumulation of beneficial mutations using directed evolution by random mutagenesis on the whole targeted gene sequence. We created active 19-amino-acid green fluorescent proteins (GFPs) without Trp from a poorly fluorescent 19-amino-acid mutant, S1-W57F, by using directed evolution with two rounds of mutagenesis and selection. The N105I and S205T mutations showed beneficial effects on the S1-W57F mutant. When these two mutations were combined on S1-W57F, we observed an additive effect on the fluorescence intensity. In contrast, these mutations showed no clear improvement individually or in combination on GFPS1, which is the parental GFP mutant composed of 20 amino acids. Our results provide an additional example for the experimental evolution of 19-amino-acid proteins without Trp, and would help understand the mechanisms underlying the evolution of 19-amino-acid proteins. (236 words)

  17. Dynamics of domain coverage of the protein sequence universe

    Science.gov (United States)

    2012-01-01

    Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data. PMID:23157439

  18. Dynamics of domain coverage of the protein sequence universe

    Directory of Open Access Journals (Sweden)

    Rekapalli Bhanu

    2012-11-01

    Full Text Available Abstract Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.

  19. Pre-main-sequence evolution of the sun

    International Nuclear Information System (INIS)

    Gough, D.

    1980-01-01

    The phase of solar evolution after the dynamical collapse is considered. The physics of the Kelvin-Helmholtz phase of gravitational collapse is described, attention being given to the early stages of the star when it was completely convective. It is noted that subsequently, a radiative core developed and evolution was controlled by the rate at which heat can diffuse through it by radiative transfer. Since the study of the Kelvin-Helmholtz contraction alone does not give enough information regarding the state of the sun when it first settled down to approximate hydrostatic equilibrium, other stars are studied, and information on the sun is obtained by analogy. Many young solar-type stars, such as the T Tauri stars, are not in the completely convective Hayashi (1961) phase hence it is proposed that the sun was completely mixed soon after its formation, which has some bearing on the sun's chemical structure. It is suggested that the surface of the sun was very nonuniform compared with the photosphere of today. The simple solar evolution model presented gives a good guide to the general way in which the sun contracted to the main sequence

  20. The Evolution of the Secreted Regulatory Protein Progranulin.

    Directory of Open Access Journals (Sweden)

    Roger G E Palfree

    Full Text Available Progranulin is a secreted growth factor that is active in tumorigenesis, wound repair, and inflammation. Haploinsufficiency of the human progranulin gene, GRN, causes frontotemporal dementia. Progranulins are composed of chains of cysteine-rich granulin modules. Modules may be released from progranulin by proteolysis as 6kDa granulin polypeptides. Both intact progranulin and some of the granulin polypeptides are biologically active. The granulin module occurs in certain plant proteases and progranulins are present in early diverging metazoan clades such as the sponges, indicating their ancient evolutionary origin. There is only one Grn gene in mammalian genomes. More gene-rich Grn families occur in teleost fish with between 3 and 6 members per species including short-form Grns that have no tetrapod counterparts. Our goals are to elucidate progranulin and granulin module evolution by investigating (i: the origins of metazoan progranulins (ii: the evolutionary relationships between the single Grn of tetrapods and the multiple Grn genes of fish (iii: the evolution of granulin module architectures of vertebrate progranulins (iv: the conservation of mammalian granulin polypeptide sequences and how the conserved granulin amino acid sequences map to the known three dimensional structures of granulin modules. We report that progranulin-like proteins are present in unicellular eukaryotes that are closely related to metazoa suggesting that progranulin is among the earliest extracellular regulatory proteins still employed by multicellular animals. From the genomes of the elephant shark and coelacanth we identified contemporary representatives of a precursor for short-from Grn genes of ray-finned fish that is lost in tetrapods. In vertebrate Grns pathways of exon duplication resulted in a conserved module architecture at the amino-terminus that is frequently accompanied by an unusual pattern of tandem nearly identical module repeats near the carboxyl

  1. The Evolution of the Secreted Regulatory Protein Progranulin.

    Science.gov (United States)

    Palfree, Roger G E; Bennett, Hugh P J; Bateman, Andrew

    2015-01-01

    Progranulin is a secreted growth factor that is active in tumorigenesis, wound repair, and inflammation. Haploinsufficiency of the human progranulin gene, GRN, causes frontotemporal dementia. Progranulins are composed of chains of cysteine-rich granulin modules. Modules may be released from progranulin by proteolysis as 6kDa granulin polypeptides. Both intact progranulin and some of the granulin polypeptides are biologically active. The granulin module occurs in certain plant proteases and progranulins are present in early diverging metazoan clades such as the sponges, indicating their ancient evolutionary origin. There is only one Grn gene in mammalian genomes. More gene-rich Grn families occur in teleost fish with between 3 and 6 members per species including short-form Grns that have no tetrapod counterparts. Our goals are to elucidate progranulin and granulin module evolution by investigating (i): the origins of metazoan progranulins (ii): the evolutionary relationships between the single Grn of tetrapods and the multiple Grn genes of fish (iii): the evolution of granulin module architectures of vertebrate progranulins (iv): the conservation of mammalian granulin polypeptide sequences and how the conserved granulin amino acid sequences map to the known three dimensional structures of granulin modules. We report that progranulin-like proteins are present in unicellular eukaryotes that are closely related to metazoa suggesting that progranulin is among the earliest extracellular regulatory proteins still employed by multicellular animals. From the genomes of the elephant shark and coelacanth we identified contemporary representatives of a precursor for short-from Grn genes of ray-finned fish that is lost in tetrapods. In vertebrate Grns pathways of exon duplication resulted in a conserved module architecture at the amino-terminus that is frequently accompanied by an unusual pattern of tandem nearly identical module repeats near the carboxyl-terminus. Polypeptide

  2. Improving accuracy of protein-protein interaction prediction by considering the converse problem for sequence representation

    Directory of Open Access Journals (Sweden)

    Wang Yong

    2011-10-01

    Full Text Available Abstract Background With the development of genome-sequencing technologies, protein sequences are readily obtained by translating the measured mRNAs. Therefore predicting protein-protein interactions from the sequences is of great demand. The reason lies in the fact that identifying protein-protein interactions is becoming a bottleneck for eventually understanding the functions of proteins, especially for those organisms barely characterized. Although a few methods have been proposed, the converse problem, if the features used extract sufficient and unbiased information from protein sequences, is almost untouched. Results In this study, we interrogate this problem theoretically by an optimization scheme. Motivated by the theoretical investigation, we find novel encoding methods for both protein sequences and protein pairs. Our new methods exploit sufficiently the information of protein sequences and reduce artificial bias and computational cost. Thus, it significantly outperforms the available methods regarding sensitivity, specificity, precision, and recall with cross-validation evaluation and reaches ~80% and ~90% accuracy in Escherichia coli and Saccharomyces cerevisiae respectively. Our findings here hold important implication for other sequence-based prediction tasks because representation of biological sequence is always the first step in computational biology. Conclusions By considering the converse problem, we propose new representation methods for both protein sequences and protein pairs. The results show that our method significantly improves the accuracy of protein-protein interaction predictions.

  3. The Population Genomics of Sunflowers and Genomic Determinants of Protein Evolution Revealed by RNAseq

    Directory of Open Access Journals (Sweden)

    Loren H. Rieseberg

    2012-10-01

    Full Text Available Few studies have investigated the causes of evolutionary rate variation among plant nuclear genes, especially in recently diverged species still capable of hybridizing in the wild. The recent advent of Next Generation Sequencing (NGS permits investigation of genome wide rates of protein evolution and the role of selection in generating and maintaining divergence. Here, we use individual whole-transcriptome sequencing (RNAseq to refine our understanding of the population genomics of wild species of sunflowers (Helianthus spp. and the factors that affect rates of protein evolution. We aligned 35 GB of transcriptome sequencing data and identified 433,257 polymorphic sites (SNPs in a reference transcriptome comprising 16,312 genes. Using SNP markers, we identified strong population clustering largely corresponding to the three species analyzed here (Helianthus annuus, H. petiolaris, H. debilis, with one distinct early generation hybrid. Then, we calculated the proportions of adaptive substitution fixed by selection (alpha and identified gene ontology categories with elevated values of alpha. The “response to biotic stimulus” category had the highest mean alpha across the three interspecific comparisons, implying that natural selection imposed by other organisms plays an important role in driving protein evolution in wild sunflowers. Finally, we examined the relationship between protein evolution (dN/dS ratio and several genomic factors predicted to co-vary with protein evolution (gene expression level, divergence and specificity, genetic divergence [FST], and nucleotide diversity pi. We find that variation in rates of protein divergence was correlated with gene expression level and specificity, consistent with results from a broad range of taxa and timescales. This would in turn imply that these factors govern protein evolution both at a microevolutionary and macroevolutionary timescale. Our results contribute to a general understanding of the

  4. Nonlinear analysis of sequence repeats of multi-domain proteins

    Energy Technology Data Exchange (ETDEWEB)

    Huang Yanzhao [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Li Mingfeng [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xiao Yi [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China)]. E-mail: lmf_bill@sina.com

    2007-11-15

    Many multi-domain proteins have repetitive three-dimensional structures but nearly-random amino acid sequences. In the present paper, by using a modified recurrence plot proposed by us previously, we show that these amino acid sequences have hidden repetitions in fact. These results indicate that the repetitive domain structures are encoded by the repetitive sequences. This also gives a method to detect the repetitive domain structures directly from amino acid sequences.

  5. Arabidopsis thaliana mTERF proteins: evolution and functional classification

    Directory of Open Access Journals (Sweden)

    Tatjana eKleine

    2012-10-01

    Full Text Available Organellar gene expression (OGE is crucial for plant development, photosynthesis and respiration, but our understanding of the mechanisms that control it is still relatively poor. Thus, OGE requires various nucleus-encoded proteins that promote transcription, splicing, trimming and editing of organellar RNAs, and regulate translation. In metazoans, proteins of the mitochondrial Transcription tERmination Factor (mTERF family interact with the mitochondrial chromosome and regulate transcriptional initiation and termination. Sequencing of the Arabidopsis thaliana genome led to the identification of a diversified MTERF gene family but, in contrast to mammalian mTERFs, knowledge about the function of these proteins in photosynthetic organisms is scarce. In this hypothesis article, I show that tandem duplications and one block duplication contributed to the large number of MTERF genes in A. thaliana, and propose that the expansion of the family is related to the evolution of land plants. The MTERF genes - especially the duplicated genes - display a number of distinct mRNA accumulation patterns, suggesting functional diversification of mTERF proteins to increase adaptability to environmental changes. Indeed, hypothetical functions for the different mTERF proteins can be predicted using co-expression analysis and gene ontology annotations. On this basis, mTERF proteins can be sorted into five groups. Members of the chloroplast and chloroplast-associated clusters are principally involved in chloroplast gene expression, embryogenesis and protein catabolism, while representatives of the mitochondrial cluster seem to participate in DNA and RNA metabolism in that organelle. Moreover, members of the mitochondrion-associated cluster and the low expression group may act in the nucleus and/or the cytosol. As proteins involved in OGE and presumably nuclear gene expression, mTERFs are ideal candidates for the coordination of the expression of organelle and nuclear

  6. Diversity and evolution of coral fluorescent proteins.

    Directory of Open Access Journals (Sweden)

    Naila O Alieva

    2008-07-01

    Full Text Available GFP-like fluorescent proteins (FPs are the key color determinants in reef-building corals (class Anthozoa, order Scleractinia and are of considerable interest as potential genetically encoded fluorescent labels. Here we report 40 additional members of the GFP family from corals. There are three major paralogous lineages of coral FPs. One of them is retained in all sampled coral families and is responsible for the non-fluorescent purple-blue color, while each of the other two evolved a full complement of typical coral fluorescent colors (cyan, green, and red and underwent sorting between coral groups. Among the newly cloned proteins are a "chromo-red" color type from Echinopora forskaliana (family Faviidae and pink chromoprotein from Stylophora pistillata (Pocilloporidae, both evolving independently from the rest of coral chromoproteins. There are several cyan FPs that possess a novel kind of excitation spectrum indicating a neutral chromophore ground state, for which the residue E167 is responsible (numeration according to GFP from A. victoria. The chromoprotein from Acropora millepora is an unusual blue instead of purple, which is due to two mutations: S64C and S183T. We applied a novel probabilistic sampling approach to recreate the common ancestor of all coral FPs as well as the more derived common ancestor of three main fluorescent colors of the Faviina suborder. Both proteins were green such as found elsewhere outside class Anthozoa. Interestingly, a substantial fraction of the all-coral ancestral protein had a chromohore apparently locked in a non-fluorescent neutral state, which may reflect the transitional stage that enabled rapid color diversification early in the history of coral FPs. Our results highlight the extent of convergent or parallel evolution of the color diversity in corals, provide the foundation for experimental studies of evolutionary processes that led to color diversification, and enable a comparative analysis of

  7. Phylogeny and evolution of Rab7 and Rab9 proteins

    Directory of Open Access Journals (Sweden)

    Wyroba Elżbieta

    2009-05-01

    Full Text Available Abstract Background An important role in the evolution of intracellular trafficking machinery in eukaryotes played small GTPases belonging to the Rab family known as pivotal regulators of vesicle docking, fusion and transport. The Rab family is very diversified and divided into several specialized subfamilies. We focused on the VII functional group comprising Rab7 and Rab9, two related subfamilies, and analysed 210 sequences of these proteins. Rab7 regulates traffic from early to late endosomes and from late endosome to vacuole/lysosome, whereas Rab9 participates in transport from late endosomes to the trans-Golgi network. Results Although Rab7 and Rab9 proteins are quite small and show heterogeneous rates of substitution in different lineages, we found a phylogenetic signal and inferred evolutionary relationships between them. Rab7 proteins evolved before radiation of main eukaryotic supergroups while Rab9 GTPases diverged from Rab7 before split of choanoflagellates and metazoans. Additional duplication of Rab9 and Rab7 proteins resulting in several isoforms occurred in the early evolution of vertebrates and next in teleost fishes and tetrapods. Three Rab7 lineages emerged before divergence of monocots and eudicots and subsequent duplications of Rab7 genes occurred in particular angiosperm clades. Interestingly, several Rab7 copies were identified in some representatives of excavates, ciliates and amoebozoans. The presence of many Rab copies is correlated with significant differences in their expression level. The diversification of analysed Rab subfamilies is also manifested by non-conserved sequences and structural features, many of which are involved in the interaction with regulators and effectors. Individual sites discriminating different subgroups of Rab7 and Rab9 GTPases have been identified. Conclusion Phylogenetic reconstructions of Rab7 and Rab9 proteins were performed by a variety of methods. These Rab GTPases show diversification

  8. MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evolution

    Directory of Open Access Journals (Sweden)

    Xia Xuhua

    2005-03-01

    Full Text Available Abstract Background MATLAB is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as mathematics and computation, algorithm development, data acquisition, modeling, simulation, and scientific and engineering graphics. However, few functions are freely available in MATLAB to perform the sequence data analyses specifically required for molecular biology and evolution. Results We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this gap by offering efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible, functional framework for users with more specialized requirements to explore and analyze aligned nucleotide or protein sequences from an evolutionary perspective. The full functions in the toolbox are accessible through the command-line for seasoned MATLAB users. A graphical user interface, that may be especially useful for non-specialist end users, is also provided. Conclusion MBEToolbox is a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software is publicly available at http://web.hku.hk/~jamescai/mbetoolbox/ and http://bioinformatics.org/project/?group_id=454.

  9. An algorithm to find all palindromic sequences in proteins

    Indian Academy of Sciences (India)

    2013-01-20

    Jan 20, 2013 ... 1976; Karrer and Gall 1976; Vogt and Braun 1976) and (iii) in the formation of hairpin loops in the newly transcribed RNA. Palindromic sequences are observed in various classes of proteins like histones (Cheng et al. 1989), prion proteins (Sulkowski 1992; Kazim 1993),. DNA-binding proteins (Suzuki 1992; ...

  10. Discriminating Microbial Species Using Protein Sequence Properties and Machine Learning

    NARCIS (Netherlands)

    Shahib, Ali Al-; Gilbert, David; Breitling, Rainer

    2007-01-01

    Much work has been done to identify species-specific proteins in sequenced genomes and hence to determine their function. We assumed that such proteins have specific physico-chemical properties that will discriminate them from proteins in other species. In this paper, we examine the validity of this

  11. Evolution of DNA replication protein complexes in eukaryotes and Archaea.

    Directory of Open Access Journals (Sweden)

    Nicholas Chia

    Full Text Available BACKGROUND: The replication of DNA in Archaea and eukaryotes requires several ancillary complexes, including proliferating cell nuclear antigen (PCNA, replication factor C (RFC, and the minichromosome maintenance (MCM complex. Bacterial DNA replication utilizes comparable proteins, but these are distantly related phylogenetically to their archaeal and eukaryotic counterparts at best. METHODOLOGY/PRINCIPAL FINDINGS: While the structures of each of the complexes do not differ significantly between the archaeal and eukaryotic versions thereof, the evolutionary dynamic in the two cases does. The number of subunits in each complex is constant across all taxa. However, they vary subtly with regard to composition. In some taxa the subunits are all identical in sequence, while in others some are homologous rather than identical. In the case of eukaryotes, there is no phylogenetic variation in the makeup of each complex-all appear to derive from a common eukaryotic ancestor. This is not the case in Archaea, where the relationship between the subunits within each complex varies taxon-to-taxon. We have performed a detailed phylogenetic analysis of these relationships in order to better understand the gene duplications and divergences that gave rise to the homologous subunits in Archaea. CONCLUSION/SIGNIFICANCE: This domain level difference in evolution suggests that different forces have driven the evolution of DNA replication proteins in each of these two domains. In addition, the phylogenies of all three gene families support the distinctiveness of the proposed archaeal phylum Thaumarchaeota.

  12. Experimental Evolution of Escherichia coli Harboring an Ancient Translation Protein.

    Science.gov (United States)

    Kacar, Betül; Ge, Xueliang; Sanyal, Suparna; Gaucher, Eric A

    2017-03-01

    The ability to design synthetic genes and engineer biological systems at the genome scale opens new means by which to characterize phenotypic states and the responses of biological systems to perturbations. One emerging method involves inserting artificial genes into bacterial genomes and examining how the genome and its new genes adapt to each other. Here we report the development and implementation of a modified approach to this method, in which phylogenetically inferred genes are inserted into a microbial genome, and laboratory evolution is then used to examine the adaptive potential of the resulting hybrid genome. Specifically, we engineered an approximately 700-million-year-old inferred ancestral variant of tufB, an essential gene encoding elongation factor Tu, and inserted it in a modern Escherichia coli genome in place of the native tufB gene. While the ancient homolog was not lethal to the cell, it did cause a twofold decrease in organismal fitness, mainly due to reduced protein dosage. We subsequently evolved replicate hybrid bacterial populations for 2000 generations in the laboratory and examined the adaptive response via fitness assays, whole genome sequencing, proteomics, and biochemical assays. Hybrid lineages exhibit a general adaptive strategy in which the fitness cost of the ancient gene was ameliorated in part by upregulation of protein production. Our results suggest that an ancient-modern recombinant method may pave the way for the synthesis of organisms that exhibit ancient phenotypes, and that laboratory evolution of these organisms may prove useful in elucidating insights into historical adaptive processes.

  13. Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be.

    KAUST Repository

    Schaefer, Christian

    2010-01-16

    MOTIVATION: The mutation of amino acids often impacts protein function and structure. Mutations without negative effect sustain evolutionary pressure. We study a particular aspect of structural robustness with respect to mutations: regular protein secondary structure and natively unstructured (intrinsically disordered) regions. Is the formation of regular secondary structure an intrinsic feature of amino acid sequences, or is it a feature that is lost upon mutation and is maintained by evolution against the odds? Similarly, is disorder an intrinsic sequence feature or is it difficult to maintain? To tackle these questions, we in silico mutated native protein sequences into random sequence-like ensembles and monitored the change in predicted secondary structure and disorder. RESULTS: We established that by our coarse-grained measures for change, predictions and observations were similar, suggesting that our results were not biased by prediction mistakes. Changes in secondary structure and disorder predictions were linearly proportional to the change in sequence. Surprisingly, neither the content nor the length distribution for the predicted secondary structure changed substantially. Regions with long disorder behaved differently in that significantly fewer such regions were predicted after a few mutation steps. Our findings suggest that the formation of regular secondary structure is an intrinsic feature of random amino acid sequences, while the formation of long-disordered regions is not an intrinsic feature of proteins with disordered regions. Put differently, helices and strands appear to be maintained easily by evolution, whereas maintaining disordered regions appears difficult. Neutral mutations with respect to disorder are therefore very unlikely.

  14. Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be.

    KAUST Repository

    Schaefer, Christian; Schlessinger, Avner; Rost, Burkhard

    2010-01-01

    MOTIVATION: The mutation of amino acids often impacts protein function and structure. Mutations without negative effect sustain evolutionary pressure. We study a particular aspect of structural robustness with respect to mutations: regular protein secondary structure and natively unstructured (intrinsically disordered) regions. Is the formation of regular secondary structure an intrinsic feature of amino acid sequences, or is it a feature that is lost upon mutation and is maintained by evolution against the odds? Similarly, is disorder an intrinsic sequence feature or is it difficult to maintain? To tackle these questions, we in silico mutated native protein sequences into random sequence-like ensembles and monitored the change in predicted secondary structure and disorder. RESULTS: We established that by our coarse-grained measures for change, predictions and observations were similar, suggesting that our results were not biased by prediction mistakes. Changes in secondary structure and disorder predictions were linearly proportional to the change in sequence. Surprisingly, neither the content nor the length distribution for the predicted secondary structure changed substantially. Regions with long disorder behaved differently in that significantly fewer such regions were predicted after a few mutation steps. Our findings suggest that the formation of regular secondary structure is an intrinsic feature of random amino acid sequences, while the formation of long-disordered regions is not an intrinsic feature of proteins with disordered regions. Put differently, helices and strands appear to be maintained easily by evolution, whereas maintaining disordered regions appears difficult. Neutral mutations with respect to disorder are therefore very unlikely.

  15. The relationship of protein conservation and sequence length

    Directory of Open Access Journals (Sweden)

    Panchenko Anna R

    2002-11-01

    Full Text Available Abstract Background In general, the length of a protein sequence is determined by its function and the wide variance in the lengths of an organism's proteins reflects the diversity of specific functional roles for these proteins. However, additional evolutionary forces that affect the length of a protein may be revealed by studying the length distributions of proteins evolving under weaker functional constraints. Results We performed sequence comparisons to distinguish highly conserved and poorly conserved proteins from the bacterium Escherichia coli, the archaeon Archaeoglobus fulgidus, and the eukaryotes Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. For all organisms studied, the conserved and nonconserved proteins have strikingly different length distributions. The conserved proteins are, on average, longer than the poorly conserved ones, and the length distributions for the poorly conserved proteins have a relatively narrow peak, in contrast to the conserved proteins whose lengths spread over a wider range of values. For the two prokaryotes studied, the poorly conserved proteins approximate the minimal length distribution expected for a diverse range of structural folds. Conclusions There is a relationship between protein conservation and sequence length. For all the organisms studied, there seems to be a significant evolutionary trend favoring shorter proteins in the absence of other, more specific functional constraints.

  16. Modeling coding-sequence evolution within the context of residue solvent accessibility.

    Science.gov (United States)

    Scherrer, Michael P; Meyer, Austin G; Wilke, Claus O

    2012-09-12

    Protein structure mediates site-specific patterns of sequence divergence. In particular, residues in the core of a protein (solvent-inaccessible residues) tend to be more evolutionarily conserved than residues on the surface (solvent-accessible residues). Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the Goldman-Yang 1994 (GY94) model in which all model parameters can be functions of the relative solvent accessibility (RSA) of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionary-rate ratio ω that varies linearly with RSA provides a better model fit than an RSA-independent ω or an ω that is estimated separately in individual RSA bins. We further show that the branch length t and the transition-transverion ratio κ also vary with RSA. The RSA-dependent GY94 model performs better than an RSA-dependent Muse-Gaut 1994 (MG94) model in which the synonymous and non-synonymous rates individually are linear functions of RSA. Finally, protein core size affects the slope of the linear relationship between ω and RSA, and gene expression level affects both the intercept and the slope. Structure-aware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between ω and RSA implies that genes are better characterized by their ω slope and intercept than by just their mean ω.

  17. Modeling coding-sequence evolution within the context of residue solvent accessibility

    Directory of Open Access Journals (Sweden)

    Scherrer Michael P

    2012-09-01

    Full Text Available Abstract Background Protein structure mediates site-specific patterns of sequence divergence. In particular, residues in the core of a protein (solvent-inaccessible residues tend to be more evolutionarily conserved than residues on the surface (solvent-accessible residues. Results Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the Goldman-Yang 1994 (GY94 model in which all model parameters can be functions of the relative solvent accessibility (RSA of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionary-rate ratio ω that varies linearly with RSA provides a better model fit than an RSA-independent ω or an ω that is estimated separately in individual RSA bins. We further show that the branch length t and the transition-transverion ratio κ also vary with RSA. The RSA-dependent GY94 model performs better than an RSA-dependent Muse-Gaut 1994 (MG94 model in which the synonymous and non-synonymous rates individually are linear functions of RSA. Finally, protein core size affects the slope of the linear relationship between ω and RSA, and gene expression level affects both the intercept and the slope. Conclusions Structure-aware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between ω and RSA implies that genes are better characterized by their ω slope and intercept than by just their mean ω.

  18. The SWISS-PROT protein sequence data bank

    OpenAIRE

    Bairoch, Amos; Boeckmann, Brigitte

    1992-01-01

    SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library

  19. Aligning protein sequence and analysing substitution pattern using ...

    Indian Academy of Sciences (India)

    Prakash

    Aligning protein sequences using a score matrix has became a routine but valuable method in modern biological ..... the amino acids according to their substitution behaviour ...... which may cause great change (e.g. prolonging the helix) in.

  20. Cyclotide Evolution: Insights from the Analyses of Their Precursor Sequences, Structures and Distribution in Violets (Viola

    Directory of Open Access Journals (Sweden)

    Sungkyu Park

    2017-12-01

    Full Text Available Cyclotides are a family of plant proteins that are characterized by a cyclic backbone and a knotted disulfide topology. Their cyclic cystine knot (CCK motif makes them exceptionally resistant to thermal, chemical, and enzymatic degradation. By disrupting cell membranes, the cyclotides function as host defense peptides by exhibiting insecticidal, anthelmintic, antifouling, and molluscicidal activities. In this work, we provide the first insight into the evolution of this family of plant proteins by studying the Violaceae, in particular species of the genus Viola. We discovered 157 novel precursor sequences by the transcriptomic analysis of six Viola species: V. albida var. takahashii, V. mandshurica, V. orientalis, V. verecunda, V. acuminata, and V. canadensis. By combining these precursor sequences with the phylogenetic classification of Viola, we infer the distribution of cyclotides across 63% of the species in the genus (i.e., ~380 species. Using full precursor sequences from transcriptomes, we show an evolutionary link to the structural diversity of the cyclotides, and further classify the cyclotides by sequence signatures from the non-cyclotide domain. Also, transcriptomes were compared to cyclotide expression on a peptide level determined using liquid chromatography-mass spectrometry. Furthermore, the novel cyclotides discovered were associated with the emergence of new biological functions.

  1. Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment

    Directory of Open Access Journals (Sweden)

    Daniels Noah M

    2012-10-01

    Full Text Available Abstract Background The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult. Results We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD. Conclusions Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

  2. Complete Chloroplast Genome Sequence of Aquilaria sinensis (Lour.) Gilg and Evolution Analysis within the Malvales Order.

    Science.gov (United States)

    Wang, Ying; Zhan, Di-Feng; Jia, Xian; Mei, Wen-Li; Dai, Hao-Fu; Chen, Xiong-Ting; Peng, Shi-Qing

    2016-01-01

    Aquilaria sinensis (Lour.) Gilg is an important medicinal woody plant producing agarwood, which is widely used in traditional Chinese medicine. High-throughput sequencing of chloroplast (cp) genomes enhanced the understanding about evolutionary relationships within plant families. In this study, we determined the complete cp genome sequences for A. sinensis. The size of the A. sinensis cp genome was 159,565 bp. This genome included a large single-copy region of 87,482 bp, a small single-copy region of 19,857 bp, and a pair of inverted repeats (IRa and IRb) of 26,113 bp each. The GC content of the genome was 37.11%. The A. sinensis cp genome encoded 113 functional genes, including 82 protein-coding genes, 27 tRNA genes, and 4 rRNA genes. Seven genes were duplicated in the protein-coding genes, whereas 11 genes were duplicated in the RNA genes. A total of 45 polymorphic simple-sequence repeat loci and 60 pairs of large repeats were identified. Most simple-sequence repeats were located in the noncoding sections of the large single-copy/small single-copy region and exhibited high A/T content. Moreover, 33 pairs of large repeat sequences were located in the protein-coding genes, whereas 27 pairs were located in the intergenic regions. Aquilaria sinensis cp genome bias ended with A/T on the basis of codon usage. The distribution of codon usage in A. sinensis cp genome was most similar to that in the Gonystylus bancanus cp genome. Comparative results of 82 protein-coding genes from 29 species of cp genomes demonstrated that A. sinensis was a sister species to G. bancanus within the Malvales order. Aquilaria sinensis cp genome presented the highest sequence similarity of >90% with the G. bancanus cp genome by using CGView Comparison Tool. This finding strongly supports the placement of A. sinensis as a sister to G. bancanus within the Malvales order. The complete A. sinensis cp genome information will be highly beneficial for further studies on this traditional medicinal

  3. Reassessing Domain Architecture Evolution of Metazoan Proteins: The Contribution of Different Evolutionary Mechanisms

    Directory of Open Access Journals (Sweden)

    Laszlo Patthy

    2011-08-01

    Full Text Available In the accompanying papers we have shown that sequence errors of public databases and confusion of paralogs and epaktologs (proteins that are related only through the independent acquisition of the same domain types significantly distort the picture that emerges from comparison of the domain architecture (DA of multidomain Metazoan proteins since they introduce a strong bias in favor of terminal over internal DA change. The issue of whether terminal or internal DA changes occur with greater probability has very important implications for the DA evolution of multidomain proteins since gene fusion can add domains only at terminal positions, whereas domain-shuffling is capable of inserting domains both at internal and terminal positions. As a corollary, overestimation of terminal DA changes may be misinterpreted as evidence for a dominant role of gene fusion in DA evolution. In this manuscript we show that in several recent studies of DA evolution of Metazoa the authors used databases that are significantly contaminated with incomplete, abnormal and mispredicted sequences (e.g., UniProtKB/TrEMBL, EnsEMBL and/or the authors failed to separate paralogs and epaktologs, explaining why these studies concluded that the major mechanism for gains of new domains in metazoan proteins is gene fusion. In contrast with the latter conclusion, our studies on high quality orthologous and paralogous Swiss-Prot sequences confirm that shuffling of mobile domains had a major role in the evolution of multidomain proteins of Metazoa and especially those formed in early vertebrates.

  4. Protein Function Prediction Based on Sequence and Structure Information

    KAUST Repository

    Smaili, Fatima Z.

    2016-05-25

    The number of available protein sequences in public databases is increasing exponentially. However, a significant fraction of these sequences lack functional annotation which is essential to our understanding of how biological systems and processes operate. In this master thesis project, we worked on inferring protein functions based on the primary protein sequence. In the approach we follow, 3D models are first constructed using I-TASSER. Functions are then deduced by structurally matching these predicted models, using global and local similarities, through three independent enzyme commission (EC) and gene ontology (GO) function libraries. The method was tested on 250 “hard” proteins, which lack homologous templates in both structure and function libraries. The results show that this method outperforms the conventional prediction methods based on sequence similarity or threading. Additionally, our method could be improved even further by incorporating protein-protein interaction information. Overall, the method we use provides an efficient approach for automated functional annotation of non-homologous proteins, starting from their sequence.

  5. Evolution of the MAGUK protein gene family in premetazoan lineages

    Directory of Open Access Journals (Sweden)

    Ruiz-Trillo Iñaki

    2010-04-01

    Full Text Available Abstract Background Cell-to-cell communication is a key process in multicellular organisms. In multicellular animals, scaffolding proteins belonging to the family of membrane-associated guanylate kinases (MAGUK are involved in the regulation and formation of cell junctions. These MAGUK proteins were believed to be exclusive to Metazoa. However, a MAGUK gene was recently identified in an EST survey of Capsaspora owczarzaki, an unicellular organism that branches off near the metazoan clade. To further investigate the evolutionary history of MAGUK, we have undertook a broader search for this gene family using available genomic sequences of different opisthokont taxa. Results Our survey and phylogenetic analyses show that MAGUK proteins are present not only in Metazoa, but also in the choanoflagellate Monosiga brevicollis and in the protist Capsaspora owczarzaki. However, MAGUKs are absent from fungi, amoebozoans or any other eukaryote. The repertoire of MAGUKs in Placozoa and eumetazoan taxa (Cnidaria + Bilateria is quite similar, except for one class that is missing in Trichoplax, while Porifera have a simpler MAGUK repertoire. However, Vertebrata have undergone several independent duplications and exhibit two exclusive MAGUK classes. Three different MAGUK types are found in both M. brevicollis and C. owczarzaki: DLG, MPP and MAGI. Furthermore, M. brevicollis has suffered a lineage-specific diversification. Conclusions The diversification of the MAGUK protein gene family occurred, most probably, prior to the divergence between Metazoa+choanoflagellates and the Capsaspora+Ministeria clade. A MAGI-like, a DLG-like, and a MPP-like ancestral genes were already present in the unicellular ancestor of Metazoa, and new gene members have been incorporated through metazoan evolution within two major periods, one before the sponge-eumetazoan split and another within the vertebrate lineage. Moreover, choanoflagellates have suffered an independent MAGUK

  6. Quantiprot - a Python package for quantitative analysis of protein sequences.

    Science.gov (United States)

    Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold

    2017-07-17

    The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.

  7. Taxonomic colouring of phylogenetic trees of protein sequences

    Directory of Open Access Journals (Sweden)

    Andrade-Navarro Miguel A

    2006-02-01

    Full Text Available Abstract Background Phylogenetic analyses of protein families are used to define the evolutionary relationships between homologous proteins. The interpretation of protein-sequence phylogenetic trees requires the examination of the taxonomic properties of the species associated to those sequences. However, there is no online tool to facilitate this interpretation, for example, by automatically attaching taxonomic information to the nodes of a tree, or by interactively colouring the branches of a tree according to any combination of taxonomic divisions. This is especially problematic if the tree contains on the order of hundreds of sequences, which, given the accelerated increase in the size of the protein sequence databases, is a situation that is becoming common. Results We have developed PhyloView, a web based tool for colouring phylogenetic trees upon arbitrary taxonomic properties of the species represented in a protein sequence phylogenetic tree. Provided that the tree contains SwissProt, SpTrembl, or GenBank protein identifiers, the tool retrieves the taxonomic information from the corresponding database. A colour picker displays a summary of the findings and allows the user to associate colours to the leaves of the tree according to any number of taxonomic partitions. Then, the colours are propagated to the branches of the tree. Conclusion PhyloView can be used at http://www.ogic.ca/projects/phyloview/. A tutorial, the software with documentation, and GPL licensed source code, can be accessed at the same web address.

  8. MIPS: a database for genomes and protein sequences.

    Science.gov (United States)

    Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B

    2002-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).

  9. Correlation between protein sequence similarity and x-ray diffraction quality in the protein data bank.

    Science.gov (United States)

    Lu, Hui-Meng; Yin, Da-Chuan; Ye, Ya-Jing; Luo, Hui-Min; Geng, Li-Qiang; Li, Hai-Sheng; Guo, Wei-Hong; Shang, Peng

    2009-01-01

    As the most widely utilized technique to determine the 3-dimensional structure of protein molecules, X-ray crystallography can provide structure of the highest resolution among the developed techniques. The resolution obtained via X-ray crystallography is known to be influenced by many factors, such as the crystal quality, diffraction techniques, and X-ray sources, etc. In this paper, the authors found that the protein sequence could also be one of the factors. We extracted information of the resolution and the sequence of proteins from the Protein Data Bank (PDB), classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the best resolution obtained. The results showed that there was a pronounced correlation between the sequence similarity and the obtained resolution. These results indicate that protein structure itself is one variable that may affect resolution when X-ray crystallography is used.

  10. Statistical theory of neutral protein evolution by random site mutations

    Indian Academy of Sciences (India)

    Administrator

    Understanding the features of the protein conformational space represents a key component to characterize ... Neutral evolution; protein design; mutations; foldability criteria. 1. Introduction ... analysis of the vast evolutionary landscape is re- ... intra-molecular interactions in the protein which may not be ... This is the main in-.

  11. Evolutionary rates at codon sites may be used to align sequences and infer protein domain function

    Directory of Open Access Journals (Sweden)

    Hazelhurst Scott

    2010-03-01

    Full Text Available Abstract Background Sequence alignments form part of many investigations in molecular biology, including the determination of phylogenetic relationships, the prediction of protein structure and function, and the measurement of evolutionary rates. However, to obtain meaningful results, a significant degree of sequence similarity is required to ensure that the alignments are accurate and the inferences correct. Limitations arise when sequence similarity is low, which is particularly problematic when working with fast-evolving genes, evolutionary distant taxa, genomes with nucleotide biases, and cases of convergent evolution. Results A novel approach was conceptualized to address the "low sequence similarity" alignment problem. We developed an alignment algorithm termed FIRE (Functional Inference using the Rates of Evolution, which aligns sequences using the evolutionary rate at codon sites, as measured by the dN/dS ratio, rather than nucleotide or amino acid residues. FIRE was used to test the hypotheses that evolutionary rates can be used to align sequences and that the alignments may be used to infer protein domain function. Using a range of test data, we found that aligning domains based on evolutionary rates was possible even when sequence similarity was very low (for example, antibody variable regions. Furthermore, the alignment has the potential to infer protein domain function, indicating that domains with similar functions are subject to similar evolutionary constraints. These data suggest that an evolutionary rate-based approach to sequence analysis (particularly when combined with structural data may be used to study cases of convergent evolution or when sequences have very low similarity. However, when aligning homologous gene sets with sequence similarity, FIRE did not perform as well as the best traditional alignment algorithms indicating that the conventional approach of aligning residues as opposed to evolutionary rates remains the

  12. Genepleio software for effective estimation of gene pleiotropy from protein sequences.

    Science.gov (United States)

    Chen, Wenhai; Chen, Dandan; Zhao, Ming; Zou, Yangyun; Zeng, Yanwu; Gu, Xun

    2015-01-01

    Though pleiotropy, which refers to the phenomenon of a gene affecting multiple traits, has long played a central role in genetics, development, and evolution, estimation of the number of pleiotropy components remains a hard mission to accomplish. In this paper, we report a newly developed software package, Genepleio, to estimate the effective gene pleiotropy from phylogenetic analysis of protein sequences. Since this estimate can be interpreted as the minimum pleiotropy of a gene, it is used to play a role of reference for many empirical pleiotropy measures. This work would facilitate our understanding of how gene pleiotropy affects the pattern of genotype-phenotype map and the consequence of organismal evolution.

  13. Can Natural Proteins Designed with ‘Inverted’ Peptide Sequences Adopt Native-Like Protein Folds?

    Science.gov (United States)

    Sridhar, Settu; Guruprasad, Kunchur

    2014-01-01

    We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to ‘swap’ certain short peptide sequences in naturally occurring proteins with their corresponding ‘inverted’ peptides and generate ‘artificial’ proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5–12 and 18 amino acid residues. Our analysis illustrates with examples that such ‘artificial’ proteins may be generated by identifying peptides with ‘similar structural environment’ and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides. PMID:25210740

  14. Correlated mutations in protein sequences: Phylogenetic and structural effects

    Energy Technology Data Exchange (ETDEWEB)

    Lapedes, A.S. [Los Alamos National Lab., NM (United States). Theoretical Div.]|[Santa Fe Inst., NM (United States); Giraud, B.G. [C.E.N. Saclay, Gif/Yvette (France). Service Physique Theorique; Liu, L.C. [Los Alamos National Lab., NM (United States). Theoretical Div.; Stormo, G.D. [Univ. of Colorado, Boulder, CO (United States). Dept. of Molecular, Cellular and Developmental Biology

    1998-12-01

    Covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure. Covariation analysis of sets of aligned sequences for protein molecules is successful in certain instances in elucidating certain structural and functional links, but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure. In this paper the authors identify two reasons why naive use of covariation analysis for protein sequences fails to reliably indicate sequence positions that are spatially proximate. The first reason involves the bias introduced in calculation of covariation measures due to the fact that biological sequences are generally related by a non-trivial phylogenetic tree. The authors present a null-model approach to solve this problem. The second reason involves linked chains of covariation which can result in pairs of sites displaying significant covariation even though they are not spatially proximate. They present a maximum entropy solution to this classic problem of causation versus correlation. The methodologies are validated in simulation.

  15. Co-evolution of SNF spliceosomal proteins with their RNA targets in trans-splicing nematodes.

    Science.gov (United States)

    Strange, Rex Meade; Russelburg, L Peyton; Delaney, Kimberly J

    2016-08-01

    Although the mechanism of pre-mRNA splicing has been well characterized, the evolution of spliceosomal proteins is poorly understood. The U1A/U2B″/SNF family (hereafter referred to as the SNF family) of RNA binding spliceosomal proteins participates in both the U1 and U2 small interacting nuclear ribonucleoproteins (snRNPs). The highly constrained nature of this system has inhibited an analysis of co-evolutionary trends between the proteins and their RNA binding targets. Here we report accelerated sequence evolution in the SNF protein family in Phylum Nematoda, which has allowed an analysis of protein:RNA co-evolution. In a comparison of SNF genes from ecdysozoan species, we found a correlation between trans-splicing species (nematodes) and increased phylogenetic branch lengths of the SNF protein family, with respect to their sister clade Arthropoda. In particular, we found that nematodes (~70-80 % of pre-mRNAs are trans-spliced) have experienced higher rates of SNF sequence evolution than arthropods (predominantly cis-spliced) at both the nucleotide and amino acid levels. Interestingly, this increased evolutionary rate correlates with the reliance on trans-splicing by nematodes, which would alter the role of the SNF family of spliceosomal proteins. We mapped amino acid substitutions to functionally important regions of the SNF protein, specifically to sites that are predicted to disrupt protein:RNA and protein:protein interactions. Finally, we investigated SNF's RNA targets: the U1 and U2 snRNAs. Both are more divergent in nematodes than arthropods, suggesting the RNAs have co-evolved with SNF in order to maintain the necessarily high affinity interaction that has been characterized in other species.

  16. Semi-Supervised Learning for Classification of Protein Sequence Data

    Directory of Open Access Journals (Sweden)

    Brian R. King

    2008-01-01

    Full Text Available Protein sequence data continue to become available at an exponential rate. Annotation of functional and structural attributes of these data lags far behind, with only a small fraction of the data understood and labeled by experimental methods. Classification methods that are based on semi-supervised learning can increase the overall accuracy of classifying partly labeled data in many domains, but very few methods exist that have shown their effect on protein sequence classification. We show how proven methods from text classification can be applied to protein sequence data, as we consider both existing and novel extensions to the basic methods, and demonstrate restrictions and differences that must be considered. We demonstrate comparative results against the transductive support vector machine, and show superior results on the most difficult classification problems. Our results show that large repositories of unlabeled protein sequence data can indeed be used to improve predictive performance, particularly in situations where there are fewer labeled protein sequences available, and/or the data are highly unbalanced in nature.

  17. Single-molecule protein sequencing through fingerprinting: computational assessment

    Science.gov (United States)

    Yao, Yao; Docter, Margreet; van Ginkel, Jetty; de Ridder, Dick; Joo, Chirlmin

    2015-10-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences.

  18. Single-molecule protein sequencing through fingerprinting: computational assessment

    International Nuclear Information System (INIS)

    Yao, Yao; Docter, Margreet; Van Ginkel, Jetty; Joo, Chirlmin; De Ridder, Dick

    2015-01-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences. (paper)

  19. Sequence analysis reveals how G protein-coupled receptors transduce the signal to the G protein.

    NARCIS (Netherlands)

    Oliveira, L.; Paiva, P.B.; Paiva, A.C.; Vriend, G.

    2003-01-01

    Sequence entropy-variability plots based on alignments of very large numbers of sequences-can indicate the location in proteins of the main active site and modulator sites. In the previous article in this issue, we applied this observation to a series of well-studied proteins and concluded that it

  20. TARSyn: Tunable Antibiotic Resistance Devices Enabling Bacterial Synthetic Evolution and Protein Production

    DEFF Research Database (Denmark)

    Rennig, Maja; Martinez, Virginia; Mirzadeh, Kiavash

    2018-01-01

    Evolution can be harnessed to optimize synthetic biology designs. A prominent example is recombinant protein production-a dominating theme in biotechnology for more than three decades. Typically, a protein coding sequence (cds) is recombined with genetic elements, such as promoters, ribosome...... and allows expression levels in large clone libraries to be probed using a simple cell survival assay on the respective antibiotic. The power of the approach is demonstrated by substantially increasing production of two commercially interesting proteins, a Nanobody and an Affibody. The method is a simple......-level expression-an example of synthetic evolution. However, manual screening limits the ability to assay expression levels of all putative sequences in the libraries. Here we have solved this bottleneck by designing a collection of translational coupling devices based on a RNA secondary structure. Exchange...

  1. WildSpan: mining structured motifs from protein sequences

    Directory of Open Access Journals (Sweden)

    Chen Chien-Yu

    2011-03-01

    Full Text Available Abstract Background Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions that incorporates several pruning strategies to largely reduce the mining cost. Results WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode

  2. Strong Selection Significantly Increases Epistatic Interactions in the Long-Term Evolution of a Protein.

    Directory of Open Access Journals (Sweden)

    Aditi Gupta

    2016-03-01

    Full Text Available Epistatic interactions between residues determine a protein's adaptability and shape its evolutionary trajectory. When a protein experiences a changed environment, it is under strong selection to find a peak in the new fitness landscape. It has been shown that strong selection increases epistatic interactions as well as the ruggedness of the fitness landscape, but little is known about how the epistatic interactions change under selection in the long-term evolution of a protein. Here we analyze the evolution of epistasis in the protease of the human immunodeficiency virus type 1 (HIV-1 using protease sequences collected for almost a decade from both treated and untreated patients, to understand how epistasis changes and how those changes impact the long-term evolvability of a protein. We use an information-theoretic proxy for epistasis that quantifies the co-variation between sites, and show that positive information is a necessary (but not sufficient condition that detects epistasis in most cases. We analyze the "fossils" of the evolutionary trajectories of the protein contained in the sequence data, and show that epistasis continues to enrich under strong selection, but not for proteins whose environment is unchanged. The increase in epistasis compensates for the information loss due to sequence variability brought about by treatment, and facilitates adaptation in the increasingly rugged fitness landscape of treatment. While epistasis is thought to enhance evolvability via valley-crossing early-on in adaptation, it can hinder adaptation later when the landscape has turned rugged. However, we find no evidence that the HIV-1 protease has reached its potential for evolution after 9 years of adapting to a drug environment that itself is constantly changing. We suggest that the mechanism of encoding new information into pairwise interactions is central to protein evolution not just in HIV-1 protease, but for any protein adapting to a changing

  3. Structure and Sequence Search on Aptamer-Protein Docking

    Science.gov (United States)

    Xiao, Jiajie; Bonin, Keith; Guthold, Martin; Salsbury, Freddie

    2015-03-01

    Interactions between proteins and deoxyribonucleic acid (DNA) play a significant role in the living systems, especially through gene regulation. However, short nucleic acids sequences (aptamers) with specific binding affinity to specific proteins exhibit clinical potential as therapeutics. Our capillary and gel electrophoresis selection experiments show that specific sequences of aptamers can be selected that bind specific proteins. Computationally, given the experimentally-determined structure and sequence of a thrombin-binding aptamer, we can successfully dock the aptamer onto thrombin in agreement with experimental structures of the complex. In order to further study the conformational flexibility of this thrombin-binding aptamer and to potentially develop a predictive computational model of aptamer-binding, we use GPU-enabled molecular dynamics simulations to both examine the conformational flexibility of the aptamer in the absence of binding to thrombin, and to determine our ability to fold an aptamer. This study should help further de-novo predictions of aptamer sequences by enabling the study of structural and sequence-dependent effects on aptamer-protein docking specificity.

  4. Whole-genome sequencing reveals the mechanisms for evolution of streptomycin resistance in Lactobacillus plantarum.

    Science.gov (United States)

    Zhang, Fuxin; Gao, Jiayuan; Wang, Bini; Huo, Dongxue; Wang, Zhaoxia; Zhang, Jiachao; Shao, Yuyu

    2018-04-01

    In this research, we investigated the evolution of streptomycin resistance in Lactobacillus plantarum ATCC14917, which was passaged in medium containing a gradually increasing concentration of streptomycin. After 25 d, the minimum inhibitory concentration (MIC) of L. plantarum ATCC14917 had reached 131,072 µg/mL, which was 8,192-fold higher than the MIC of the original parent isolate. The highly resistant L. plantarum ATCC14917 isolate was then passaged in antibiotic-free medium to determine the stability of resistance. The MIC value of the L. plantarum ATCC14917 isolate decreased to 2,048 µg/mL after 35 d but remained constant thereafter, indicating that resistance was irreversible even in the absence of selection pressure. Whole-genome sequencing of parent isolates, control isolates, and isolates following passage was used to study the resistance mechanism of L. plantarum ATCC14917 to streptomycin and adaptation in the presence and absence of selection pressure. Five mutated genes (single nucleotide polymorphisms and structural variants) were verified in highly resistant L. plantarum ATCC14917 isolates, which were related to ribosomal protein S12, LPXTG-motif cell wall anchor domain protein, LrgA family protein, Ser/Thr phosphatase family protein, and a hypothetical protein that may correlate with resistance to streptomycin. After passage in streptomycin-free medium, only the mutant gene encoding ribosomal protein S12 remained; the other 4 mutant genes had reverted to the wild type as found in the parent isolate. Although the MIC value of L. plantarum ATCC14917 was reduced in the absence of selection pressure, it remained 128-fold higher than the MIC value of the parent isolate, indicating that ribosomal protein S12 may play an important role in streptomycin resistance. Using the mobile elements database, we demonstrated that streptomycin resistance-related genes in L. plantarum ATCC14917 were not located on mobile elements. This research offers a way of

  5. Universal sequence replication, reversible polymerization and early functional biopolymers: a model for the initiation of prebiotic sequence evolution.

    Directory of Open Access Journals (Sweden)

    Sara Imari Walker

    Full Text Available Many models for the origin of life have focused on understanding how evolution can drive the refinement of a preexisting enzyme, such as the evolution of efficient replicase activity. Here we present a model for what was, arguably, an even earlier stage of chemical evolution, when polymer sequence diversity was generated and sustained before, and during, the onset of functional selection. The model includes regular environmental cycles (e.g. hydration-dehydration cycles that drive polymers between times of replication and functional activity, which coincide with times of different monomer and polymer diffusivity. Template-directed replication of informational polymers, which takes place during the dehydration stage of each cycle, is considered to be sequence-independent. New sequences are generated by spontaneous polymer formation, and all sequences compete for a finite monomer resource that is recycled via reversible polymerization. Kinetic Monte Carlo simulations demonstrate that this proposed prebiotic scenario provides a robust mechanism for the exploration of sequence space. Introduction of a polymer sequence with monomer synthetase activity illustrates that functional sequences can become established in a preexisting pool of otherwise non-functional sequences. Functional selection does not dominate system dynamics and sequence diversity remains high, permitting the emergence and spread of more than one functional sequence. It is also observed that polymers spontaneously form clusters in simulations where polymers diffuse more slowly than monomers, a feature that is reminiscent of a previous proposal that the earliest stages of life could have been defined by the collective evolution of a system-wide cooperation of polymer aggregates. Overall, the results presented demonstrate the merits of considering plausible prebiotic polymer chemistries and environments that would have allowed for the rapid turnover of monomer resources and for

  6. Prediction of Protein Hotspots from Whole Protein Sequences by a Random Projection Ensemble System

    Directory of Open Access Journals (Sweden)

    Jinjian Jiang

    2017-07-01

    Full Text Available Hotspot residues are important in the determination of protein-protein interactions, and they always perform specific functions in biological processes. The determination of hotspot residues is by the commonly-used method of alanine scanning mutagenesis experiments, which is always costly and time consuming. To address this issue, computational methods have been developed. Most of them are structure based, i.e., using the information of solved protein structures. However, the number of solved protein structures is extremely less than that of sequences. Moreover, almost all of the predictors identified hotspots from the interfaces of protein complexes, seldom from the whole protein sequences. Therefore, determining hotspots from whole protein sequences by sequence information alone is urgent. To address the issue of hotspot predictions from the whole sequences of proteins, we proposed an ensemble system with random projections using statistical physicochemical properties of amino acids. First, an encoding scheme involving sequence profiles of residues and physicochemical properties from the AAindex1 dataset is developed. Then, the random projection technique was adopted to project the encoding instances into a reduced space. Then, several better random projections were obtained by training an IBk classifier based on the training dataset, which were thus applied to the test dataset. The ensemble of random projection classifiers is therefore obtained. Experimental results showed that although the performance of our method is not good enough for real applications of hotspots, it is very promising in the determination of hotspot residues from whole sequences.

  7. Osteocalcin protein sequences of Neanderthals and modern primates.

    Science.gov (United States)

    Nielsen-Marsh, Christina M; Richards, Michael P; Hauschka, Peter V; Thomas-Oates, Jane E; Trinkaus, Erik; Pettitt, Paul B; Karavanic, Ivor; Poinar, Hendrik; Collins, Matthew J

    2005-03-22

    We report here protein sequences of fossil hominids, from two Neanderthals dating to approximately 75,000 years old from Shanidar Cave in Iraq. These sequences, the oldest reported fossil primate protein sequences, are of bone osteocalcin, which was extracted and sequenced by using MALDI-TOF/TOF mass spectrometry. Through a combination of direct sequencing and peptide mass mapping, we determined that Neanderthals have an osteocalcin amino acid sequence that is identical to that of modern humans. We also report complete osteocalcin sequences for chimpanzee (Pan troglodytes) and gorilla (Gorilla gorilla gorilla) and a partial sequence for orangutan (Pongo pygmaeus), all of which are previously unreported. We found that the osteocalcin sequences of Neanderthals, modern human, chimpanzee, and orangutan are unusual among mammals in that the ninth amino acid is proline (Pro-9), whereas most species have hydroxyproline (Hyp-9). Posttranslational hydroxylation of Pro-9 in osteocalcin by prolyl-4-hydroxylase requires adequate concentrations of vitamin C (l-ascorbic acid), molecular O(2), Fe(2+), and 2-oxoglutarate, and also depends on enzyme recognition of the target proline substrate consensus sequence Leu-Gly-Ala-Pro-9-Ala-Pro-Tyr occurring in most mammals. In five species with Pro-9-Val-10, hydroxylation is blocked, whereas in gorilla there is a mixture of Pro-9 and Hyp-9. We suggest that the absence of hydroxylation of Pro-9 in Pan, Pongo, and Homo may reflect response to a selective pressure related to a decline in vitamin C in the diet during omnivorous dietary adaptation, either independently or through the common ancestor of these species.

  8. Sequence alignment reveals possible MAPK docking motifs on HIV proteins.

    Directory of Open Access Journals (Sweden)

    Perry Evans

    Full Text Available Over the course of HIV infection, virus replication is facilitated by the phosphorylation of HIV proteins by human ERK1 and ERK2 mitogen-activated protein kinases (MAPKs. MAPKs are known to phosphorylate their substrates by first binding with them at a docking site. Docking site interactions could be viable drug targets because the sequences guiding them are more specific than phosphorylation consensus sites. In this study we use multiple bioinformatics tools to discover candidate MAPK docking site motifs on HIV proteins known to be phosphorylated by MAPKs, and we discuss the possibility of targeting docking sites with drugs. Using sequence alignments of HIV proteins of different subtypes, we show that MAPK docking patterns previously described for human proteins appear on the HIV matrix, Tat, and Vif proteins in a strain dependent manner, but are absent from HIV Rev and appear on all HIV Nef strains. We revise the regular expressions of previously annotated MAPK docking patterns in order to provide a subtype independent motif that annotates all HIV proteins. One revision is based on a documented human variant of one of the substrate docking motifs, and the other reduces the number of required basic amino acids in the standard docking motifs from two to one. The proposed patterns are shown to be consistent with in silico docking between ERK1 and the HIV matrix protein. The motif usage on HIV proteins is sufficiently different from human proteins in amino acid sequence similarity to allow for HIV specific targeting using small-molecule drugs.

  9. Intricate knots in proteins: Function and evolution.

    Directory of Open Access Journals (Sweden)

    Peter Virnau

    2006-09-01

    Full Text Available Our investigation of knotted structures in the Protein Data Bank reveals the most complicated knot discovered to date. We suggest that the occurrence of this knot in a human ubiquitin hydrolase might be related to the role of the enzyme in protein degradation. While knots are usually preserved among homologues, we also identify an exception in a transcarbamylase. This allows us to exemplify the function of knots in proteins and to suggest how they may have been created.

  10. EST2Prot: Mapping EST sequences to proteins

    Directory of Open Access Journals (Sweden)

    Lin David M

    2006-03-01

    Full Text Available Abstract Background EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. Results We describe a system (EST2Prot that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. Conclusion EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at http://biozon.org/tools/est/.

  11. Molecular evolution of the Paramyxoviridae and Rhabdoviridae multiple-protein-encoding P gene.

    Science.gov (United States)

    Jordan, I K; Sutter, B A; McClure, M A

    2000-01-01

    Presented here is an analysis of the molecular evolutionary dynamics of the P gene among 76 representative sequences of the Paramyxoviridae and Rhabdoviridae RNA virus families. In a number of Paramyxoviridae taxa, as well as in vesicular stomatitis viruses of the Rhabdoviridae, the P gene encodes multiple proteins from a single genomic RNA sequence. These products include the phosphoprotein (P), as well as the C and V proteins. The complexity of the P gene makes it an intriguing locus to study from an evolutionary perspective. Amino acid sequence alignments of the proteins encoded at the P and N loci were used in independent phylogenetic reconstructions of the Paramyxoviridae and Rhabdoviridae families. P-gene-coding capacities were mapped onto the Paramyxoviridae phylogeny, and the most parsimonious path of multiple-coding-capacity evolution was determined. Levels of amino acid variation for Paramyxoviridae and Rhabdoviridae P-gene-encoded products were also analyzed. Proteins encoded in overlapping reading frames from the same nucleotides have different levels of amino acid variation. The nucleotide architecture that underlies the amino acid variation was determined in order to evaluate the role of selection in the evolution of the P gene overlapping reading frames. In every case, the evolution of one of the proteins encoded in the overlapping reading frames has been constrained by negative selection while the other has evolved more rapidly. The integrity of the overlapping reading frame that represents a derived state is generally maintained at the expense of the ancestral reading frame encoded by the same nucleotides. The evolution of such multicoding sequences is likely a response by RNA viruses to selective pressure to maximize genomic information content while maintaining small genome size. The ability to evolve such a complex genomic strategy is intimately related to the dynamics of the viral quasispecies, which allow enhanced exploration of the adaptive

  12. HomPPI: a class of sequence homology based protein-protein interface prediction methods

    Directory of Open Access Journals (Sweden)

    Dobbs Drena

    2011-06-01

    Full Text Available Abstract Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i NPS-HomPPI (Non partner-specific HomPPI, which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii PS-HomPPI (Partner-specific HomPPI, which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of

  13. Sequence analysis corresponding to the PPE and PE proteins in ...

    Indian Academy of Sciences (India)

    Unknown

    AB repeats; Mycobacterium tuberculosis genome; PE-PPE domain; PPE, PE proteins; sequence analysis; surface antigens. J. Biosci. | Vol. ... bacterium tuberculosis genomes resulted in the identification of a previously uncharacterized 225 amino acid- ...... Vega Lopez F, Brooks L A, Dockrell H M, De Smet K A,. Thompson ...

  14. Representation of protein-sequence information by amino acid subalphabets

    DEFF Research Database (Denmark)

    Andersen, C.A.F.; Brunak, Søren

    2004-01-01

    -sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids...

  15. Tracing early stellar evolution with asteroseismology: pre-main sequence stars in NGC 2264

    Directory of Open Access Journals (Sweden)

    Zwintz Konstanze

    2015-01-01

    Full Text Available Asteroseismology has been proven to be a successful tool to unravel details of the internal structure for different types of stars in various stages of their main sequence and post-main sequence evolution. Recently, we found a relation between the detected pulsation properties in a sample of 34 pre-main sequence (pre-MS δ Scuti stars and the relative phase in their pre-MS evolution. With this we are able to demonstrate that asteroseismology is similarly powerful if applied to stars in the earliest stages of evolution before the onset of hydrogen core burning.

  16. GuiTope: an application for mapping random-sequence peptides to protein sequences.

    Science.gov (United States)

    Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert

    2012-01-03

    Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  17. GuiTope: an application for mapping random-sequence peptides to protein sequences

    Directory of Open Access Journals (Sweden)

    Halperin Rebecca F

    2012-01-01

    Full Text Available Abstract Background Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. Results GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. Conclusions GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  18. The HMMER Web Server for Protein Sequence Similarity Search.

    Science.gov (United States)

    Prakash, Ananth; Jeffryes, Matt; Bateman, Alex; Finn, Robert D

    2017-12-08

    Protein sequence similarity search is one of the most commonly used bioinformatics methods for identifying evolutionarily related proteins. In general, sequences that are evolutionarily related share some degree of similarity, and sequence-search algorithms use this principle to identify homologs. The requirement for a fast and sensitive sequence search method led to the development of the HMMER software, which in the latest version (v3.1) uses a combination of sophisticated acceleration heuristics and mathematical and computational optimizations to enable the use of profile hidden Markov models (HMMs) for sequence analysis. The HMMER Web server provides a common platform by linking the HMMER algorithms to databases, thereby enabling the search for homologs, as well as providing sequence and functional annotation by linking external databases. This unit describes three basic protocols and two alternate protocols that explain how to use the HMMER Web server using various input formats and user defined parameters. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  19. alpha-Crystallin A sequences of Alligator mississippiensis and the lizard Tupinambis teguixin: molecular evolution and reptilian phylogeny.

    Science.gov (United States)

    de Jong, W W; Zweers, A; Versteeg, M; Dessauer, H C; Goodman, M

    1985-11-01

    The amino acid sequences of the eye lens protein alpha-crystallin A from many mammalian and avian species, two frog species, and a dogfish have provided detailed information about the molecular evolution of this protein and allowed some useful inferences about phylogenetic relationships among these species. We now have isolated and sequenced the alpha-crystallins of the American alligator and the common tegu lizard. The reptilian alpha A chains appear to have evolved as slowly as those of other vertebrates, i.e., at two to three amino acid replacements per 100 residues in 100 Myr. The lack of charged replacements and the general types and distribution of replacements also are similar to those in other vertebrate alpha A chains. Maximum-parsimony analyses of the total data set of 67 vertebrate alpha A sequences support the monophyletic origin of alligator, tegu, and birds and favor the grouping of crocodilians and birds as surviving sister groups in the subclass Archosauria.

  20. The consequences of sequence erosion in the evolution of recombination hotspots.

    Science.gov (United States)

    Tiemann-Boege, Irene; Schwarz, Theresa; Striedner, Yasmin; Heissl, Angelika

    2017-12-19

    Meiosis is initiated by a double-strand break (DSB) introduced in the DNA by a highly controlled process that is repaired by recombination. In many organisms, recombination occurs at specific and narrow regions of the genome, known as recombination hotspots, which overlap with regions enriched for DSBs. In recent years, it has been demonstrated that conversions and mutations resulting from the repair of DSBs lead to a rapid sequence evolution at recombination hotspots eroding target sites for DSBs. We still do not fully understand the effect of this erosion in the recombination activity, but evidence has shown that the binding of trans -acting factors like PRDM9 is affected. PRDM9 is a meiosis-specific, multi-domain protein that recognizes DNA target motifs by its zinc finger domain and directs DSBs to these target sites. Here we discuss the changes in affinity of PRDM9 to eroded recognition sequences, and explain how these changes in affinity of PRDM9 can affect recombination, leading sometimes to sterility in the context of hybrid crosses. We also present experimental data showing that DNA methylation reduces PRDM9 binding in vitro Finally, we discuss PRDM9-independent hotspots, posing the question how these hotspots evolve and change with sequence erosion.This article is part of the themed issue 'Evolutionary causes and consequences of recombination rate variation in sexual organisms'. © 2017 The Authors.

  1. Pix proteins and the evolution of centrioles.

    Directory of Open Access Journals (Sweden)

    Hugh R Woodland

    Full Text Available We have made a wide phylogenetic survey of Pix proteins, which are constituents of vertebrate centrioles in most eukaryotes. We have also surveyed the presence and structure of flagella or cilia and centrioles in these organisms, as far as is possible from published information. We find that Pix proteins are present in a vast range of eukaryotes, but not all. Where centrioles are absent so are Pix proteins. If one considers the maintenance of Pix proteins over evolutionary time scales, our analysis would suggest that their key function is to make cilia and flagella, and the same is true of centrioles. Moreover, this survey raises the possibility that Pix proteins are only maintained to make cilia and flagella that undulate, and even then only when they are constructed by transporting ciliary constituents up the cilium using the intraflagellar transport (IFT system. We also find that Pix proteins have become generally divergent within Ecdysozoa and between this group and other taxa. This correlates with a simplification of centrioles within Ecdysozoa and a loss or divergence of cilia/flagella. Thus Pix proteins act as a weathervane to indicate changes in centriole function, whose core activity is to make cilia and flagella.

  2. Pix proteins and the evolution of centrioles.

    Science.gov (United States)

    Woodland, Hugh R; Fry, Andrew M

    2008-01-01

    We have made a wide phylogenetic survey of Pix proteins, which are constituents of vertebrate centrioles in most eukaryotes. We have also surveyed the presence and structure of flagella or cilia and centrioles in these organisms, as far as is possible from published information. We find that Pix proteins are present in a vast range of eukaryotes, but not all. Where centrioles are absent so are Pix proteins. If one considers the maintenance of Pix proteins over evolutionary time scales, our analysis would suggest that their key function is to make cilia and flagella, and the same is true of centrioles. Moreover, this survey raises the possibility that Pix proteins are only maintained to make cilia and flagella that undulate, and even then only when they are constructed by transporting ciliary constituents up the cilium using the intraflagellar transport (IFT) system. We also find that Pix proteins have become generally divergent within Ecdysozoa and between this group and other taxa. This correlates with a simplification of centrioles within Ecdysozoa and a loss or divergence of cilia/flagella. Thus Pix proteins act as a weathervane to indicate changes in centriole function, whose core activity is to make cilia and flagella.

  3. Evolution of EF-hand calcium-modulated proteins. IV. Exon shuffling did not determine the domain compositions of EF-hand proteins

    Science.gov (United States)

    Kretsinger, R. H.; Nakayama, S.

    1993-01-01

    In the previous three reports in this series we demonstrated that the EF-hand family of proteins evolved by a complex pattern of gene duplication, transposition, and splicing. The dendrograms based on exon sequences are nearly identical to those based on protein sequences for troponin C, the essential light chain myosin, the regulatory light chain, and calpain. This validates both the computational methods and the dendrograms for these subfamilies. The proposal of congruence for calmodulin, troponin C, essential light chain, and regulatory light chain was confirmed. There are, however, significant differences in the calmodulin dendrograms computed from DNA and from protein sequences. In this study we find that introns are distributed throughout the EF-hand domain and the interdomain regions. Further, dendrograms based on intron type and distribution bear little resemblance to those based on protein or on DNA sequences. We conclude that introns are inserted, and probably deleted, with relatively high frequency. Further, in the EF-hand family exons do not correspond to structural domains and exon shuffling played little if any role in the evolution of this widely distributed homolog family. Calmodulin has had a turbulent evolution. Its dendrograms based on protein sequence, exon sequence, 3'-tail sequence, intron sequences, and intron positions all show significant differences.

  4. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences

    KAUST Repository

    Zhang, Zhang; Yu, Jun

    2010-01-01

    Background: Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge.Results: To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences.Conclusions: We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms.Reviewers: This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft. 2010 Zhang and Yu; licensee BioMed Central Ltd.

  5. Modeling compositional dynamics based on GC and purine contents of protein-coding sequences

    KAUST Repository

    Zhang, Zhang

    2010-11-08

    Background: Understanding the compositional dynamics of genomes and their coding sequences is of great significance in gaining clues into molecular evolution and a large number of publically-available genome sequences have allowed us to quantitatively predict deviations of empirical data from their theoretical counterparts. However, the quantification of theoretical compositional variations for a wide diversity of genomes remains a major challenge.Results: To model the compositional dynamics of protein-coding sequences, we propose two simple models that take into account both mutation and selection effects, which act differently at the three codon positions, and use both GC and purine contents as compositional parameters. The two models concern the theoretical composition of nucleotides, codons, and amino acids, with no prerequisite of homologous sequences or their alignments. We evaluated the two models by quantifying theoretical compositions of a large collection of protein-coding sequences (including 46 of Archaea, 686 of Bacteria, and 826 of Eukarya), yielding consistent theoretical compositions across all the collected sequences.Conclusions: We show that the compositions of nucleotides, codons, and amino acids are largely determined by both GC and purine contents and suggest that deviations of the observed from the expected compositions may reflect compositional signatures that arise from a complex interplay between mutation and selection via DNA replication and repair mechanisms.Reviewers: This article was reviewed by Zhaolei Zhang (nominated by Mark Gerstein), Guruprasad Ananda (nominated by Kateryna Makova), and Daniel Haft. 2010 Zhang and Yu; licensee BioMed Central Ltd.

  6. Sequence of a cDNA encoding turtle high mobility group 1 protein.

    Science.gov (United States)

    Zheng, Jifang; Hu, Bi; Wu, Duansheng

    2005-07-01

    In order to understand sequence information about turtle HMG1 gene, a cDNA encoding HMG1 protein of the Chinese soft-shell turtle (Pelodiscus sinensis) was amplified by RT-PCR from kidney total RNA, and was cloned, sequenced and analyzed. The results revealed that the open reading frame (ORF) of turtle HMG1 cDNA is 606 bp long. The ORF codifies 202 amino acid residues, from which two DNA-binding domains and one polyacidic region are derived. The DNA-binding domains share higher amino acid identity with homologues sequences of chicken (96.5%) and mammalian (74%) than homologues sequence of rainbow trout (67%). The polyacidic region shows 84.6% amino acid homology with the equivalent region of chicken HMG1 cDNA. Turtle HMG1 protein contains 3 Cys residues located at completely conserved positions. Conservation in sequence and structure suggests that the functions of turtle HMG1 cDNA may be highly conserved during evolution. To our knowledge, this is the first report of HMG1 cDNA sequence in any reptilian.

  7. Protein sequencing via nanopore based devices: a nanofluidics perspective

    Science.gov (United States)

    Chinappi, Mauro; Cecconi, Fabio

    2018-05-01

    Proteins perform a huge number of central functions in living organisms, thus all the new techniques allowing their precise, fast and accurate characterization at single-molecule level certainly represent a burst in proteomics with important biomedical impact. In this review, we describe the recent progresses in the developing of nanopore based devices for protein sequencing. We start with a critical analysis of the main technical requirements for nanopore protein sequencing, summarizing some ideas and methodologies that have recently appeared in the literature. In the last sections, we focus on the physical modelling of the transport phenomena occurring in nanopore based devices. The multiscale nature of the problem is discussed and, in this respect, some of the main possible computational approaches are illustrated.

  8. Sequence heterogeneity accelerates protein search for targets on DNA

    International Nuclear Information System (INIS)

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-01-01

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome

  9. Sequence heterogeneity accelerates protein search for targets on DNA

    Energy Technology Data Exchange (ETDEWEB)

    Shvets, Alexey A.; Kolomeisky, Anatoly B., E-mail: tolya@rice.edu [Department of Chemistry and Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005 (United States)

    2015-12-28

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  10. Determining and comparing protein function in Bacterial genome sequences

    DEFF Research Database (Denmark)

    Vesth, Tammi Camilla

    of this class have very little homology to other known genomes making functional annotation based on sequence similarity very difficult. Inspired in part by this analysis, an approach for comparative functional annotation was created based public sequenced genomes, CMGfunc. Functionally related groups......In November 2013, there was around 21.000 different prokaryotic genomes sequenced and publicly available, and the number is growing daily with another 20.000 or more genomes expected to be sequenced and deposited by the end of 2014. An important part of the analysis of this data is the functional...... annotation of genes – the descriptions assigned to genes that describe the likely function of the encoded proteins. This process is limited by several factors, including the definition of a function which can be more or less specific as well as how many genes can actually be assigned a function based...

  11. Relationships between residue Voronoi volume and sequence conservation in proteins.

    Science.gov (United States)

    Liu, Jen-Wei; Cheng, Chih-Wen; Lin, Yu-Feng; Chen, Shao-Yu; Hwang, Jenn-Kang; Yen, Shih-Chung

    2018-02-01

    Functional and biophysical constraints can cause different levels of sequence conservation in proteins. Previously, structural properties, e.g., relative solvent accessibility (RSA) and packing density of the weighted contact number (WCN), have been found to be related to protein sequence conservation (CS). The Voronoi volume has recently been recognized as a new structural property of the local protein structural environment reflecting CS. However, for surface residues, it is sensitive to water molecules surrounding the protein structure. Herein, we present a simple structural determinant termed the relative space of Voronoi volume (RSV); it uses the Voronoi volume and the van der Waals volume of particular residues to quantify the local structural environment. RSV (range, 0-1) is defined as (Voronoi volume-van der Waals volume)/Voronoi volume of the target residue. The concept of RSV describes the extent of available space for every protein residue. RSV and Voronoi profiles with and without water molecules (RSVw, RSV, VOw, and VO) were compared for 554 non-homologous proteins. RSV (without water) showed better Pearson's correlations with CS than did RSVw, VO, or VOw values. The mean correlation coefficient between RSV and CS was 0.51, which is comparable to the correlation between RSA and CS (0.49) and that between WCN and CS (0.56). RSV is a robust structural descriptor with and without water molecules and can quantitatively reflect evolutionary information in a single protein structure. Therefore, it may represent a practical structural determinant to study protein sequence, structure, and function relationships. Copyright © 2017 Elsevier B.V. All rights reserved.

  12. Evolution favors protein mutational robustness in sufficiently large populations

    Directory of Open Access Journals (Sweden)

    Venturelli Ophelia S

    2007-07-01

    Full Text Available Abstract Background An important question is whether evolution favors properties such as mutational robustness or evolvability that do not directly benefit any individual, but can influence the course of future evolution. Functionally similar proteins can differ substantially in their robustness to mutations and capacity to evolve new functions, but it has remained unclear whether any of these differences might be due to evolutionary selection for these properties. Results Here we use laboratory experiments to demonstrate that evolution favors protein mutational robustness if the evolving population is sufficiently large. We neutrally evolve cytochrome P450 proteins under identical selection pressures and mutation rates in populations of different sizes, and show that proteins from the larger and thus more polymorphic population tend towards higher mutational robustness. Proteins from the larger population also evolve greater stability, a biophysical property that is known to enhance both mutational robustness and evolvability. The excess mutational robustness and stability is well described by mathematical theory, and can be quantitatively related to the way that the proteins occupy their neutral network. Conclusion Our work is the first experimental demonstration of the general tendency of evolution to favor mutational robustness and protein stability in highly polymorphic populations. We suggest that this phenomenon could contribute to the mutational robustness and evolvability of viruses and bacteria that exist in large populations.

  13. Biased Gene Conversion and GC-Content Evolution in the Coding Sequences of Reptiles and Vertebrates

    Science.gov (United States)

    Figuet, Emeric; Ballenghien, Marion; Romiguier, Jonathan; Galtier, Nicolas

    2015-01-01

    Mammalian and avian genomes are characterized by a substantial spatial heterogeneity of GC-content, which is often interpreted as reflecting the effect of local GC-biased gene conversion (gBGC), a meiotic repair bias that favors G and C over A and T alleles in high-recombining genomic regions. Surprisingly, the first fully sequenced nonavian sauropsid (i.e., reptile), the green anole Anolis carolinensis, revealed a highly homogeneous genomic GC-content landscape, suggesting the possibility that gBGC might not be at work in this lineage. Here, we analyze GC-content evolution at third-codon positions (GC3) in 44 vertebrates species, including eight newly sequenced transcriptomes, with a specific focus on nonavian sauropsids. We report that reptiles, including the green anole, have a genome-wide distribution of GC3 similar to that of mammals and birds, and we infer a strong GC3-heterogeneity to be already present in the tetrapod ancestor. We further show that the dynamic of coding sequence GC-content is largely governed by karyotypic features in vertebrates, notably in the green anole, in agreement with the gBGC hypothesis. The discrepancy between third-codon positions and noncoding DNA regarding GC-content dynamics in the green anole could not be explained by the activity of transposable elements or selection on codon usage. This analysis highlights the unique value of third-codon positions as an insertion/deletion-free marker of nucleotide substitution biases that ultimately affect the evolution of proteins. PMID:25527834

  14. New Measurement for Correlation of Co-evolution Relationship of Subsequences in Protein.

    Science.gov (United States)

    Gao, Hongyun; Yu, Xiaoqing; Dou, Yongchao; Wang, Jun

    2015-12-01

    Many computational tools have been developed to measure the protein residues co-evolution. Most of them only focus on co-evolution for pairwise residues in a protein sequence. However, number of residues participate in co-evolution might be multiple. And some co-evolved residues are clustered in several distinct regions in primary structure. Therefore, the co-evolution among the adjacent residues and the correlation between the distinct regions offer insights into function and evolution of the protein and residues. Subsequence is used to represent the adjacent multiple residues in one distinct region. In the paper, co-evolution relationship in each subsequence is represented by mutual information matrix (MIM). Then, Pearson's correlation coefficient: R value is developed to measure the similarity correlation of two MIMs. MSAs from Catalytic Data Base (Catalytic Site Atlas, CSA) are used for testing. R value characterizes a specific class of residues. In contrast to individual pairwise co-evolved residues, adjacent residues without high individual MI values are found since the co-evolved relationship among them is similar to that among another set of adjacent residues. These subsequences possess some flexibility in the composition of side chains, such as the catalyzed environment.

  15. Ultra-fast evaluation of protein energies directly from sequence.

    Directory of Open Access Journals (Sweden)

    Gevorg Grigoryan

    2006-06-01

    Full Text Available The structure, function, stability, and many other properties of a protein in a fixed environment are fully specified by its sequence, but in a manner that is difficult to discern. We present a general approach for rapidly mapping sequences directly to their energies on a pre-specified rigid backbone, an important sub-problem in computational protein design and in some methods for protein structure prediction. The cluster expansion (CE method that we employ can, in principle, be extended to model any computable or measurable protein property directly as a function of sequence. Here we show how CE can be applied to the problem of computational protein design, and use it to derive excellent approximations of physical potentials. The approach provides several attractive advantages. First, following a one-time derivation of a CE expansion, the amount of time necessary to evaluate the energy of a sequence adopting a specified backbone conformation is reduced by a factor of 10(7 compared to standard full-atom methods for the same task. Second, the agreement between two full-atom methods that we tested and their CE sequence-based expressions is very high (root mean square deviation 1.1-4.7 kcal/mol, R2 = 0.7-1.0. Third, the functional form of the CE energy expression is such that individual terms of the expansion have clear physical interpretations. We derived expressions for the energies of three classic protein design targets-a coiled coil, a zinc finger, and a WW domain-as functions of sequence, and examined the most significant terms. Single-residue and residue-pair interactions are sufficient to accurately capture the energetics of the dimeric coiled coil, whereas higher-order contributions are important for the two more globular folds. For the task of designing novel zinc-finger sequences, a CE-derived energy function provides significantly better solutions than a standard design protocol, in comparable computation time. Given these advantages

  16. Studying the co-evolution of protein families with the Mirrortree web server.

    Science.gov (United States)

    Ochoa, David; Pazos, Florencio

    2010-05-15

    The Mirrortree server allows to graphically and interactively study the co-evolution of two protein families, and investigate their possible interactions and functional relationships in a taxonomic context. The server includes the possibility of starting from single sequences and hence it can be used by non-expert users. The web server is freely available at http://csbg.cnb.csic.es/mtserver. It was tested in the main web browsers. Adobe Flash Player is required at the client side to perform the interactive assessment of co-evolution. pazos@cnb.csic.es Supplementary data are available at Bioinformatics online.

  17. Refining the Results of a Classical SELEX Experiment by Expanding the Sequence Data Set of an Aptamer Pool Selected for Protein A

    OpenAIRE

    Regina Stoltenburg; Beate Strehlitz

    2018-01-01

    New, as yet undiscovered aptamers for Protein A were identified by applying next generation sequencing (NGS) to a previously selected aptamer pool. This pool was obtained in a classical SELEX (Systematic Evolution of Ligands by EXponential enrichment) experiment using the FluMag-SELEX procedure followed by cloning and Sanger sequencing. PA#2/8 was identified as the only Protein A-binding aptamer from the Sanger sequence pool, and was shown to be able to bind intact cells of Staphylococcus aur...

  18. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    Directory of Open Access Journals (Sweden)

    Colin A Smith

    Full Text Available Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface, interactions between and within parts of the structure (e.g. domains can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  19. Automatic discovery of cross-family sequence features associated with protein function

    Directory of Open Access Journals (Sweden)

    Krings Andrea

    2006-01-01

    Full Text Available Abstract Background Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. Results We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. Conclusion We have developed a novel and useful approach for

  20. SAAS: Short Amino Acid Sequence - A Promising Protein Secondary Structure Prediction Method of Single Sequence

    Directory of Open Access Journals (Sweden)

    Zhou Yuan Wu

    2013-07-01

    Full Text Available In statistical methods of predicting protein secondary structure, many researchers focus on single amino acid frequencies in α-helices, β-sheets, and so on, or the impact near amino acids on an amino acid forming a secondary structure. But the paper considers a short sequence of amino acids (3, 4, 5 or 6 amino acids as integer, and statistics short sequence's probability forming secondary structure. Also, many researchers select low homologous sequences as statistical database. But this paper select whole PDB database. In this paper we propose a strategy to predict protein secondary structure using simple statistical method. Numerical computation shows that, short amino acids sequence as integer to statistics, which can easy see trend of short sequence forming secondary structure, and it will work well to select large statistical database (whole PDB database without considering homologous, and Q3 accuracy is ca. 74% using this paper proposed simple statistical method, but accuracy of others statistical methods is less than 70%.

  1. Protein sequences bound to mineral surfaces persist into deep time

    DEFF Research Database (Denmark)

    Demarchi, Beatrice; Hall, Shaun; Roncal-Herrero, Teresa

    2016-01-01

    of Laetoli (3.8 Ma) and Olduvai Gorge (1.3 Ma) in Tanzania. By tracking protein diagenesis back in time we find consistent patterns of preservation, demonstrating authenticity of the surviving sequences. Molecular dynamics simulations of struthiocalcin-1 and -2, the dominant proteins within the eggshell......, reveal that distinct domains bind to the mineral surface. It is the domain with the strongest calculated binding energy to the calcite surface that is selectively preserved. Thermal age calculations demonstrate that the Laetoli and Olduvai peptides are 50 times older than any previously authenticated...

  2. Mis-translation of a Computationally Designed Protein Yields an Exceptionally Stable Homodimer: Implications for Protein Engineering and Evolution.

    Energy Technology Data Exchange (ETDEWEB)

    Dantas, Gautam; Watters, Alexander L.; Lunde, Bradley; Eletr, Ziad; Isern, Nancy G.; Roseman, Toby; Lipfert, Jan; Doniach, Sebastian; Tompa, Martin; Kuhlman, Brian; Stoddard, Barry L.; Varani, Gabriele; Baker, David

    2006-10-06

    We recently used computational protein design to create an extremely stable, globular protein, Top7, with a sequence and fold not observed previously in nature. Since Top7 was created in the absence of genetic selection, it provides a rare opportunity to investigate aspects of the cellular protein production and surveillance machinery that are subject to natural selection. Here we show that a portion of the Top7 protein corresponding to the final 49 C-terminal residues is efficiently mistranslated and accumulates at high levels in E. coli. We used circular dichroism spectroscopy, size-exclusion chromatography, small-angle x-ray scattering, analytical ultra-centrifugation, and NMR spectroscopy to show that the resulting CFr protein adopts a compact, extremely-stable, obligate, symmetric, homo-dimeric structure. Based on the solution structure, we engineered an even more stable variant of CFr by disulfide-induced covalent circularisation that should be an excellent platform for design of novel functions. The accumulation of high levels of CFr exposes the high error rate of the protein translation machinery, and the rarity of correspondingly stable fragments in natural proteins implies a stringent evolutionary pressure against protein sub-fragments that can independently fold into stable structures. The symmetric self-association between two identical mistranslated CFr sub-units to generate an extremely stable structure parallels a mechanism for natural protein-fold evolution by modular recombination of stable protein sub-structures.

  3. Computational identification of MoRFs in protein sequences.

    Science.gov (United States)

    Malhis, Nawar; Gsponer, Jörg

    2015-06-01

    Intrinsically disordered regions of proteins play an essential role in the regulation of various biological processes. Key to their regulatory function is the binding of molecular recognition features (MoRFs) to globular protein domains in a process known as a disorder-to-order transition. Predicting the location of MoRFs in protein sequences with high accuracy remains an important computational challenge. In this study, we introduce MoRFCHiBi, a new computational approach for fast and accurate prediction of MoRFs in protein sequences. MoRFCHiBi combines the outcomes of two support vector machine (SVM) models that take advantage of two different kernels with high noise tolerance. The first, SVMS, is designed to extract maximal information from the general contrast in amino acid compositions between MoRFs, their surrounding regions (Flanks), and the remainders of the sequences. The second, SVMT, is used to identify similarities between regions in a query sequence and MoRFs of the training set. We evaluated the performance of our predictor by comparing its results with those of two currently available MoRF predictors, MoRFpred and ANCHOR. Using three test sets that have previously been collected and used to evaluate MoRFpred and ANCHOR, we demonstrate that MoRFCHiBi outperforms the other predictors with respect to different evaluation metrics. In addition, MoRFCHiBi is downloadable and fast, which makes it useful as a component in other computational prediction tools. http://www.chibi.ubc.ca/morf/. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  4. Evolutionary conservation of nuclear and nucleolar targeting sequences in yeast ribosomal protein S6A

    International Nuclear Information System (INIS)

    Lipsius, Edgar; Walter, Korden; Leicher, Torsten; Phlippen, Wolfgang; Bisotti, Marc-Angelo; Kruppa, Joachim

    2005-01-01

    Over 1 billion years ago, the animal kingdom diverged from the fungi. Nevertheless, a high sequence homology of 62% exists between human ribosomal protein S6 and S6A of Saccharomyces cerevisiae. To investigate whether this similarity in primary structure is mirrored in corresponding functional protein domains, the nuclear and nucleolar targeting signals were delineated in yeast S6A and compared to the known human S6 signals. The complete sequence of S6A and cDNA fragments was fused to the 5'-end of the LacZ gene, the constructs were transiently expressed in COS cells, and the subcellular localization of the fusion proteins was detected by indirect immunofluorescence. One bipartite and two monopartite nuclear localization signals as well as two nucleolar binding domains were identified in yeast S6A, which are located at homologous regions in human S6 protein. Remarkably, the number, nature, and position of these targeting signals have been conserved, albeit their amino acid sequences have presumably undergone a process of co-evolution with their corresponding rRNAs

  5. Functional analysis of bipartite begomovirus coat protein promoter sequences

    International Nuclear Information System (INIS)

    Lacatus, Gabriela; Sunter, Garry

    2008-01-01

    We demonstrate that the AL2 gene of Cabbage leaf curl virus (CaLCuV) activates the CP promoter in mesophyll and acts to derepress the promoter in vascular tissue, similar to that observed for Tomato golden mosaic virus (TGMV). Binding studies indicate that sequences mediating repression and activation of the TGMV and CaLCuV CP promoter specifically bind different nuclear factors common to Nicotiana benthamiana, spinach and tomato. However, chromatin immunoprecipitation demonstrates that TGMV AL2 can interact with both sequences independently. Binding of nuclear protein(s) from different crop species to viral sequences conserved in both bipartite and monopartite begomoviruses, including TGMV, CaLCuV, Pepper golden mosaic virus and Tomato yellow leaf curl virus suggests that bipartite begomoviruses bind common host factors to regulate the CP promoter. This is consistent with a model in which AL2 interacts with different components of the cellular transcription machinery that bind viral sequences important for repression and activation of begomovirus CP promoters

  6. Directed evolution of an extremely stable fluorescent protein.

    Science.gov (United States)

    Kiss, Csaba; Temirov, Jamshid; Chasteen, Leslie; Waldo, Geoffrey S; Bradbury, Andrew R M

    2009-05-01

    In this paper we describe the evolution of eCGP123, an extremely stable green fluorescent protein based on a previously described fluorescent protein created by consensus engineering (CGP: consensus green protein). eCGP123 could not be denatured by a standard thermal melt, preserved almost full fluorescence after overnight incubation at 80 degrees C and possessed a free energy of denaturation of 12.4 kcal/mol. It was created from CGP by a recursive process involving the sequential introduction of three destabilizing heterologous inserts, evolution to overcome the destabilization and finally 'removal' of the destabilizing insert by gene synthesis. We believe that this approach may be generally applicable to the stabilization of other proteins.

  7. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition.

    Science.gov (United States)

    Hayat, Maqsood; Khan, Asifullah

    2011-02-21

    Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. Copyright © 2010 Elsevier Ltd. All rights reserved.

  8. Lithium evolution in metal-poor stars: from Pre-Main Sequence to the Spite plateau

    OpenAIRE

    Fu, Xiaoting; Bressan, Alessandro; Molaro, Paolo; Marigo, Paola

    2015-01-01

    Lithium abundance derived in metal-poor main sequence stars is about three times lower than the value of primordial Li predicted by the standard Big Bang nucleosynthesis when the baryon density is taken from the CMB or the deuterium measurements. This disagreement is generally referred as the lithium problem. We here reconsider the stellar Li evolution from the pre-main sequence to the end of the main sequence phase by introducing the effects of convective overshooting and residual mass accre...

  9. ACCELERATED EVOLUTION OF LAND SNAILS MANDARINA IN THE OCEANIC BONIN ISLANDS: EVIDENCE FROM MITOCHONDRIAL DNA SEQUENCES.

    Science.gov (United States)

    Chiba, Satoshi

    1999-04-01

    An endemic land snail genus Mandarina of the oceanic Bonin (Ogasawara) Islands shows exceptionally rapid evolution not only of morphological and ecological traits, but of DNA sequence. A phylogenetic relationship based on mitochondrial DNA (mtDNA) sequences suggests that morphological differences equivalent to the differences between families were produced between Mandarina and its ancestor during the Pleistocene. The inferred phylogeny shows that species with similar morphologies and life habitats appeared repeatedly and independently in different lineages and islands at different times. Sequential adaptive radiations occurred in different islands of the Bonin Islands and species occupying arboreal, semiarboreal, and terrestrial habitat arose independently in each island. Because of a close relationship between shell morphology and life habitat, independent evolution of the same life habitat in different islands created species possesing the same shell morphology in different islands and lineages. This rapid evolution produced some incongruences between phylogenetic relationship and species taxonomy. Levels of sequence divergence of mtDNA among the species of Mandarina is extremely high. The maximum level of sequence divergence at 16S and 12S ribosomal RNA sequence within Mandarina are 18.7% and 17.7%, respectively, and this suggests that evolution of mtDNA of Mandarina is extremely rapid, more than 20 times faster than the standard rate in other animals. The present examination reveals that evolution of morphological and ecological traits occurs at extremely high rates in the time of adaptive radiation, especially in fragmented environments. © 1999 The Society for the Study of Evolution.

  10. Differential evolution optimization combined with chaotic sequences for image contrast enhancement

    Energy Technology Data Exchange (ETDEWEB)

    Santos Coelho, Leandro dos [Industrial and Systems Engineering Graduate Program, LAS/PPGEPS, Pontifical Catholic University of Parana, PUCPR Imaculada Conceicao, 1155, 80215-901 Curitiba, Parana (Brazil)], E-mail: leandro.coelho@pucpr.br; Sauer, Joao Guilherme [Industrial and Systems Engineering Graduate Program, LAS/PPGEPS, Pontifical Catholic University of Parana, PUCPR Imaculada Conceicao, 1155, 80215-901 Curitiba, Parana (Brazil)], E-mail: joao.sauer@gmail.com; Rudek, Marcelo [Industrial and Systems Engineering Graduate Program, LAS/PPGEPS, Pontifical Catholic University of Parana, PUCPR Imaculada Conceicao, 1155, 80215-901 Curitiba, Parana (Brazil)], E-mail: marcelo.rudek@pucpr.br

    2009-10-15

    Evolutionary Algorithms (EAs) are stochastic and robust meta-heuristics of evolutionary computation field useful to solve optimization problems in image processing applications. Recently, as special mechanism to avoid being trapped in local minimum, the ergodicity property of chaotic sequences has been used in various designs of EAs. Three differential evolution approaches based on chaotic sequences using logistic equation for image enhancement process are proposed in this paper. Differential evolution is a simple yet powerful evolutionary optimization algorithm that has been successfully used in solving continuous problems. The proposed chaotic differential evolution schemes have fast convergence rate but also maintain the diversity of the population so as to escape from local optima. In this paper, the image contrast enhancement is approached as a constrained nonlinear optimization problem. The objective of the proposed chaotic differential evolution schemes is to maximize the fitness criterion in order to enhance the contrast and detail in the image by adapting the parameters using a contrast enhancement technique. The proposed chaotic differential evolution schemes are compared with classical differential evolution to two testing images. Simulation results on three images show that the application of chaotic sequences instead of random sequences is a possible strategy to improve the performance of classical differential evolution optimization algorithm.

  11. Co-evolution of transcriptional silencing proteins and the DNA elements specifying their assembly.

    Directory of Open Access Journals (Sweden)

    Oliver A Zill

    Full Text Available Co-evolution of transcriptional regulatory proteins and their sites of action has been often hypothesized but rarely demonstrated. Here we provide experimental evidence of such co-evolution in yeast silent chromatin, a finding that emerged from studies of hybrids formed between two closely related Saccharomyces species. A unidirectional silencing incompatibility between S. cerevisiae and S. bayanus led to a key discovery: asymmetrical complementation of divergent orthologs of the silent chromatin component Sir4. In S. cerevisiae/S. bayanus interspecies hybrids, ChIP-Seq analysis revealed a restriction against S. cerevisiae Sir4 associating with most S. bayanus silenced regions; in contrast, S. bayanus Sir4 associated with S. cerevisiae silenced loci to an even greater degree than did S. cerevisiae's own Sir4. Functional changes in silencer sequences paralleled changes in Sir4 sequence and a reduction in Sir1 family members in S. cerevisiae. Critically, species-specific silencing of the S. bayanus HMR locus could be reconstituted in S. cerevisiae by co-transfer of the S. bayanus Sir4 and Kos3 (the ancestral relative of Sir1 proteins. As Sir1/Kos3 and Sir4 bind conserved silencer-binding proteins, but not specific DNA sequences, these rapidly evolving proteins served to interpret differences in the two species' silencers presumably involving emergent features created by the regulatory proteins that bind sequences within silencers. The results presented here, and in particular the high resolution ChIP-Seq localization of the Sir4 protein, provided unanticipated insights into the mechanism of silent chromatin assembly in yeast.

  12. The evolution and utility of ribosomal ITS sequences in Bambusinae ...

    Indian Academy of Sciences (India)

    2012-08-09

    Aug 9, 2012 ... both direct sequencing and cloning of PCR products were used for ITS analysis, with only one clone ... woody bamboos belonging to this subtribe. The recent availability of molecular data enabled tax- ..... Gigantochloa species occur mainly in Malaysia, and they can be distinguished upon flowering by the ...

  13. CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences

    Directory of Open Access Journals (Sweden)

    Charalambos Chrysostomou

    2015-01-01

    Full Text Available Complex informational spectrum analysis for protein sequences (CISAPS and its web-based server are developed and presented. As recent studies show, only the use of the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient. Therefore, CISAPS is developed to consider and provide results in three forms including absolute, real, and imaginary spectrum. Biologically related features to the analysis of influenza A subtypes as presented as a case study in this study can also appear individually either in the real or imaginary spectrum. As the results presented, protein classes can present similarities or differences according to the features extracted from CISAPS web server. These associations are probable to be related with the protein feature that the specific amino acid index represents. In addition, various technical issues such as zero-padding and windowing that may affect the analysis are also addressed. CISAPS uses an expanded list of 611 unique amino acid indices where each one represents a different property to perform the analysis. This web-based server enables researchers with little knowledge of signal processing methods to apply and include complex informational spectrum analysis to their work.

  14. Dinoflagellate phylogeny as inferred from heat shock protein 90 and ribosomal gene sequences.

    Directory of Open Access Journals (Sweden)

    Mona Hoppenrath

    2010-10-01

    Full Text Available Interrelationships among dinoflagellates in molecular phylogenies are largely unresolved, especially in the deepest branches. Ribosomal DNA (rDNA sequences provide phylogenetic signals only at the tips of the dinoflagellate tree. Two reasons for the poor resolution of deep dinoflagellate relationships using rDNA sequences are (1 most sites are relatively conserved and (2 there are different evolutionary rates among sites in different lineages. Therefore, alternative molecular markers are required to address the deeper phylogenetic relationships among dinoflagellates. Preliminary evidence indicates that the heat shock protein 90 gene (Hsp90 will provide an informative marker, mainly because this gene is relatively long and appears to have relatively uniform rates of evolution in different lineages.We more than doubled the previous dataset of Hsp90 sequences from dinoflagellates by generating additional sequences from 17 different species, representing seven different orders. In order to concatenate the Hsp90 data with rDNA sequences, we supplemented the Hsp90 sequences with three new SSU rDNA sequences and five new LSU rDNA sequences. The new Hsp90 sequences were generated, in part, from four additional heterotrophic dinoflagellates and the type species for six different genera. Molecular phylogenetic analyses resulted in a paraphyletic assemblage near the base of the dinoflagellate tree consisting of only athecate species. However, Noctiluca was never part of this assemblage and branched in a position that was nested within other lineages of dinokaryotes. The phylogenetic trees inferred from Hsp90 sequences were consistent with trees inferred from rDNA sequences in that the backbone of the dinoflagellate clade was largely unresolved.The sequence conservation in both Hsp90 and rDNA sequences and the poor resolution of the deepest nodes suggests that dinoflagellates reflect an explosive radiation in morphological diversity in their recent

  15. Automated design evolution of stereochemically randomized protein foldamers

    Science.gov (United States)

    Ranbhor, Ranjit; Kumar, Anil; Patel, Kirti; Ramakrishnan, Vibin; Durani, Susheel

    2018-05-01

    Diversification of chain stereochemistry opens up the possibilities of an ‘in principle’ increase in the design space of proteins. This huge increase in the sequence and consequent structural variation is aimed at the generation of smart materials. To diversify protein structure stereochemically, we introduced L- and D-α-amino acids as the design alphabet. With a sequence design algorithm, we explored the usage of specific variables such as chirality and the sequence of this alphabet in independent steps. With molecular dynamics, we folded stereochemically diverse homopolypeptides and evaluated their ‘fitness’ for possible design as protein-like foldamers. We propose a fitness function to prune the most optimal fold among 1000 structures simulated with an automated repetitive simulated annealing molecular dynamics (AR-SAMD) approach. The highly scored poly-leucine fold with sequence lengths of 24 and 30 amino acids were later sequence-optimized using a Dead End Elimination cum Monte Carlo based optimization tool. This paper demonstrates a novel approach for the de novo design of protein-like foldamers.

  16. Prediction of protein-protein interaction sites in sequences and 3D structures by random forests.

    Directory of Open Access Journals (Sweden)

    Mile Sikić

    2009-01-01

    Full Text Available Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i a combination of sequence- and structure-derived parameters and (ii sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras-Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information.

  17. Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair.

    Science.gov (United States)

    Wu, Dong-Dong; Irwin, David M; Zhang, Ya-Ping

    2008-08-23

    Hair is unique to mammals. Keratin associated proteins (KRTAPs), which contain two major groups: high/ultrahigh cysteine and high glycine-tyrosine, are one of the major components of hair and play essential roles in the formation of rigid and resistant hair shafts. The KRTAP family was identified as being unique to mammals, and near-complete KRTAP gene repertoires for eight mammalian genomes were characterized in this study. An expanded KRTAP gene repertoire was found in rodents. Surprisingly, humans have a similar number of genes as other primates despite the relative hairlessness of humans. We identified several new subfamilies not previously reported in the high/ultrahigh cysteine KRTAP genes. Genes in many subfamilies of the high/ultrahigh cysteine KRTAP genes have evolved by concerted evolution with frequent gene conversion events, yielding a higher GC base content for these gene sequences. In contrast, the high glycine-tyrosine KRTAP genes have evolved more dynamically, with fewer gene conversion events and thus have a lower GC base content, possibly due to positive selection. Most of the subfamilies emerged early in the evolution of mammals, thus we propose that the mammalian ancestor should have a diverse KRTAP gene repertoire. We propose that hair content characteristics have evolved and diverged rapidly among mammals because of rapid divergent evolution of KRTAPs between species. In contrast, subfamilies of KRTAP genes have been homogenized within each species due to concerted evolution.

  18. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence.

    Science.gov (United States)

    Zhang, Ya-Nan; Pan, Xiao-Yong; Huang, Yan; Shen, Hong-Bin

    2011-08-21

    Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems. Copyright © 2011 Elsevier Ltd. All rights reserved.

  19. Analysis of correlations between sites in models of protein sequences

    International Nuclear Information System (INIS)

    Giraud, B.G.; Lapedes, A.; Liu, L.C.

    1998-01-01

    A criterion based on conditional probabilities, related to the concept of algorithmic distance, is used to detect correlated mutations at noncontiguous sites on sequences. We apply this criterion to the problem of analyzing correlations between sites in protein sequences; however, the analysis applies generally to networks of interacting sites with discrete states at each site. Elementary models, where explicit results can be derived easily, are introduced. The number of states per site considered ranges from 2, illustrating the relation to familiar classical spin systems, to 20 states, suitable for representing amino acids. Numerical simulations show that the criterion remains valid even when the genetic history of the data samples (e.g., protein sequences), as represented by a phylogenetic tree, introduces nonindependence between samples. Statistical fluctuations due to finite sampling are also investigated and do not invalidate the criterion. A subsidiary result is found: The more homogeneous a population, the more easily its average properties can drift from the properties of its ancestor. copyright 1998 The American Physical Society

  20. Rapid evolution of coral proteins responsible for interaction with the environment.

    Science.gov (United States)

    Voolstra, Christian R; Sunagawa, Shinichi; Matz, Mikhail V; Bayer, Till; Aranda, Manuel; Buschiazzo, Emmanuel; Desalvo, Michael K; Lindquist, Erika; Szmant, Alina M; Coffroth, Mary Alice; Medina, Mónica

    2011-01-01

    Corals worldwide are in decline due to climate change effects (e.g., rising seawater temperatures), pollution, and exploitation. The ability of corals to cope with these stressors in the long run depends on the evolvability of the underlying genetic networks and proteins, which remain largely unknown. A genome-wide scan for positively selected genes between related coral species can help to narrow down the search space considerably. We screened a set of 2,604 putative orthologs from EST-based sequence datasets of the coral species Acropora millepora and Acropora palmata to determine the fraction and identity of proteins that may experience adaptive evolution. 7% of the orthologs show elevated rates of evolution. Taxonomically-restricted (i.e. lineage-specific) genes show a positive selection signature more frequently than genes that are found across many animal phyla. The class of proteins that displayed elevated evolutionary rates was significantly enriched for proteins involved in immunity and defense, reproduction, and sensory perception. We also found elevated rates of evolution in several other functional groups such as management of membrane vesicles, transmembrane transport of ions and organic molecules, cell adhesion, and oxidative stress response. Proteins in these processes might be related to the endosymbiotic relationship corals maintain with dinoflagellates in the genus Symbiodinium. This study provides a birds-eye view of the processes potentially underlying coral adaptation, which will serve as a foundation for future work to elucidate the rates, patterns, and mechanisms of corals' evolutionary response to global climate change.

  1. Isolation of Hox cluster genes from insects reveals an accelerated sequence evolution rate.

    Directory of Open Access Journals (Sweden)

    Heike Hadrys

    Full Text Available Among gene families it is the Hox genes and among metazoan animals it is the insects (Hexapoda that have attracted particular attention for studying the evolution of development. Surprisingly though, no Hox genes have been isolated from 26 out of 35 insect orders yet, and the existing sequences derive mainly from only two orders (61% from Hymenoptera and 22% from Diptera. We have designed insect specific primers and isolated 37 new partial homeobox sequences of Hox cluster genes (lab, pb, Hox3, ftz, Antp, Scr, abd-a, Abd-B, Dfd, and Ubx from six insect orders, which are crucial to insect phylogenetics. These new gene sequences provide a first step towards comparative Hox gene studies in insects. Furthermore, comparative distance analyses of homeobox sequences reveal a correlation between gene divergence rate and species radiation success with insects showing the highest rate of homeobox sequence evolution.

  2. Computational design of chimeric protein libraries for directed evolution.

    Science.gov (United States)

    Silberg, Jonathan J; Nguyen, Peter Q; Stevenson, Taylor

    2010-01-01

    The best approach for creating libraries of functional proteins with large numbers of nondisruptive amino acid substitutions is protein recombination, in which structurally related polypeptides are swapped among homologous proteins. Unfortunately, as more distantly related proteins are recombined, the fraction of variants having a disrupted structure increases. One way to enrich the fraction of folded and potentially interesting chimeras in these libraries is to use computational algorithms to anticipate which structural elements can be swapped without disturbing the integrity of a protein's structure. Herein, we describe how the algorithm Schema uses the sequences and structures of the parent proteins recombined to predict the structural disruption of chimeras, and we outline how dynamic programming can be used to find libraries with a range of amino acid substitution levels that are enriched in variants with low Schema disruption.

  3. Genepleio Software for Effective Estimation of Gene Pleiotropy from Protein Sequences

    Directory of Open Access Journals (Sweden)

    Wenhai Chen

    2015-01-01

    Full Text Available Though pleiotropy, which refers to the phenomenon of a gene affecting multiple traits, has long played a central role in genetics, development, and evolution, estimation of the number of pleiotropy components remains a hard mission to accomplish. In this paper, we report a newly developed software package, Genepleio, to estimate the effective gene pleiotropy from phylogenetic analysis of protein sequences. Since this estimate can be interpreted as the minimum pleiotropy of a gene, it is used to play a role of reference for many empirical pleiotropy measures. This work would facilitate our understanding of how gene pleiotropy affects the pattern of genotype-phenotype map and the consequence of organismal evolution.

  4. Protein model discrimination using mutational sensitivity derived from deep sequencing.

    Science.gov (United States)

    Adkar, Bharat V; Tripathi, Arti; Sahoo, Anusmita; Bajaj, Kanika; Goswami, Devrishi; Chakrabarti, Purbani; Swarnkar, Mohit K; Gokhale, Rajesh S; Varadarajan, Raghavan

    2012-02-08

    A major bottleneck in protein structure prediction is the selection of correct models from a pool of decoys. Relative activities of ∼1,200 individual single-site mutants in a saturation library of the bacterial toxin CcdB were estimated by determining their relative populations using deep sequencing. This phenotypic information was used to define an empirical score for each residue (RankScore), which correlated with the residue depth, and identify active-site residues. Using these correlations, ∼98% of correct models of CcdB (RMSD ≤ 4Å) were identified from a large set of decoys. The model-discrimination methodology was further validated on eleven different monomeric proteins using simulated RankScore values. The methodology is also a rapid, accurate way to obtain relative activities of each mutant in a large pool and derive sequence-structure-function relationships without protein isolation or characterization. It can be applied to any system in which mutational effects can be monitored by a phenotypic readout. Copyright © 2012 Elsevier Ltd. All rights reserved.

  5. The Evolution of Bony Vertebrate Enhancers at Odds with Their Coding Sequence Landscape.

    Science.gov (United States)

    Yousaf, Aisha; Sohail Raza, Muhammad; Ali Abbasi, Amir

    2015-08-06

    Enhancers lie at the heart of transcriptional and developmental gene regulation. Therefore, changes in enhancer sequences usually disrupt the target gene expression and result in disease phenotypes. Despite the well-established role of enhancers in development and disease, evolutionary sequence studies are lacking. The current study attempts to unravel the puzzle of bony vertebrates' conserved noncoding elements (CNE) enhancer evolution. Bayesian phylogenetics of enhancer sequences spotlights promising interordinal relationships among placental mammals, proposing a closer relationship between humans and laurasiatherians while placing rodents at the basal position. Clock-based estimates of enhancer evolution provided a dynamic picture of interspecific rate changes across the bony vertebrate lineage. Moreover, coelacanth in the study augmented our appreciation of the vertebrate cis-regulatory evolution during water-land transition. Intriguingly, we observed a pronounced upsurge in enhancer evolution in land-dwelling vertebrates. These novel findings triggered us to further investigate the evolutionary trend of coding as well as CNE nonenhancer repertoires, to highlight the relative evolutionary dynamics of diverse genomic landscapes. Surprisingly, the evolutionary rates of enhancer sequences were clearly at odds with those of the coding and the CNE nonenhancer sequences during vertebrate adaptation to land, with land vertebrates exhibiting significantly reduced rates of coding sequence evolution in comparison to their fast evolving regulatory landscape. The observed variation in tetrapod cis-regulatory elements caused the fine-tuning of associated gene regulatory networks. Therefore, the increased evolutionary rate of tetrapods' enhancer sequences might be responsible for the variation in developmental regulatory circuits during the process of vertebrate adaptation to land. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for

  6. PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment

    Directory of Open Access Journals (Sweden)

    Massingham Tim

    2011-04-01

    Full Text Available Abstract Background The Monte Carlo simulation of sequence evolution is routinely used to assess the performance of phylogenetic inference methods and sequence alignment algorithms. Progress in the field of molecular evolution fuels the need for more realistic and hence more complex simulations, adapted to particular situations, yet current software makes unreasonable assumptions such as homogeneous substitution dynamics or a uniform distribution of indels across the simulated sequences. This calls for an extensible simulation framework written in a high-level functional language, offering new functionality and making it easy to incorporate further complexity. Results PhyloSim is an extensible framework for the Monte Carlo simulation of sequence evolution, written in R, using the Gillespie algorithm to integrate the actions of many concurrent processes such as substitutions, insertions and deletions. Uniquely among sequence simulation tools, PhyloSim can simulate arbitrarily complex patterns of rate variation and multiple indel processes, and allows for the incorporation of selective constraints on indel events. User-defined complex patterns of mutation and selection can be easily integrated into simulations, allowing PhyloSim to be adapted to specific needs. Conclusions Close integration with R and the wide range of features implemented offer unmatched flexibility, making it possible to simulate sequence evolution under a wide range of realistic settings. We believe that PhyloSim will be useful to future studies involving simulated alignments.

  7. Evolution of light-harvesting complex proteins from Chl c-containing algae

    Directory of Open Access Journals (Sweden)

    Puerta M Virginia

    2011-04-01

    Full Text Available Abstract Background Light harvesting complex (LHC proteins function in photosynthesis by binding chlorophyll (Chl and carotenoid molecules that absorb light and transfer the energy to the reaction center Chl of the photosystem. Most research has focused on LHCs of plants and chlorophytes that bind Chl a and b and extensive work on these proteins has uncovered a diversity of biochemical functions, expression patterns and amino acid sequences. We focus here on a less-studied family of LHCs that typically bind Chl a and c, and that are widely distributed in Chl c-containing and other algae. Previous phylogenetic analyses of these proteins suggested that individual algal lineages possess proteins from one or two subfamilies, and that most subfamilies are characteristic of a particular algal lineage, but genome-scale datasets had revealed that some species have multiple different forms of the gene. Such observations also suggested that there might have been an important influence of endosymbiosis in the evolution of LHCs. Results We reconstruct a phylogeny of LHCs from Chl c-containing algae and related lineages using data from recent sequencing projects to give ~10-fold larger taxon sampling than previous studies. The phylogeny indicates that individual taxa possess proteins from multiple LHC subfamilies and that several LHC subfamilies are found in distantly related algal lineages. This phylogenetic pattern implies functional differentiation of the gene families, a hypothesis that is consistent with data on gene expression, carotenoid binding and physical associations with other LHCs. In all probability LHCs have undergone a complex history of evolution of function, gene transfer, and lineage-specific diversification. Conclusion The analysis provides a strikingly different picture of LHC diversity than previous analyses of LHC evolution. Individual algal lineages possess proteins from multiple LHC subfamilies. Evolutionary relationships showed

  8. Protein consensus-based surface engineering (ProCoS): a computer-assisted method for directed protein evolution.

    Science.gov (United States)

    Shivange, Amol V; Hoeffken, Hans Wolfgang; Haefner, Stefan; Schwaneberg, Ulrich

    2016-12-01

    Protein consensus-based surface engineering (ProCoS) is a simple and efficient method for directed protein evolution combining computational analysis and molecular biology tools to engineer protein surfaces. ProCoS is based on the hypothesis that conserved residues originated from a common ancestor and that these residues are crucial for the function of a protein, whereas highly variable regions (situated on the surface of a protein) can be targeted for surface engineering to maximize performance. ProCoS comprises four main steps: ( i ) identification of conserved and highly variable regions; ( ii ) protein sequence design by substituting residues in the highly variable regions, and gene synthesis; ( iii ) in vitro DNA recombination of synthetic genes; and ( iv ) screening for active variants. ProCoS is a simple method for surface mutagenesis in which multiple sequence alignment is used for selection of surface residues based on a structural model. To demonstrate the technique's utility for directed evolution, the surface of a phytase enzyme from Yersinia mollaretii (Ymphytase) was subjected to ProCoS. Screening just 1050 clones from ProCoS engineering-guided mutant libraries yielded an enzyme with 34 amino acid substitutions. The surface-engineered Ymphytase exhibited 3.8-fold higher pH stability (at pH 2.8 for 3 h) and retained 40% of the enzyme's specific activity (400 U/mg) compared with the wild-type Ymphytase. The pH stability might be attributed to a significantly increased (20 percentage points; from 9% to 29%) number of negatively charged amino acids on the surface of the engineered phytase.

  9. Nucleation phenomena in protein folding: the modulating role of protein sequence

    International Nuclear Information System (INIS)

    Travasso, Rui D M; FaIsca, Patricia F N; Gama, Margarida M Telo da

    2007-01-01

    For the vast majority of naturally occurring, small, single-domain proteins, folding is often described as a two-state process that lacks detectable intermediates. This observation has often been rationalized on the basis of a nucleation mechanism for protein folding whose basic premise is the idea that, after completion of a specific set of contacts forming the so-called folding nucleus, the native state is achieved promptly. Here we propose a methodology to identify folding nuclei in small lattice polymers and apply it to the study of protein molecules with a chain length of N = 48. To investigate the extent to which protein topology is a robust determinant of the nucleation mechanism, we compare the nucleation scenario of a native-centric model with that of a sequence-specific model sharing the same native fold. To evaluate the impact of the sequence's finer details in the nucleation mechanism, we consider the folding of two non-homologous sequences. We conclude that, in a sequence-specific model, the folding nucleus is, to some extent, formed by the most stable contacts in the protein and that the less stable linkages in the folding nucleus are solely determined by the fold's topology. We have also found that, independently of the protein sequence, the folding nucleus performs the same 'topological' function. This unifying feature of the nucleation mechanism results from the residues forming the folding nucleus being distributed along the protein chain in a similar and well-defined manner that is determined by the fold's topological features

  10. Functional divergence outlines the evolution of novel protein ...

    Indian Academy of Sciences (India)

    2013-10-01

    Oct 1, 2013 ... identified a number of vital amino acid sites which contribute to predicted functional diversity. We have ... Taking this into account, in this study we looked into the possibility of ... to the structure of NifH protein and solubility accessibility of ..... ment through sequence weighting, position-specific gap penalties.

  11. Bidirectional gene sequences with similar homology to functional proteins of alkane degrading bacterium pseudomonas fredriksbergensis DNA

    International Nuclear Information System (INIS)

    Megeed, A.A.

    2011-01-01

    The potential for two overlapping fragments of DNA from a clone of newly isolated alkanes degrading bacterium Pseudomonas frederiksbergensis encoding sequences with similar homology to two parts of functional proteins is described. One strand contains a sequence with high homology to alkanes monooxygenase (alkB), a member of the alkanes hydroxylase family, and the other strand contains a sequence with some homology to alcohol dehydrogenase gene (alkJ). Overlapping of the genes on opposite strands has been reported in eukaryotic species, and is now reported in a bacterial species. The sequence comparisons and ORFS results revealed that the regulation and the genes organization involved in alkane oxidation represented in Pseudomonas frederiksberghensis varies among the different known alkane degrading bacteria. The alk gene cluster containing homologues to the known alkane monooxygenase (alkB), and rubredoxin (alkG) are oriented in the same direction, whereas alcohol dehydrogenase (alkJ) is oriented in the opposite direction. Such genomes encode messages on both strands of the DNA, or in an overlapping but different reading frames, of the same strand of DNA. The possibility of creating novel genes from pre-existing sequences, known as overprinting, which is a widespread phenomenon in small viruses. Here, the origin and evolution of the gene overlap to bacteriophages belonging to the family Microviridae have been investigated. Such a phenomenon is most widely described in extremely small genomes such as those of viruses or small plasmids, yet here is a unique phenomenon. (author)

  12. Characterization of bud emergence 46 (BEM46) protein: Sequence, structural, phylogenetic and subcellular localization analyses

    International Nuclear Information System (INIS)

    Kumar, Abhishek; Kollath-Leiß, Krisztina; Kempken, Frank

    2013-01-01

    Highlights: •All eukaryotes have at least a single copy of a bem46 ortholog. •The catalytic triad of BEM46 is illustrated using sequence and structural analysis. •We identified indels in the conserved domain of BEM46 protein. •Localization studies of BEM46 protein were carried out using GFP-fusion tagging. -- Abstract: The bud emergence 46 (BEM46) protein from Neurospora crassa belongs to the α/β-hydrolase superfamily. Recently, we have reported that the BEM46 protein is localized in the perinuclear ER and also forms spots close by the plasma membrane. The protein appears to be required for cell type-specific polarity formation in N. crassa. Furthermore, initial studies suggested that the BEM46 amino acid sequence is conserved in eukaryotes and is considered to be one of the widespread conserved “known unknown” eukaryotic genes. This warrants for a comprehensive phylogenetic analysis of this superfamily to unravel origin and molecular evolution of these genes in different eukaryotes. Herein, we observe that all eukaryotes have at least a single copy of a bem46 ortholog. Upon scanning of these proteins in various genomes, we find that there are expansions leading into several paralogs in vertebrates. Usingcomparative genomic analyses, we identified insertion/deletions (indels) in the conserved domain of BEM46 protein, which allow to differentiate fungal classes such as ascomycetes from basidiomycetes. We also find that exonic indels are able to differentiate BEM46 homologs of different eukaryotic lineage. Furthermore, we unravel that BEM46 protein from N. crassa possess a novel endoplasmic-retention signal (PEKK) using GFP-fusion tagging experiments. We propose that three residues namely a serine 188S, a histidine 292H and an aspartic acid 262D are most critical residues, forming a catalytic triad in BEM46 protein from N. crassa. We carried out a comprehensive study on bem46 genes from a molecular evolution perspective with combination of functional

  13. Characterization of bud emergence 46 (BEM46) protein: Sequence, structural, phylogenetic and subcellular localization analyses

    Energy Technology Data Exchange (ETDEWEB)

    Kumar, Abhishek; Kollath-Leiß, Krisztina; Kempken, Frank, E-mail: fkempken@bot.uni-kiel.de

    2013-08-30

    Highlights: •All eukaryotes have at least a single copy of a bem46 ortholog. •The catalytic triad of BEM46 is illustrated using sequence and structural analysis. •We identified indels in the conserved domain of BEM46 protein. •Localization studies of BEM46 protein were carried out using GFP-fusion tagging. -- Abstract: The bud emergence 46 (BEM46) protein from Neurospora crassa belongs to the α/β-hydrolase superfamily. Recently, we have reported that the BEM46 protein is localized in the perinuclear ER and also forms spots close by the plasma membrane. The protein appears to be required for cell type-specific polarity formation in N. crassa. Furthermore, initial studies suggested that the BEM46 amino acid sequence is conserved in eukaryotes and is considered to be one of the widespread conserved “known unknown” eukaryotic genes. This warrants for a comprehensive phylogenetic analysis of this superfamily to unravel origin and molecular evolution of these genes in different eukaryotes. Herein, we observe that all eukaryotes have at least a single copy of a bem46 ortholog. Upon scanning of these proteins in various genomes, we find that there are expansions leading into several paralogs in vertebrates. Usingcomparative genomic analyses, we identified insertion/deletions (indels) in the conserved domain of BEM46 protein, which allow to differentiate fungal classes such as ascomycetes from basidiomycetes. We also find that exonic indels are able to differentiate BEM46 homologs of different eukaryotic lineage. Furthermore, we unravel that BEM46 protein from N. crassa possess a novel endoplasmic-retention signal (PEKK) using GFP-fusion tagging experiments. We propose that three residues namely a serine 188S, a histidine 292H and an aspartic acid 262D are most critical residues, forming a catalytic triad in BEM46 protein from N. crassa. We carried out a comprehensive study on bem46 genes from a molecular evolution perspective with combination of functional

  14. Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data.

    Directory of Open Access Journals (Sweden)

    Can Alkan

    2007-09-01

    Full Text Available The major DNA constituent of primate centromeres is alpha satellite DNA. As much as 2%-5% of sequence generated as part of primate genome sequencing projects consists of this material, which is fragmented or not assembled as part of published genome sequences due to its highly repetitive nature. Here, we develop computational methods to rapidly recover and categorize alpha-satellite sequences from previously uncharacterized whole-genome shotgun sequence data. We present an algorithm to computationally predict potential higher-order array structure based on paired-end sequence data and then experimentally validate its organization and distribution by experimental analyses. Using whole-genome shotgun data from the human, chimpanzee, and macaque genomes, we examine the phylogenetic relationship of these sequences and provide further support for a model for their evolution and mutation over the last 25 million years. Our results confirm fundamental differences in the dispersal and evolution of centromeric satellites in the Old World monkey and ape lineages of evolution.

  15. Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data.

    Science.gov (United States)

    Alkan, Can; Ventura, Mario; Archidiacono, Nicoletta; Rocchi, Mariano; Sahinalp, S Cenk; Eichler, Evan E

    2007-09-01

    The major DNA constituent of primate centromeres is alpha satellite DNA. As much as 2%-5% of sequence generated as part of primate genome sequencing projects consists of this material, which is fragmented or not assembled as part of published genome sequences due to its highly repetitive nature. Here, we develop computational methods to rapidly recover and categorize alpha-satellite sequences from previously uncharacterized whole-genome shotgun sequence data. We present an algorithm to computationally predict potential higher-order array structure based on paired-end sequence data and then experimentally validate its organization and distribution by experimental analyses. Using whole-genome shotgun data from the human, chimpanzee, and macaque genomes, we examine the phylogenetic relationship of these sequences and provide further support for a model for their evolution and mutation over the last 25 million years. Our results confirm fundamental differences in the dispersal and evolution of centromeric satellites in the Old World monkey and ape lineages of evolution.

  16. Comparative Genomics Identifies Epidermal Proteins Associated with the Evolution of the Turtle Shell.

    Science.gov (United States)

    Holthaus, Karin Brigit; Strasser, Bettina; Sipos, Wolfgang; Schmidt, Heiko A; Mlitz, Veronika; Sukseree, Supawadee; Weissenbacher, Anton; Tschachler, Erwin; Alibardi, Lorenzo; Eckhart, Leopold

    2016-03-01

    The evolution of reptiles, birds, and mammals was associated with the origin of unique integumentary structures. Studies on lizards, chicken, and humans have suggested that the evolution of major structural proteins of the outermost, cornified layers of the epidermis was driven by the diversification of a gene cluster called Epidermal Differentiation Complex (EDC). Turtles have evolved unique defense mechanisms that depend on mechanically resilient modifications of the epidermis. To investigate whether the evolution of the integument in these reptiles was associated with specific adaptations of the sequences and expression patterns of EDC-related genes, we utilized newly available genome sequences to determine the epidermal differentiation gene complement of turtles. The EDC of the western painted turtle (Chrysemys picta bellii) comprises more than 100 genes, including at least 48 genes that encode proteins referred to as beta-keratins or corneous beta-proteins. Several EDC proteins have evolved cysteine/proline contents beyond 50% of total amino acid residues. Comparative genomics suggests that distinct subfamilies of EDC genes have been expanded and partly translocated to loci outside of the EDC in turtles. Gene expression analysis in the European pond turtle (Emys orbicularis) showed that EDC genes are differentially expressed in the skin of the various body sites and that a subset of beta-keratin genes within the EDC as well as those located outside of the EDC are expressed predominantly in the shell. Our findings give strong support to the hypothesis that the evolutionary innovation of the turtle shell involved specific molecular adaptations of epidermal differentiation. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  17. Mitochondrial genome evolution in Alismatales: Size reduction and extensive loss of ribosomal protein genes

    DEFF Research Database (Denmark)

    Petersen, Gitte; Cuenca, Argelia; Zervas, Athanasios

    2017-01-01

    The order Alismatales is a hotspot for evolution of plant mitochondrial genomes characterized by remarkable differences in genome size, substitution rates, RNA editing, retrotranscription, gene loss and intron loss. Here we have sequenced the complete mitogenomes of Zostera marina and Stratiotes...... aloides, which together with previously sequenced mitogenomes from Butomus and Spirodela, provide new evolutionary evidence of genome size reduction, gene loss and transfer to the nucleus. The Zostera mitogenome includes a large portion of DNA transferred from the plastome, yet it is the smallest known...... mitogenome from a non-parasitic plant. Using a broad sample of the Alismatales, the evolutionary history of ribosomal protein gene loss is analyzed. In Zostera almost all ribosomal protein genes are lost from the mitogenome, but only some can be found in the nucleus....

  18. Reproductive protein evolution in two cryptic species of marine chordate

    Science.gov (United States)

    2011-01-01

    Background Reproductive character displacement (RCD) is a common and taxonomically widespread pattern. In marine broadcast spawning organisms, behavioral and mechanical isolation are absent and prezygotic barriers between species often operate only during the fertilization process. Such barriers are usually a consequence of differences in the way in which sperm and egg proteins interact, so RCD can be manifest as faster evolution of these proteins between species in sympatry than allopatry. Rapid evolution of these proteins often appears to be a consequence of positive (directional) selection. Here, we identify a set of candidate gamete recognition proteins (GRPs) in the ascidian Ciona intestinalis and showed that these GRPs evolve more rapidly than control proteins (those not involved in gamete recognition). Choosing a subset of these gamete recognition proteins that show evidence of positive selection (CIPRO37.40.1, CIPRO60.5.1, CIPRO100.7.1), we then directly test the RCD hypothesis by comparing divergence (omega) and polymorphism (McDonald-Kreitman, Tajima's D, Fu and Li's D and F, Fay and Wu's H) statistics in sympatric and allopatric populations of two distinct forms of C. intestinalis (Types A and B) between which there are strong post-zygotic barriers. Results Candidate gamete recognition proteins from two lineages of C. intestinalis (Type A and B) are evolving more rapidly than control proteins, consistent with patterns seen in insects and mammals. However, ω (dN/dS) is not significantly different between the sympatric and allopatric populations, and none of the polymorphism statistics show significant differences between sympatric and allopatric populations. Conclusions Enhanced prezygotic isolation in sympatry has become a well-known feature of gamete recognition proteins in marine broadcast spawners. But in most cases the evolutionary process or processes responsible for this pattern have not been identified. Although gamete recognition proteins in C

  19. Reproductive protein evolution in two cryptic species of marine chordate

    Directory of Open Access Journals (Sweden)

    Harrison Richard G

    2011-01-01

    Full Text Available Abstract Background Reproductive character displacement (RCD is a common and taxonomically widespread pattern. In marine broadcast spawning organisms, behavioral and mechanical isolation are absent and prezygotic barriers between species often operate only during the fertilization process. Such barriers are usually a consequence of differences in the way in which sperm and egg proteins interact, so RCD can be manifest as faster evolution of these proteins between species in sympatry than allopatry. Rapid evolution of these proteins often appears to be a consequence of positive (directional selection. Here, we identify a set of candidate gamete recognition proteins (GRPs in the ascidian Ciona intestinalis and showed that these GRPs evolve more rapidly than control proteins (those not involved in gamete recognition. Choosing a subset of these gamete recognition proteins that show evidence of positive selection (CIPRO37.40.1, CIPRO60.5.1, CIPRO100.7.1, we then directly test the RCD hypothesis by comparing divergence (omega and polymorphism (McDonald-Kreitman, Tajima's D, Fu and Li's D and F, Fay and Wu's H statistics in sympatric and allopatric populations of two distinct forms of C. intestinalis (Types A and B between which there are strong post-zygotic barriers. Results Candidate gamete recognition proteins from two lineages of C. intestinalis (Type A and B are evolving more rapidly than control proteins, consistent with patterns seen in insects and mammals. However, ω (dN/dS is not significantly different between the sympatric and allopatric populations, and none of the polymorphism statistics show significant differences between sympatric and allopatric populations. Conclusions Enhanced prezygotic isolation in sympatry has become a well-known feature of gamete recognition proteins in marine broadcast spawners. But in most cases the evolutionary process or processes responsible for this pattern have not been identified. Although gamete

  20. Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences | Center for Cancer Research

    Science.gov (United States)

    A graphical method is presented for displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo.

  1. The evolution of the tape measure protein: units, duplications and losses

    Directory of Open Access Journals (Sweden)

    Poisson Guylaine

    2011-10-01

    Full Text Available Abstract Background A large family of viruses that infect bacteria, called phages, is characterized by long tails used to inject DNA into their victims' cells. The tape measure protein got its name because the length of the corresponding gene is proportional to the length of the phage's tail: a fact shown by actually copying or splicing out parts of DNA in exemplar species. A natural question is whether there exist units for these tape measures, and if different tape measures have different units and lengths. Such units would allow us to retrace the evolution of tape measure proteins using their duplication/loss history. The vast number of sequenced phages genomes allows us to attack this problem with a comparative genomics approach. Results Here we describe a subset of phages whose tape measure proteins contain variable numbers of an 11 amino acids sequence repeat, aligned with sequence similarity, structural properties, and simple arithmetics. This subset provides a unique opportunity for the combinatorial study of phage evolution, without the added uncertainties of multiple alignments, which are trivial in this case, or of protein functions, that are well established. We give a heuristic that reconstructs the duplication history of these sequences, using divergent strains to discriminate between mutations that occurred before and after speciation, or lineage divergence. The heuristic is based on an efficient algorithm that gives an exhaustive enumeration of all possible parsimonious reconstructions of the duplication/speciation history of a single nucleotide. Finally, we present a method that allows, when possible, to discriminate between duplication and loss events. Conclusions Establishing the evolutionary history of viruses is difficult, in part due to extensive recombinations and gene transfers, and high mutation rates that often erase detectable similarity between homologous genes. In this paper, we introduce new tools to address this

  2. Genome sequence analysis of the model grass Brachypodium distachyon: insights into grass genome evolution

    Energy Technology Data Exchange (ETDEWEB)

    Schulman, Al

    2009-08-09

    Three subfamilies of grasses, the Erhardtoideae (rice), the Panicoideae (maize, sorghum, sugar cane and millet), and the Pooideae (wheat, barley and cool season forage grasses) provide the basis of human nutrition and are poised to become major sources of renewable energy. Here we describe the complete genome sequence of the wild grass Brachypodium distachyon (Brachypodium), the first member of the Pooideae subfamily to be completely sequenced. Comparison of the Brachypodium, rice and sorghum genomes reveals a precise sequence- based history of genome evolution across a broad diversity of the grass family and identifies nested insertions of whole chromosomes into centromeric regions as a predominant mechanism driving chromosome evolution in the grasses. The relatively compact genome of Brachypodium is maintained by a balance of retroelement replication and loss. The complete genome sequence of Brachypodium, coupled to its exceptional promise as a model system for grass research, will support the development of new energy and food crops

  3. Directed Evolution of Proteins through In Vitro Protein Synthesis in Liposomes

    Directory of Open Access Journals (Sweden)

    Takehiro Nishikawa

    2012-01-01

    Full Text Available Directed evolution of proteins is a technique used to modify protein functions through “Darwinian selection.” In vitro compartmentalization (IVC is an in vitro gene screening system for directed evolution of proteins. IVC establishes the link between genetic information (genotype and the protein translated from the information (phenotype, which is essential for all directed evolution methods, by encapsulating both in a nonliving microcompartment. Herein, we introduce a new liposome-based IVC system consisting of a liposome, the protein synthesis using recombinant elements (PURE system and a fluorescence-activated cell sorter (FACS used as a microcompartment, in vitro protein synthesis system, and high-throughput screen, respectively. Liposome-based IVC is characterized by in vitro protein synthesis from a single copy of a gene in a cell-sized unilamellar liposome and quantitative functional evaluation of the synthesized proteins. Examples of liposome-based IVC for screening proteins such as GFP and β-glucuronidase are described. We discuss the future directions for this method and its applications.

  4. Protein evolution via amino acid and codon elimination

    DEFF Research Database (Denmark)

    Goltermann, Lise; Larsen, Marie Sofie Yoo; Banerjee, Rajat

    2010-01-01

    BACKGROUND: Global residue-specific amino acid mutagenesis can provide important biological insight and generate proteins with altered properties, but at the risk of protein misfolding. Further, targeted libraries are usually restricted to a handful of amino acids because there is an exponential...... correlation between the number of residues randomized and the size of the resulting ensemble. Using GFP as the model protein, we present a strategy, termed protein evolution via amino acid and codon elimination, through which simplified, native-like polypeptides encoded by a reduced genetic code were obtained...... simultaneously), while retaining varying levels of activity. Combination of these substitutions to generate a Phe-free variant of GFP abolished fluorescence. Combinatorial re-introduction of five Phe residues, based on the activities of the respective single amino acid replacements, was sufficient to restore GFP...

  5. Adaptive Evolution of Eel Fluorescent Proteins from Fatty Acid Binding Proteins Produces Bright Fluorescence in the Marine Environment.

    Directory of Open Access Journals (Sweden)

    David F Gruber

    Full Text Available We report the identification and characterization of two new members of a family of bilirubin-inducible fluorescent proteins (FPs from marine chlopsid eels and demonstrate a key region of the sequence that serves as an evolutionary switch from non-fluorescent to fluorescent fatty acid-binding proteins (FABPs. Using transcriptomic analysis of two species of brightly fluorescent Kaupichthys eels (Kaupichthys hyoproroides and Kaupichthys n. sp., two new FPs were identified, cloned and characterized (Chlopsid FP I and Chlopsid FP II. We then performed phylogenetic analysis on 210 FABPs, spanning 16 vertebrate orders, and including 163 vertebrate taxa. We show that the fluorescent FPs diverged as a protein family and are the sister group to brain FABPs. Our results indicate that the evolution of this family involved at least three gene duplication events. We show that fluorescent FABPs possess a unique, conserved tripeptide Gly-Pro-Pro sequence motif, which is not found in non-fluorescent fatty acid binding proteins. This motif arose from a duplication event of the FABP brain isoforms and was under strong purifying selection, leading to the classification of this new FP family. Residues adjacent to the motif are under strong positive selection, suggesting a further refinement of the eel protein's fluorescent properties. We present a phylogenetic reconstruction of this emerging FP family and describe additional fluorescent FABP members from groups of distantly related eels. The elucidation of this class of fish FPs with diverse properties provides new templates for the development of protein-based fluorescent tools. The evolutionary adaptation from fatty acid-binding proteins to fluorescent fatty acid-binding proteins raises intrigue as to the functional role of bright green fluorescence in this cryptic genus of reclusive eels that inhabit a blue, nearly monochromatic, marine environment.

  6. The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution.

    Directory of Open Access Journals (Sweden)

    Jean-François Gout

    2010-05-01

    Full Text Available The understanding of selective constraints affecting genes is a major issue in biology. It is well established that gene expression level is a major determinant of the rate of protein evolution, but the reasons for this relationship remain highly debated. Here we demonstrate that gene expression is also a major determinant of the evolution of gene dosage: the rate of gene losses after whole genome duplications in the Paramecium lineage is negatively correlated to the level of gene expression, and this relationship is not a byproduct of other factors known to affect the fate of gene duplicates. This indicates that changes in gene dosage are generally more deleterious for highly expressed genes. This rule also holds for other taxa: in yeast, we find a clear relationship between gene expression level and the fitness impact of reduction in gene dosage. To explain these observations, we propose a model based on the fact that the optimal expression level of a gene corresponds to a trade-off between the benefit and cost of its expression. This COSTEX model predicts that selective pressure against mutations changing gene expression level or affecting the encoded protein should on average be stronger in highly expressed genes and hence that both the frequency of gene loss and the rate of protein evolution should correlate negatively with gene expression. Thus, the COSTEX model provides a simple and common explanation for the general relationship observed between the level of gene expression and the different facets of gene evolution.

  7. The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments.

    Science.gov (United States)

    Murvai, J; Vlahovicek, K; Barta, E; Pongor, S

    2001-01-01

    SBASE 8.0 is the eighth release of the SBASE library of protein domain sequences that contains 294 898 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to most major sequence databases and sequence pattern collections. The entries are clustered into over 2005 statistically validated domain groups (SBASE-A) and 595 non-validated groups (SBASE-B), provided with several WWW-based search and browsing facilities for online use. A domain-search facility was developed, based on non-parametric pattern recognition methods, including artificial neural networks. SBASE 8.0 is freely available by anonymous 'ftp' file transfer from ftp.icgeb.trieste.it. Automated searching of SBASE can be carried out with the WWW servers http://www.icgeb.trieste.it/sbase/ and http://sbase.abc. hu/sbase/.

  8. Perspective on sequence evolution of microsatellite locus (CCGn in Rv0050 gene from Mycobacterium tuberculosis

    Directory of Open Access Journals (Sweden)

    Jin Ruiliang

    2011-08-01

    Full Text Available Abstract Background The mycobacterial genome is inclined to polymerase slippage and a high mutation rate in microsatellite regions due to high GC content and absence of a mismatch repair system. However, the exact molecular mechanisms underlying microsatellite variation have not been fully elucidated. Here, we investigated mutation events in the hyper-variable trinucleotide microsatellite locus MML0050 located in the Rv0050 gene of W-Beijing and non-W-Beijing Mycobacterium tuberculosis strains in order to gain insight into the genomic structure and activity of repeated regions. Results Size analysis indicated the presence of five alleles that differed in length by three base pairs. Moreover, nucleotide gains occurred more frequently than loses in this trinucleotide microsatellite. Mutation frequency was not completely related with the total length, though the relative frequency in the longest allele was remarkably higher than that in the shortest. Sequence analysis was able to detect seven alleles and revealed that point mutations enhanced the level of locus variation. Introduction of an interruptive motif correlated with the total allele length and genetic lineage, rather than the length of the longest stretch of perfect repeats. Finally, the level of locus variation was drastically different between the two genetic lineages. Conclusion The Rv0050 locus encodes the bifunctional penicillin-binding protein ponA1 and is essential to mycobacterial survival. Our investigations of this particularly dynamic genomic region provide insights into the overall mode of microsatellite evolution. Specifically, replication slippage was implicated in the mutational process of this microsatellite and a sequence-based genetic analysis was necessary to determine that point mutation events acted to maintain microsatellite size integrity while providing genomic diversity.

  9. Sequence-specific capture of protein-DNA complexes for mass spectrometric protein identification.

    Directory of Open Access Journals (Sweden)

    Cheng-Hsien Wu

    Full Text Available The regulation of gene transcription is fundamental to the existence of complex multicellular organisms such as humans. Although it is widely recognized that much of gene regulation is controlled by gene-specific protein-DNA interactions, there presently exists little in the way of tools to identify proteins that interact with the genome at locations of interest. We have developed a novel strategy to address this problem, which we refer to as GENECAPP, for Global ExoNuclease-based Enrichment of Chromatin-Associated Proteins for Proteomics. In this approach, formaldehyde cross-linking is employed to covalently link DNA to its associated proteins; subsequent fragmentation of the DNA, followed by exonuclease digestion, produces a single-stranded region of the DNA that enables sequence-specific hybridization capture of the protein-DNA complex on a solid support. Mass spectrometric (MS analysis of the captured proteins is then used for their identification and/or quantification. We show here the development and optimization of GENECAPP for an in vitro model system, comprised of the murine insulin-like growth factor-binding protein 1 (IGFBP1 promoter region and FoxO1, a member of the forkhead rhabdomyosarcoma (FoxO subfamily of transcription factors, which binds specifically to the IGFBP1 promoter. This novel strategy provides a powerful tool for studies of protein-DNA and protein-protein interactions.

  10. The genome sequence of taurine cattle: A window to ruminant biology and evolution

    Science.gov (United States)

    To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (ma...

  11. ELAV proteins along evolution: back to the nucleus?

    Science.gov (United States)

    Colombrita, Claudia; Silani, Vincenzo; Ratti, Antonia

    2013-09-01

    The complex interplay of post-transcriptional regulatory mechanisms mediated by RNA-binding proteins (RBP) at different steps of RNA metabolism is pivotal for the development of the nervous system and the maintenance of adult brain activities. In this review, we will focus on the highly conserved ELAV gene family encoding for neuronal-specific RBPs which are necessary for proper neuronal differentiation and important for synaptic plasticity process. In the evolution from Drosophila to man, ELAV proteins seem to have changed their biological functions in relation to their different subcellular localization. While in Drosophila, they are localized in the nuclear compartment of neuronal cells and regulate splicing and polyadenylation, in mammals, the neuronal ELAV proteins are mainly present in the cytoplasm where they participate in regulating mRNA target stability, translation and transport into neurites. However, recent data indicate that the mammalian ELAV RBPs also have nuclear activities, similarly to their fly counterpart, being them able to continuously shuttle between the cytoplasm and the nucleus. Here, we will review and comment on all the biological functions associated with neuronal ELAV proteins along evolution and will show that the post-transcriptional regulatory network mediated by these RBPs in the brain is highly complex and only at an initial stage of being fully understood. This article is part of a Special Issue entitled 'RNA and splicing regulation in neurodegeneration'. Copyright © 2013 Elsevier Inc. All rights reserved.

  12. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences.

    Directory of Open Access Journals (Sweden)

    Alexander M Sevy

    2015-07-01

    Full Text Available Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a 'single state' design (SSD paradigm. Multi-specificity design (MSD, on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON. The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design "promiscuous", polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes.

  13. Gene Unprediction with Spurio: A tool to identify spurious protein sequences.

    Science.gov (United States)

    Höps, Wolfram; Jeffryes, Matt; Bateman, Alex

    2018-01-01

    We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation.  Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases.  We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes.  Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.

  14. Analysis of ribosomal protein gene structures: implications for intron evolution.

    Directory of Open Access Journals (Sweden)

    2006-03-01

    Full Text Available Many spliceosomal introns exist in the eukaryotic nuclear genome. Despite much research, the evolution of spliceosomal introns remains poorly understood. In this paper, we tried to gain insights into intron evolution from a novel perspective by comparing the gene structures of cytoplasmic ribosomal proteins (CRPs and mitochondrial ribosomal proteins (MRPs, which are held to be of archaeal and bacterial origin, respectively. We analyzed 25 homologous pairs of CRP and MRP genes that together had a total of 527 intron positions. We found that all 12 of the intron positions shared by CRP and MRP genes resulted from parallel intron gains and none could be considered to be "conserved," i.e., descendants of the same ancestor. This was supported further by the high frequency of proto-splice sites at these shared positions; proto-splice sites are proposed to be sites for intron insertion. Although we could not definitively disprove that spliceosomal introns were already present in the last universal common ancestor, our results lend more support to the idea that introns were gained late. At least, our results show that MRP genes were intronless at the time of endosymbiosis. The parallel intron gains between CRP and MRP genes accounted for 2.3% of total intron positions, which should provide a reliable estimate for future inferences of intron evolution.

  15. Evolution of endogenous sequences of banana streak virus: what can we learn from banana (Musa sp.) evolution?

    Science.gov (United States)

    Gayral, Philippe; Blondin, Laurence; Guidolin, Olivier; Carreel, Françoise; Hippolyte, Isabelle; Perrier, Xavier; Iskra-Caruana, Marie-Line

    2010-07-01

    Endogenous plant pararetroviruses (EPRVs) are viral sequences of the family Caulimoviridae integrated into the nuclear genome of numerous plant species. The ability of some endogenous sequences of Banana streak viruses (eBSVs) in the genome of banana (Musa sp.) to induce infections just like the virus itself was recently demonstrated (P. Gayral et al., J. Virol. 83:6697-6710, 2008). Although eBSVs probably arose from accidental events, infectious eBSVs constitute an extreme case of parasitism, as well as a newly described strategy for vertical virus transmission in plants. We investigated the early evolutionary stages of infectious eBSV for two distinct BSV species-GF (BSGFV) and Imové (BSImV)-through the study of their distribution, insertion polymorphism, and structure evolution among selected banana genotypes representative of the diversity of 60 wild Musa species and genotypes. To do so, the historical frame of host evolution was analyzed by inferring banana phylogeny from two chloroplast regions-matK and trnL-trnF-as well as from the nuclear genome, using 19 microsatellite loci. We demonstrated that both BSV species integrated recently in banana evolution, circa 640,000 years ago. The two infectious eBSVs were subjected to different selective pressures and showed distinct levels of rearrangement within their final structure. In addition, the molecular phylogenies of integrated and nonintegrated BSVs enabled us to establish the phylogenetic origins of eBSGFV and eBSImV.

  16. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

    NARCIS (Netherlands)

    Al-Shahib, A.; Breitling, R.; Gilbert, D.

    2005-01-01

    Abstract: When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence

  17. Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.

    Science.gov (United States)

    Bastien, Olivier; Maréchal, Eric

    2008-08-07

    Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support. We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the

  18. Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores

    Directory of Open Access Journals (Sweden)

    Maréchal Eric

    2008-08-01

    Full Text Available Abstract Background Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2 following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support. Results We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure. Homologous sequences were considered as systems 1 having a high redundancy of information reflected by the magnitude of their alignment scores, 2 which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a

  19. Designing sequence to control protein function in an EF-hand protein.

    Science.gov (United States)

    Bunick, Christopher G; Nelson, Melanie R; Mangahas, Sheryll; Hunter, Michael J; Sheehan, Jonathan H; Mizoue, Laura S; Bunick, Gerard J; Chazin, Walter J

    2004-05-19

    The extent of conformational change that calcium binding induces in EF-hand proteins is a key biochemical property specifying Ca(2+) sensor versus signal modulator function. To understand how differences in amino acid sequence lead to differences in the response to Ca(2+) binding, comparative analyses of sequence and structures, combined with model building, were used to develop hypotheses about which amino acid residues control Ca(2+)-induced conformational changes. These results were used to generate a first design of calbindomodulin (CBM-1), a calbindin D(9k) re-engineered with 15 mutations to respond to Ca(2+) binding with a conformational change similar to that of calmodulin. The gene for CBM-1 was synthesized, and the protein was expressed and purified. Remarkably, this protein did not exhibit any non-native-like molten globule properties despite the large number of mutations and the nonconservative nature of some of them. Ca(2+)-induced changes in CD intensity and in the binding of the hydrophobic probe, ANS, implied that CBM-1 does undergo Ca(2+) sensorlike conformational changes. The X-ray crystal structure of Ca(2+)-CBM-1 determined at 1.44 A resolution reveals the anticipated increase in hydrophobic surface area relative to the wild-type protein. A nascent calmodulin-like hydrophobic docking surface was also found, though it is occluded by the inter-EF-hand loop. The results from this first calbindomodulin design are discussed in terms of progress toward understanding the relationships between amino acid sequence, protein structure, and protein function for EF-hand CaBPs, as well as the additional mutations for the next CBM design.

  20. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    KAUST Repository

    Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin

    2016-01-01

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment

  1. GFP-like proteins as ubiquitous metazoan superfamily: evolution of functional features and structural complexity.

    Science.gov (United States)

    Shagin, Dmitry A; Barsova, Ekaterina V; Yanushevich, Yurii G; Fradkov, Arkady F; Lukyanov, Konstantin A; Labas, Yulii A; Semenova, Tatiana N; Ugalde, Juan A; Meyers, Ann; Nunez, Jose M; Widder, Edith A; Lukyanov, Sergey A; Matz, Mikhail V

    2004-05-01

    Homologs of the green fluorescent protein (GFP), including the recently described GFP-like domains of certain extracellular matrix proteins in Bilaterian organisms, are remarkably similar at the protein structure level, yet they often perform totally unrelated functions, thereby warranting recognition as a superfamily. Here we describe diverse GFP-like proteins from previously undersampled and completely new sources, including hydromedusae and planktonic Copepoda. In hydromedusae, yellow and nonfluorescent purple proteins were found in addition to greens. Notably, the new yellow protein seems to follow exactly the same structural solution to achieving the yellow color of fluorescence as YFP, an engineered yellow-emitting mutant variant of GFP. The addition of these new sequences made it possible to resolve deep-level phylogenetic relationships within the superfamily. Fluorescence (most likely green) must have already existed in the common ancestor of Cnidaria and Bilateria, and therefore GFP-like proteins may be responsible for fluorescence and/or coloration in virtually any animal. At least 15 color diversification events can be inferred following the maximum parsimony principle in Cnidaria. Origination of red fluorescence and nonfluorescent purple-blue colors on several independent occasions provides a remarkable example of convergent evolution of complex features at the molecular level.

  2. The TALE face of Hox proteins in animal evolution.

    Science.gov (United States)

    Merabet, Samir; Galliot, Brigitte

    2015-01-01

    Hox genes are major regulators of embryonic development. One of their most conserved functions is to coordinate the formation of specific body structures along the anterior-posterior (AP) axis in Bilateria. This architectural role was at the basis of several morphological innovations across bilaterian evolution. In this review, we traced the origin of the Hox patterning system by considering the partnership with PBC and Meis proteins. PBC and Meis belong to the TALE-class of homeodomain-containing transcription factors and act as generic cofactors of Hox proteins for AP axis patterning in Bilateria. Recent data indicate that Hox proteins acquired the ability to interact with their TALE partners in the last common ancestor of Bilateria and Cnidaria. These interactions relied initially on a short peptide motif called hexapeptide (HX), which is present in Hox and non-Hox protein families. Remarkably, Hox proteins can also recruit the TALE cofactors by using specific PBC Interaction Motifs (SPIMs). We describe how a functional Hox/TALE patterning system emerged in eumetazoans through the acquisition of SPIMs. We anticipate that interaction flexibility could be found in other patterning systems, being at the heart of the astonishing morphological diversity observed in the animal kingdom.

  3. A scalable double-barcode sequencing platform for characterization of dynamic protein-protein interactions.

    Science.gov (United States)

    Schlecht, Ulrich; Liu, Zhimin; Blundell, Jamie R; St Onge, Robert P; Levy, Sasha F

    2017-05-25

    Several large-scale efforts have systematically catalogued protein-protein interactions (PPIs) of a cell in a single environment. However, little is known about how the protein interactome changes across environmental perturbations. Current technologies, which assay one PPI at a time, are too low throughput to make it practical to study protein interactome dynamics. Here, we develop a highly parallel protein-protein interaction sequencing (PPiSeq) platform that uses a novel double barcoding system in conjunction with the dihydrofolate reductase protein-fragment complementation assay in Saccharomyces cerevisiae. PPiSeq detects PPIs at a rate that is on par with current assays and, in contrast with current methods, quantitatively scores PPIs with enough accuracy and sensitivity to detect changes across environments. Both PPI scoring and the bulk of strain construction can be performed with cell pools, making the assay scalable and easily reproduced across environments. PPiSeq is therefore a powerful new tool for large-scale investigations of dynamic PPIs.

  4. Evolution of the acyl-CoA binding protein (ACBP)

    DEFF Research Database (Denmark)

    Burton, Mark; Rose, Timothy M; Faergeman, Nils J

    2005-01-01

    -CoA pool size, donation of acyl-CoA esters for beta-oxidation, vesicular trafficking, complex lipid synthesis and gene regulation. In the present study, we delineate the evolutionary history of ACBP to get a complete picture of its evolution and distribution among species. ACBP homologues were identified...... duplication and/or retrotransposition events. The ACBP protein is highly conserved across phylums, and the majority of ACBP genes are subjected to strong purifying selection. Experimental evidence indicates that the function of ACBP has been conserved from yeast to humans and that the multiple lineage...

  5. SPiCE : A web-based tool for sequence-based protein classification and exploration

    NARCIS (Netherlands)

    Van den Berg, B.A.; Reinders, M.J.; Roubos, J.A.; De Ridder, D.

    2014-01-01

    Background Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using classifier algorithms. Although software tools are available for both feature extraction and classifier construction,

  6. Molecular evolution of a chordate specific family of G protein-coupled receptors

    Directory of Open Access Journals (Sweden)

    Leese Florian

    2011-08-01

    Full Text Available Abstract Background Chordate evolution is a history of innovations that is marked by physical and behavioral specializations, which led to the development of a variety of forms from a single ancestral group. Among other important characteristics, vertebrates obtained a well developed brain, anterior sensory structures, a closed circulatory system and gills or lungs as blood oxygenation systems. The duplication of pre-existing genes had profound evolutionary implications for the developmental complexity in vertebrates, since mutations modifying the function of a duplicated protein can lead to novel functions, improving the evolutionary success. Results We analyzed here the evolution of the GPRC5 family of G protein-coupled receptors by comprehensive similarity searches and found that the receptors are only present in chordates and that the size of the receptor family expanded, likely due to genome duplication events in the early history of vertebrate evolution. We propose that a single GPRC5 receptor coding gene originated in a stem chordate ancestor and gave rise by duplication events to a gene family comprising three receptor types (GPRC5A-C in vertebrates, and a fourth homologue present only in mammals (GPRC5D. Additional duplications of GPRC5B and GPRC5C sequences occurred in teleost fishes. The finding that the expression patterns of the receptors are evolutionarily conserved indicates an important biological function of these receptors. Moreover, we found that expression of GPRC5B is regulated by vitamin A in vivo, confirming previous findings that linked receptor expression to retinoic acid levels in tumor cell lines and strengthening the link between the receptor expression and the development of a complex nervous system in chordates, known to be dependent on retinoic acid signaling. Conclusions GPRC5 receptors, a class of G protein-coupled receptors with unique sequence characteristics, may represent a molecular novelty that helped non

  7. Rapid evolution of coral proteins responsible for interaction with the environment.

    Directory of Open Access Journals (Sweden)

    Christian R Voolstra

    Full Text Available Corals worldwide are in decline due to climate change effects (e.g., rising seawater temperatures, pollution, and exploitation. The ability of corals to cope with these stressors in the long run depends on the evolvability of the underlying genetic networks and proteins, which remain largely unknown. A genome-wide scan for positively selected genes between related coral species can help to narrow down the search space considerably.We screened a set of 2,604 putative orthologs from EST-based sequence datasets of the coral species Acropora millepora and Acropora palmata to determine the fraction and identity of proteins that may experience adaptive evolution. 7% of the orthologs show elevated rates of evolution. Taxonomically-restricted (i.e. lineage-specific genes show a positive selection signature more frequently than genes that are found across many animal phyla. The class of proteins that displayed elevated evolutionary rates was significantly enriched for proteins involved in immunity and defense, reproduction, and sensory perception. We also found elevated rates of evolution in several other functional groups such as management of membrane vesicles, transmembrane transport of ions and organic molecules, cell adhesion, and oxidative stress response. Proteins in these processes might be related to the endosymbiotic relationship corals maintain with dinoflagellates in the genus Symbiodinium.This study provides a birds-eye view of the processes potentially underlying coral adaptation, which will serve as a foundation for future work to elucidate the rates, patterns, and mechanisms of corals' evolutionary response to global climate change.

  8. Rapid Evolution of Coral Proteins Responsible for Interaction with the Environment

    Energy Technology Data Exchange (ETDEWEB)

    Voolstra, Christian R.; Sunagawa, Shinichi; Matz, Mikhail V.; Bayer, Till; Aranda, Manuel; Buschiazzo, Emmanuel; DeSalvo, Michael K.; Lindquist, Erika; Szmant, Alina M.; Coffroth, Mary Alice; Medina, Monica

    2011-01-31

    Background: Corals worldwide are in decline due to climate change effects (e.g., rising seawater temperatures), pollution, and exploitation. The ability of corals to cope with these stressors in the long run depends on the evolvability of the underlying genetic networks and proteins, which remain largely unknown. A genome-wide scan for positively selected genes between related coral species can help to narrow down the search space considerably. Methodology/Principal Findings: We screened a set of 2,604 putative orthologs from EST-based sequence datasets of the coral species Acropora millepora and Acropora palmata to determine the fraction and identity of proteins that may experience adaptive evolution. 7percent of the orthologs show elevated rates of evolution. Taxonomically-restricted (i.e. lineagespecific) genes show a positive selection signature more frequently than genes that are found across many animal phyla. The class of proteins that displayed elevated evolutionary rates was significantly enriched for proteins involved in immunity and defense, reproduction, and sensory perception. We also found elevated rates of evolution in several other functional groups such as management of membrane vesicles, transmembrane transport of ions and organic molecules, cell adhesion, and oxidative stress response. Proteins in these processes might be related to the endosymbiotic relationship corals maintain with dinoflagellates in the genus Symbiodinium. Conclusion/Relevance: This study provides a birds-eye view of the processes potentially underlying coral adaptation, which will serve as a foundation for future work to elucidate the rates, patterns, and mechanisms of corals? evolutionary response to global climate change.

  9. New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships*

    Science.gov (United States)

    Brown, Shoshana D.; Babbitt, Patricia C.

    2014-01-01

    Understanding how enzymes have evolved offers clues about their structure-function relationships and mechanisms. Here, we describe evolution of functionally diverse enzyme superfamilies, each representing a large set of sequences that evolved from a common ancestor and that retain conserved features of their structures and active sites. Using several examples, we describe the different structural strategies nature has used to evolve new reaction and substrate specificities in each unique superfamily. The results provide insight about enzyme evolution that is not easily obtained from studies of one or only a few enzymes. PMID:25210038

  10. New insights about enzyme evolution from large scale studies of sequence and structure relationships.

    Science.gov (United States)

    Brown, Shoshana D; Babbitt, Patricia C

    2014-10-31

    Understanding how enzymes have evolved offers clues about their structure-function relationships and mechanisms. Here, we describe evolution of functionally diverse enzyme superfamilies, each representing a large set of sequences that evolved from a common ancestor and that retain conserved features of their structures and active sites. Using several examples, we describe the different structural strategies nature has used to evolve new reaction and substrate specificities in each unique superfamily. The results provide insight about enzyme evolution that is not easily obtained from studies of one or only a few enzymes. © 2014 by The American Society for Biochemistry and Molecular Biology, Inc.

  11. Evolution and Virulence of Influenza A Virus Protein PB1-F2

    Directory of Open Access Journals (Sweden)

    Ram P. Kamal

    2017-12-01

    Full Text Available PB1-F2 is an accessory protein of most human, avian, swine, equine, and canine influenza A viruses (IAVs. Although it is dispensable for virus replication and growth, it plays significant roles in pathogenesis by interfering with the host innate immune response, inducing death in immune and epithelial cells, altering inflammatory responses, and promoting secondary bacterial pneumonia. The effects of PB1-F2 differ between virus strains and host species. This can at least partially be explained by the presence of multiple PB1-F2 sequence variants, including premature stop codons that lead to the expression of truncated PB1-F2 proteins of different lengths and specific virulence-associated residues that enhance susceptibility to bacterial superinfection. Although there has been a tendency for human seasonal IAV to gradually reduce the number of virulence-associated residues, zoonotic IAVs contain a reservoir of PB1-F2 proteins with full length, virulence-associated sequences. Here, we review the molecular mechanisms by which PB1-F2 may affect influenza virulence, and factors associated with the evolution and selection of this protein.

  12. Sequence comparisons of odorant receptors among tortricid moths reveal different rates of molecular evolution among family members.

    Directory of Open Access Journals (Sweden)

    Colm Carraher

    Full Text Available In insects, odorant receptors detect volatile cues involved in behaviours such as mate recognition, food location and oviposition. We have investigated the evolution of three odorant receptors from five species within the moth genera Ctenopseustis and Planotrotrix, family Tortricidae, which fall into distinct clades within the odorant receptor multigene family. One receptor is the orthologue of the co-receptor Or83b, now known as Orco (OR2, and encodes the obligate ion channel subunit of the receptor complex. In comparison, the other two receptors, OR1 and OR3, are ligand-binding receptor subunits, activated by volatile compounds produced by plants--methyl salicylate and citral, respectively. Rates of sequence evolution at non-synonymous sites were significantly higher in OR1 compared with OR2 and OR3. Within the dataset OR1 contains 109 variable amino acid positions that are distributed evenly across the entire protein including transmembrane helices, loop regions and termini, while OR2 and OR3 contain 18 and 16 variable sites, respectively. OR2 shows a high level of amino acid conservation as expected due to its essential role in odour detection; however we found unexpected differences in the rate of evolution between two ligand-binding odorant receptors, OR1 and OR3. OR3 shows high sequence conservation suggestive of a conserved role in odour reception, whereas the higher rate of evolution observed in OR1, particularly at non-synonymous sites, may be suggestive of relaxed constraint, perhaps associated with the loss of an ancestral role in sex pheromone reception.

  13. Sequence protein identification by randomized sequence database and transcriptome mass spectrometry (SPIDER-TMS): from manual to automatic application of a 'de novo sequencing' approach.

    Science.gov (United States)

    Pascale, Raffaella; Grossi, Gerarda; Cruciani, Gabriele; Mecca, Giansalvatore; Santoro, Donatello; Sarli Calace, Renzo; Falabella, Patrizia; Bianco, Giuliana

    Sequence protein identification by a randomized sequence database and transcriptome mass spectrometry software package has been developed at the University of Basilicata in Potenza (Italy) and designed to facilitate the determination of the amino acid sequence of a peptide as well as an unequivocal identification of proteins in a high-throughput manner with enormous advantages of time, economical resource and expertise. The software package is a valid tool for the automation of a de novo sequencing approach, overcoming the main limits and a versatile platform useful in the proteomic field for an unequivocal identification of proteins, starting from tandem mass spectrometry data. The strength of this software is that it is a user-friendly and non-statistical approach, so protein identification can be considered unambiguous.

  14. Formation of a Multiple Protein Complex on the Adenovirus Packaging Sequence by the IVa2 Protein▿

    OpenAIRE

    Tyler, Ryan E.; Ewing, Sean G.; Imperiale, Michael J.

    2007-01-01

    During adenovirus virion assembly, the packaging sequence mediates the encapsidation of the viral genome. This sequence is composed of seven functional units, termed A repeats. Recent evidence suggests that the adenovirus IVa2 protein binds the packaging sequence and is involved in packaging of the genome. Study of the IVa2-packaging sequence interaction has been hindered by difficulty in purifying the protein produced in virus-infected cells or by recombinant techniques. We report the first ...

  15. Evolution and structural organization of the C proteins of paramyxovirinae.

    Directory of Open Access Journals (Sweden)

    Michael K Lo

    Full Text Available The phosphoprotein (P gene of most Paramyxovirinae encodes several proteins in overlapping frames: P and V, which share a common N-terminus (PNT, and C, which overlaps PNT. Overlapping genes are of particular interest because they encode proteins originated de novo, some of which have unknown structural folds, challenging the notion that nature utilizes only a limited, well-mapped area of fold space. The C proteins cluster in three groups, comprising measles, Nipah, and Sendai virus. We predicted that all C proteins have a similar organization: a variable, disordered N-terminus and a conserved, α-helical C-terminus. We confirmed this predicted organization by biophysically characterizing recombinant C proteins from Tupaia paramyxovirus (measles group and human parainfluenza virus 1 (Sendai group. We also found that the C of the measles and Nipah groups have statistically significant sequence similarity, indicating a common origin. Although the C of the Sendai group lack sequence similarity with them, we speculate that they also have a common origin, given their similar genomic location and structural organization. Since C is dispensable for viral replication, unlike PNT, we hypothesize that C may have originated de novo by overprinting PNT in the ancestor of Paramyxovirinae. Intriguingly, in measles virus and Nipah virus, PNT encodes STAT1-binding sites that overlap different regions of the C-terminus of C, indicating they have probably originated independently. This arrangement, in which the same genetic region encodes simultaneously a crucial functional motif (a STAT1-binding site and a highly constrained region (the C-terminus of C, seems paradoxical, since it should severely reduce the ability of the virus to adapt. The fact that it originated twice suggests that it must be balanced by an evolutionary advantage, perhaps from reducing the size of the genetic region vulnerable to mutations.

  16. Effects of main-sequence mass loss on stellar and galactic chemical evolution

    International Nuclear Information System (INIS)

    Guzik, J.A.

    1988-01-01

    L.A. Willson, G.H. Bowen and C. Struck-Marcell have proposed that 1 to 3 solar mass stars may experience evolutionarily significant mass loss during the early part of their main-sequence phase. The suggested mass-loss mechanism is pulsation, facilitated by rapid rotation. Initial mass-loss rates may be as large as several times 10 -9 M mass of sun/yr, diminishing over several times 10 8 years. The author attempts to test this hypothesis by comparing some theoretical implications with observations. Three areas are addressed: Solar models, cluster HR diagrams, and galactic chemical evolution. Mass-losing solar models were evolved that match the Sun's luminosity and radius at its present age. The most extreme viable models have initial mass 2.0 M 0 , and mass-loss rates decreasing exponentially over 2-3 x 10 8 years. Evolution calculations incorporating main-sequence mass loss were completed for a grid of models with initial masses 1.25 to 2.0 M mass of sun and mass loss timescales 0.2 to 2.0 Gry. Cluster HR diagrams synthesized with these models confirm the potential for the hypothesis to explain observed spreads or bifurcations in the upper main sequence, blue stragglers, anomalous giants, and poor fits of main-sequence turnoffs by standard isochrones. Simple closed galactic chemical evolution models were used to test the effects of main-sequence mass loss on the F and G dwarf distribution. Stars between 3.0 M mass of sun and a metallicity-dependent lower mass are assumed to lose mass. The models produce a 30 to 60% increase in the stars to stars-plus-remnants ratio, with fewer early-F dwarfs and many more late-F dwarfs remaining on the main sequence to the present

  17. MIPS: a database for protein sequences, homology data and yeast genome information.

    Science.gov (United States)

    Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F

    1997-01-01

    The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498

  18. Protein domain evolution is associated with reproductive diversification and adaptive radiation in the genus Eucalyptus.

    Science.gov (United States)

    Kersting, Anna R; Mizrachi, Eshchar; Bornberg-Bauer, Erich; Myburg, Alexander A

    2015-06-01

    Eucalyptus is a pivotal genus within the rosid order Myrtales with distinct geographic history and adaptations. Comparative analysis of protein domain evolution in the newly sequenced Eucalyptus grandis genome and other rosid lineages sheds light on the adaptive mechanisms integral to the success of this genus of woody perennials. We reconstructed the ancestral domain content to elucidate the gain, loss and expansion of protein domains and domain arrangements in Eucalyptus in the context of rosid phylogeny. We used functional gene ontology (GO) annotation of genes to investigate the possible biological and evolutionary consequences of protein domain expansion. We found that protein modulation within the angiosperms occurred primarily on the level of expansion of certain domains and arrangements. Using RNA-Seq data from E. grandis, we showed that domain expansions have contributed to tissue-specific expression of tandemly duplicated genes. Our results indicate that tandem duplication of genes, a key feature of the Eucalyptus genome, has played an important role in the expansion of domains, particularly in proteins related to the specialization of reproduction and biotic and abiotic interactions affecting root and floral biology, and that tissue-specific expression of proteins with expanded domains has facilitated subfunctionalization in domain families. © 2014 University of Pretoria New Phytologist © 2014 New Phytologist Trust.

  19. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    Directory of Open Access Journals (Sweden)

    Kevin R Ramkissoon

    Full Text Available The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  20. Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

    Science.gov (United States)

    Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392

  1. The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution

    OpenAIRE

    Elsik, Christine G.; Tellam, Ross L.; Worley, Kim C.; Gibbs, Richard A.; Abatepaulo, Antonio R. R.; Abbey, Colette A.; Adelson, David L.; Aerts, Jan; Ahola, Virpi; Alexander, Lee; Alioto, Tyler; Almeida, Iassudara G.; Amadio, Ariel F.; Anatriello, Elen; Antonarakis, Stylianos E.

    2009-01-01

    To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specifi...

  2. Interactions of Chromatin Context, Binding Site Sequence Content, and Sequence Evolution in Stress-Induced p53 Occupancy and Transactivation

    OpenAIRE

    Su, Dan; Wang, Xuting; Campbell, Michelle R.; Song, Lingyun; Safi, Alexias; Crawford, Gregory E.; Bell, Douglas A.

    2015-01-01

    Cellular stresses activate the tumor suppressor p53 protein leading to selective binding to DNA response elements (REs) and gene transactivation from a large pool of potential p53 REs (p53REs). To elucidate how p53RE sequences and local chromatin context interact to affect p53 binding and gene transactivation, we mapped genome-wide binding localizations of p53 and H3K4me3 in untreated and doxorubicin (DXR)-treated human lymphoblastoid cells. We examined the relationships among p53 occupancy, ...

  3. Elman RNN based classification of proteins sequences on account of their mutual information.

    Science.gov (United States)

    Mishra, Pooja; Nath Pandey, Paras

    2012-10-21

    In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.

  4. A first-principles model of early evolution: emergence of gene families, species, and preferred protein folds.

    Directory of Open Access Journals (Sweden)

    Konstantin B Zeldovich

    2007-07-01

    Full Text Available In this work we develop a microscopic physical model of early evolution where phenotype--organism life expectancy--is directly related to genotype--the stability of its proteins in their native conformations-which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the "Big Bang" scenario whereby exponential population growth ensues as soon as favorable sequence-structure combinations (precursors of stable proteins are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species--subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.

  5. Molecular evolution and diversification of snake toxin genes, revealed by analysis of intron sequences.

    Science.gov (United States)

    Fujimi, T J; Nakajyo, T; Nishimura, E; Ogura, E; Tsuchiya, T; Tamiya, T

    2003-08-14

    The genes encoding erabutoxin (short chain neurotoxin) isoforms (Ea, Eb, and Ec), LsIII (long chain neurotoxin) and a novel long chain neurotoxin pseudogene were cloned from a Laticauda semifasciata genomic library. Short and long chain neurotoxin genes were also cloned from the genome of Laticauda laticaudata, a closely related species of L. semifasciata, by PCR. A putative matrix attached region (MAR) sequence was found in the intron I of the LsIII gene. Comparative analysis of 11 structurally relevant snake toxin genes (three-finger-structure toxins) revealed the molecular evolution of these toxins. Three-finger-structure toxin genes diverged from a common ancestor through two types of evolutionary pathways (long and short types), early in the course of evolution. At a later stage of evolution in each gene, the accumulation of mutations in the exons, especially exon II, by accelerated evolution may have caused the increased diversification in their functions. It was also revealed that the putative MAR sequence found in the LsIII gene was integrated into the gene after the species-level divergence.

  6. JACOP: A simple and robust method for the automated classification of protein sequences with modular architecture

    Directory of Open Access Journals (Sweden)

    Pagni Marco

    2005-08-01

    Full Text Available Abstract Background Whole-genome sequencing projects are rapidly producing an enormous number of new sequences. Consequently almost every family of proteins now contains hundreds of members. It has thus become necessary to develop tools, which classify protein sequences automatically and also quickly and reliably. The difficulty of this task is intimately linked to the mechanism by which protein sequences diverge, i.e. by simultaneous residue substitutions, insertions and/or deletions and whole domain reorganisations (duplications/swapping/fusion. Results Here we present a novel approach, which is based on random sampling of sub-sequences (probes out of a set of input sequences. The probes are compared to the input sequences, after a normalisation step; the results are used to partition the input sequences into homogeneous groups of proteins. In addition, this method provides information on diagnostic parts of the proteins. The performance of this method is challenged by two data sets. The first one contains the sequences of prokaryotic lyases that could be arranged as a multiple sequence alignment. The second one contains all proteins from Swiss-Prot Release 36 with at least one Src homology 2 (SH2 domain – a classical example for proteins with modular architecture. Conclusion The outcome of our method is robust, highly reproducible as shown using bootstrap and resampling validation procedures. The results are essentially coherent with the biology. This method depends solely on well-established publicly available software and algorithms.

  7. Genetic diversity and evolution of human metapneumovirus fusion protein over twenty years

    Directory of Open Access Journals (Sweden)

    Liem Alexis

    2009-09-01

    Full Text Available Abstract Background Human metapneumovirus (HMPV is an important cause of acute respiratory illness in children. We examined the diversity and molecular evolution of HMPV using 85 full-length F (fusion gene sequences collected over a 20-year period. Results The F gene sequences fell into two major groups, each with two subgroups, which exhibited a mean of 96% identity by predicted amino acid sequences. Amino acid identity within and between subgroups was higher than nucleotide identity, suggesting structural or functional constraints on F protein diversity. There was minimal progressive drift over time, and the genetic lineages were stable over the 20-year period. Several canonical amino acid differences discriminated between major subgroups, and polymorphic variations tended to cluster in discrete regions. The estimated rate of mutation was 7.12 × 10-4 substitutions/site/year and the estimated time to most recent common HMPV ancestor was 97 years (95% likelihood range 66-194 years. Analysis suggested that HMPV diverged from avian metapneumovirus type C (AMPV-C 269 years ago (95% likelihood range 106-382 years. Conclusion HMPV F protein remains conserved over decades. HMPV appears to have diverged from AMPV-C fairly recently.

  8. Genetic diversity and evolution of human metapneumovirus fusion protein over twenty years

    Science.gov (United States)

    Yang, Chin-Fen; Wang, Chiaoyin K; Tollefson, Sharon J; Piyaratna, Rohith; Lintao, Linda D; Chu, Marla; Liem, Alexis; Mark, Mary; Spaete, Richard R; Crowe, James E; Williams, John V

    2009-01-01

    Background Human metapneumovirus (HMPV) is an important cause of acute respiratory illness in children. We examined the diversity and molecular evolution of HMPV using 85 full-length F (fusion) gene sequences collected over a 20-year period. Results The F gene sequences fell into two major groups, each with two subgroups, which exhibited a mean of 96% identity by predicted amino acid sequences. Amino acid identity within and between subgroups was higher than nucleotide identity, suggesting structural or functional constraints on F protein diversity. There was minimal progressive drift over time, and the genetic lineages were stable over the 20-year period. Several canonical amino acid differences discriminated between major subgroups, and polymorphic variations tended to cluster in discrete regions. The estimated rate of mutation was 7.12 × 10-4 substitutions/site/year and the estimated time to most recent common HMPV ancestor was 97 years (95% likelihood range 66-194 years). Analysis suggested that HMPV diverged from avian metapneumovirus type C (AMPV-C) 269 years ago (95% likelihood range 106-382 years). Conclusion HMPV F protein remains conserved over decades. HMPV appears to have diverged from AMPV-C fairly recently. PMID:19740442

  9. Analysis of long-range correlation in sequences data of proteins

    OpenAIRE

    ADRIANA ISVORAN; LAURA UNIPAN; DANA CRACIUN; VASILE MORARIU

    2007-01-01

    The results presented here suggest the existence of correlations in the sequence data of proteins. 32 proteins, both globular and fibrous, both monomeric and polymeric, were analyzed. The primary structures of these proteins were treated as time series. Three spatial series of data for each sequence of a protein were generated from numerical correspondences between each amino acid and a physical property associated with it, i.e., its electric charge, its polar character and its dipole moment....

  10. Large-scale identification of odorant-binding proteins and chemosensory proteins from expressed sequence tags in insects

    Science.gov (United States)

    2009-01-01

    Background Insect odorant binding proteins (OBPs) and chemosensory proteins (CSPs) play an important role in chemical communication of insects. Gene discovery of these proteins is a time-consuming task. In recent years, expressed sequence tags (ESTs) of many insect species have accumulated, thus providing a useful resource for gene discovery. Results We have developed a computational pipeline to identify OBP and CSP genes from insect ESTs. In total, 752,841 insect ESTs were examined from 54 species covering eight Orders of Insecta. From these ESTs, 142 OBPs and 177 CSPs were identified, of which 117 OBPs and 129 CSPs are new. The complete open reading frames (ORFs) of 88 OBPs and 123 CSPs were obtained by electronic elongation. We randomly chose 26 OBPs from eight species of insects, and 21 CSPs from four species for RT-PCR validation. Twenty two OBPs and 16 CSPs were confirmed by RT-PCR, proving the efficiency and reliability of the algorithm. Together with all family members obtained from the NCBI (OBPs) or the UniProtKB (CSPs), 850 OBPs and 237 CSPs were analyzed for their structural characteristics and evolutionary relationship. Conclusions A large number of new OBPs and CSPs were found, providing the basis for deeper understanding of these proteins. In addition, the conserved motif and evolutionary analysis provide some new insights into the evolution of insect OBPs and CSPs. Motif pattern fine-tune the functions of OBPs and CSPs, leading to the minor difference in binding sex pheromone or plant volatiles in different insect Orders. PMID:20034407

  11. Evolution and Origin of HRS, a Protein Interacting with Merlin, the Neurofibromatosis 2 Gene Product

    Directory of Open Access Journals (Sweden)

    Leonid V. Omelyanchuk

    2009-10-01

    Full Text Available Hepatocyte growth factor receptor tyrosine kinase substrate (HRS is an endosomal protein required for trafficking receptor tyrosine kinases from the early endosome to the lysosome. HRS interacts with Merlin, the Neurofibromatosis 2 (NF2 gene product, and this interaction may be important for Merlin’s tumor suppressor activity. Understanding the evolution, origin, and structure of HRS may provide new insight into Merlin function. We show that HRS homologs are present across a wide range of Metazoa with the yeast Vps27 protein as their most distant ancestor. The phylogenetic tree of the HRS family coincides with species evolution and divergence, suggesting a unique function for HRS. Sequence alignment shows that various protein domains of HRS, including the VHS domain, the FYVE domain, the UIM domain, and the clathrin-binding domain, are conserved from yeast to multicellular organisms. The evolutionary transition from unicellular to multicellular organisms was accompanied by the appearance of a binding site for Merlin, which emerges in the early Metazoa after its separation from flatworms. In addition to the region responsible for growth suppression, the Merlin-binding and STAM-binding domains of HRS are conserved among multicellular organisms. The residue equivalent to tyrosine-377, which is phosphorylated in the human HRS protein, is highly conserved throughout the HRS family. Three additional conserved boxes lacking assigned functions are found in the HRS proteins of Metazoa. While boxes 1 and 3 may constitute the Eps-15- and Snx1-binding sites, respectively, box 2, containing the residue equivalent to tyrosine-377, is likely to be important for HRS phosphorylation. While several functional domains are conserved throughout the HRS family, the STAM-binding, Merlin-binding, and growth suppression domains evolved in the early Metazoa around the time the Merlin protein emerged. As these domains appear during the transition to multicellularity

  12. Whole-genome sequence of the Tibetan frog Nanorana parkeri and the comparative evolution of tetrapod genomes.

    Science.gov (United States)

    Sun, Yan-Bo; Xiong, Zi-Jun; Xiang, Xue-Yan; Liu, Shi-Ping; Zhou, Wei-Wei; Tu, Xiao-Long; Zhong, Li; Wang, Lu; Wu, Dong-Dong; Zhang, Bao-Lin; Zhu, Chun-Ling; Yang, Min-Min; Chen, Hong-Man; Li, Fang; Zhou, Long; Feng, Shao-Hong; Huang, Chao; Zhang, Guo-Jie; Irwin, David; Hillis, David M; Murphy, Robert W; Yang, Huan-Ming; Che, Jing; Wang, Jun; Zhang, Ya-Ping

    2015-03-17

    The development of efficient sequencing techniques has resulted in large numbers of genomes being available for evolutionary studies. However, only one genome is available for all amphibians, that of Xenopus tropicalis, which is distantly related from the majority of frogs. More than 96% of frogs belong to the Neobatrachia, and no genome exists for this group. This dearth of amphibian genomes greatly restricts genomic studies of amphibians and, more generally, our understanding of tetrapod genome evolution. To fill this gap, we provide the de novo genome of a Tibetan Plateau frog, Nanorana parkeri, and compare it to that of X. tropicalis and other vertebrates. This genome encodes more than 20,000 protein-coding genes, a number similar to that of Xenopus. Although the genome size of Nanorana is considerably larger than that of Xenopus (2.3 vs. 1.5 Gb), most of the difference is due to the respective number of transposable elements in the two genomes. The two frogs exhibit considerable conserved whole-genome synteny despite having diverged approximately 266 Ma, indicating a slow rate of DNA structural evolution in anurans. Multigenome synteny blocks further show that amphibians have fewer interchromosomal rearrangements than mammals but have a comparable rate of intrachromosomal rearrangements. Our analysis also identifies 11 Mb of anuran-specific highly conserved elements that will be useful for comparative genomic analyses of frogs. The Nanorana genome offers an improved understanding of evolution of tetrapod genomes and also provides a genomic reference for other evolutionary studies.

  13. Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

    Science.gov (United States)

    Sun, Tanlin; Zhou, Bo; Lai, Luhua; Pei, Jianfeng

    2017-05-25

    Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

  14. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM

    Directory of Open Access Journals (Sweden)

    Yunyun Liang

    2015-01-01

    Full Text Available Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM. Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS, segmented PsePSSM, and segmented autocovariance transformation (ACT based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640 are adopted in this paper. Then a 700-dimensional (700D feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA. To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  15. A correspondence between solution-state dynamics of an individual protein and the sequence and conformational diversity of its family.

    Directory of Open Access Journals (Sweden)

    Gregory D Friedland

    2009-05-01

    Full Text Available Conformational ensembles are increasingly recognized as a useful representation to describe fundamental relationships between protein structure, dynamics and function. Here we present an ensemble of ubiquitin in solution that is created by sampling conformational space without experimental information using "Backrub" motions inspired by alternative conformations observed in sub-Angstrom resolution crystal structures. Backrub-generated structures are then selected to produce an ensemble that optimizes agreement with nuclear magnetic resonance (NMR Residual Dipolar Couplings (RDCs. Using this ensemble, we probe two proposed relationships between properties of protein ensembles: (i a link between native-state dynamics and the conformational heterogeneity observed in crystal structures, and (ii a relation between dynamics of an individual protein and the conformational variability explored by its natural family. We show that the Backrub motional mechanism can simultaneously explore protein native-state dynamics measured by RDCs, encompass the conformational variability present in ubiquitin complex structures and facilitate sampling of conformational and sequence variability matching those occurring in the ubiquitin protein family. Our results thus support an overall relation between protein dynamics and conformational changes enabling sequence changes in evolution. More practically, the presented method can be applied to improve protein design predictions by accounting for intrinsic native-state dynamics.

  16. Exploring Sequence Characteristics Related to High- Level Production of Secreted Proteins in Aspergillus niger

    NARCIS (Netherlands)

    Van den Berg, B.A.; Reinders, M.J.T.; Hulsman, M.; Wu, L.; Pel, H.J.; Roubos, J.A.; De Ridder, D.

    2012-01-01

    Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large

  17. Evolution of the Yellow/Major Royal Jelly Protein family and the emergence of social behavior in honey bees.

    Science.gov (United States)

    Drapeau, Mark David; Albert, Stefan; Kucharski, Robert; Prusko, Carsten; Maleszka, Ryszard

    2006-11-01

    The genomic architecture underlying the evolution of insect social behavior is largely a mystery. Eusociality, defined by overlapping generations, parental brood care, and reproductive division of labor, has most commonly evolved in the Hymenopteran insects, including the honey bee Apis mellifera. In this species, the Major Royal Jelly Protein (MRJP) family is required for all major aspects of eusocial behavior. Here, using data obtained from the A. mellifera genome sequencing project, we demonstrate that the MRJP family is encoded by nine genes arranged in an approximately 60-kb tandem array. Furthermore, the MRJP protein family appears to have evolved from a single progenitor gene that encodes a member of the ancient Yellow protein family. Five genes encoding Yellow-family proteins flank the genomic region containing the genes encoding MRJPs. We describe the molecular evolution of these protein families. We then characterize developmental-stage-specific, sex-specific, and caste-specific expression patterns of the mrjp and yellow genes in the honey bee. We review empirical evidence concerning the functions of Yellow proteins in fruit flies and social ants, in order to shed light on the roles of both Yellow and MRJP proteins in A. mellifera. In total, the available evidence suggests that Yellows and MRJPs are multifunctional proteins with diverse, context-dependent physiological and developmental roles. However, many members of the Yellow/MRJP family act as facilitators of reproductive maturation. Finally, it appears that MRJP protein subfamily evolution from the Yellow protein family may have coincided with the evolution of honey bee eusociality.

  18. Accelerated Evolution of Conserved Noncoding Sequences in theHuman Genome

    Energy Technology Data Exchange (ETDEWEB)

    Prambhakar, Shyam; Noonan, James P.; Paabo, Svante; Rubin, EdwardM.

    2006-07-06

    Genomic comparisons between human and distant, non-primatemammals are commonly used to identify cis-regulatory elements based onconstrained sequence evolution. However, these methods fail to detect"cryptic" functional elements, which are too weakly conserved amongmammals to distinguish from nonfunctional DNA. To address this problem,we explored the potential of deep intra-primate sequence comparisons. Wesequenced the orthologs of 558 kb of human genomic sequence, coveringmultiple loci involved in cholesterol homeostasis, in 6 nonhumanprimates. Our analysis identified 6 noncoding DNA elements displayingsignificant conservation among primates, but undetectable in more distantcomparisons. In vitro and in vivo tests revealed that at least three ofthese 6 elements have regulatory function. Notably, the mouse orthologsof these three functional human sequences had regulatory activity despitetheir lack of significant sequence conservation, indicating that they arecryptic ancestral cis-regulatory elements. These regulatory elementscould still be detected in a smaller set of three primate speciesincluding human, rhesus and marmoset. Since the human and rhesus genomesequences are already available, and the marmoset genome is activelybeing sequenced, the primate-specific conservation analysis describedhere can be applied in the near future on a whole-genome scale, tocomplement the annotation provided by more distant speciescomparisons.

  19. On Statistical Modeling of Sequencing Noise in High Depth Data to Assess Tumor Evolution

    Science.gov (United States)

    Rabadan, Raul; Bhanot, Gyan; Marsilio, Sonia; Chiorazzi, Nicholas; Pasqualucci, Laura; Khiabanian, Hossein

    2017-12-01

    One cause of cancer mortality is tumor evolution to therapy-resistant disease. First line therapy often targets the dominant clone, and drug resistance can emerge from preexisting clones that gain fitness through therapy-induced natural selection. Such mutations may be identified using targeted sequencing assays by analysis of noise in high-depth data. Here, we develop a comprehensive, unbiased model for sequencing error background. We find that noise in sufficiently deep DNA sequencing data can be approximated by aggregating negative binomial distributions. Mutations with frequencies above noise may have prognostic value. We evaluate our model with simulated exponentially expanded populations as well as data from cell line and patient sample dilution experiments, demonstrating its utility in prognosticating tumor progression. Our results may have the potential to identify significant mutations that can cause recurrence. These results are relevant in the pretreatment clinical setting to determine appropriate therapy and prepare for potential recurrence pretreatment.

  20. Characterization and Evolution of the Cell Cycle-Associated Mob Domain-Containing Proteins in Eukaryotes

    Directory of Open Access Journals (Sweden)

    Nicola Vitulo

    2007-01-01

    Full Text Available The MOB family includes a group of cell cycle-associated proteins highly conserved throughout eukaryotes, whose founding members are implicated in mitotic exit and co-ordination of cell cycle progression with cell polarity and morphogenesis. Here we report the characterization and evolution of the MOB domain-containing proteins as inferred from the 43 eukaryotic genomes so far sequenced. We show that genes for Mob-like proteins are present in at least 41 of these genomes, confi rming the universal distribution of this protein family and suggesting its prominent biological function. The phylogenetic analysis reveals fi ve distinct MOB domain classes, showing a progressive expansion of this family from unicellular to multicellular organisms, reaching the highest number in mammals. Plant Mob genes appear to have evolved from a single ancestor, most likely after the loss of one or more genes during the early stage of Viridiplantae evolutionary history. Three of the Mob classes are widespread among most of the analyzed organisms. The possible biological and molecular function of Mob proteins and their role in conserved signaling pathways related to cell proliferation, cell death and cell polarity are also presented and critically discussed.

  1. Mean protein evolutionary distance: a method for comparative protein evolution and its application.

    Directory of Open Access Journals (Sweden)

    Michael J Wise

    Full Text Available Proteins are under tight evolutionary constraints, so if a protein changes it can only do so in ways that do not compromise its function. In addition, the proteins in an organism evolve at different rates. Leveraging the history of patristic distance methods, a new method for analysing comparative protein evolution, called Mean Protein Evolutionary Distance (MeaPED, measures differential resistance to evolutionary pressure across viral proteomes and is thereby able to point to the proteins' roles. Different species' proteomes can also be compared because the results, consistent across virus subtypes, concisely reflect the very different lifestyles of the viruses. The MeaPED method is here applied to influenza A virus, hepatitis C virus, human immunodeficiency virus (HIV, dengue virus, rotavirus A, polyomavirus BK and measles, which span the positive and negative single-stranded, doubled-stranded and reverse transcribing RNA viruses, and double-stranded DNA viruses. From this analysis, host interaction proteins including hemagglutinin (influenza, and viroporins agnoprotein (polyomavirus, p7 (hepatitis C and VPU (HIV emerge as evolutionary hot-spots. By contrast, RNA-directed RNA polymerase proteins including L (measles, PB1/PB2 (influenza and VP1 (rotavirus, and internal serine proteases such as NS3 (dengue and hepatitis C virus emerge as evolutionary cold-spots. The hot spot influenza hemagglutinin protein is contrasted with the related cold spot H protein from measles. It is proposed that evolutionary cold-spot proteins can become significant targets for second-line anti-viral therapeutics, in cases where front-line vaccines are not available or have become ineffective due to mutations in the hot-spot, generally more antigenically exposed proteins. The MeaPED package is available from www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/meaped.tar.gz.

  2. Mean protein evolutionary distance: a method for comparative protein evolution and its application.

    Science.gov (United States)

    Wise, Michael J

    2013-01-01

    Proteins are under tight evolutionary constraints, so if a protein changes it can only do so in ways that do not compromise its function. In addition, the proteins in an organism evolve at different rates. Leveraging the history of patristic distance methods, a new method for analysing comparative protein evolution, called Mean Protein Evolutionary Distance (MeaPED), measures differential resistance to evolutionary pressure across viral proteomes and is thereby able to point to the proteins' roles. Different species' proteomes can also be compared because the results, consistent across virus subtypes, concisely reflect the very different lifestyles of the viruses. The MeaPED method is here applied to influenza A virus, hepatitis C virus, human immunodeficiency virus (HIV), dengue virus, rotavirus A, polyomavirus BK and measles, which span the positive and negative single-stranded, doubled-stranded and reverse transcribing RNA viruses, and double-stranded DNA viruses. From this analysis, host interaction proteins including hemagglutinin (influenza), and viroporins agnoprotein (polyomavirus), p7 (hepatitis C) and VPU (HIV) emerge as evolutionary hot-spots. By contrast, RNA-directed RNA polymerase proteins including L (measles), PB1/PB2 (influenza) and VP1 (rotavirus), and internal serine proteases such as NS3 (dengue and hepatitis C virus) emerge as evolutionary cold-spots. The hot spot influenza hemagglutinin protein is contrasted with the related cold spot H protein from measles. It is proposed that evolutionary cold-spot proteins can become significant targets for second-line anti-viral therapeutics, in cases where front-line vaccines are not available or have become ineffective due to mutations in the hot-spot, generally more antigenically exposed proteins. The MeaPED package is available from www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/meaped.tar.gz.

  3. Sequence- and interactome-based prediction of viral protein hotspots targeting host proteins: a case study for HIV Nef.

    Directory of Open Access Journals (Sweden)

    Mahdi Sarmady

    Full Text Available Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk.

  4. Protein Science by DNA Sequencing: How Advances in Molecular Biology Are Accelerating Biochemistry.

    Science.gov (United States)

    Higgins, Sean A; Savage, David F

    2018-01-09

    A fundamental goal of protein biochemistry is to determine the sequence-function relationship, but the vastness of sequence space makes comprehensive evaluation of this landscape difficult. However, advances in DNA synthesis and sequencing now allow researchers to assess the functional impact of every single mutation in many proteins, but challenges remain in library construction and the development of general assays applicable to a diverse range of protein functions. This Perspective briefly outlines the technical innovations in DNA manipulation that allow massively parallel protein biochemistry and then summarizes the methods currently available for library construction and the functional assays of protein variants. Areas in need of future innovation are highlighted with a particular focus on assay development and the use of computational analysis with machine learning to effectively traverse the sequence-function landscape. Finally, applications in the fundamentals of protein biochemistry, disease prediction, and protein engineering are presented.

  5. Microwave-assisted acid and base hydrolysis of intact proteins containing disulfide bonds for protein sequence analysis by mass spectrometry.

    Science.gov (United States)

    Reiz, Bela; Li, Liang

    2010-09-01

    Controlled hydrolysis of proteins to generate peptide ladders combined with mass spectrometric analysis of the resultant peptides can be used for protein sequencing. In this paper, two methods of improving the microwave-assisted protein hydrolysis process are described to enable rapid sequencing of proteins containing disulfide bonds and increase sequence coverage, respectively. It was demonstrated that proteins containing disulfide bonds could be sequenced by MS analysis by first performing hydrolysis for less than 2 min, followed by 1 h of reduction to release the peptides originally linked by disulfide bonds. It was shown that a strong base could be used as a catalyst for microwave-assisted protein hydrolysis, producing complementary sequence information to that generated by microwave-assisted acid hydrolysis. However, using either acid or base hydrolysis, amide bond breakages in small regions of the polypeptide chains of the model proteins (e.g., cytochrome c and lysozyme) were not detected. Dynamic light scattering measurement of the proteins solubilized in an acid or base indicated that protein-protein interaction or aggregation was not the cause of the failure to hydrolyze certain amide bonds. It was speculated that there were some unknown local structures that might play a role in preventing an acid or base from reacting with the peptide bonds therein. 2010 American Society for Mass Spectrometry. Published by Elsevier Inc. All rights reserved.

  6. The evolution of the protein synthesis system. I - A model of a primitive protein synthesis system

    Science.gov (United States)

    Mizutani, H.; Ponnamperuma, C.

    1977-01-01

    A model is developed to describe the evolution of the protein synthesis system. The model is comprised of two independent autocatalytic systems, one including one gene (A-gene) and two activated amino acid polymerases (O and A-polymerases), and the other including the addition of another gene (N-gene) and a nucleotide polymerase. Simulation results have suggested that even a small enzymic activity and polymerase specificity could lead the system to the most accurate protein synthesis, as far as permitted by transitions to systems with higher accuracy.

  7. Overview of the creative genome: effects of genome structure and sequence on the generation of variation and evolution.

    Science.gov (United States)

    Caporale, Lynn Helena

    2012-09-01

    This overview of a special issue of Annals of the New York Academy of Sciences discusses uneven distribution of distinct types of variation across the genome, the dependence of specific types of variation upon distinct classes of DNA sequences and/or the induction of specific proteins, the circumstances in which distinct variation-generating systems are activated, and the implications of this work for our understanding of evolution and of cancer. Also discussed is the value of non text-based computational methods for analyzing information carried by DNA, early insights into organizational frameworks that affect genome behavior, and implications of this work for comparative genomics. © 2012 New York Academy of Sciences.

  8. Protein and DNA sequence determinants of thermophilic adaptation.

    Directory of Open Access Journals (Sweden)

    Konstantin B Zeldovich

    2007-01-01

    Full Text Available There have been considerable attempts in the past to relate phenotypic trait--habitat temperature of organisms--to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from -10 degrees C to 110 degrees C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL, and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = -0.10. On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.

  9. Scoring protein relationships in functional interaction networks predicted from sequence data.

    Directory of Open Access Journals (Sweden)

    Gaston K Mazandu

    Full Text Available UNLABELLED: The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins. AVAILABILITY: Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.

  10. Nonlinear analysis of sequence symmetry of beta-trefoil family proteins

    Energy Technology Data Exchange (ETDEWEB)

    Li Mingfeng [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Huang Yanzhao [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xu Ruizhen [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China); Xiao Yi [Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei (China)]. E-mail: yxiao@mail.hust.edu.cn

    2005-07-01

    The tertiary structures of proteins of beta-trefoil family have three-fold quasi-symmetry while their amino acid sequences appear almost at random. In the present paper we show that these amino acid sequences have hidden symmetries in fact and furthermore the degrees of these hidden symmetries are the same as those of their tertiary structures. We shall present a modified recurrence plot to reveal hidden symmetries in protein sequences. Our results can explain the contradiction in sequence-structure relations of proteins of beta-trefoil family.

  11. Targeted Sequencing of Venom Genes from Cone Snail Genomes Improves Understanding of Conotoxin Molecular Evolution.

    Science.gov (United States)

    Phuong, Mark A; Mahardika, Gusti N

    2018-05-01

    To expand our capacity to discover venom sequences from the genomes of venomous organisms, we applied targeted sequencing techniques to selectively recover venom gene superfamilies and nontoxin loci from the genomes of 32 cone snail species (family, Conidae), a diverse group of marine gastropods that capture their prey using a cocktail of neurotoxic peptides (conotoxins). We were able to successfully recover conotoxin gene superfamilies across all species with high confidence (> 100× coverage) and used these data to provide new insights into conotoxin evolution. First, we found that conotoxin gene superfamilies are composed of one to six exons and are typically short in length (mean = ∼85 bp). Second, we expanded our understanding of the following genetic features of conotoxin evolution: 1) positive selection, where exons coding the mature toxin region were often three times more divergent than their adjacent noncoding regions, 2) expression regulation, with comparisons to transcriptome data showing that cone snails only express a fraction of the genes available in their genome (24-63%), and 3) extensive gene turnover, where Conidae species varied from 120 to 859 conotoxin gene copies. Finally, using comparative phylogenetic methods, we found that while diet specificity did not predict patterns of conotoxin evolution, dietary breadth was positively correlated with total conotoxin gene diversity. Overall, the targeted sequencing technique demonstrated here has the potential to radically increase the pace at which venom gene families are sequenced and studied, reshaping our ability to understand the impact of genetic changes on ecologically relevant phenotypes and subsequent diversification.

  12. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    KAUST Repository

    Cui, Xuefeng

    2016-06-15

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence–structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM–HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

  13. Tracing the Evolutionary History of the CAP Superfamily of Proteins Using Amino Acid Sequence Homology and Conservation of Splice Sites.

    Science.gov (United States)

    Abraham, Anup; Chandler, Douglas E

    2017-10-01

    Proteins of the CAP superfamily play numerous roles in reproduction, innate immune responses, cancer biology, and venom toxicology. Here we document the breadth of the CAP (Cysteine-RIch Secretory Protein (CRISP), Antigen 5, and Pathogenesis-Related) protein superfamily and trace the major events in its evolution using amino acid sequence homology and the positions of exon/intron borders within their genes. Seldom acknowledged in the literature, we find that many of the CAP subfamilies present in mammals, where they were originally characterized, have distinct homologues in the invertebrate phyla. Early eukaryotic CAP genes contained only one exon inherited from prokaryotic predecessors and as evolution progressed an increasing number of introns were inserted, reaching 2-5 in the invertebrate world and 5-15 in the vertebrate world. Focusing on the CRISP subfamily, we propose that these proteins evolved in three major steps: (1) origination of the CAP/PR/SCP domain in bacteria, (2) addition of a small Hinge domain to produce the two-domain SCP-like proteins found in roundworms and anthropoids, and (3) addition of an Ion Channel Regulatory domain, borrowed from invertebrate peptide toxins, to produce full length, three-domain CRISP proteins, first seen in insects and later to diversify into multiple subtypes in the vertebrate world.

  14. Comparative analysis of protein coding sequences from human, mouse and the domesticated pig

    DEFF Research Database (Denmark)

    Jørgensen, Frank Grønlund; Hobolth, Asger; Hornshøj, Henrik

    2005-01-01

    Background: The availability of abundant sequence data from key model organisms has made large scale studies of mulecular evolution an exciting possibility. Here we use full length cDNA alignments comprising more than 700,000 nucleotides from human, mouse, pig and the Japanese pufferfish Fugu rub...... rubrices in order to investigate 1) the relationships between three major lineages of mammals: rodents, artiodactys and primates, and 2) the rate of evolution and the occurrence of positive Darwinian selection using codon based models of sequence evolution. Results: We provide evidence...

  15. Progress in strategies for sequence diversity library creation for ...

    African Journals Online (AJOL)

    As the simplest technique of protein engineering, directed evolution has been ... An experiment of directed evolution comprises mutant libraries creation and ... evolution, sequence diversity creation, novel strategy, computational design, ...

  16. Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences

    Directory of Open Access Journals (Sweden)

    Jun Wang

    2017-11-01

    Full Text Available Protein-protein interactions (PPIs play crucial roles in almost all cellular processes. Although a large amount of PPIs have been verified by high-throughput techniques in the past decades, currently known PPIs pairs are still far from complete. Furthermore, the wet-lab experiments based techniques for detecting PPIs are time-consuming and expensive. Hence, it is urgent and essential to develop automatic computational methods to efficiently and accurately predict PPIs. In this paper, a sequence-based approach called DNN-LCTD is developed by combining deep neural networks (DNNs and a novel local conjoint triad description (LCTD feature representation. LCTD incorporates the advantage of local description and conjoint triad, thus, it is capable to account for the interactions between residues in both continuous and discontinuous regions of amino acid sequences. DNNs can not only learn suitable features from the data by themselves, but also learn and discover hierarchical representations of data. When performing on the PPIs data of Saccharomyces cerevisiae, DNN-LCTD achieves superior performance with accuracy as 93.12%, precision as 93.75%, sensitivity as 93.83%, area under the receiver operating characteristic curve (AUC as 97.92%, and it only needs 718 s. These results indicate DNN-LCTD is very promising for predicting PPIs. DNN-LCTD can be a useful supplementary tool for future proteomics study.

  17. Insights into the evolution of Darwin’s finches from comparative analysis of the Geospiza magnirostris genome sequence

    Directory of Open Access Journals (Sweden)

    Rands Chris M

    2013-02-01

    Full Text Available Abstract Background A classical example of repeated speciation coupled with ecological diversification is the evolution of 14 closely related species of Darwin’s (Galápagos finches (Thraupidae, Passeriformes. Their adaptive radiation in the Galápagos archipelago took place in the last 2–3 million years and some of the molecular mechanisms that led to their diversification are now being elucidated. Here we report evolutionary analyses of genome of the large ground finch, Geospiza magnirostris. Results 13,291 protein-coding genes were predicted from a 991.0 Mb G. magnirostris genome assembly. We then defined gene orthology relationships and constructed whole genome alignments between the G. magnirostris and other vertebrate genomes. We estimate that 15% of genomic sequence is functionally constrained between G. magnirostris and zebra finch. Genic evolutionary rate comparisons indicate that similar selective pressures acted along the G. magnirostris and zebra finch lineages suggesting that historical effective population size values have been similar in both lineages. 21 otherwise highly conserved genes were identified that each show evidence for positive selection on amino acid changes in the Darwin's finch lineage. Two of these genes (Igf2r and Pou1f1 have been implicated in beak morphology changes in Darwin’s finches. Five of 47 genes showing evidence of positive selection in early passerine evolution have cilia related functions, and may be examples of adaptively evolving reproductive proteins. Conclusions These results provide insights into past evolutionary processes that have shaped G. magnirostris genes and its genome, and provide the necessary foundation upon which to build population genomics resources that will shed light on more contemporaneous adaptive and non-adaptive processes that have contributed to the evolution of the Darwin’s finches.

  18. Role of accelerated segment switch in exons to alter targeting (ASSET in the molecular evolution of snake venom proteins

    Directory of Open Access Journals (Sweden)

    Kini R Manjunatha

    2009-06-01

    Full Text Available Abstract Background Snake venom toxins evolve more rapidly than other proteins through accelerated changes in the protein coding regions. Previously we have shown that accelerated segment switch in exons to alter targeting (ASSET might play an important role in its functional evolution of viperid three-finger toxins. In this phenomenon, short sequences in exons are radically changed to unrelated sequences and hence affect the folding and functional properties of the toxins. Results Here we analyzed other snake venom protein families to elucidate the role of ASSET in their functional evolution. ASSET appears to be involved in the functional evolution of three-finger toxins to a greater extent than in several other venom protein families. ASSET leads to replacement of some of the critical amino acid residues that affect the biological function in three-finger toxins as well as change the conformation of the loop that is involved in binding to specific target sites. Conclusion ASSET could lead to novel functions in snake venom proteins. Among snake venom serine proteases, ASSET contributes to changes in three surface segments. One of these segments near the substrate binding region is known to affect substrate specificity, and its exchange may have significant implications for differences in isoform catalytic activity on specific target protein substrates. ASSET therefore plays an important role in functional diversification of snake venom proteins, in addition to accelerated point mutations in the protein coding regions. Accelerated point mutations lead to fine-tuning of target specificity, whereas ASSET leads to large-scale replacement of multiple functionally important residues, resulting in change or gain of functions.

  19. Increasing genomic diversity and evidence of constrained lifestyle evolution due to insertion sequences in Aeromonas salmonicida.

    Science.gov (United States)

    Vincent, Antony T; Trudel, Mélanie V; Freschi, Luca; Nagar, Vandan; Gagné-Thivierge, Cynthia; Levesque, Roger C; Charette, Steve J

    2016-01-12

    Aeromonads make up a group of Gram-negative bacteria that includes human and fish pathogens. The Aeromonas salmonicida species has the peculiarity of including five known subspecies. However, few studies of the genomes of A. salmonicida subspecies have been reported to date. We sequenced the genomes of additional A. salmonicida isolates, including three from India, using next-generation sequencing in order to gain a better understanding of the genomic and phylogenetic links between A. salmonicida subspecies. Their relative phylogenetic positions were confirmed by a core genome phylogeny based on 1645 gene sequences. The Indian isolates, which formed a sub-group together with A. salmonicida subsp. pectinolytica, were able to grow at either at 18 °C and 37 °C, unlike the A. salmonicida psychrophilic isolates that did not grow at 37 °C. Amino acid frequencies, GC content, tRNA composition, loss and gain of genes during evolution, pseudogenes as well as genes under positive selection and the mobilome were studied to explain this intraspecies dichotomy. Insertion sequences appeared to be an important driving force that locked the psychrophilic strains into their particular lifestyle in order to conserve their genomic integrity. This observation, based on comparative genomics, is in agreement with previous results showing that insertion sequence mobility induced by heat in A. salmonicida subspecies causes genomic plasticity, resulting in a deleterious effect on the virulence of the bacterium. We provide a proof-of-concept that selfish DNAs play a major role in the evolution of bacterial species by modeling genomes.

  20. Data-driven modelling of protein synthesis : A sequence perspective

    NARCIS (Netherlands)

    Gritsenko, A.

    2017-01-01

    Recent advances in DNA sequencing, synthesis and genetic engineering have enabled the introduction of choice DNA sequences into living cells. This is an exciting prospect for the field of industrial biotechnology, which aims at using microorganisms to produce foods, beverages, pharmaceuticals and

  1. Biological sequence analysis: probabilistic models of proteins and nucleic acids

    National Research Council Canada - National Science Library

    Durbin, Richard

    1998-01-01

    ... analysis methods are now based on principles of probabilistic modelling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference...

  2. Comparative analysis of the prion protein gene sequences in African lion.

    Science.gov (United States)

    Wu, Chang-De; Pang, Wan-Yong; Zhao, De-Ming

    2006-10-01

    The prion protein gene of African lion (Panthera Leo) was first cloned and polymorphisms screened. The results suggest that the prion protein gene of eight African lions is highly homogenous. The amino acid sequences of the prion protein (PrP) of all samples tested were identical. Four single nucleotide polymorphisms (C42T, C81A, C420T, T600C) in the prion protein gene (Prnp) of African lion were found, but no amino acid substitutions. Sequence analysis showed that the higher homology is observed to felis catus AF003087 (96.7%) and to sheep number M31313.1 (96.2%) Genbank accessed. With respect to all the mammalian prion protein sequences compared, the African lion prion protein sequence has three amino acid substitutions. The homology might in turn affect the potential intermolecular interactions critical for cross species transmission of prion disease.

  3. Focal plane AIT sequence: evolution from HRG-Spot 5 to Pleiades HR

    Science.gov (United States)

    Le Goff, Roland; Pranyies, Pascal; Toubhans, Isabelle

    2017-11-01

    Optical and geometrical image qualities of Focal Planes, for "push-broom" high resolution remote sensing satellites, require the implementation of specific means and methods for the AIT sequence. Indeed the geometric performances of the focal plane mainly axial focusing and transverse registration, are duly obtained on the basis of adjustment, setting and measurement of optical and CCD components with an accuracy of a few microns. Since the end of the 1970s, EADS-SODERN has developed a series of detection units for earth observation instruments like SPOT and Helios. And EADS-SODERN is now responsible for the development of the Pleiades High Resolution Focal Plane assembly. This paper presents the AIT sequences. We introduce all the efforts, innovative solutions and improvements made on the assembly facilities to match the technical evolutions and breakthrough of the Pleiades HR FP concept in comparison with the previous High Resolution Geometric SPOT 5 Focal Plane. The main evolution drivers are the implementation of strip filters and the realization of 400 mm continuous retinas. For Pleiades HR AIT sequence, three specific integration and measuring benches, corresponding with the different assembly stages, are used: a 3-D non-contact measurement machine for the assembly of detection module, a 3-D measurement machine for mirror integration on the main Focal Plane SiC structure, and a 3-D geometric coordinates control bench to focus detection module lines and to ensure they are well registered together.

  4. Large Scale Sequencing of Dothideomycetes Provides Insights into Genome Evolution and Adaptation

    Energy Technology Data Exchange (ETDEWEB)

    Haridas, Sajeet; Crous, Pedro; Binder, Manfred; Spatafora, Joseph; Grigoriev, Igor

    2015-03-16

    Dothideomycetes is the largest and most diverse class of ascomycete fungi with 23 orders 110 families, 1300 genera and over 19,000 known species. We present comparative analysis of 70 Dothideomycete genomes including over 50 that we sequenced and are as yet unpublished. This extensive sampling has almost quadrupled the previous study of 18 species and uncovered a 10 fold range of genome sizes. We were able to clarify the phylogenetic positions of several species whose origins were unclear in previous morphological and sequence comparison studies. We analyzed selected gene families including proteases, transporters and small secreted proteins and show that major differences in gene content is influenced by speciation.

  5. Effects of Main-Sequence Mass Loss on Stellar and Galactic Chemical Evolution.

    Science.gov (United States)

    Guzik, Joyce Ann

    1988-06-01

    L. A. Willson, G. H. Bowen and C. Struck -Marcell have proposed that 1 to 3 solar mass stars may experience evolutionarily significant mass loss during the early part of their main-sequence phase. The suggested mass-loss mechanism is pulsation, facilitated by rapid rotation. Initial mass-loss rates may be as large as several times 10^{-9}M o/yr, diminishing over several times 10^8 years. We attempted to test this hypothesis by comparing some theoretical implications with observations. Three areas are addressed: Solar models, cluster HR diagrams, and galactic chemical evolution. Mass-losing solar models were evolved that match the Sun's luminosity and radius at its present age. The most extreme viable models have initial mass 2.0 M o, and mass-loss rates decreasing exponentially over 2-3 times 10^8 years. Compared to a constant -mass model, these models require a reduced initial ^4He abundance, have deeper envelope convection zones and higher ^8B neutrino fluxes. Early processing of present surface layers at higher interior temperatures increases the surface ^3He abundance, destroys Li, Be and B, and decreases the surface C/N ratio following first dredge-up. Evolution calculations incorporating main-sequence mass loss were completed for a grid of models with initial masses 1.25 to 2.0 Mo and mass loss timescales 0.2 to 2.0 Gyr. Cluster HR diagrams synthesized with these models confirm the potential for the hypothesis to explain observed spreads or bifurcations in the upper main sequence, blue stragglers, anomalous giants, and poor fits of main-sequence turnoffs by standard isochrones. Simple closed galactic chemical evolution models were used to test the effects of main-sequence mass loss on the F and G dwarf distribution. Stars between 3.0 M o and a metallicity -dependent lower mass are assumed to lose mass. The models produce a 30 to 60% increase in the stars to stars-plus -remnants ratio, with fewer early-F dwarfs and many more late-F dwarfs remaining on the main

  6. Species specificity in major urinary proteins by parallel evolution.

    Directory of Open Access Journals (Sweden)

    Darren W Logan

    Full Text Available Species-specific chemosignals, pheromones, regulate social behaviors such as aggression, mating, pup-suckling, territory establishment, and dominance. The identity of these cues remains mostly undetermined and few mammalian pheromones have been identified. Genetically-encoded pheromones are expected to exhibit several different mechanisms for coding 1 diversity, to enable the signaling of multiple behaviors, 2 dynamic regulation, to indicate age and dominance, and 3 species-specificity. Recently, the major urinary proteins (Mups have been shown to function themselves as genetically-encoded pheromones to regulate species-specific behavior. Mups are multiple highly related proteins expressed in combinatorial patterns that differ between individuals, gender, and age; which are sufficient to fulfill the first two criteria. We have now characterized and fully annotated the mouse Mup gene content in detail. This has enabled us to further analyze the extent of Mup coding diversity and determine their potential to encode species-specific cues.Our results show that the mouse Mup gene cluster is composed of two subgroups: an older, more divergent class of genes and pseudogenes, and a second class with high sequence identity formed by recent sequential duplications of a single gene/pseudogene pair. Previous work suggests that truncated Mup pseudogenes may encode a family of functional hexapeptides with the potential for pheromone activity. Sequence comparison, however, reveals that they have limited coding potential. Similar analyses of nine other completed genomes find Mup gene expansions in divergent lineages, including those of rat, horse and grey mouse lemur, occurring independently from a single ancestral Mup present in other placental mammals. Our findings illustrate that increasing genomic complexity of the Mup gene family is not evolutionarily isolated, but is instead a recurring mechanism of generating coding diversity consistent with a species

  7. Adaptive evolution of relish, a Drosophila NF-kappaB/IkappaB protein.

    OpenAIRE

    Begun, D J; Whitley, P

    2000-01-01

    NF-kappaB and IkappaB proteins have central roles in regulation of inflammation and innate immunity in mammals. Homologues of these proteins also play an important role in regulation of the Drosophila immune response. Here we present a molecular population genetic analysis of Relish, a Drosophila NF-kappaB/IkappaB protein, in Drosophila simulans and D. melanogaster. We find strong evidence for adaptive protein evolution in D. simulans, but not in D. melanogaster. The adaptive evolution appear...

  8. Convergent evolution of plant and animal embryo defences by hyperstable non-digestible storage proteins.

    Science.gov (United States)

    Pasquevich, María Yanina; Dreon, Marcos Sebastián; Qiu, Jian-Wen; Mu, Huawei; Heras, Horacio

    2017-11-20

    Plants have evolved sophisticated embryo defences by kinetically-stable non-digestible storage proteins that lower the nutritional value of seeds, a strategy that have not been reported in animals. To further understand antinutritive defences in animals, we analysed PmPV1, massively accumulated in the eggs of the gastropod Pomacea maculata, focusing on how its structure and structural stability features affected its capacity to withstand passage through predator guts. The native protein withstands >50 min boiling and resists the denaturing detergent sodium dodecyl sulphate (SDS), indicating an unusually high structural stability (i.e., kinetic stability). PmPV1 is highly resistant to in vitro proteinase digestion and displays structural stability between pH 2.0-12.0 and 25-85 °C. Furthermore, PmPV1 withstands in vitro and mice digestion and is recovered unchanged in faeces, supporting an antinutritive defensive function. Subunit sequence similarities suggest a common origin and tolerance to mutations. This is the first known animal genus that, like plant seeds, lowers the nutritional value of eggs by kinetically-stable non-digestible storage proteins that survive the gut of predators unaffected. The selective pressure of the harsh gastrointestinal environment would have favoured their appearance, extending by convergent evolution the presence of plant-like hyperstable antinutritive proteins to unattended reproductive stages in animals.

  9. A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

    Directory of Open Access Journals (Sweden)

    Suganthan PN

    2007-09-01

    Full Text Available Abstract Background Odorant binding proteins (OBPs are believed to shuttle odorants from the environment to the underlying odorant receptors, for which they could potentially serve as odorant presenters. Although several sequence based search methods have been exploited for protein family prediction, less effort has been devoted to the prediction of OBPs from sequence data and this area is more challenging due to poor sequence identity between these proteins. Results In this paper, we propose a new algorithm that uses Regularized Least Squares Classifier (RLSC in conjunction with multiple physicochemical properties of amino acids to predict odorant-binding proteins. The algorithm was applied to the dataset derived from Pfam and GenDiS database and we obtained overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively. Conclusion Our study suggests that RLSC is potentially useful for predicting the odorant binding proteins from sequence-derived properties irrespective of sequence similarity. Our method predicts 92.8% of 56 odorant binding proteins non-homologous to any protein in the swissprot database and 97.1% of the 414 independent dataset proteins, suggesting the usefulness of RLSC method for facilitating the prediction of odorant binding proteins from sequence information.

  10. Numerical Simulation of Stress evolution and earthquake sequence of the Tibetan Plateau

    Science.gov (United States)

    Dong, Peiyu; Hu, Caibo; Shi, Yaolin

    2015-04-01

    The India-Eurasia's collision produces N-S compression and results in large thrust fault in the southern edge of the Tibetan Plateau. Differential eastern flow of the lower crust of the plateau leads to large strike-slip faults and normal faults within the plateau. From 1904 to 2014, more than 30 earthquakes of Mw > 6.5 occurred sequentially in this distinctive tectonic environment. How did the stresses evolve during the last 110 years, how did the earthquakes interact with each other? Can this knowledge help us to forecast the future seismic hazards? In this essay, we tried to simulate the evolution of the stress field and the earthquake sequence in the Tibetan plateau within the last 110 years with a 2-D finite element model. Given an initial state of stress, the boundary condition was constrained by the present-day GPS observation, which was assumed as a constant rate during the 110 years. We calculated stress evolution year by year, and earthquake would occur if stress exceed the crustal strength. Stress changes due to each large earthquake in the sequence was calculated and contributed to the stress evolution. A key issue is the choice of initial stress state of the modeling, which is actually unknown. Usually, in the study of earthquake triggering, people assume the initial stress is zero, and only calculate the stress changes by large earthquakes - the Coulomb failure stress changes (Δ CFS). To some extent, this simplified method is a powerful tool because it can reveal which fault or which part of a fault becomes more risky or safer relatively. Nonetheless, it has not utilized all information available to us. The earthquake sequence reveals, though far from complete, some information about the stress state in the region. If the entire region is close to a self-organized critical or subcritical state, earthquake stress drop provides an estimate of lower limit of initial state. For locations no earthquakes occurred during the period, initial stress has to be

  11. Chaos game representation of functional protein sequences, and simulation and multifractal analysis of induced measures

    International Nuclear Information System (INIS)

    Zu-Guo, Yu; Qian-Jun, Xiao; Long, Shi; Jun-Wu, Yu; Anh, Vo

    2010-01-01

    Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the D q curves, one sees that these functional protein sequences are not completely random. The D q of all linked functional proteins studied are multifractal-like and sufficiently smooth for the C q (analogous to specific heat) curves to be meaningful. Furthermore, the D q curves of the measure μ based on their CGRs for different orders to link the functional protein sequences are almost identical if q ≥ 0. Finally, the C q curves of all linked functional proteins resemble a classical phase transition at a critical point. (cross-disciplinary physics and related areas of science and technology)

  12. Evolution of plant virus movement proteins from the 30K superfamily and of their homologs integrated in plant genomes

    Energy Technology Data Exchange (ETDEWEB)

    Mushegian, Arcady R., E-mail: mushegian2@gmail.com [Division of Molecular and Cellular Biosciences, National Science Foundation, 4201 Wilson Boulevard, Arlington, VA 22230 (United States); Elena, Santiago F., E-mail: sfelena@ibmcp.upv.es [Instituto de Biología Molecular y Celular de Plantas, CSIC-UPV, 46022 València (Spain); The Santa Fe Institute, Santa Fe, NM 87501 (United States)

    2015-02-15

    Homologs of Tobacco mosaic virus 30K cell-to-cell movement protein are encoded by diverse plant viruses. Mechanisms of action and evolutionary origins of these proteins remain obscure. We expand the picture of conservation and evolution of the 30K proteins, producing sequence alignment of the 30K superfamily with the broadest phylogenetic coverage thus far and illuminating structural features of the core all-beta fold of these proteins. Integrated copies of pararetrovirus 30K movement genes are prevalent in euphyllophytes, with at least one copy intact in nearly every examined species, and mRNAs detected for most of them. Sequence analysis suggests repeated integrations, pseudogenizations, and positive selection in those provirus genes. An unannotated 30K-superfamily gene in Arabidopsis thaliana genome is likely expressed as a fusion with the At1g37113 transcript. This molecular background of endopararetrovirus gene products in plants may change our view of virus infection and pathogenesis, and perhaps of cellular homeostasis in the hosts. - Highlights: • Sequence region shared by plant virus “30K” movement proteins has an all-beta fold. • Most euphyllophyte genomes contain integrated copies of pararetroviruses. • These integrated virus genomes often include intact movement protein genes. • Molecular evidence suggests that these “30K” genes may be selected for function.

  13. Conserved hypothetical protein Rv1977 in Mycobacterium tuberculosis strains contains sequence polymorphisms and might be involved in ongoing immune evasion.

    Science.gov (United States)

    Jiang, Yi; Liu, Haican; Wang, Xuezhi; Li, Guilian; Qiu, Yan; Dou, Xiangfeng; Wan, Kanglin

    2015-01-01

    Host immune pressure and associated parasite immune evasion are key features of host-pathogen co-evolution. A previous study showed that human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved and thus it was deduced that M. tuberculosis lacks antigenic variation and immune evasion. Here, we selected 151 clinical Mycobacterium tuberculosis isolates from China, amplified gene encoding Rv1977 and compared the sequences. The results showed that Rv1977, a conserved hypothetical protein, is not conserved in M. tuberculosis strains and there are polymorphisms existed in the protein. Some mutations, especially one frameshift mutation, occurred in the antigen Rv1977, which is uncommon in M.tb strains and may lead to the protein function altering. Mutations and deletion in the gene all affect one of three T cell epitopes and the changed T cell epitope contained more than one variable position, which may suggest ongoing immune evasion.

  14. In Silico Characterization of Pectate Lyase Protein Sequences from Different Source Organisms

    Directory of Open Access Journals (Sweden)

    Amit Kumar Dubey

    2010-01-01

    Full Text Available A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis. The phylogenetic tree constructed revealed different clusters based on different source organisms representing bacterial, fungal, plant, and nematode pectate lyases. The multiple accessions of bacterial, fungal, nematode, and plant pectate lyase protein sequences were placed closely revealing a sequence level similarity. The multiple sequence alignment of these pectate lyase protein sequences from different source organisms showed conserved regions at different stretches with maximum homology from amino acid residues 439–467, 715–816, and 829–910 which could be used for designing degenerate primers or probes specific for pectate lyases. The motif analysis revealed a conserved Pec_Lyase_C domain uniformly observed in all pectate lyases irrespective of variable sources suggesting its possible role in structural and enzymatic functions.

  15. Peptide Pattern Recognition for high-throughput protein sequence analysis and clustering

    DEFF Research Database (Denmark)

    Busk, Peter Kamp

    2017-01-01

    Large collections of protein sequences with divergent sequences are tedious to analyze for understanding their phylogenetic or structure-function relation. Peptide Pattern Recognition is an algorithm that was developed to facilitate this task but the previous version does only allow a limited...... number of sequences as input. I implemented Peptide Pattern Recognition as a multithread software designed to handle large numbers of sequences and perform analysis in a reasonable time frame. Benchmarking showed that the new implementation of Peptide Pattern Recognition is twenty times faster than...... the previous implementation on a small protein collection with 673 MAP kinase sequences. In addition, the new implementation could analyze a large protein collection with 48,570 Glycosyl Transferase family 20 sequences without reaching its upper limit on a desktop computer. Peptide Pattern Recognition...

  16. On the relationship between residue structural environment and sequence conservation in proteins.

    Science.gov (United States)

    Liu, Jen-Wei; Lin, Jau-Ji; Cheng, Chih-Wen; Lin, Yu-Feng; Hwang, Jenn-Kang; Huang, Tsun-Tsao

    2017-09-01

    Residues that are crucial to protein function or structure are usually evolutionarily conserved. To identify the important residues in protein, sequence conservation is estimated, and current methods rely upon the unbiased collection of homologous sequences. Surprisingly, our previous studies have shown that the sequence conservation is closely correlated with the weighted contact number (WCN), a measure of packing density for residue's structural environment, calculated only based on the C α positions of a protein structure. Moreover, studies have shown that sequence conservation is correlated with environment-related structural properties calculated based on different protein substructures, such as a protein's all atoms, backbone atoms, side-chain atoms, or side-chain centroid. To know whether the C α atomic positions are adequate to show the relationship between residue environment and sequence conservation or not, here we compared C α atoms with other substructures in their contributions to the sequence conservation. Our results show that C α positions are substantially equivalent to the other substructures in calculations of various measures of residue environment. As a result, the overlapping contributions between C α atoms and the other substructures are high, yielding similar structure-conservation relationship. Take the WCN as an example, the average overlapping contribution to sequence conservation is 87% between C α and all-atom substructures. These results indicate that only C α atoms of a protein structure could reflect sequence conservation at the residue level. © 2017 Wiley Periodicals, Inc.

  17. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    Science.gov (United States)

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-04

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  18. Rapid evolution of the env gene leader sequence in cats naturally infected with feline immunodeficiency virus

    Science.gov (United States)

    Hughes, Joseph; Biek, Roman; Litster, Annette; Willett, Brian J.; Hosie, Margaret J.

    2015-01-01

    Analysing the evolution of feline immunodeficiency virus (FIV) at the intra-host level is important in order to address whether the diversity and composition of viral quasispecies affect disease progression. We examined the intra-host diversity and the evolutionary rates of the entire env and structural fragments of the env sequences obtained from sequential blood samples in 43 naturally infected domestic cats that displayed different clinical outcomes. We observed in the majority of cats that FIV env showed very low levels of intra-host diversity. We estimated that env evolved at a rate of 1.16×10−3 substitutions per site per year and demonstrated that recombinant sequences evolved faster than non-recombinant sequences. It was evident that the V3–V5 fragment of FIV env displayed higher evolutionary rates in healthy cats than in those with terminal illness. Our study provided the first evidence that the leader sequence of env, rather than the V3–V5 sequence, had the highest intra-host diversity and the highest evolutionary rate of all env fragments, consistent with this region being under a strong selective pressure for genetic variation. Overall, FIV env displayed relatively low intra-host diversity and evolved slowly in naturally infected cats. The maximum evolutionary rate was observed in the leader sequence of env. Although genetic stability is not necessarily a prerequisite for clinical stability, the higher genetic stability of FIV compared with human immunodeficiency virus might explain why many naturally infected cats do not progress rapidly to AIDS. PMID:25535323

  19. Relation of chromospheric activity to convection, rotation, and pre-main-sequence evolution

    International Nuclear Information System (INIS)

    Gilliland, R.L.

    1986-01-01

    Pre-main-sequence, or T Tauri, stars are characterized by much larger fluxes of nonradiative origin than their main-sequence counterparts. As a class, the T Tauri stars have only moderate rotation rates, making an explanation of their chromospheric properties based on rapid rotation problematic. The recent success of correlating nonradiative fluxes to the Rossby number, Ro = P/sub rot//tau/sub conv/, a central parameter of simple dynamo theories of magnetic field generation, has led to the suggestion that the same relation might be of use in explaining the pre-main-sequence (PMS) stars if tau/sub conv/ is very large. We show that tau/sub conv/ does depend strongly on evolutionary effects above the main sequence (MS), but that this dependence alone cannot account for the high observed nonradiative fluxes. The acoustic flux is also strongly dependent on PMS evolutionary state, and when coupled to the parameterization of magnetic activity based on Ro, these two mechanisms seem capable of explaining the high observed level of chromospheric activity in T Tauri stars. The moment of inertia decreases by two to three order of magnitude during PMS evolution. Since young MS stars do not rotate two to three orders of magnitude faster than PMS stars, rapid loss or redistribution of angular momentum must occur

  20. Simulating evolution of protein complexes through gene duplication and co-option.

    Science.gov (United States)

    Haarsma, Loren; Nelesen, Serita; VanAndel, Ethan; Lamine, James; VandeHaar, Peter

    2016-06-21

    We present a model of the evolution of protein complexes with novel functions through gene duplication, mutation, and co-option. Under a wide variety of input parameters, digital organisms evolve complexes of 2-5 bound proteins which have novel functions but whose component proteins are not independently functional. Evolution of complexes with novel functions happens more quickly as gene duplication rates increase, point mutation rates increase, protein complex functional probability increases, protein complex functional strength increases, and protein family size decreases. Evolution of complexity is inhibited when the metabolic costs of making proteins exceeds the fitness gain of having functional proteins, or when point mutation rates get so large the functional proteins undergo deleterious mutations faster than new functional complexes can evolve. Copyright © 2016 Elsevier Ltd. All rights reserved.

  1. Fast computational methods for predicting protein structure from primary amino acid sequence

    Science.gov (United States)

    Agarwal, Pratul Kumar [Knoxville, TN

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  2. Molecular evolution of a-kinase anchoring protein (AKAP-7: implications in comparative PKA compartmentalization

    Directory of Open Access Journals (Sweden)

    Johnson Keven R

    2012-07-01

    Full Text Available Abstract Background A-Kinase Anchoring Proteins (AKAPs are molecular scaffolding proteins mediating the assembly of multi-protein complexes containing cAMP-dependent protein kinase A (PKA, directing the kinase in discrete subcellular locations. Splice variants from the AKAP7 gene (AKAP15/18 are vital components of neuronal and cardiac phosphatase complexes, ion channels, cardiac Ca2+ handling and renal water transport. Results Shown in evolutionary analyses, the formation of the AKAP7-RI/RII binding domain (required for AKAP/PKA-R interaction corresponds to vertebrate-specific gene duplication events in the PKA-RI/RII subunits. Species analyses of AKAP7 splice variants shows the ancestral AKAP7 splice variant is AKAP7α, while the ancestral long form AKAP7 splice variant is AKAP7γ. Multi-species AKAP7 gene alignments, show the recent formation of AKAP7δ occurs with the loss of native AKAP7γ in rats and basal primates. AKAP7 gene alignments and two dimensional Western analyses indicate that AKAP7γ is produced from an internal translation-start site that is present in the AKAP7δ cDNA of mice and humans but absent in rats. Immunofluorescence analysis of AKAP7 protein localization in both rat and mouse heart suggests AKAP7γ replaces AKAP7δ at the cardiac sarcoplasmic reticulum in species other than rat. DNA sequencing identified Human AKAP7δ insertion-deletions (indels that promote the production of AKAP7γ instead of AKAP7δ. Conclusions This AKAP7 molecular evolution study shows that these vital scaffolding proteins developed in ancestral vertebrates and that independent mutations in the AKAP7 genes of rodents and early primates has resulted in the recent formation of AKAP7δ, a splice variant of likely lesser importance in humans than currently described.

  3. Nucleotide sequences of immunoglobulin eta genes of chimpanzee and orangutan: DNA molecular clock and hominoid evolution

    Energy Technology Data Exchange (ETDEWEB)

    Sakoyama, Y.; Hong, K.J.; Byun, S.M.; Hisajima, H.; Ueda, S.; Yaoita, Y.; Hayashida, H.; Miyata, T.; Honjo, T.

    1987-02-01

    To determine the phylogenetic relationships among hominoids and the dates of their divergence, the complete nucleotide sequences of the constant region of the immunoglobulin eta-chain (C/sub eta1/) genes from chimpanzee and orangutan have been determined. These sequences were compared with the human eta-chain constant-region sequence. A molecular clock (silent molecular clock), measured by the degree of sequence divergence at the synonymous (silent) positions of protein-encoding regions, was introduced for the present study. From the comparison of nucleotide sequences of ..cap alpha../sub 1/-antitrypsin and ..beta..- and delta-globulin genes between humans and Old World monkeys, the silent molecular clock was calibrated: the mean evolutionary rate of silent substitution was determined to be 1.56 x 10/sup -9/ substitutions per site per year. Using the silent molecular clock, the mean divergence dates of chimpanzee and orangutan from the human lineage were estimated as 6.4 +/- 2.6 million years and 17.3 +/- 4.5 million years, respectively. It was also shown that the evolutionary rate of primate genes is considerably slower than those of other mammalian genes.

  4. Accelerated evolution of innate immunity proteins in social insects: adaptive evolution or relaxed constraint?

    Science.gov (United States)

    Harpur, Brock A; Zayed, Amro

    2013-07-01

    The genomes of eusocial insects have a reduced complement of immune genes-an unusual finding considering that sociality provides ideal conditions for disease transmission. The following three hypotheses have been invoked to explain this finding: 1) social insects are attacked by fewer pathogens, 2) social insects have effective behavioral or 3) novel molecular mechanisms for combating pathogens. At the molecular level, these hypotheses predict that canonical innate immune pathways experience a relaxation of selective constraint. A recent study of several innate immune genes in ants and bees showed a pattern of accelerated amino acid evolution, which is consistent with either positive selection or a relaxation of constraint. We studied the population genetics of innate immune genes in the honey bee Apis mellifera by partially sequencing 13 genes from the bee's Toll pathway (∼10.5 kb) and 20 randomly chosen genes (∼16.5 kb) sequenced in 43 diploid workers. Relative to the random gene set, Toll pathway genes had significantly higher levels of amino acid replacement mutations segregating within A. mellifera and fixed between A. mellifera and A. cerana. However, levels of diversity and divergence at synonymous sites did not differ between the two gene sets. Although we detect strong signs of balancing selection on the pathogen recognition gene pgrp-sa, many of the genes in the Toll pathway show signatures of relaxed selective constraint. These results are consistent with the reduced complement of innate immune genes found in social insects and support the hypothesis that some aspect of eusociality renders canonical innate immunity superfluous.

  5. Nearly Complete 28S rRNA Gene Sequences Confirm New Hypotheses of Sponge Evolution

    Science.gov (United States)

    Thacker, Robert W.; Hill, April L.; Hill, Malcolm S.; Redmond, Niamh E.; Collins, Allen G.; Morrow, Christine C.; Spicer, Lori; Carmack, Cheryl A.; Zappe, Megan E.; Pohlmann, Deborah; Hall, Chelsea; Diaz, Maria C.; Bangalore, Purushotham V.

    2013-01-01

    The highly collaborative research sponsored by the NSF-funded Assembling the Porifera Tree of Life (PorToL) project is providing insights into some of the most difficult questions in metazoan systematics. Our understanding of phylogenetic relationships within the phylum Porifera has changed considerably with increased taxon sampling and data from additional molecular markers. PorToL researchers have falsified earlier phylogenetic hypotheses, discovered novel phylogenetic alliances, found phylogenetic homes for enigmatic taxa, and provided a more precise understanding of the evolution of skeletal features, secondary metabolites, body organization, and symbioses. Some of these exciting new discoveries are shared in the papers that form this issue of Integrative and Comparative Biology. Our analyses of over 300 nearly complete 28S ribosomal subunit gene sequences provide specific case studies that illustrate how our dataset confirms new hypotheses of sponge evolution. We recovered monophyletic clades for all 4 classes of sponges, as well as the 4 major clades of Demospongiae (Keratosa, Myxospongiae, Haploscleromorpha, and Heteroscleromorpha), but our phylogeny differs in several aspects from traditional classifications. In most major clades of sponges, families within orders appear to be paraphyletic. Although additional sampling of genes and taxa are needed to establish whether this pattern results from a lack of phylogenetic resolution or from a paraphyletic classification system, many of our results are congruent with those obtained from 18S ribosomal subunit gene sequences and complete mitochondrial genomes. These data provide further support for a revision of the traditional classification of sponges. PMID:23748742

  6. Phylogeny, character evolution, and biogeography of Cuscuta (dodders; Convolvulaceae) inferred from coding plastid and nuclear sequences.

    Science.gov (United States)

    García, Miguel A; Costea, Mihai; Kuzmina, Maria; Stefanović, Saša

    2014-04-01

    The parasitic genus Cuscuta, containing some 200 species circumscribed traditionally in three subgenera, is nearly cosmopolitan, occurring in a wide range of habitats and hosts. Previous molecular studies, on subgenera Grammica and Cuscuta, delimited major clades within these groups. However, the sequences used were unalignable among subgenera, preventing the phylogenetic comparison across the genus. We conducted a broad phylogenetic study using rbcL and nrLSU sequences covering the morphological, physiological, and geographical diversity of Cuscuta. We used parsimony methods to reconstruct ancestral states for taxonomically important characters. Biogeographical inferences were obtained using statistical and Bayesian approaches. Four well-supported major clades are resolved. Two of them correspond to subgenera Monogynella and Grammica. Subgenus Cuscuta is paraphyletic, with section Pachystigma sister to subgenus Grammica. Previously described cases of strongly supported discordance between plastid and nuclear phylogenies, interpreted as reticulation events, are confirmed here and three new cases are detected. Dehiscent fruits and globose stigmas are inferred as ancestral character states, whereas the ancestral style number is ambiguous. Biogeographical reconstructions suggest an Old World origin for the genus and subsequent spread to the Americas as a consequence of one long-distance dispersal. Hybridization may play an important yet underestimated role in the evolution of Cuscuta. Our results disagree with scenarios of evolution (polarity) previously proposed for several taxonomically important morphological characters, and with their usage and significance. While several cases of long-distance dispersal are inferred, vicariance or dispersal to adjacent areas emerges as the dominant biogeographical pattern.

  7. Impacts of WIMP dark matter upon stellar evolution: main-sequence stars

    CERN Document Server

    Scott, Pat; Edsjo, Joakim

    2008-01-01

    The presence of large amounts of WIMP dark matter in stellar cores has been shown to have significant effects upon models of stellar evolution. We present a series of detailed grids of WIMP-influenced stellar models for main sequence stars, computed using the DarkStars code. We describe the changes in stellar structure and main sequence evolution which occur for masses ranging from 0.3 to 2.0 solar masses and metallicities from Z = 0.0003-0.02, as a function of the rate of energy injection by WIMPs. We then go on to show what rates of energy injection can be obtained using realistic orbital parameters for stars near supermassive black holes, including detailed considerations of dark matter halo velocity and density profiles. Capture and annihilation rates are strongly boosted when stars follow elliptical rather than circular orbits, causing WIMP annihilation to provide up to 100 times the energy of hydrogen fusion in stars at the Galactic centre.

  8. High-resolution sequence stratigraphy and continental environmental evolution: An example from east-central Argentina

    Science.gov (United States)

    Beilinson, Elisa; Veiga, Gonzalo D.; Spalletti, Luis A.

    2013-10-01

    The aims of this contribution is to establish a high-resolution sequence stratigraphic scheme for the continental deposits that constitute the Punta San Andrés Alloformation (Plio-Pleistocene) in east-central Argentina, to analyze the basin fill evolution and to identify and assess the role that extrinsic factors such as climate and sea-level oscillations played during evolution of the unit. For the high-resolution sequence stratigraphical study of the Punta San Andrés Alloformation, high- and low-accommodation system tracts were defined mainly on the basis of the architectural elements present in the succession, also taking into account the relative degree of channel and floodplain deposits. Discontinuities and the nature of depositional systems generated during variations in accommodation helped identify two fourth-order high-accommodation system tracts and two fourth-order low-accommodation system tracts. At a third-order scale, the Punta San Andrés Alloformation may be interpreted as the progradation of continental depositional systems, characterized by a braided system in the proximal areas, and a low-sinuosity, single-channel system in the distal areas, defined by a high rate of sediment supply and discharge peaks which periodically flooded the plains and generated high aggradation rates during the late Pliocene and lower Pleistocene.

  9. Genome Sequences of Marine Shrimp Exopalaemon carinicauda Holthuis Provide Insights into Genome Size Evolution of Caridea.

    Science.gov (United States)

    Yuan, Jianbo; Gao, Yi; Zhang, Xiaojun; Wei, Jiankai; Liu, Chengzhang; Li, Fuhua; Xiang, Jianhai

    2017-07-05

    Crustacea, particularly Decapoda, contains many economically important species, such as shrimps and crabs. Crustaceans exhibit enormous (nearly 500-fold) variability in genome size. However, limited genome resources are available for investigating these species. Exopalaemon carinicauda Holthuis, an economical caridean shrimp, is a potential ideal experimental animal for research on crustaceans. In this study, we performed low-coverage sequencing and de novo assembly of the E. carinicauda genome. The assembly covers more than 95% of coding regions. E. carinicauda possesses a large complex genome (5.73 Gb), with size twice higher than those of many decapod shrimps. As such, comparative genomic analyses were implied to investigate factors affecting genome size evolution of decapods. However, clues associated with genome duplication were not identified, and few horizontally transferred sequences were detected. Ultimately, the burst of transposable elements, especially retrotransposons, was determined as the major factor influencing genome expansion. A total of 2 Gb repeats were identified, and RTE-BovB, Jockey, Gypsy, and DIRS were the four major retrotransposons that significantly expanded. Both recent (Jockey and Gypsy) and ancestral (DIRS) originated retrotransposons responsible for the genome evolution. The E. carinicauda genome also exhibited potential for the genomic and experimental research of shrimps.

  10. Seeing the trees through the forest : sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

    NARCIS (Netherlands)

    Hou, Qingzhen; De Geest, Paul F.G.; Vranken, Wim F.; Heringa, Jaap; Feenstra, K. Anton

    2017-01-01

    Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains

  11. Solving Classification Problems for Large Sets of Protein Sequences with the Example of Hox and ParaHox Proteins

    Directory of Open Access Journals (Sweden)

    Stefanie D. Hueber

    2016-02-01

    Full Text Available Phylogenetic methods are key to providing models for how a given protein family evolved. However, these methods run into difficulties when sequence divergence is either too low or too high. Here, we provide a case study of Hox and ParaHox proteins so that additional insights can be gained using a new computational approach to help solve old classification problems. For two (Gsx and Cdx out of three ParaHox proteins the assignments differ between the currently most established view and four alternative scenarios. We use a non-phylogenetic, pairwise-sequence-similarity-based method to assess which of the previous predictions, if any, are best supported by the sequence-similarity relationships between Hox and ParaHox proteins. The overall sequence-similarities show Gsx to be most similar to Hox2–3, and Cdx to be most similar to Hox4–8. The results indicate that a purely pairwise-sequence-similarity-based approach can provide additional information not only when phylogenetic inference methods have insufficient information to provide reliable classifications (as was shown previously for central Hox proteins, but also when the sequence variation is so high that the resulting phylogenetic reconstructions are likely plagued by long-branch-attraction artifacts.

  12. Homology analyses of the protein sequences of fatty acid synthases from chicken liver, rat mammary gland, and yeast

    International Nuclear Information System (INIS)

    Chang, Soo-Ik; Hammes, G.G.

    1989-01-01

    Homology analyses of the protein sequences of chicken liver and rat mammary gland fatty acid synthases were carried out. The amino acid sequences of the chicken and rat enzymes are 67% identical. If conservative substitutions are allowed, 78% of the amino acids are matched. A region of low homologies exists between the functional domains, in particular around amino acid residues 1059-1264 of the chicken enzyme. Homologies between the active sites of chicken and rat and of chicken and yeast enzymes have been analyzed by an alignment method. A high degree of homology exists between the active sites of the chicken and rat enzymes. However, the chicken and yeast enzymes show a lower degree of homology. The DADPH-binding dinucleotide folds of the β-ketoacyl reductase and the enoyl reductase sites were identified by comparison with a known consensus sequence for the DADP- and FAD-binding dinucleotide folds. The active sites of all of the enzymes are primarily in hydrophobic regions of the protein. This study suggests that the genes for the functional domains of fatty acid synthase were originally separated, and these genes were connected to each other by using different connecting nucleotide sequences in different species. An alternative explanation for the differences in rat and chicken is a common ancestry and mutations in the joining regions during evolution

  13. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier

    KAUST Repository

    Kulmanov, Maxat; Khan, Mohammed Asif; Hoehndorf, Robert

    2017-01-01

    A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often

  14. Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

    Directory of Open Access Journals (Sweden)

    Ji-Yong An

    2016-01-01

    Full Text Available We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM model and Local Phase Quantization (LPQ to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM, reducing the influence of noise using a Principal Component Analysis (PCA, and using a Relevance Vector Machine (RVM based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research.

  15. Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.

    Science.gov (United States)

    Apweiler, R; Gateau, A; Contrino, S; Martin, M J; Junker, V; O'Donovan, C; Lang, F; Mitaritonna, N; Kappus, S; Bairoch, A

    1997-01-01

    SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.

  16. Analysis of long-range correlation in sequences data of proteins

    Directory of Open Access Journals (Sweden)

    ADRIANA ISVORAN

    2007-04-01

    Full Text Available The results presented here suggest the existence of correlations in the sequence data of proteins. 32 proteins, both globular and fibrous, both monomeric and polymeric, were analyzed. The primary structures of these proteins were treated as time series. Three spatial series of data for each sequence of a protein were generated from numerical correspondences between each amino acid and a physical property associated with it, i.e., its electric charge, its polar character and its dipole moment. For each series, the spectral coefficient, the scaling exponent and the Hurst coefficient were determined. The values obtained for these coefficients revealed non-randomness in the series of data.

  17. A method for partitioning the information contained in a protein sequence between its structure and function.

    Science.gov (United States)

    Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido

    2018-05-23

    Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.

  18. Signatures of pleiotropy, economy and convergent evolution in a domain-resolved map of human-virus protein-protein interaction networks.

    Directory of Open Access Journals (Sweden)

    Sara Garamszegi

    Full Text Available A central challenge in host-pathogen systems biology is the elucidation of general, systems-level principles that distinguish host-pathogen interactions from within-host interactions. Current analyses of host-pathogen and within-host protein-protein interaction networks are largely limited by their resolution, treating proteins as nodes and interactions as edges. Here, we construct a domain-resolved map of human-virus and within-human protein-protein interaction networks by annotating protein interactions with high-coverage, high-accuracy, domain-centric interaction mechanisms: (1 domain-domain interactions, in which a domain in one protein binds to a domain in a second protein, and (2 domain-motif interactions, in which a domain in one protein binds to a short, linear peptide motif in a second protein. Analysis of these domain-resolved networks reveals, for the first time, significant mechanistic differences between virus-human and within-human interactions at the resolution of single domains. While human proteins tend to compete with each other for domain binding sites by means of sequence similarity, viral proteins tend to compete with human proteins for domain binding sites in the absence of sequence similarity. Independent of their previously established preference for targeting human protein hubs, viral proteins also preferentially target human proteins containing linear motif-binding domains. Compared to human proteins, viral proteins participate in more domain-motif interactions, target more unique linear motif-binding domains per residue, and contain more unique linear motifs per residue. Together, these results suggest that viruses surmount genome size constraints by convergently evolving multiple short linear motifs in order to effectively mimic, hijack, and manipulate complex host processes for their survival. Our domain-resolved analyses reveal unique signatures of pleiotropy, economy, and convergent evolution in viral

  19. Signatures of pleiotropy, economy and convergent evolution in a domain-resolved map of human-virus protein-protein interaction networks.

    Science.gov (United States)

    Garamszegi, Sara; Franzosa, Eric A; Xia, Yu

    2013-01-01

    A central challenge in host-pathogen systems biology is the elucidation of general, systems-level principles that distinguish host-pathogen interactions from within-host interactions. Current analyses of host-pathogen and within-host protein-protein interaction networks are largely limited by their resolution, treating proteins as nodes and interactions as edges. Here, we construct a domain-resolved map of human-virus and within-human protein-protein interaction networks by annotating protein interactions with high-coverage, high-accuracy, domain-centric interaction mechanisms: (1) domain-domain interactions, in which a domain in one protein binds to a domain in a second protein, and (2) domain-motif interactions, in which a domain in one protein binds to a short, linear peptide motif in a second protein. Analysis of these domain-resolved networks reveals, for the first time, significant mechanistic differences between virus-human and within-human interactions at the resolution of single domains. While human proteins tend to compete with each other for domain binding sites by means of sequence similarity, viral proteins tend to compete with human proteins for domain binding sites in the absence of sequence similarity. Independent of their previously established preference for targeting human protein hubs, viral proteins also preferentially target human proteins containing linear motif-binding domains. Compared to human proteins, viral proteins participate in more domain-motif interactions, target more unique linear motif-binding domains per residue, and contain more unique linear motifs per residue. Together, these results suggest that viruses surmount genome size constraints by convergently evolving multiple short linear motifs in order to effectively mimic, hijack, and manipulate complex host processes for their survival. Our domain-resolved analyses reveal unique signatures of pleiotropy, economy, and convergent evolution in viral-host interactions that are

  20. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction

    KAUST Repository

    Chen, Peng

    2015-12-03

    Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures

  1. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction

    KAUST Repository

    Chen, Peng; Hu, ShanShan; Zhang, Jun; Gao, Xin; Li, Jinyan; Xia, Junfeng; Wang, Bing

    2015-01-01

    Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures

  2. 3D representations of amino acids—applications to protein sequence comparison and classification

    Directory of Open Access Journals (Sweden)

    Jie Li

    2014-08-01

    Full Text Available The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.

  3. Structural insights and ab initio sequencing within the DING proteins family

    International Nuclear Information System (INIS)

    Elias, Mikael; Liebschner, Dorothee; Gotthard, Guillaume; Chabriere, Eric

    2011-01-01

    DING proteins constitute a recently discovered protein family that is ubiquitous in eukaryotes. The structural insights and the physiological involvements of these intriguing proteins are hereby deciphered. DING proteins constitute an intriguing family of phosphate-binding proteins that was identified in a wide range of organisms, from prokaryotes and archae to eukaryotes. Despite their seemingly ubiquitous occurrence in eukaryotes, their encoding genes are missing from sequenced genomes. Such a lack has considerably hampered functional studies. In humans, these proteins have been related to several diseases, like atherosclerosis, kidney stones, inflammation processes and HIV inhibition. The human phosphate binding protein is a human representative of the DING family that was serendipitously discovered from human plasma. An original approach was developed to determine ab initio the complete and exact sequence of this 38 kDa protein by utilizing mass spectrometry and X-ray data in tandem. Taking advantage of this first complete eukaryotic DING sequence, a immunohistochemistry study was undertaken to check the presence of DING proteins in various mice tissues, revealing that these proteins are widely expressed. Finally, the structure of a bacterial representative from Pseudomonas fluorescens was solved at sub-angstrom resolution, allowing the molecular mechanism of the phosphate binding in these high-affinity proteins to be elucidated

  4. Structural insights and ab initio sequencing within the DING proteins family

    Energy Technology Data Exchange (ETDEWEB)

    Elias, Mikael, E-mail: mikael.elias@weizmann.ac.il [Weizmann Institute of Science, Rehovot (Israel); Liebschner, Dorothee [CRM2, Nancy Université (France); Gotthard, Guillaume; Chabriere, Eric [AFMB, Université Aix-Marseille II (France)

    2011-01-01

    DING proteins constitute a recently discovered protein family that is ubiquitous in eukaryotes. The structural insights and the physiological involvements of these intriguing proteins are hereby deciphered. DING proteins constitute an intriguing family of phosphate-binding proteins that was identified in a wide range of organisms, from prokaryotes and archae to eukaryotes. Despite their seemingly ubiquitous occurrence in eukaryotes, their encoding genes are missing from sequenced genomes. Such a lack has considerably hampered functional studies. In humans, these proteins have been related to several diseases, like atherosclerosis, kidney stones, inflammation processes and HIV inhibition. The human phosphate binding protein is a human representative of the DING family that was serendipitously discovered from human plasma. An original approach was developed to determine ab initio the complete and exact sequence of this 38 kDa protein by utilizing mass spectrometry and X-ray data in tandem. Taking advantage of this first complete eukaryotic DING sequence, a immunohistochemistry study was undertaken to check the presence of DING proteins in various mice tissues, revealing that these proteins are widely expressed. Finally, the structure of a bacterial representative from Pseudomonas fluorescens was solved at sub-angstrom resolution, allowing the molecular mechanism of the phosphate binding in these high-affinity proteins to be elucidated.

  5. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution

    Science.gov (United States)

    Smith, Jeramiah J; Kuraku, Shigehiro; Holt, Carson; Sauka-Spengler, Tatjana; Jiang, Ning; Campbell, Michael S; Yandell, Mark D; Manousaki, Tereza; Meyer, Axel; Bloom, Ona E; Morgan, Jennifer R; Buxbaum, Joseph D; Sachidanandam, Ravi; Sims, Carrie; Garruss, Alexander S; Cook, Malcolm; Krumlauf, Robb; Wiedemann, Leanne M; Sower, Stacia A; Decatur, Wayne A; Hall, Jeffrey A; Amemiya, Chris T; Saha, Nil R; Buckley, Katherine M; Rast, Jonathan P; Das, Sabyasachi; Hirano, Masayuki; McCurley, Nathanael; Guo, Peng; Rohner, Nicolas; Tabin, Clifford J; Piccinelli, Paul; Elgar, Greg; Ruffier, Magali; Aken, Bronwen L; Searle, Stephen MJ; Muffato, Matthieu; Pignatelli, Miguel; Herrero, Javier; Jones, Matthew; Brown, C Titus; Chung-Davidson, Yu-Wen; Nanlohy, Kaben G; Libants, Scot V; Yeh, Chu-Yin; McCauley, David W; Langeland, James A; Pancer, Zeev; Fritzsch, Bernd; de Jong, Pieter J; Zhu, Baoli; Fulton, Lucinda L; Theising, Brenda; Flicek, Paul; Bronner, Marianne E; Warren, Wesley C; Clifton, Sandra W; Wilson, Richard K; Li, Weiming

    2013-01-01

    Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ~500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms. PMID:23435085

  6. Positive Selection Drives the Evolution of rhino, a Member of the Heterochromatin Protein 1 Family in Drosophila.

    Directory of Open Access Journals (Sweden)

    2005-07-01

    Full Text Available Heterochromatin comprises a significant component of many eukaryotic genomes. In comparison to euchromatin, heterochromatin is gene poor, transposon rich, and late replicating. It serves many important biological roles, from gene silencing to accurate chromosome segregation, yet little is known about the evolutionary constraints that shape heterochromatin. A complementary approach to the traditional one of directly studying heterochromatic DNA sequence is to study the evolution of proteins that bind and define heterochromatin. One of the best markers for heterochromatin is the heterochromatin protein 1 (HP1, which is an essential, nonhistone chromosomal protein. Here we investigate the molecular evolution of five HP1 paralogs present in Drosophila melanogaster. Three of these paralogs have ubiquitous expression patterns in adult Drosophila tissues, whereas HP1D/rhino and HP1E are expressed predominantly in ovaries and testes respectively. The HP1 paralogs also have distinct localization preferences in Drosophila cells. Thus, Rhino localizes to the heterochromatic compartment in Drosophila tissue culture cells, but in a pattern distinct from HP1A and lysine-9 dimethylated H3. Using molecular evolution and population genetic analyses, we find that rhino has been subject to positive selection in all three domains of the protein: the N-terminal chromo domain, the C-terminal chromo-shadow domain, and the hinge region that connects these two modules. Maximum likelihood analysis of rhino sequences from 20 species of Drosophila reveals that a small number of residues of the chromo and shadow domains have been subject to repeated positive selection. The rapid and positive selection of rhino is highly unusual for a gene encoding a chromosomal protein and suggests that rhino is involved in a genetic conflict that affects the germline, belying the notion that heterochromatin is simply a passive recipient of "junk DNA" in eukaryotic genomes.

  7. Positive selection drives the evolution of rhino, a member of the heterochromatin protein 1 family in Drosophila.

    Directory of Open Access Journals (Sweden)

    Danielle Vermaak

    2005-07-01

    Full Text Available Heterochromatin comprises a significant component of many eukaryotic genomes. In comparison to euchromatin, heterochromatin is gene poor, transposon rich, and late replicating. It serves many important biological roles, from gene silencing to accurate chromosome segregation, yet little is known about the evolutionary constraints that shape heterochromatin. A complementary approach to the traditional one of directly studying heterochromatic DNA sequence is to study the evolution of proteins that bind and define heterochromatin. One of the best markers for heterochromatin is the heterochromatin protein 1 (HP1, which is an essential, nonhistone chromosomal protein. Here we investigate the molecular evolution of five HP1 paralogs present in Drosophila melanogaster. Three of these paralogs have ubiquitous expression patterns in adult Drosophila tissues, whereas HP1D/rhino and HP1E are expressed predominantly in ovaries and testes respectively. The HP1 paralogs also have distinct localization preferences in Drosophila cells. Thus, Rhino localizes to the heterochromatic compartment in Drosophila tissue culture cells, but in a pattern distinct from HP1A and lysine-9 dimethylated H3. Using molecular evolution and population genetic analyses, we find that rhino has been subject to positive selection in all three domains of the protein: the N-terminal chromo domain, the C-terminal chromo-shadow domain, and the hinge region that connects these two modules. Maximum likelihood analysis of rhino sequences from 20 species of Drosophila reveals that a small number of residues of the chromo and shadow domains have been subject to repeated positive selection. The rapid and positive selection of rhino is highly unusual for a gene encoding a chromosomal protein and suggests that rhino is involved in a genetic conflict that affects the germline, belying the notion that heterochromatin is simply a passive recipient of "junk DNA" in eukaryotic genomes.

  8. Feasibilty of zein proteins, simple sequence repeats and phenotypic ...

    African Journals Online (AJOL)

    Widespread adoption of quality protein maize (QPM), especially among tropical farming systems has been slow mainly due to the slow process of generating varieties with acceptable kernel quality and adaptability to different agroecological contexts. A molecular based foreground selection system for opaque 2 (o2), the ...

  9. Variation in the prion protein sequence in Dutch goat breeds

    NARCIS (Netherlands)

    Windig, J.J.; Hoving, R.A.H.; Priem, J.; Bossers, A.; Keulen, van L.J.M.; Langeveld, J.P.M.

    2016-01-01

    Scrapie is a neurodegenerative disease occurring in goats and sheep. Several haplotypes of the prion protein increase resistance to scrapie infection and may be used in selective breeding to help eradicate scrapie. In this study, frequencies of the allelic variants of the PrP gene are determined

  10. Sequence-based feature prediction and annotation of proteins

    DEFF Research Database (Denmark)

    Juncker, Agnieszka; Jensen, Lars J.; Pierleoni, Andrea

    2009-01-01

    A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome....

  11. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

    Directory of Open Access Journals (Sweden)

    Holly J Atkinson

    Full Text Available The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

  12. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

    Science.gov (United States)

    Atkinson, Holly J; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C

    2009-01-01

    The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

  13. Protein secondary structure prediction for a single-sequence using hidden semi-Markov models

    Directory of Open Access Journals (Sweden)

    Borodovsky Mark

    2006-03-01

    Full Text Available Abstract Background The accuracy of protein secondary structure prediction has been improving steadily towards the 88% estimated theoretical limit. There are two types of prediction algorithms: Single-sequence prediction algorithms imply that information about other (homologous proteins is not available, while algorithms of the second type imply that information about homologous proteins is available, and use it intensively. The single-sequence algorithms could make an important contribution to studies of proteins with no detected homologs, however the accuracy of protein secondary structure prediction from a single-sequence is not as high as when the additional evolutionary information is present. Results In this paper, we further refine and extend the hidden semi-Markov model (HSMM initially considered in the BSPSS algorithm. We introduce an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. We also derive models that specialize on different sections of the dependency structure and incorporate them into HSMM. In addition, we implement an iterative training method to refine estimates of HSMM parameters. The three-state-per-residue accuracy and other accuracy measures of the new method, IPSSP, are shown to be comparable or better than ones for BSPSS as well as for PSIPRED, tested under the single-sequence condition. Conclusions We have shown that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. The results are obtained under cross-validation conditions using a dataset with no pair of sequences having significant sequence similarity. As new sequences are added to the database it is possible to augment the dependency structure and obtain even higher accuracy. Current and future advances should contribute to the improvement of function prediction for orphan proteins inscrutable

  14. Directed evolution and in silico analysis of reaction centre proteins reveal molecular signatures of photosynthesis adaptation to radiation pressure.

    Directory of Open Access Journals (Sweden)

    Giuseppina Rea

    2011-01-01

    Full Text Available Evolutionary mechanisms adopted by the photosynthetic apparatus to modifications in the Earth's atmosphere on a geological time-scale remain a focus of intense research. The photosynthetic machinery has had to cope with continuously changing environmental conditions and particularly with the complex ionizing radiation emitted by solar flares. The photosynthetic D1 protein, being the site of electron tunneling-mediated charge separation and solar energy transduction, is a hot spot for the generation of radiation-induced radical injuries. We explored the possibility to produce D1 variants tolerant to ionizing radiation in Chlamydomonas reinhardtii and clarified the effect of radiation-induced oxidative damage on the photosynthetic proteins evolution. In vitro directed evolution strategies targeted at the D1 protein were adopted to create libraries of chlamydomonas random mutants, subsequently selected by exposures to radical-generating proton or neutron sources. The common trend observed in the D1 aminoacidic substitutions was the replacement of less polar by more polar amino acids. The applied selection pressure forced replacement of residues more sensitive to oxidative damage with less sensitive ones, suggesting that ionizing radiation may have been one of the driving forces in the evolution of the eukaryotic photosynthetic apparatus. A set of the identified aminoacidic substitutions, close to the secondary plastoquinone binding niche and oxygen evolving complex, were introduced by site-directed mutagenesis in un-transformed strains, and their sensitivity to free radicals attack analyzed. Mutants displayed reduced electron transport efficiency in physiological conditions, and increased photosynthetic performance stability and oxygen evolution capacity in stressful high-light conditions. Finally, comparative in silico analyses of D1 aminoacidic sequences of organisms differently located in the evolution chain, revealed a higher ratio of residues

  15. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

    Science.gov (United States)

    Neuwald, Andrew F

    2009-08-01

    The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.

  16. Adhesive proteins of stalked and acorn barnacles display homology with low sequence similarities.

    Directory of Open Access Journals (Sweden)

    Jaimie-Leigh Jonker

    Full Text Available Barnacle adhesion underwater is an important phenomenon to understand for the prevention of biofouling and potential biotechnological innovations, yet so far, identifying what makes barnacle glue proteins 'sticky' has proved elusive. Examination of a broad range of species within the barnacles may be instructive to identify conserved adhesive domains. We add to extensive information from the acorn barnacles (order Sessilia by providing the first protein analysis of a stalked barnacle adhesive, Lepas anatifera (order Lepadiformes. It was possible to separate the L. anatifera adhesive into at least 10 protein bands using SDS-PAGE. Intense bands were present at approximately 30, 70, 90 and 110 kilodaltons (kDa. Mass spectrometry for protein identification was followed by de novo sequencing which detected 52 peptides of 7-16 amino acids in length. None of the peptides matched published or unpublished transcriptome sequences, but some amino acid sequence similarity was apparent between L. anatifera and closely-related Dosima fascicularis. Antibodies against two acorn barnacle proteins (ab-cp-52k and ab-cp-68k showed cross-reactivity in the adhesive glands of L. anatifera. We also analysed the similarity of adhesive proteins across several barnacle taxa, including Pollicipes pollicipes (a stalked barnacle in the order Scalpelliformes. Sequence alignment of published expressed sequence tags clearly indicated that P. pollicipes possesses homologues for the 19 kDa and 100 kDa proteins in acorn barnacles. Homology aside, sequence similarity in amino acid and gene sequences tended to decline as taxonomic distance increased, with minimum similarities of 18-26%, depending on the gene. The results indicate that some adhesive proteins (e.g. 100 kDa are more conserved within barnacles than others (20 kDa.

  17. ProteinSplit: splitting of multi-domain proteins using prediction of ordered and disordered regions in protein sequences for virtual structural genomics

    International Nuclear Information System (INIS)

    Wyrwicz, Lucjan S; Koczyk, Grzegorz; Rychlewski, Leszek; Plewczynski, Dariusz

    2007-01-01

    The annotation of protein folds within newly sequenced genomes is the main target for semi-automated protein structure prediction (virtual structural genomics). A large number of automated methods have been developed recently with very good results in the case of single-domain proteins. Unfortunately, most of these automated methods often fail to properly predict the distant homology between a given multi-domain protein query and structural templates. Therefore a multi-domain protein should be split into domains in order to overcome this limitation. ProteinSplit is designed to identify protein domain boundaries using a novel algorithm that predicts disordered regions in protein sequences. The software utilizes various sequence characteristics to assess the local propensity of a protein to be disordered or ordered in terms of local structure stability. These disordered parts of a protein are likely to create interdomain spacers. Because of its speed and portability, the method was successfully applied to several genome-wide fold annotation experiments. The user can run an automated analysis of sets of proteins or perform semi-automated multiple user projects (saving the results on the server). Additionally the sequences of predicted domains can be sent to the Bioinfo.PL Protein Structure Prediction Meta-Server for further protein three-dimensional structure and function prediction. The program is freely accessible as a web service at http://lucjan.bioinfo.pl/proteinsplit together with detailed benchmark results on the critical assessment of a fully automated structure prediction (CAFASP) set of sequences. The source code of the local version of protein domain boundary prediction is available upon request from the authors

  18. The Evolution of the Observed Hubble Sequence over the past 6Gyr

    Science.gov (United States)

    Delgado-Serrano, R.; Hammer, F.; Yang, Y. B.; Puech, M.; Flores, H.; Rodrigues, M.

    2011-10-01

    During the past years we have confronted serious problems of methodology concerning the morphological and kinematic classification of distant galaxies. This has forced us to create a new simple and effective morphological classification methodology, in order to guarantee a morpho-kinematic correlation, make the reproducibility easier and restrict the classification subjectivity. Giving the characteristic of our morphological classification, we have thus been able to apply the same methodology, using equivalent observations, to representative samples of local and distant galaxies. It has allowed us to derive, for the first time, the distant Hubble sequence (~6 Gyr ago), and determine a morphological evolution of galaxies over the past 6 Gyr. Our results strongly suggest that more than half of the present-day spirals had peculiar morphologies, 6 Gyr ago.

  19. MannDB – A microbial database of automated protein sequence analyses and evidence integration for protein characterization

    Directory of Open Access Journals (Sweden)

    Kuczmarski Thomas A

    2006-10-01

    Full Text Available Abstract Background MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. Description MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. Conclusion MannDB comprises a large number of genomes and comprehensive protein

  20. Cloning and sequence analysis of cDNA coding for rat nucleolar protein C23

    International Nuclear Information System (INIS)

    Ghaffari, S.H.; Olson, M.O.J.

    1986-01-01

    Using synthetic oligonucleotides as primers and probes, the authors have isolated and sequenced cDNA clones encoding protein C23, a putative nucleolus organizer protein. Poly(A + ) RNA was isolated from rat Novikoff hepatoma cells and enriched in C23 mRNA by sucrose density gradient ultracentrifugation. Two deoxyoligonuleotides, a 48- and a 27-mer, were synthesized on the basis of amino acid sequence from the C-terminal half of protein C23 and cDNA sequence data from CHO cell protein. The 48-mer was used a primer for synthesis of cDNA which was then inserted into plasmid pUC9. Transformed bacterial colonies were screened by hybridization with 32 P labeled 27-mer. Two clones among 5000 gave a strong positive signal. Plasmid DNAs from these clones were purified and characterized by blotting and nucleotide sequence analysis. The length of C23 mRNA was estimated to be 3200 bases in a northern blot analysis. The sequence of a 267 b.p. insert shows high homology with the CHO cDNA with only 9 nucleotide differences and an identical amino acid sequence. These studies indicate that this region of the protein is highly conserved

  1. Application of native signal sequences for recombinant proteins secretion in Pichia pastoris

    DEFF Research Database (Denmark)

    Borodina, Irina; Do, Duy Duc; Eriksen, Jens C.

    Background Methylotrophic yeast Pichia pastoris is widely used for recombinant protein production, largely due to its ability to secrete correctly folded heterologous proteins to the fermentation medium. Secretion is usually achieved by cloning the recombinant gene after a leader sequence, where...... alpha‐mating factor (MF) prepropeptide from Saccharomyces cerevisiae is most commonly used. Our aim was to test whether signal peptides from P. pastoris native secreted proteins could be used to direct secretion of recombinant proteins. Results Eleven native signal peptides from P. pastoris were tested...... by optimization of expression of three different proteins in P. pastoris. Conclusions Native signal peptides from P. pastoris can be used to direct secretion of recombinant proteins. A novel USER‐based P. pastoris system allows easy cloning of protein‐coding gene with the promoter and leader sequence of choice....

  2. Evolutionary mechanisms driving the evolution of a large polydnavirus gene family coding for protein tyrosine phosphatases

    Directory of Open Access Journals (Sweden)

    Serbielle Céline

    2012-12-01

    Full Text Available Abstract Background Gene duplications have been proposed to be the main mechanism involved in genome evolution and in acquisition of new functions. Polydnaviruses (PDVs, symbiotic viruses associated with parasitoid wasps, are ideal model systems to study mechanisms of gene duplications given that PDV genomes consist of virulence genes organized into multigene families. In these systems the viral genome is integrated in a wasp chromosome as a provirus and virus particles containing circular double-stranded DNA are injected into the parasitoids’ hosts and are essential for parasitism success. The viral virulence factors, organized in gene families, are required collectively to induce host immune suppression and developmental arrest. The gene family which encodes protein tyrosine phosphatases (PTPs has undergone spectacular expansion in several PDV genomes with up to 42 genes. Results Here, we present strong indications that PTP gene family expansion occurred via classical mechanisms: by duplication of large segments of the chromosomally integrated form of the virus sequences (segmental duplication, by tandem duplications within this form and by dispersed duplications. We also propose a novel duplication mechanism specific to PDVs that involves viral circle reintegration into the wasp genome. The PTP copies produced were shown to undergo conservative evolution along with episodes of adaptive evolution. In particular recently produced copies have undergone positive selection in sites most likely involved in defining substrate selectivity. Conclusion The results provide evidence about the dynamic nature of polydnavirus proviral genomes. Classical and PDV-specific duplication mechanisms have been involved in the production of new gene copies. Selection pressures associated with antagonistic interactions with parasitized hosts have shaped these genes used to manipulate lepidopteran physiology with evidence for positive selection involved in

  3. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues.

    Science.gov (United States)

    Garrido-Martín, Diego; Pazos, Florencio

    2018-02-27

    The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.

  4. Evolution of green plants as deduced from 5S rRNA sequences.

    Science.gov (United States)

    Hori, H; Lim, B L; Osawa, S

    1985-02-01

    We have constructed a phylogenic tree for green plants by comparing 5S rRNA sequences. The tree suggests that the emergence of most of the uni- and multicellular green algae such as Chlamydomonas, Spirogyra, Ulva, and Chlorella occurred in the early stage of green plant evolution. The branching point of Nitella is a little earlier than that of land plants and much later than that of the above green algae, supporting the view that Nitella-like green algae may be the direct precursor to land plants. The Bryophyta and the Pteridophyta separated from each other after emergence of the Spermatophyta. The result is consistent with the view that the Bryophyta evolved from ferns by degeneration. In the Pteridophyta, Psilotum (whisk fern) separated first, and a little later Lycopodium (club moss) separated from the ancestor common to Equisetum (horsetail) and Dryopteris (fern). This order is in accordance with the classical view. During the Spermatophyta evolution, the gymnosperms (Cycas, Ginkgo, and Metasequoia have been studied here) and the angiosperms (flowering plants) separated, and this was followed by the separation of Metasequoia and Cycas (cycad)/Ginkgo (maidenhair tree) on one branch and various flowering plants on the other.

  5. Functional mapping of protein-protein interactions in an enzyme complex by directed evolution.

    Directory of Open Access Journals (Sweden)

    Kathrin Roderer

    Full Text Available The shikimate pathway enzyme chorismate mutase converts chorismate into prephenate, a precursor of Tyr and Phe. The intracellular chorismate mutase (MtCM of Mycobacterium tuberculosis is poorly active on its own, but becomes >100-fold more efficient upon formation of a complex with the first enzyme of the shikimate pathway, 3-deoxy-d-arabino-heptulosonate-7-phosphate synthase (MtDS. The crystal structure of the enzyme complex revealed involvement of C-terminal MtCM residues with the MtDS interface. Here we employed evolutionary strategies to probe the tolerance to substitution of the C-terminal MtCM residues from positions 84-90. Variants with randomized positions were subjected to stringent selection in vivo requiring productive interactions with MtDS for survival. Sequence patterns identified in active library members coincide with residue conservation in natural chorismate mutases of the AroQδ subclass to which MtCM belongs. An Arg-Gly dyad at positions 85 and 86, invariant in AroQδ sequences, was intolerant to mutation, whereas Leu88 and Gly89 exhibited a preference for small and hydrophobic residues in functional MtCM-MtDS complexes. In the absence of MtDS, selection under relaxed conditions identifies positions 84-86 as MtCM integrity determinants, suggesting that the more C-terminal residues function in the activation by MtDS. Several MtCM variants, purified using a novel plasmid-based T7 RNA polymerase gene expression system, showed that a diminished ability to physically interact with MtDS correlates with reduced activatability and feedback regulatory control by Tyr and Phe. Mapping critical protein-protein interaction sites by evolutionary strategies may pinpoint promising targets for drugs that interfere with the activity of protein complexes.

  6. Functional mapping of protein-protein interactions in an enzyme complex by directed evolution.

    Science.gov (United States)

    Roderer, Kathrin; Neuenschwander, Martin; Codoni, Giosiana; Sasso, Severin; Gamper, Marianne; Kast, Peter

    2014-01-01

    The shikimate pathway enzyme chorismate mutase converts chorismate into prephenate, a precursor of Tyr and Phe. The intracellular chorismate mutase (MtCM) of Mycobacterium tuberculosis is poorly active on its own, but becomes >100-fold more efficient upon formation of a complex with the first enzyme of the shikimate pathway, 3-deoxy-d-arabino-heptulosonate-7-phosphate synthase (MtDS). The crystal structure of the enzyme complex revealed involvement of C-terminal MtCM residues with the MtDS interface. Here we employed evolutionary strategies to probe the tolerance to substitution of the C-terminal MtCM residues from positions 84-90. Variants with randomized positions were subjected to stringent selection in vivo requiring productive interactions with MtDS for survival. Sequence patterns identified in active library members coincide with residue conservation in natural chorismate mutases of the AroQδ subclass to which MtCM belongs. An Arg-Gly dyad at positions 85 and 86, invariant in AroQδ sequences, was intolerant to mutation, whereas Leu88 and Gly89 exhibited a preference for small and hydrophobic residues in functional MtCM-MtDS complexes. In the absence of MtDS, selection under relaxed conditions identifies positions 84-86 as MtCM integrity determinants, suggesting that the more C-terminal residues function in the activation by MtDS. Several MtCM variants, purified using a novel plasmid-based T7 RNA polymerase gene expression system, showed that a diminished ability to physically interact with MtDS correlates with reduced activatability and feedback regulatory control by Tyr and Phe. Mapping critical protein-protein interaction sites by evolutionary strategies may pinpoint promising targets for drugs that interfere with the activity of protein complexes.

  7. Interactions of rat repetitive sequence MspI8 with nuclear matrix proteins during spermatogenesis

    International Nuclear Information System (INIS)

    Rogolinski, J.; Widlak, P.; Rzeszowska-Wolny, J.

    1996-01-01

    Using the Southwestern blot analysis we have studied the interactions between rat repetitive sequence MspI8 and the nuclear matrix proteins of rats testis cells. Starting from 2 weeks the young to adult animal showed differences in type of testis nuclear matrix proteins recognizing the MspI8 sequence. The same sets of nuclear matrix proteins were detected in some enriched in spermatocytes and spermatids and obtained after fractionation of cells of adult animal by the velocity sedimentation technique. (author). 21 refs, 5 figs

  8. GenProBiS: web server for mapping of sequence variants to protein binding sites.

    Science.gov (United States)

    Konc, Janez; Skrlj, Blaz; Erzen, Nika; Kunej, Tanja; Janezic, Dusanka

    2017-07-03

    Discovery of potentially deleterious sequence variants is important and has wide implications for research and generation of new hypotheses in human and veterinary medicine, and drug discovery. The GenProBiS web server maps sequence variants to protein structures from the Protein Data Bank (PDB), and further to protein-protein, protein-nucleic acid, protein-compound, and protein-metal ion binding sites. The concept of a protein-compound binding site is understood in the broadest sense, which includes glycosylation and other post-translational modification sites. Binding sites were defined by local structural comparisons of whole protein structures using the Protein Binding Sites (ProBiS) algorithm and transposition of ligands from the similar binding sites found to the query protein using the ProBiS-ligands approach with new improvements introduced in GenProBiS. Binding site surfaces were generated as three-dimensional grids encompassing the space occupied by predicted ligands. The server allows intuitive visual exploration of comprehensively mapped variants, such as human somatic mis-sense mutations related to cancer and non-synonymous single nucleotide polymorphisms from 21 species, within the predicted binding sites regions for about 80 000 PDB protein structures using fast WebGL graphics. The GenProBiS web server is open and free to all users at http://genprobis.insilab.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. Sequence analysis of two alleles reveals that intra-and intergenic recombination played a role in the evolution of the radish fertility restorer (Rfo

    Directory of Open Access Journals (Sweden)

    Budar Françoise

    2010-02-01

    Full Text Available Abstract Background Land plant genomes contain multiple members of a eukaryote-specific gene family encoding proteins with pentatricopeptide repeat (PPR motifs. Some PPR proteins were shown to participate in post-transcriptional events involved in organellar gene expression, and this type of function is now thought to be their main biological role. Among PPR genes, restorers of fertility (Rf of cytoplasmic male sterility systems constitute a peculiar subgroup that is thought to evolve in response to the presence of mitochondrial sterility-inducing genes. Rf genes encoding PPR proteins are associated with very close relatives on complex loci. Results We sequenced a non-restoring allele (L7rfo of the Rfo radish locus whose restoring allele (D81Rfo was previously described, and compared the two alleles and their PPR genes. We identified a ca 13 kb long fragment, likely originating from another part of the radish genome, inserted into the L7rfo sequence. The L7rfo allele carries two genes (PPR-1 and PPR-2 closely related to the three previously described PPR genes of the restorer D81Rfo allele (PPR-A, PPR-B, and PPR-C. Our results indicate that alleles of the Rfo locus have experienced complex evolutionary events, including recombination and insertion of extra-locus sequences, since they diverged. Our analyses strongly suggest that present coding sequences of Rfo PPR genes result from intragenic recombination. We found that the 10 C-terminal PPR repeats in Rfo PPR gene encoded proteins result from the tandem duplication of a 5 PPR repeat block. Conclusions The Rfo locus appears to experience more complex evolution than its flanking sequences. The Rfo locus and PPR genes therein are likely to evolve as a result of intergenic and intragenic recombination. It is therefore not possible to determine which genes on the two alleles are direct orthologs. Our observations recall some previously reported data on pathogen resistance complex loci.

  10. Prediction of protein hydration sites from sequence by modular neural networks

    DEFF Research Database (Denmark)

    Ehrlich, L.; Reczko, M.; Bohr, Henrik

    1998-01-01

    The hydration properties of a protein are important determinants of its structure and function. Here, modular neural networks are employed to predict ordered hydration sites using protein sequence information. First, secondary structure and solvent accessibility are predicted from sequence with two...... separate neural networks. These predictions are used as input together with protein sequences for networks predicting hydration of residues, backbone atoms and sidechains. These networks are teined with protein crystal structures. The prediction of hydration is improved by adding information on secondary...... structure and solvent accessibility and, using actual values of these properties, redidue hydration can be predicted to 77% accuracy with a Metthews coefficient of 0.43. However, predicted property data with an accuracy of 60-70% result in less than half the improvement in predictive performance observed...

  11. Population genetic and evolution analysis of controversial genus Edwardsiella by multilocus sequence typing.

    Science.gov (United States)

    Buján, Noemí; Balboa, Sabela; L Romalde, Jesús; E Toranzo, Alicia; Magariños, Beatriz

    2018-05-08

    At present, the genus Edwardsiella compiles five species: E. tarda, E. hoshinae, E. ictaluri, E. piscicida and E. anguillarum. Some species of this genus such us E. ictaluri and E. piscicida are important pathogens of numerous fish species. With the description of the two latter species, the phylogeny of Edwardsiella became more complicated. With the aim to clarify the relationships among all species in the genus, a multilocus sequence typing (MLST) approach was developed and applied to characterize 56 isolates and 6 reference strains belonging to the five Edwardsiella species. Moreover, several analyses based on the MLST scheme were performed to investigate the evolution within the genus, as well as the influence of recombination and mutation in the speciation. Edwardsiella isolates presented a high genetic variability reflected in the fourteen sequence types (ST) represented by a single isolates out of eighteen total ST. Mutation events were considerably more frequent than recombination, although both approximately equal influenced the genetic diversification. However, the speciation among species occurred mostly by recombination. Edwardsiella genus displays a non-clonal population structure with some degree of geographical isolation followed by a population expansion of E. piscicida. A database from this study was created and hosted on pubmlst.org (http://pubmlst.org/edwardsiella/). Copyright © 2018 Elsevier Inc. All rights reserved.

  12. Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features

    Directory of Open Access Journals (Sweden)

    Zhang Ya-Nan

    2012-05-01

    Full Text Available Abstract Background Adenosine-5′-triphosphate (ATP is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex. Results In this paper, we propose a novel framework for predicting the proteins’ functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target’s knowledge and multiple sources of descriptors can be complementary. Conclusions Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.

  13. Phylogenetics and evolution of Trx SET genes in fully sequenced land plants.

    Science.gov (United States)

    Zhu, Xinyu; Chen, Caoyi; Wang, Baohua

    2012-04-01

    Plant Trx SET proteins are involved in H3K4 methylation and play a key role in plant floral development. Genes encoding Trx SET proteins constitute a multigene family in which the copy number varies among plant species and functional divergence appears to have occurred repeatedly. To investigate the evolutionary history of the Trx SET gene family, we made a comprehensive evolutionary analysis on this gene family from 13 major representatives of green plants. A novel clustering (here named as cpTrx clade), which included the III-1, III-2, and III-4 orthologous groups, previously resolved was identified. Our analysis showed that plant Trx proteins possessed a variety of domain organizations and gene structures among paralogs. Additional domains such as PHD, PWWP, and FYR were early integrated into primordial SET-PostSET domain organization of cpTrx clade. We suggested that the PostSET domain was lost in some members of III-4 orthologous group during the evolution of land plants. At least four classes of gene structures had been formed at the early evolutionary stage of land plants. Three intronless orphan Trx SET genes from the Physcomitrella patens (moss) were identified, and supposedly, their parental genes have been eliminated from the genome. The structural differences among evolutionary groups of plant Trx SET genes with different functions were described, contributing to the design of further experimental studies.

  14. Galactic chemical evolution with main-sequence mass loss and the distribution of F and G dwarfs

    International Nuclear Information System (INIS)

    Guzik, J.A.; Struck-Marcell, C.

    1988-01-01

    Simple closed galactic chemical-evolution models incorporating early main-sequence stellar mass loss have been developed for disk ages of 5, 10, and 15 Gyr. Relative to models without stellar mass loss, the models are shown to produce a 30-60 percent increase in the present mass ratio of dwarfs to dwarfs plus remnants, and a 200-250 percent increase in the total mass of late F dwarfs remaining on the main sequence at the current disk age. For present disk ages 5 and 10 Gyr, the total mass of mid-F dwarfs remaining on the main sequence is also shown to increase by 90-120 percent. It is concluded that models with main-sequence mass loss have a slightly reduced gas metallicity and slightly increased gas fraction midway through the evolution. 30 references

  15. Repetitive sequences and epigenetic modification: inseparable partners play important roles in the evolution of plant sex chromosomes.

    Science.gov (United States)

    Li, Shu-Fen; Zhang, Guo-Jun; Yuan, Jin-Hong; Deng, Chuan-Liang; Gao, Wu-Jun

    2016-05-01

    The present review discusses the roles of repetitive sequences played in plant sex chromosome evolution, and highlights epigenetic modification as potential mechanism of repetitive sequences involved in sex chromosome evolution. Sex determination in plants is mostly based on sex chromosomes. Classic theory proposes that sex chromosomes evolve from a specific pair of autosomes with emergence of a sex-determining gene(s). Subsequently, the newly formed sex chromosomes stop recombination in a small region around the sex-determining locus, and over time, the non-recombining region expands to almost all parts of the sex chromosomes. Accumulation of repetitive sequences, mostly transposable elements and tandem repeats, is a conspicuous feature of the non-recombining region of the Y chromosome, even in primitive one. Repetitive sequences may play multiple roles in sex chromosome evolution, such as triggering heterochromatization and causing recombination suppression, leading to structural and morphological differentiation of sex chromosomes, and promoting Y chromosome degeneration and X chromosome dosage compensation. In this article, we review the current status of this field, and based on preliminary evidence, we posit that repetitive sequences are involved in sex chromosome evolution probably via epigenetic modification, such as DNA and histone methylation, with small interfering RNAs as the mediator.

  16. PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

    Directory of Open Access Journals (Sweden)

    Huilin Wang

    Full Text Available X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM. Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I. Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II, which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization

  17. Exploring sequence characteristics related to high-level production of secreted proteins in Aspergillus niger.

    Directory of Open Access Journals (Sweden)

    Bastiaan A van den Berg

    Full Text Available Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large set, over 600 homologous and nearly 2,000 heterologous fungal genes, were overexpressed in Aspergillus niger using a standardized expression cassette and scored for high versus no production. Subsequently, sequence-based machine learning techniques were applied for identifying relevant DNA and protein sequence features. The amino-acid composition of the protein sequence was found to be most predictive and interpretation revealed that, for both homologous and heterologous gene expression, the same features are important: tyrosine and asparagine composition was found to have a positive correlation with high-level production, whereas for unsuccessful production, contributions were found for methionine and lysine composition. The predictor is available online at http://bioinformatics.tudelft.nl/hipsec. Subsequent work aims at validating these findings by protein engineering as a method for increasing expression levels per gene copy.

  18. Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)

    Science.gov (United States)

    Bustamam, A.; Siswantining, T.; Febriyani, N. L.; Novitasari, I. D.; Cahyaningrum, R. D.

    2017-07-01

    The herpes virus can be found anywhere and one of the important characteristics is its ability to cause acute and chronic infection at certain times so as a result of the infection allows severe complications occurred. The herpes virus is composed of DNA containing protein and wrapped by glycoproteins. In this work, the Herpes viruses family is classified and analyzed by clustering their protein-sequence using Tribe Markov Clustering (Tribe-MCL) algorithm. Tribe-MCL is an efficient clustering method based on the theory of Markov chains, to classify protein families from protein sequences using pre-computed sequence similarity information. We implement the Tribe-MCL algorithm using an open source program of R. We select 24 protein sequences of Herpes virus obtained from NCBI database. The dataset consists of three types of glycoprotein B, F, and H. Each type has eight herpes virus that infected humans. Based on our simulation using different inflation factor r=1.5, 2, 3 we find a various number of the clusters results. The greater the inflation factor the greater the number of their clusters. Each protein will grouped together in the same type of protein.

  19. Strategy for complete NMR assignment of disordered proteins with highly repetitive sequences based on resolution-enhanced 5D experiments

    Energy Technology Data Exchange (ETDEWEB)

    Motackova, Veronika; Novacek, Jiri [Masaryk University, Faculty of Science, National Centre for Biomolecular Research (Czech Republic); Zawadzka-Kazimierczuk, Anna; Kazimierczuk, Krzysztof [University of Warsaw, Faculty of Chemistry (Poland); Zidek, Lukas, E-mail: lzidek@chemi.muni.c [Masaryk University, Faculty of Science, National Centre for Biomolecular Research (Czech Republic); Sanderova, Hana; Krasny, Libor [Academy of Sciences of the Czech Republic, Laboratory of Molecular Genetics of Bacteria and Department of Bacteriology, Institute of Microbiology (Czech Republic); Kozminski, Wiktor [University of Warsaw, Faculty of Chemistry (Poland); Sklenar, Vladimir [Masaryk University, Faculty of Science, National Centre for Biomolecular Research (Czech Republic)

    2010-11-15

    A strategy for complete backbone and side-chain resonance assignment of disordered proteins with highly repetitive sequence is presented. The protocol is based on three resolution-enhanced NMR experiments: 5D HN(CA)CONH provides sequential connectivity, 5D HabCabCONH is utilized to identify amino acid types, and 5D HC(CC-TOCSY)CONH is used to assign the side-chain resonances. The improved resolution was achieved by a combination of high dimensionality and long evolution times, allowed by non-uniform sampling in the indirect dimensions. Random distribution of the data points and Sparse Multidimensional Fourier Transform processing were used. Successful application of the assignment procedure to a particularly difficult protein, {delta} subunit of RNA polymerase from Bacillus subtilis, is shown to prove the efficiency of the strategy. The studied protein contains a disordered C-terminal region of 81 amino acids with a highly repetitive sequence. While the conventional assignment methods completely failed due to a very small differences in chemical shifts, the presented strategy provided a complete backbone and side-chain assignment.

  20. The N-terminal sequence of ribosomal protein L10 from the archaebacterium Halobacterium marismortui and its relationship to eubacterial protein L6 and other ribosomal proteins.

    Science.gov (United States)

    Dijk, J; van den Broek, R; Nasiulas, G; Beck, A; Reinhardt, R; Wittmann-Liebold, B

    1987-08-01

    The amino-terminal sequence of ribosomal protein L10 from Halobacterium marismortui has been determined up to residue 54, using both a liquid- and a gas-phase sequenator. The two sequences are in good agreement. The protein is clearly homologous to protein HcuL10 from the related strain Halobacterium cutirubrum. Furthermore, a weaker but distinct homology to ribosomal protein L6 from Escherichia coli and Bacillus stearothermophilus can be detected. In addition to 7 identical amino acids in the first 36 residues in all four sequences a number of conservative replacements occurs, of mainly hydrophobic amino acids. In this common region the pattern of conserved amino acids suggests the presence of a beta-alpha fold as it occurs in ribosomal proteins L12 and L30. Furthermore, several potential cases of homology to other ribosomal components of the three ur-kingdoms have been found.

  1. Nucleotide sequence of the coat protein gene of the Skierniewice isolate of plum pox virus (PPV)

    International Nuclear Information System (INIS)

    Wypijewski, K.; Musial, W.; Augustyniak, J.; Malinowski, T.

    1994-01-01

    The coat protein (CP) gene of the Skierniewice isolate of plum pox virus (PPV-S) has been amplified using the reverse transcription - polymerase chain reaction (RT-PCR), cloned and sequenced. The nucleotide sequence of the gene and the deduced amino-acid sequences of PPV-S CP were compared with those of other PPV strains. The nucleotide sequence showed very high homology to most of the published sequences. The motif: Asp-Ala-Gly (DAG), important for the aphid transmissibility, was present in the amino-acid sequence. Our isolate did not react in ELISA with monoclonal antibodies MAb06 supposed to be specific for PPV-D. (author). 32 refs, 1 fig., 2 tabs

  2. Origin and Evolution of Protein Fold Designs Inferred from Phylogenomic Analysis of CATH Domain Structures in Proteomes

    Science.gov (United States)

    Bukhari, Syed Abbas; Caetano-Anollés, Gustavo

    2013-01-01

    The spatial arrangements of secondary structures in proteins, irrespective of their connectivity, depict the overall shape and organization of protein domains. These features have been used in the CATH and SCOP classifications to hierarchically partition fold space and define the architectural make up of proteins. Here we use phylogenomic methods and a census of CATH structures in hundreds of genomes to study the origin and diversification of protein architectures (A) and their associated topologies (T) and superfamilies (H). Phylogenies that describe the evolution of domain structures and proteomes were reconstructed from the structural census and used to generate timelines of domain discovery. Phylogenies of CATH domains at T and H levels of structural abstraction and associated chronologies revealed patterns of reductive evolution, the early rise of Archaea, three epochs in the evolution of the protein world, and patterns of structural sharing between superkingdoms. Phylogenies of proteomes confirmed the early appearance of Archaea. While these findings are in agreement with previous phylogenomic studies based on the SCOP classification, phylogenies unveiled sharing patterns between Archaea and Eukarya that are recent and can explain the canonical bacterial rooting typically recovered from sequence analysis. Phylogenies of CATH domains at A level uncovered general patterns of architectural origin and diversification. The tree of A structures showed that ancient structural designs such as the 3-layer (αβα) sandwich (3.40) or the orthogonal bundle (1.10) are comparatively simpler in their makeup and are involved in basic cellular functions. In contrast, modern structural designs such as prisms, propellers, 2-solenoid, super-roll, clam, trefoil and box are not widely distributed and were probably adopted to perform specialized functions. Our timelines therefore uncover a universal tendency towards protein structural complexity that is remarkable. PMID:23555236

  3. Predicting protein amidation sites by orchestrating amino acid sequence features

    Science.gov (United States)

    Zhao, Shuqiu; Yu, Hua; Gong, Xiujun

    2017-08-01

    Amidation is the fourth major category of post-translational modifications, which plays an important role in physiological and pathological processes. Identifying amidation sites can help us understanding the amidation and recognizing the original reason of many kinds of diseases. But the traditional experimental methods for predicting amidation sites are often time-consuming and expensive. In this study, we propose a computational method for predicting amidation sites by orchestrating amino acid sequence features. Three kinds of feature extraction methods are used to build a feature vector enabling to capture not only the physicochemical properties but also position related information of the amino acids. An extremely randomized trees algorithm is applied to choose the optimal features to remove redundancy and dependence among components of the feature vector by a supervised fashion. Finally the support vector machine classifier is used to label the amidation sites. When tested on an independent data set, it shows that the proposed method performs better than all the previous ones with the prediction accuracy of 0.962 at the Matthew's correlation coefficient of 0.89 and area under curve of 0.964.

  4. Revised Mimivirus major capsid protein sequence reveals intron-containing gene structure and extra domain

    Directory of Open Access Journals (Sweden)

    Suzan-Monti Marie

    2009-05-01

    Full Text Available Abstract Background Acanthamoebae polyphaga Mimivirus (APM is the largest known dsDNA virus. The viral particle has a nearly icosahedral structure with an internal capsid shell surrounded with a dense layer of fibrils. A Capsid protein sequence, D13L, was deduced from the APM L425 coding gene and was shown to be the most abundant protein found within the viral particle. However this protein remained poorly characterised until now. A revised protein sequence deposited in a database suggested an additional N-terminal stretch of 142 amino acids missing from the original deduced sequence. This result led us to investigate the L425 gene structure and the biochemical properties of the complete APM major Capsid protein. Results This study describes the full length 3430 bp Capsid coding gene and characterises the 593 amino acids long corresponding Capsid protein 1. The recombinant full length protein allowed the production of a specific monoclonal antibody able to detect the Capsid protein 1 within the viral particle. This protein appeared to be post-translationnally modified by glycosylation and phosphorylation. We proposed a secondary structure prediction of APM Capsid protein 1 compared to the Capsid protein structure of Paramecium Bursaria Chlorella Virus 1, another member of the Nucleo-Cytoplasmic Large DNA virus family. Conclusion The characterisation of the full length L425 Capsid coding gene of Acanthamoebae polyphaga Mimivirus provides new insights into the structure of the main Capsid protein. The production of a full length recombinant protein will be useful for further structural studies.

  5. EST-PAC a web package for EST annotation and protein sequence prediction

    Directory of Open Access Journals (Sweden)

    Strahm Yvan

    2006-10-01

    Full Text Available Abstract With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1 searching local or remote biological databases for sequence similarities using Blast services, 2 predicting protein coding sequence from EST data and, 3 annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  6. Prediction of glutathionylation sites in proteins using minimal sequence information and their experimental validation.

    Science.gov (United States)

    Pal, Debojyoti; Sharma, Deepak; Kumar, Mukesh; Sandur, Santosh K

    2016-09-01

    S-glutathionylation of proteins plays an important role in various biological processes and is known to be protective modification during oxidative stress. Since, experimental detection of S-glutathionylation is labor intensive and time consuming, bioinformatics based approach is a viable alternative. Available methods require relatively longer sequence information, which may prevent prediction if sequence information is incomplete. Here, we present a model to predict glutathionylation sites from pentapeptide sequences. It is based upon differential association of amino acids with glutathionylated and non-glutathionylated cysteines from a database of experimentally verified sequences. This data was used to calculate position dependent F-scores, which measure how a particular amino acid at a particular position may affect the likelihood of glutathionylation event. Glutathionylation-score (G-score), indicating propensity of a sequence to undergo glutathionylation, was calculated using position-dependent F-scores for each amino-acid. Cut-off values were used for prediction. Our model returned an accuracy of 58% with Matthew's correlation-coefficient (MCC) value of 0.165. On an independent dataset, our model outperformed the currently available model, in spite of needing much less sequence information. Pentapeptide motifs having high abundance among glutathionylated proteins were identified. A list of potential glutathionylation hotspot sequences were obtained by assigning G-scores and subsequent Protein-BLAST analysis revealed a total of 254 putative glutathionable proteins, a number of which were already known to be glutathionylated. Our model predicted glutathionylation sites in 93.93% of experimentally verified glutathionylated proteins. Outcome of this study may assist in discovering novel glutathionylation sites and finding candidate proteins for glutathionylation.

  7. Extreme sequence divergence but conserved ligand-binding specificity in Streptococcus pyogenes M protein.

    Directory of Open Access Journals (Sweden)

    2006-05-01

    Full Text Available Many pathogenic microorganisms evade host immunity through extensive sequence variability in a protein region targeted by protective antibodies. In spite of the sequence variability, a variable region commonly retains an important ligand-binding function, reflected in the presence of a highly conserved sequence motif. Here, we analyze the limits of sequence divergence in a ligand-binding region by characterizing the hypervariable region (HVR of Streptococcus pyogenes M protein. Our studies were focused on HVRs that bind the human complement regulator C4b-binding protein (C4BP, a ligand that confers phagocytosis resistance. A previous comparison of C4BP-binding HVRs identified residue identities that could be part of a binding motif, but the extended analysis reported here shows that no residue identities remain when additional C4BP-binding HVRs are included. Characterization of the HVR in the M22 protein indicated that two relatively conserved Leu residues are essential for C4BP binding, but these residues are probably core residues in a coiled-coil, implying that they do not directly contribute to binding. In contrast, substitution of either of two relatively conserved Glu residues, predicted to be solvent-exposed, had no effect on C4BP binding, although each of these changes had a major effect on the antigenic properties of the HVR. Together, these findings show that HVRs of M proteins have an extraordinary capacity for sequence divergence and antigenic variability while retaining a specific ligand-binding function.

  8. Rapid detection and purification of sequence specific DNA binding proteins using magnetic separation

    Directory of Open Access Journals (Sweden)

    TIJANA SAVIC

    2006-02-01

    Full Text Available In this paper, a method for the rapid identification and purification of sequence specific DNA binding proteins based on magnetic separation is presented. This method was applied to confirm the binding of the human recombinant USF1 protein to its putative binding site (E-box within the human SOX3 protomer. It has been shown that biotinylated DNA attached to streptavidin magnetic particles specifically binds the USF1 protein in the presence of competitor DNA. It has also been demonstrated that the protein could be successfully eluted from the beads, in high yield and with restored DNA binding activity. The advantage of these procedures is that they could be applied for the identification and purification of any high-affinity sequence-specific DNA binding protein with only minor modifications.

  9. Isolation and N-terminal sequencing of a novel cadmium-binding protein from Boletus edulis

    Science.gov (United States)

    Collin-Hansen, C.; Andersen, R. A.; Steinnes, E.

    2003-05-01

    A Cd-binding protein was isolated from the popular edible mushroom Boletus edulis, which is a hyperaccumulator of both Cd and Hg. Wild-growing samples of B. edulis were collected from soils rich in Cd. Cd radiotracer was added to the crude protein preparation obtained from ethanol precipitation of heat-treated cytosol. Proteins were then further separated in two consecutive steps; gel filtration and anion exchange chromatography. In both steps the Cd radiotracer profile showed only one distinct peak, which corresponded well with the profiles of endogenous Cd obtained by atomic absorption spectrophotometry (AAS). Concentrations of the essential elements Cu and Zn were low in the protein fractions high in Cd. N-terminal sequencing performed on the Cd-binding protein fractions revealed a protein with a novel amino acid sequence, which contained aromatic amino acids as well as proline. Both the N-terminal sequencing and spectrofluorimetric analysis with EDTA and ABD-F (4-aminosulfonyl-7-fluoro-2, 1, 3-benzoxadiazole) failed to detect cysteine in the Cd-binding fractions. These findings conclude that the novel protein does not belong to the metallothionein family. The results suggest a role for the protein in Cd transport and storage, and they are of importance in view of toxicology and food chemistry, but also for environmental protection.

  10. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes

    Directory of Open Access Journals (Sweden)

    Tiessen Axel

    2012-02-01

    Full Text Available Abstract Background The sizes of proteins are relevant to their biochemical structure and for their biological function. The statistical distribution of protein lengths across a diverse set of taxa can provide hints about the evolution of proteomes. Results Using the full genomic sequences of over 1,302 prokaryotic and 140 eukaryotic species two datasets containing 1.2 and 6.1 million proteins were generated and analyzed statistically. The lengthwise distribution of proteins can be roughly described with a gamma type or log-normal model, depending on the species. However the shape parameter of the gamma model has not a fixed value of 2, as previously suggested, but varies between 1.5 and 3 in different species. A gamma model with unrestricted shape parameter described best the distributions in ~48% of the species, whereas the log-normal distribution described better the observed protein sizes in 42% of the species. The gamma restricted function and the sum of exponentials distribution had a better fitting in only ~5% of the species. Eukaryotic proteins have an average size of 472 aa, whereas bacterial (320 aa and archaeal (283 aa proteins are significantly smaller (33-40% on average. Average protein sizes in different phylogenetic groups were: Alveolata (628 aa, Amoebozoa (533 aa, Fornicata (543 aa, Placozoa (453 aa, Eumetazoa (486 aa, Fungi (487 aa, Stramenopila (486 aa, Viridiplantae (392 aa. Amino acid composition is biased according to protein size. Protein length correlated negatively with %C, %M, %K, %F, %R, %W, %Y and positively with %D, %E, %Q, %S and %T. Prokaryotic proteins had a different protein size bias for %E, %G, %K and %M as compared to eukaryotes. Conclusions Mathematical modeling of protein length empirical distributions can be used to asses the quality of small ORFs annotation in genomic releases (detection of too many false positive small ORFs. There is a negative correlation between average protein size and total number of

  11. Evolution of pH buffers and water homeostasis in eukaryotes: homology between humans and Acanthamoeba proteins.

    Science.gov (United States)

    Baig, Abdul M; Zohaib, R; Tariq, S; Ahmad, H R

    2018-02-01

    This study intended to trace the evolution of acid-base buffers and water homeostasis in eukaryotes. Acanthamoeba castellanii  was selected as a model unicellular eukaryote for this purpose. Homologies of proteins involved in pH and water regulatory mechanisms at cellular levels were compared between humans and A. castellanii. Amino acid sequence homology, structural homology, 3D modeling and docking prediction were done to show the extent of similarities between carbonic anhydrase 1 (CA1), aquaporin (AQP), band-3 protein and H + pump. Experimental assays were done with acetazolamide (AZM), brinzolamide and mannitol to observe their effects on the trophozoites of  A. castellanii.  The human CA1, AQP, band-3 protein and H + -transport proteins revealed similar proteins in Acanthamoeba. Docking showed the binding of AZM on amoebal AQP-like proteins.  Acanthamoeba showed transient shape changes and encystation at differential doses of brinzolamide, mannitol and AZM.  Conclusion: Water and pH regulating adapter proteins in Acanthamoeba and humans show significant homology, these mechanisms evolved early in the primitive unicellular eukaryotes and have remained conserved in multicellular eukaryotes.

  12. Refining the Results of a Classical SELEX Experiment by Expanding the Sequence Data Set of an Aptamer Pool Selected for Protein A

    Directory of Open Access Journals (Sweden)

    Regina Stoltenburg

    2018-02-01

    Full Text Available New, as yet undiscovered aptamers for Protein A were identified by applying next generation sequencing (NGS to a previously selected aptamer pool. This pool was obtained in a classical SELEX (Systematic Evolution of Ligands by EXponential enrichment experiment using the FluMag-SELEX procedure followed by cloning and Sanger sequencing. PA#2/8 was identified as the only Protein A-binding aptamer from the Sanger sequence pool, and was shown to be able to bind intact cells of Staphylococcus aureus. In this study, we show the extension of the SELEX results by re-sequencing of the same aptamer pool using a medium throughput NGS approach and data analysis. Both data pools were compared. They confirm the selection of a highly complex and heterogeneous oligonucleotide pool and show consistently a high content of orphans as well as a similar relative frequency of certain sequence groups. But in contrast to the Sanger data pool, the NGS pool was clearly dominated by one sequence group containing the known Protein A-binding aptamer PA#2/8 as the most frequent sequence in this group. In addition, we found two new sequence groups in the NGS pool represented by PA-C10 and PA-C8, respectively, which also have high specificity for Protein A. Comparative affinity studies reveal differences between the aptamers and confirm that PA#2/8 remains the most potent sequence within the selected aptamer pool reaching affinities in the low nanomolar range of KD = 20 ± 1 nM.

  13. Refining the Results of a Classical SELEX Experiment by Expanding the Sequence Data Set of an Aptamer Pool Selected for Protein A.

    Science.gov (United States)

    Stoltenburg, Regina; Strehlitz, Beate

    2018-02-24

    New, as yet undiscovered aptamers for Protein A were identified by applying next generation sequencing (NGS) to a previously selected aptamer pool. This pool was obtained in a classical SELEX (Systematic Evolution of Ligands by EXponential enrichment) experiment using the FluMag-SELEX procedure followed by cloning and Sanger sequencing. PA#2/8 was identified as the only Protein A-binding aptamer from the Sanger sequence pool, and was shown to be able to bind intact cells of Staphylococcus aureus . In this study, we show the extension of the SELEX results by re-sequencing of the same aptamer pool using a medium throughput NGS approach and data analysis. Both data pools were compared. They confirm the selection of a highly complex and heterogeneous oligonucleotide pool and show consistently a high content of orphans as well as a similar relative frequency of certain sequence groups. But in contrast to the Sanger data pool, the NGS pool was clearly dominated by one sequence group containing the known Protein A-binding aptamer PA#2/8 as the most frequent sequence in this group. In addition, we found two new sequence groups in the NGS pool represented by PA-C10 and PA-C8, respectively, which also have high specificity for Protein A. Comparative affinity studies reveal differences between the aptamers and confirm that PA#2/8 remains the most potent sequence within the selected aptamer pool reaching affinities in the low nanomolar range of K D = 20 ± 1 nM.

  14. Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations.

    Science.gov (United States)

    Ikram, Najmul; Qadir, Muhammad Abdul; Afzal, Muhammad Tanvir

    2018-01-01

    Sequence similarity is a commonly used measure to compare proteins. With the increasing use of ontologies, semantic (function) similarity is getting importance. The correlation between these measures has been applied in the evaluation of new semantic similarity methods, and in protein function prediction. In this research, we investigate the relationship between the two similarity methods. The results suggest absence of a strong correlation between sequence and semantic similarities. There is a large number of proteins with low sequence similarity and high semantic similarity. We observe that Pearson's correlation coefficient is not sufficient to explain the nature of this relationship. Interestingly, the term semantic similarity values above 0 and below 1 do not seem to play a role in improving the correlation. That is, the correlation coefficient depends only on the number of common GO terms in proteins under comparison, and the semantic similarity measurement method does not influence it. Semantic similarity and sequence similarity have a distinct behavior. These findings are of significant effect for future works on protein comparison, and will help understand the semantic similarity between proteins in a better way.

  15. The Coding of Biological Information: From Nucleotide Sequence to Protein Recognition

    Science.gov (United States)

    Štambuk, Nikola

    The paper reviews the classic results of Swanson, Dayhoff, Grantham, Blalock and Root-Bernstein, which link genetic code nucleotide patterns to the protein structure, evolution and molecular recognition. Symbolic representation of the binary addresses defining particular nucleotide and amino acid properties is discussed, with consideration of: structure and metric of the code, direct correspondence between amino acid and nucleotide information, and molecular recognition of the interacting protein motifs coded by the complementary DNA and RNA strands.

  16. Identification of physicochemical selective pressure on protein encoding nucleotide sequences

    Directory of Open Access Journals (Sweden)

    Sainudiin Raazesh

    2006-03-01

    Full Text Available Abstract Background Statistical methods for identifying positively selected sites in protein coding regions are one of the most commonly used tools in evolutionary bioinformatics. However, they have been limited by not taking the physiochemical properties of amino acids into account. Results We develop a new codon-based likelihood model for detecting site-specific selection pressures acting on specific physicochemical properties. Nonsynonymous substitutions are divided into substitutions that differ with respect to the physicochemical properties of interest, and those that do not. The substitution rates of these two types of changes, relative to the synonymous substitution rate, are then described by two parameters, γ and ω respectively. The new model allows us to perform likelihood ratio tests for positive selection acting on specific physicochemical properties of interest. The new method is first used to analyze simulated data and is shown to have good power and accuracy in detecting physicochemical selective pressure. We then re-analyze data from the class-I alleles of the human Major Histocompatibility Complex (MHC and from the abalone sperm lysine. Conclusion Our new method allows a more flexible framework to identify selection pressure on particular physicochemical properties.

  17. Evolution, diversification and expression of KNOX proteins in plants

    Directory of Open Access Journals (Sweden)

    Jie eGao

    2015-10-01

    Full Text Available The KNOX (KNOTTED1-like homeobox transcription factors play a pivotal role in leaf and meristem development. The majority of these proteins are characterized by the KNOX1, KNOX2, ELK and homeobox domains whereas the proteins of the KNATM family contain only the KNOX domains. We carried out an extensive inventory of these proteins and here report on a total of 394 KNOX proteins from 48 species. The land plant proteins fall into two classes (I and II as previously shown where the class I family seems to be most closely related to the green algae homologs. The KNATM proteins are restricted to Eudicots and some species have multiple paralogs of this protein. Certain plants are characterized by a significant increase in the number of KNOX paralogs; one example is Glycine max. Through the analysis of public gene expression data we show that the class II proteins of this plant have a relatively broad expression specificity as compared to class I proteins, consistent with previous studies of other plants. In G. max, class I protein are mainly distributed in axis tissues and KNATM paralogs are overall poorly expressed; highest expression is in the early plumular axis. Overall, analysis of gene expression in G. max demonstrates clearly that the expansion in gene number is associated with functional diversification.

  18. Convergent Evolution of Hemoglobin Function in High-Altitude Andean Waterfowl Involves Limited Parallelism at the Molecular Sequence Level.

    Directory of Open Access Journals (Sweden)

    Chandrasekhar Natarajan

    2015-12-01

    Full Text Available A fundamental question in evolutionary genetics concerns the extent to which adaptive phenotypic convergence is attributable to convergent or parallel changes at the molecular sequence level. Here we report a comparative analysis of hemoglobin (Hb function in eight phylogenetically replicated pairs of high- and low-altitude waterfowl taxa to test for convergence in the oxygenation properties of Hb, and to assess the extent to which convergence in biochemical phenotype is attributable to repeated amino acid replacements. Functional experiments on native Hb variants and protein engineering experiments based on site-directed mutagenesis revealed the phenotypic effects of specific amino acid replacements that were responsible for convergent increases in Hb-O2 affinity in multiple high-altitude taxa. In six of the eight taxon pairs, high-altitude taxa evolved derived increases in Hb-O2 affinity that were caused by a combination of unique replacements, parallel replacements (involving identical-by-state variants with independent mutational origins in different lineages, and collateral replacements (involving shared, identical-by-descent variants derived via introgressive hybridization. In genome scans of nucleotide differentiation involving high- and low-altitude populations of three separate species, function-altering amino acid polymorphisms in the globin genes emerged as highly significant outliers, providing independent evidence for adaptive divergence in Hb function. The experimental results demonstrate that convergent changes in protein function can occur through multiple historical paths, and can involve multiple possible mutations. Most cases of convergence in Hb function did not involve parallel substitutions and most parallel substitutions did not affect Hb-O2 affinity, indicating that the repeatability of phenotypic evolution does not require parallelism at the molecular level.

  19. Protein phylogenetic analysis of Ca2+/cation antiporters and insights into their evolution in plants

    Directory of Open Access Journals (Sweden)

    Laura eEmery

    2012-01-01

    Full Text Available Cation transport is a critical process in all organisms and is essential for mineral nutrition, ion stress tolerance, and signal transduction. Transporters that are members of the Ca2+/Cation Antiporter (CaCA superfamily are involved in the transport of Ca2+ and/or other cations using the counter exchange of another ion such as H+ or Na+. The CaCA superfamily has been previously divided into five transporter families: the YRBG, NCX, NCKX, CAX and CCX families, which include the well-characterized Na+/Ca2+ exchanger (NCX and H+/cation exchanger (CAX transporters. To examine the evolution of CaCA transporters within higher plants and the green plant lineage, CaCA genes were identified from the genomes of sequenced flowering plants, a bryophyte, lycophyte, and freshwater and marine algae, and compared with those from non-plant species. We found evidence of the expansion and increased diversity of flowering plant genes within the CAX and CCX families. Genes related to the NCX family are present in land plant though they encode distinct MHX homologs which probably have an altered transport function. In contrast, the NCX and NCKX genes which are absent in land plants have been retained in many species of algae, especially the marine algae, indicating that these organisms may share ‘animal-like’ characteristics of Ca2+ homeostasis and signaling. A group of genes encoding novel CAX-like proteins containing an EF hand domain were identified from plants and selected algae but appeared to be lacking in any other species. Lack of functional data for most of the CaCA proteins make it impossible to reliably predict substrate specificity and function for many of the groups or individual proteins. The abundance and diversity of CaCA genes throughout all branches of life indicates the importance of this class of cation transporter, and that many transporters with novel functions are waiting to be discovered.

  20. Variation in the prion protein sequence in Dutch goat breeds.

    Science.gov (United States)

    Windig, J J; Hoving, R A H; Priem, J; Bossers, A; van Keulen, L J M; Langeveld, J P M

    2016-10-01

    Scrapie is a neurodegenerative disease occurring in goats and sheep. Several haplotypes of the prion protein increase resistance to scrapie infection and may be used in selective breeding to help eradicate scrapie. In this study, frequencies of the allelic variants of the PrP gene are determined for six goat breeds in the Netherlands. Overall frequencies in Dutch goats were determined from 768 brain tissue samples in 2005, 766 in 2008 and 300 in 2012, derived from random sampling for the national scrapie surveillance without knowledge of the breed. Breed specific frequencies were determined in the winter 2013/2014 by sampling 300 breeding animals from the main breeders of the different breeds. Detailed analysis of the scrapie-resistant K222 haplotype was carried out in 2014 for 220 Dutch Toggenburger goats and in 2015 for 942 goats from the Saanen derived White Goat breed. Nine haplotypes were identified in the Dutch breeds. Frequencies for non-wild type haplotypes were generally low. Exception was the K222 haplotype in the Dutch Toggenburger (29%) and the S146 haplotype in the Nubian and Boer breeds (respectively 7 and 31%). The frequency of the K222 haplotype in the Toggenburger was higher than for any other breed reported in literature, while for the White Goat breed it was with 3.1% similar to frequencies of other Saanen or Saanen derived breeds. Further evidence was found for the existence of two M142 haplotypes, M142 /S240 and M142 /P240 . Breeds vary in haplotype frequencies but frequencies of resistant genotypes are generally low and consequently selective breeding for scrapie resistance can only be slow but will benefit from animals identified in this study. The unexpectedly high frequency of the K222 haplotype in the Dutch Toggenburger underlines the need for conservation of rare breeds in order to conserve genetic diversity rare or absent in other breeds. © 2016 Blackwell Verlag GmbH.

  1. Rapid directed evolution of stabilized proteins with cellular high-throughput encapsulation solubilization and screening (CHESS).

    Science.gov (United States)

    Yong, K J; Scott, D J

    2015-03-01

    Directed evolution is a powerful method for engineering proteins towards user-defined goals and has been used to generate novel proteins for industrial processes, biological research and drug discovery. Typical directed evolution techniques include cellular display, phage display, ribosome display and water-in-oil compartmentalization, all of which physically link individual members of diverse gene libraries to their translated proteins. This allows the screening or selection for a desired protein function and subsequent isolation of the encoding gene from diverse populations. For biotechnological and industrial applications there is a need to engineer proteins that are functional under conditions that are not compatible with these techniques, such as high temperatures and harsh detergents. Cellular High-throughput Encapsulation Solubilization and Screening (CHESS), is a directed evolution method originally developed to engineer detergent-stable G proteins-coupled receptors (GPCRs) for structural biology. With CHESS, library-transformed bacterial cells are encapsulated in detergent-resistant polymers to form capsules, which serve to contain mutant genes and their encoded proteins upon detergent mediated solubilization of cell membranes. Populations of capsules can be screened like single cells to enable rapid isolation of genes encoding detergent-stable protein mutants. To demonstrate the general applicability of CHESS to other proteins, we have characterized the stability and permeability of CHESS microcapsules and employed CHESS to generate thermostable, sodium dodecyl sulfate (SDS) resistant green fluorescent protein (GFP) mutants, the first soluble proteins to be engineered using CHESS. © 2014 Wiley Periodicals, Inc.

  2. Unification of Cas protein families and a simple scenario for the origin and evolution of CRISPR-Cas systems

    Directory of Open Access Journals (Sweden)

    Wolf Yuri I

    2011-07-01

    Full Text Available Abstract Background The CRISPR-Cas adaptive immunity systems that are present in most Archaea and many Bacteria function by incorporating fragments of alien genomes into specific genomic loci, transcribing the inserts and using the transcripts as guide RNAs to destroy the genome of the cognate virus or plasmid. This RNA interference-like immune response is mediated by numerous, diverse and rapidly evolving Cas (CRISPR-associated proteins, several of which form the Cascade complex involved in the processing of CRISPR transcripts and cleavage of the target DNA. Comparative analysis of the Cas protein sequences and structures led to the classification of the CRISPR-Cas systems into three Types (I, II and III. Results A detailed comparison of the available sequences and structures of Cas proteins revealed several unnoticed homologous relationships. The Repeat-Associated Mysterious Proteins (RAMPs containing a distinct form of the RNA Recognition Motif (RRM domain, which are major components of the CRISPR-Cas systems, were classified into three large groups, Cas5, Cas6 and Cas7. Each of these groups includes many previously uncharacterized proteins now shown to adopt the RAMP structure. Evidence is presented that large subunits contained in most of the CRISPR-Cas systems could be homologous to Cas10 proteins which contain a polymerase-like Palm domain and are predicted to be enzymatically active in Type III CRISPR-Cas systems but inactivated in Type I systems. These findings, the fact that the CRISPR polymerases, RAMPs and Cas2 all contain core RRM domains, and distinct gene arrangements in the three types of CRISPR-Cas systems together provide for a simple scenario for origin and evolution of the CRISPR-Cas machinery. Under this scenario, the CRISPR-Cas system originated in thermophilic Archaea and subsequently spread horizontally among prokaryotes. Conclusions Because of the extreme diversity of CRISPR-Cas systems, in-depth sequence and structure

  3. Structural insights into the evolution of a sexy protein: novel topology and restricted backbone flexibility in a hypervariable pheromone from the red-legged salamander, Plethodon shermani.

    Science.gov (United States)

    Wilburn, Damien B; Bowen, Kathleen E; Doty, Kari A; Arumugam, Sengodagounder; Lane, Andrew N; Feldhoff, Pamela W; Feldhoff, Richard C

    2014-01-01

    In response to pervasive sexual selection, protein sex pheromones often display rapid mutation and accelerated evolution of corresponding gene sequences. For proteins, the general dogma is that structure is maintained even as sequence or function may rapidly change. This phenomenon is well exemplified by the three-finger protein (TFP) superfamily: a diverse class of vertebrate proteins co-opted for many biological functions - such as components of snake venoms, regulators of the complement system, and coordinators of amphibian limb regeneration. All of the >200 structurally characterized TFPs adopt the namesake "three-finger" topology. In male red-legged salamanders, the TFP pheromone Plethodontid Modulating Factor (PMF) is a hypervariable protein such that, through extensive gene duplication and pervasive sexual selection, individual male salamanders express more than 30 unique isoforms. However, it remained unclear how this accelerated evolution affected the protein structure of PMF. Using LC/MS-MS and multidimensional NMR, we report the 3D structure of the most abundant PMF isoform, PMF-G. The high resolution structural ensemble revealed a highly modified TFP structure, including a unique disulfide bonding pattern and loss of secondary structure, that define a novel protein topology with greater backbone flexibility in the third peptide finger. Sequence comparison, models of molecular evolution, and homology modeling together support that this flexible third finger is the most rapidly evolving segment of PMF. Combined with PMF sequence hypervariability, this structural flexibility may enhance the plasticity of PMF as a chemical signal by permitting potentially thousands of structural conformers. We propose that the flexible third finger plays a critical role in PMF:receptor interactions. As female receptors co-evolve, this flexibility may allow PMF to still bind its receptor(s) without the immediate need for complementary mutations. Consequently, this unique

  4. Structural insights into the evolution of a sexy protein: novel topology and restricted backbone flexibility in a hypervariable pheromone from the red-legged salamander, Plethodon shermani.

    Directory of Open Access Journals (Sweden)

    Damien B Wilburn

    Full Text Available In response to pervasive sexual selection, protein sex pheromones often display rapid mutation and accelerated evolution of corresponding gene sequences. For proteins, the general dogma is that structure is maintained even as sequence or function may rapidly change. This phenomenon is well exemplified by the three-finger protein (TFP superfamily: a diverse class of vertebrate proteins co-opted for many biological functions - such as components of snake venoms, regulators of the complement system, and coordinators of amphibian limb regeneration. All of the >200 structurally characterized TFPs adopt the namesake "three-finger" topology. In male red-legged salamanders, the TFP pheromone Plethodontid Modulating Factor (PMF is a hypervariable protein such that, through extensive gene duplication and pervasive sexual selection, individual male salamanders express more than 30 unique isoforms. However, it remained unclear how this accelerated evolution affected the protein structure of PMF. Using LC/MS-MS and multidimensional NMR, we report the 3D structure of the most abundant PMF isoform, PMF-G. The high resolution structural ensemble revealed a highly modified TFP structure, including a unique disulfide bonding pattern and loss of secondary structure, that define a novel protein topology with greater backbone flexibility in the third peptide finger. Sequence comparison, models of molecular evolution, and homology modeling together support that this flexible third finger is the most rapidly evolving segment of PMF. Combined with PMF sequence hypervariability, this structural flexibility may enhance the plasticity of PMF as a chemical signal by permitting potentially thousands of structural conformers. We propose that the flexible third finger plays a critical role in PMF:receptor interactions. As female receptors co-evolve, this flexibility may allow PMF to still bind its receptor(s without the immediate need for complementary mutations. Consequently

  5. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

    Directory of Open Access Journals (Sweden)

    Ruan Jishou

    2007-04-01

    Full Text Available Abstract Background Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP; the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction. Results The flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are

  6. A two-step recognition of signal sequences determines the translocation efficiency of proteins.

    OpenAIRE

    Belin, D; Bost, S; Vassalli, J D; Strub, K

    1996-01-01

    The cytosolic and secreted, N-glycosylated, forms of plasminogen activator inhibitor-2 (PAI-2) are generated by facultative translocation. To study the molecular events that result in the bi-topological distribution of proteins, we determined in vitro the capacities of several signal sequences to bind the signal recognition particle (SRP) during targeting, and to promote vectorial transport of murine PAI-2 (mPAI-2). Interestingly, the six signal sequences we compared (mPAI-2 and three mutated...

  7. DIALIGN: multiple DNA and protein sequence alignment at BiBiServ.

    OpenAIRE

    Morgenstern, Burkhard

    2004-01-01

    DIALIGN is a widely used software tool for multiple DNA and protein sequence alignment. The program combines local and global alignment features and can therefore be applied to sequence data that cannot be correctly aligned by more traditional approaches. DIALIGN is available online through Bielefeld Bioinformatics Server (BiBiServ). The downloadable version of the program offers several new program features. To compare the output of different alignment programs, we developed the program AltA...

  8. TRDistiller: a rapid filter for enrichment of sequence datasets with proteins containing tandem repeats.

    Science.gov (United States)

    Richard, François D; Kajava, Andrey V

    2014-06-01

    The dramatic growth of sequencing data evokes an urgent need to improve bioinformatics tools for large-scale proteome analysis. Over the last two decades, the foremost efforts of computer scientists were devoted to proteins with aperiodic sequences having globular 3D structures. However, a large portion of proteins contain periodic sequences representing arrays of repeats that are directly adjacent to each other (so called tandem repeats or TRs). These proteins frequently fold into elongated fibrous structures carrying different fundamental functions. Algorithms specific to the analysis of these regions are urgently required since the conventional approaches developed for globular domains have had limited success when applied to the TR regions. The protein TRs are frequently not perfect, containing a number of mutations, and some of them cannot be easily identified. To detect such "hidden" repeats several algorithms have been developed. However, the most sensitive among them are time-consuming and, therefore, inappropriate for large scale proteome analysis. To speed up the TR detection we developed a rapid filter that is based on the comparison of composition and order of short strings in the adjacent sequence motifs. Tests show that our filter discards up to 22.5% of proteins which are known to be without TRs while keeping almost all (99.2%) TR-containing sequences. Thus, we are able to decrease the size of the initial sequence dataset enriching it with TR-containing proteins which allows a faster subsequent TR detection by other methods. The program is available upon request. Copyright © 2014 Elsevier Inc. All rights reserved.

  9. Large-scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe.

    Science.gov (United States)

    Necci, Marco; Piovesan, Damiano; Tosatto, Silvio C E

    2016-12-01

    Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. © 2016 The Protein Society.

  10. RSARF: Prediction of residue solvent accessibility from protein sequence using random forest method

    KAUST Repository

    Ganesan, Pugalenthi; Kandaswamy, Krishna Kumar Umar; Chou -, Kuochen; Vivekanandan, Saravanan; Kolatkar, Prasanna R.

    2012-01-01

    Prediction of protein structure from its amino acid sequence is still a challenging problem. The complete physicochemical understanding of protein folding is essential for the accurate structure prediction. Knowledge of residue solvent accessibility gives useful insights into protein structure prediction and function prediction. In this work, we propose a random forest method, RSARF, to predict residue accessible surface area from protein sequence information. The training and testing was performed using 120 proteins containing 22006 residues. For each residue, buried and exposed state was computed using five thresholds (0%, 5%, 10%, 25%, and 50%). The prediction accuracy for 0%, 5%, 10%, 25%, and 50% thresholds are 72.9%, 78.25%, 78.12%, 77.57% and 72.07% respectively. Further, comparison of RSARF with other methods using a benchmark dataset containing 20 proteins shows that our approach is useful for prediction of residue solvent accessibility from protein sequence without using structural information. The RSARF program, datasets and supplementary data are available at http://caps.ncbs.res.in/download/pugal/RSARF/. - See more at: http://www.eurekaselect.com/89216/article#sthash.pwVGFUjq.dpuf

  11. Trends in global warming and evolution of matrix protein 2 family from influenza A virus.

    Science.gov (United States)

    Yan, Shao-Min; Wu, Guang

    2009-12-01

    The global warming is an important factor affecting the biological evolution, and the influenza is an important disease that threatens humans with possible epidemics or pandemics. In this study, we attempted to analyze the trends in global warming and evolution of matrix protein 2 family from influenza A virus, because this protein is a target of anti-flu drug, and its mutation would have significant effect on the resistance to anti-flu drugs. The evolution of matrix protein 2 of influenza A virus from 1959 to 2008 was defined using the unpredictable portion of amino-acid pair predictability. Then the trend in this evolution was compared with the trend in the global temperature, the temperature in north and south hemispheres, and the temperature in influenza A virus sampling site, and species carrying influenza A virus. The results showed the similar trends in global warming and in evolution of M2 proteins although we could not correlate them at this stage of study. The study suggested the potential impact of global warming on the evolution of proteins from influenza A virus.

  12. Insights into hominin phenotypic and dietary evolution from ancient DNA sequence data.

    Science.gov (United States)

    Perry, George H; Kistler, Logan; Kelaita, Mary A; Sams, Aaron J

    2015-02-01

    Nuclear genome sequence data from Neandertals, Denisovans, and archaic anatomically modern humans can be used to complement our understanding of hominin evolutionary biology and ecology through i) direct inference of archaic hominin phenotypes, ii) indirect inference of those phenotypes by identifying the effects of previously-introgressed alleles still present among modern humans, or iii) determining the evolutionary timing of relevant hominin-specific genetic changes. Here we review and reanalyze published Neandertal and Denisovan genome sequence data to illustrate an example of the third approach. Specifically, we infer the timing of five human gene presence/absence changes that may be related to particular hominin-specific dietary changes and discuss these results in the context of our broader reconstructions of hominin evolutionary ecology. We show that pseudogenizing (gene loss) mutations in the TAS2R62 and TAS2R64 bitter taste receptor genes and the MYH16 masticatory myosin gene occurred after the hominin-chimpanzee divergence but before the divergence of the human and Neandertal/Denisovan lineages. The absence of a functional MYH16 protein may explain our relatively reduced jaw muscles; this gene loss may have followed the adoption of cooking behavior. In contrast, salivary amylase gene (AMY1) duplications were not observed in the Neandertal and Denisovan genomes, suggesting a relatively recent origin for the AMY1 copy number gains that are observed in modern humans. Thus, if earlier hominins were consuming large quantities of starch-rich underground storage organs, as previously hypothesized, then they were likely doing so without the digestive benefits of increased salivary amylase production. Our most surprising result was the observation of a heterozygous mutation in the first codon of the TAS2R38 bitter taste receptor gene in the Neandertal individual, which likely would have resulted in a non-functional protein and inter-individual PTC

  13. Conflict RNA modification, host-parasite co-evolution, and the origins of DNA and DNA-binding proteins1.

    Science.gov (United States)

    McLaughlin, Paul J; Keegan, Liam P

    2014-08-01

    Nearly 150 different enzymatically modified forms of the four canonical residues in RNA have been identified. For instance, enzymes of the ADAR (adenosine deaminase acting on RNA) family convert adenosine residues into inosine in cellular dsRNAs. Recent findings show that DNA endonuclease V enzymes have undergone an evolutionary transition from cleaving 3' to deoxyinosine in DNA and ssDNA to cleaving 3' to inosine in dsRNA and ssRNA in humans. Recent work on dsRNA-binding domains of ADARs and other proteins also shows that a degree of sequence specificity is achieved by direct readout in the minor groove. However, the level of sequence specificity observed is much less than that of DNA major groove-binding helix-turn-helix proteins. We suggest that the evolution of DNA-binding proteins following the RNA to DNA genome transition represents the major advantage that DNA genomes have over RNA genomes. We propose that a hypothetical RNA modification, a RRAR (ribose reductase acting on genomic dsRNA) produced the first stretches of DNA in RNA genomes. We discuss why this is the most satisfactory explanation for the origin of DNA. The evolution of this RNA modification and later steps to DNA genomes are likely to have been driven by cellular genome co-evolution with viruses and intragenomic parasites. RNA modifications continue to be involved in host-virus conflicts; in vertebrates, edited cellular dsRNAs with inosine-uracil base pairs appear to be recognized as self RNA and to suppress activation of innate immune sensors that detect viral dsRNA.

  14. Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse.

    Science.gov (United States)

    Orlando, Ludovic; Ginolhac, Aurélien; Zhang, Guojie; Froese, Duane; Albrechtsen, Anders; Stiller, Mathias; Schubert, Mikkel; Cappellini, Enrico; Petersen, Bent; Moltke, Ida; Johnson, Philip L F; Fumagalli, Matteo; Vilstrup, Julia T; Raghavan, Maanasa; Korneliussen, Thorfinn; Malaspinas, Anna-Sapfo; Vogt, Josef; Szklarczyk, Damian; Kelstrup, Christian D; Vinther, Jakob; Dolocan, Andrei; Stenderup, Jesper; Velazquez, Amhed M V; Cahill, James; Rasmussen, Morten; Wang, Xiaoli; Min, Jiumeng; Zazula, Grant D; Seguin-Orlando, Andaine; Mortensen, Cecilie; Magnussen, Kim; Thompson, John F; Weinstock, Jacobo; Gregersen, Kristian; Røed, Knut H; Eisenmann, Véra; Rubin, Carl J; Miller, Donald C; Antczak, Douglas F; Bertelsen, Mads F; Brunak, Søren; Al-Rasheid, Khaled A S; Ryder, Oliver; Andersson, Leif; Mundy, John; Krogh, Anders; Gilbert, M Thomas P; Kjær, Kurt; Sicheritz-Ponten, Thomas; Jensen, Lars Juhl; Olsen, Jesper V; Hofreiter, Michael; Nielsen, Rasmus; Shapiro, Beth; Wang, Jun; Willerslev, Eske

    2013-07-04

    The rich fossil record of equids has made them a model for evolutionary processes. Here we present a 1.12-times coverage draft genome from a horse bone recovered from permafrost dated to approximately 560-780 thousand years before present (kyr BP). Our data represent the oldest full genome sequence determined so far by almost an order of magnitude. For comparison, we sequenced the genome of a Late Pleistocene horse (43 kyr BP), and modern genomes of five domestic horse breeds (Equus ferus caballus), a Przewalski's horse (E. f. przewalskii) and a donkey (E. asinus). Our analyses suggest that the Equus lineage giving rise to all contemporary horses, zebras and donkeys originated 4.0-4.5 million years before present (Myr BP), twice the conventionally accepted time to the most recent common ancestor of the genus Equus. We also find that horse population size fluctuated multiple times over the past 2 Myr, particularly during periods of severe climatic changes. We estimate that the Przewalski's and domestic horse populations diverged 38-72 kyr BP, and find no evidence of recent admixture between the domestic horse breeds and the Przewalski's horse investigated. This supports the contention that Przewalski's horses represent the last surviving wild horse population. We find similar levels of genetic variation among Przewalski's and domestic populations, indicating that the former are genetically viable and worthy of conservation efforts. We also find evidence for continuous selection on the immune system and olfaction throughout horse evolution. Finally, we identify 29 genomic regions among horse breeds that deviate from neutrality and show low levels of genetic variation compared to the Przewalski's horse. Such regions could correspond to loci selected early during domestication.

  15. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology.

    Science.gov (United States)

    Bakhtiarizadeh, Mohammad Reza; Moradi-Shahrbabak, Mohammad; Ebrahimi, Mansour; Ebrahimie, Esmaeil

    2014-09-07

    Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. Sequence preservation of osteocalcin protein and mitochondrial DNA in bison bones older than 55 ka

    Science.gov (United States)

    Nielsen-Marsh, Christina M.; Ostrom, Peggy H.; Gandhi, Hasand; Shapiro, Beth; Cooper, Alan; Hauschka, Peter V.; Collins, Matthew J.

    2002-12-01

    We report the first complete sequences of the protein osteocalcin from small amounts (20 mg) of two bison bone (Bison priscus) dated to older than 55.6 ka and older than 58.9 ka. Osteocalcin was purified using new gravity columns (never exposed to protein) followed by microbore reversed-phase high-performance liquid chromatography. Sequencing of osteocalcin employed two methods of matrix-assisted laser desorption ionization mass spectrometry (MALDI-MS): peptide mass mapping (PMM) and post-source decay (PSD). The PMM shows that ancient and modern bison osteocalcin have the same mass to charge (m/z) distribution, indicating an identical protein sequence and absence of diagenetic products. This was confirmed by PSD of the m/z 2066 tryptic peptide (residues 1 19); the mass spectra from ancient and modern peptides were identical. The 129 mass unit difference in the molecular ion between cow (Bos taurus) and bison is caused by a single amino-acid substitution between the taxa (Trp in cow is replaced by Gly in bison at residue 5). Bison mitochondrial control region DNA sequences were obtained from the older than 55.6 ka fossil. These results suggest that DNA and protein sequences can be used to directly investigate molecular phylogenies over a considerable time period, the absolute limit of which is yet to be determined.

  17. Efficient use of unlabeled data for protein sequence classification: a comparative study.

    Science.gov (United States)

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-04-29

    Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

  18. Dynamic Evolution of Pathogenicity Revealed by Sequencing and Comparative Genomics of 19 Pseudomonas syringae Isolates

    Science.gov (United States)

    Romanchuk, Artur; Chang, Jeff H.; Mukhtar, M. Shahid; Cherkis, Karen; Roach, Jeff; Grant, Sarah R.; Jones, Corbin D.; Dangl, Jeffery L.

    2011-01-01

    Closely related pathogens may differ dramatically in host range, but the molecular, genetic, and evolutionary basis for these differences remains unclear. In many Gram- negative bacteria, including the phytopathogen Pseudomonas syringae, type III effectors (TTEs) are essential for pathogenicity, instrumental in structuring host range, and exhibit wide diversity between strains. To capture the dynamic nature of virulence gene repertoires across P. syringae, we screened 11 diverse strains for novel TTE families and coupled this nearly saturating screen with the sequencing and assembly of 14 phylogenetically diverse isolates from a broad collection of diseased host plants. TTE repertoires vary dramatically in size and content across all P. syringae clades; surprisingly few TTEs are conserved and present in all strains. Those that are likely provide basal requirements for pathogenicity. We demonstrate that functional divergence within one conserved locus, hopM1, leads to dramatic differences in pathogenicity, and we demonstrate that phylogenetics-informed mutagenesis can be used to identify functionally critical residues of TTEs. The dynamism of the TTE repertoire is mirrored by diversity in pathways affecting the synthesis of secreted phytotoxins, highlighting the likely role of both types of virulence factors in determination of host range. We used these 14 draft genome sequences, plus five additional genome sequences previously reported, to identify the core genome for P. syringae and we compared this core to that of two closely related non-pathogenic pseudomonad species. These data revealed the recent acquisition of a 1 Mb megaplasmid by a sub-clade of cucumber pathogens. This megaplasmid encodes a type IV secretion system and a diverse set of unknown proteins, which dramatically increases both the genomic content of these strains and the pan-genome of the species. PMID:21799664

  19. The Complete Chloroplast and Mitochondrial Genome Sequences of Boea hygrometrica: Insights into the Evolution of Plant Organellar Genomes

    Science.gov (United States)

    Wang, Xumin; Deng, Xin; Zhang, Xiaowei; Hu, Songnian; Yu, Jun

    2012-01-01

    The complete nucleotide sequences of the chloroplast (cp) and mitochondrial (mt) genomes of resurrection plant Boea hygrometrica (Bh, Gesneriaceae) have been determined with the lengths of 153,493 bp and 510,519 bp, respectively. The smaller chloroplast genome contains more genes (147) with a 72% coding sequence, and the larger mitochondrial genome have less genes (65) with a coding faction of 12%. Similar to other seed plants, the Bh cp genome has a typical quadripartite organization with a conserved gene in each region. The Bh mt genome has three recombinant sequence repeats of 222 bp, 843 bp, and 1474 bp in length, which divide the genome into a single master circle (MC) and four isomeric molecules. Compared to other angiosperms, one remarkable feature of the Bh mt genome is the frequent transfer of genetic material from the cp genome during recent Bh evolution. We also analyzed organellar genome evolution in general regarding genome features as well as compositional dynamics of sequence and gene structure/organization, providing clues for the understanding of the evolution of organellar genomes in plants. The cp-derived sequences including tRNAs found in angiosperm mt genomes support the conclusion that frequent gene transfer events may have begun early in the land plant lineage. PMID:22291979

  20. The nucleotide sequence of human transition protein 1 cDNA

    Energy Technology Data Exchange (ETDEWEB)

    Luerssen, H; Hoyer-Fender, S; Engel, W [Universitaet Goettingen (West Germany)

    1988-08-11

    The authors have screened a human testis cDNA library with an oligonucleotide of 81 mer prepared according to a part of the published nucleotide sequence of the rat transition protein TP 1. They have isolated a cDNA clone with the length of 441 bp containing the coding region of 162 bp for human transition protein 1. There is about 84% homology in the coding region of the sequence compared to rat. The human cDNA-clone encodes a polypeptide of 54 amino acids of which 7 are different to that of rat.

  1. Insights into the molecular evolution of the PDZ/LIM family and identification of a novel conserved protein motif.

    Directory of Open Access Journals (Sweden)

    Aartjan J W Te Velthuis

    Full Text Available The PDZ and LIM domain-containing protein family is encoded by a diverse group of genes whose phylogeny has currently not been analyzed. In mammals, ten genes are found that encode both a PDZ- and one or several LIM-domains. These genes are: ALP, RIL, Elfin (CLP36, Mystique, Enigma (LMP-1, Enigma homologue (ENH, ZASP (Cypher, Oracle, LMO7 and the two LIM domain kinases (LIMK1 and LIMK2. As conventional alignment and phylogenetic procedures of full-length sequences fell short of elucidating the evolutionary history of these genes, we started to analyze the PDZ and LIM domain sequences themselves. Using information from most sequenced eukaryotic lineages, our phylogenetic analysis is based on full-length cDNA-, EST-derived- and genomic- PDZ and LIM domain sequences of over 25 species, ranging from yeast to humans. Plant and protozoan homologs were not found. Our phylogenetic analysis identifies a number of domain duplication and rearrangement events, and shows a single convergent event during evolution of the PDZ/LIM family. Further, we describe the separation of the ALP and Enigma subfamilies in lower vertebrates and identify a novel consensus motif, which we call 'ALP-like motif' (AM. This motif is highly-conserved between ALP subfamily proteins of diverse organisms. We used here a combinatorial approach to define the relation of the PDZ and LIM domain encoding genes and to reconstruct their phylogeny. This analysis allowed us to classify the PDZ/LIM family and to suggest a meaningful model for the molecular evolution of the diverse gene architectures found in this multi-domain family.

  2. Unraveling the sequence and structure of the protein osteocalcin from a 42 ka fossil horse

    Science.gov (United States)

    Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Andrews, Philip C.; Leykam, Joseph; Stafford, Thomas W.; Kelly, Robert L.; Walker, Danny N.; Buckley, Mike; Humpula, James

    2006-04-01

    We report the first complete amino acid sequence and evidence of secondary structure for osteocalcin from a temperate fossil. The osteocalcin derives from a 42 ka equid bone excavated from Juniper Cave, Wyoming. Results were determined by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-MS) and Edman sequencing with independent confirmation of the sequence in two laboratories. The ancient sequence was compared to that of three modern taxa: horse ( Equus caballus), zebra ( Equus grevyi), and donkey ( Equus asinus). Although there was no difference in sequence among modern taxa, MALDI-MS and Edman sequencing show that residues 48 and 49 of our modern horse are Thr, Ala rather than Pro, Val as previously reported (Carstanjen B., Wattiez, R., Armory, H., Lepage, O.M., Remy, B., 2002. Isolation and characterization of equine osteocalcin. Ann. Med. Vet.146(1), 31-38). MALDI-MS and Edman sequencing data indicate that the osteocalcin sequence of the 42 ka fossil is similar to that of modern horse. Previously inaccessible structural attributes for ancient osteocalcin were observed. Glu 39 rather than Gln 39 is consistent with deamidation, a process known to occur during fossilization and aging. Two post-translational modifications were documented: Hyp 9 and a disulfide bridge. The latter suggests at least partial retention of secondary structure. As has been done for ancient DNA research, we recommend standards for preparation and criteria for authenticating results of ancient protein sequencing.

  3. High dimensional and high resolution pulse sequences for backbone resonance assignment of intrinsically disordered proteins

    Energy Technology Data Exchange (ETDEWEB)

    Zawadzka-Kazimierczuk, Anna; Kozminski, Wiktor, E-mail: kozmin@chem.uw.edu.pl [University of Warsaw, Faculty of Chemistry (Poland); Sanderova, Hana; Krasny, Libor [Institute of Microbiology, Academy of Sciences of the Czech Republic, Laboratory of Molecular Genetics of Bacteria, Department of Bacteriology (Czech Republic)

    2012-04-15

    Four novel 5D (HACA(N)CONH, HNCOCACB, (HACA)CON(CA)CONH, (H)NCO(NCA)CONH), and one 6D ((H)NCO(N)CACONH) NMR pulse sequences are proposed. The new experiments employ non-uniform sampling that enables achieving high resolution in indirectly detected dimensions. The experiments facilitate resonance assignment of intrinsically disordered proteins. The novel pulse sequences were successfully tested using {delta} subunit (20 kDa) of Bacillus subtilis RNA polymerase that has an 81-amino acid disordered part containing various repetitive sequences.

  4. Sequence variability is correlated with weak immunogenicity in Streptococcus pyogenes M protein.

    Science.gov (United States)

    Lannergård, Jonas; Kristensen, Bodil M; Gustafsson, Mattias C U; Persson, Jenny J; Norrby-Teglund, Anna; Stålhammar-Carlemalm, Margaretha; Lindahl, Gunnar

    2015-10-01

    The M protein of Streptococcus pyogenes, a major bacterial virulence factor, has an amino-terminal hypervariable region (HVR) that is a target for type-specific protective antibodies. Intriguingly, the HVR elicits a weak antibody response, indicating that it escapes host immunity by two mechanisms, sequence variability and weak immunogenicity. However, the properties influencing the immunogenicity of regions in an M protein remain poorly understood. Here, we studied the antibody response to different regions of the classical M1 and M5 proteins, in which not only the HVR but also the adjacent fibrinogen-binding B repeat region exhibits extensive sequence divergence. Analysis of antisera from S. pyogenes-infected patients, infected mice, and immunized mice showed that both the HVR and the B repeat region elicited weak antibody responses, while the conserved carboxy-terminal part was immunodominant. Thus, we identified a correlation between sequence variability and weak immunogenicity for M protein regions. A potential explanation for the weak immunogenicity was provided by the demonstration that protease digestion selectively eliminated the HVR-B part from whole M protein-expressing bacteria. These data support a coherent model, in which the entire variable HVR-B part evades antibody attack, not only by sequence variability but also by weak immunogenicity resulting from protease attack. © 2015 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

  5. Sequence variability is correlated with weak immunogenicity in Streptococcus pyogenes M protein

    Science.gov (United States)

    Lannergård, Jonas; Kristensen, Bodil M; Gustafsson, Mattias C U; Persson, Jenny J; Norrby-Teglund, Anna; Stålhammar-Carlemalm, Margaretha; Lindahl, Gunnar

    2015-01-01

    The M protein of Streptococcus pyogenes, a major bacterial virulence factor, has an amino-terminal hypervariable region (HVR) that is a target for type-specific protective antibodies. Intriguingly, the HVR elicits a weak antibody response, indicating that it escapes host immunity by two mechanisms, sequence variability and weak immunogenicity. However, the properties influencing the immunogenicity of regions in an M protein remain poorly understood. Here, we studied the antibody response to different regions of the classical M1 and M5 proteins, in which not only the HVR but also the adjacent fibrinogen-binding B repeat region exhibits extensive sequence divergence. Analysis of antisera from S. pyogenes-infected patients, infected mice, and immunized mice showed that both the HVR and the B repeat region elicited weak antibody responses, while the conserved carboxy-terminal part was immunodominant. Thus, we identified a correlation between sequence variability and weak immunogenicity for M protein regions. A potential explanation for the weak immunogenicity was provided by the demonstration that protease digestion selectively eliminated the HVR-B part from whole M protein-expressing bacteria. These data support a coherent model, in which the entire variable HVR-B part evades antibody attack, not only by sequence variability but also by weak immunogenicity resulting from protease attack. PMID:26175306

  6. Protein backbone angle restraints from searching a database for chemical shift and sequence homology

    Energy Technology Data Exchange (ETDEWEB)

    Cornilescu, Gabriel; Delaglio, Frank; Bax, Ad [National Institutes of Health, Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases (United States)

    1999-03-15

    Chemical shifts of backbone atoms in proteins are exquisitely sensitive to local conformation, and homologous proteins show quite similar patterns of secondary chemical shifts. The inverse of this relation is used to search a database for triplets of adjacent residues with secondary chemical shifts and sequence similarity which provide the best match to the query triplet of interest. The database contains 13C{alpha}, 13C{beta}, 13C', 1H{alpha} and 15N chemical shifts for 20 proteins for which a high resolution X-ray structure is available. The computer program TALOS was developed to search this database for strings of residues with chemical shift and residue type homology. The relative importance of the weighting factors attached to the secondary chemical shifts of the five types of resonances relative to that of sequence similarity was optimized empirically. TALOS yields the 10 triplets which have the closest similarity in secondary chemical shift and amino acid sequence to those of the query sequence. If the central residues in these 10 triplets exhibit similar {phi} and {psi} backbone angles, their averages can reliably be used as angular restraints for the protein whose structure is being studied. Tests carried out for proteins of known structure indicate that the root-mean-square difference (rmsd) between the output of TALOS and the X-ray derived backbone angles is about 15 deg. Approximately 3% of the predictions made by TALOS are found to be in error.

  7. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences.

    Science.gov (United States)

    Meinicke, Peter

    2009-09-02

    Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.

  8. Evolution Models of Helium White Dwarf–Main-sequence Star Merger Remnants

    Energy Technology Data Exchange (ETDEWEB)

    Zhang, Xianfei; Bi, Shaolan [Department of Astronomy, Beijing Normal University, Beijing, 100875 (China); Hall, Philip D.; Jeffery, C. Simon, E-mail: zxf@bnu.edu.cn [Armagh Observatory, College Hill, Armagh BT61 9DG (United Kingdom)

    2017-02-01

    It is predicted that orbital decay by gravitational-wave radiation and tidal interaction will cause some close binary stars to merge within a Hubble time. The merger of a helium-core white dwarf with a main-sequence (MS) star can produce a red giant branch star that has a low-mass hydrogen envelope when helium is ignited and thus become a hot subdwarf. Because detailed calculations have not been made, we compute post-merger models with a stellar evolution code. We find the evolutionary paths available to merger remnants and find the pre-merger conditions that lead to the formation of hot subdwarfs. We find that some such mergers result in the formation of stars with intermediate helium-rich surfaces. These stars later develop helium-poor surfaces owing to diffusion. Combining our results with a model population and comparing to observed stars, we find that some observed intermediate helium-rich hot subdwarfs can be explained as the remnants of the mergers of helium-core white dwarfs with low-mass MS stars.

  9. Evolution Models of Helium White Dwarf–Main-sequence Star Merger Remnants

    International Nuclear Information System (INIS)

    Zhang, Xianfei; Bi, Shaolan; Hall, Philip D.; Jeffery, C. Simon

    2017-01-01

    It is predicted that orbital decay by gravitational-wave radiation and tidal interaction will cause some close binary stars to merge within a Hubble time. The merger of a helium-core white dwarf with a main-sequence (MS) star can produce a red giant branch star that has a low-mass hydrogen envelope when helium is ignited and thus become a hot subdwarf. Because detailed calculations have not been made, we compute post-merger models with a stellar evolution code. We find the evolutionary paths available to merger remnants and find the pre-merger conditions that lead to the formation of hot subdwarfs. We find that some such mergers result in the formation of stars with intermediate helium-rich surfaces. These stars later develop helium-poor surfaces owing to diffusion. Combining our results with a model population and comparing to observed stars, we find that some observed intermediate helium-rich hot subdwarfs can be explained as the remnants of the mergers of helium-core white dwarfs with low-mass MS stars.

  10. System risk evolution analysis and risk critical event identification based on event sequence diagram

    International Nuclear Information System (INIS)

    Luo, Pengcheng; Hu, Yang

    2013-01-01

    During system operation, the environmental, operational and usage conditions are time-varying, which causes the fluctuations of the system state variables (SSVs). These fluctuations change the accidents’ probabilities and then result in the system risk evolution (SRE). This inherent relation makes it feasible to realize risk control by monitoring the SSVs in real time, herein, the quantitative analysis of SRE is essential. Besides, some events in the process of SRE are critical to system risk, because they act like the “demarcative points” of safety and accident, and this characteristic makes each of them a key point of risk control. Therefore, analysis of SRE and identification of risk critical events (RCEs) are remarkably meaningful to ensure the system to operate safely. In this context, an event sequence diagram (ESD) based method of SRE analysis and the related Monte Carlo solution are presented; RCE and risk sensitive variable (RSV) are defined, and the corresponding identification methods are also proposed. Finally, the proposed approaches are exemplified with an accident scenario of an aircraft getting into the icing region

  11. Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

    Directory of Open Access Journals (Sweden)

    Li Weizhong

    2008-04-01

    Full Text Available Abstract Background The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. Results We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net. Conclusion The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.

  12. A measure of the denseness of a phylogenetic network. [by sequenced proteins from extant species

    Science.gov (United States)

    Holmquist, R.

    1978-01-01

    An objective measure of phylogenetic denseness is developed to examine various phylogenetic criteria: alpha- and beta-hemoglobin, myoglobin, cytochrome c, and the parvalbumin family. Attention is given to the number of nucleotide replacements separating homologous sequences, and to the topology of the network (in other words, to the qualitative nature of the network as defined by how closely the studied species are related). Applications include quantitative comparisons of species origin, relation, and rates of evolution.

  13. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.

    Science.gov (United States)

    Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke

    2009-02-15

    Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http

  14. Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins.

    Science.gov (United States)

    Firman, Taylor; Ghosh, Kingshuk

    2018-03-28

    We present an analytical theory to compute conformations of heteropolymers-applicable to describe disordered proteins-as a function of temperature and charge sequence. The theory describes coil-globule transition for a given protein sequence when temperature is varied and has been benchmarked against the all-atom Monte Carlo simulation (using CAMPARI) of intrinsically disordered proteins (IDPs). In addition, the model quantitatively shows how subtle alterations of charge placement in the primary sequence-while maintaining the same charge composition-can lead to significant changes in conformation, even as drastic as a coil (swelled above a purely random coil) to globule (collapsed below a random coil) and vice versa. The theory provides insights on how to control (enhance or suppress) these changes by tuning the temperature (or solution condition) and charge decoration. As an application, we predict the distribution of conformations (at room temperature) of all naturally occurring IDPs in the DisProt database and notice significant size variation even among IDPs with a similar composition of positive and negative charges. Based on this, we provide a new diagram-of-states delineating the sequence-conformation relation for proteins in the DisProt database. Next, we study the effect of post-translational modification, e.g., phosphorylation, on IDP conformations. Modifications as little as two-site phosphorylation can significantly alter the size of an IDP with everything else being constant (temperature, salt concentration, etc.). However, not all possible modification sites have the same effect on protein conformations; there are certain "hot spots" that can cause maximal change in conformation. The location of these "hot spots" in the parent sequence can readily be identified by using a sequence charge decoration metric originally introduced by Sawle and Ghosh. The ability of our model to predict conformations (both expanded and collapsed states) of IDPs at a high

  15. Origins and Evolution of WUSCHEL-Related Homeobox Protein Family in Plant Kingdom

    Directory of Open Access Journals (Sweden)

    Gaibin Lian

    2014-01-01

    Full Text Available WUSCHEL-related homeobox (WOX is a large group of transcription factors specifically found in plants. WOX members contain the conserved homeodomain essential for plant development by regulating cell division and differentiation. However, the evolutionary relationship of WOX members in plant kingdom remains to be elucidated. In this study, we searched 350 WOX members from 50 species in plant kingdom. Linkage analysis of WOX protein sequences demonstrated that amino acid residues 141–145 and 153–160 located in the homeodomain are possibly associated with the function of WOXs during the evolution. These 350 members were grouped into 3 clades: the first clade represents the conservative WOXs from the lower plant algae to higher plants; the second clade has the members from vascular plant species; the third clade has the members only from spermatophyte species. Furthermore, among the members of Arabidopsis thaliana and Oryza sativa, we observed ubiquitous expression of genes in the first clade and the diversified expression pattern of WOX genes in distinct organs in the second clade and the third clade. This work provides insight into the origin and evolutionary process of WOXs, facilitating their functional investigations in the future.

  16. Prediction of host - pathogen protein interactions between Mycobacterium tuberculosis and Homo sapiens using sequence motifs.

    Science.gov (United States)

    Huo, Tong; Liu, Wei; Guo, Yu; Yang, Cheng; Lin, Jianping; Rao, Zihe

    2015-03-26

    Emergence of multiple drug resistant strains of M. tuberculosis (MDR-TB) threatens to derail global efforts aimed at reigning in the pathogen. Co-infections of M. tuberculosis with HIV are difficult to treat. To counter these new challenges, it is essential to study the interactions between M. tuberculosis and the host to learn how these bacteria cause disease. We report a systematic flow to predict the host pathogen interactions (HPIs) between M. tuberculosis and Homo sapiens based on sequence motifs. First, protein sequences were used as initial input for identifying the HPIs by 'interolog' method. HPIs were further filtered by prediction of domain-domain interactions (DDIs). Functional annotations of protein and publicly available experimental results were applied to filter the remaining HPIs. Using such a strategy, 118 pairs of HPIs were identified, which involve 43 proteins from M. tuberculosis and 48 proteins from Homo sapiens. A biological interaction network between M. tuberculosis and Homo sapiens was then constructed using the predicted inter- and intra-species interactions based on the 118 pairs of HPIs. Finally, a web accessible database named PATH (Protein interactions of M. tuberculosis and Human) was constructed to store these predicted interactions and proteins. This interaction network will facilitate the research on host-pathogen protein-protein interactions, and may throw light on how M. tuberculosis interacts with its host.

  17. Lamins of the sea lamprey (Petromyzon marinus) and the evolution of the vertebrate lamin protein family.

    Science.gov (United States)

    Schilf, Paul; Peter, Annette; Hurek, Thomas; Stick, Reimer

    2014-07-01

    Lamin proteins are found in all metazoans. Most non-vertebrate genomes including those of the closest relatives of vertebrates, the cephalochordates and tunicates, encode only a single lamin. In teleosts and tetrapods the number of lamin genes has quadrupled. They can be divided into four sub-types, lmnb1, lmnb2, LIII, and lmna, each characterized by particular features and functional differentiations. Little is known when during vertebrate evolution these features have emerged. Lampreys belong to the Agnatha, the sister group of the Gnathostomata. They split off first within the vertebrate lineage. Analysis of the sea lamprey (Petromyzon marinus) lamin complement presented here, identified three functional lamin genes, one encoding a lamin LIII, indicating that the characteristic gene structure of this subtype had been established prior to the agnathan/gnathostome split. Two other genes encode lamins for which orthology to gnathostome lamins cannot be designated. Search for lamin gene sequences in all vertebrate taxa for which sufficient sequence data are available reveals the evolutionary time frame in which specific features of the vertebrate lamins were established. Structural features characteristic for A-type lamins are not found in the lamprey genome. In contrast, lmna genes are present in all gnathostome lineages suggesting that this gene evolved with the emergence of the gnathostomes. The analysis of lamin gene neighborhoods reveals noticeable similarities between the different vertebrate lamin genes supporting the hypothesis that they emerged due to two rounds of whole genome duplication and makes clear that an orthologous relationship between a particular vertebrate paralog and lamins outside the vertebrate lineage cannot be established. Copyright © 2014 Elsevier GmbH. All rights reserved.

  18. A protein-tyrosine phosphatase with sequence similarity to the SH2 domain of the protein-tyrosine kinases.

    Science.gov (United States)

    Shen, S H; Bastien, L; Posner, B I; Chrétien, P

    1991-08-22

    The phosphorylation of proteins at tyrosine residues is critical in cellular signal transduction, neoplastic transformation and control of the mitotic cycle. These mechanisms are regulated by the activities of both protein-tyrosine kinases (PTKs) and protein-tyrosine phosphatases (PTPases). As in the PTKs, there are two classes of PTPases: membrane associated, receptor-like enzymes and soluble proteins. Here we report the isolation of a complementary DNA clone encoding a new form of soluble PTPase, PTP1C. The enzyme possesses a large noncatalytic region at the N terminus which unexpectedly contains two adjacent copies of the Src homology region 2 (the SH2 domain) found in various nonreceptor PTKs and other cytoplasmic signalling proteins. As with other SH2 sequences, the SH2 domains of PTP1C formed high-affinity complexes with the activated epidermal growth factor receptor and other phosphotyrosine-containing proteins. These results suggest that the SH2 regions in PTP1C may interact with other cellular components to modulate its own phosphatase activity against interacting substrates. PTPase activity may thus directly link growth factor receptors and other signalling proteins through protein-tyrosine phosphorylation.

  19. Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity.

    Science.gov (United States)

    Camproux, A C; Tufféry, P

    2005-08-05

    Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid sequence.

  20. Comparison of Enzymes / Non-Enzymes Proteins Classification Models Based on 3D, Composition, Sequences and Topological Indices

    OpenAIRE

    Munteanu, Cristian Robert

    2014-01-01

    Comparison of Enzymes / Non-Enzymes Proteins Classification Models Based on 3D, Composition, Sequences and Topological Indices, German Conference on Bioinformatics (GCB), Potsdam, Germany (September, 2007)

  1. Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads.

    Science.gov (United States)

    Huson, Daniel H; Tappu, Rewati; Bazinet, Adam L; Xie, Chao; Cummings, Michael P; Nieselt, Kay; Williams, Rohan

    2017-01-25

    Microbiome sequencing projects typically collect tens of millions of short reads per sample. Depending on the goals of the project, the short reads can either be subjected to direct sequence analysis or be assembled into longer contigs. The assembly of whole genomes from metagenomic sequencing reads is a very difficult problem. However, for some questions, only specific genes of interest need to be assembled. This is then a gene-centric assembly where the goal is to assemble reads into contigs for a family of orthologous genes. We present a new method for performing gene-centric assembly, called protein-alignment-guided assembly, and provide an implementation in our metagenome analysis tool MEGAN. Genes are assembled on the fly, based on the alignment of all reads against a protein reference database such as NCBI-nr. Specifically, the user selects a gene family based on a classification such as KEGG and all reads binned to that gene family are assembled. Using published synthetic community metagenome sequencing reads and a set of 41 gene families, we show that the performance of this approach compares favorably with that of full-featured assemblers and that of a recently published HMM-based gene-centric assembler, both in terms of the number of reference genes detected and of the percentage of reference sequence covered. Protein-alignment-guided assembly of orthologous gene families complements whole-metagenome assembly in a new and very useful way.

  2. MODexplorer: an integrated tool for exploring protein sequence, structure and function relationships.

    KAUST Repository

    Kosinski, Jan

    2013-02-08

    SUMMARY: MODexplorer is an integrated tool aimed at exploring the sequence, structural and functional diversity in protein families useful in homology modeling and in analyzing protein families in general. It takes as input either the sequence or the structure of a protein and provides alignments with its homologs along with a variety of structural and functional annotations through an interactive interface. The annotations include sequence conservation, similarity scores, ligand-, DNA- and RNA-binding sites, secondary structure, disorder, crystallographic structure resolution and quality scores of models implied by the alignments to the homologs of known structure. MODexplorer can be used to analyze sequence and structural conservation among the structures of similar proteins, to find structures of homologs solved in different conformational state or with different ligands and to transfer functional annotations. Furthermore, if the structure of the query is not known, MODexplorer can be used to select the modeling templates taking all this information into account and to build a comparative model. AVAILABILITY AND IMPLEMENTATION: Freely available on the web at http://modorama.biocomputing.it/modexplorer. Website implemented in HTML and JavaScript with all major browsers supported. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

  3. Cloning, sequencing, and expression of dnaK-operon proteins from the thermophilic bacterium Thermus thermophilus.

    Science.gov (United States)

    Osipiuk, J; Joachimiak, A

    1997-09-12

    We propose that the dnaK operon of Thermus thermophilus HB8 is composed of three functionally linked genes: dnaK, grpE, and dnaJ. The dnaK and dnaJ gene products are most closely related to their cyanobacterial homologs. The DnaK protein sequence places T. thermophilus in the plastid Hsp70 subfamily. In contrast, the grpE translated sequence is most similar to GrpE from Clostridium acetobutylicum, a Gram-positive anaerobic bacterium. A single promoter region, with homology to the Escherichia coli consensus promoter sequences recognized by the sigma70 and sigma32 transcription factors, precedes the postulated operon. This promoter is heat-shock inducible. The dnaK mRNA level increased more than 30 times upon 10 min of heat shock (from 70 degrees C to 85 degrees C). A strong transcription terminating sequence was found between the dnaK and grpE genes. The individual genes were cloned into pET expression vectors and the thermophilic proteins were overproduced at high levels in E. coli and purified to homogeneity. The recombinant T. thermophilus DnaK protein was shown to have a weak ATP-hydrolytic activity, with an optimum at 90 degrees C. The ATPase was stimulated by the presence of GrpE and DnaJ. Another open reading frame, coding for ClpB heat-shock protein, was found downstream of the dnaK operon.

  4. Classification of Dutch and German avian reoviruses by sequencing the sigma-C protein.

    NARCIS (Netherlands)

    Kant, A.; Balk, F.R.M.; Born, L.; Roozelaar, van D.; Heijmans, J.; Gielkens, A.; Huurne, ter A.A.H.M.

    2003-01-01

    We have amplified, cloned and sequenced (part of) the open reading frame of the S1 segment encoding the ¿ C protein of avian reoviruses isolated from chickens with different disease conditions in Germany and The Netherlands during 1980 up to 2000. These avian reoviruses were analysed

  5. MODexplorer: an integrated tool for exploring protein sequence, structure and function relationships.

    KAUST Repository

    Kosinski, Jan; Barbato, Alessandro; Tramontano, Anna

    2013-01-01

    SUMMARY: MODexplorer is an integrated tool aimed at exploring the sequence, structural and functional diversity in protein families useful in homology modeling and in analyzing protein families in general. It takes as input either the sequence or the structure of a protein and provides alignments with its homologs along with a variety of structural and functional annotations through an interactive interface. The annotations include sequence conservation, similarity scores, ligand-, DNA- and RNA-binding sites, secondary structure, disorder, crystallographic structure resolution and quality scores of models implied by the alignments to the homologs of known structure. MODexplorer can be used to analyze sequence and structural conservation among the structures of similar proteins, to find structures of homologs solved in different conformational state or with different ligands and to transfer functional annotations. Furthermore, if the structure of the query is not known, MODexplorer can be used to select the modeling templates taking all this information into account and to build a comparative model. AVAILABILITY AND IMPLEMENTATION: Freely available on the web at http://modorama.biocomputing.it/modexplorer. Website implemented in HTML and JavaScript with all major browsers supported. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

  6. Comment on "Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry".

    Science.gov (United States)

    Pevzner, Pavel A; Kim, Sangtae; Ng, Julio

    2008-08-22

    Asara et al. (Reports, 13 April 2007, p. 280) reported sequencing of Tyrannosaurus rex proteins and used them to establish the evolutionary relationships between birds and dinosaurs. We argue that the reported T. rex peptides may represent statistical artifacts and call for complete data release to enable experimental and computational verification of their findings.

  7. Monte Carlo simulation of a statistical mechanical model of multiple protein sequence alignment.

    Science.gov (United States)

    Kinjo, Akira R

    2017-01-01

    A grand canonical Monte Carlo (MC) algorithm is presented for studying the lattice gas model (LGM) of multiple protein sequence alignment, which coherently combines long-range interactions and variable-length insertions. MC simulations are used for both parameter optimization of the model and production runs to explore the sequence subspace around a given protein family. In this Note, I describe the details of the MC algorithm as well as some preliminary results of MC simulations with various temperatures and chemical potentials, and compare them with the mean-field approximation. The existence of a two-state transition in the sequence space is suggested for the SH3 domain family, and inappropriateness of the mean-field approximation for the LGM is demonstrated.

  8. On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families

    DEFF Research Database (Denmark)

    Wiwie, Christian; Röttger, Richard

    2017-01-01

    Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally...... important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We...... used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters...

  9. Origin and spread of photosynthesis based upon conserved sequence features in key bacteriochlorophyll biosynthesis proteins.

    Science.gov (United States)

    Gupta, Radhey S

    2012-11-01

    The origin of photosynthesis and how this capability has spread to other bacterial phyla remain important unresolved questions. I describe here a number of conserved signature indels (CSIs) in key proteins involved in bacteriochlorophyll (Bchl) biosynthesis that provide important insights in these regards. The proteins BchL and BchX, which are essential for Bchl biosynthesis, are derived by gene duplication in a common ancestor of all phototrophs. More ancient gene duplication gave rise to the BchX-BchL proteins and the NifH protein of the nitrogenase complex. The sequence alignment of NifH-BchX-BchL proteins contain two CSIs that are uniquely shared by all NifH and BchX homologs, but not by any BchL homologs. These CSIs and phylogenetic analysis of NifH-BchX-BchL protein sequences strongly suggest that the BchX homologs are ancestral to BchL and that the Bchl-based anoxygenic photosynthesis originated prior to the chlorophyll (Chl)-based photosynth