WorldWideScience

Sample records for non-referenced genome assembly

  1. Integrating genome assemblies with MAIA

    NARCIS (Netherlands)

    Nijkamp, J.F.; Winterbach, W.; Van den Broek, M.; Daran, J.M.; Reinders, M.J.T.; De Ridder, D.

    2010-01-01

    De novo assembly of a eukaryotic genome with next-generation sequencing data is still a challenging task. Over the past few years several assemblers have been developed, often suitable for one specific type of sequencing data. The number of known genomes is expanding rapidly, therefore it becomes po

  2. Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2

    OpenAIRE

    Jaffe, David B.; Butler, Jonathan; Gnerre, Sante; Mauceli, Evan; Lindblad-Toh, Kerstin; Jill P. Mesirov; Michael C Zody; Lander, Eric S.

    2003-01-01

    We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rej...

  3. Assembly: a resource for assembled genomes at NCBI.

    Science.gov (United States)

    Kitts, Paul A; Church, Deanna M; Thibaud-Nissen, Françoise; Choi, Jinna; Hem, Vichet; Sapojnikov, Victor; Smith, Robert G; Tatusova, Tatiana; Xiang, Charlie; Zherikov, Andrey; DiCuccio, Michael; Murphy, Terence D; Pruitt, Kim D; Kimchi, Avi

    2016-01-04

    The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.

  4. Quality Assessment of Domesticated Animal Genome Assemblies.

    Science.gov (United States)

    Seemann, Stefan E; Anthon, Christian; Palasca, Oana; Gorodkin, Jan

    2015-01-01

    The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNAseq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.

  5. Genome Sequence Databases (Overview): Sequencing and Assembly

    Energy Technology Data Exchange (ETDEWEB)

    Lapidus, Alla L.

    2009-01-01

    From the date its role in heredity was discovered, DNA has been generating interest among scientists from different fields of knowledge: physicists have studied the three dimensional structure of the DNA molecule, biologists tried to decode the secrets of life hidden within these long molecules, and technologists invent and improve methods of DNA analysis. The analysis of the nucleotide sequence of DNA occupies a special place among the methods developed. Thanks to the variety of sequencing technologies available, the process of decoding the sequence of genomic DNA (or whole genome sequencing) has become robust and inexpensive. Meanwhile the assembly of whole genome sequences remains a challenging task. In addition to the need to assemble millions of DNA fragments of different length (from 35 bp (Solexa) to 800 bp (Sanger)), great interest in analysis of microbial communities (metagenomes) of different complexities raises new problems and pushes some new requirements for sequence assembly tools to the forefront. The genome assembly process can be divided into two steps: draft assembly and assembly improvement (finishing). Despite the fact that automatically performed assembly (or draft assembly) is capable of covering up to 98% of the genome, in most cases, it still contains incorrectly assembled reads. The error rate of the consensus sequence produced at this stage is about 1/2000 bp. A finished genome represents the genome assembly of much higher accuracy (with no gaps or incorrectly assembled areas) and quality ({approx}1 error/10,000 bp), validated through a number of computer and laboratory experiments.

  6. V-GAP: Viral genome assembly pipeline

    KAUST Repository

    Nakamura, Yoji

    2015-10-22

    Next-generation sequencing technologies have allowed the rapid determination of the complete genomes of many organisms. Although shotgun sequences from large genome organisms are still difficult to reconstruct perfect contigs each of which represents a full chromosome, those from small genomes have been assembled successfully into a very small number of contigs. In this study, we show that shotgun reads from phage genomes can be reconstructed into a single contig by controlling the number of read sequences used in de novo assembly. We have developed a pipeline to assemble small viral genomes with good reliability using a resampling method from shotgun data. This pipeline, named V-GAP (Viral Genome Assembly Pipeline), will contribute to the rapid genome typing of viruses, which are highly divergent, and thus will meet the increasing need for viral genome comparisons in metagenomic studies.

  7. Minimus: a fast, lightweight genome assembler

    Directory of Open Access Journals (Sweden)

    Salzberg Steven L

    2007-02-01

    Full Text Available Abstract Background Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. Results We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. Conclusion We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.

  8. Quality Assessment of Domesticated Animal Genome Assemblies

    DEFF Research Database (Denmark)

    Seemann, Stefan E; Anthon, Christian; Palasca, Oana

    2015-01-01

    domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNAseq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often...... affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data...

  9. Specific genomic cues regulate Cajal body assembly.

    Science.gov (United States)

    Sawyer, Iain A; Hager, Gordon L; Dundr, Miroslav

    2016-10-07

    The assembly of specialized sub-nuclear microenvironments known as nuclear bodies (NBs) is important for promoting efficient nuclear function. In particular, the Cajal body (CB), a prominent NB that facilitates spliceosomal snRNP biogenesis, assembles in response to genomic cues. Here, we detail the factors that regulate CB assembly and structural maintenance. These include the importance of transcription at nucleating gene loci, the grouping of these genes on human chromosomes 1, 6 and 17, as well as cell cycle and biochemical regulation of CB protein function. We also speculate on the correlation between CB formation and RNA splicing levels in neurons and cancer. The timing and location of these specific molecular events is critical to CB assembly and its contribution to genome function. However, further work is required to explore the emerging biophysical characteristics of CB assembly and the impact upon subsequent genome reorganization.

  10. Plantagora: modeling whole genome sequencing and assembly of plant genomes.

    Directory of Open Access Journals (Sweden)

    Roger Barthelson

    Full Text Available BACKGROUND: Genomics studies are being revolutionized by the next generation sequencing technologies, which have made whole genome sequencing much more accessible to the average researcher. Whole genome sequencing with the new technologies is a developing art that, despite the large volumes of data that can be produced, may still fail to provide a clear and thorough map of a genome. The Plantagora project was conceived to address specifically the gap between having the technical tools for genome sequencing and knowing precisely the best way to use them. METHODOLOGY/PRINCIPAL FINDINGS: For Plantagora, a platform was created for generating simulated reads from several different plant genomes of different sizes. The resulting read files mimicked either 454 or Illumina reads, with varying paired end spacing. Thousands of datasets of reads were created, most derived from our primary model genome, rice chromosome one. All reads were assembled with different software assemblers, including Newbler, Abyss, and SOAPdenovo, and the resulting assemblies were evaluated by an extensive battery of metrics chosen for these studies. The metrics included both statistics of the assembly sequences and fidelity-related measures derived by alignment of the assemblies to the original genome source for the reads. The results were presented in a website, which includes a data graphing tool, all created to help the user compare rapidly the feasibility and effectiveness of different sequencing and assembly strategies prior to testing an approach in the lab. Some of our own conclusions regarding the different strategies were also recorded on the website. CONCLUSIONS/SIGNIFICANCE: Plantagora provides a substantial body of information for comparing different approaches to sequencing a plant genome, and some conclusions regarding some of the specific approaches. Plantagora also provides a platform of metrics and tools for studying the process of sequencing and assembly

  11. Assembly of viral genomes from metagenomes

    Directory of Open Access Journals (Sweden)

    Saskia L Smits

    2014-12-01

    Full Text Available Viral infections remain a serious global health issue. Metagenomic approaches are increasingly used in the detection of novel viral pathogens but also to generate complete genomes of uncultivated viruses. In silico identification of complete viral genomes from sequence data would allow rapid phylogenetic characterization of these new viruses. Often, however, complete viral genomes are not recovered, but rather several distinct contigs derived from a single entity, some of which have no sequence homology to any known proteins. De novo assembly of single viruses from a metagenome is challenging, not only because of the lack of a reference genome, but also because of intrapopulation variation and uneven or insufficient coverage. Here we explored different assembly algorithms, remote homology searches, genome-specific sequence motifs, k-mer frequency ranking, and coverage profile binning to detect and obtain viral target genomes from metagenomes. All methods were tested on 454-generated sequencing datasets containing three recently described RNA viruses with a relatively large genome which were divergent to previously known viruses from the viral families Rhabdoviridae and Coronaviridae. Depending on specific characteristics of the target virus and the metagenomic community, different assembly and in silico gap closure strategies were successful in obtaining near complete viral genomes.

  12. Enabling Graph Appliance for Genome Assembly

    Energy Technology Data Exchange (ETDEWEB)

    Singh, Rina [ORNL; Graves, Jeffrey A [ORNL; Lee, Sangkeun (Matt) [ORNL; Sukumar, Sreenivas R [ORNL; Shankar, Mallikarjun [ORNL

    2015-01-01

    In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.

  13. De novo assembly of a haplotype-resolved human genome.

    Science.gov (United States)

    Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun

    2015-06-01

    The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.

  14. De novo assembly of a haplotype-resolved human genome

    DEFF Research Database (Denmark)

    Cao, Hongzhi; Wu, Honglong; Luo, Ruibang

    2015-01-01

    The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome...... of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should...... shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb...

  15. Genetic variation and the de novo assembly of human genomes.

    Science.gov (United States)

    Chaisson, Mark J P; Wilson, Richard K; Eichler, Evan E

    2015-11-01

    The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

  16. Effective de novo assembly of fish genome using haploid larvae.

    Science.gov (United States)

    Iwasaki, Yuki; Nishiki, Issei; Nakamura, Yoji; Yasuike, Motoshige; Kai, Wataru; Nomura, Kazuharu; Yoshida, Kazunori; Nomura, Yousuke; Fujiwara, Atushi; Kobayashi, Takanori; Ototake, Mitsuru

    2016-02-01

    Recent improvements in next-generation sequencing technology have made it possible to do whole genome sequencing, on even non-model eukaryote species with no available reference genomes. However, de novo assembly of diploid genomes is still a big challenge because of allelic variation. The aim of this study was to determine the feasibility of utilizing the genome of haploid fish larvae for de novo assembly of whole-genome sequences. We compared the efficiency of assembly using the haploid genome of yellowtail (Seriola quinqueradiata) with that using the diploid genome obtained from the dam. De novo assembly from the haploid and the diploid sequence reads (100 million reads per each datasets) generated by the Ion Proton sequencer (200 bp) was done under two different assembly algorithms, namely overlap-layout-consensus (OLC) and de Bruijn graph (DBG). This revealed that the assembly of the haploid genome significantly reduced (approximately 22% for OLC, 9% for DBG) the total number of contigs (with longer average and N50 contig lengths) when compared to the diploid genome assembly. The haploid assembly also improved the quality of the scaffolds by reducing the number of regions with unassigned nucleotides (Ns) (total length of Ns; 45,331,916 bp for haploids and 67,724,360 bp for diploids) in OLC-based assemblies. It appears clear that the haploid genome assembly is better because the allelic variation in the diploid genome disrupts the extension of contigs during the assembly process. Our results indicate that utilizing the genome of haploid larvae leads to a significant improvement in the de novo assembly process, thus providing a novel strategy for the construction of reference genomes from non-model diploid organisms such as fish.

  17. Long-read sequence assembly of the gorilla genome

    Science.gov (United States)

    Gordon, David; Huddleston, John; Chaisson, Mark J. P.; Hill, Christopher M.; Kronenberg, Zev N.; Munson, Katherine M.; Malig, Maika; Raja, Archana; Fiddes, Ian; Hillier, LaDeana W.; Dunn, Christopher; Baker, Carl; Armstrong, Joel; Diekhans, Mark; Paten, Benedict; Shendure, Jay; Wilson, Richard K.; Haussler, David; Chin, Chen-Shan; Eichler, Evan E.

    2016-01-01

    Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The approach provides a path forward for the routine assembly of mammalian genomes at a level approaching that of the current quality of the human genome. PMID:27034376

  18. Whole-genome sequencing for comparative genomics and de novo genome assembly.

    Science.gov (United States)

    Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C

    2015-01-01

    Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).

  19. Next-generation sequencing and large genome assemblies

    OpenAIRE

    Henson, Joseph; Tischler, German; Ning, Zemin

    2012-01-01

    The next-generation sequencing (NGS) revolution has drastically reduced time and cost requirements for sequencing of large genomes, and also qualitatively changed the problem of assembly. This article reviews the state of the art in de novo genome assembly, paying particular attention to mammalian-sized genomes. The strengths and weaknesses of the main sequencing platforms are highlighted, leading to a discussion of assembly and the new challenges associated with NGS data. Current approaches ...

  20. The A, C, G, and T of Genome Assembly

    Directory of Open Access Journals (Sweden)

    Bilal Wajid

    2016-01-01

    Full Text Available Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research.

  1. GRAbB : Selective Assembly of Genomic Regions, a New Niche for Genomic Research

    NARCIS (Netherlands)

    Brankovics, Balázs; Zhang, Hao; van Diepeningen, Anne D; van der Lee, Theo A J; Waalwijk, Cees; de Hoog, G Sybren

    2016-01-01

    GRAbB (Genomic Region Assembly by Baiting) is a new program that is dedicated to assemble specific genomic regions from NGS data. This approach is especially useful when dealing with multi copy regions, such as mitochondrial genome and the rDNA repeat region, parts of the genome that are often negle

  2. Minimal absent words in four human genome assemblies.

    Directory of Open Access Journals (Sweden)

    Sara P Garcia

    Full Text Available Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH. Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species.

  3. Minimal absent words in four human genome assemblies.

    Science.gov (United States)

    Garcia, Sara P; Pinho, Armando J

    2011-01-01

    Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef) than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH). Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species.

  4. Assembly complexity of prokaryotic genomes using short reads

    Directory of Open Access Journals (Sweden)

    Pop Mihai

    2010-01-01

    Full Text Available Abstract Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages. Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.

  5. Programming biological operating systems: genome design, assembly and activation.

    Science.gov (United States)

    Gibson, Daniel G

    2014-05-01

    The DNA technologies developed over the past 20 years for reading and writing the genetic code converged when the first synthetic cell was created 4 years ago. An outcome of this work has been an extraordinary set of tools for synthesizing, assembling, engineering and transplanting whole bacterial genomes. Technical progress, options and applications for bacterial genome design, assembly and activation are discussed.

  6. Next-generation sequencing and large genome assemblies.

    Science.gov (United States)

    Henson, Joseph; Tischler, German; Ning, Zemin

    2012-06-01

    The next-generation sequencing (NGS) revolution has drastically reduced time and cost requirements for sequencing of large genomes, and also qualitatively changed the problem of assembly. This article reviews the state of the art in de novo genome assembly, paying particular attention to mammalian-sized genomes. The strengths and weaknesses of the main sequencing platforms are highlighted, leading to a discussion of assembly and the new challenges associated with NGS data. Current approaches to assembly are outlined and the various software packages available are introduced and compared. The question of whether quality assemblies can be produced using short-read NGS data alone, or whether it must be combined with more expensive sequencing techniques, is considered. Prospects for future assemblers and tests of assembly performance are also discussed.

  7. Mind the gap; seven reasons to close fragmented genome assemblies.

    Science.gov (United States)

    Thomma, Bart P H J; Seidl, Michael F; Shi-Kunne, Xiaoqian; Cook, David E; Bolton, Melvin D; van Kan, Jan A L; Faino, Luigi

    2016-05-01

    Like other domains of life, research into the biology of filamentous microbes has greatly benefited from the advent of whole-genome sequencing. Next-generation sequencing (NGS) technologies have revolutionized sequencing, making genomic sciences accessible to many academic laboratories including those that study non-model organisms. Thus, hundreds of fungal genomes have been sequenced and are publically available today, although these initiatives have typically yielded considerably fragmented genome assemblies that often lack large contiguous genomic regions. Many important genomic features are contained in intergenic DNA that is often missing in current genome assemblies, and recent studies underscore the significance of non-coding regions and repetitive elements for the life style, adaptability and evolution of many organisms. The study of particular types of genetic elements, such as telomeres, centromeres, repetitive elements, effectors, and clusters of co-regulated genes, but also of phenomena such as structural rearrangements, genome compartmentalization and epigenetics, greatly benefits from having a contiguous and high-quality, preferably even complete and gapless, genome assembly. Here we discuss a number of important reasons to produce gapless, finished, genome assemblies to help answer important biological questions.

  8. AutoAssemblyD: a graphical user interface system for several genome assemblers.

    Science.gov (United States)

    Veras, Adonney Allan de Oliveira; de Sá, Pablo Henrique Caracciolo Gomes; Azevedo, Vasco; Silva, Artur; Ramos, Rommel Thiago Jucá

    2013-01-01

    Next-generation sequencing technologies have increased the amount of biological data generated. Thus, bioinformatics has become important because new methods and algorithms are necessary to manipulate and process such data. However, certain challenges have emerged, such as genome assembly using short reads and high-throughput platforms. In this context, several algorithms have been developed, such as Velvet, Abyss, Euler-SR, Mira, Edna, Maq, SHRiMP, Newbler, ALLPATHS, Bowtie and BWA. However, most such assemblers do not have a graphical interface, which makes their use difficult for users without computing experience given the complexity of the assembler syntax. Thus, to make the operation of such assemblers accessible to users without a computing background, we developed AutoAssemblyD, which is a graphical tool for genome assembly submission and remote management by multiple assemblers through XML templates. AssemblyD is freely available at https://sourceforge.net/projects/autoassemblyd. It requires Sun jdk 6 or higher.

  9. Why Assembling Plant Genome Sequences Is So Challenging

    Directory of Open Access Journals (Sweden)

    Pedro Seoane

    2012-09-01

    Full Text Available In spite of the biological and economic importance of plants, relatively few plant species have been sequenced. Only the genome sequence of plants with relatively small genomes, most of them angiosperms, in particular eudicots, has been determined. The arrival of next-generation sequencing technologies has allowed the rapid and efficient development of new genomic resources for non-model or orphan plant species. But the sequencing pace of plants is far from that of animals and microorganisms. This review focuses on the typical challenges of plant genomes that can explain why plant genomics is less developed than animal genomics. Explanations about the impact of some confounding factors emerging from the nature of plant genomes are given. As a result of these challenges and confounding factors, the correct assembly and annotation of plant genomes is hindered, genome drafts are produced, and advances in plant genomics are delayed.

  10. Why Assembling Plant Genome Sequences Is So Challenging

    Science.gov (United States)

    Claros, Manuel Gonzalo; Bautista, Rocío; Guerrero-Fernández, Darío; Benzerki, Hicham; Seoane, Pedro; Fernández-Pozo, Noé

    2012-01-01

    In spite of the biological and economic importance of plants, relatively few plant species have been sequenced. Only the genome sequence of plants with relatively small genomes, most of them angiosperms, in particular eudicots, has been determined. The arrival of next-generation sequencing technologies has allowed the rapid and efficient development of new genomic resources for non-model or orphan plant species. But the sequencing pace of plants is far from that of animals and microorganisms. This review focuses on the typical challenges of plant genomes that can explain why plant genomics is less developed than animal genomics. Explanations about the impact of some confounding factors emerging from the nature of plant genomes are given. As a result of these challenges and confounding factors, the correct assembly and annotation of plant genomes is hindered, genome drafts are produced, and advances in plant genomics are delayed. PMID:24832233

  11. An Improved Genome Assembly of Azadirachta indica A. Juss.

    Directory of Open Access Journals (Sweden)

    Neeraja M. Krishnan

    2016-07-01

    Full Text Available Neem (Azadirachta indica A. Juss., an evergreen tree of the Meliaceae family, is known for its medicinal, cosmetic, pesticidal and insecticidal properties. We had previously sequenced and published the draft genome of a neem plant, using mainly short read sequencing data. In this report, we present an improved genome assembly generated using additional short reads from Illumina and long reads from Pacific Biosciences SMRT sequencer. We assembled short reads and error-corrected long reads using Platanus, an assembler designed to perform well for heterozygous genomes. The updated genome assembly (v2.0 yielded 3- and 3.5-fold increase in N50 and N75, respectively; 2.6-fold decrease in the total number of scaffolds; 1.25-fold increase in the number of valid transcriptome alignments; 13.4-fold less misassembly and 1.85-fold increase in the percentage repeat, over the earlier assembly (v1.0. The current assembly also maps better to the genes known to be involved in the terpenoid biosynthesis pathway. Together, the data represent an improved assembly of the A. indica genome.

  12. SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

    CERN Document Server

    Ye, Chengxi; Ma, Zhanshan Sam; Yu, Douglas W; Pop, Mihai

    2011-01-01

    Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: Based on the sparse k-mers graph model, we develop SparseAssembler2. We demonstrate that the decomposition of reads of all overlapping k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirem...

  13. IVA: accurate de novo assembly of RNA virus genomes.

    Science.gov (United States)

    Hunt, Martin; Gall, Astrid; Ong, Swee Hoe; Brener, Jacqui; Ferns, Bridget; Goulder, Philip; Nastouli, Eleni; Keane, Jacqueline A; Kellam, Paul; Otto, Thomas D

    2015-07-15

    An accurate genome assembly from short read sequencing data is critical for downstream analysis, for example allowing investigation of variants within a sequenced population. However, assembling sequencing data from virus samples, especially RNA viruses, into a genome sequence is challenging due to the combination of viral population diversity and extremely uneven read depth caused by amplification bias in the inevitable reverse transcription and polymerase chain reaction amplification process of current methods. We developed a new de novo assembler called IVA (Iterative Virus Assembler) designed specifically for read pairs sequenced at highly variable depth from RNA virus samples. We tested IVA on datasets from 140 sequenced samples from human immunodeficiency virus-1 or influenza-virus-infected people and demonstrated that IVA outperforms all other virus de novo assemblers. The software runs under Linux, has the GPLv3 licence and is freely available from http://sanger-pathogens.github.io/iva © The Author 2015. Published by Oxford University Press.

  14. Assembly of viral genomes from metagenomes

    NARCIS (Netherlands)

    S.L. Smits (Saskia); R. Bodewes (Rogier); A. Ruiz-Gonzalez (Aritz); V. Baumgärtner (Volkmar); M.P.G. Koopmans D.V.M. (Marion); A.D.M.E. Osterhaus (Albert); A. Schürch (Anita)

    2014-01-01

    textabstractViral infections remain a serious global health issue. Metagenomic approaches are increasingly used in the detection of novel viral pathogens but also to generate complete genomes of uncultivated viruses. In silico identification of complete viral genomes from sequence data would allow r

  15. Oxford Nanopore MinION Sequencing and Genome Assembly

    Institute of Scientific and Technical Information of China (English)

    Hengyun Lu; Francesca Giordano; Zemin Ning

    2016-01-01

    The revolution of genome sequencing is continuing after the successful second-generation sequencing (SGS) technology. The third-generation sequencing (TGS) technology, led by Pacific Biosciences (PacBio), is progressing rapidly, moving from a technology once only capable of providing data for small genome analysis, or for performing targeted screening, to one that pro-mises high quality de novo assembly and structural variation detection for human-sized genomes. In 2014, the MinION, the first commercial sequencer using nanopore technology, was released by Oxford Nanopore Technologies (ONT). MinION identifies DNA bases by measuring the changes in electrical conductivity generated as DNA strands pass through a biological pore. Its portability, affordability, and speed in data production makes it suitable for real-time applications, the release of the long read sequencer MinION has thus generated much excitement and interest in the geno-mics community. While de novo genome assemblies can be cheaply produced from SGS data, assem-bly continuity is often relatively poor, due to the limited ability of short reads to handle long repeats. Assembly quality can be greatly improved by using TGS long reads, since repetitive regions can be easily expanded into using longer sequencing lengths, despite having higher error rates at the base level. The potential of nanopore sequencing has been demonstrated by various studies in gen-ome surveillance at locations where rapid and reliable sequencing is needed, but where resources are limited.

  16. PERTRAN: Genome-guided RNA-seq Read Assembler

    Energy Technology Data Exchange (ETDEWEB)

    Shu, Shengqiang; Goodstein, David; Rokhsar, Dan

    2013-10-28

    As short RNA-seq reads become a standard, affordable input to any genome annotation project, a sensitive and accurate transcript assembler is an essential part of any gene prediction system. PERTRAN is a pipeline for assembling transcripts from RNA-seq reads which demonstrates higher sensitivity, with fewer fused exons (in most cases), and faster run times compared to other TOPHAT/CUFFLINKS and genome-guided Trinity. PERTRAN shows slightly lower specificity with increased gene fusions in some cases, discussed below. SAM files generated from PERTRAN can be used to compute expression level by cuffdiff and result is comparable to that from TOPHAT.

  17. De novo assembly and phasing of a Korean human genome.

    Science.gov (United States)

    Seo, Jeong-Sun; Rhie, Arang; Kim, Junsoo; Lee, Sangjin; Sohn, Min-Hwan; Kim, Chang-Uk; Hastie, Alex; Cao, Han; Yun, Ji-Young; Kim, Jihye; Kuk, Junho; Park, Gun Hwa; Kim, Juhyeok; Ryu, Hanna; Kim, Jongbum; Roh, Mira; Baek, Jeonghun; Hunkapiller, Michael W; Korlach, Jonas; Shin, Jong-Yeon; Kim, Changhoon

    2016-10-13

    Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing, next-generation mapping, microfluidics-based linked reads, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of

  18. A new chicken genome assembly provides insight into avian genome structure.

    Science.gov (United States)

    The importance of the Gallus gallus (chicken) as a model organism and agricultural animal merits a continuation of sequence assembly improvement efforts. We present a new version of the chicken genome assembly (Gallus_gallus-5.0; GCA_000002315.3) built from combined long single molecule sequencing t...

  19. Genome Assembly and Computational Analysis Pipelines for Bacterial Pathogens

    KAUST Repository

    Rangkuti, Farania Gama Ardhina

    2011-06-01

    Pathogens lie behind the deadliest pandemics in history. To date, AIDS pandemic has resulted in more than 25 million fatal cases, while tuberculosis and malaria annually claim more than 2 million lives. Comparative genomic analyses are needed to gain insights into the molecular mechanisms of pathogens, but the abundance of biological data dictates that such studies cannot be performed without the assistance of computational approaches. This explains the significant need for computational pipelines for genome assembly and analyses. The aim of this research is to develop such pipelines. This work utilizes various bioinformatics approaches to analyze the high-­throughput genomic sequence data that has been obtained from several strains of bacterial pathogens. A pipeline has been compiled for quality control for sequencing and assembly, and several protocols have been developed to detect contaminations. Visualization has been generated of genomic data in various formats, in addition to alignment, homology detection and sequence variant detection. We have also implemented a metaheuristic algorithm that significantly improves bacterial genome assemblies compared to other known methods. Experiments on Mycobacterium tuberculosis H37Rv data showed that our method resulted in improvement of N50 value of up to 9697% while consistently maintaining high accuracy, covering around 98% of the published reference genome. Other improvement efforts were also implemented, consisting of iterative local assemblies and iterative correction of contiguated bases. Our result expedites the genomic analysis of virulent genes up to single base pair resolution. It is also applicable to virtually every pathogenic microorganism, propelling further research in the control of and protection from pathogen-­associated diseases.

  20. Genomes correction and assembling: present methods and tools

    Science.gov (United States)

    Wojcieszek, Michał; Pawełkowicz, Magdalena; Nowak, Robert; Przybecki, Zbigniew

    2014-11-01

    Recent rapid development of next generation sequencing (NGS) technologies provided significant impact into genomics field of study enabling implementation of many de novo sequencing projects of new species which was previously confined by technological costs. Along with advancement of NGS there was need for adjustment in assembly programs. New algorithms must cope with massive amounts of data computation in reasonable time limits and processing power and hardware is also an important factor. In this paper, we address the issue of assembly pipeline for de novo genome assembly provided by programs presently available for scientist both as commercial and as open - source software. The implementation of four different approaches - Greedy, Overlap - Layout - Consensus (OLC), De Bruijn and Integrated resulting in variation of performance is the main focus of our discussion with additional insight into issue of short and long reads correction.

  1. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads.

    Science.gov (United States)

    Kajitani, Rei; Toshimoto, Kouta; Noguchi, Hideki; Toyoda, Atsushi; Ogura, Yoshitoshi; Okuno, Miki; Yabana, Mitsuru; Harada, Masayuki; Nagayasu, Eiji; Maruyama, Haruhiko; Kohara, Yuji; Fujiyama, Asao; Hayashi, Tetsuya; Itoh, Takehiko

    2014-08-01

    Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.

  2. SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Large Scale

    Energy Technology Data Exchange (ETDEWEB)

    Meng, Jintao; Seo, Sangmin; Balaji, Pavan; Wei, Yanjie; Wang, Bingqiang; Feng, Shengzhong

    2016-08-16

    In this paper, we analyze and optimize the most time-consuming steps of the SWAP-Assembler, a parallel genome assembler, so that it can scale to a large number of cores for huge genomes with the size of sequencing data ranging from terabyes to petabytes. According to the performance analysis results, the most time-consuming steps are input parallelization, k-mer graph construction, and graph simplification (edge merging). For the input parallelization, the input data is divided into virtual fragments with nearly equal size, and the start position and end position of each fragment are automatically separated at the beginning of the reads. In k-mer graph construction, in order to improve the communication efficiency, the message size is kept constant between any two processes by proportionally increasing the number of nucleotides to the number of processes in the input parallelization step for each round. The memory usage is also decreased because only a small part of the input data is processed in each round. With graph simplification, the communication protocol reduces the number of communication loops from four to two loops and decreases the idle communication time. The optimized assembler is denoted as SWAP-Assembler 2 (SWAP2). In our experiments using a 1000 Genomes project dataset of 4 terabytes (the largest dataset ever used for assembling) on the supercomputer Mira, the results show that SWAP2 scales to 131,072 cores with an efficiency of 40%. We also compared our work with both the HipMER assembler and the SWAP-Assembler. On the Yanhuang dataset of 300 gigabytes, SWAP2 shows a 3X speedup and 4X better scalability compared with the HipMer assembler and is 45 times faster than the SWAP-Assembler. The SWAP2 software is available at https://sourceforge.net/projects/swapassembler.

  3. Assembly-driven community genomics of a hypersaline microbial ecosystem.

    Directory of Open Access Journals (Sweden)

    Sheila Podell

    Full Text Available Microbial populations inhabiting a natural hypersaline lake ecosystem in Lake Tyrrell, Victoria, Australia, have been characterized using deep metagenomic sampling, iterative de novo assembly, and multidimensional phylogenetic binning. Composite genomes representing habitat-specific microbial populations were reconstructed for eleven different archaea and one bacterium, comprising between 0.6 and 14.1% of the planktonic community. Eight of the eleven archaeal genomes were from microbial species without previously cultured representatives. These new genomes provide habitat-specific reference sequences enabling detailed, lineage-specific compartmentalization of predicted functional capabilities and cellular properties associated with both dominant and less abundant community members, including organisms previously known only by their 16S rRNA sequences. Together, these data provide a comprehensive, culture-independent genomic blueprint for ecosystem-wide analysis of protein functions, population structure, and lifestyles of co-existing, co-evolving microbial groups within the same natural habitat. The "assembly-driven" community genomic approach demonstrated in this study advances our ability to push beyond single gene investigations, and promotes genome-scale reconstructions as a tangible goal in the quest to define the metabolic, ecological, and evolutionary dynamics that underpin environmental microbial diversity.

  4. Assembly, Annotation, and Analysis of Multiple Mycorrhizal Fungal Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Initiative Consortium, Mycorrhizal Genomics; Kuo, Alan; Grigoriev, Igor; Kohler, Annegret; Martin, Francis

    2013-03-08

    Mycorrhizal fungi play critical roles in host plant health, soil community structure and chemistry, and carbon and nutrient cycling, all areas of intense interest to the US Dept. of Energy (DOE) Joint Genome Institute (JGI). To this end we are building on our earlier sequencing of the Laccaria bicolor genome by partnering with INRA-Nancy and the mycorrhizal research community in the MGI to sequence and analyze dozens of mycorrhizal genomes of all Basidiomycota and Ascomycota orders and multiple ecological types (ericoid, orchid, and ectomycorrhizal). JGI has developed and deployed high-throughput sequencing techniques, and Assembly, RNASeq, and Annotation Pipelines. In 2012 alone we sequenced, assembled, and annotated 12 draft or improved genomes of mycorrhizae, and predicted ~;;232831 genes and ~;;15011 multigene families, All of this data is publicly available on JGI MycoCosm (http://jgi.doe.gov/fungi/), which provides access to both the genome data and tools with which to analyze the data. Preliminary comparisons of the current total of 14 public mycorrhizal genomes suggest that 1) short secreted proteins potentially involved in symbiosis are more enriched in some orders than in others amongst the mycorrhizal Agaricomycetes, 2) there are wide ranges of numbers of genes involved in certain functional categories, such as signal transduction and post-translational modification, and 3) novel gene families are specific to some ecological types.

  5. Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes

    Directory of Open Access Journals (Sweden)

    Feltus F

    2011-04-01

    Full Text Available Abstract Background We investigate if pooling BAC clones and sequencing the pools can provide for more accurate assembly of genome sequences than the "whole genome shotgun" (WGS approach. Furthermore, we quantify this accuracy increase. We compare the pooled BAC and WGS approaches using in silico simulations. Standard measures of assembly quality focus on assembly size and fragmentation, which are desirable for large whole genome assemblies. We propose additional measures enabling easy and visual comparison of assembly quality, such as rearrangements and redundant sequence content, relative to the known target sequence. Results The best assembly quality scores were obtained using 454 coverage of 15× linear and 5× paired (3kb insert size reads (15L-5P on Arabidopsis. This regime gave similarly good results on four additional plant genomes of very different GC and repeat contents. BAC pooling improved assembly scores over WGS assembly, coverage and redundancy scores improving the most. Conclusions BAC pooling works better than WGS, however, both require a physical map to order the scaffolds. Pool sizes up to 12Mbp work well, suggesting this pooling density to be effective in medium-scale re-sequencing applications such as targeted sequencing of QTL intervals for candidate gene discovery. Assuming the current Roche/454 Titanium sequencing limitations, a 12 Mbp region could be re-sequenced with a full plate of linear reads and a half plate of paired-end reads, yielding 15L-5P coverage after read pre-processing. Our simulation suggests that massively over-sequencing may not improve accuracy. Our scoring measures can be used generally to evaluate and compare results of simulated genome assemblies.

  6. Family Competition Pheromone Genetic Algorithm for Comparative Genome Assembly

    Institute of Scientific and Technical Information of China (English)

    Chien-Hao Su; Chien-Shun Chiou; Jung-Che Kuo; Pei-Jen Wang; Cheng-Yan Kao; Hsueh-Ting Chu

    2014-01-01

    Genome assembly is a prerequisite step for analyzing next generation sequencing data and also far from being solved. Many assembly tools have been proposed and used extensively. Majority of them aim to assemble sequencing reads into contigs; however, we focus on the assembly of contigs into scaffolds in this paper. This is called scaffolding, which estimates the relative order of the contigs as well as the size of the gaps between these contigs. Pheromone trail-based genetic algorithm (PGA) was previously proposed and had decent performance according to their paper. From our previous study, we found that family competition mechanism in genetic algorithm is able to further improve the results. Therefore, we propose family competition pheromone genetic algorithm (FCPGA) and demonstrate the improvement over PGA.

  7. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data.

    Science.gov (United States)

    Birol, Inanc; Raymond, Anthony; Jackman, Shaun D; Pleasance, Stephen; Coope, Robin; Taylor, Greg A; Yuen, Macaire Man Saint; Keeling, Christopher I; Brand, Dana; Vandervalk, Benjamin P; Kirk, Heather; Pandoh, Pawan; Moore, Richard A; Zhao, Yongjun; Mungall, Andrew J; Jaquish, Barry; Yanchuk, Alvin; Ritland, Carol; Boyle, Brian; Bousquet, Jean; Ritland, Kermit; Mackay, John; Bohlmann, Jörg; Jones, Steven J M

    2013-06-15

    White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20,356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies. The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.

  8. A New Reference Genome Assembly for the Microcrustacean Daphnia pulex

    Directory of Open Access Journals (Sweden)

    Zhiqiang Ye

    2017-05-01

    Full Text Available Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42, and compare this with the first sequenced genome of this species (TCO, which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with ∼7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs and a higher frequency of gained introns in the TCO genome.

  9. Resequencing of the common marmoset genome improves genome assemblies and gene-coding sequence analysis.

    Science.gov (United States)

    Sato, Kengo; Kuroki, Yoko; Kumita, Wakako; Fujiyama, Asao; Toyoda, Atsushi; Kawai, Jun; Iriki, Atsushi; Sasaki, Erika; Okano, Hideyuki; Sakakibara, Yasubumi

    2015-11-20

    The first draft of the common marmoset (Callithrix jacchus) genome was published by the Marmoset Genome Sequencing and Analysis Consortium. The draft was based on whole-genome shotgun sequencing, and the current assembly version is Callithrix_jacches-3.2.1, but there still exist 187,214 undetermined gap regions and supercontigs and relatively short contigs that are unmapped to chromosomes in the draft genome. We performed resequencing and assembly of the genome of common marmoset by deep sequencing with high-throughput sequencing technology. Several different sequence runs using Illumina sequencing platforms were executed, and 181 Gbp of high-quality bases including mate-pairs with long insert lengths of 3, 8, 20, and 40 Kbp were obtained, that is, approximately 60× coverage. The resequencing significantly improved the MGSAC draft genome sequence. The N50 of the contigs, which is a statistical measure used to evaluate assembly quality, doubled. As a result, 51% of the contigs (total length: 299 Mbp) that were unmapped to chromosomes in the MGSAC draft were merged with chromosomal contigs, and the improved genome sequence helped to detect 5,288 new genes that are homologous to human cDNAs and the gaps in 5,187 transcripts of the Ensembl gene annotations were completely filled.

  10. Self-assembly of virus particles: The role of genome

    Science.gov (United States)

    Erdemci-Tandogan, Gonca; Wagner, Jef; Podgornik, Rudolf; Zandi, Roya

    2013-03-01

    A virus is an infectious agent that inserts its genetic material into the cell and hijacks the cell's machinery to reproduce. The simplest viruses are made of a protein shell (capsid) that protects its genome (DNA or RNA). Many plant and animal viruses can be assembled spontaneously from a solution of proteins and genetic material in different capsid shapes and sizes. This work focuses on the role of genome in the assembly of spherical RNA viruses. The RNA, a highly flexible polymer, is modeled by mean field approximations. Two RNA models are discussed: (i) A linear polymer model including a pairing affinity between RNA base pairs, and (ii) a branched polymer model. Polymer density and electrostatic potential profiles are obtained, and the relevant free energies are calculated from these profiles. The optimal length of the encapsidated chain is examined as a function of the model parameters. The osmotic pressure of the system is also discussed.

  11. FLASH assembly of TALENs for high-throughput genome editing.

    Science.gov (United States)

    Reyon, Deepak; Tsai, Shengdar Q; Khayter, Cyd; Foden, Jennifer A; Sander, Jeffry D; Joung, J Keith

    2012-05-01

    Engineered transcription activator–like effector nucleases (TALENs) have shown promise as facile and broadly applicable genome editing tools. However, no publicly available high-throughput method for constructing TALENs has been published, and large-scale assessments of the success rate and targeting range of the technology remain lacking. Here we describe the fast ligation-based automatable solid-phase high-throughput (FLASH) system, a rapid and cost-effective method for large-scale assembly of TALENs. We tested 48 FLASH-assembled TALEN pairs in a human cell–based EGFP reporter system and found that all 48 possessed efficient gene-modification activities. We also used FLASH to assemble TALENs for 96 endogenous human genes implicated in cancer and/or epigenetic regulation and found that 84 pairs were able to efficiently introduce targeted alterations. Our results establish the robustness of TALEN technology and demonstrate that FLASH facilitates high-throughput genome editing at a scale not currently possible with other genome modification technologies.

  12. ABySS-Explorer: visualizing genome sequence assemblies.

    Science.gov (United States)

    Nielsen, Cydney B; Jackman, Shaun D; Birol, Inanç; Jones, Steven J M

    2009-01-01

    One bottleneck in large-scale genome sequencing projects is reconstructing the full genome sequence from the short subsequences produced by current technologies. The final stages of the genome assembly process inevitably require manual inspection of data inconsistencies and could be greatly aided by visualization. This paper presents our design decisions in translating key data features identified through discussions with analysts into a concise visual encoding. Current visualization tools in this domain focus on local sequence errors making high-level inspection of the assembly difficult if not impossible. We present a novel interactive graph display, ABySS-Explorer, that emphasizes the global assembly structure while also integrating salient data features such as sequence length. Our tool replaces manual and in some cases pen-and-paper based analysis tasks, and we discuss how user feedback was incorporated into iterative design refinements. Finally, we touch on applications of this representation not initially considered in our design phase, suggesting the generality of this encoding for DNA sequence data.

  13. A draft genome assembly of the army worm, Spodoptera frugiperda.

    Science.gov (United States)

    Kakumani, Pavan Kumar; Malhotra, Pawan; Mukherjee, Sunil K; Bhatnagar, Raj K

    2014-08-01

    Spodoptera is an agriculturally important pest insect and studies in understanding its biology have been limited by the unavailability of its genome. In the present study, the genomic DNA was sequenced and assembled into 37,243 scaffolds of size, 358 Mb with N50 of 53.7 kb. Based on degree of identity, we could anchor 305 Mb of the genome onto all the 28 chromosomes of Bombyx mori. Repeat elements were identified, which accounts for 20.28% of the total genome. Further, we predicted 11,595 genes, with an average intron length of 726 bp. The genes were annotated and domain analysis revealed that Sf genes share a significant homology and expression pattern with B. mori, despite differences in KOG gene categories and representation of certain protein families. The present study on Sf genome would help in the characterization of cellular pathways to understand its biology and comparative evolutionary studies among lepidopteran family members to help annotate their genomes.

  14. GRAbB: Selective Assembly of Genomic Regions, a New Niche for Genomic Research.

    Science.gov (United States)

    Brankovics, Balázs; Zhang, Hao; van Diepeningen, Anne D; van der Lee, Theo A J; Waalwijk, Cees; de Hoog, G Sybren

    2016-06-01

    GRAbB (Genomic Region Assembly by Baiting) is a new program that is dedicated to assemble specific genomic regions from NGS data. This approach is especially useful when dealing with multi copy regions, such as mitochondrial genome and the rDNA repeat region, parts of the genome that are often neglected or poorly assembled, although they contain interesting information from phylogenetic or epidemiologic perspectives, but also single copy regions can be assembled. The program is capable of targeting multiple regions within a single run. Furthermore, GRAbB can be used to extract specific loci from NGS data, based on homology, like sequences that are used for barcoding. To make the assembly specific, a known part of the region, such as the sequence of a PCR amplicon or a homologous sequence from a related species must be specified. By assembling only the region of interest, the assembly process is computationally much less demanding and may lead to assemblies of better quality. In this study the different applications and functionalities of the program are demonstrated such as: exhaustive assembly (rDNA region and mitochondrial genome), extracting homologous regions or genes (IGS, RPB1, RPB2 and TEF1a), as well as extracting multiple regions within a single run. The program is also compared with MITObim, which is meant for the exhaustive assembly of a single target based on a similar query sequence. GRAbB is shown to be more efficient than MITObim in terms of speed, memory and disk usage. The other functionalities (handling multiple targets simultaneously and extracting homologous regions) of the new program are not matched by other programs. The program is available with explanatory documentation at https://github.com/b-brankovics/grabb. GRAbB has been tested on Ubuntu (12.04 and 14.04), Fedora (23), CentOS (7.1.1503) and Mac OS X (10.7). Furthermore, GRAbB is available as a docker repository: brankovics/grabb (https://hub.docker.com/r/brankovics/grabb/).

  15. GRAbB: Selective Assembly of Genomic Regions, a New Niche for Genomic Research.

    Directory of Open Access Journals (Sweden)

    Balázs Brankovics

    2016-06-01

    Full Text Available GRAbB (Genomic Region Assembly by Baiting is a new program that is dedicated to assemble specific genomic regions from NGS data. This approach is especially useful when dealing with multi copy regions, such as mitochondrial genome and the rDNA repeat region, parts of the genome that are often neglected or poorly assembled, although they contain interesting information from phylogenetic or epidemiologic perspectives, but also single copy regions can be assembled. The program is capable of targeting multiple regions within a single run. Furthermore, GRAbB can be used to extract specific loci from NGS data, based on homology, like sequences that are used for barcoding. To make the assembly specific, a known part of the region, such as the sequence of a PCR amplicon or a homologous sequence from a related species must be specified. By assembling only the region of interest, the assembly process is computationally much less demanding and may lead to assemblies of better quality. In this study the different applications and functionalities of the program are demonstrated such as: exhaustive assembly (rDNA region and mitochondrial genome, extracting homologous regions or genes (IGS, RPB1, RPB2 and TEF1a, as well as extracting multiple regions within a single run. The program is also compared with MITObim, which is meant for the exhaustive assembly of a single target based on a similar query sequence. GRAbB is shown to be more efficient than MITObim in terms of speed, memory and disk usage. The other functionalities (handling multiple targets simultaneously and extracting homologous regions of the new program are not matched by other programs. The program is available with explanatory documentation at https://github.com/b-brankovics/grabb. GRAbB has been tested on Ubuntu (12.04 and 14.04, Fedora (23, CentOS (7.1.1503 and Mac OS X (10.7. Furthermore, GRAbB is available as a docker repository: brankovics/grabb (https://hub.docker.com/r/brankovics/grabb/.

  16. Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

    Science.gov (United States)

    Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay

    2013-01-01

    Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

  17. Two low coverage bird genomes and a comparison of reference-guided versus de novo genome assemblies.

    Directory of Open Access Journals (Sweden)

    Daren C Card

    Full Text Available As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (∼3.5-5.5x Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana. We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.

  18. Two low coverage bird genomes and a comparison of reference-guided versus de novo genome assemblies

    Science.gov (United States)

    Card, Daren C.; Schield, Drew R.; Reyes-Velasco, Jacobo; Fujita, Matthre K.; Andrew, Audra L.; Oyler-McCance, Sara J.; Fike, Jennifer A.; Tomback, Diana F.; Ruggiero, Robert P.; Castoe, Todd A.

    2014-01-01

    As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (~3.5–5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.

  19. Augmenting Chinese hamster genome assembly by identifying regions of high confidence.

    Science.gov (United States)

    Vishwanathan, Nandita; Bandyopadhyay, Arpan A; Fu, Hsu-Yuan; Sharma, Mohit; Johnson, Kathryn C; Mudge, Joann; Ramaraj, Thiruvarangan; Onsongo, Getiria; Silverstein, Kevin A T; Jacob, Nitya M; Le, Huong; Karypis, George; Hu, Wei-Shou

    2016-09-01

    Chinese hamster Ovary (CHO) cell lines are the dominant industrial workhorses for therapeutic recombinant protein production. The availability of genome sequence of Chinese hamster and CHO cells will spur further genome and RNA sequencing of producing cell lines. However, the mammalian genomes assembled using shot-gun sequencing data still contain regions of uncertain quality due to assembly errors. Identifying high confidence regions in the assembled genome will facilitate its use for cell engineering and genome engineering. We assembled two independent drafts of Chinese hamster genome by de novo assembly from shotgun sequencing reads and by re-scaffolding and gap-filling the draft genome from NCBI for improved scaffold lengths and gap fractions. We then used the two independent assemblies to identify high confidence regions using two different approaches. First, the two independent assemblies were compared at the sequence level to identify their consensus regions as "high confidence regions" which accounts for at least 78 % of the assembled genome. Further, a genome wide comparison of the Chinese hamster scaffolds with mouse chromosomes revealed scaffolds with large blocks of collinearity, which were also compiled as high-quality scaffolds. Genome scale collinearity was complemented with EST based synteny which also revealed conserved gene order compared to mouse. As cell line sequencing becomes more commonly practiced, the approaches reported here are useful for assessing the quality of assembly and potentially facilitate the engineering of cell lines.

  20. An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform

    Directory of Open Access Journals (Sweden)

    Zhang Tongwu

    2011-11-01

    Full Text Available Abstract Motivation Complete organellar genome sequences (chloroplasts and mitochondria provide valuable resources and information for studying plant molecular ecology and evolution. As high-throughput sequencing technology advances, it becomes the norm that a shotgun approach is used to obtain complete genome sequences. Therefore, to assemble organellar sequences from the whole genome, shotgun reads are inevitable. However, associated techniques are often cumbersome, time-consuming, and difficult, because true organellar DNA is difficult to separate efficiently from nuclear copies, which have been transferred to the nucleus through the course of evolution. Results We report a new, rapid procedure for plant chloroplast and mitochondrial genome sequencing and assembly using the Roche/454 GS FLX platform. Plant cells can contain multiple copies of the organellar genomes, and there is a significant correlation between the depth of sequence reads in contigs and the number of copies of the genome. Without isolating organellar DNA from the mixture of nuclear and organellar DNA for sequencing, we retrospectively extracted assembled contigs of either chloroplast or mitochondrial sequences from the whole genome shotgun data. Moreover, the contig connection graph property of Newbler (a platform-specific sequence assembler ensures an efficient final assembly. Using this procedure, we assembled both chloroplast and mitochondrial genomes of a resurrection plant, Boea hygrometrica, with high fidelity. We also present information and a minimal sequence dataset as a reference for the assembly of other plant organellar genomes.

  1. GAViT: Genome Assembly Visualization Tool for Short Read Data

    Energy Technology Data Exchange (ETDEWEB)

    Syed, Aijazuddin; Shapiro, Harris; Tu, Hank; Pangilinan, Jasmyn; Trong, Stephan

    2008-03-14

    It is a challenging job for genome analysts to accurately debug, troubleshoot, and validate genome assembly results. Genome analysts rely on visualization tools to help validate and troubleshoot assembly results, including such problems as mis-assemblies, low-quality regions, and repeats. Short read data adds further complexity and makes it extremely challenging for the visualization tools to scale and to view all needed assembly information. As a result, there is a need for a visualization tool that can scale to display assembly data from the new sequencing technologies. We present Genome Assembly Visualization Tool (GAViT), a highly scalable and interactive assembly visualization tool developed at the DOE Joint Genome Institute (JGI).

  2. A biologist's guide to de novo genome assembly using next-generation sequence data: A test with fungal genomes.

    Science.gov (United States)

    Haridas, Sajeet; Breuill, Colette; Bohlmann, Joerg; Hsiang, Tom

    2011-09-01

    We offer a guide to de novo genome assembly using sequence data generated by the Illumina platform for biologists working with fungi or other organisms whose genomes are less than 100Mb in size. The guide requires no familiarity with sequencing assembly technology or associated computer programs. It defines commonly used terms in genome sequencing and assembly; provides examples of assembling short-read genome sequence data for four strains of the fungus Grosmannia clavigera using four assembly programs; gives examples of protocols and software; and presents a commented flowchart that extends from DNA preparation for submission to a sequencing center, through to processing and assembly of the raw sequence reads using freely available operating systems and software.

  3. Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine.

    Science.gov (United States)

    Xiao, Wenming; Wu, Leihong; Yavas, Gokhan; Simonyan, Vahan; Ning, Baitang; Hong, Huixiao

    2016-04-22

    Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring

  4. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

    Science.gov (United States)

    Berlin, Konstantin; Koren, Sergey; Chin, Chen-Shan; Drake, James P; Landolin, Jane M; Phillippy, Adam M

    2015-06-01

    Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

  5. Comparative study of de novo assembly and genome-guided assembly strategies for transcriptome reconstruction based on RNA-Seq.

    Science.gov (United States)

    Lu, Bingxin; Zeng, Zhenbing; Shi, Tieliu

    2013-02-01

    Transcriptome reconstruction is an important application of RNA-Seq, providing critical information for further analysis of transcriptome. Although RNA-Seq offers the potential to identify the whole picture of transcriptome, it still presents special challenges. To handle these difficulties and reconstruct transcriptome as completely as possible, current computational approaches mainly employ two strategies: de novo assembly and genome-guided assembly. In order to find the similarities and differences between them, we firstly chose five representative assemblers belonging to the two classes respectively, and then investigated and compared their algorithm features in theory and real performances in practice. We found that all the methods can be reduced to graph reduction problems, yet they have different conceptual and practical implementations, thus each assembly method has its specific advantages and disadvantages, performing worse than others in certain aspects while outperforming others in anther aspects at the same time. Finally we merged assemblies of the five assemblers and obtained a much better assembly. Additionally we evaluated an assembler using genome-guided de novo assembly approach, and achieved good performance. Based on these results, we suggest that to obtain a comprehensive set of recovered transcripts, it is better to use a combination of de novo assembly and genome-guided assembly.

  6. Genome assembly quality: Assessment and improvement using the neutral indel model

    Science.gov (United States)

    Meader, Stephen; Hillier, LaDeana W.; Locke, Devin; Ponting, Chris P.; Lunter, Gerton

    2010-01-01

    We describe a statistical and comparative-genomic approach for quantifying error rates of genome sequence assemblies. The method exploits not substitutions but the pattern of insertions and deletions (indels) in genome-scale alignments for closely related species. Using two- or three-way alignments, the approach estimates the amount of aligned sequence containing clusters of nucleotides that were wrongly inserted or deleted during sequencing or assembly. Thus, the method is well-suited to assessing fine-scale sequence quality within single assemblies, between different assemblies of a single set of reads, and between genome assemblies for different species. When applying this approach to four primate genome assemblies, we found that average gap error rates per base varied considerably, by up to sixfold. As expected, bacterial artificial chromosome (BAC) sequences contained lower, but still substantial, predicted numbers of errors, arguing for caution in regarding BACs as the epitome of genome fidelity. We then mapped short reads, at approximately 10-fold statistical coverage, from a Bornean orangutan onto the Sumatran orangutan genome assembly originally constructed from capillary reads. This resulted in a reduced gap error rate and a separation of error-prone from high-fidelity sequence. Over 5000 predicted indel errors in protein-coding sequence were corrected in a hybrid assembly. Our approach contributes a new fine-scale quality metric for assemblies that should facilitate development of improved genome sequencing and assembly strategies. PMID:20305016

  7. A hybrid approach for de novo human genome sequence assembly and phasing.

    Science.gov (United States)

    Mostovoy, Yulia; Levy-Sakin, Michal; Lam, Jessica; Lam, Ernest T; Hastie, Alex R; Marks, Patrick; Lee, Joyce; Chu, Catherine; Lin, Chin; Džakula, Željko; Cao, Han; Schlebusch, Stephen A; Giorda, Kristina; Schnall-Levin, Michael; Wall, Jeffrey D; Kwok, Pui-Yan

    2016-07-01

    Despite tremendous progress in genome sequencing, the basic goal of producing a phased (haplotype-resolved) genome sequence with end-to-end contiguity for each chromosome at reasonable cost and effort is still unrealized. In this study, we describe an approach to performing de novo genome assembly and experimental phasing by integrating the data from Illumina short-read sequencing, 10X Genomics linked-read sequencing, and BioNano Genomics genome mapping to yield a high-quality, phased, de novo assembled human genome.

  8. Effects of GC bias in next-generation-sequencing data on de novo genome assembly.

    Science.gov (United States)

    Chen, Yen-Chun; Liu, Tsunglin; Yu, Chun-Hui; Chiang, Tzen-Yuh; Hwang, Chi-Chuan

    2013-01-01

    Next-generation-sequencing (NGS) has revolutionized the field of genome assembly because of its much higher data throughput and much lower cost compared with traditional Sanger sequencing. However, NGS poses new computational challenges to de novo genome assembly. Among the challenges, GC bias in NGS data is known to aggravate genome assembly. However, it is not clear to what extent GC bias affects genome assembly in general. In this work, we conduct a systematic analysis on the effects of GC bias on genome assembly. Our analyses reveal that GC bias only lowers assembly completeness when the degree of GC bias is above a threshold. At a strong GC bias, the assembly fragmentation due to GC bias can be explained by the low coverage of reads in the GC-poor or GC-rich regions of a genome. This effect is observed for all the assemblers under study. Increasing the total amount of NGS data thus rescues the assembly fragmentation because of GC bias. However, the amount of data needed for a full rescue depends on the distribution of GC contents. Both low and high coverage depths due to GC bias lower the accuracy of assembly. These pieces of information provide guidance toward a better de novo genome assembly in the presence of GC bias.

  9. Large-scale parallel genome assembler over cloud computing environment.

    Science.gov (United States)

    Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong

    2017-06-01

    The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

  10. GFinisher: a new strategy to refine and finish bacterial genome assemblies

    Science.gov (United States)

    Guizelini, Dieval; Raittz, Roberto T.; Cruz, Leonardo M.; Souza, Emanuel M.; Steffens, Maria B. R.; Pedrosa, Fabio O.

    2016-10-01

    Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.

  11. A New Chicken Genome Assembly Provides Insight into Avian Genome Structure

    Directory of Open Access Journals (Sweden)

    Wesley C. Warren

    2017-01-01

    Full Text Available The importance of the Gallus gallus (chicken as a model organism and agricultural animal merits a continuation of sequence assembly improvement efforts. We present a new version of the chicken genome assembly (Gallus_gallus-5.0; GCA_000002315.3, built from combined long single molecule sequencing technology, finished BACs, and improved physical maps. In overall assembled bases, we see a gain of 183 Mb, including 16.4 Mb in placed chromosomes with a corresponding gain in the percentage of intact repeat elements characterized. Of the 1.21 Gb genome, we include three previously missing autosomes, GGA30, 31, and 33, and improve sequence contig length 10-fold over the previous Gallus_gallus-4.0. Despite the significant base representation improvements made, 138 Mb of sequence is not yet located to chromosomes. When annotated for gene content, Gallus_gallus-5.0 shows an increase of 4679 annotated genes (2768 noncoding and 1911 protein-coding over those in Gallus_gallus-4.0. We also revisited the question of what genes are missing in the avian lineage, as assessed by the highest quality avian genome assembly to date, and found that a large fraction of the original set of missing genes are still absent in sequenced bird species. Finally, our new data support a detailed map of MHC-B, encompassing two segments: one with a highly stable gene copy number and another in which the gene copy number is highly variable. The chicken model has been a critical resource for many other fields of study, and this new reference assembly will substantially further these efforts.

  12. Projector 2 : contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies

    NARCIS (Netherlands)

    van Hijum, SAFT; Zomer, AL; Kuipers, OP; Kok, J

    2005-01-01

    With genome sequencing efforts increasing exponentially, valuable information accumulates on genomic content of the various organisms sequenced. Projector 2 uses (un) finished genomic sequences of an organism as a template to infer linkage information for a genome sequence assembly of a related orga

  13. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

    DEFF Research Database (Denmark)

    Li, Yingrui; Zheng, Hancheng; Luo, Ruibang

    2011-01-01

    Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1-50 kb) including insertions, deletions, inversions and their precise...

  14. Genomic characterization of large heterochromatic gaps in the human genome assembly.

    Directory of Open Access Journals (Sweden)

    Nicolas Altemose

    2014-05-01

    Full Text Available The largest gaps in the human genome assembly correspond to multi-megabase heterochromatic regions composed primarily of two related families of tandem repeats, Human Satellites 2 and 3 (HSat2,3. The abundance of repetitive DNA in these regions challenges standard mapping and assembly algorithms, and as a result, the sequence composition and potential biological functions of these regions remain largely unexplored. Furthermore, existing genomic tools designed to predict consensus-based descriptions of repeat families cannot be readily applied to complex satellite repeats such as HSat2,3, which lack a consistent repeat unit reference sequence. Here we present an alignment-free method to characterize complex satellites using whole-genome shotgun read datasets. Utilizing this approach, we classify HSat2,3 sequences into fourteen subfamilies and predict their chromosomal distributions, resulting in a comprehensive satellite reference database to further enable genomic studies of heterochromatic regions. We also identify 1.3 Mb of non-repetitive sequence interspersed with HSat2,3 across 17 unmapped assembly scaffolds, including eight annotated gene predictions. Finally, we apply our satellite reference database to high-throughput sequence data from 396 males to estimate array size variation of the predominant HSat3 array on the Y chromosome, confirming that satellite array sizes can vary between individuals over an order of magnitude (7 to 98 Mb and further demonstrating that array sizes are distributed differently within distinct Y haplogroups. In summary, we present a novel framework for generating initial reference databases for unassembled genomic regions enriched with complex satellite DNA, and we further demonstrate the utility of these reference databases for studying patterns of sequence variation within human populations.

  15. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers.

    Directory of Open Access Journals (Sweden)

    Sebastian Jünemann

    Full Text Available De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM, popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely

  16. Extensive error in the number of genes inferred from draft genome assemblies.

    Directory of Open Access Journals (Sweden)

    James F Denton

    2014-12-01

    Full Text Available Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

  17. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce.

    Science.gov (United States)

    Reyes-Chin-Wo, Sebastian; Wang, Zhiwen; Yang, Xinhua; Kozik, Alexander; Arikit, Siwaret; Song, Chi; Xia, Liangfeng; Froenicke, Lutz; Lavelle, Dean O; Truco, María-José; Xia, Rui; Zhu, Shilin; Xu, Chunyan; Xu, Huaqin; Xu, Xun; Cox, Kyle; Korf, Ian; Meyers, Blake C; Michelmore, Richard W

    2017-04-12

    Lettuce (Lactuca sativa) is a major crop and a member of the large, highly successful Compositae family of flowering plants. Here we present a reference assembly for the species and family. This was generated using whole-genome shotgun Illumina reads plus in vitro proximity ligation data to create large superscaffolds; it was validated genetically and superscaffolds were oriented in genetic bins ordered along nine chromosomal pseudomolecules. We identify several genomic features that may have contributed to the success of the family, including genes encoding Cycloidea-like transcription factors, kinases, enzymes involved in rubber biosynthesis and disease resistance proteins that are expanded in the genome. We characterize 21 novel microRNAs, one of which may trigger phasiRNAs from numerous kinase transcripts. We provide evidence for a whole-genome triplication event specific but basal to the Compositae. We detect 26% of the genome in triplicated regions containing 30% of all genes that are enriched for regulatory sequences and depleted for genes involved in defence.

  18. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce

    Science.gov (United States)

    Reyes-Chin-Wo, Sebastian; Wang, Zhiwen; Yang, Xinhua; Kozik, Alexander; Arikit, Siwaret; Song, Chi; Xia, Liangfeng; Froenicke, Lutz; Lavelle, Dean O.; Truco, María-José; Xia, Rui; Zhu, Shilin; Xu, Chunyan; Xu, Huaqin; Xu, Xun; Cox, Kyle; Korf, Ian; Meyers, Blake C.; Michelmore, Richard W.

    2017-01-01

    Lettuce (Lactuca sativa) is a major crop and a member of the large, highly successful Compositae family of flowering plants. Here we present a reference assembly for the species and family. This was generated using whole-genome shotgun Illumina reads plus in vitro proximity ligation data to create large superscaffolds; it was validated genetically and superscaffolds were oriented in genetic bins ordered along nine chromosomal pseudomolecules. We identify several genomic features that may have contributed to the success of the family, including genes encoding Cycloidea-like transcription factors, kinases, enzymes involved in rubber biosynthesis and disease resistance proteins that are expanded in the genome. We characterize 21 novel microRNAs, one of which may trigger phasiRNAs from numerous kinase transcripts. We provide evidence for a whole-genome triplication event specific but basal to the Compositae. We detect 26% of the genome in triplicated regions containing 30% of all genes that are enriched for regulatory sequences and depleted for genes involved in defence. PMID:28401891

  19. Comparing Memory-Efficient Genome Assemblers on Stand-Alone and Cloud Infrastructures

    KAUST Repository

    Kleftogiannis, Dimitrios A.

    2013-09-27

    A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.

  20. Omega: an Overlap-graph de novo Assembler for Meta-genomics

    Energy Technology Data Exchange (ETDEWEB)

    Haider, Bahlul [ORNL; Ahn, Tae-Hyuk [ORNL; Bushnell, Brian [U.S. Department of Energy, Joint Genome Institute; Chai, JJ [ORNL; Copeland, Alex [U.S. Department of Energy, Joint Genome Institute; Pan, Chongle [ORNL

    2014-01-01

    Motivation: Metagenomic sequencing allows reconstruction of mi-crobial genomes directly from environmental samples. Omega (overlap-graph metagenome assembler) was developed here for assembling and scaffolding Illumina sequencing data of microbial communities. Results: Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming small branches. Unitigs were generat-ed based on minimum cost flow analysis of the overlap graph. Obtained unitigs were merged to contigs and scaffolds using mate-pair information. Omega was compared with two de Bruijn graph assemblers, SOAPdenovo and IDBA-UD, using a publically-available Illumina sequencing dataset of a 64-genome mock com-munity. The assembly results were verified by their alignment with reference genomes. The overall performances of the three assem-blers were comparable and each assembler provided best results for a subset of genomes.

  1. A nine-scaffold genome assembly of the nine chromosome sugar beet

    Science.gov (United States)

    Over the course of 20 months, we assembled a sugar beet genome (700 - 800 Mb) into a close representation of the nine haploid chromosomes of beet. This result was obtained by sequentially assembling sequences >40 kb in length, orienting these assemblies via optical mapping, and scaffolding with in v...

  2. De Novo Genome and Transcriptome Assembly of the Canadian Beaver (Castor canadensis)

    Science.gov (United States)

    Lok, Si; Paton, Tara A.; Wang, Zhuozhi; Kaur, Gaganjot; Walker, Susan; Yuen, Ryan K. C.; Sung, Wilson W. L.; Whitney, Joseph; Buchanan, Janet A.; Trost, Brett; Singh, Naina; Apresto, Beverly; Chen, Nan; Coole, Matthew; Dawson, Travis J.; Ho, Karen; Hu, Zhizhou; Pullenayegum, Sanjeev; Samler, Kozue; Shipstone, Arun; Tsoi, Fiona; Wang, Ting; Pereira, Sergio L.; Rostami, Pirooz; Ryan, Carol Ann; Tong, Amy Hin Yan; Ng, Karen; Sundaravadanam, Yogi; Simpson, Jared T.; Lim, Burton K.; Engstrom, Mark D.; Dutton, Christopher J.; Kerr, Kevin C. R.; Franke, Maria; Rapley, William; Wintle, Richard F.; Scherer, Stephen W.

    2017-01-01

    The Canadian beaver (Castor canadensis) is the largest indigenous rodent in North America. We report a draft annotated assembly of the beaver genome, the first for a large rodent and the first mammalian genome assembled directly from uncorrected and moderate coverage (genome size is 2.7 Gb estimated by k-mer analysis. We assembled the beaver genome using the new Canu assembler optimized for noisy reads. The resulting assembly was refined using Pilon supported by short reads (80 ×) and checked for accuracy by congruency against an independent short read assembly. We scaffolded the assembly using the exon–gene models derived from 9805 full-length open reading frames (FL-ORFs) constructed from the beaver leukocyte and muscle transcriptomes. The final assembly comprised 22,515 contigs with an N50 of 278,680 bp and an N50-scaffold of 317,558 bp. Maximum contig and scaffold lengths were 3.3 and 4.2 Mb, respectively, with a combined scaffold length representing 92% of the estimated genome size. The completeness and accuracy of the scaffold assembly was demonstrated by the precise exon placement for 91.1% of the 9805 assembled FL-ORFs and 83.1% of the BUSCO (Benchmarking Universal Single-Copy Orthologs) gene set used to assess the quality of genome assemblies. Well-represented were genes involved in dentition and enamel deposition, defining characteristics of rodents with which the beaver is well-endowed. The study provides insights for genome assembly and an important genomics resource for Castoridae and rodent evolutionary biology. PMID:28087693

  3. De Novo Genome and Transcriptome Assembly of the Canadian Beaver (Castor canadensis

    Directory of Open Access Journals (Sweden)

    Si Lok

    2017-02-01

    Full Text Available The Canadian beaver (Castor canadensis is the largest indigenous rodent in North America. We report a draft annotated assembly of the beaver genome, the first for a large rodent and the first mammalian genome assembled directly from uncorrected and moderate coverage (< 30 × long reads generated by single-molecule sequencing. The genome size is 2.7 Gb estimated by k-mer analysis. We assembled the beaver genome using the new Canu assembler optimized for noisy reads. The resulting assembly was refined using Pilon supported by short reads (80 × and checked for accuracy by congruency against an independent short read assembly. We scaffolded the assembly using the exon–gene models derived from 9805 full-length open reading frames (FL-ORFs constructed from the beaver leukocyte and muscle transcriptomes. The final assembly comprised 22,515 contigs with an N50 of 278,680 bp and an N50-scaffold of 317,558 bp. Maximum contig and scaffold lengths were 3.3 and 4.2 Mb, respectively, with a combined scaffold length representing 92% of the estimated genome size. The completeness and accuracy of the scaffold assembly was demonstrated by the precise exon placement for 91.1% of the 9805 assembled FL-ORFs and 83.1% of the BUSCO (Benchmarking Universal Single-Copy Orthologs gene set used to assess the quality of genome assemblies. Well-represented were genes involved in dentition and enamel deposition, defining characteristics of rodents with which the beaver is well-endowed. The study provides insights for genome assembly and an important genomics resource for Castoridae and rodent evolutionary biology.

  4. De novo assembly of human genomes with massively parallel short read sequencing

    DEFF Research Database (Denmark)

    Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue

    2010-01-01

    genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities...... for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way....

  5. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome

    Directory of Open Access Journals (Sweden)

    Iorizzo Massimo

    2012-05-01

    Full Text Available Abstract Background Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA. Results Sequencing data from a carrot 454 whole genome library were used to develop a de novo assembly of the mitochondrial genome. Development of a new bioinformatic tool allowed visualizing contig connections and elucidation of the de novo assembly. Southern hybridization demonstrated recombination across two large repeats. Genome annotation allowed identification of 44 protein coding genes, three rRNA and 17 tRNA. Identification of the plastid genome sequence allowed organelle genome comparison. Mitochondrial intergenic sequence analysis allowed detection of a fragment of DNA specific to the carrot plastid genome. PCR amplification and sequence analysis across different Apiaceae species revealed consistent conservation of this fragment in the mitochondrial genomes and an insertion in Daucus plastid genomes, giving evidence of a mitochondrial to plastid transfer of DNA. Sequence similarity with a retrotransposon element suggests a possibility that a transposon-like event transferred this sequence into the plastid genome. Conclusions This study confirmed that whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. In addition, a new aspect of intercompartmental genome interaction was reported providing the first evidence for DNA transfer into an angiosperm plastid genome. The approach used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse species. This information will allow us to better understand intercompartmental interactions and cell evolution.

  6. Genome-wide microsatellite identification in the fungus Anisogramma anomala using Illumina sequencing and genome assembly.

    Directory of Open Access Journals (Sweden)

    Guohong Cai

    Full Text Available High-throughput sequencing has been dramatically accelerating the discovery of microsatellite markers (also known as Simple Sequence Repeats. Both 454 and Illumina reads have been used directly in microsatellite discovery and primer design (the "Seq-to-SSR" approach. However, constraints of this approach include: 1 many microsatellite-containing reads do not have sufficient flanking sequences to allow primer design, and 2 difficulties in removing microsatellite loci residing in longer, repetitive regions. In the current study, we applied the novel "Seq-Assembly-SSR" approach to overcome these constraints in Anisogramma anomala. In our approach, Illumina reads were first assembled into a draft genome, and the latter was then used in microsatellite discovery. A. anomala is an obligate biotrophic ascomycete that causes eastern filbert blight disease of commercial European hazelnut. Little is known about its population structure or diversity. Approximately 26 M 146 bp Illumina reads were generated from a paired-end library of a fungal strain from Oregon. The reads were assembled into a draft genome of 333 Mb (excluding gaps, with contig N50 of 10,384 bp and scaffold N50 of 32,987 bp. A bioinformatics pipeline identified 46,677 microsatellite motifs at 44,247 loci, including 2,430 compound loci. Primers were successfully designed for 42,923 loci (97%. After removing 2,886 loci close to assembly gaps and 676 loci in repetitive regions, a genome-wide microsatellite database of 39,361 loci was generated for the fungus. In experimental screening of 236 loci using four geographically representative strains, 228 (96.6% were successfully amplified and 214 (90.7% produced single PCR products. Twenty-three (9.7% were found to be perfect polymorphic loci. A small-scale population study using 11 polymorphic loci revealed considerable gene diversity. Clustering analysis grouped isolates of this fungus into two clades in accordance with their geographic origins

  7. Genome-wide microsatellite identification in the fungus Anisogramma anomala using Illumina sequencing and genome assembly.

    Science.gov (United States)

    Cai, Guohong; Leadbetter, Clayton W; Muehlbauer, Megan F; Molnar, Thomas J; Hillman, Bradley I

    2013-01-01

    High-throughput sequencing has been dramatically accelerating the discovery of microsatellite markers (also known as Simple Sequence Repeats). Both 454 and Illumina reads have been used directly in microsatellite discovery and primer design (the "Seq-to-SSR" approach). However, constraints of this approach include: 1) many microsatellite-containing reads do not have sufficient flanking sequences to allow primer design, and 2) difficulties in removing microsatellite loci residing in longer, repetitive regions. In the current study, we applied the novel "Seq-Assembly-SSR" approach to overcome these constraints in Anisogramma anomala. In our approach, Illumina reads were first assembled into a draft genome, and the latter was then used in microsatellite discovery. A. anomala is an obligate biotrophic ascomycete that causes eastern filbert blight disease of commercial European hazelnut. Little is known about its population structure or diversity. Approximately 26 M 146 bp Illumina reads were generated from a paired-end library of a fungal strain from Oregon. The reads were assembled into a draft genome of 333 Mb (excluding gaps), with contig N50 of 10,384 bp and scaffold N50 of 32,987 bp. A bioinformatics pipeline identified 46,677 microsatellite motifs at 44,247 loci, including 2,430 compound loci. Primers were successfully designed for 42,923 loci (97%). After removing 2,886 loci close to assembly gaps and 676 loci in repetitive regions, a genome-wide microsatellite database of 39,361 loci was generated for the fungus. In experimental screening of 236 loci using four geographically representative strains, 228 (96.6%) were successfully amplified and 214 (90.7%) produced single PCR products. Twenty-three (9.7%) were found to be perfect polymorphic loci. A small-scale population study using 11 polymorphic loci revealed considerable gene diversity. Clustering analysis grouped isolates of this fungus into two clades in accordance with their geographic origins. Thus, the

  8. Optimizing information in Next-Generation-Sequencing (NGS) reads for improving de novo genome assembly.

    Science.gov (United States)

    Liu, Tsunglin; Tsai, Cheng-Hung; Lee, Wen-Bin; Chiang, Jung-Hsien

    2013-01-01

    Next-Generation-Sequencing is advantageous because of its much higher data throughput and much lower cost compared with the traditional Sanger method. However, NGS reads are shorter than Sanger reads, making de novo genome assembly very challenging. Because genome assembly is essential for all downstream biological studies, great efforts have been made to enhance the completeness of genome assembly, which requires the presence of long reads or long distance information. To improve de novo genome assembly, we develop a computational program, ARF-PE, to increase the length of Illumina reads. ARF-PE takes as input Illumina paired-end (PE) reads and recovers the original DNA fragments from which two ends the paired reads are obtained. On the PE data of four bacteria, ARF-PE recovered >87% of the DNA fragments and achieved >98% of perfect DNA fragment recovery. Using Velvet, SOAPdenovo, Newbler, and CABOG, we evaluated the benefits of recovered DNA fragments to genome assembly. For all four bacteria, the recovered DNA fragments increased the assembly contiguity. For example, the N50 lengths of the P. brasiliensis contigs assembled by SOAPdenovo and Newbler increased from 80,524 bp to 166,573 bp and from 80,655 bp to 193,388 bp, respectively. ARF-PE also increased assembly accuracy in many cases. On the PE data of two fungi and a human chromosome, ARF-PE doubled and tripled the N50 length. However, the assembly accuracies dropped, but still remained >91%. In general, ARF-PE can increase both assembly contiguity and accuracy for bacterial genomes. For complex eukaryotic genomes, ARF-PE is promising because it raises assembly contiguity. But future error correction is needed for ARF-PE to also increase the assembly accuracy. ARF-PE is freely available at http://140.116.235.124/~tliu/arf-pe/.

  9. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution

    Science.gov (United States)

    We report a chromosome-scale assembly and analysis of the Daucus carota genome, an important source of provitamin A in the human diet and the first sequenced genome among members of the Euasterid II clade. We characterized two new polyploidization events, both occurring after the divergence of carro...

  10. Tips and tricks for the assembly of a Corynebacterium pseudotuberculosis genome using a semiconductor sequencer

    DEFF Research Database (Denmark)

    Ramos, Rommel Thiago Jucá; Carneiro, Adriana Ribeiro; Soares, Siomar de Castro;

    2013-01-01

    New sequencing platforms have enabled rapid decoding of complete prokaryotic genomes at relatively low cost. The Ion Torrent platform is an example of these technologies, characterized by lower coverage, generating challenges for the genome assembly. One particular problem is the lack of genomes...... data obtained compared with traditional quality filter approaches. Data preprocessing prior to the de novo assembly enabled the use of known methodologies in the next-generation sequencing data assembly. Moreover, manual curation was proved to be essential for ensuring a quality assembly, which...... that enable reference-based assembly, such as the one used in the present study, Corynebacterium pseudotuberculosis biovar equi, which causes high economic losses in the US equine industry. The quality treatment strategy incorporated into the assembly pipeline enabled a 16-fold greater use of the sequencing...

  11. Computational Comparison of Human Genomic Sequence Assemblies for a Region of Chromosome 4

    OpenAIRE

    Semple, Colin; Stewart W. Morris; Porteous, David J.; Evans, Kathryn L.

    2002-01-01

    Much of the available human genomic sequence data exist in a fragmentary draft state following the completion of the initial high-volume sequencing performed by the International Human Genome Sequencing Consortium (IHGSC) and Celera Genomics (CG). We compared six draft genome assemblies over a region of chromosome 4p (D4S394–D4S403), two consecutive releases by the IHGSC at University of California, Santa Cruz (UCSC), two consecutive releases from the National Centre for Biotechnology Informa...

  12. Molecular Assemblies, Genes and Genomics Integrated Efficiently (MAGGIE)

    Energy Technology Data Exchange (ETDEWEB)

    Baliga, Nitin S

    2011-05-26

    Final report on MAGGIE. We set ambitious goals to model the functions of individual organisms and their community from molecular to systems scale. These scientific goals are driving the development of sophisticated algorithms to analyze large amounts of experimental measurements made using high throughput technologies to explain and predict how the environment influences biological function at multiple scales and how the microbial systems in turn modify the environment. By experimentally evaluating predictions made using these models we will test the degree to which our quantitative multiscale understanding wilt help to rationally steer individual microbes and their communities towards specific tasks. Towards this end we have made substantial progress towards understanding evolution of gene families, transcriptional structures, detailed structures of keystone molecular assemblies (proteins and complexes), protein interactions, biological networks, microbial interactions, and community structure. Using comparative analysis we have tracked the evolutionary history of gene functions to understand how novel functions evolve. One level up, we have used proteomics data, high-resolution genome tiling microarrays, and 5' RNA sequencing to revise genome annotations, discover new genes including ncRNAs, and map dynamically changing operon structures of five model organisms: For Desulfovibrio vulgaris Hildenborough, Pyrococcus furiosis, Sulfolobus solfataricus, Methanococcus maripaludis and Haiobacterium salinarum NROL We have developed machine learning algorithms to accurately identify protein interactions at a near-zero false positive rate from noisy data generated using tagfess complex purification, TAP purification, and analysis of membrane complexes. Combining other genome-scale datasets produced by ENIGMA (in particular, microarray data) and available from literature we have been able to achieve a true positive rate as high as 65% at almost zero false positives

  13. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

    Energy Technology Data Exchange (ETDEWEB)

    Fleischmann, R.D.; Adams, M.D.; White, O. [Institute for Genomic Research, Gaithersburg, MD (United States)] [and others

    1995-07-28

    An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism. 46 refs., 4 figs., 4 tabs.

  14. SEQUENCING AND DE NOVO DRAFT ASSEMBLIES OF A FATHEAD MINNOW (Pimpehales promelas) reference genome

    Data.gov (United States)

    U.S. Environmental Protection Agency — The dataset provides the URLs for accessing the genome sequence data and two draft assemblies as well as fathead minnow genotyping data associated with estimating...

  15. The sequence and de novo assembly of the giant panda genome

    Science.gov (United States)

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C.; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W.; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2013-01-01

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. PMID:20010809

  16. Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis

    Institute of Scientific and Technical Information of China (English)

    HAN Yujun; WANG Jing; GU Xiaocheng; YU Jun; LI Songgang; NI Peixiang; L(U) Hong; YE Jia; HU Jianfei; CHEN Chen; HUANG Xiangang; CONG Lijuan; LI Guangyuan

    2005-01-01

    Double-barreled (DB) data have been widely used for the assembly of large genomes. Based on the experience of building the whole-genome working draft of Oryza sativa L.ssp. Indica, we present here the prevailing and improved uses of DB data in the assembly procedure and report on novel applications during the following data-mining processes such as acquiring precise insert fragment information of each clone across the genome, and a new kind of Iow-cost whole-genome microarray. With the increasing number of organisms being sequenced,we believe that DB data will play an important role both in other assembly procedures and infuture genomic studies.

  17. Parallelized short read assembly of large genomes using de Bruijn graphs

    Directory of Open Access Journals (Sweden)

    Liu Yongchao

    2011-08-01

    Full Text Available Abstract Background Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. Results We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. Conclusions Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.

  18. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation

    Energy Technology Data Exchange (ETDEWEB)

    Mavromatis, K [U.S. Department of Energy, Joint Genome Institute; Land, Miriam L [ORNL; Brettin, Thomas S [ORNL; Quest, Daniel J [ORNL; Copeland, A [U.S. Department of Energy, Joint Genome Institute; Clum, Alicia [U.S. Department of Energy, Joint Genome Institute; Goodwin, Lynne A. [Los Alamos National Laboratory (LANL); Woyke, Tanja [U.S. Department of Energy, Joint Genome Institute; Lapidus, Alla L. [U.S. Department of Energy, Joint Genome Institute; Klenk, Hans-Peter [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Cottingham, Robert W [ORNL; Kyrpides, Nikos C [U.S. Department of Energy, Joint Genome Institute

    2012-01-01

    Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).

  19. Optimizing k-mer size using a variant grid search to enhance de novo genome assembly

    Science.gov (United States)

    Cha, Soyeon; Bird, David McK

    2016-01-01

    Largely driven by huge reductions in per-base costs, sequencing nucleic acids has become a near-ubiquitous technique in laboratories performing biological and biomedical research. Most of the effort goes to re-sequencing, but assembly of de novogenerated, raw sequence reads into contigs that span as much of the genome as possible is central to many projects. Although truly complete coverage is not realistically attainable, maximizing the amount of sequence that can be correctly assembled into contigs contributes to coverage. Here we compare three commonly used assembly algorithms (ABySS, Velvet and SOAPdenovo2), and show that empirical optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome, the nematode parasite Meloidogynechitwoodi. Each assembler was challenged with about 40 million Iluumina II paired-end reads, and assemblies performed under a range of k-mer sizes. In each instance, the optimal k-mer was 127, although based on N50 values,ABySS was more efficient than the others. That the assembly was not spurious was established using the “Core Eukaryotic Gene Mapping Approach”, which indicated that 98.79% of the M. chitwoodi genome was accounted for by the assembly. Subsequent gene finding and annotation are consistent with this and suggest that k-mer optimization contributes to the robustness of assembly. PMID:28104957

  20. Assembling and exploring the Cochliobolus miyabeanus genome of a strain pathogenic on wildrice (Zizania palustris)

    Science.gov (United States)

    The genome of a strain of C. miyabeanus was shotgun sequenced by paired-end reads with Illumina HiSeq 2000 technology. The genome was assembled with AbySS software yielding a total size of 34.96 Mb (114X), with N50 = 99.43 kb contained in the largest 105 scaffolds and a maximum scaffold length of 40...

  1. Analysis Of Transcriptomes In A Porcine Tissue Collection Using RNA-Seq And Genome Assembly 10

    DEFF Research Database (Denmark)

    Hornshøj, Henrik; Thomsen, Bo; Hedegaard, Jakob

    2011-01-01

    The release of Sus scrofa genome assembly 10 supports improvement of the pig genome annotation and in depth transcriptome analyses using next-generation sequencing technologies. In this study we analyze RNA-seq reads from a tissue collection, including 10 separate tissues from Duroc boars and 10...

  2. Tick Genome Assembled: New Opportunities for Research on Tick-Host-Pathogen Interactions

    Science.gov (United States)

    de la Fuente, José; Waterhouse, Robert M.; Sonenshine, Daniel E.; Roe, R. Michael; Ribeiro, Jose M.; Sattelle, David B.; Hill, Catherine A.

    2016-01-01

    As tick-borne diseases are on the rise, an international effort resulted in the sequence and assembly of the first genome of a tick vector. This result promotes research on comparative, functional and evolutionary genomics and the study of tick-host-pathogen interactions to improve human, animal and ecosystem health on a global scale. PMID:27695689

  3. Assembly and sorting of homologous BAC contigs in allotetraploid cotton genomes

    Science.gov (United States)

    Upland cotton (G. hirsutum) is a diploidized allopolyploid species containing At and Dt sub-genomes that have partial homology. Assembly and sorting of homologous BAC contigs into their subgenomes and further to individual chromosomes are of both great interest and great challenge for genome-wide i...

  4. The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome.

    Science.gov (United States)

    Sakai, Hiroaki; Naito, Ken; Ogiso-Tanaka, Eri; Takahashi, Yu; Iseki, Kohtaro; Muto, Chiaki; Satou, Kazuhito; Teruya, Kuniko; Shiroma, Akino; Shimoji, Makiko; Hirano, Takashi; Itoh, Takeshi; Kaga, Akito; Tomooka, Norihiko

    2015-11-30

    Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.

  5. Total Chemical Synthesis,Assembly of Human Torque Teno Virus Genome

    Institute of Scientific and Technical Information of China (English)

    Zheng Hou; Gengfu Xiao

    2011-01-01

    Torque teno virus(TTV)is a nonenveloped virus containing a single-stranded,circular DNA genome of approximately 3.8kb.We completely synthesized the 3808 nucleotides of the TTV(SANBAN isolate)genome,which contains a hairpin structure and a GC-rich region.More than 100 overlapping oligonucleotides were chemically synthesized and assembled by polymerise chain assembly reaction(PCA),and the synthesis was completed with splicing by overlap extension(SOEing).This study establishes the methodological basis of the chemical synthesis of a viral genome for use as a live attenuated vaccine or gene therapy vector.

  6. Nature-inspired novel Cuckoo Search Algorithm for genome sequence assembly

    Indian Academy of Sciences (India)

    R Indumathy; S Uma Maheswari; G Subashini

    2015-02-01

    This study aims to produce a novel optimization algorithm, called the Cuckoo Search Algorithm (CS), for solving the genome sequence assembly problem. Assembly of genome sequence is a technique that attempts to rebuild the target sequence from the collection of fragments. This study is the first application of the CS for DNA sequence assembly problem in the literature. The algorithm is based on the levy flight behaviour and brood parasitic behaviour. The CS algorithm is employed to maximize the overlap score by reconstructing the original DNA sequence. Experimental results show the ability of the CS to find better optimal genome assembly. To check the efficiency of the proposed technique the results of the CS is compared with one of the well known evolutionary algorithms namely, particle swarm optimization (PSO) and its variants.

  7. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

    Directory of Open Access Journals (Sweden)

    Tom O. Delmont

    2016-03-01

    Full Text Available High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini, and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

  8. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

    Science.gov (United States)

    Delmont, Tom O.

    2016-01-01

    High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini, and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes. PMID:27069789

  9. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies.

    Science.gov (United States)

    Delmont, Tom O; Eren, A Murat

    2016-01-01

    High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini, and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today's microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

  10. Sequencing and de novo draft assemblies of a fathead minnow (Pimephales promelas) reference genome.

    Science.gov (United States)

    Burns, Frank R; Cogburn, Amarin L; Ankley, Gerald T; Villeneuve, Daniel L; Waits, Eric; Chang, Yun-Juan; Llaca, Victor; Deschamps, Stephane D; Jackson, Raymond E; Hoke, Robert Alan

    2016-01-01

    The present study was undertaken to provide the foundation for development of genome-scale resources for the fathead minnow (Pimephales promelas), an important model organism widely used in both aquatic toxicology research and regulatory testing. The authors report on the first sequencing and 2 draft assemblies for the reference genome of this species. Approximately 120× sequence coverage was achieved via Illumina sequencing of a combination of paired-end, mate-pair, and fosmid libraries. Evaluation and comparison of these assemblies demonstrate that they are of sufficient quality to be useful for genome-enabled studies, with 418 of 458 (91%) conserved eukaryotic genes mapping to at least 1 of the assemblies. In addition to its immediate utility, the present work provides a strong foundation on which to build further refinements of a reference genome for the fathead minnow.

  11. The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads

    DEFF Research Database (Denmark)

    Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo

    2012-01-01

    to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig....... A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K...... these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species....

  12. Augmenting transcriptome assembly by combining de novo and genome-guided tools.

    Science.gov (United States)

    Jain, Prachi; Krishnan, Neeraja M; Panda, Binay

    2013-01-01

    Researchers interested in studying and constructing transcriptomes, especially for non-model species, face the conundrum of choosing from a number of available de novo and genome-guided assemblers. None of the popular assembly tools in use today achieve requisite sensitivity, specificity or recovery of full-length transcripts on their own. Here, we present a comprehensive comparative study of the performance of various assemblers. Additionally, we present an approach to combinatorially augment transciptome assembly by using both de novo and genome-guided tools. In our study, we obtained the best recovery and most full-length transcripts with Trinity and TopHat1-Cufflinks, respectively. The sensitivity of the assembly and isoform recovery was superior, without compromising much on the specificity, when transcripts from Trinity were augmented with those from TopHat1-Cufflinks.

  13. Mind the gap; seven reasons to close fragmented genome assemblies

    Science.gov (United States)

    Like other domains of life, research into the biology of filamentous microbes has greatly benefited from the advent of whole-genome sequencing. Next-generation sequencing (NGS) technologies have revolutionized sequencing, making genomic sciences accessible to many academic laboratories including tho...

  14. Exploring an Annotated Sequence Assembly of the Perennial Ryegrass Genome for Genomic Regions Enriched for Trait Associated Variants

    DEFF Research Database (Denmark)

    Byrne, Stephen; Cericola, Fabio; Janss, Luc

    2015-01-01

    Perennial ryegrass (Lolium perenne L.) is an outbreeding diploid species and one of the most important forage crops used in temperate agriculture. We have developed a draft sequence assembly of the perennial ryegrass genome and annotated it with the aid of RNA-seq data from various genotypes, plant...

  15. Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data

    Directory of Open Access Journals (Sweden)

    Fujiyama Asao

    2010-04-01

    Full Text Available Abstract Background Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated B. subtilis 168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, B. subtilis natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length. Results We applied a comparative genome assembly method, which combines de novo assembly and reference guided assembly, to one of the B. subtilis natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for B. subtilis natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of degQ and the coding region of swrAA as the wild strain, RO-FF-1. These are specific for γ-PGA production ability, which is related to natto production. Further, the B. subtilis natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases. Conclusions The determination of the whole genome sequence of Bacillus subtilis natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that B

  16. Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data.

    Science.gov (United States)

    Nishito, Yukari; Osana, Yasunori; Hachiya, Tsuyoshi; Popendorf, Kris; Toyoda, Atsushi; Fujiyama, Asao; Itaya, Mitsuhiro; Sakakibara, Yasubumi

    2010-04-16

    Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated B. subtilis 168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, B. subtilis natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length. We applied a comparative genome assembly method, which combines de novo assembly and reference guided assembly, to one of the B. subtilis natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for B. subtilis natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of degQ and the coding region of swrAA as the wild strain, RO-FF-1. These are specific for gamma-PGA production ability, which is related to natto production. Further, the B. subtilis natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases. The determination of the whole genome sequence of Bacillus subtilis natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that B. subtilis natto harbors but B. subtilis 168 lacks

  17. Enhanced de novo assembly of high throughput pyrosequencing data using whole genome mapping.

    Science.gov (United States)

    Onmus-Leone, Fatma; Hang, Jun; Clifford, Robert J; Yang, Yu; Riley, Matthew C; Kuschner, Robert A; Waterman, Paige E; Lesho, Emil P

    2013-01-01

    Despite major advances in next-generation sequencing, assembly of sequencing data, especially data from novel microorganisms or re-emerging pathogens, remains constrained by the lack of suitable reference sequences. De novo assembly is the best approach to achieve an accurate finished sequence, but multiple sequencing platforms or paired-end libraries are often required to achieve full genome coverage. In this study, we demonstrated a method to assemble complete bacterial genome sequences by integrating shotgun Roche 454 pyrosequencing with optical whole genome mapping (WGM). The whole genome restriction map (WGRM) was used as the reference to scaffold de novo assembled sequence contigs through a stepwise process. Large de novo contigs were placed in the correct order and orientation through alignment to the WGRM. De novo contigs that were not aligned to WGRM were merged into scaffolds using contig branching structure information. These extended scaffolds were then aligned to the WGRM to identify the overlaps to be eliminated and the gaps and mismatches to be resolved with unused contigs. The process was repeated until a sequence with full coverage and alignment with the whole genome map was achieved. Using this method we were able to achieved 100% WGRM coverage without a paired-end library. We assembled complete sequences for three distinct genetic components of a clinical isolate of Providencia stuartii: a bacterial chromosome, a novel bla NDM-1 plasmid, and a novel bacteriophage, without separately purifying them to homogeneity.

  18. Lineage-specific biology revealed by a finished genome assembly of the mouse.

    Science.gov (United States)

    Church, Deanna M; Goodstadt, Leo; Hillier, Ladeana W; Zody, Michael C; Goldstein, Steve; She, Xinwe; Bult, Carol J; Agarwala, Richa; Cherry, Joshua L; DiCuccio, Michael; Hlavina, Wratko; Kapustin, Yuri; Meric, Peter; Maglott, Donna; Birtle, Zoë; Marques, Ana C; Graves, Tina; Zhou, Shiguo; Teague, Brian; Potamousis, Konstantinos; Churas, Christopher; Place, Michael; Herschleb, Jill; Runnheim, Ron; Forrest, Daniel; Amos-Landgraf, James; Schwartz, David C; Cheng, Ze; Lindblad-Toh, Kerstin; Eichler, Evan E; Ponting, Chris P

    2009-05-05

    The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non-protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.

  19. Lineage-specific biology revealed by a finished genome assembly of the mouse.

    Directory of Open Access Journals (Sweden)

    Deanna M Church

    2009-05-01

    Full Text Available The mouse (Mus musculus is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes. In addition, we identified 439 long, non-protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.

  20. Bonus Organisms in High-Throughput Eukaryotic Whole-Genome Shorgun Assembly

    Energy Technology Data Exchange (ETDEWEB)

    Pangilinan, Jasmyn; Shapiro, Harris; Tu, Hank; Platt, Darren

    2006-02-06

    The DOE Joint Genome Institute has sequenced over 50 eukaryotic genomes, ranging in size from 15 MB to 1.6 GB, over a wide range of organism types. In the course of doing so, it has become clear that a substantial fraction of these data sets contains bonus organisms, usually prokaryotes, in addition to the desired genome. While some of these additional organisms are extraneous contamination, they are sometimes symbionts, and so can be of biological interest. Therefore, it is desirable to assemble the bonus organisms along with the main genome. This transforms the problem into one of metagenomic assembly, which is considerably more challenging than traditional whole-genome shotgun (WGS) assembly. The different organisms will usually be present at different sequence depths, which is difficult to handle in most WGS assemblers. In addition, with multiple distinct genomes present, chimerism can produce cross-organism combinations. Finally, there is no guarantee that only a single bonus organism will be present. For example, one JGI project contained at least two different prokaryotic contaminants, plus a 145 KB plasmid of unknown origin. We have developed techniques to routinely identify and handle such bonus organisms in a high-throughput sequencing environment. Approaches include screening and partitioning the unassembled data, and iterative subassemblies. These methods are applicable not only to bonus organisms, but also to desired components such as organelles. These procedures have the additional benefit of identifying, and allowing for the removal of, cloning artifacts such as E.coli and spurious vector inclusions.

  1. Whole genome amplification and de novo assembly of single bacterial cells.

    Directory of Open Access Journals (Sweden)

    Sébastien Rodrigue

    Full Text Available BACKGROUND: Single-cell genome sequencing has the potential to allow the in-depth exploration of the vast genetic diversity found in uncultured microbes. We used the marine cyanobacterium Prochlorococcus as a model system for addressing important challenges facing high-throughput whole genome amplification (WGA and complete genome sequencing of individual cells. METHODOLOGY/PRINCIPAL FINDINGS: We describe a pipeline that enables single-cell WGA on hundreds of cells at a time while virtually eliminating non-target DNA from the reactions. We further developed a post-amplification normalization procedure that mitigates extreme variations in sequencing coverage associated with multiple displacement amplification (MDA, and demonstrated that the procedure increased sequencing efficiency and facilitated genome assembly. We report genome recovery as high as 99.6% with reference-guided assembly, and 95% with de novo assembly starting from a single cell. We also analyzed the impact of chimera formation during MDA on de novo assembly, and discuss strategies to minimize the presence of incorrectly joined regions in contigs. CONCLUSIONS/SIGNIFICANCE: The methods describe in this paper will be useful for sequencing genomes of individual cells from a variety of samples.

  2. High-quality genome (re)assembly using chromosomal contact data.

    Science.gov (United States)

    Marie-Nelly, Hervé; Marbouty, Martial; Cournac, Axel; Flot, Jean-François; Liti, Gianni; Parodi, Dante Poggi; Syan, Sylvie; Guillén, Nancy; Margeot, Antoine; Zimmer, Christophe; Koszul, Romain

    2014-12-17

    Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left 'unfinished.' Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm--named GRAAL--generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.

  3. Efficient assembly of de novo human artificial chromosomes from large genomic loci

    Directory of Open Access Journals (Sweden)

    Stromberg Gregory

    2005-07-01

    Full Text Available Abstract Background Human Artificial Chromosomes (HACs are potentially useful vectors for gene transfer studies and for functional annotation of the genome because of their suitability for cloning, manipulating and transferring large segments of the genome. However, development of HACs for the transfer of large genomic loci into mammalian cells has been limited by difficulties in manipulating high-molecular weight DNA, as well as by the low overall frequencies of de novo HAC formation. Indeed, to date, only a small number of large (>100 kb genomic loci have been reported to be successfully packaged into de novo HACs. Results We have developed novel methodologies to enable efficient assembly of HAC vectors containing any genomic locus of interest. We report here the creation of a novel, bimolecular system based on bacterial artificial chromosomes (BACs for the construction of HACs incorporating any defined genomic region. We have utilized this vector system to rapidly design, construct and validate multiple de novo HACs containing large (100–200 kb genomic loci including therapeutically significant genes for human growth hormone (HGH, polycystic kidney disease (PKD1 and ß-globin. We report significant differences in the ability of different genomic loci to support de novo HAC formation, suggesting possible effects of cis-acting genomic elements. Finally, as a proof of principle, we have observed sustained ß-globin gene expression from HACs incorporating the entire 200 kb ß-globin genomic locus for over 90 days in the absence of selection. Conclusion Taken together, these results are significant for the development of HAC vector technology, as they enable high-throughput assembly and functional validation of HACs containing any large genomic locus. We have evaluated the impact of different genomic loci on the frequency of HAC formation and identified segments of genomic DNA that appear to facilitate de novo HAC formation. These genomic loci

  4. SMRT sequencing only de novo assembly of the sugar beet (Beta vulgaris) chloroplast genome.

    Science.gov (United States)

    Stadermann, Kai Bernd; Weisshaar, Bernd; Holtgräwe, Daniela

    2015-09-16

    Third generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base. We describe a workflow for a de novo assembly of the sugar beet (Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly. SMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly

  5. Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis.

    Science.gov (United States)

    Li, Yi; Hsiang, Tom; Yang, Rui-Heng; Hu, Xiao-Di; Wang, Ke; Wang, Wen-Jing; Wang, Xiao-Liang; Jiao, Lei; Yao, Yi-Jian

    2016-09-01

    Ophiocordyceps sinensis is one of the most expensive medicinal fungi world-wide, and has been used as a traditional Chinese medicine for centuries. In a recent report, the genome of this fungus was found to be expanded by extensive repetitive elements after assembly of Roche 454 (223Mb) and Illumina HiSeq (10.6Gb) sequencing data, producing a genome of 87.7Mb with an N50 scaffold length of 12kb and 6972 predicted genes. To test whether the assembly could be improved by deeper sequencing and to assess the amount of data needed for optimal assembly, genomic sequencing was run several times on genomic DNA extractions of a single ascospore isolate (strain 1229) on an Illumina HiSeq platform (25Gb total data). Assemblies were produced using different data types (raw vs. trimmed) and data amounts, and using three freely available assembly programs (ABySS, SOAP and Velvet). In nearly all cases, trimming the data for low quality base calls did not provide assemblies with higher N50 values compared to the non-trimmed data, and increasing the amount of input data (i.e. sequence reads) did not always lead to higher N50 values. Depending on the assembly program and data type, the maximal N50 was reached with between 50% to 90% of the total read data, equivalent to 100× to 200× coverage. The draft genome assembly was improved over the previously published version resulting in a 114Mb assembly, scaffold N50 of 70kb and 9610 predicted genes. Among the predicted genes, 9213 were validated by RNA-Seq analysis in this study, of which 8896 were found to be singletons. Evidence from genome and transcriptome analyses indicated that species assemblies could be improved with defined input material (e.g. haploid mono-ascospore isolate) without the requirement of multiple sequencing technologies, multiple library sizes or data trimming for low quality base calls, and with genome coverages between 100× and 200×.

  6. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions

    Science.gov (United States)

    Burton, Joshua N.; Adey, Andrew; Patwardhan, Rupali P.; Qiu, Ruolan; Kitzman, Jacob O.; Shendure, Jay

    2014-01-01

    Genomes assembled de novo from short reads are highly fragmented relative to the finished chromosomes of H. sapiens and key model organisms generated by the Human Genome Project. To address this, we need scalable, cost-effective methods enabling chromosome-scale contiguity. Here we show that genome-wide chromatin interaction datasets, such as those generated by Hi-C, are a rich source of long-range information for assigning, ordering and orienting genomic sequences to chromosomes, including across centromeres. To exploit this, we developed an algorithm that uses Hi-C data for ultra-long-range scaffolding of de novo genome assemblies. We demonstrate the approach by combining shotgun fragment and short jump mate-pair sequences with Hi-C data to generate chromosome-scale de novo assemblies of the human, mouse and Drosophila genomes, achieving – for human – 98% accuracy in assigning scaffolds to chromosome groups and 99% accuracy in ordering and orienting scaffolds within chromosome groups. Hi-C data can also be used to validate chromosomal translocations in cancer genomes. PMID:24185095

  7. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome.

    Science.gov (United States)

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A V

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

  8. Genome Assembly of Bell Pepper Endornavirus from Small RNA

    Science.gov (United States)

    Luria, Neta; Dombrovsky, Aviv

    2012-01-01

    The family Endornaviridae infects diverse hosts, including plants, fungi, and oomycetes. Here we report for the first time the assembly of bell pepper endornavirus by next-generation sequencing of viral small RNA. Such a population of small RNA indicates the activation of the viral immunity silencing machinery by this cryptic virus, which probably encodes a novel silencing suppressor. PMID:22733884

  9. Genome assembly and annotation of Arabidopsis halleri, a model for heavy metal hyperaccumulation and evolutionary ecology.

    Science.gov (United States)

    Briskine, Roman V; Paape, Timothy; Shimizu-Inatsugi, Rie; Nishiyama, Tomoaki; Akama, Satoru; Sese, Jun; Shimizu, Kentaro K

    2016-09-27

    The self-incompatible species Arabidopsis halleri is a close relative of the self-compatible model plant Arabidopsis thaliana. The broad European and Asian distribution and heavy metal hyperaccumulation ability make A. halleri a useful model for ecological genomics studies. We used long-insert mate-pair libraries to improve the genome assembly of the A. halleri ssp. gemmifera Tada mine genotype (W302) collected from a site with high contamination by heavy metals in Japan. After five rounds of forced selfing, heterozygosity was reduced to 0.04%, which facilitated subsequent genome assembly. Our assembly now covers 196 Mb or 78% of the estimated genome size and achieved scaffold N50 length of 712 kb. To validate assembly and annotation, we used synteny of A. halleri Tada mine with a previously published high-quality reference assembly of a closely related species, Arabidopsis lyrata. Further validation of the assembly quality comes from synteny and phylogenetic analysis of the HEAVY METAL ATPASE4 (HMA4) and METAL TOLERANCE PROTEIN1 (MTP1) regions using published sequences from European A. halleri for comparison. Three tandemly duplicated copies of HMA4, key gene involved in cadmium and zinc hyperaccumulation, were assembled on a single scaffold. The assembly will enhance the genomewide studies of A. halleri as well as the allopolyploid Arabidopsis kamchatica derived from A. lyrata and A. halleri. © 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd.

  10. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds.

    Science.gov (United States)

    Dudchenko, Olga; Batra, Sanjit S; Omer, Arina D; Nyquist, Sarah K; Hoeger, Marie; Durand, Neva C; Shamim, Muhammad S; Machol, Ido; Lander, Eric S; Aiden, Aviva Presser; Aiden, Erez Lieberman

    2017-04-07

    The Zika outbreak, spread by the Aedes aegypti mosquito, highlights the need to create high-quality assemblies of large genomes in a rapid and cost-effective way. Here we combine Hi-C data with existing draft assemblies to generate chromosome-length scaffolds. We validate this method by assembling a human genome, de novo, from short reads alone (67× coverage). We then combine our method with draft sequences to create genome assemblies of the mosquito disease vectors Aeaegypti and Culex quinquefasciatus, each consisting of three scaffolds corresponding to the three chromosomes in each species. These assemblies indicate that almost all genomic rearrangements among these species occur within, rather than between, chromosome arms. The genome assembly procedure we describe is fast, inexpensive, and accurate, and can be applied to many species.

  11. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity.

    Science.gov (United States)

    Adey, Andrew; Kitzman, Jacob O; Burton, Joshua N; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; Gunderson, Kevin L; Steemers, Frank J; Shendure, Jay

    2014-12-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to > 1 megabase. These pools are "subhaploid," in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate "joins" are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.

  12. Meraculous: De Novo Genome Assembly with Short Paired-End Reads

    Energy Technology Data Exchange (ETDEWEB)

    Chapman, Jarrod A.; Ho, Isaac; Sunkara, Sirisha; Luo, Shujun; Schroth, Gary P.; Rokhsar, Daniel S.; Salzberg, Steven L.

    2011-08-18

    We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ~280 bp or ~3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

  13. Meraculous: de novo genome assembly with short paired-end reads.

    Directory of Open Access Journals (Sweden)

    Jarrod A Chapman

    Full Text Available We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

  14. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.

    Science.gov (United States)

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C; McCombie, W Richard

    2015-11-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

  15. Comparing de novo genome assembly: the long and short of it.

    Science.gov (United States)

    Narzisi, Giuseppe; Mishra, Bud

    2011-04-29

    Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.

  16. Comparing de novo genome assembly: the long and short of it.

    Directory of Open Access Journals (Sweden)

    Giuseppe Narzisi

    Full Text Available Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS have rekindled a growing interest in the whole-genome sequence assembly (WGSA problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc. as well as a more comprehensive metric (Feature-Response Curves, FRC that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger and high-coverage short (Illumina reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter and partial human genome sequences (Chr. Y, using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.

  17. The de novo genome assembly and annotation of a female domestic dromedary of North African origin.

    Science.gov (United States)

    Fitak, Robert R; Mohandesan, Elmira; Corander, Jukka; Burger, Pamela A

    2016-01-01

    The single-humped dromedary (Camelus dromedarius) is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbour a variety of biological and physiological characteristics with evolutionary, economic and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65× coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21,167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single-nucleotide polymorphisms with a mean density of 0.71 × 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes, highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps.

  18. Assembling networks of microbial genomes using linear programming

    Directory of Open Access Journals (Sweden)

    Holloway Catherine

    2010-11-01

    Full Text Available Abstract Background Microbial genomes exhibit complex sets of genetic affinities due to lateral genetic transfer. Assessing the relative contributions of parent-to-offspring inheritance and gene sharing is a vital step in understanding the evolutionary origins and modern-day function of an organism, but recovering and showing these relationships is a challenging problem. Results We have developed a new approach that uses linear programming to find between-genome relationships, by treating tables of genetic affinities (here, represented by transformed BLAST e-values as an optimization problem. Validation trials on simulated data demonstrate the effectiveness of the approach in recovering and representing vertical and lateral relationships among genomes. Application of the technique to a set comprising Aquifex aeolicus and 75 other thermophiles showed an important role for large genomes as 'hubs' in the gene sharing network, and suggested that genes are preferentially shared between organisms with similar optimal growth temperatures. We were also able to discover distinct and common genetic contributors to each sequenced representative of genus Pseudomonas. Conclusions The linear programming approach we have developed can serve as an effective inference tool in its own right, and can be an efficient first step in a more-intensive phylogenomic analysis.

  19. Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome.

    Science.gov (United States)

    Hastie, Alex R; Dong, Lingli; Smith, Alexis; Finklestein, Jeff; Lam, Ernest T; Huo, Naxin; Cao, Han; Kwok, Pui-Yan; Deal, Karin R; Dvorak, Jan; Luo, Ming-Cheng; Gu, Yong; Xiao, Ming

    2013-01-01

    Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences and to facilitate unambiguous assembly. Plant genomes are notorious for containing high quantities of repetitive elements, which combined with huge genome sizes, makes accurate assembly of these large and complex genomes intractable thus far. Using two-color genome mapping of tiling bacterial artificial chromosomes (BAC) clones on nanochannel arrays, we completed high-confidence assembly of a 2.1-Mb, highly repetitive region in the large and complex genome of Aegilops tauschii, the D-genome donor of hexaploid wheat (Triticum aestivum). Genome mapping is based on direct visualization of sequence motifs on single DNA molecules hundreds of kilobases in length. With the genome map as a scaffold, we anchored unplaced sequence contigs, validated the initial draft assembly, and resolved instances of misassembly, some involving contigs assembly from 75% to 95% complete.

  20. Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome.

    Directory of Open Access Journals (Sweden)

    Alex R Hastie

    Full Text Available Next-generation sequencing (NGS technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences and to facilitate unambiguous assembly. Plant genomes are notorious for containing high quantities of repetitive elements, which combined with huge genome sizes, makes accurate assembly of these large and complex genomes intractable thus far. Using two-color genome mapping of tiling bacterial artificial chromosomes (BAC clones on nanochannel arrays, we completed high-confidence assembly of a 2.1-Mb, highly repetitive region in the large and complex genome of Aegilops tauschii, the D-genome donor of hexaploid wheat (Triticum aestivum. Genome mapping is based on direct visualization of sequence motifs on single DNA molecules hundreds of kilobases in length. With the genome map as a scaffold, we anchored unplaced sequence contigs, validated the initial draft assembly, and resolved instances of misassembly, some involving contigs <2 kb long, to dramatically improve the assembly from 75% to 95% complete.

  1. Impact of genome assembly status on ChIP-Seq and ChIP-PET data mapping

    Directory of Open Access Journals (Sweden)

    Sachs Laurent

    2009-12-01

    Full Text Available Abstract Background ChIP-Seq and ChIP-PET can potentially be used with any genome for genome wide profiling of protein-DNA interaction sites. Unfortunately, it is probable that most genome assemblies will never reach the quality of the human genome assembly. Therefore, it remains to be determined whether ChIP-Seq and ChIP-PET are practicable with genome sequences other than a few (e.g. human and mouse. Findings Here, we used in silico simulations to assess the impact of completeness or fragmentation of genome assemblies on ChIP-Seq and ChIP-PET data mapping. Conclusions Most currently published genome assemblies are suitable for mapping the short sequence tags produced by ChIP-Seq or ChIP-PET.

  2. The complex task of choosing a de novo assembly: lessons from fungal genomes.

    Science.gov (United States)

    Gallo, Juan Esteban; Muñoz, José Fernando; Misas, Elizabeth; McEwen, Juan Guillermo; Clay, Oliver Keatinge

    2014-12-01

    Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular

  3. The possibility of de novo assembly of the genome and population genomics of the mangrove rivulus, Kryptolebias marmoratus.

    Science.gov (United States)

    Kelley, Joanna L; Yee, Muh-Ching; Lee, Clarence; Levandowsky, Elizabeth; Shah, Minita; Harkins, Timothy; Earley, Ryan L; Bustamante, Carlos D

    2012-12-01

    How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.

  4. Thermodynamic Interrogation of the Assembly of a Viral Genome Packaging Motor Complex.

    Science.gov (United States)

    Yang, Teng-Chieh; Ortiz, David; Nosaka, Lyn'Al; Lander, Gabriel C; Catalano, Carlos Enrique

    2015-10-20

    Viral terminase enzymes serve as genome packaging motors in many complex double-stranded DNA viruses. The functional motors are multiprotein complexes that translocate viral DNA into a capsid shell, powered by a packaging ATPase, and are among the most powerful molecular motors in nature. Given their essential role in virus development, the structure and function of these biological motors is of considerable interest. Bacteriophage λ-terminase, which serves as a prototypical genome packaging motor, is composed of one large catalytic subunit tightly associated with two DNA recognition subunits. This protomer assembles into a functional higher-order complex that excises a unit length genome from a concatemeric DNA precursor (genome maturation) and concomitantly translocates the duplex into a preformed procapsid shell (genome packaging). While the enzymology of λ-terminase has been well described, the nature of the catalytically competent nucleoprotein intermediates, and the mechanism describing their assembly and activation, is less clear. Here we utilize analytical ultracentrifugation to determine the thermodynamic parameters describing motor assembly and define a minimal thermodynamic linkage model that describes the effects of salt on protomer assembly into a tetrameric complex. Negative stain electron microscopy images reveal a symmetric ring-like complex with a compact stem and four extended arms that exhibit a range of conformational states. Finally, kinetic studies demonstrate that assembly of the ring tetramer is directly linked to activation of the packaging ATPase activity of the motor, thus providing a direct link between structure and function. The implications of these results with respect to the assembly and activation of the functional packaging motor during a productive viral infection are discussed.

  5. FLASH Assembly of TALENs Enables High-Throughput Genome Editing

    OpenAIRE

    Reyon, Deepak; Tsai, Shengdar Q.; Khayter, Cyd; Foden, Jennifer A.; Sander, Jeffry D.; Joung, J. Keith

    2012-01-01

    Engineered transcription activator-like effector nucleases (TALENs) have shown promise as facile and broadly applicable genome editing tools. However, no publicly available high-throughput method for constructing TALENs has been published and large-scale assessments of the success rate and targeting range of the technology remain lacking. Here we describe the Fast Ligation-based Automatable Solid-phase High-throughput (FLASH) platform, a rapid and cost-effective method we developed to enable ...

  6. The human genome: Some assembly required. Final report

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1994-12-31

    The Human Genome Project promises to be one of the most rewarding endeavors in modern biology. The cost and the ethical and social implications, however, have made this project the source of considerable debate both in the scientific community and in the public at large. The 1994 Graduate Student Symposium addresses the scientific merits of the project, the technical issues involved in accomplishing the task, as well as the medical and social issues which stem from the wealth of knowledge which the Human Genome Project will help create. To this end, speakers were brought together who represent the diverse areas of expertise characteristic of this multidisciplinary project. The keynote speaker addresses the project`s motivations and goals in the larger context of biological and medical sciences. The first two sessions address relevant technical issues, data collection with a focus on high-throughput sequencing methods and data analysis with an emphasis on identification of coding sequences. The third session explores recent advances in the understanding of genetic diseases and possible routes to treatment. Finally, the last session addresses some of the ethical, social and legal issues which will undoubtedly arise from having a detailed knowledge of the human genome.

  7. Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data.

    Science.gov (United States)

    Soorni, Aboozar; Haak, David; Zaitlin, David; Bombarely, Aureliano

    2017-01-07

    The development of long-read sequencing technologies, such as single-molecule real-time (SMRT) sequencing by PacBio, has produced a revolution in the sequencing of small genomes. Sequencing organelle genomes using PacBio long-read data is a cost effective, straightforward approach. Nevertheless, the availability of simple-to-use software to perform the assembly from raw reads is limited at present. We present Organelle-PBA, a Perl program designed specifically for the assembly of chloroplast and mitochondrial genomes. For chloroplast genomes, the program selects the chloroplast reads from a whole genome sequencing pool, maps the reads to a reference sequence from a closely related species, and then performs read correction and de novo assembly using Sprai. Organelle-PBA completes the assembly process with the additional step of scaffolding by SSPACE-LongRead. The program then detects the chloroplast inverted repeats and reassembles and re-orients the assembly based on the organelle origin of the reference. We have evaluated the performance of the software using PacBio reads from different species, read coverage, and reference genomes. Finally, we present the assembly of two novel chloroplast genomes from the species Picea glauca (Pinaceae) and Sinningia speciosa (Gesneriaceae). Organelle-PBA is an easy-to-use Perl-based software pipeline that was written specifically to assemble mitochondrial and chloroplast genomes from whole genome PacBio reads. The program is available at https://github.com/aubombarely/Organelle_PBA .

  8. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    Science.gov (United States)

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  9. De novo genome assembly of Cercospora beticola for microsatellite marker development and validation

    Science.gov (United States)

    Cercospora leaf spot caused by Cercospora beticola is a significant threat to the production of sugar and table beet worldwide. A de novo genome assembly of C. beticola was used to develop eight polymorphic and reproducible microsatellite markers for population genetic analyses. These markers were u...

  10. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    Science.gov (United States)

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  11. Draft Genome Assembly of a Filamentous Euendolithic (True Boring) Cyanobacterium, Mastigocoleus testarum Strain BC008.

    Science.gov (United States)

    Guida, Brandon S; Garcia-Pichel, Ferran

    2016-01-28

    Mastigocoleus testarum strain BC008 is a model organism used to study marine photoautotrophic carbonate dissolution. It is a multicellular, filamentous, diazotrophic, euendolithic cyanobacterium ubiquitously found in marine benthic environments. We present an accurate draft genome assembly of 172 contigs spanning 12,700,239 bp with 9,131 annotated genes with an average G+C% of 37.3.

  12. De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

    NARCIS (Netherlands)

    Bucur, Doina; Squillero, Giovanni; Sim, Kevin

    We design an evolutionary heuristic for the combinatorial problem of de-novo DNA assembly with short, overlapping, accurately sequenced single DNA reads of uniform length, from both strands of a genome without long repeated sequences. The representation of a candidate solution is a novel segmented

  13. Radiation hybrid maps of D-genome of Aegilops tauschii and their application in sequence assembly of large and complex plant genomes

    Science.gov (United States)

    The large and complex genome of bread wheat (Triticum aestivum L., ~17 Gb) requires high-resolution genome maps saturated with ordered markers to assist in anchoring and orienting BAC contigs/ sequence scaffolds for whole genome sequence assembly. Radiation hybrid (RH) mapping has proven to be an e...

  14. De novo assembly of a field isolate genome reveals novel Plasmodium vivax erythrocyte invasion genes.

    Science.gov (United States)

    Hester, James; Chan, Ernest R; Menard, Didier; Mercereau-Puijalon, Odile; Barnwell, John; Zimmerman, Peter A; Serre, David

    2013-01-01

    Recent sequencing of Plasmodium vivax field isolates and monkey-adapted strains enabled characterization of SNPs throughout the genome. These analyses relied on mapping short reads onto the P. vivax reference genome that was generated using DNA from the monkey-adapted strain Salvador I. Any genomic locus deleted in this strain would be lacking in the reference genome sequence and missed in previous analyses. Here, we report de novo assembly of a P. vivax field isolate genome. Out of 2,857 assembled contigs, we identify 362 contigs, each containing more than 5 kb of contiguous DNA sequences absent from the reference genome sequence. These novel P. vivax DNA sequences account for 3.8 million nucleotides and contain 792 predicted genes. Most of these contigs contain members of multigene families and likely originate from telomeric regions. Interestingly, we identify two contigs containing predicted protein coding genes similar to known Plasmodium red blood cell invasion proteins. One gene encodes the reticulocyte-binding protein gene orthologous to P. cynomolgi RBP2e and P. knowlesi NBPXb. The second gene harbors all the hallmarks of a Plasmodium erythrocyte-binding protein, including conserved Duffy-binding like and C-terminus cysteine-rich domains. Phylogenetic analysis shows that this novel gene clusters separately from all known Plasmodium Duffy-binding protein genes. Additional analyses showing that this gene is present in most P. vivax genomes and transcribed in blood-stage parasites suggest that P. vivax red blood cell invasion mechanisms may be more complex than currently understood. The strategy employed here complements previous genomic analyses and takes full advantage of next-generation sequencing data to provide a comprehensive characterization of genetic variations in this important malaria parasite. Further analyses of the novel protein coding genes discovered through de novo assembly have the potential to identify genes that influence key aspects of P

  15. Long-read sequencing and de novo assembly of a Chinese genome.

    Science.gov (United States)

    Shi, Lingling; Guo, Yunfei; Dong, Chengliang; Huddleston, John; Yang, Hui; Han, Xiaolu; Fu, Aisi; Li, Quan; Li, Na; Gong, Siyi; Lintner, Katherine E; Ding, Qiong; Wang, Zou; Hu, Jiang; Wang, Depeng; Wang, Feng; Wang, Lin; Lyon, Gholson J; Guan, Yongtao; Shen, Yufeng; Evgrafov, Oleg V; Knowles, James A; Thibaud-Nissen, Francoise; Schneider, Valerie; Yu, Chack-Yung; Zhou, Libing; Eichler, Evan E; So, Kwok-Fai; Wang, Kai

    2016-06-30

    Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.

  16. Assembly and diploid architecture of an individual human genome via single-molecule technologies.

    Science.gov (United States)

    Pendleton, Matthew; Sebra, Robert; Pang, Andy Wing Chun; Ummat, Ajay; Franzen, Oscar; Rausch, Tobias; Stütz, Adrian M; Stedman, William; Anantharaman, Thomas; Hastie, Alex; Dai, Heng; Fritz, Markus Hsi-Yang; Cao, Han; Cohain, Ariella; Deikus, Gintaras; Durrett, Russell E; Blanchard, Scott C; Altman, Roger; Chin, Chen-Shan; Guo, Yan; Paxinos, Ellen E; Korbel, Jan O; Darnell, Robert B; McCombie, W Richard; Kwok, Pui-Yan; Mason, Christopher E; Schadt, Eric E; Bashir, Ali

    2015-08-01

    We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.

  17. A second generation radiation hybrid map to aid the assembly of the bovine genome sequence

    Directory of Open Access Journals (Sweden)

    Janitz Michal

    2006-11-01

    Full Text Available Abstract Background Several approaches can be used to determine the order of loci on chromosomes and hence develop maps of the genome. However, all mapping approaches are prone to errors either arising from technical deficiencies or lack of statistical support to distinguish between alternative orders of loci. The accuracy of the genome maps could be improved, in principle, if information from different sources was combined to produce integrated maps. The publicly available bovine genomic sequence assembly with 6× coverage (Btau_2.0 is based on whole genome shotgun sequence data and limited mapping data however, it is recognised that this assembly is a draft that contains errors. Correcting the sequence assembly requires extensive additional mapping information to improve the reliability of the ordering of sequence scaffolds on chromosomes. The radiation hybrid (RH map described here has been contributed to the international sequencing project to aid this process. Results An RH map for the 30 bovine chromosomes is presented. The map was built using the Roslin 3000-rad RH panel (BovGen RH map and contains 3966 markers including 2473 new loci in addition to 262 amplified fragment-length polymorphisms (AFLP and 1231 markers previously published with the first generation RH map. Sequences of the mapped loci were aligned with published bovine genome maps to identify inconsistencies. In addition to differences in the order of loci, several cases were observed where the chromosomal assignment of loci differed between maps. All the chromosome maps were aligned with the current 6× bovine assembly (Btau_2.0 and 2898 loci were unambiguously located in the bovine sequence. The order of loci on the RH map for BTA 5, 7, 16, 22, 25 and 29 differed substantially from the assembled bovine sequence. From the 2898 loci unambiguously identified in the bovine sequence assembly, 131 mapped to different chromosomes in the BovGen RH map. Conclusion Alignment of the

  18. Illuminating the Black Box of Genome Sequence Assembly: A Free Online Tool to Introduce Students to Bioinformatics

    Science.gov (United States)

    Taylor, D. Leland; Campbell, A. Malcolm; Heyer, Laurie J.

    2013-01-01

    Next-generation sequencing technologies have greatly reduced the cost of sequencing genomes. With the current sequencing technology, a genome is broken into fragments and sequenced, producing millions of "reads." A computer algorithm pieces these reads together in the genome assembly process. PHAST is a set of online modules…

  19. An advanced draft genome assembly of a desi type chickpea (Cicer arietinum L.).

    Science.gov (United States)

    Parween, Sabiha; Nawaz, Kashif; Roy, Riti; Pole, Anil K; Venkata Suresh, B; Misra, Gopal; Jain, Mukesh; Yadav, Gitanjali; Parida, Swarup K; Tyagi, Akhilesh K; Bhatia, Sabhyata; Chattopadhyay, Debasis

    2015-08-11

    Chickpea (Cicer arietinum L.) is an important pulse legume crop. We previously reported a draft genome assembly of the desi chickpea cultivar ICC 4958. Here we report an advanced version of the ICC 4958 genome assembly (version 2.0) generated using additional sequence data and an improved genetic map. This resulted in 2.7-fold increase in the length of the pseudomolecules and substantial reduction of sequence gaps. The genome assembly covered more than 94% of the estimated gene space and predicted the presence of 30,257 protein-coding genes including 2230 and 133 genes encoding potential transcription factors (TF) and resistance gene homologs, respectively. Gene expression analysis identified several TF and chickpea-specific genes with tissue-specific expression and displayed functional diversification of the paralogous genes. Pairwise comparison of pseudomolecules in the desi (ICC 4958) and the earlier reported kabuli (CDC Frontier) chickpea assemblies showed an extensive local collinearity with incongruity in the placement of large sequence blocks along the linkage groups, apparently due to use of different genetic maps. Single nucleotide polymorphism (SNP)-based mining of intra-specific polymorphism identified more than four thousand SNPs differentiating a desi group and a kabuli group of chickpea genotypes.

  20. A field guide to whole-genome sequencing, assembly and annotation.

    Science.gov (United States)

    Ekblom, Robert; Wolf, Jochen B W

    2014-11-01

    Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects.

  1. De novo assembly and characterization of the complete chloroplast genome of radish (Raphanus sativus L.).

    Science.gov (United States)

    Jeong, Young-Min; Chung, Won-Hyung; Mun, Jeong-Hwan; Kim, Namshin; Yu, Hee-Ju

    2014-11-01

    Radish (Raphanus sativus L.) is an edible root vegetable crop that is cultivated worldwide and whose genome has been sequenced. Here we report the complete nucleotide sequence of the radish cultivar WK10039 chloroplast (cp) genome, along with a de novo assembly strategy using whole genome shotgun sequence reads obtained by next generation sequencing. The radish cp genome is 153,368 bp in length and has a typical quadripartite structure, composed of a pair of inverted repeat regions (26,217 bp each), a large single copy region (83,170 bp), and a small single copy region (17,764 bp). The radish cp genome contains 87 predicted protein-coding genes, 37 tRNA genes, and 8 rRNA genes. Sequence analysis revealed the presence of 91 simple sequence repeats (SSRs) in the radish cp genome. Phylogenetic analysis of 62 protein-coding gene sequences from the 17 cp genomes of the Brassicaceae family suggested that the radish cp genome is most closely related to the cp genomes of Brassica rapa and Brassicanapus. Comparisons with the B. rapa and B. napus cp genomes revealed highly divergent intergenic sequences and introns that can potentially be developed as diagnostic cp markers. Synonymous and nonsynonymous substitutions of cp genes suggested that nucleotide substitutions have occurred at similar rates in most genes. The complete sequence of the radish cp genome would serve as a valuable resource for the development of new molecular markers and the study of the phylogenetic relationships of Raphanus species in the Brassicaceae family.

  2. De novo genome assembly of the economically important weed horseweed using integrated data from multiple sequencing platforms.

    Science.gov (United States)

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O'Geen, Henriette; Kim, Ryan W; Sammons, R Douglas; Rieseberg, Loren H; Stewart, C Neal

    2014-11-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed.

  3. Genomic libraries: II. Subcloning, sequencing, and assembling large-insert genomic DNA clones.

    Science.gov (United States)

    Quail, Mike A; Matthews, Lucy; Sims, Sarah; Lloyd, Christine; Beasley, Helen; Baxter, Simon W

    2011-01-01

    Sequencing large insert clones to completion is useful for characterizing specific genomic regions, identifying haplotypes, and closing gaps in whole genome sequencing projects. Despite being a standard technique in molecular laboratories, DNA sequencing using the Sanger method can be highly problematic when complex secondary structures or sequence repeats are encountered in genomic clones. Here, we describe methods to isolate DNA from a large insert clone (fosmid or BAC), subclone the sample, and sequence the region to the highest industry standard. Troubleshooting solutions for sequencing difficult templates are discussed.

  4. Assembly and characterization of megaTALs for hyperspecific genome engineering applications.

    Science.gov (United States)

    Boissel, Sandrine; Scharenberg, Andrew M

    2015-01-01

    Rare-cleaving nucleases have emerged as valuable tools for creating targeted genomic modification for both therapeutic and research applications. MegaTALs are novel monomeric nucleases composed of a site-specific meganuclease cleavage head with additional affinity and specificity provided by a TAL effector DNA binding domain. This fusion product facilitates the transformation of meganucleases into hyperspecific and highly active genome engineering tools that are amenable to multiplexing and compatible with multiple cellular delivery methods. In this chapter, we describe the process of assembling a megaTAL from a meganuclease, as well as a method for characterization of nuclease cleavage activity in vivo using a fluorescence reporter assay.

  5. Genome-wide assembly and analysis of alternative transcripts in mouse

    OpenAIRE

    Sharov, Alexei A; Dudekula, Dawood B.; Minoru S.H. Ko

    2005-01-01

    To build a mouse gene index with the most comprehensive coverage of alternative transcription/splicing (ATS), we developed an algorithm and a fully automated computational pipeline for transcript assembly from expressed sequences aligned to the genome. We identified 191,946 genomic loci, which included 27,497 protein-coding genes and 11,906 additional gene candidates (e.g., nonprotein-coding, but multiexon). Comparison of the resulting gene index with TIGR, UniGene, DoTS, and ESTGenes databas...

  6. Long-read sequencing improves assembly of Trichinella genomes 10-fold, revealing substantial synteny between lineages diverged over seven million years

    Science.gov (United States)

    Genome evolution influences a parasite’s’s pathogenicity, host-pathogen interactions, environmental constraints, and invasion biology, while genome assemblies form the basis of comparative sequence analyses. Given that closely related organisms typically maintain appreciable synteny, the genome asse...

  7. Regulation of DnaA Assembly and Activity: Taking Directions From the Genome

    OpenAIRE

    2011-01-01

    To ensure proper timing of chromosome duplication during the cell cycle, bacteria must carefully regulate the activity of initiator protein, DnaA, and its interactions with the unique replication origin, oriC. Although several protein regulators of DnaA are known, recent evidence suggests that DnaA recognition sites, in multiple genomic locations, also play an important role in controlling assembly of pre-replication complexes. In oriC, closely spaced high and low affinity recognition sites d...

  8. The Whole Genome Assembly and Comparative Genomic Research of Thellungiella parvula (Extremophile Crucifer Mitochondrion

    Directory of Open Access Journals (Sweden)

    Xuelin Wang

    2016-01-01

    Full Text Available The complete nucleotide sequences of the mitochondrial (mt genome of an extremophile species Thellungiella parvula (T. parvula have been determined with the lengths of 255,773 bp. T. parvula mt genome is a circular sequence and contains 32 protein-coding genes, 19 tRNA genes, and three ribosomal RNA genes with a 11.5% coding sequence. The base composition of 27.5% A, 27.5% T, 22.7% C, and 22.3% G in descending order shows a slight bias of 55% AT. Fifty-three repeats were identified in the mitochondrial genome of T. parvula, including 24 direct repeats, 28 tandem repeats (TRs, and one palindromic repeat. Furthermore, a total of 199 perfect microsatellites have been mined with a high A/T content (83.1% through simple sequence repeat (SSR analysis and they were distributed unevenly within this mitochondrial genome. We also analyzed other plant mitochondrial genomes’ evolution in general, providing clues for the understanding of the evolution of organelles genomes in plants. Comparing with other Brassicaceae species, T. parvula is related to Arabidopsis thaliana whose characters of low temperature resistance have been well documented. This study will provide important genetic tools for other Brassicaceae species research and improve yields of economically important plants.

  9. Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus).

    Science.gov (United States)

    Thanh, Nguyen Minh; Jung, Hyungtaek; Lyons, Russell E; Njaci, Isaac; Yoon, Byoung-Ha; Chand, Vincent; Tuan, Nguyen Viet; Thu, Vo Thi Minh; Mather, Peter

    2015-10-01

    Striped catfish (Pangasianodon hypophthalmus) is a commercially important freshwater fish used in inland aquaculture in the Mekong Delta, Vietnam. The culture industry is facing a significant challenge however from saltwater intrusion into many low topographical coastal provinces across the Mekong Delta as a result of predicted climate change impacts. Developing genomic resources for this species can facilitate the production of improved culture lines that can withstand raised salinity conditions, and so we have applied high-throughput Ion Torrent sequencing of transcriptome libraries from six target osmoregulatory organs from striped catfish as a genomic resource for use in future selection strategies. We obtained 12,177,770 reads after trimming and processing with an average length of 97bp. De novo assemblies were generated using CLC Genomic Workbench, Trinity and Velvet/Oases with the best overall contig performance resulting from the CLC assembly. De novo assembly using CLC yielded 66,451 contigs with an average length of 478bp and N50 length of 506bp. A total of 37,969 contigs (57%) possessed significant similarity with proteins in the non-redundant database. Comparative analyses revealed that a significant number of contigs matched sequences reported in other teleost fishes, ranging in similarity from 45.2% with Atlantic cod to 52% with zebrafish. In addition, 28,879 simple sequence repeats (SSRs) and 55,721 single nucleotide polymorphisms (SNPs) were detected in the striped catfish transcriptome. The sequence collection generated in the current study represents the most comprehensive genomic resource for P. hypophthalmus available to date. Our results illustrate the utility of next-generation sequencing as an efficient tool for constructing a large genomic database for marker development in non-model species.

  10. Genomic donor cassette sharing during VLRA and VLRC assembly in jawless vertebrates.

    Science.gov (United States)

    Das, Sabyasachi; Li, Jianxu; Holland, Stephen J; Iyer, Lakshminarayan M; Hirano, Masayuki; Schorpp, Michael; Aravind, L; Cooper, Max D; Boehm, Thomas

    2014-10-14

    Lampreys possess two T-like lymphocyte lineages that express either variable lymphocyte receptor (VLR) A or VLRC antigen receptors. VLRA(+) and VLRC(+) lymphocytes share many similarities with the two principal T-cell lineages of jawed vertebrates expressing the αβ and γδ T-cell receptors (TCRs). During the assembly of VLR genes, several types of genomic cassettes are inserted, in step-wise fashion, into incomplete germ-line genes to generate the mature forms of antigen receptor genes. Unexpectedly, the structurally variable components of VLRA and VLRC receptors often possess partially identical sequences; this phenomenon of module sharing between these two VLR isotypes occurs in both lampreys and hagfishes. By contrast, VLRA and VLRC molecules typically do not share their building blocks with the structurally analogous VLRB receptors that are expressed by B-like lymphocytes. Our studies reveal that VLRA and VLRC germ-line genes are situated in close proximity to each other in the lamprey genome and indicate the interspersed arrangement of isotype-specific and shared genomic donor cassettes; these features may facilitate the shared cassette use. The genomic structure of the VLRA/VLRC locus in lampreys is reminiscent of the interspersed nature of the TCRA/TCRD locus in jawed vertebrates that also allows the sharing of some variable gene segments during the recombinatorial assembly of TCR genes.

  11. Endogenous avian leukosis viral loci in the Red Jungle Fowl genome assembly.

    Science.gov (United States)

    Benkel, Bernhard; Rutherford, Katherine

    2014-12-01

    The current build (galGal4) of the genome of the ancestor of the modern chicken, the Red Jungle Fowl, contains a single endogenous avian leukosis viral element (ALVE) on chromosome 1 (designated RSV-LTR; family ERVK). The assembly shows the ALVE provirus juxtaposed with a member of a second family of avian endogenous retroviruses (designated GGERV20; family ERVL); however, the status of the 3' end of the ALVE element as well as its flanking region remain unclear due to a gap in the reference genome sequence. In this study, we filled the gap in the assembly using a combination of long-range PCR (LR-PCR) and a short contig present in the unassembled portion of the reference genome database. Our results demonstrate that the ALVE element (ALVE-JFevB) is inserted into the putative envelope region of a GGERV20 element, roughly 1 kbp from its 3' end, and that ALVE-JFevB is complete, and depending on its expression status, potentially capable of directing the production of virus. Moreover, the unassembled portion of the genome database contains junction fragments for a second, previously characterized endogenous proviral element, ALVE-6.

  12. Improvement of genome assembly completeness and identification of novel full-length protein-coding genes by RNA-seq in the giant panda genome.

    Science.gov (United States)

    Chen, Meili; Hu, Yibo; Liu, Jingxing; Wu, Qi; Zhang, Chenglin; Yu, Jun; Xiao, Jingfa; Wei, Fuwen; Wu, Jiayan

    2015-12-11

    High-quality and complete gene models are the basis of whole genome analyses. The giant panda (Ailuropoda melanoleuca) genome was the first genome sequenced on the basis of solely short reads, but the genome annotation had lacked the support of transcriptomic evidence. In this study, we applied RNA-seq to globally improve the genome assembly completeness and to detect novel expressed transcripts in 12 tissues from giant pandas, by using a transcriptome reconstruction strategy that combined reference-based and de novo methods. Several aspects of genome assembly completeness in the transcribed regions were effectively improved by the de novo assembled transcripts, including genome scaffolding, the detection of small-size assembly errors, the extension of scaffold/contig boundaries, and gap closure. Through expression and homology validation, we detected three groups of novel full-length protein-coding genes. A total of 12.62% of the novel protein-coding genes were validated by proteomic data. GO annotation analysis showed that some of the novel protein-coding genes were involved in pigmentation, anatomical structure formation and reproduction, which might be related to the development and evolution of the black-white pelage, pseudo-thumb and delayed embryonic implantation of giant pandas. The updated genome annotation will help further giant panda studies from both structural and functional perspectives.

  13. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes

    NARCIS (Netherlands)

    Nielsen, H.B.; Almeida, M.; Sierakowska Juncker, A.; Rasmussen, S.; Li, J.; Sunagawa, S.; Plichta, D.R.; Gautier, L.; Pedersen, A.G.; Chatelier, Le E.; Pelletier, E.; Bonde, I.; Nielsen, T.; Manichanh, C.; Arumugam, M.; Batto, J.M.; Quintanilha dos Santos, M.B.; Blom, N.; Borruel, N.; Burgdorf, K.S.; Boumezbeur, F.; Casellas, F.; Doré, J.; Dworzynski, P.; Guarner, F.; Hansen, T.; Hildebrand, F.; Kaas, R.S.; Kennedy, S.; Kristiansen, K.; Kultima, J.R.; Leonard, P.; Levenez, F.; Lund, O.; Moumen, B.; Paslier, Le D.; Pons, N.; Pedersen, O.; Prifti, E.; Qin, J.; Raes, J.; Sørensen, S.; Tap, J.; Tims, S.; Ussery, D.W.; Yamada, T.; Jamet, A.; Mérieux, A.; Cultrone, A.; Torrejon, A.; Quinquis, B.; Brechot, C.; Delorme, C.; M'Rini, C.; Vos, de W.M.; Maguin, E.; Varela, E.; Guedon, E.; Gwen, F.; Haimet, F.; Artiguenave, F.; Vandemeulebrouck, G.; Denariaz, G.; Khaci, G.; Blottière, H.; Knol, J.; Weissenbach, J.; Hylckama Vlieg, van J.E.; Torben, J.; Parkhil, J.; Turner, K.; Guchte, van de M.; Antolin, M.; Rescigno, M.; Kleerebezem, M.; Derrien, M.; Galleron, N.; Sanchez, N.; Grarup, N.; Veiga, P.; Oozeer, R.; Dervyn, R.; Layec, S.; Bruls, T.; Winogradski, Y.; Zoetendal, E.G.; Renault, D.; Sicheritz-Ponten,; Bork, P.; Wang, J.; Brunak, S.; Ehrlich, S.D.

    2014-01-01

    Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as part

  14. High-coverage sequencing and annotated assembly of the genome of the Australian dragon lizard Pogona vitticeps.

    Science.gov (United States)

    Georges, Arthur; Li, Qiye; Lian, Jinmin; O'Meally, Denis; Deakin, Janine; Wang, Zongji; Zhang, Pei; Fujita, Matthew; Patel, Hardip R; Holleley, Clare E; Zhou, Yang; Zhang, Xiuwen; Matsubara, Kazumi; Waters, Paul; Graves, Jennifer A Marshall; Sarre, Stephen D; Zhang, Guojie

    2015-01-01

    The lizards of the family Agamidae are one of the most prominent elements of the Australian reptile fauna. Here, we present a genomic resource built on the basis of a wild-caught male ZZ central bearded dragon Pogona vitticeps. The genomic sequence for P. vitticeps, generated on the Illumina HiSeq 2000 platform, comprised 317 Gbp (179X raw read depth) from 13 insert libraries ranging from 250 bp to 40 kbp. After filtering for low-quality and duplicated reads, 146 Gbp of data (83X) was available for assembly. Exceptionally high levels of heterozygosity (0.85 % of single nucleotide polymorphisms plus sequence insertions or deletions) complicated assembly; nevertheless, 96.4 % of reads mapped back to the assembled scaffolds, indicating that the assembly included most of the sequenced genome. Length of the assembly was 1.8 Gbp in 545,310 scaffolds (69,852 longer than 300 bp), the longest being 14.68 Mbp. N50 was 2.29 Mbp. Genes were annotated on the basis of de novo prediction, similarity to the green anole Anolis carolinensis, Gallus gallus and Homo sapiens proteins, and P. vitticeps transcriptome sequence assemblies, to yield 19,406 protein-coding genes in the assembly, 63 % of which had intact open reading frames. Our assembly captured 99 % (246 of 248) of core CEGMA genes, with 93 % (231) being complete. The quality of the P. vitticeps assembly is comparable or superior to that of other published squamate genomes, and the annotated P. vitticeps genome can be accessed through a genome browser available at https://genomics.canberra.edu.au.

  15. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes

    DEFF Research Database (Denmark)

    Nielsen, Henrik Bjørn; Almeida, Mathieu; Juncker, Agnieszka

    2014-01-01

    , such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly...... affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples....

  16. High-Quality de Novo Genome Assembly of the Dekkera bruxellensis Yeast Isolate Using Nanopore MinION Sequencing.

    Science.gov (United States)

    Fournier, Téo; Gounot, Jean-Sébastien; Freel, Kelle; Cruaud, Corinne; Lemainque, Arnaud; Aury, Jean-Marc; Wincker, Patrick; Schacherer, Joseph; Friedrich, Anne

    2017-08-09

    Genetic variation in natural populations represents the raw material for phenotypic diversity. Species-wide characterization of genetic variants is crucial to have a deeper insight into the genotype-phenotype relationship. With the advent of new sequencing strategies and more recently the release of long-read sequencing platforms, it is now possible to explore the genetic diversity of any non-model organisms, representing a fundamental resource for biological research. In the frame of population genomic surveys, a first step is to obtain the complete sequence and high quality assembly of a reference genome. Here, we sequenced and assembled a reference genome of the non-conventional Dekkera bruxellensis yeast. While this species is a major cause of wine spoilage, it paradoxically contributes to the specific flavor profile of some Belgium beers. In addition, an extreme karyotype variability is observed across natural isolates, highlighting that D. bruxellensis genome is very dynamic. The whole genome of the D. bruxellensis UMY321 isolate was sequenced using a combination of Nanopore long-read and Illumina short-read sequencing data. We generated the most complete and contiguous de novo assembly of D. bruxellensis to date and obtained a first glimpse into the genomic variability within this species by comparing the sequences of several isolates. This genome sequence is therefore of high value for population genomic surveys and represents a reference to study genome dynamic in this yeast species. Copyright © 2017, G3: Genes, Genomes, Genetics.

  17. A multi-platform draft de novo genome assembly and comparative analysis for the Scarlet Macaw (Ara macao).

    Science.gov (United States)

    Seabury, Christopher M; Dowd, Scot E; Seabury, Paul M; Raudsepp, Terje; Brightsmith, Donald J; Liboriussen, Poul; Halley, Yvette; Fisher, Colleen A; Owens, Elaine; Viswanathan, Ganesh; Tizard, Ian R

    2013-01-01

    Data deposition to NCBI Genomes: This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AMXX00000000 (SMACv1.0, unscaffolded genome assembly). The version described in this paper is the first version (AMXX01000000). The scaffolded assembly (SMACv1.1) has been deposited at DDBJ/EMBL/GenBank under the accession AOUJ00000000, and is also the first version (AOUJ01000000). Strong biological interest in traits such as the acquisition and utilization of speech, cognitive abilities, and longevity catalyzed the utilization of two next-generation sequencing platforms to provide the first-draft de novo genome assembly for the large, new world parrot Ara macao (Scarlet Macaw). Despite the challenges associated with genome assembly for an outbred avian species, including 951,507 high-quality putative single nucleotide polymorphisms, the final genome assembly (>1.035 Gb) includes more than 997 Mb of unambiguous sequence data (excluding N's). Cytogenetic analyses including ZooFISH revealed complex rearrangements associated with two scarlet macaw macrochromosomes (AMA6, AMA7), which supports the hypothesis that translocations, fusions, and intragenomic rearrangements are key factors associated with karyotype evolution among parrots. In silico annotation of the scarlet macaw genome provided robust evidence for 14,405 nuclear gene annotation models, their predicted transcripts and proteins, and a complete mitochondrial genome. Comparative analyses involving the scarlet macaw, chicken, and zebra finch genomes revealed high levels of nucleotide-based conservation as well as evidence for overall genome stability among the three highly divergent species. Application of a new whole-genome analysis of divergence involving all three species yielded prioritized candidate genes and noncoding regions for parrot traits of interest (i.e., speech, intelligence, longevity) which were independently supported by the results of previous human GWAS studies. We

  18. Assembly and comparative analysis of complete mitochondrial genome sequence of an economic plant Salix suchowensis

    Directory of Open Access Journals (Sweden)

    Ning Ye

    2017-03-01

    Full Text Available Willow is a widely used dioecious woody plant of Salicaceae family in China. Due to their high biomass yields, willows are promising sources for bioenergy crops. In this study, we assembled the complete mitochondrial (mt genome sequence of S. suchowensis with the length of 644,437 bp using Roche-454 GS FLX Titanium sequencing technologies. Base composition of the S. suchowensis mt genome is A (27.43%, T (27.59%, C (22.34%, and G (22.64%, which shows a prevalent GC content with that of other angiosperms. This long circular mt genome encodes 58 unique genes (32 protein-coding genes, 23 tRNA genes and 3 rRNA genes, and 9 of the 32 protein-coding genes contain 17 introns. Through the phylogenetic analysis of 35 species based on 23 protein-coding genes, it is supported that Salix as a sister to Populus. With the detailed phylogenetic information and the identification of phylogenetic position, some ribosomal protein genes and succinate dehydrogenase genes are found usually lost during evolution. As a native shrub willow species, this worthwhile research of S. suchowensis mt genome will provide more desirable information for better understanding the genomic breeding and missing pieces of sex determination evolution in the future.

  19. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo) genome assembly and analysis

    Science.gov (United States)

    Next-generation sequencing technologies were used to rapidly and efficiently sequence the genome of the domestic turkey (Meleagris gallopavo). The current genome assembly (~1.1 Gb) includes 917 Mb of sequence assigned to chromosomes. Innate heterozygosity of the sequenced bird allowed discovery of...

  20. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis

    NARCIS (Netherlands)

    Dalloul, R.A.; Long, J.A.; Zimin, A.V.; Aslam, M.L.; Crooijmans, R.P.M.A.; Megens, H.J.W.C.; Groenen, M.

    2010-01-01

    A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more th

  1. De novo Transcriptome Assemblies of Rana (Lithobates catesbeiana and Xenopus laevis Tadpole Livers for Comparative Genomics without Reference Genomes.

    Directory of Open Access Journals (Sweden)

    Inanc Birol

    Full Text Available In this work we studied the liver transcriptomes of two frog species, the American bullfrog (Rana (Lithobates catesbeiana and the African clawed frog (Xenopus laevis. We used high throughput RNA sequencing (RNA-seq data to assemble and annotate these transcriptomes, and compared how their baseline expression profiles change when tadpoles of the two species are exposed to thyroid hormone. We generated more than 1.5 billion RNA-seq reads in total for the two species under two conditions as treatment/control pairs. We de novo assembled these reads using Trans-ABySS to reconstruct reference transcriptomes, obtaining over 350,000 and 130,000 putative transcripts for R. catesbeiana and X. laevis, respectively. Using available genomics resources for X. laevis, we annotated over 97% of our X. laevis transcriptome contigs, demonstrating the utility and efficacy of our methodology. Leveraging this validated analysis pipeline, we also annotated the assembled R. catesbeiana transcriptome. We used the expression profiles of the annotated genes of the two species to examine the similarities and differences between the tadpole liver transcriptomes. We also compared the gene ontology terms of expressed genes to measure how the animals react to a challenge by thyroid hormone. Our study reports three main conclusions. First, de novo assembly of RNA-seq data is a powerful method for annotating and establishing transcriptomes of non-model organisms. Second, the liver transcriptomes of the two frog species, R. catesbeiana and X. laevis, show many common features, and the distribution of their gene ontology profiles are statistically indistinguishable. Third, although they broadly respond the same way to the presence of thyroid hormone in their environment, their receptor/signal transduction pathways display marked differences.

  2. HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly

    Directory of Open Access Journals (Sweden)

    Md Mahfuzer Rahman

    2017-01-01

    Full Text Available Background. The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the de Bruijn graph is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barrier here is the memory and runtime. Therefore, this area has received significant attention in contemporary literature. Results. In this paper, we present an approach called HaVec that attempts to achieve a balance between the memory consumption and the running time. HaVec uses a hash table along with an auxiliary vector data structure to store the de Bruijn graph thereby improving the total memory usage and the running time. A critical and noteworthy feature of HaVec is that it exhibits no false positive error. Conclusions. In general, the graph construction procedure takes the major share of the time involved in an assembly process. HaVec can be seen as a significant advancement in this aspect. We anticipate that HaVec will be extremely useful in the de Bruijn graph-based genome assembly.

  3. An Efficient Genome Fragment Assembling Using GA with Neighborhood Aware Fitness Function

    Directory of Open Access Journals (Sweden)

    Satoko Kikuchi

    2012-01-01

    Full Text Available To decode a long genome sequence, shotgun sequencing is the state-of-the-art technique. It needs to properly sequence a very large number, sometimes as large as millions, of short partially readable strings (fragments. Arranging those fragments in correct sequence is known as fragment assembling, which is an NP-problem. Presently used methods require enormous computational cost. In this work, we have shown how our modified genetic algorithm (GA could solve this problem efficiently. In the proposed GA, the length of the chromosome, which represents the volume of the search space, is reduced with advancing generations, and thereby improves search efficiency. We also introduced a greedy mutation, by swapping nearby fragments using some heuristics, to improve the fitness of chromosomes. We compared results with Parsons’ algorithm which is based on GA too. We used fragments with partial reads on both sides, mimicking fragments in real genome assembling process. In Parsons’ work base-pair array of the whole fragment is known. Even then, we could obtain much better results, and we succeeded in restructuring contigs covering 100% of the genome sequences.

  4. Moleculo Long-Read Sequencing Facilitates Assembly and Genomic Binning from Complex Soil Metagenomes

    Energy Technology Data Exchange (ETDEWEB)

    White, Richard Allen; Bottos, Eric M.; Roy Chowdhury, Taniya; Zucker, Jeremy D.; Brislawn, Colin J.; Nicora, Carrie D.; Fansler, Sarah J.; Glaesemann, Kurt R.; Glass, Kevin; Jansson, Janet K.; Langille, Morgan

    2016-06-28

    ABSTRACT

    Soil metagenomics has been touted as the “grand challenge” for metagenomics, as the high microbial diversity and spatial heterogeneity of soils make them unamenable to current assembly platforms. Here, we aimed to improve soil metagenomic sequence assembly by applying the Moleculo synthetic long-read sequencing technology. In total, we obtained 267 Gbp of raw sequence data from a native prairie soil; these data included 109.7 Gbp of short-read data (~100 bp) from the Joint Genome Institute (JGI), an additional 87.7 Gbp of rapid-mode read data (~250 bp), plus 69.6 Gbp (>1.5 kbp) from Moleculo sequencing. The Moleculo data alone yielded over 5,600 reads of >10 kbp in length, and over 95% of the unassembled reads mapped to contigs of >1.5 kbp. Hybrid assembly of all data resulted in more than 10,000 contigs over 10 kbp in length. We mapped three replicate metatranscriptomes derived from the same parent soil to the Moleculo subassembly and found that 95% of the predicted genes, based on their assignments to Enzyme Commission (EC) numbers, were expressed. The Moleculo subassembly also enabled binning of >100 microbial genome bins. We obtained via direct binning the first complete genome, that of “CandidatusPseudomonas sp. strain JKJ-1” from a native soil metagenome. By mapping metatranscriptome sequence reads back to the bins, we found that several bins corresponding to low-relative-abundanceAcidobacteriawere highly transcriptionally active, whereas bins corresponding to high-relative-abundanceVerrucomicrobiawere not. These results demonstrate that Moleculo sequencing provides a significant advance for resolving complex soil microbial communities.

    IMPORTANCESoil microorganisms carry out key processes for life on our planet, including cycling of carbon and other nutrients and supporting growth of plants. However, there is poor molecular-level understanding of their

  5. Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics’ GemCode Sequencing Data

    Science.gov (United States)

    Coombe, Lauren; Jackman, Shaun D.; Yang, Chen; Vandervalk, Benjamin P.; Moore, Richard A.; Pleasance, Stephen; Coope, Robin J.; Bohlmann, Joerg; Holt, Robert A.; Jones, Steven J. M.; Birol, Inanc

    2016-01-01

    The linked read sequencing library preparation platform by 10X Genomics produces barcoded sequencing libraries, which are subsequently sequenced using the Illumina short read sequencing technology. In this new approach, long fragments of DNA are partitioned into separate micro-reactions, where the same index sequence is incorporated into each of the sequencing fragment inserts derived from a given long fragment. In this study, we exploited this property by using reads from index sequences associated with a large number of reads, to assemble the chloroplast genome of the Sitka spruce tree (Picea sitchensis). Here we report on the first Sitka spruce chloroplast genome assembled exclusively from P. sitchensis genomic libraries prepared using the 10X Genomics protocol. We show that the resulting 124,049 base pair long genome shares high sequence similarity with the related white spruce and Norway spruce chloroplast genomes, but diverges substantially from a previously published P. sitchensis- P. thunbergii chimeric genome. The use of reads from high-frequency indices enabled separation of the nuclear genome reads from that of the chloroplast, which resulted in the simplification of the de Bruijn graphs used at the various stages of assembly. PMID:27632164

  6. Insights into specific DNA recognition during the assembly of a viral genome packaging machine.

    Science.gov (United States)

    de Beer, Tonny; Fang, Jenny; Ortega, Marcos; Yang, Qin; Maes, Levi; Duffy, Carol; Berton, Nancy; Sippy, Jean; Overduin, Michael; Feiss, Michael; Catalano, Carlos Enrique

    2002-05-01

    Terminase enzymes mediate genome "packaging" during the reproduction of DNA viruses. In lambda, the gpNu1 subunit guides site-specific assembly of terminase onto DNA. The structure of the dimeric DNA binding domain of gpNu1 was solved using nuclear magnetic resonance spectroscopy. Its fold contains a unique winged helix-turn-helix (wHTH) motif within a novel scaffold. Surprisingly, a predicted P loop ATP binding motif is in fact the wing of the DNA binding motif. Structural and genetic analysis has identified determinants of DNA recognition specificity within the wHTH motif and the DNA recognition sequence. The structure reveals an unexpected DNA binding mode and provides a mechanistic basis for the concerted action of gpNu1 and Escherichia coli integration host factor during assembly of the packaging machinery.

  7. Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding.

    Directory of Open Access Journals (Sweden)

    Shubha Vij

    2016-04-01

    Full Text Available We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer, a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics.

  8. The assembly and annotation of the complete Rufous-bellied thrush mitochondrial genome.

    Science.gov (United States)

    Gomes de Sá, Pablo; Veras, Adonney; Fontana, Carla Suertegaray; Aleixo, Alexandre; Burlamaqui, Tibério; Mello, Claudio Vianna; de Vasconcelos, Ana Tereza Ribeiro; Prosdocimi, Francisco; Ramos, Rommel; Schneider, Maria; Silva, Artur

    2017-03-01

    Among known bird species, oscines are one of the few groups that produce complex vocalizations due to vocal learning. One of the most conspicuous oscine passerines in southeastern South America is the Rufous-bellied Thrush, Turdus rufiventris. The complete mitochondrial genome of this species was sequenced with the Illumina HiSeq platform (Illumina Inc., San Diego, CA), assembled using MITObim software and annotated by MITOS web server and Artemis software. This mitogenome contained 16 669 bases, organized as 13 protein-coding genes, 22 transfer RNAs, two ribosomal RNAs, and a control region (d-loop). The sequencing of the Rufous-bellied Thrush mitochondrial genome is of particular interest for better understanding of population genetics and phylogeography of the Turdidae family.

  9. TALEN construction via "Unit Assembly" method and targeted genome modifications in zebrafish.

    Science.gov (United States)

    Huang, Peng; Xiao, An; Tong, Xiangjun; Zu, Yao; Wang, Zhanxiang; Zhang, Bo

    2014-08-15

    Transcription activator-like effector nucleases (TALENs) are engineered endonucleases composed of a customized transcription activator-like effector (TALE) DNA-binding domain and a FokI DNA cleavage domain. TALENs induce DNA double-strand breaks (DSBs) at their target sites on the chromosome and have been successfully used for genome engineering in many species and cultured cells. Zebrafish is a very popular model organism in both basic and clinical research. Here, we describe the details of construction of customized TALENs using the "Unit Assembly" (UA) method, as well as three applications of zebrafish genome manipulations using TALENs: gene knock-out, large chromosome deletion, and gene knock-in by homologous recombination.

  10. Chromosome-level genome assembly and transcriptome of the green alga Chromochloris zofingiensis illuminates astaxanthin production

    Science.gov (United States)

    Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.; Walter, Andreas; Lopez, David; Erickson, Erika; Endelman, Benjamin; Westcott, Daniel; Larabell, Carolyn A.; Merchant, Sabeeha S.; Pellegrini, Matteo

    2017-01-01

    Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. To advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ∼58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniform gene density over chromosomes, low repetitive sequence content (∼6%), and a high fraction of protein-coding sequence (∼39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (∼73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. The high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production. PMID:28484037

  11. De-novo assembly and analysis of the heterozygous triploid genome of the wine spoilage yeast Dekkera bruxellensis AWRI1499.

    Science.gov (United States)

    Curtin, Chris D; Borneman, Anthony R; Chambers, Paul J; Pretorius, Isak S

    2012-01-01

    Despite its industrial importance, the yeast species Dekkera (Brettanomyces) bruxellensis has remained poorly understood at the genetic level. In this study we describe whole genome sequencing and analysis for a prevalent wine spoilage strain, AWRI1499. The 12.7 Mb assembly, consisting of 324 contigs in 99 scaffolds (super-contigs) at 26-fold coverage, exhibits a relatively high density of single nucleotide polymorphisms (SNPs). Haplotype sampling for 1.2% of open reading frames suggested that the D. bruxellensis AWRI1499 genome is comprised of a moderately heterozygous diploid genome, in combination with a divergent haploid genome. Gene content analysis revealed enrichment in membrane proteins, particularly transporters, along with oxidoreductase enzymes. Availability of this assembly and annotation provides a resource for further investigation of genomic organization in this species, and functional characterization of genes that may confer important phenotypic traits.

  12. De-novo assembly and analysis of the heterozygous triploid genome of the wine spoilage yeast Dekkera bruxellensis AWRI1499.

    Directory of Open Access Journals (Sweden)

    Chris D Curtin

    Full Text Available Despite its industrial importance, the yeast species Dekkera (Brettanomyces bruxellensis has remained poorly understood at the genetic level. In this study we describe whole genome sequencing and analysis for a prevalent wine spoilage strain, AWRI1499. The 12.7 Mb assembly, consisting of 324 contigs in 99 scaffolds (super-contigs at 26-fold coverage, exhibits a relatively high density of single nucleotide polymorphisms (SNPs. Haplotype sampling for 1.2% of open reading frames suggested that the D. bruxellensis AWRI1499 genome is comprised of a moderately heterozygous diploid genome, in combination with a divergent haploid genome. Gene content analysis revealed enrichment in membrane proteins, particularly transporters, along with oxidoreductase enzymes. Availability of this assembly and annotation provides a resource for further investigation of genomic organization in this species, and functional characterization of genes that may confer important phenotypic traits.

  13. ATLAS (Automatic Tool for Local Assembly Structures) - A Comprehensive Infrastructure for Assembly, Annotation, and Genomic Binning of Metagenomic and Metaranscripomic Data

    Energy Technology Data Exchange (ETDEWEB)

    White, Richard A.; Brown, Joseph M.; Colby, Sean M.; Overall, Christopher C.; Lee, Joon-Yong; Zucker, Jeremy D.; Glaesemann, Kurt R.; Jansson, Georg C.; Jansson, Janet K.

    2017-03-02

    ATLAS (Automatic Tool for Local Assembly Structures) is a comprehensive multiomics data analysis pipeline that is massively parallel and scalable. ATLAS contains a modular analysis pipeline for assembly, annotation, quantification and genome binning of metagenomics and metatranscriptomics data and a framework for reference metaproteomic database construction. ATLAS transforms raw sequence data into functional and taxonomic data at the microbial population level and provides genome-centric resolution through genome binning. ATLAS provides robust taxonomy based on majority voting of protein coding open reading frames rolled-up at the contig level using modified lowest common ancestor (LCA) analysis. ATLAS provides robust taxonomy based on majority voting of protein coding open reading frames rolled-up at the contig level using modified lowest common ancestor (LCA) analysis. ATLAS is user-friendly, easy install through bioconda maintained as open-source on GitHub, and is implemented in Snakemake for modular customizable workflows.

  14. Chromosomal instability in Afrotheria: fragile sites, evolutionary breakpoints and phylogenetic inference from genome sequence assemblies

    Directory of Open Access Journals (Sweden)

    Ruiz-Herrera Aurora

    2007-10-01

    Full Text Available Abstract Background Extant placental mammals are divided into four major clades (Laurasiatheria, Supraprimates, Xenarthra and Afrotheria. Given that Afrotheria is generally thought to root the eutherian tree in phylogenetic analysis of large nuclear gene data sets, the study of the organization of the genomes of afrotherian species provides new insights into the dynamics of mammalian chromosomal evolution. Here we test if there are chromosomal bands with a high tendency to break and reorganize in Afrotheria, and by analyzing the expression of aphidicolin-induced common fragile sites in three afrotherian species, whether these are coincidental with recognized evolutionary breakpoints. Results We described 29 fragile sites in the aardvark (OAF genome, 27 in the golden mole (CAS, and 35 in the elephant-shrew (EED genome. We show that fragile sites are conserved among afrotherian species and these are correlated with evolutionary breakpoints when compared to the human (HSA genome. Inddition, by computationally scanning the newly released opossum (Monodelphis domestica and chicken sequence assemblies for use as outgroups to Placentalia, we validate the HSA 3/21/5 chromosomal synteny as a rare genomic change that defines the monophyly of this ancient African clade of mammals. On the other hand, support for HSA 1/19p, which is also thought to underpin Afrotheria, is currently ambiguous. Conclusion We provide evidence that (i the evolutionary breakpoints that characterise human syntenies detected in the basal Afrotheria correspond at the chromosomal band level with fragile sites, (ii that HSA 3p/21 was in the amniote ancestor (i.e., common to turtles, lepidosaurs, crocodilians, birds and mammals and was subsequently disrupted in the lineage leading to marsupials. Its expansion to include HSA 5 in Afrotheria is unique and (iii that its fragmentation to HSA 3p/21 + HSA 5/21 in elephant and manatee was due to a fission within HSA 21 that is probably shared

  15. Highly precise and developmentally programmed genome assembly in Paramecium requires ligase IV-dependent end joining.

    Directory of Open Access Journals (Sweden)

    Aurélie Kapusta

    2011-04-01

    Full Text Available During the sexual cycle of the ciliate Paramecium, assembly of the somatic genome includes the precise excision of tens of thousands of short, non-coding germline sequences (Internal Eliminated Sequences or IESs, each one flanked by two TA dinucleotides. It has been reported previously that these genome rearrangements are initiated by the introduction of developmentally programmed DNA double-strand breaks (DSBs, which depend on the domesticated transposase PiggyMac. These DSBs all exhibit a characteristic geometry, with 4-base 5' overhangs centered on the conserved TA, and may readily align and undergo ligation with minimal processing. However, the molecular steps and actors involved in the final and precise assembly of somatic genes have remained unknown. We demonstrate here that Ligase IV and Xrcc4p, core components of the non-homologous end-joining pathway (NHEJ, are required both for the repair of IES excision sites and for the circularization of excised IESs. The transcription of LIG4 and XRCC4 is induced early during the sexual cycle and a Lig4p-GFP fusion protein accumulates in the developing somatic nucleus by the time IES excision takes place. RNAi-mediated silencing of either gene results in the persistence of free broken DNA ends, apparently protected against extensive resection. At the nucleotide level, controlled removal of the 5'-terminal nucleotide occurs normally in LIG4-silenced cells, while nucleotide addition to the 3' ends of the breaks is blocked, together with the final joining step, indicative of a coupling between NHEJ polymerase and ligase activities. Taken together, our data indicate that IES excision is a "cut-and-close" mechanism, which involves the introduction of initiating double-strand cleavages at both ends of each IES, followed by DSB repair via highly precise end joining. This work broadens our current view on how the cellular NHEJ pathway has cooperated with domesticated transposases for the emergence of new

  16. On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes.

    Science.gov (United States)

    Bonizzoni, Paola; Dondi, Riccardo; Klau, Gunnar W; Pirola, Yuri; Pisanti, Nadia; Zaccaria, Simone

    2016-09-01

    In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.

  17. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.

    Science.gov (United States)

    Dalloul, Rami A; Long, Julie A; Zimin, Aleksey V; Aslam, Luqman; Beal, Kathryn; Blomberg, Le Ann; Bouffard, Pascal; Burt, David W; Crasta, Oswald; Crooijmans, Richard P M A; Cooper, Kristal; Coulombe, Roger A; De, Supriyo; Delany, Mary E; Dodgson, Jerry B; Dong, Jennifer J; Evans, Clive; Frederickson, Karin M; Flicek, Paul; Florea, Liliana; Folkerts, Otto; Groenen, Martien A M; Harkins, Tim T; Herrero, Javier; Hoffmann, Steve; Megens, Hendrik-Jan; Jiang, Andrew; de Jong, Pieter; Kaiser, Pete; Kim, Heebal; Kim, Kyu-Won; Kim, Sungwon; Langenberger, David; Lee, Mi-Kyung; Lee, Taeheon; Mane, Shrinivasrao; Marcais, Guillaume; Marz, Manja; McElroy, Audrey P; Modise, Thero; Nefedov, Mikhail; Notredame, Cédric; Paton, Ian R; Payne, William S; Pertea, Geo; Prickett, Dennis; Puiu, Daniela; Qioa, Dan; Raineri, Emanuele; Ruffier, Magali; Salzberg, Steven L; Schatz, Michael C; Scheuring, Chantel; Schmidt, Carl J; Schroeder, Steven; Searle, Stephen M J; Smith, Edward J; Smith, Jacqueline; Sonstegard, Tad S; Stadler, Peter F; Tafer, Hakim; Tu, Zhijian Jake; Van Tassell, Curtis P; Vilella, Albert J; Williams, Kelly P; Yorke, James A; Zhang, Liqing; Zhang, Hong-Bin; Zhang, Xiaojun; Zhang, Yang; Reed, Kent M

    2010-09-07

    A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.

  18. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo: genome assembly and analysis.

    Directory of Open Access Journals (Sweden)

    Rami A Dalloul

    Full Text Available A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo. Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.

  19. Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis

    Science.gov (United States)

    Aslam, Luqman; Beal, Kathryn; Ann Blomberg, Le; Bouffard, Pascal; Burt, David W.; Crasta, Oswald; Crooijmans, Richard P. M. A.; Cooper, Kristal; Coulombe, Roger A.; De, Supriyo; Delany, Mary E.; Dodgson, Jerry B.; Dong, Jennifer J.; Evans, Clive; Frederickson, Karin M.; Flicek, Paul; Florea, Liliana; Folkerts, Otto; Groenen, Martien A. M.; Harkins, Tim T.; Herrero, Javier; Hoffmann, Steve; Megens, Hendrik-Jan; Jiang, Andrew; de Jong, Pieter; Kaiser, Pete; Kim, Heebal; Kim, Kyu-Won; Kim, Sungwon; Langenberger, David; Lee, Mi-Kyung; Lee, Taeheon; Mane, Shrinivasrao; Marcais, Guillaume; Marz, Manja; McElroy, Audrey P.; Modise, Thero; Nefedov, Mikhail; Notredame, Cédric; Paton, Ian R.; Payne, William S.; Pertea, Geo; Prickett, Dennis; Puiu, Daniela; Qioa, Dan; Raineri, Emanuele; Ruffier, Magali; Salzberg, Steven L.; Schatz, Michael C.; Scheuring, Chantel; Schmidt, Carl J.; Schroeder, Steven; Searle, Stephen M. J.; Smith, Edward J.; Smith, Jacqueline; Sonstegard, Tad S.; Stadler, Peter F.; Tafer, Hakim; Tu, Zhijian (Jake); Van Tassell, Curtis P.; Vilella, Albert J.; Williams, Kelly P.; Yorke, James A.; Zhang, Liqing; Zhang, Hong-Bin; Zhang, Xiaojun; Zhang, Yang; Reed, Kent M.

    2010-01-01

    A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest. PMID:20838655

  20. Long-range genomic enrichment, sequencing, and assembly to determine unknown sequences flanking a known microRNA.

    Directory of Open Access Journals (Sweden)

    Zhaorong Ma

    Full Text Available Conserved plant microRNAs (miRNAs modulate important biological processes but little is known about conserved cis-regulatory elements (CREs surrounding MIRNA genes. We developed a solution-based targeted genomic enrichment methodology to capture, enrich, and sequence flanking genomic regions surrounding conserved MIRNA genes with a locked-nucleic acid (LNA-modified, biotinylated probe complementary to the mature miRNA sequence. Genomic DNA bound by the probe is captured by streptavidin-coated magnetic beads, amplified, sequenced and assembled de novo to obtain genomic DNA sequences flanking MIRNA locus of interest. We demonstrate the sensitivity and specificity of this enrichment methodology in Arabidopsis thaliana to enrich targeted regions spanning 10-20 kb surrounding known MIR166 and MIR165 loci. Assembly of the sequencing reads successfully recovered all targeted loci. While further optimization for larger, more complex genomes is needed, this method may enable determination of flanking genomic DNA sequence surrounding a known core (like a conserved mature miRNA from multiple species that currently don't have a full genome assembly available.

  1. De novo genome assembly of Geosmithia morbida, the causal agent of thousand cankers disease.

    Science.gov (United States)

    Schuelke, Taruna A; Westbrook, Anthony; Broders, Kirk; Woeste, Keith; MacManes, Matthew D

    2016-01-01

    Geosmithia morbida is a filamentous ascomycete that causes thousand cankers disease in the eastern black walnut tree. This pathogen is commonly found in the western U.S.; however, recently the disease was also detected in several eastern states where the black walnut lumber industry is concentrated. G. morbida is one of two known phytopathogens within the genus Geosmithia, and it is vectored into the host tree via the walnut twig beetle. We present the first de novo draft genome of G. morbida. It is 26.5 Mbp in length and contains less than 1% repetitive elements. The genome possesses an estimated 6,273 genes, 277 of which are predicted to encode proteins with unknown functions. Approximately 31.5% of the proteins in G. morbida are homologous to proteins involved in pathogenicity, and 5.6% of the proteins contain signal peptides that indicate these proteins are secreted. Several studies have investigated the evolution of pathogenicity in pathogens of agricultural crops; forest fungal pathogens are often neglected because research efforts are focused on food crops. G. morbida is one of the few tree phytopathogens to be sequenced, assembled and annotated. The first draft genome of G. morbida serves as a valuable tool for comprehending the underlying molecular and evolutionary mechanisms behind pathogenesis within the Geosmithia genus.

  2. The Spindle Assembly Checkpoint Safeguards Genomic Integrity of Skeletal Muscle Satellite Cells

    Directory of Open Access Journals (Sweden)

    Swapna Kollu

    2015-06-01

    Full Text Available To ensure accurate genomic segregation, cells evolved the spindle assembly checkpoint (SAC, whose role in adult stem cells remains unknown. Inducible perturbation of a SAC kinase, Mps1, and its downstream effector, Mad2, in skeletal muscle stem cells shows the SAC to be critical for normal muscle growth, repair, and self-renewal of the stem cell pool. SAC-deficient muscle stem cells arrest in G1 phase of the cell cycle with elevated aneuploidy, resisting differentiation even under inductive conditions. p21CIP1 is responsible for these SAC-deficient phenotypes. Despite aneuploidy’s correlation with aging, we find that aged proliferating muscle stem cells display robust SAC activity without elevated aneuploidy. Thus, muscle stem cells have a two-step mechanism to safeguard their genomic integrity. The SAC prevents chromosome missegregation and, if it fails, p21CIP1-dependent G1 arrest limits cellular propagation and tissue integration. These mechanisms ensure that muscle stem cells with compromised genomes do not contribute to tissue homeostasis.

  3. FusX: A Rapid One-Step Transcription Activator-Like Effector Assembly System for Genome Science.

    Science.gov (United States)

    Ma, Alvin C; McNulty, Melissa S; Poshusta, Tanya L; Campbell, Jarryd M; Martínez-Gálvez, Gabriel; Argue, David P; Lee, Han B; Urban, Mark D; Bullard, Cassandra E; Blackburn, Patrick R; Man, Toni K; Clark, Karl J; Ekker, Stephen C

    2016-06-01

    Transcription activator-like effectors (TALEs) are extremely effective, single-molecule DNA-targeting molecular cursors used for locus-specific genome science applications, including high-precision molecular medicine and other genome engineering applications. TALEs are used in genome engineering for locus-specific DNA editing and imaging, as artificial transcriptional activators and repressors, and for targeted epigenetic modification. TALEs as nucleases (TALENs) are effective editing tools and offer high binding specificity and fewer sequence constraints toward the targeted genome than other custom nuclease systems. One bottleneck of broader TALE use is reagent accessibility. For example, one commonly deployed method uses a multitube, 5-day assembly protocol. Here we describe FusX, a streamlined Golden Gate TALE assembly system that (1) is backward compatible with popular TALE backbones, (2) is functionalized as a single-tube 3-day TALE assembly process, (3) requires only commonly used basic molecular biology reagents, and (4) is cost-effective. More than 100 TALEN pairs have been successfully assembled using FusX, and 27 pairs were quantitatively tested in zebrafish, with each showing high somatic and germline activity. Furthermore, this assembly system is flexible and is compatible with standard molecular biology laboratory tools, but can be scaled with automated laboratory support. To demonstrate, we use a highly accessible and commercially available liquid-handling robot to rapidly and accurately assemble TALEs using the FusX TALE toolkit. Together, the FusX system accelerates TALE-based genomic science applications from basic science screening work for functional genomics testing and molecular medicine applications.

  4. Packaging signals in two single-stranded RNA viruses imply a conserved assembly mechanism and geometry of the packaged genome.

    Science.gov (United States)

    Dykeman, Eric C; Stockley, Peter G; Twarock, Reidun

    2013-09-09

    The current paradigm for assembly of single-stranded RNA viruses is based on a mechanism involving non-sequence-specific packaging of genomic RNA driven by electrostatic interactions. Recent experiments, however, provide compelling evidence for sequence specificity in this process both in vitro and in vivo. The existence of multiple RNA packaging signals (PSs) within viral genomes has been proposed, which facilitates assembly by binding coat proteins in such a way that they promote the protein-protein contacts needed to build the capsid. The binding energy from these interactions enables the confinement or compaction of the genomic RNAs. Identifying the nature of such PSs is crucial for a full understanding of assembly, which is an as yet untapped potential drug target for this important class of pathogens. Here, for two related bacterial viruses, we determine the sequences and locations of their PSs using Hamiltonian paths, a concept from graph theory, in combination with bioinformatics and structural studies. Their PSs have a common secondary structure motif but distinct consensus sequences and positions within the respective genomes. Despite these differences, the distributions of PSs in both viruses imply defined conformations for the packaged RNA genomes in contact with the protein shell in the capsid, consistent with a recent asymmetric structure determination of the MS2 virion. The PS distributions identified moreover imply a preferred, evolutionarily conserved assembly pathway with respect to the RNA sequence with potentially profound implications for other single-stranded RNA viruses known to have RNA PSs, including many animal and human pathogens.

  5. A genome assembly-integrated dog 1 Mb BAC microarray: a cytogenetic resource for canine cancer studies and comparative genomic analysis.

    Science.gov (United States)

    Thomas, R; Duke, S E; Karlsson, E K; Evans, A; Ellis, P; Lindblad-Toh, K; Langford, C F; Breen, M

    2008-01-01

    Molecular cytogenetic studies have been instrumental in defining the nature of numerical and structural chromosome changes in human cancers, but their significance remains to be fully understood. The emergence of high quality genome assemblies for several model organisms provides exciting opportunities to develop novel genome-integrated molecular cytogenetic resources that now permit a comparative approach to evaluating the relevance of tumor-associated chromosome aberrations, both within and between species. We have used the dog genome sequence assembly to identify a framework panel of 2,097 bacterial artificial chromosome (BAC) clones, selected at intervals of approximately one megabase. Each clone has been evaluated by multicolor fluorescence in situ hybridization (FISH) to confirm its unique cytogenetic location in concordance with its reported position in the genome assembly, providing new information on the organization of the dog genome. This panel of BAC clones also represents a powerful cytogenetic resource with numerous potential applications. We have used the clone set to develop a genome-wide microarray for comparative genomic hybridization (aCGH) analysis, and demonstrate its application in detection of tumor-associated DNA copy number aberrations (CNAs) including single copy deletions and amplifications, regional aneuploidy and whole chromosome aneuploidy. We also show how individual clones selected from the BAC panel can be used as FISH probes in direct evaluation of tumor karyotypes, to verify and explore CNAs detected using aCGH analysis. This cytogenetically validated, genome integrated BAC clone panel has enormous potential for aiding gene discovery through a comparative approach to molecular oncology.

  6. A multi-platform draft de novo genome assembly and comparative analysis for the Scarlet Macaw (Ara macao.

    Directory of Open Access Journals (Sweden)

    Christopher M Seabury

    Full Text Available Data deposition to NCBI Genomes: This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AMXX00000000 (SMACv1.0, unscaffolded genome assembly. The version described in this paper is the first version (AMXX01000000. The scaffolded assembly (SMACv1.1 has been deposited at DDBJ/EMBL/GenBank under the accession AOUJ00000000, and is also the first version (AOUJ01000000. Strong biological interest in traits such as the acquisition and utilization of speech, cognitive abilities, and longevity catalyzed the utilization of two next-generation sequencing platforms to provide the first-draft de novo genome assembly for the large, new world parrot Ara macao (Scarlet Macaw. Despite the challenges associated with genome assembly for an outbred avian species, including 951,507 high-quality putative single nucleotide polymorphisms, the final genome assembly (>1.035 Gb includes more than 997 Mb of unambiguous sequence data (excluding N's. Cytogenetic analyses including ZooFISH revealed complex rearrangements associated with two scarlet macaw macrochromosomes (AMA6, AMA7, which supports the hypothesis that translocations, fusions, and intragenomic rearrangements are key factors associated with karyotype evolution among parrots. In silico annotation of the scarlet macaw genome provided robust evidence for 14,405 nuclear gene annotation models, their predicted transcripts and proteins, and a complete mitochondrial genome. Comparative analyses involving the scarlet macaw, chicken, and zebra finch genomes revealed high levels of nucleotide-based conservation as well as evidence for overall genome stability among the three highly divergent species. Application of a new whole-genome analysis of divergence involving all three species yielded prioritized candidate genes and noncoding regions for parrot traits of interest (i.e., speech, intelligence, longevity which were independently supported by the results of previous human GWAS

  7. Genome assembly and geospatial phylogenomics of the bed bug Cimex lectularius.

    Science.gov (United States)

    Rosenfeld, Jeffrey A; Reeves, Darryl; Brugler, Mercer R; Narechania, Apurva; Simon, Sabrina; Durrett, Russell; Foox, Jonathan; Shianna, Kevin; Schatz, Michael C; Gandara, Jorge; Afshinnekoo, Ebrahim; Lam, Ernest T; Hastie, Alex R; Chan, Saki; Cao, Han; Saghbini, Michael; Kentsis, Alex; Planet, Paul J; Kholodovych, Vladyslav; Tessler, Michael; Baker, Richard; DeSalle, Rob; Sorkin, Louis N; Kolokotronis, Sergios-Orestis; Siddall, Mark E; Amato, George; Mason, Christopher E

    2016-02-02

    The common bed bug (Cimex lectularius) has been a persistent pest of humans for thousands of years, yet the genetic basis of the bed bug's basic biology and adaptation to dense human environments is largely unknown. Here we report the assembly, annotation and phylogenetic mapping of the 697.9-Mb Cimex lectularius genome, with an N50 of 971 kb, using both long and short read technologies. A RNA-seq time course across all five developmental stages and male and female adults generated 36,985 coding and noncoding gene models. The most pronounced change in gene expression during the life cycle occurs after feeding on human blood and included genes from the Wolbachia endosymbiont, which shows a simultaneous and coordinated host/commensal response to haematophagous activity. These data provide a rich genetic resource for mapping activity and density of C. lectularius across human hosts and cities, which can help track, manage and control bed bug infestations.

  8. High resolution radiation hybrid maps of bovine chromosomes 19 and 29: comparison with the bovine genome sequence assembly

    Directory of Open Access Journals (Sweden)

    Womack James E

    2007-09-01

    Full Text Available Abstract Background High resolution radiation hybrid (RH maps can facilitate genome sequence assembly by correctly ordering genes and genetic markers along chromosomes. The objective of the present study was to generate high resolution RH maps of bovine chromosomes 19 (BTA19 and 29 (BTA29, and compare them with the current 7.1X bovine genome sequence assembly (bovine build 3.1. We have chosen BTA19 and 29 as candidate chromosomes for mapping, since many Quantitative Trait Loci (QTL for the traits of carcass merit and residual feed intake have been identified on these chromosomes. Results We have constructed high resolution maps of BTA19 and BTA29 consisting of 555 and 253 Single Nucleotide Polymorphism (SNP markers respectively using a 12,000 rad whole genome RH panel. With these markers, the RH map of BTA19 and BTA29 extended to 4591.4 cR and 2884.1 cR in length respectively. When aligned with the current bovine build 3.1, the order of markers on the RH map for BTA19 and 29 showed inconsistencies with respect to the genome assembly. Maps of both the chromosomes show that there is a significant internal rearrangement of the markers involving displacement, inversion and flips within the scaffolds with some scaffolds being misplaced in the genome assembly. We also constructed cattle-human comparative maps of these chromosomes which showed an overall agreement with the comparative maps published previously. However, minor discrepancies in the orientation of few homologous synteny blocks were observed. Conclusion The high resolution maps of BTA19 (average 1 locus/139 kb and BTA29 (average 1 locus/208 kb presented in this study suggest that by the incorporation of RH mapping information, the current bovine genome sequence assembly can be significantly improved. Furthermore, these maps can serve as a potential resource for fine mapping QTL and identification of causative mutations underlying QTL for economically important traits.

  9. Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology.

    Science.gov (United States)

    Judge, Kim; Hunt, Martin; Reuter, Sandra; Tracey, Alan; Quail, Michael A; Parkhill, Julian; Peacock, Sharon J

    2016-09-01

    Translating the Oxford Nanopore MinION sequencing technology into medical microbiology requires on-going analysis that keeps pace with technological improvements to the instrument and release of associated analysis software. Here, we use a multidrug-resistant Enterobacter kobei isolate as a model organism to compare open source software for the assembly of genome data, and relate this to the time taken to generate actionable information. Three software tools (PBcR, Canu and miniasm) were used to assemble MinION data and a fourth (SPAdes) was used to combine MinION and Illumina data to produce a hybrid assembly. All four had a similar number of contigs and were more contiguous than the assembly using Illumina data alone, with SPAdes producing a single chromosomal contig. Evaluation of the four assemblies to represent the genome structure revealed a single large inversion in the SPAdes assembly, which also incorrectly integrated a plasmid into the chromosomal contig. Almost 50 %, 80 % and 90 % of MinION pass reads were generated in the first 6, 9 and 12 h, respectively. Using data from the first 6 h alone led to a less accurate, fragmented assembly, but data from the first 9 or 12 h generated similar assemblies to that from 48 h sequencing. Assemblies were generated in 2 h using Canu, indicating that going from isolate to assembled data is possible in less than 48 h. MinION data identified that genes responsible for resistance were carried by two plasmids encoding resistance to carbapenem and to sulphonamides, rifampicin and aminoglycosides, respectively.

  10. Computational modelling of genome-wide [corrected] transcription assembly networks using a fluidics analogy.

    Directory of Open Access Journals (Sweden)

    Yousry Y Azmy

    Full Text Available Understanding how a myriad of transcription regulators work to modulate mRNA output at thousands of genes remains a fundamental challenge in molecular biology. Here we develop a computational tool to aid in assessing the plausibility of gene regulatory models derived from genome-wide expression profiling of cells mutant for transcription regulators. mRNA output is modelled as fluid flow in a pipe lattice, with assembly of the transcription machinery represented by the effect of valves. Transcriptional regulators are represented as external pressure heads that determine flow rate. Modelling mutations in regulatory proteins is achieved by adjusting valves' on/off settings. The topology of the lattice is designed by the experimentalist to resemble the expected interconnection between the modelled agents and their influence on mRNA expression. Users can compare multiple lattice configurations so as to find the one that minimizes the error with experimental data. This computational model provides a means to test the plausibility of transcription regulation models derived from large genomic data sets.

  11. Chromosome Scale Genome Assembly andTranscriptome Profiling of Nannochloropsisgaditana in Nitrogen Depletion

    Institute of Scientific and Technical Information of China (English)

    2014-01-01

    Nannochloropsis is rapidly emerging as a model organism for the study of biofuel production in microalgae.Here, we report a high-quality genomic assembly of Nannochloropsis gaditana, consisting of large contigs, up to 500 kbplong, and scaffolds that in most cases span the entire length of the chromosomes. We identified 10646 complete genesand characterized possible alternative transcripts. The annotation of the predicted genes and the analysis of cellular pro-cesses revealed traits relevant for the genetic improvement of this organism such as genes involved in DNA recombina-tion, RNA silencing, and cell wall synthesis. We also analyzed the modification of the transcriptional profile in nitrogendeficiencyma condition known to stimulate lipid accumulation. While the content of lipids increased, we did not detectmajor changes in expression of the genes involved in their biosynthesis. At the same time, we observed a very signifi-cant down-regulation of mitochondrial gene expression, suggesting that part of the AcetyI-CoA and NAD(P)H, normallyoxidized through the mitochondrial respiration, would be made available for fatty acids synthesis, increasing the fluxthrough the lipid biosynthetic pathway. Finally, we released an information resource of the genomic data of IV. gaditana,available online at www.nannochloropsis.org.

  12. Genome-wide assembly and analysis of alternative transcripts in mouse

    Science.gov (United States)

    Sharov, Alexei A.; Dudekula, Dawood B.; Ko, Minoru S.H.

    2005-01-01

    To build a mouse gene index with the most comprehensive coverage of alternative transcription/splicing (ATS), we developed an algorithm and a fully automated computational pipeline for transcript assembly from expressed sequences aligned to the genome. We identified 191,946 genomic loci, which included 27,497 protein-coding genes and 11,906 additional gene candidates (e.g., nonprotein-coding, but multiexon). Comparison of the resulting gene index with TIGR, UniGene, DoTS, and ESTGenes databases revealed that it had a greater number of transcripts, a greater average number of exons and introns with proper splicing sites per gene, and longer ORFs. The 27,497 protein-coding genes had 77,138 transcripts, i.e., 2.8 transcripts per gene on average. Close examination of transcripts led to a combinatorial table of 23 types of ATS units, only nine of which were previously described, i.e., 14 types of alternative splicing, seven types of alternative starts, and two types of alternative termination. The 47%, 18%, and 14% of 20,323 multiexon protein-coding genes with proper splice sites had alternative splicings, alternative starts, and alternative terminations, respectively. The gene index with the comprehensive ATS will provide a useful platform for analyzing the nature and mechanism of ATS, as well as for designing the accurate exon-based DNA microarrays. PMID:15867436

  13. Complete Taiwanese Macaque (Macaca cyclopis) Mitochondrial Genome: Reference-Assisted de novo Assembly with Multiple k-mer Strategy.

    Science.gov (United States)

    Huang, Yu-Feng; Midha, Mohit; Chen, Tzu-Han; Wang, Yu-Tai; Smith, David Glenn; Pei, Kurtis Jai-Chyi; Chiu, Kuo Ping

    2015-01-01

    The Taiwanese (Formosan) macaque (Macaca cyclopis) is the only nonhuman primate endemic to Taiwan. This primate species is valuable for evolutionary studies and as subjects in medical research. However, only partial fragments of the mitochondrial genome (mitogenome) of this primate species have been sequenced, not mentioning its nuclear genome. We employed next-generation sequencing to generate 2 x 90 bp paired-end reads, followed by reference-assisted de novo assembly with multiple k-mer strategy to characterize the M. cyclopis mitogenome. We compared the assembled mitogenome with that of other macaque species for phylogenetic analysis. Our results show that, the M. cyclopis mitogenome consists of 16,563 nucleotides encoding for 13 protein-coding genes, 2 ribosomal RNAs and 22 transfer RNAs. Phylogenetic analysis indicates that M. cyclopis is most closely related to M. mulatta lasiota (Chinese rhesus macaque), supporting the notion of Asia-continental origin of M. cyclopis proposed in previous studies based on partial mitochondrial sequences. Our work presents a novel approach for assembling a mitogenome that utilizes the capabilities of de novo genome assembly with assistance of a reference genome. The availability of the complete Taiwanese macaque mitogenome will facilitate the study of primate evolution and the characterization of genetic variations for the potential usage of this species as a non-human primate model for medical research.

  14. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results.

    Science.gov (United States)

    Haiminen, Niina; Kuhn, David N; Parida, Laxmi; Rigoutsos, Isidore

    2011-01-01

    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness.

  15. Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveal Novel Gene Content.

    Science.gov (United States)

    Faber-Hammond, Joshua J; Brown, Kim H

    2016-04-01

    Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences, given that the genome reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish, which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence data set indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads, we identified 1,514,491 novel contigs covering ∼224 Mb of genomic sequence. Among these, 1083 contigs were found to contain a potential gene coding sequence. RNA-seq data comparison confirmed that 362 contigs contained a transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in the zebrafish reference genome. By utilizing the bioinformatics pipeline developed in this study, the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.

  16. Anchored pseudo-de novo assembly of human genomes identifies extensive sequence variation from unmapped sequence reads.

    Science.gov (United States)

    Faber-Hammond, Joshua J; Brown, Kim H

    2016-07-01

    The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10-20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.

  17. Genomic Sequencing of Orientia tsutsugamushi Strain Karp, an Assembly Comparable to the Genome Size of the Strain Ikeda.

    Science.gov (United States)

    Liao, Hsiao-Mei; Chao, Chien-Chung; Lei, Haiyan; Li, Bingjie; Tsai, Shien; Hung, Guo-Chiuan; Ching, Wei-Mei; Lo, Shyh-Ching

    2016-08-18

    Orientia tsutsugamushi, an intracellular bacterium, belongs to the family Rickettsiaceae This study presents the draft genome sequence of strain Karp, with 2.0 Mb as the size of the completed genome. This nearly finished draft genome sequence was annotated with the RAST server and the contents compared to those of the other strains.

  18. Genomic Sequencing of Orientia tsutsugamushi Strain Karp, an Assembly Comparable to the Genome Size of the Strain Ikeda

    Science.gov (United States)

    Liao, Hsiao-Mei; Chao, Chien-Chung; Lei, Haiyan; Li, Bingjie; Tsai, Shien; Hung, Guo-Chiuan

    2016-01-01

    Orientia tsutsugamushi, an intracellular bacterium, belongs to the family Rickettsiaceae. This study presents the draft genome sequence of strain Karp, with 2.0 Mb as the size of the completed genome. This nearly finished draft genome sequence was annotated with the RAST server and the contents compared to those of the other strains. PMID:27540052

  19. Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study.

    Science.gov (United States)

    Cerdeira, Louise Teixeira; Carneiro, Adriana Ribeiro; Ramos, Rommel Thiago Jucá; de Almeida, Sintia Silva; D'Afonseca, Vivian; Schneider, Maria Paula Cruz; Baumbach, Jan; Tauch, Andreas; McCulloch, John Anthony; Azevedo, Vasco Ariston Carvalho; Silva, Artur

    2011-08-01

    Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former.

  20. Detection of phytochrome-like genes from Rhazya stricta (Apocynaceae) using de novo genome assembly.

    Science.gov (United States)

    Sabir, Jamal S M; Baeshen, Nabih A; Shokry, Ahmed M; Gadalla, Nour O; Edris, Sherif; Mutwakil, Mohammed H; Ramadan, Ahmed M; Atef, Ahmed; Al-Kordy, Magdy A; Abuzinadah, Osama A; El-Domyati, Fotouh M; Jansen, Robert K; Bahieldin, Ahmed

    2013-01-01

    Phytochrome-like genes in the wild plant species Rhazya stricta Decne were characterized using a de novo genome assembly of next generation sequence data. Rhazya stricta contains more than 100 alkaloids with multiple pharmacological properties, and leaf extracts have been used to cure chronic rheumatism, to treat tumors, and in the treatment of several other diseases. Phytochromes are known to be involved in the light-regulated biosynthesis of some alkaloids. Phytochromes are soluble chromoproteins that function in the absorption of red and far-red light and the transduction of intracellular signals during light-regulated plant development. De novo assembly of the nuclear genome of R. stricta recovered 45,641 contigs greater than 1000bp long, which were used in constructing a local database. Five sequences belonging to Arabidopsis thaliana phytochrome gene family (i.e., AtphyABCDE) were used to identify R. stricta contigs with phytochrome-like sequences using BLAST. This led to the identification of three contigs with phytochrome-like sequences covering AtphyA-, AtphyC- and AtphyE-like full-length genes. Annotation of the three sequences showed that each contig consists of one phytochrome-like gene with three exons and two introns. BLASTn and BLASTp results indicated that RsphyA mRNA and protein sequences had homologues in Wrightia coccinea and and Solanum tuberosum, respectively. RsphyC-like mRNA and protein sequence were homologous to Vitis vinifera and Vitis riparia. RsphyE-like mRNA coding and protein sequences were homologous to Ipomoea nil. Multiple-sequence alignment of phytochrome proteins indicated a homology with 30 sequences from 23 different species of flowering plants. Phylogenetic analysis confirmed that each R. stricta phytochrome gene is related to the same phytochrome gene of other flowering plants. It is proposed that the absence of phyB gene in R. stricta is due to RsphyA gene taking over the role of phyB.

  1. Deep Sequencing of Mixed Total DNA without Barcodes Allows Efficient Assembly of Highly Plastic Ascidian Mitochondrial Genomes

    Science.gov (United States)

    Rubinstein, Nimrod D.; Feldstein, Tamar; Shenkar, Noa; Botero-Castro, Fidel; Griggio, Francesca; Mastrototaro, Francesco; Delsuc, Frédéric; Douzery, Emmanuel J.P.; Gissi, Carmela; Huchon, Dorothée

    2013-01-01

    Ascidians or sea squirts form a diverse group within chordates, which includes a few thousand members of marine sessile filter-feeding animals. Their mitochondrial genomes are characterized by particularly high evolutionary rates and rampant gene rearrangements. This extreme variability complicates standard polymerase chain reaction (PCR) based techniques for molecular characterization studies, and consequently only a few complete Ascidian mitochondrial genome sequences are available. Using the standard PCR and Sanger sequencing approach, we produced the mitochondrial genome of Ascidiella aspersa only after a great effort. In contrast, we produced five additional mitogenomes (Botrylloides aff. leachii, Halocynthia spinosa, Polycarpa mytiligera, Pyura gangelion, and Rhodosoma turcicum) with a novel strategy, consisting in sequencing the pooled total DNA samples of these five species using one Illumina HiSeq 2000 flow cell lane. Each mitogenome was efficiently assembled in a single contig using de novo transcriptome assembly, as de novo genome assembly generally performed poorly for this task. Each of the new six mitogenomes presents a different and novel gene order, showing that no syntenic block has been conserved at the ordinal level (in Stolidobranchia and in Phlebobranchia). Phylogenetic analyses support the paraphyly of both Ascidiacea and Phlebobranchia, with Thaliacea nested inside Phlebobranchia, although the deepest nodes of the Phlebobranchia–Thaliacea clade are not well resolved. The strategy described here thus provides a cost-effective approach to obtain complete mitogenomes characterized by a highly plastic gene order and a fast nucleotide/amino acid substitution rate. PMID:23709623

  2. Deep sequencing of mixed total DNA without barcodes allows efficient assembly of highly plastic ascidian mitochondrial genomes.

    Science.gov (United States)

    Rubinstein, Nimrod D; Feldstein, Tamar; Shenkar, Noa; Botero-Castro, Fidel; Griggio, Francesca; Mastrototaro, Francesco; Delsuc, Frédéric; Douzery, Emmanuel J P; Gissi, Carmela; Huchon, Dorothée

    2013-01-01

    Ascidians or sea squirts form a diverse group within chordates, which includes a few thousand members of marine sessile filter-feeding animals. Their mitochondrial genomes are characterized by particularly high evolutionary rates and rampant gene rearrangements. This extreme variability complicates standard polymerase chain reaction (PCR) based techniques for molecular characterization studies, and consequently only a few complete Ascidian mitochondrial genome sequences are available. Using the standard PCR and Sanger sequencing approach, we produced the mitochondrial genome of Ascidiella aspersa only after a great effort. In contrast, we produced five additional mitogenomes (Botrylloides aff. leachii, Halocynthia spinosa, Polycarpa mytiligera, Pyura gangelion, and Rhodosoma turcicum) with a novel strategy, consisting in sequencing the pooled total DNA samples of these five species using one Illumina HiSeq 2000 flow cell lane. Each mitogenome was efficiently assembled in a single contig using de novo transcriptome assembly, as de novo genome assembly generally performed poorly for this task. Each of the new six mitogenomes presents a different and novel gene order, showing that no syntenic block has been conserved at the ordinal level (in Stolidobranchia and in Phlebobranchia). Phylogenetic analyses support the paraphyly of both Ascidiacea and Phlebobranchia, with Thaliacea nested inside Phlebobranchia, although the deepest nodes of the Phlebobranchia-Thaliacea clade are not well resolved. The strategy described here thus provides a cost-effective approach to obtain complete mitogenomes characterized by a highly plastic gene order and a fast nucleotide/amino acid substitution rate.

  3. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    Science.gov (United States)

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  4. Mitogenome assembly from genomic multiplex libraries: comparison of strategies and novel mitogenomes for five species of frogs.

    Science.gov (United States)

    Machado, D J; Lyra, M L; Grant, T

    2016-05-01

    Next-generation sequencing continues to revolutionize biodiversity studies by generating unprecedented amounts of DNA sequence data for comparative genomic analysis. However, these data are produced as millions or billions of short reads of variable quality that cannot be directly applied in comparative analyses, creating a demand for methods to facilitate assembly. We optimized an in silico strategy to efficiently reconstruct high-quality mitochondrial genomes directly from genomic reads. We tested this strategy using sequences from five species of frogs: Hylodes meridionalis (Hylodidae), Hyloxalus yasuni (Dendrobatidae), Pristimantis fenestratus (Craugastoridae), and Melanophryniscus simplex and Rhinella sp. (Bufonidae). These are the first mitogenomes published for these species, the genera Hylodes, Hyloxalus, Pristimantis, Melanophryniscus and Rhinella, and the families Craugastoridae and Hylodidae. Sequences were generated using only half of one lane of a standard Illumina HiqSeq 2000 flow cell, resulting in fewer than eight million reads. We analysed the reads of Hylodes meridionalis using three different assembly strategies: (1) reference-based (using bowtie2); (2) de novo (using abyss, soapdenovo2 and velvet); and (3) baiting and iterative mapping (using mira and mitobim). Mitogenomes were assembled exclusively with strategy 3, which we employed to assemble the remaining mitogenomes. Annotations were performed with mitos and confirmed by comparison with published amphibian mitochondria. In most cases, we recovered all 13 coding genes, 22 tRNAs, and two ribosomal subunit genes, with minor gene rearrangements. Our results show that few raw reads can be sufficient to generate high-quality scaffolds, making any Illumina machine run using genomic multiplex libraries a potential source of data for organelle assemblies as by-catch. © 2015 John Wiley & Sons Ltd.

  5. DNA damage response and spindle assembly checkpoint function throughout the cell cycle to ensure genomic integrity.

    Directory of Open Access Journals (Sweden)

    Katherine S Lawrence

    2015-04-01

    Full Text Available Errors in replication or segregation lead to DNA damage, mutations, and aneuploidies. Consequently, cells monitor these events and delay progression through the cell cycle so repair precedes division. The DNA damage response (DDR, which monitors DNA integrity, and the spindle assembly checkpoint (SAC, which responds to defects in spindle attachment/tension during metaphase of mitosis and meiosis, are critical for preventing genome instability. Here we show that the DDR and SAC function together throughout the cell cycle to ensure genome integrity in C. elegans germ cells. Metaphase defects result in enrichment of SAC and DDR components to chromatin, and both SAC and DDR are required for metaphase delays. During persistent metaphase arrest following establishment of bi-oriented chromosomes, stability of the metaphase plate is compromised in the absence of DDR kinases ATR or CHK1 or SAC components, MAD1/MAD2, suggesting SAC functions in metaphase beyond its interactions with APC activator CDC20. In response to DNA damage, MAD2 and the histone variant CENPA become enriched at the nuclear periphery in a DDR-dependent manner. Further, depletion of either MAD1 or CENPA results in loss of peripherally associated damaged DNA. In contrast to a SAC-insensitive CDC20 mutant, germ cells deficient for SAC or CENPA cannot efficiently repair DNA damage, suggesting that SAC mediates DNA repair through CENPA interactions with the nuclear periphery. We also show that replication perturbations result in relocalization of MAD1/MAD2 in human cells, suggesting that the role of SAC in DNA repair is conserved.

  6. Selecting Superior De Novo Transcriptome Assemblies: Lessons Learned by Leveraging the Best Plant Genome.

    Directory of Open Access Journals (Sweden)

    Loren A Honaas

    Full Text Available Whereas de novo assemblies of RNA-Seq data are being published for a growing number of species across the tree of life, there are currently no broadly accepted methods for evaluating such assemblies. Here we present a detailed comparison of 99 transcriptome assemblies, generated with 6 de novo assemblers including CLC, Trinity, SOAP, Oases, ABySS and NextGENe. Controlled analyses of de novo assemblies for Arabidopsis thaliana and Oryza sativa transcriptomes provide new insights into the strengths and limitations of transcriptome assembly strategies. We find that the leading assemblers generate reassuringly accurate assemblies for the majority of transcripts. At the same time, we find a propensity for assemblers to fail to fully assemble highly expressed genes. Surprisingly, the instance of true chimeric assemblies is very low for all assemblers. Normalized libraries are reduced in highly abundant transcripts, but they also lack 1000s of low abundance transcripts. We conclude that the quality of de novo transcriptome assemblies is best assessed through consideration of a combination of metrics: 1 proportion of reads mapping to an assembly 2 recovery of conserved, widely expressed genes, 3 N50 length statistics, and 4 the total number of unigenes. We provide benchmark Illumina transcriptome data and introduce SCERNA, a broadly applicable modular protocol for de novo assembly improvement. Finally, our de novo assembly of the Arabidopsis leaf transcriptome revealed ~20 putative Arabidopsis genes lacking in the current annotation.

  7. Cytogenetic maps of homoeologous chromosomes A h01 and D h01 and their integration with the genome assembly in Gossypium hirsutum

    Directory of Open Access Journals (Sweden)

    Yuling Liu

    2017-06-01

    Full Text Available Cytogenetic maps of Gossypium hirsutum (Linnaeus, 1753 homoeologous chromosomes Ah01 and Dh01 were constructed by fluorescence in situ hybridization (FISH, using eleven homoeologous-chromosomes-shared bacterial artificial chromosomes (BACs clones and one chromosome-specific BAC clone respectively. We compared the cytogenetic maps with the genetic linkage and draft genome assembly maps based on a standardized map unit, relative map position (RMP, which allowed a global view of the relationship of genetic and physical distances along each chromosome, and assembly quality of the draft genome assembly map. By integration of cytogenetic maps with sequence maps of the two chromosomes (Ah01 and Dh01, we inferred the locations of two scaffolds and speculated that some homologous sequences belonging to homoeologous chromosomes were removed as repetitiveness during the sequence assembly. The result offers molecular tools for cotton genomics research and also provides valuable information for the improvement of the draft genome assembly.

  8. Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler

    Directory of Open Access Journals (Sweden)

    Haines Albert

    2010-07-01

    Full Text Available Abstract Background The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. Results We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. Conclusion The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible.

  9. Genome assembly and annotation ofArabidopsis halleri, a model for heavy metal hyperaccumulation and evolutionary ecology

    OpenAIRE

    Briskine, Roman V; Paape, Timothy; Shimizu-Inatsugi, Rie; Nishiyama, Tomoaki; Akama, Satoru; Sese, Jun; Kentaro K. Shimizu

    2016-01-01

    The self-incompatible species Arabidopsis halleri is a close relative of the self-compatible model plant Arabidopsis thaliana. The broad European and Asian distribution and heavy metal hyperaccumulation ability makes A. halleri a useful model for ecological genomics studies.We used long-insert mate-pair libraries to improve the genome assembly of the A. halleri ssp.gemmifera Tada mine genotype (W302) collected from a site with high contamination by heavy metals in Japan. After five rounds of ...

  10. Whole-genome shotgun optical mapping of Rhodobacter sphaeroides strain 2.4. 1 and its use for whole-genome shotgun sequence assembly

    Energy Technology Data Exchange (ETDEWEB)

    Shou, S. [Univ. Wisc.-Madison; Kvikstad, E. [Univ. Wisc.-Madison; Kile, A. [Univ. Wisc.-Madison; Severin, J. [Whole-genome shotgun optical mapping of Rhodobacter sphaeroides strain 2.4. 1 and its use for whole-genome shotgun sequence assembly; Forrest, D. [Univ. Wisc.-Madison; Runnheim, R. [Univ. Wisc.-Madison; Churas, C. [Univ. Wisc.-Madison; Hickman, J. W. [Univ. Wisc.-Madison; Mackenzie, C. [University of Texas–Houston Medical School; Choudhary, M. [University of Texas–Houston Medical School; Donohue, T. [Univ. Wisc.-Madison; Kaplan, S. [University of Texas–Houston Medical School; Schwartz, D. C. [Univ. Wisc.-Madison

    2003-09-01

    Rhodobacter sphaeroides 2.4.1 is a facultative photoheterotrophic bacterium with tremendous metabolic diversity, which has significantly contributed to our understanding of the molecular genetics of photosynthesis, photoheterotrophy, nitrogen fixation, hydrogen metabolism, carbon dioxide fixation, taxis, and tetrapyrrole biosynthesis. To further understand this remarkable bacterium, and to accelerate an ongoing sequencing project, two whole-genome restriction maps (EcoRI and HindIII) of R. sphaeroides strain 2.4.1 were constructed using shotgun optical mapping. The approach directly mapped genomic DNA by the random mapping of single molecules. The two maps were used to facilitate sequence assembly by providing an optical scaffold for high-resolution alignment and verification of sequence contigs. Our results show that such maps facilitated the closure of sequence gaps by the early detection of nascent sequence contigs during the course of the whole-genome shotgun sequencing process.

  11. Characterization of genome-wide microsatellites of Saccharina japonica based on a preliminary assembly of Illumina sequencing reads

    Science.gov (United States)

    Zhang, Linan; Peng, Jie; Li, Xiaojie; Cui, Cuiju; Sun, Juan; Yang, Guanpin

    2016-06-01

    Microsatellites or simple sequence repeats (SSR) function widely and locate dependently in genome. However, their characteristics are often ignored due to the lack of genomic sequences of most species. Kelp ( Saccharina japonica), a brown macroalga, is extensively cultured in China. In this study, the genome of S. japonica was surveyed using an Illumina sequencing platform, and its microsatellites were characterized. The preliminarily assembled genome was 469.4 Mb in size, with a scaffold N50 of 20529 bp. Among the 128370 identified microsatellites, 90671, 25726 and 11973 were found in intergenic regions, introns and exons, averaging 339.3, 178.8 and 205.4 microsatellites per Mb, respectively. These microsatellites distributed unevenly in S. japonica genome. Mononucleotide motifs were the most abundant in the genome, while trinucleotide ones were the most prevalent in exons. The microsatellite abundance decreased significantly with the increase of motif repeat numbers, and the microsatellites with a small number of repeats accounted for a higher proportion of the exons than those of the intergenic regions and introns. C/G-rich motifs were more common in exons than in intergenic regions and introns. These characteristics of microsatellites in S. japonica genome may associate with their functions, and ultimately their adaptation and evolution. Among the 120140 pairs of designed microsatellite primers, approximately 75% were predicted to be able to amplify S. japonica DNA. These microsatellite markers will be extremely useful for the genetic breeding and population evolution studies of kelp.

  12. Insight into structure and assembly of the nuclear pore complex by utilizing the genome of a eukaryotic thermophile

    DEFF Research Database (Denmark)

    Amlacher, Stefan; Sarges, Phillip; Flemming, Dirk;

    2011-01-01

    Despite decades of research, the structure and assembly of the nuclear pore complex (NPC), which is composed of ~30 nucleoporins (Nups), remain elusive. Here, we report the genome of the thermophilic fungus Chaetomium thermophilum (ct) and identify the complete repertoire of Nups therein. The the......Despite decades of research, the structure and assembly of the nuclear pore complex (NPC), which is composed of ~30 nucleoporins (Nups), remain elusive. Here, we report the genome of the thermophilic fungus Chaetomium thermophilum (ct) and identify the complete repertoire of Nups therein....... The thermophilic proteins show improved properties for structural and biochemical studies compared to their mesophilic counterparts, and purified ctNups enabled the reconstitution of the inner pore ring module that spans the width of the NPC from the anchoring membrane to the central transport channel. This module...... of a thermophilic eukaryote for studying complex molecular machines....

  13. De novo assembly of a genome-wide transcriptome map of Vicia faba (L.) for transfer cell research.

    Science.gov (United States)

    Arun-Chinnappa, Kiruba S; McCurdy, David W

    2015-01-01

    Vicia faba (L.) is an important cool-season grain legume species used widely in agriculture but also in plant physiology research, particularly as an experimental model to study transfer cell (TC) development. TCs are specialized nutrient transport cells in plants, characterized by invaginated wall ingrowths with amplified plasma membrane surface area enriched with transporter proteins that facilitate nutrient transfer. Many TCs are formed by trans-differentiation from differentiated cells at apoplasmic/symplasmic boundaries in nutrient transport. Adaxial epidermal cells of isolated cotyledons can be induced to form functional TCs, thus providing a valuable experimental system to investigate genetic regulation of TC trans-differentiation. The genome of V. faba is exceedingly large (ca. 13 Gb), however, and limited genomic information is available for this species. To provide a resource for future transcript profiling of epidermal TC differentiation, we have undertaken de novo assembly of a genome-wide transcriptome map for V. faba. Illumina paired-end sequencing of total RNA pooled from different tissues and different stages, including isolated cotyledons induced to form epidermal TCs, generated 69.5 M reads, of which 65.8 M were used for assembly following trimming and quality control. Assembly using a De-Bruijn graph-based approach generated 21,297 contigs, of which 80.6% were successfully annotated against GO terms. The assembly was validated against known V. faba cDNAs held in GenBank, including transcripts previously identified as being specifically expressed in epidermal cells across TC trans-differentiation. This genome-wide transcriptome map therefore provides a valuable tool for future transcript profiling of epidermal TC trans-differentiation, and also enriches the genetic resources available for this important legume crop species.

  14. The Carcinogenic Liver Fluke, Clonorchis sinensis: New Assembly, Reannotation and Analysis of the Genome and Characterization of Tissue Transcriptomes

    OpenAIRE

    Yan Huang; Wenjun Chen; Xiaoyun Wang; Hailiang Liu; Yangyi Chen; Lei Guo; Fang Luo; Jiufeng Sun; Qiang Mao; Pei Liang; Zhizhi Xie; Chenhui Zhou; Yanli Tian; Xiaoli Lv; Lisi Huang

    2013-01-01

    Clonorchis sinensis (C. sinensis), an important food-borne parasite that inhabits the intrahepatic bile duct and causes clonorchiasis, is of interest to both the public health field and the scientific research community. To learn more about the migration, parasitism and pathogenesis of C. sinensis at the molecular level, the present study developed an upgraded genomic assembly and annotation by sequencing paired-end and mate-paired libraries. We also performed transcriptome sequence analyses ...

  15. De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis.

    Science.gov (United States)

    Nowrousian, Minou; Stajich, Jason E; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D; Pöggeler, Stefanie; Read, Nick D; Seiler, Stephan; Smith, Kristina M; Zickler, Denise; Kück, Ulrich; Freitag, Michael

    2010-04-08

    Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for

  16. Modulation of Root Microbiome Community Assembly by the Plant Immune Response (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    Energy Technology Data Exchange (ETDEWEB)

    Lebeis, Sarah [University of North Carolina

    2013-03-01

    Sarah Lebeis of University of North Carolina on "Modulation of root microbiome community assembly by the plant immune response" at the 8th Annual Genomics of Energy & Environment Meeting on March 28, 2013 in Walnut Creek, Calif.

  17. Assembly-driven metagenomics of a hypersaline microbial ecosystem (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    Energy Technology Data Exchange (ETDEWEB)

    Allen, Eric [Scripps and UCSD

    2013-03-01

    Eric Allen of Scripps and UC San Diego on "Assembly-driven metagenomics of a hypersaline microbial ecosystem" at the 8th Annual Genomics of Energy & Environment Meeting on March 27, 2013 in Walnut Creek, Calif.

  18. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits.

    Science.gov (United States)

    Li, Ying-hui; Zhou, Guangyu; Ma, Jianxin; Jiang, Wenkai; Jin, Long-guo; Zhang, Zhouhao; Guo, Yong; Zhang, Jinbo; Sui, Yi; Zheng, Liangtao; Zhang, Shan-shan; Zuo, Qiyang; Shi, Xue-hui; Li, Yan-fei; Zhang, Wan-ke; Hu, Yiyao; Kong, Guanyi; Hong, Hui-long; Tan, Bing; Song, Jian; Liu, Zhang-xiong; Wang, Yaoshen; Ruan, Hang; Yeung, Carol K L; Liu, Jian; Wang, Hailong; Zhang, Li-juan; Guan, Rong-xia; Wang, Ke-jing; Li, Wen-bin; Chen, Shou-yi; Chang, Ru-zhen; Jiang, Zhi; Jackson, Scott A; Li, Ruiqiang; Qiu, Li-juan

    2014-10-01

    Wild relatives of crops are an important source of genetic diversity for agriculture, but their gene repertoire remains largely unexplored. We report the establishment and analysis of a pan-genome of Glycine soja, the wild relative of cultivated soybean Glycine max, by sequencing and de novo assembly of seven phylogenetically and geographically representative accessions. Intergenomic comparisons identified lineage-specific genes and genes with copy number variation or large-effect mutations, some of which show evidence of positive selection and may contribute to variation of agronomic traits such as biotic resistance, seed composition, flowering and maturity time, organ size and final biomass. Approximately 80% of the pan-genome was present in all seven accessions (core), whereas the rest was dispensable and exhibited greater variation than the core genome, perhaps reflecting a role in adaptation to diverse environments. This work will facilitate the harnessing of untapped genetic diversity from wild soybean for enhancement of elite cultivars.

  19. Whole-genome optical mapping reveals a mis-assembly between two rRNA operons of Corynebacterium pseudotuberculosis strain 1002.

    Science.gov (United States)

    Mariano, Diego César Batista; Sousa, Thiago de Jesus; Pereira, Felipe Luiz; Aburjaile, Flávia; Barh, Debmalya; Rocha, Flávia; Pinto, Anne Cybelle; Hassan, Syed Shah; Saraiva, Tessália Diniz Luerce; Dorella, Fernanda Alves; de Carvalho, Alex Fiorini; Leal, Carlos Augusto Gomes; Figueiredo, Henrique César Pereira; Silva, Artur; Ramos, Rommel Thiago Jucá; Azevedo, Vasco Ariston Carvalho

    2016-04-30

    Studies have detected mis-assemblies in genomes of the species Corynebacterium pseudotuberculosis. These new discover have been possible due to the evolution of the Next-Generation Sequencing platforms, which have provided sequencing with accuracy and reduced costs. In addition, the improving of techniques for construction of high accuracy genomic maps, for example, Whole-genome mapping (WGM) (OpGen Inc), have allow high-resolution assembly that can detect large rearrangements. In this work, we present the resequencing of Corynebacterium pseudotuberculosis strain 1002 (Cp1002). Cp1002 was the first strain of this species sequenced in Brazil, and its genome has been used as model for several studies in silico of caseous lymphadenitis disease. The sequencing was performed using the platform Ion PGM and fragment library (200 bp kit). A restriction map was constructed, using the technique of WGM with the enzyme KpnI. After the new assembly process, using WGM as scaffolder, we detected a large inversion with size bigger than one-half of genome. A specific analysis using BLAST and NR database shows that the inversion occurs between two homology RNA ribosomal regions. In conclusion, the results showed by WGM could be used to detect mismatches in assemblies, providing genomic maps with high resolution and allow assemblies with more accuracy and completeness. The new assembly of C. pseudotuberculosis was deposited in GenBank under the accession no. CP012837.

  20. Using Genome-Referenced Expressed Sequence Tag Assembly to Analyze the Origin and Expression Patterns of Gossypium hirsutum Transcripts

    Institute of Scientific and Technical Information of China (English)

    Xiang Jin; Qin Li; Guanghui Xiao; Yu-Xian Zhu

    2013-01-01

    We assembled a total of 297,239 Gossypium hirsutum (Gh,a tetraploid cotton,AADD) expressed sequence tag (EST) sequences that were available in the National Center for Biotechnology Information database,with reference to the recently published G.raimondii (Gr,a diploid cotton,DD) genome,and obtained 49,125 UniGenes.The average lengths of the UniGenes were increased from 804 and 791 bp in two previous EST assemblies to 1,019 bp in the current analysis.The number of putative cotton UniGenes with lengths of 3 kb or more increased from 25 or 34 to 1,223.As a result,thousands of originally independent G.hirsutum ESTs were aligned to produce large contigs encoding transcripts with very long open reading frames,indicating that the G.raimondii genome sequence provided remarkable advantages to assemble the tetraploid cotton transcriptome.Significant different distribution patterns within several GO terms,including transcription factor activity,were observed between D-and A-derived assemblies.Transcriptome analysis showed that,in a tetraploid cotton cell,29,547 UniGenes were possibly derived from the D subgenome while another 19,578 may come from the A subgenome.Finally,some of the in silico data were confirmed by reverse transcription polymerase chain reaction experiments to show the changes in transcript levels for several gene families known to play key role in cotton fiber development.We believe that our work provides a useful platform for functional and evolutionary genomic studies in cotton.

  1. Transcriptome analysis of root response to citrus blight based on the newly assembled Swingle citrumelo draft genome.

    Science.gov (United States)

    Zhang, Yunzeng; Barthe, Gary; Grosser, Jude W; Wang, Nian

    2016-07-08

    Citrus blight is a citrus tree overall decline disease and causes serious losses in the citrus industry worldwide. Although it was described more than one hundred years ago, its causal agent remains unknown and its pathophysiology is not well determined, which hampers our understanding of the disease and design of suitable disease management. In this study, we sequenced and assembled the draft genome for Swingle citrumelo, one important citrus rootstock. The draft genome is approximately 280 Mb, which covers 74 % of the estimated Swingle citrumelo genome and the average coverage is around 15X. The draft genome of Swingle citrumelo enabled us to conduct transcriptome analysis of roots of blight and healthy Swingle citrumelo using RNA-seq. The RNA-seq was reliable as evidenced by the high consistence of RNA-seq analysis and quantitative reverse transcription PCR results (R(2) = 0.966). Comparison of the gene expression profiles between blight and healthy root samples revealed the molecular mechanism underneath the characteristic blight phenotypes including decline, starch accumulation, and drought stress. The JA and ET biosynthesis and signaling pathways showed decreased transcript abundance, whereas SA-mediated defense-related genes showed increased transcript abundance in blight trees, suggesting unclassified biotrophic pathogen was involved in this disease. Overall, the Swingle citrumelo draft genome generated in this study will advance our understanding of plant biology and contribute to the citrus breeding. Transcriptome analysis of blight and healthy trees deepened our understanding of the pathophysiology of citrus blight.

  2. Genome-wide identification of novel genetic markers from RNA sequencing assembly of diverse Aegilops tauschii accessions.

    Science.gov (United States)

    Nishijima, Ryo; Yoshida, Kentaro; Motoi, Yuka; Sato, Kazuhiro; Takumi, Shigeo

    2016-08-01

    The wild species in the Triticeae tribe are tremendous resources for crop breeding due to their abundant natural variation. However, their huge and highly repetitive genomes have hindered the establishment of physical maps and the completeness of their genome sequences. To develop molecular markers for the efficient utilization of their valuable traits while avoiding their genome complexity, we assembled RNA sequences of ten representative accessions of Aegilops tauschii, the progenitor of the wheat D genome, and estimated single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). The deduced unigenes were anchored to the chromosomes of Ae. tauschii and barley. The SNPs and indels in the anchored unigenes, covering entire chromosomes, were sufficient for linkage map construction, even in combinations between the genetically closest accessions. Interestingly, the resolution of SNP and indel distribution on barley chromosomes was slightly higher than on Ae. tauschii chromosomes. Since barley chromosomes are regarded as virtual chromosomes of Triticeae species, our strategy allows capture of genetic markers arranged on the chromosomes in order based on the conserved synteny. The resolution of these genetic markers will be comparable to that of the Ae. tauschii whose draft genome sequence is available. Our procedure should be applicable to marker development for Triticeae species, which have no draft sequences available.

  3. Draft Genome Sequences of Tersicoccus phoenicis DSM 30849T, Isolated from a Cleanroom for Spacecraft Assembly, and Tersicoccus sp. Strain Bi-70, Isolated from a Freshwater Lake

    Science.gov (United States)

    Yoshizawa, Susumu; Nakamura, Keiji; Ogura, Yoshitoshi; Hayashi, Tetsuya; Kogure, Kazuhiro

    2017-01-01

    ABSTRACT Here, we report the draft genome sequences of Tersicoccus phoenicis DSM 30849T, isolated from a spacecraft assembly cleanroom at the National Aeronautics and Space Administration (NASA), and Tersicoccus sp. strain Bi-70, isolated from Lake Biwa, the largest lake in Japan. These genome sequences facilitate our understanding of the adaptation of these closely related strains to different habitats. PMID:28360156

  4. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology

    NARCIS (Netherlands)

    Nijkamp, J.F.; Van den Broek, M.A.; Datema, E.; De Kok, S.; Bosman, L.; Luttik, M.A.H.; Daran-Lapujade, P.A.S.; Vongsangnak, W.; Nielsen, J.; Heijne. W.H.M.; Klaassen, P.; Paddon, C.J.; Platt, D.; Kötter, P.; Van Ham, R.C.; Reinders, M.J.T.; Pronk, J.T.; De Ridder, D.; Daran, J.M.

    2012-01-01

    Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization

  5. Sequence assembly

    DEFF Research Database (Denmark)

    Scheibye-Alsing, Karsten; Hoffmann, S.; Frankel, Annett Maria

    2009-01-01

    Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies...

  6. Individual Genome of the Russian Male: SNP Calling and a de novo Assembly of Unmapped Reads.

    Science.gov (United States)

    Chekanov, N N; Boulygina, E S; Beletskiy, A V; Prokhortchouk, E B; Skryabin, K G

    2010-07-01

    A somatic cell genome was recently resequenced for a patient with renal cancer. The data were submitted to the NCBI Sequence Read Archive under the accession number SRA012240. Here, we have performed SNP calling for the genome and compared it with several published genomes. We have found 2, 921, 724 SNPs, including 1, 472, 679 newly described ones. Among them, 63, 462 SNPs have been mapped to the Y chromosome and, based on 18 markers, the genome has been ascribed to the R1a1a haplogroup predominant in Russian males. The mitochondrial haplogroup has been determined as U5a, which is also common in the European part of Russia. Short reads unmapped to the human genome were used for thede novoassembly of DNA sequences. This resulted in genome-specific contigs (more than 100 bp in length) with an overall length of 154 kbp (for GAII) and 4.7 kbp (for SOLiD).

  7. Comparison of carnivore, omnivore, and herbivore mammalian genomes with a new leopard assembly

    OpenAIRE

    Kim, Soonok; Cho, Yun Sung; Kim, Hak-Min; Chung, Oksung; Kim, Hyunho; Jho, Sungwoong; Seomun, Hong; Kim, Jeongho; Bang, Woo Young; Kim, Changmu; An, Junghwa; Bae, Chang Hwan; Bhak, Youngjune; Jeon, Sungwon; Yoon, Hyejun

    2016-01-01

    Background: There are three main dietary groups in mammals: carnivores, omnivores, and herbivores. Currently, there is limited comparative genomics insight into the evolution of dietary specializations in mammals. Due to recent advances in sequencing technologies, we were able to perform in-depth whole genome analyses of representatives of these three dietary groups. Results: We investigated the evolution of carnivory by comparing 18 representative genomes from across Mammalia with carnivorou...

  8. Comparison of carnivore, omnivore, and herbivore mammalian genomes with a new leopard assembly

    OpenAIRE

    Kim, Soonok; Cho, Yun Sung; Kim, Hak-Min; Chung, Oksung; Kim, Hyunho; Jho, Sungwoong; Seomun, Hong; Kim, Jeongho; Bang, Woo Young; Kim, Changmu; An, Junghwa; Bae, Chang Hwan; Bhak, Youngjune; Jeon, Sungwon; Yoon, Hyejun

    2016-01-01

    Background There are three main dietary groups in mammals: carnivores, omnivores, and herbivores. Currently, there is limited comparative genomics insight into the evolution of dietary specializations in mammals. Due to recent advances in sequencing technologies, we were able to perform in-depth whole genome analyses of representatives of these three dietary groups. Results We investigated the evolution of carnivory by comparing 18 representative genomes from across Mammalia with carnivorous,...

  9. INDIVIDUAL GENOME OF THE RUSSIAN MALE: SNP CALLING AND A DE NOVO ASSEMBLY OF UNMAPPED READS

    OpenAIRE

    Chekanov, N.; Boulygina, E.; Beletskiy, A.; Prokhortchouk, E.; Skryabin, K.

    2010-01-01

    A somatic cell genome was recently resequenced for a patient with renal cancer. The data were submitted to the NCBI Sequence Read Archive under the accession number SRA012240. Here, we have performed SNP calling for the genome and compared it with several published genomes. We have found 2, 921, 724 SNPs, including 1, 472, 679 newly described ones. Among them, 63, 462 SNPs have been mapped to the Y chromosome and, based on 18 markers, the genome has been ascribed to the R1a1a haplogroup predo...

  10. A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny

    Science.gov (United States)

    Pucker, Boas; Holtgräwe, Daniela; Rosleff Sörensen, Thomas; Stracke, Ralf; Viehöver, Prisca

    2016-01-01

    Arabidopsis thaliana is the most important model organism for fundamental plant biology. The genome diversity of different accessions of this species has been intensively studied, for example in the 1001 genome project which led to the identification of many small nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels). In addition, presence/absence variation (PAV), copy number variation (CNV) and mobile genetic elements contribute to genomic differences between A. thaliana accessions. To address larger genome rearrangements between the A. thaliana reference accession Columbia-0 (Col-0) and another accession of about average distance to Col-0, we created a de novo next generation sequencing (NGS)-based assembly from the accession Niederzenz-1 (Nd-1). The result was evaluated with respect to assembly strategy and synteny to Col-0. We provide a high quality genome sequence of the A. thaliana accession (Nd-1, LXSY01000000). The assembly displays an N50 of 0.590 Mbp and covers 99% of the Col-0 reference sequence. Scaffolds from the de novo assembly were positioned on the basis of sequence similarity to the reference. Errors in this automatic scaffold anchoring were manually corrected based on analyzing reciprocal best BLAST hits (RBHs) of genes. Comparison of the final Nd-1 assembly to the reference revealed duplications and deletions (PAV). We identified 826 insertions and 746 deletions in Nd-1. Randomly selected candidates of PAV were experimentally validated. Our Nd-1 de novo assembly allowed reliable identification of larger genic and intergenic variants, which was difficult or error-prone by short read mapping approaches alone. While overall sequence similarity as well as synteny is very high, we detected short and larger (affecting more than 100 bp) differences between Col-0 and Nd-1 based on bi-directional comparisons. The de novo assembly provided here and additional assemblies that will certainly be published in the future will allow to

  11. A High-Resolution SNP Array-Based Linkage Map Anchors a New Domestic Cat Draft Genome Assembly and Provides Detailed Patterns of Recombination.

    Science.gov (United States)

    Li, Gang; Hillier, LaDeana W; Grahn, Robert A; Zimin, Aleksey V; David, Victor A; Menotti-Raymond, Marilyn; Middleton, Rondo; Hannah, Steven; Hendrickson, Sher; Makunin, Alex; O'Brien, Stephen J; Minx, Pat; Wilson, Richard K; Lyons, Leslie A; Warren, Wesley C; Murphy, William J

    2016-06-01

    High-resolution genetic and physical maps are invaluable tools for building accurate genome assemblies, and interpreting results of genome-wide association studies (GWAS). Previous genetic and physical maps anchored good quality draft assemblies of the domestic cat genome, enabling the discovery of numerous genes underlying hereditary disease and phenotypes of interest to the biomedical science and breeding communities. However, these maps lacked sufficient marker density to order thousands of shorter scaffolds in earlier assemblies, which instead relied heavily on comparative mapping with related species. A high-resolution map would aid in validating and ordering chromosome scaffolds from existing and new genome assemblies. Here, we describe a high-resolution genetic linkage map of the domestic cat genome based on genotyping 453 domestic cats from several multi-generational pedigrees on the Illumina 63K SNP array. The final maps include 58,055 SNP markers placed relative to 6637 markers with unique positions, distributed across all autosomes and the X chromosome. Our final sex-averaged maps span a total autosomal length of 4464 cM, the longest described linkage map for any mammal, confirming length estimates from a previous microsatellite-based map. The linkage map was used to order and orient the scaffolds from a substantially more contiguous domestic cat genome assembly (Felis catus v8.0), which incorporated ∼20 × coverage of Illumina fragment reads. The new genome assembly shows substantial improvements in contiguity, with a nearly fourfold increase in N50 scaffold size to 18 Mb. We use this map to report probable structural errors in previous maps and assemblies, and to describe features of the recombination landscape, including a massive (∼50 Mb) recombination desert (of virtually zero recombination) on the X chromosome that parallels a similar desert on the porcine X chromosome in both size and physical location.

  12. LTC: a novel algorithm to improve the efficiency of contig assembly for physical mapping in complex genomes

    Directory of Open Access Journals (Sweden)

    Feuillet Catherine

    2010-11-01

    Full Text Available Abstract Background Physical maps are the substrate of genome sequencing and map-based cloning and their construction relies on the accurate assembly of BAC clones into large contigs that are then anchored to genetic maps with molecular markers. High Information Content Fingerprinting has become the method of choice for large and repetitive genomes such as those of maize, barley, and wheat. However, the high level of repeated DNA present in these genomes requires the application of very stringent criteria to ensure a reliable assembly with the FingerPrinted Contig (FPC software, which often results in short contig lengths (of 3-5 clones before merging as well as an unreliable assembly in some difficult regions. Difficulties can originate from a non-linear topological structure of clone overlaps, low power of clone ordering algorithms, and the absence of tools to identify sources of gaps in Minimal Tiling Paths (MTPs. Results To address these problems, we propose a novel approach that: (i reduces the rate of false connections and Q-clones by using a new cutoff calculation method; (ii obtains reliable clusters robust to the exclusion of single clone or clone overlap; (iii explores the topological contig structure by considering contigs as networks of clones connected by significant overlaps; (iv performs iterative clone clustering combined with ordering and order verification using re-sampling methods; and (v uses global optimization methods for clone ordering and Band Map construction. The elements of this new analytical framework called Linear Topological Contig (LTC were applied on datasets used previously for the construction of the physical map of wheat chromosome 3B with FPC. The performance of LTC vs. FPC was compared also on the simulated BAC libraries based on the known genome sequences for chromosome 1 of rice and chromosome 1 of maize. Conclusions The results show that compared to other methods, LTC enables the construction of highly

  13. Hybrid De Novo Genome Assembly Using MiSeq and SOLiD Short Read Data.

    Science.gov (United States)

    Ikegami, Tsutomu; Inatsugi, Toyohiro; Kojima, Isao; Umemura, Myco; Hagiwara, Hiroko; Machida, Masayuki; Asai, Kiyoshi

    2015-01-01

    A hybrid de novo assembly pipeline was constructed to utilize both MiSeq and SOLiD short read data in combination in the assembly. The short read data were converted to a standard format of the pipeline, and were supplied to the pipeline components such as ABySS and SOAPdenovo. The assembly pipeline proceeded through several stages, and either MiSeq paired-end data, SOLiD mate-paired data, or both of them could be specified as input data at each stage separately. The pipeline was examined on the filamentous fungus Aspergillus oryzae RIB40, by aligning the assembly results against the reference sequences. Using both the MiSeq and the SOLiD data in the hybrid assembly, the alignment length was improved by a factor of 3 to 8, compared with the assemblies using either one of the data types. The number of the reproduced gene cluster regions encoding secondary metabolite biosyntheses (SMB) was also improved by the hybrid assemblies. These results imply that the MiSeq data with long read length are essential to construct accurate nucleotide sequences, while the SOLiD mate-paired reads with long insertion length enhance long-range arrangements of the sequences. The pipeline was also tested on the actinomycete Streptomyces avermitilis MA-4680, whose gene is known to have high-GC content. Although the quality of the SOLiD reads was too low to perform any meaningful assemblies by themselves, the alignment length to the reference was improved by a factor of 2, compared with the assembly using only the MiSeq data.

  14. Hybrid De Novo Genome Assembly Using MiSeq and SOLiD Short Read Data.

    Directory of Open Access Journals (Sweden)

    Tsutomu Ikegami

    Full Text Available A hybrid de novo assembly pipeline was constructed to utilize both MiSeq and SOLiD short read data in combination in the assembly. The short read data were converted to a standard format of the pipeline, and were supplied to the pipeline components such as ABySS and SOAPdenovo. The assembly pipeline proceeded through several stages, and either MiSeq paired-end data, SOLiD mate-paired data, or both of them could be specified as input data at each stage separately. The pipeline was examined on the filamentous fungus Aspergillus oryzae RIB40, by aligning the assembly results against the reference sequences. Using both the MiSeq and the SOLiD data in the hybrid assembly, the alignment length was improved by a factor of 3 to 8, compared with the assemblies using either one of the data types. The number of the reproduced gene cluster regions encoding secondary metabolite biosyntheses (SMB was also improved by the hybrid assemblies. These results imply that the MiSeq data with long read length are essential to construct accurate nucleotide sequences, while the SOLiD mate-paired reads with long insertion length enhance long-range arrangements of the sequences. The pipeline was also tested on the actinomycete Streptomyces avermitilis MA-4680, whose gene is known to have high-GC content. Although the quality of the SOLiD reads was too low to perform any meaningful assemblies by themselves, the alignment length to the reference was improved by a factor of 2, compared with the assembly using only the MiSeq data.

  15. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals

    DEFF Research Database (Denmark)

    Hellmann, Ines; Mang, Yuan; Gu, Zhiping

    2008-01-01

    for individual reads. Applying this method to data from the Celera human genome sequencing and SNP discovery project, we obtain estimates of nucleotide diversity in windows spanning the human genome and show that the diversity to divergence ratio is reduced in regions of low recombination. Furthermore, we show...

  16. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference

    DEFF Research Database (Denmark)

    Maretty, Lasse; Jensen, Jacob Malte; Petersen, Bent

    2017-01-01

    Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome ...

  17. Survey of endosymbionts in the Diaphorina citri metagenome and assembly of a Wolbachia wDi draft genome.

    Directory of Open Access Journals (Sweden)

    Surya Saha

    Full Text Available Diaphorina citri (Hemiptera: Psyllidae, the Asian citrus psyllid, is the insect vector of Ca. Liberibacter asiaticus, the causal agent of citrus greening disease. Sequencing of the D. citri metagenome has been initiated to gain better understanding of the biology of this organism and the potential roles of its bacterial endosymbionts. To corroborate candidate endosymbionts previously identified by rDNA amplification, raw reads from the D. citri metagenome sequence were mapped to reference genome sequences. Results of the read mapping provided the most support for Wolbachia and an enteric bacterium most similar to Salmonella. Wolbachia-derived reads were extracted using the complete genome sequences for four Wolbachia strains. Reads were assembled into a draft genome sequence, and the annotation assessed for the presence of features potentially involved in host interaction. Genome alignment with the complete sequences reveals membership of Wolbachia wDi in supergroup B, further supported by phylogenetic analysis of FtsZ. FtsZ and Wsp phylogenies additionally indicate that the Wolbachia strain in the Florida D. citri isolate falls into a sub-clade of supergroup B, distinct from Wolbachia present in Chinese D. citri isolates, supporting the hypothesis that the D. citri introduced into Florida did not originate from China.

  18. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing.

    Science.gov (United States)

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-08-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90-99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.

  19. Updated genome assembly and annotation of Paenibacillus larvae, the agent of American foulbrood disease of honey bees

    Directory of Open Access Journals (Sweden)

    de Graaf Dirk C

    2011-09-01

    Full Text Available Abstract Background As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. Results We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. Conclusions This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.

  20. Comparison of assembled Clostridium botulinum A1 genomes revealed their evolutionary relationship.

    Science.gov (United States)

    Ng, Virginia; Lin, Wei-Jen

    2014-01-01

    Clostridium botulinum encompasses bacteria that produce at least one of the seven serotypes of botulinum neurotoxin (BoNT/A-G). The availability of genome sequences of four closely related Type A1 or A1(B) strains, as well as the A1-specific microarray, allowed the analysis of their genomic organizations and evolutionary relationship. The four genomes share >90% core genes and >96% functional groups. Phylogenetic analysis based on COG shows closer relations of the A1(B) strain, NCTC 2916, to B1 and F1 than A1 strains. Alignment of the genomes of the three A1 strains revealed a highly similar chromosomal structure with three small gaps in the genome of ATCC 19397 and one additional gap in the genome of Hall A, suggesting ATCC 19379 as an evolutionary intermediate between Hall A and ATCC 3502. Analyses of the four gap regions indicated potential horizontal gene transfer and recombination events important for the evolution of A1 strains.

  1. Human artificial chromosome assembly by transposon-based retrofitting of genomic BACs with synthetic alpha-satellite arrays.

    Science.gov (United States)

    Basu, Joydeep; Willard, Huntington F; Stromberg, Gregory

    2007-01-01

    The development of methodologies for the rapid assembly of synthetic alpha-satellite arrays recapitulating the higher-order periodic organization of native human centromeres permits the systematic investigation of the significance of primary sequence and sequence organization in centromere function. Synthetic arrays with defined mutations affecting sequence and/or organization may be evaluated in a de novo human artificial chromosome assay. This unit describes strategies for the assembly of custom built alpha-satellite arrays containing any desired mutation as well as strategies for the construction and manipulation of alpha satellite-based transposons. Transposons permit the rapid and reliable retrofitting of any genomic bacterial artificial chromosome (BAC) with synthetic alpha-satellite arrays and other functional components, thereby facilitating conversion into BAC-based human artificial chromosome vectors. These techniques permit identification and optimization of the critical parameters underlying the unique ability of alpha-satellite DNA to facilitate de novo centromere assembly, and they will establish the foundation for the next generation of human artificial chromosome vectors.

  2. A New Approach to Predict Microbial Community Assembly and Function Using a Stochastic, Genome-Enabled Modeling Framework

    Science.gov (United States)

    King, E.; Brodie, E.; Anantharaman, K.; Karaoz, U.; Bouskill, N.; Banfield, J. F.; Steefel, C. I.; Molins, S.

    2016-12-01

    Characterizing and predicting the microbial and chemical compositions of subsurface aquatic systems necessitates an understanding of the metabolism and physiology of organisms that are often uncultured or studied under conditions not relevant for one's environment of interest. Cultivation-independent approaches are therefore important and have greatly enhanced our ability to characterize functional microbial diversity. The capability to reconstruct genomes representing thousands of populations from microbial communities using metagenomic techniques provides a foundation for development of predictive models for community structure and function. Here, we discuss a genome-informed stochastic trait-based model incorporated into a reactive transport framework to represent the activities of coupled guilds of hypothetical microorganisms. Metabolic pathways for each microbe within a functional guild are parameterized from metagenomic data with a unique combination of traits governing organism fitness under dynamic environmental conditions. We simulate the thermodynamics of coupled electron donor and acceptor reactions to predict the energy available for cellular maintenance, respiration, biomass development, and enzyme production. While `omics analyses can now characterize the metabolic potential of microbial communities, it is functionally redundant as well as computationally prohibitive to explicitly include the thousands of recovered organisms into biogeochemical models. However, one can derive potential metabolic pathways from genomes along with trait-linkages to build probability distributions of traits. These distributions are used to assemble groups of microbes that couple one or more of these pathways. From the initial ensemble of microbes, only a subset will persist based on the interaction of their physiological and metabolic traits with environmental conditions, competing organisms, etc. Here, we analyze the predicted niches of these hypothetical microbes and

  3. A draft de novo genome assembly for the northern bobwhite (Colinus virginianus reveals evidence for a rapid decline in effective population size beginning in the Late Pleistocene.

    Directory of Open Access Journals (Sweden)

    Yvette A Halley

    Full Text Available Wild populations of northern bobwhites (Colinus virginianus; hereafter bobwhite have declined across nearly all of their U.S. range, and despite their importance as an experimental wildlife model for ecotoxicology studies, no bobwhite draft genome assembly currently exists. Herein, we present a bobwhite draft de novo genome assembly with annotation, comparative analyses including genome-wide analyses of divergence with the chicken (Gallus gallus and zebra finch (Taeniopygia guttata genomes, and coalescent modeling to reconstruct the demographic history of the bobwhite for comparison to other birds currently in decline (i.e., scarlet macaw; Ara macao. More than 90% of the assembled bobwhite genome was captured within 14,000 unique genes and proteins. Bobwhite analyses of divergence with the chicken and zebra finch genomes revealed many extremely conserved gene sequences, and evidence for lineage-specific divergence of noncoding regions. Coalescent models for reconstructing the demographic history of the bobwhite and the scarlet macaw provided evidence for population bottlenecks which were temporally coincident with human colonization of the New World, the late Pleistocene collapse of the megafauna, and the last glacial maximum. Demographic trends predicted for the bobwhite and the scarlet macaw also were concordant with how opposing natural selection strategies (i.e., skewness in the r-/K-selection continuum would be expected to shape genome diversity and the effective population sizes in these species, which is directly relevant to future conservation efforts.

  4. A draft de novo genome assembly for the northern bobwhite (Colinus virginianus) reveals evidence for a rapid decline in effective population size beginning in the Late Pleistocene.

    Science.gov (United States)

    Halley, Yvette A; Dowd, Scot E; Decker, Jared E; Seabury, Paul M; Bhattarai, Eric; Johnson, Charles D; Rollins, Dale; Tizard, Ian R; Brightsmith, Donald J; Peterson, Markus J; Taylor, Jeremy F; Seabury, Christopher M

    2014-01-01

    Wild populations of northern bobwhites (Colinus virginianus; hereafter bobwhite) have declined across nearly all of their U.S. range, and despite their importance as an experimental wildlife model for ecotoxicology studies, no bobwhite draft genome assembly currently exists. Herein, we present a bobwhite draft de novo genome assembly with annotation, comparative analyses including genome-wide analyses of divergence with the chicken (Gallus gallus) and zebra finch (Taeniopygia guttata) genomes, and coalescent modeling to reconstruct the demographic history of the bobwhite for comparison to other birds currently in decline (i.e., scarlet macaw; Ara macao). More than 90% of the assembled bobwhite genome was captured within 14,000 unique genes and proteins. Bobwhite analyses of divergence with the chicken and zebra finch genomes revealed many extremely conserved gene sequences, and evidence for lineage-specific divergence of noncoding regions. Coalescent models for reconstructing the demographic history of the bobwhite and the scarlet macaw provided evidence for population bottlenecks which were temporally coincident with human colonization of the New World, the late Pleistocene collapse of the megafauna, and the last glacial maximum. Demographic trends predicted for the bobwhite and the scarlet macaw also were concordant with how opposing natural selection strategies (i.e., skewness in the r-/K-selection continuum) would be expected to shape genome diversity and the effective population sizes in these species, which is directly relevant to future conservation efforts.

  5. The Salmonella In Silico Typing Resource (SISTR): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies.

    Science.gov (United States)

    Yoshida, Catherine E; Kruczkiewicz, Peter; Laing, Chad R; Lingohr, Erika J; Gannon, Victor P J; Nash, John H E; Taboada, Eduardo N

    2016-01-01

    For nearly 100 years serotyping has been the gold standard for the identification of Salmonella serovars. Despite the increasing adoption of DNA-based subtyping approaches, serotype information remains a cornerstone in food safety and public health activities aimed at reducing the burden of salmonellosis. At the same time, recent advances in whole-genome sequencing (WGS) promise to revolutionize our ability to perform advanced pathogen characterization in support of improved source attribution and outbreak analysis. We present the Salmonella In Silico Typing Resource (SISTR), a bioinformatics platform for rapidly performing simultaneous in silico analyses for several leading subtyping methods on draft Salmonella genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). We show how phylogenetic context from cgMLST analysis can supplement the genoserotyping analysis and increase the accuracy of in silico serovar prediction to over 94.6% on a dataset comprised of 4,188 finished genomes and WGS draft assemblies. In addition to allowing analysis of user-uploaded whole-genome assemblies, the SISTR platform incorporates a database comprising over 4,000 publicly available genomes, allowing users to place their isolates in a broader phylogenetic and epidemiological context. The resource incorporates several metadata driven visualizations to examine the phylogenetic, geospatial and temporal distribution of genome-sequenced isolates. As sequencing of Salmonella isolates at public health laboratories around the world becomes increasingly common, rapid in silico analysis of minimally processed draft genome assemblies provides a powerful approach for molecular epidemiology in support of public health investigations. Moreover, this type of integrated analysis using multiple sequence-based methods of sub

  6. Accurate DNA assembly and genome engineering with optimized uracil excision cloning

    DEFF Research Database (Denmark)

    Cavaleiro, Mafalda; Kim, Se Hyeuk; Seppala, Susanna

    2015-01-01

    Simple and reliable DNA editing by uracil excision (a.k.a. USER cloning) has been described by several research groups, but the optimal design of cohesive DNA ends for multigene assembly remains elusive. Here, we use two model constructs based on expression of gfp and a four-gene pathway that pro......Simple and reliable DNA editing by uracil excision (a.k.a. USER cloning) has been described by several research groups, but the optimal design of cohesive DNA ends for multigene assembly remains elusive. Here, we use two model constructs based on expression of gfp and a four-gene pathway...

  7. Assembled Plastid and Mitochondrial Genomes, as well as Nuclear Genes, Place the Parasite Family Cynomoriaceae in the Saxifragales.

    Science.gov (United States)

    Bellot, Sidonie; Cusimano, Natalie; Luo, Shixiao; Sun, Guiling; Zarre, Shahin; Gröger, Andreas; Temsch, Eva; Renner, Susanne S

    2016-08-03

    Cynomoriaceae, one of the last unplaced families of flowering plants, comprise one or two species or subspecies of root parasites that occur from the Mediterranean to the Gobi Desert. Using Illumina sequencing, we assembled the mitochondrial and plastid genomes as well as some nuclear genes of a Cynomorium specimen from Italy. Selected genes were also obtained by Sanger sequencing from individuals collected in China and Iran, resulting in matrices of 33 mitochondrial, 6 nuclear, and 14 plastid genes and rDNAs enlarged to include a representative angiosperm taxon sampling based on data available in GenBank. We also compiled a new geographic map to discern possible discontinuities in the parasites' occurrence. Cynomorium has large genomes of 13.70-13.61 (Italy) to 13.95-13.76 pg (China). Its mitochondrial genome consists of up to 49 circular subgenomes and has an overall gene content similar to that of photosynthetic angiosperms, while its plastome retains only 27 of the normally 116 genes. Nuclear, plastid and mitochondrial phylogenies place Cynomoriaceae in Saxifragales, and we found evidence for several horizontal gene transfers from different hosts, as well as intracellular gene transfers. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  8. High-resolution linkage map and chromosome-scale genome assembly for cassava (Manihot esculenta Crantz) from 10 populations.

    Science.gov (United States)

    2014-12-11

    Cassava (Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa 400-500 years ago and is now widely cultivated across sub-Saharan Africa, but it is subject to biotic and abiotic stresses. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculenta Crantz from reduced representation sequencing [genotyping-by-sequencing (GBS)]. The composite 2412-cM map integrates 10 biparental maps (comprising 3480 meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. We used the map to anchor 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The chromosome-anchored genome sequence will be useful for breeding improvement by assisting in the rapid identification of markers linked to important traits, and in providing a framework for genomic selection-enhanced breeding of this important crop.

  9. Discovery of genes related to insecticide resistance in Bactrocera dorsalis by functional genomic analysis of a de novo assembled transcriptome.

    Directory of Open Access Journals (Sweden)

    Ju-Chun Hsu

    Full Text Available Insecticide resistance has recently become a critical concern for control of many insect pest species. Genome sequencing and global quantization of gene expression through analysis of the transcriptome can provide useful information relevant to this challenging problem. The oriental fruit fly, Bactrocera dorsalis, is one of the world's most destructive agricultural pests, and recently it has been used as a target for studies of genetic mechanisms related to insecticide resistance. However, prior to this study, the molecular data available for this species was largely limited to genes identified through homology. To provide a broader pool of gene sequences of potential interest with regard to insecticide resistance, this study uses whole transcriptome analysis developed through de novo assembly of short reads generated by next-generation sequencing (NGS. The transcriptome of B. dorsalis was initially constructed using Illumina's Solexa sequencing technology. Qualified reads were assembled into contigs and potential splicing variants (isotigs. A total of 29,067 isotigs have putative homologues in the non-redundant (nr protein database from NCBI, and 11,073 of these correspond to distinct D. melanogaster proteins in the RefSeq database. Approximately 5,546 isotigs contain coding sequences that are at least 80% complete and appear to represent B. dorsalis genes. We observed a strong correlation between the completeness of the assembled sequences and the expression intensity of the transcripts. The assembled sequences were also used to identify large numbers of genes potentially belonging to families related to insecticide resistance. A total of 90 P450-, 42 GST-and 37 COE-related genes, representing three major enzyme families involved in insecticide metabolism and resistance, were identified. In addition, 36 isotigs were discovered to contain target site sequences related to four classes of resistance genes. Identified sequence motifs were also

  10. RNA-Seq analysis of Cocos nucifera: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

    Directory of Open Access Journals (Sweden)

    Haikuo Fan

    Full Text Available BACKGROUND: Cocos nucifera (coconut, a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. METHODOLOGY/PRINCIPAL FINDINGS: To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr protein database. The annotated unigenes were then further classified using the Gene Ontology (GO, Clusters of Orthologous Groups (COG and Kyoto Encyclopedia of Genes and Genomes (KEGG databases. CONCLUSIONS/SIGNIFICANCE: Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species.

  11. Discovery of Genes Related to Insecticide Resistance in Bactrocera dorsalis by Functional Genomic Analysis of a De Novo Assembled Transcriptome

    Science.gov (United States)

    Hsu, Ju-Chun; Wu, Wen-Jer; Feng, Hai-Tung; Haymer, David S.; Chen, Chien-Yu

    2012-01-01

    Insecticide resistance has recently become a critical concern for control of many insect pest species. Genome sequencing and global quantization of gene expression through analysis of the transcriptome can provide useful information relevant to this challenging problem. The oriental fruit fly, Bactrocera dorsalis, is one of the world's most destructive agricultural pests, and recently it has been used as a target for studies of genetic mechanisms related to insecticide resistance. However, prior to this study, the molecular data available for this species was largely limited to genes identified through homology. To provide a broader pool of gene sequences of potential interest with regard to insecticide resistance, this study uses whole transcriptome analysis developed through de novo assembly of short reads generated by next-generation sequencing (NGS). The transcriptome of B. dorsalis was initially constructed using Illumina's Solexa sequencing technology. Qualified reads were assembled into contigs and potential splicing variants (isotigs). A total of 29,067 isotigs have putative homologues in the non-redundant (nr) protein database from NCBI, and 11,073 of these correspond to distinct D. melanogaster proteins in the RefSeq database. Approximately 5,546 isotigs contain coding sequences that are at least 80% complete and appear to represent B. dorsalis genes. We observed a strong correlation between the completeness of the assembled sequences and the expression intensity of the transcripts. The assembled sequences were also used to identify large numbers of genes potentially belonging to families related to insecticide resistance. A total of 90 P450-, 42 GST-and 37 COE-related genes, representing three major enzyme families involved in insecticide metabolism and resistance, were identified. In addition, 36 isotigs were discovered to contain target site sequences related to four classes of resistance genes. Identified sequence motifs were also analyzed to

  12. Genome assembly and geospatial phylogenomics of the bed bug Cimex lectularius

    NARCIS (Netherlands)

    Rosenfeld, Jeffrey A.; Reeves, Darryl; Brugler, Mercer R.; Narechania, Apurva; Simon, Sabrina; Durrett, Russell; Foox, Jonathan; Shianna, Kevin; Schatz, Michael C.; Gandara, Jorge; Afshinnekoo, Ebrahim; Lam, Ernest T.; Hastie, Alex R.; Chan, Saki; Cao, Han; Saghbini, Michael; Kentsis, Alex; Planet, Paul J.; Kholodovych, Vladyslav; Tessler, Michael; Baker, Richard; DeSalle, Rob; Sorkin, Louis N.; Kolokotronis, Sergios Orestis; Siddall, Mark E.; Amato, George; Mason, Christopher E.

    2016-01-01

    The common bed bug (Cimex lectularius) has been a persistent pest of humans for thousands of years, yet the genetic basis of the bed bug's basic biology and adaptation to dense human environments is largely unknown. Here we report the assembly, annotation and phylogenetic mapping of the 697.9-Mb

  13. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture.

    Science.gov (United States)

    Driscoll, Connor B; Otten, Timothy G; Brown, Nathan M; Dreher, Theo W

    2017-01-01

    Here we report three complete bacterial genome assemblies from a PacBio shotgun metagenome of a co-culture from Upper Klamath Lake, OR. Genome annotations and culture conditions indicate these bacteria are dependent on carbon and nitrogen fixation from the cyanobacterium Aphanizomenon flos-aquae, whose genome was assembled to draft-quality. Due to their taxonomic novelty relative to previously sequenced bacteria, we have temporarily designated these bacteria as incertae sedis Hyphomonadaceae strain UKL13-1 (3,501,508 bp and 56.12% GC), incertae sedis Betaproteobacterium strain UKL13-2 (3,387,087 bp and 54.98% GC), and incertae sedis Bacteroidetes strain UKL13-3 (3,236,529 bp and 37.33% GC). Each genome consists of a single circular chromosome with no identified plasmids. When compared with binned Illumina assemblies of the same three genomes, there was ~7% discrepancy in total genome length. Gaps where Illumina assemblies broke were often due to repetitive elements. Within these missing sequences were essential genes and genes associated with a variety of functional categories. Annotated gene content reveals that both Proteobacteria are aerobic anoxygenic phototrophs, with Betaproteobacterium UKL13-2 potentially capable of phototrophic oxidation of sulfur compounds. Both proteobacterial genomes contain transporters suggesting they are scavenging fixed nitrogen from A. flos-aquae in the form of ammonium. Bacteroidetes UKL13-3 has few completely annotated biosynthetic pathways, and has a comparatively higher proportion of unannotated genes. The genomes were detected in only a few other freshwater metagenomes, suggesting that these bacteria are not ubiquitous in freshwater systems. Our results indicate that long-read sequencing is a viable method for sequencing dominant members from low-diversity microbial communities, and should be considered for environmental metagenomics when conditions meet these requirements.

  14. Complete Sequencing and Chromosome-Scale Genome Assembly of the Industrial Progenitor Strain P2niaD18 from the Penicillin Producer Penicillium chrysogenum

    OpenAIRE

    Specht, Thomas; Dahlmann, Tim A.; Zadra, Ivo; Kürnsteiner, Hubert; Kück, Ulrich

    2014-01-01

    Penicillium chrysogenum is the major industrial producer of the β-lactam antibiotic penicillin. Here, we report the complete genome sequence of the industrial progenitor strain P. chrysogenum P2niaD18 in a chromosome-scale genome assembly. P2niaD18 is distinguished from the recently sequenced P. chrysogenum Wisconsin 54-1255 strain by major chromosomal rearrangements leading to a modified chromosomal architecture.

  15. A common genomic framework for a diverse assembly of plasmids in the symbiotic nitrogen fixing bacteria.

    Directory of Open Access Journals (Sweden)

    Lisa C Crossman

    Full Text Available This work centres on the genomic comparisons of two closely-related nitrogen-fixing symbiotic bacteria, Rhizobium leguminosarum biovar viciae 3841 and Rhizobium etli CFN42. These strains maintain a stable genomic core that is also common to other rhizobia species plus a very variable and significant accessory component. The chromosomes are highly syntenic, whereas plasmids are related by fewer syntenic blocks and have mosaic structures. The pairs of plasmids p42f-pRL12, p42e-pRL11 and p42b-pRL9 as well large parts of p42c with pRL10 are shown to be similar, whereas the symbiotic plasmids (p42d and pRL10 are structurally unrelated and seem to follow distinct evolutionary paths. Even though purifying selection is acting on the whole genome, the accessory component is evolving more rapidly. This component is constituted largely for proteins for transport of diverse metabolites and elements of external origin. The present analysis allows us to conclude that a heterogeneous and quickly diversifying group of plasmids co-exists in a common genomic framework.

  16. Deciphering heterogeneity in pig genome assembly Sscrofa9 by isochore and isochore-like region analyses.

    Directory of Open Access Journals (Sweden)

    Wenqian Zhang

    Full Text Available BACKGROUND: The isochore, a large DNA sequence with relatively small GC variance, is one of the most important structures in eukaryotic genomes. Although the isochore has been widely studied in humans and other species, little is known about its distribution in pigs. PRINCIPAL FINDINGS: In this paper, we construct a map of long homogeneous genome regions (LHGRs, i.e., isochores and isochore-like regions, in pigs to provide an intuitive version of GC heterogeneity in each chromosome. The LHGR pattern study not only quantifies heterogeneities, but also reveals some primary characteristics of the chromatin organization, including the followings: (1 the majority of LHGRs belong to GC-poor families and are in long length; (2 a high gene density tends to occur with the appearance of GC-rich LHGRs; and (3 the density of LINE repeats decreases with an increase in the GC content of LHGRs. Furthermore, a portion of LHGRs with particular GC ranges (50%-51% and 54%-55% tend to have abnormally high gene densities, suggesting that biased gene conversion (BGC, as well as time- and energy-saving principles, could be of importance to the formation of genome organization. CONCLUSION: This study significantly improves our knowledge of chromatin organization in the pig genome. Correlations between the different biological features (e.g., gene density and repeat density and GC content of LHGRs provide a unique glimpse of in silico gene and repeats prediction.

  17. Comparative transcriptome analyses and genome assembly of Fusarium oxysporum f. sp. cubense

    NARCIS (Netherlands)

    Dita, M.A.; Herai, R.; Waalwijk, C.; Yamagishi, M.; Giachetto, P.; Ferreira, G.; Souza, de M.; Kema, G.H.J.

    2013-01-01

    Fusarium oxysporum f. sp. cubense (Foc), the causal agent of Fusarium wilt of banana, is a highly destructive and genetically diverse pathogen. Despite its economic importance, genomic information about Foc is limited and no transcriptomic analyses have been reported so far. By using 454 sequencing

  18. Complete genome sequence of novel carbon monoxide oxidizing bacteria Citrobacter amalonaticus Y19, assembled de novo.

    Science.gov (United States)

    Ainala, Satish Kumar; Seol, Eunhee; Park, Sunghoon

    2015-10-10

    We report here the complete genome sequence of Citrobacter amalonaticus Y19 isolated from an anaerobic digester. PacBio single-molecule real-time (SMRT) sequencing was employed, resulting in a single scaffold of 5.58Mb. The sequence of a mega plasmid of 291Kb size is also presented.

  19. Genome Assembly of Citrus Leprosis Virus Nuclear Type Reveals a Close Association with Orchid Fleck Virus

    OpenAIRE

    Roy, Avijit; Stone, Andrew; Otero-Colina, Gabriel; Wei, Gang; Choudhary, Nandlal; Achor, Diann; Shao, Jonathan; Levy, Laurene; Nakhla, Mark K.; Hollingsworth, Charla R.; Hartung, John S.; Schneider, William L.; Brlansky, Ronald H.

    2013-01-01

    The complete genome of citrus leprosis virus nuclear type (CiLV-N) was identified by small RNA sequencing utilizing leprosis-affected citrus samples collected from the state of Querétaro, Mexico. The nucleotide identity and phylogenetic analysis indicate that CiLV-N is very closely related to orchid fleck virus, which typically infects Cymbidium species.

  20. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    Science.gov (United States)

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and

  1. Assembly of the Genome of the Disease Vector Aedes aegypti onto a Genetic Linkage Map Allows Mapping of Genes Affecting Disease Transmission

    KAUST Repository

    Juneja, Punita

    2014-01-30

    The mosquito Aedes aegypti transmits some of the most important human arboviruses, including dengue, yellow fever and chikungunya viruses. It has a large genome containing many repetitive sequences, which has resulted in the genome being poorly assembled - there are 4,758 scaffolds, few of which have been assigned to a chromosome. To allow the mapping of genes affecting disease transmission, we have improved the genome assembly by scoring a large number of SNPs in recombinant progeny from a cross between two strains of Ae. aegypti, and used these to generate a genetic map. This revealed a high rate of misassemblies in the current genome, where, for example, sequences from different chromosomes were found on the same scaffold. Once these were corrected, we were able to assign 60% of the genome sequence to chromosomes and approximately order the scaffolds along the chromosome. We found that there are very large regions of suppressed recombination around the centromeres, which can extend to as much as 47% of the chromosome. To illustrate the utility of this new genome assembly, we mapped a gene that makes Ae. aegypti resistant to the human parasite Brugia malayi, and generated a list of candidate genes that could be affecting the trait. © 2014 Juneja et al.

  2. De Novo assembly of the complete genome of an enhanced electricity-producing variant of Geobacter sulfurreducens using only short reads.

    Directory of Open Access Journals (Sweden)

    Harish Nagarajan

    Full Text Available State-of-the-art DNA sequencing technologies are transforming the life sciences due to their ability to generate nucleotide sequence information with a speed and quantity that is unapproachable with traditional Sanger sequencing. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest. Due to the nature of the raw data produced by these technologies, a full genomic sequence attained without the aid of Sanger sequencing has yet to be demonstrated.We have successfully developed a four-phase strategy for using only next-generation sequencing technologies (Illumina and 454 to assemble a complete microbial genome de novo. We applied this approach to completely assemble the 3.7 Mb genome of a rare Geobacter variant (KN400 that is capable of unprecedented current production at an electrode. Two key components of our strategy enabled us to achieve this result. First, we integrated the two data types early in the process to maximally leverage their complementary characteristics. And second, we used the output of different short read assembly programs in such a way so as to leverage the complementary nature of their different underlying algorithms or of their different implementations of the same underlying algorithm.The significance of our result is that it demonstrates a general approach for maximizing the efficiency and success of genome assembly projects as new sequencing technologies and new assembly algorithms are introduced. The general approach is a meta strategy, wherein sequencing data are integrated as early as possible and in particular ways and wherein multiple assembly algorithms are judiciously applied such that the deficiencies in one are complemented by another.

  3. Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species

    Directory of Open Access Journals (Sweden)

    Hornett Emily A

    2012-08-01

    Full Text Available Abstract Background How well does RNA-Seq data perform for quantitative whole gene expression analysis in the absence of a genome? This is one unanswered question facing the rapidly growing number of researchers studying non-model species. Using Homo sapiens data and resources, we compared the direct mapping of sequencing reads to predicted genes from the genome with mapping to de novo transcriptomes assembled from RNA-Seq data. Gene coverage and expression analysis was further investigated in the non-model context by using increasingly divergent genomic reference species to group assembled contigs by unique genes. Results Eight transcriptome sets, composed of varying amounts of Illumina and 454 data, were assembled and assessed. Hybrid 454/Illumina assemblies had the highest transcriptome and individual gene coverage. Quantitative whole gene expression levels were highly similar between using a de novo hybrid assembly and the predicted genes as a scaffold, although mapping to the de novo transcriptome assembly provided data on fewer genes. Using non-target species as reference scaffolds does result in some loss of sequence and expression data, and bias and error increase with evolutionary distance. However, within a 100 million year window these effect sizes are relatively small. Conclusions Predicted gene sets from sequenced genomes of related species can provide a powerful method for grouping RNA-Seq reads and annotating contigs. Gene expression results can be produced that are similar to results obtained using gene models derived from a high quality genome, though biased towards conserved genes. Our results demonstrate the power and limitations of conducting RNA-Seq in non-model species.

  4. Cas9-assisted recombineering in C. elegans: genome editing using in vivo assembly of linear DNAs

    Science.gov (United States)

    Paix, Alexandre; Schmidt, Helen; Seydoux, Geraldine

    2016-01-01

    Recombineering, the use of endogenous homologous recombination systems to recombine DNA in vivo, is a commonly used technique for genome editing in microbes. Recombineering has not yet been developed for animals, where non-homology-based mechanisms have been thought to dominate DNA repair. Here, we demonstrate, using Caenorhabditis elegans, that linear DNAs with short homologies (∼35 bases) engage in a highly efficient gene conversion mechanism. Linear DNA repair templates with homology to only one side of a double-strand break (DSB) initiate repair efficiently, and short overlaps between templates support template switching. We demonstrate the use of single-stranded, bridging oligonucleotides (ssODNs) to target PCR fragments for repair of DSBs induced by CRISPR/Cas9 on chromosomes. Based on these findings, we develop recombineering strategies for precise genome editing that expand the utility of ssODNs and eliminate in vitro cloning steps for template construction. We apply these methods to the generation of GFP knock-in alleles and gene replacements without co-integrated markers. We conclude that, like microbes, metazoans possess robust homology-dependent repair mechanisms that can be harnessed for recombineering and genome editing. PMID:27257074

  5. Assembling a puzzle of dispersed retrotransposable sequences in the genome of chickpea (Cicer arietinum L.).

    Science.gov (United States)

    Staginnus, C; Desel, C; Schmidt, T; Kahl, G

    2010-12-01

    Several repetitive elements are known to be present in the genome of chickpea (Cicer arietinum L.) including satellite DNA and En/Spm transposons as well as two dispersed, highly repetitive elements, CaRep1 and CaRep2. PCR was used to prove that CaRep1, CaRep2, and previously isolated CaRep3 of C. arietinum represent different segments of a highly repetitive Ty3-gypsy-like retrotransposon (Metaviridae) designated CaRep that makes up large parts of the intercalary heterochromatin. The full sequence of this element including the LTRs and untranslated internal regions was isolated by selective amplification. The restriction pattern of CaRep was different within the annual species of the genus Cicer, suggesting its rearrangement during the evolution of the genus during the last 100 000 years. In addition to CaRep, another LTR and a non-LTR retrotransposon family were isolated, and their restriction patterns and physical localization in the chickpea genome were characterized. The LINE-like element CaLin is only of comparatively low abundance and reveals a considerable heterogeneity. The Ty1-copia-like element (Pseudoviridae) CaTy is located in the distal parts of the intercalary heterochromatin and adjacent euchromatic regions, but it is absent from the centromeric regions. These results together with earlier findings allow to depict the distribution of retroelements on chickpea chromosomes, which extensively resembles the retroelement landscape of the genome of the model legume Medicago truncatula Gaertn.

  6. Comparative transcriptome assembly and genome-guided profiling for Brettanomyces bruxellensis LAMAP2480 during p-coumaric acid stress

    Science.gov (United States)

    Godoy, Liliana; Vera-Wolf, Patricia; Martinez, Claudio; Ugalde, Juan A.; Ganga, María Angélica

    2016-01-01

    Brettanomyces bruxellensis has been described as the main contaminant yeast in wine production, due to its ability to convert the hydroxycinnamic acids naturally present in the grape phenolic derivatives, into volatile phenols. Currently, there are no studies in B. bruxellensis which explains the resistance mechanisms to hydroxycinnamic acids, and in particular to p-coumaric acid which is directly involved in alterations to wine. In this work, we performed a transcriptome analysis of B. bruxellensis LAMAP248rown in the presence and absence of p-coumaric acid during lag phase. Because of reported genetic variability among B. bruxellensis strains, to complement de novo assembly of the transcripts, we used the high-quality genome of B. bruxellensis AWRI1499, as well as the draft genomes of strains CBS2499 and0 g LAMAP2480. The results from the transcriptome analysis allowed us to propose a model in which the entrance of p-coumaric acid to the cell generates a generalized stress condition, in which the expression of proton pump and efflux of toxic compounds are induced. In addition, these mechanisms could be involved in the outflux of nitrogen compounds, such as amino acids, decreasing the overall concentration and triggering the expression of nitrogen metabolism genes. PMID:27678167

  7. Microsatellite loci and the complete mitochondrial DNA sequence characterized through next generation sequencing and de novo genome assembly for the critically endangered orange-bellied parrot, Neophema chrysogaster.

    Science.gov (United States)

    Miller, Adam D; Good, Robert T; Coleman, Rhys A; Lancaster, Melanie L; Weeks, Andrew R

    2013-01-01

    A suite of polymorphic microsatellite markers and the complete mitochondrial genome sequence was developed by next generation sequencing (NGS) for the critically endangered orange-bellied parrot, Neophema chrysogaster. A total of 14 polymorphic loci were identified and characterized using DNA extractions representing 40 individuals from Melaleuca, Tasmania, sampled in 2002. We observed moderate genetic variation across most loci (mean number of alleles per locus = 2.79; mean expected heterozygosity = 0.53) with no evidence of individual loci deviating significantly from Hardy-Weinberg equilibrium. Marker independence was confirmed with tests for linkage disequilibrium, and analyses indicated no evidence of null alleles across loci. De novo and reference-based genome assemblies performed using MIRA were used to assemble the N. chrysogaster mitochondrial genome sequence with mean coverage of 116-fold (range 89 to 142-fold). The mitochondrial genome consists of 18,034 base pairs, and a typical metazoan mitochondrial gene content consisting of 13 protein-coding genes, 2 ribosomal subunit genes, 22 transfer RNAs, and a single large non-coding region (control region). The arrangement of mitochondrial genes is also typical of Avian taxa. The annotation of the mitochondrial genome and the characterization of 14 microsatellite markers provide a valuable resource for future genetic monitoring of wild and captive N. chrysogaster populations. As found previously, NGS provides a rapid, low cost and reliable method for polymorphic nuclear genetic marker development and determining complete mitochondrial genome sequences when only a fraction of a genome is sequenced.

  8. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and D. barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data

    Science.gov (United States)

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica v. virgifera (16,747 bp) and D. barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and...

  9. It's a dirty job--A robust method for the purification and de novo genome assembly of Cryptosporidium from clinical material.

    Science.gov (United States)

    Andersson, Sofia; Sikora, Per; Karlberg, Maria L; Winiecka-Krusnell, Jadwiga; Alm, Erik; Beser, Jessica; Arrighi, Romanico B G

    2015-06-01

    We have developed a novel strategy for the purification of Cryptosporidium oocysts from clinical samples using IMS and PCR amplification of target DNA to facilitate uniform coverage genome sequencing and de novo assembly. Our procedure could also be used for other microbial pathogens from clinical specimens.

  10. ATRX in chromatin assembly and genome architecture during development and disease.

    Science.gov (United States)

    Bérubé, Nathalie G

    2011-10-01

    The regulation of genome architecture is essential for a variety of fundamental cellular phenomena that underlie the complex orchestration of mammalian development. The ATP-dependent chromatin remodeling protein ATRX is emerging as a key regulatory component of nucleosomal dynamics and higher order chromatin conformation. Here we provide an overview of the role of ATRX at chromatin and during development, and discuss recent studies exposing a repertoire of ATRX functions at heterochromatin, in gene regulation, and during mitosis and meiosis. Exciting new progress on several fronts suggest that ATRX operates in histone variant deposition and in the modulation of higher order chromatin structure. Not surprisingly, dysfunction or absence of ATRX protein has devastating consequences on embryonic development and leads to human disease.

  11. Deconstruction of archaeal genome depict strategic consensus in core pathways coding sequence assembly.

    Directory of Open Access Journals (Sweden)

    Ayon Pal

    Full Text Available A comprehensive in silico analysis of 71 species representing the different taxonomic classes and physiological genre of the domain Archaea was performed. These organisms differed in their physiological attributes, particularly oxygen tolerance and energy metabolism. We explored the diversity and similarity in the codon usage pattern in the genes and genomes of these organisms, emphasizing on their core cellular pathways. Our thrust was to figure out whether there is any underlying similarity in the design of core pathways within these organisms. Analyses of codon utilization pattern, construction of hierarchical linear models of codon usage, expression pattern and codon pair preference pointed to the fact that, in the archaea there is a trend towards biased use of synonymous codons in the core cellular pathways and the Nc-plots appeared to display the physiological variations present within the different species. Our analyses revealed that aerobic species of archaea possessed a larger degree of freedom in regulating expression levels than could be accounted for by codon usage bias alone. This feature might be a consequence of their enhanced metabolic activities as a result of their adaptation to the relatively O2-rich environment. Species of archaea, which are related from the taxonomical viewpoint, were found to have striking similarities in their ORF structuring pattern. In the anaerobic species of archaea, codon bias was found to be a major determinant of gene expression. We have also detected a significant difference in the codon pair usage pattern between the whole genome and the genes related to vital cellular pathways, and it was not only species-specific but pathway specific too. This hints towards the structuring of ORFs with better decoding accuracy during translation. Finally, a codon-pathway interaction in shaping the codon design of pathways was observed where the transcription pathway exhibited a significantly different coding

  12. Updated sesame genome assembly and fine mapping of plant height and seed coat color QTLs using a new high-density genetic map.

    Science.gov (United States)

    Wang, Linhai; Xia, Qiuju; Zhang, Yanxin; Zhu, Xiaodong; Zhu, Xiaofeng; Li, Donghua; Ni, Xuemei; Gao, Yuan; Xiang, Haitao; Wei, Xin; Yu, Jingyin; Quan, Zhiwu; Zhang, Xiurong

    2016-01-05

    Sesame is an important high-quality oil seed crop. The sesame genome was de novo sequenced and assembled in 2014 (version 1.0); however, the number of anchored pseudomolecules was higher than the chromosome number (2n = 2x = 26) due to the lack of a high-density genetic map with 13 linkage groups. We resequenced a permanent population consisting of 430 recombinant inbred lines and constructed a genetic map to improve the sesame genome assembly. We successfully anchored 327 scaffolds onto 13 pseudomolecules. The new genome assembly (version 2.0) included 97.5 % of the scaffolds greater than 150 kb in size present in assembly version 1.0 and increased the total pseudomolecule length from 233.7 to 258.4 Mb with 94.3 % of the genome assembled and 97.2 % of the predicted gene models anchored. Based on the new genome assembly, a bin map including 1,522 bins spanning 1090.99 cM was generated and used to identified 41 quantitative trait loci (QTLs) for sesame plant height and 9 for seed coat color. The plant height-related QTLs explained 3-24 % the phenotypic variation (mean value, 8 %), and 29 of them were detected in at least two field trials. Two major loci (qPH-8.2 and qPH-3.3) that contributed 23 and 18 % of the plant height were located in 350 and 928-kb spaces on Chr8 and Chr3, respectively. qPH-3.3, is predicted to be responsible for the semi-dwarf sesame plant phenotype and contains 102 candidate genes. This is the first report of a sesame semi-dwarf locus and provides an interesting opportunity for a plant architecture study of the sesame. For the sesame seed coat color, the QTLs of the color spaces L*, a*, and b* were detected with contribution rates of 3-46 %. qSCb-4.1 contributed approximately 39 % of the b* value and was located on Chr4 in a 199.9-kb space. A list of 32 candidate genes for the locus, including a predicted black seed coat-related gene, was determined by screening the newly anchored genome. This study offers a high

  13. Mutations of Conserved Residues in the Major Homology Region Arrest Assembling HIV-1 Gag as a Membrane-Targeted Intermediate Containing Genomic RNA and Cellular Proteins.

    Science.gov (United States)

    Tanaka, Motoko; Robinson, Bridget A; Chutiraka, Kasana; Geary, Clair D; Reed, Jonathan C; Lingappa, Jaisri R

    2015-12-09

    The major homology region (MHR) is a highly conserved motif that is found within the Gag protein of all orthoretroviruses and some retrotransposons. While it is widely accepted that the MHR is critical for assembly of HIV-1 and other retroviruses, how the MHR functions and why it is so highly conserved are not understood. Moreover, consensus is lacking on when HIV-1 MHR residues function during assembly. Here, we first addressed previous conflicting reports by confirming that MHR deletion, like conserved MHR residue substitution, leads to a dramatic reduction in particle production in human and nonhuman primate cells expressing HIV-1 proviruses. Next, we used biochemical analyses and immunoelectron microscopy to demonstrate that conserved residues in the MHR are required after assembling Gag has associated with genomic RNA, recruited critical host factors involved in assembly, and targeted to the plasma membrane. The exact point of inhibition at the plasma membrane differed depending on the specific mutation, with one MHR mutant arrested as a membrane-associated intermediate that is stable upon high-salt treatment and other MHR mutants arrested as labile, membrane-associated intermediates. Finally, we observed the same assembly-defective phenotypes when the MHR deletion or conserved MHR residue substitutions were engineered into Gag from a subtype B, lab-adapted provirus or Gag from a subtype C primary isolate that was codon optimized. Together, our data support a model in which MHR residues act just after membrane targeting, with some MHR residues promoting stability and another promoting multimerization of the membrane-targeted assembling Gag oligomer. The retroviral Gag protein exhibits extensive amino acid sequence variation overall; however, one region of Gag, termed the major homology region, is conserved among all retroviruses and even some yeast retrotransposons, although the reason for this conservation remains poorly understood. Highly conserved residues

  14. The Salmonella In Silico Typing Resource (SISTR: An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies.

    Directory of Open Access Journals (Sweden)

    Catherine E Yoshida

    Full Text Available For nearly 100 years serotyping has been the gold standard for the identification of Salmonella serovars. Despite the increasing adoption of DNA-based subtyping approaches, serotype information remains a cornerstone in food safety and public health activities aimed at reducing the burden of salmonellosis. At the same time, recent advances in whole-genome sequencing (WGS promise to revolutionize our ability to perform advanced pathogen characterization in support of improved source attribution and outbreak analysis. We present the Salmonella In Silico Typing Resource (SISTR, a bioinformatics platform for rapidly performing simultaneous in silico analyses for several leading subtyping methods on draft Salmonella genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST, ribosomal MLST (rMLST, and core genome MLST (cgMLST. We show how phylogenetic context from cgMLST analysis can supplement the genoserotyping analysis and increase the accuracy of in silico serovar prediction to over 94.6% on a dataset comprised of 4,188 finished genomes and WGS draft assemblies. In addition to allowing analysis of user-uploaded whole-genome assemblies, the SISTR platform incorporates a database comprising over 4,000 publicly available genomes, allowing users to place their isolates in a broader phylogenetic and epidemiological context. The resource incorporates several metadata driven visualizations to examine the phylogenetic, geospatial and temporal distribution of genome-sequenced isolates. As sequencing of Salmonella isolates at public health laboratories around the world becomes increasingly common, rapid in silico analysis of minimally processed draft genome assemblies provides a powerful approach for molecular epidemiology in support of public health investigations. Moreover, this type of integrated analysis using multiple sequence

  15. De Novo Assembly of Human Herpes Virus Type 1 (HHV-1) Genome, Mining of Non-Canonical Structures and Detection of Novel Drug-Resistance Mutations Using Short- and Long-Read Next Generation Sequencing Technologies.

    Science.gov (United States)

    Karamitros, Timokratis; Harrison, Ian; Piorkowska, Renata; Katzourakis, Aris; Magiorkinis, Gkikas; Mbisa, Jean Lutamyo

    2016-01-01

    Human herpesvirus type 1 (HHV-1) has a large double-stranded DNA genome of approximately 152 kbp that is structurally complex and GC-rich. This makes the assembly of HHV-1 whole genomes from short-read sequencing data technically challenging. To improve the assembly of HHV-1 genomes we have employed a hybrid genome assembly protocol using data from two sequencing technologies: the short-read Roche 454 and the long-read Oxford Nanopore MinION sequencers. We sequenced 18 HHV-1 cell culture-isolated clinical specimens collected from immunocompromised patients undergoing antiviral therapy. The susceptibility of the samples to several antivirals was determined by plaque reduction assay. Hybrid genome assembly resulted in a decrease in the number of contigs in 6 out of 7 samples and an increase in N(G)50 and N(G)75 of all 7 samples sequenced by both technologies. The approach also enhanced the detection of non-canonical contigs including a rearrangement between the unique (UL) and repeat (T/IRL) sequence regions of one sample that was not detectable by assembly of 454 reads alone. We detected several known and novel resistance-associated mutations in UL23 and UL30 genes. Genome-wide genetic variability ranged from assembly of accurate, full-length HHV-1 genomes will be useful in determining genetic determinants of drug resistance, virulence, pathogenesis and viral evolution. The numerous, complex repeat regions of the HHV-1 genome currently remain a barrier towards this goal.

  16. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology.

    Science.gov (United States)

    Nijkamp, Jurgen F; van den Broek, Marcel; Datema, Erwin; de Kok, Stefan; Bosman, Lizanne; Luttik, Marijke A; Daran-Lapujade, Pascale; Vongsangnak, Wanwipa; Nielsen, Jens; Heijne, Wilbert H M; Klaassen, Paul; Paddon, Chris J; Platt, Darren; Kötter, Peter; van Ham, Roeland C; Reinders, Marcel J T; Pronk, Jack T; de Ridder, Dick; Daran, Jean-Marc

    2012-03-26

    Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains.

  17. Assembly and initial characterization of a panel of 85 genomically validated cell lines from diverse head and neck tumor sites.

    Science.gov (United States)

    Zhao, Mei; Sano, Daisuke; Pickering, Curtis R; Jasser, Samar A; Henderson, Ying C; Clayman, Gary L; Sturgis, Erich M; Ow, Thomas J; Lotan, Reuben; Carey, Thomas E; Sacks, Peter G; Grandis, Jennifer R; Sidransky, David; Heldin, Nils Erik; Myers, Jeffrey N

    2011-12-01

    Human cell lines are useful for studying cancer biology and preclinically modeling cancer therapy, but can be misidentified and cross-contamination is unfortunately common. The purpose of this study was to develop a panel of validated head and neck cell lines representing the spectrum of tissue sites and histologies that could be used for studying the molecular, genetic, and phenotypic diversity of head and neck cancer. A panel of 122 clinically and phenotypically diverse head and neck cell lines from head and neck squamous cell carcinoma, thyroid cancer, cutaneous squamous cell carcinoma, adenoid cystic carcinoma, oral leukoplakia, immortalized primary keratinocytes, and normal epithelium was assembled from the collections of several individuals and institutions. Authenticity was verified by carrying out short tandem repeat analysis. Human papillomavirus (HPV) status and cell morphology were also determined. Eighty-five of the 122 cell lines had unique genetic profiles. HPV-16 DNA was detected in 2 cell lines. These 85 cell lines included cell lines from the major head and neck primary tumor sites, and close examination shows a wide range of in vitro phenotypes. This panel of 85 genomically validated head and neck cell lines represents a valuable resource for the head and neck cancer research community that can help advance understanding of the disease by providing a standard reference for cell lines that can be used for biological as well as preclinical studies. ©2011 AACR.

  18. Single-molecule sequencing and Hi-C-based proximity-guided assembly of amaranth (Amaranthus hypochondriacus) chromosomes provide insights into genome evolution

    KAUST Repository

    Lightfoot, D. J.

    2017-08-29

    Background: Amaranth (Amaranthus hypochondriacus) was a food staple among the ancient civilizations of Central and South America that has recently received increased attention due to the high nutritional value of the seeds, with the potential to help alleviate malnutrition and food security concerns, particularly in arid and semiarid regions of the developing world. Here, we present a reference-quality assembly of the amaranth genome which will assist the agronomic development of the species.

  19. Optimizing Hybrid de Novo Transcriptome Assembly and Extending Genomic Resources for Giant Freshwater Prawns (Macrobrachium rosenbergii): The Identification of Genes and Markers Associated with Reproduction.

    Science.gov (United States)

    Jung, Hyungtaek; Yoon, Byung-Ha; Kim, Woo-Jin; Kim, Dong-Wook; Hurwood, David A; Lyons, Russell E; Salin, Krishna R; Kim, Heui-Soo; Baek, Ilseon; Chand, Vincent; Mather, Peter B

    2016-05-07

    The giant freshwater prawn, Macrobrachium rosenbergii, a sexually dimorphic decapod crustacean is currently the world's most economically important cultured freshwater crustacean species. Despite its economic importance, there is currently a lack of genomic resources available for this species, and this has limited exploration of the molecular mechanisms that control the M. rosenbergii sex-differentiation system more widely in freshwater prawns. Here, we present the first hybrid transcriptome from M. rosenbergii applying RNA-Seq technologies directed at identifying genes that have potential functional roles in reproductive-related traits. A total of 13,733,210 combined raw reads (1720 Mbp) were obtained from Ion-Torrent PGM and 454 FLX. Bioinformatic analyses based on three state-of-the-art assemblers, the CLC Genomic Workbench, Trans-ABySS, and Trinity, that use single and multiple k-mer methods respectively, were used to analyse the data. The influence of multiple k-mers on assembly performance was assessed to gain insight into transcriptome assembly from short reads. After optimisation, de novo assembly resulted in 44,407 contigs with a mean length of 437 bp, and the assembled transcripts were further functionally annotated to detect single nucleotide polymorphisms and simple sequence repeat motifs. Gene expression analysis was also used to compare expression patterns from ovary and testis tissue libraries to identify genes with potential roles in reproduction and sex differentiation. The large transcript set assembled here represents the most comprehensive set of transcriptomic resources ever developed for reproduction traits in M. rosenbergii, and the large number of genetic markers predicted should constitute an invaluable resource for future genetic research studies on M. rosenbergii and can be applied more widely on other freshwater prawn species in the genus Macrobrachium.

  20. Optimizing Hybrid de Novo Transcriptome Assembly and Extending Genomic Resources for Giant Freshwater Prawns (Macrobrachium rosenbergii: The Identification of Genes and Markers Associated with Reproduction

    Directory of Open Access Journals (Sweden)

    Hyungtaek Jung

    2016-05-01

    Full Text Available The giant freshwater prawn, Macrobrachium rosenbergii, a sexually dimorphic decapod crustacean is currently the world’s most economically important cultured freshwater crustacean species. Despite its economic importance, there is currently a lack of genomic resources available for this species, and this has limited exploration of the molecular mechanisms that control the M. rosenbergii sex-differentiation system more widely in freshwater prawns. Here, we present the first hybrid transcriptome from M. rosenbergii applying RNA-Seq technologies directed at identifying genes that have potential functional roles in reproductive-related traits. A total of 13,733,210 combined raw reads (1720 Mbp were obtained from Ion-Torrent PGM and 454 FLX. Bioinformatic analyses based on three state-of-the-art assemblers, the CLC Genomic Workbench, Trans-ABySS, and Trinity, that use single and multiple k-mer methods respectively, were used to analyse the data. The influence of multiple k-mers on assembly performance was assessed to gain insight into transcriptome assembly from short reads. After optimisation, de novo assembly resulted in 44,407 contigs with a mean length of 437 bp, and the assembled transcripts were further functionally annotated to detect single nucleotide polymorphisms and simple sequence repeat motifs. Gene expression analysis was also used to compare expression patterns from ovary and testis tissue libraries to identify genes with potential roles in reproduction and sex differentiation. The large transcript set assembled here represents the most comprehensive set of transcriptomic resources ever developed for reproduction traits in M. rosenbergii, and the large number of genetic markers predicted should constitute an invaluable resource for future genetic research studies on M. rosenbergii and can be applied more widely on other freshwater prawn species in the genus Macrobrachium.

  1. Estimating the population mutation rate from a de novo assembled Bactrian camel genome and cross-species comparison with dromedary ESTs.

    Science.gov (United States)

    Burger, Pamela A; Palmieri, Nicola

    2014-01-01

    The Bactrian camel (Camelus bactrianus) and the dromedary (Camelus dromedarius) are among the last species that have been domesticated around 3000-6000 years ago. During domestication, strong artificial (anthropogenic) selection has shaped the livestock, creating a huge amount of phenotypes and breeds. Hence, domestic animals represent a unique resource to understand the genetic basis of phenotypic variation and adaptation. Similar to its late domestication history, the Bactrian camel is also among the last livestock animals to have its genome sequenced and deciphered. As no genomic data have been available until recently, we generated a de novo assembly by shotgun sequencing of a single male Bactrian camel. We obtained 1.6 Gb genomic sequences, which correspond to more than half of the Bactrian camel's genome. The aim of this study was to identify heterozygous single-nucleotide polymorphisms (SNPs) and to estimate population parameters and nucleotide diversity based on an individual camel. With an average 6.6-fold coverage, we detected over 116 000 heterozygous SNPs and recorded a genome-wide nucleotide diversity similar to that of other domesticated ungulates. More than 20 000 (85%) dromedary expressed sequence tags successfully aligned to our genomic draft. Our results provide a template for future association studies targeting economically relevant traits and to identify changes underlying the process of camel domestication and environmental adaptation.

  2. Combined Antiviral Therapy Using Designed Molecular Scaffolds Targeting Two Distinct Viral Functions, HIV-1 Genome Integration and Capsid Assembly.

    Science.gov (United States)

    Khamaikawin, Wannisa; Saoin, Somphot; Nangola, Sawitree; Chupradit, Koollawat; Sakkhachornphop, Supachai; Hadpech, Sudarat; Onlamoon, Nattawat; Ansari, Aftab A; Byrareddy, Siddappa N; Boulanger, Pierre; Hong, Saw-See; Torbett, Bruce E; Tayapiwatana, Chatchai

    2015-08-25

    Designed molecular scaffolds have been proposed as alternative therapeutic agents against HIV-1. The ankyrin repeat protein (Ank(GAG)1D4) and the zinc finger protein (2LTRZFP) have recently been characterized as intracellular antivirals, but these molecules, used individually, do not completely block HIV-1 replication and propagation. The capsid-binder Ank(GAG)1D4, which inhibits HIV-1 assembly, does not prevent the genome integration of newly incoming viruses. 2LTRZFP, designed to target the 2-LTR-circle junction of HIV-1 cDNA and block HIV-1 integration, would have no antiviral effect on HIV-1-infected cells. However, simultaneous expression of these two molecules should combine the advantage of preventive and curative treatments. To test this hypothesis, the genes encoding the N-myristoylated Myr(+)Ank(GAG)1D4 protein and the 2LTRZFP were introduced into human T-cells, using a third-generation lentiviral vector. SupT1 cells stably expressing 2LTRZFP alone or with Myr(+)Ank(GAG)1D4 showed a complete resistance to HIV-1 in viral challenge. Administration of the Myr(+)Ank(GAG)1D4 vector to HIV-1-preinfected SupT1 cells resulted in a significant antiviral effect. Resistance to viral infection was also observed in primary human CD4+ T-cells stably expressing Myr(+)Ank(GAG)1D4, and challenged with HIV-1, SIVmac, or SHIV. Our data suggest that our two anti-HIV-1 molecular scaffold prototypes are promising antiviral agents for anti-HIV-1 gene therapy.

  3. Comparative analysis of the mosaic genomes of tailed archaeal viruses and proviruses suggests common themes for virion architecture and assembly with tailed viruses of bacteria.

    Science.gov (United States)

    Krupovic, Mart; Forterre, Patrick; Bamford, Dennis H

    2010-03-19

    Tailed double-stranded DNA viruses (order Caudovirales) represent the dominant morphotype among viruses infecting bacteria. Analysis and comparison of complete genome sequences of tailed bacterial viruses provided insights into their origin and evolution. Structural and genomic studies have unexpectedly revealed that tailed bacterial viruses are evolutionarily related to eukaryotic herpesviruses. Organisms from the third domain of life, Archaea, are also infected by viruses that, in their overall morphology, resemble tailed viruses of bacteria. However, high-resolution structural information is currently unavailable for any of these viruses, and only a few complete genomes have been sequenced so far. Here we identified nine proviruses that are clearly related to tailed bacterial viruses and integrated into chromosomes of species belonging to four different taxonomic orders of the Archaea. This more than doubled the number of genome sequences available for comparative studies. Our analyses indicate that highly mosaic tailed archaeal virus genomes evolve by homologous and illegitimate recombination with genomes of other viruses, by diversification, and by acquisition of cellular genes. Comparative genomics of these viruses and related proviruses revealed a set of conserved genes encoding putative proteins similar to virion assembly and maturation, as well as genome packaging proteins of tailed bacterial viruses and herpesviruses. Furthermore, fold prediction and structural modeling experiments suggest that the major capsid proteins of tailed archaeal viruses adopt the same topology as the corresponding proteins of tailed bacterial viruses and eukaryotic herpesviruses. Data presented in this study strongly support the hypothesis that tailed viruses infecting archaea share a common ancestry with tailed bacterial viruses and herpesviruses.

  4. Comparing genome guided assembly and phased variants based assembly approach to separate the homoeolog transcripts in tetraploid peanut (Arachis hypogaea L.)

    Science.gov (United States)

    Homoeologous copies of transcripts are abundant in many self-pollinating species including tetraploid peanut, and can impose a challenge to build a transcriptome reference without the merging of homoeologs. De novo transcriptome assembly of tetraploid OLin with single kmer and multiple kmer approach...

  5. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly

    NARCIS (Netherlands)

    Scheinin, I.; Sie, D.; Bengtsson, H.; Wiel, M.A. van de; Olshen, A.B.; Thuijl, H.F. van; Essen, H.F. van; Eijk, P.P.; Rustenburg, F.; Meijer, G.A.; Reijneveld, J.C.; Wesseling, P.; Pinkel, D.; Albertson, D.G.; Ylstra, B.

    2014-01-01

    Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraff

  6. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology

    Directory of Open Access Journals (Sweden)

    Nijkamp Jurgen F

    2012-03-01

    Full Text Available Abstract Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV, insertions/deletions (indels and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains.

  7. Extending reference assembly models

    DEFF Research Database (Denmark)

    Church, Deanna M.; Schneider, Valerie A.; Steinberg, Karyn Meltz

    2015-01-01

    The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools...

  8. Extending reference assembly models

    DEFF Research Database (Denmark)

    Church, Deanna M.; Schneider, Valerie A.; Steinberg, Karyn Meltz;

    2015-01-01

    The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools...

  9. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies

    Energy Technology Data Exchange (ETDEWEB)

    Catfish Genome Consortium; Wang, Shaolin; Peatman, Eric; Abernathy, Jason; Waldbieser, Geoff; Lindquist, Erika; Richardson, Paul; Lucas, Susan; Wang, Mei; Li, Ping; Thimmapuram, Jyothi; Liu, Lei; Vullaganti, Deepika; Kucuktas, Huseyin; Murdock, Christopher; Small, Brian C; Wilson, Melanie; Liu, Hong; Jiang, Yanliang; Lee, Yoona; Chen, Fei; Lu, Jianguo; Wang, Wenqi; Xu, Peng; Somridhivej, Benjaporn; Baoprasertkul, Puttharat; Quilang, Jonas; Sha, Zhenxia; Bao, Baolong; Wang, Yaping; Wang, Qun; Takano, Tomokazu; Nandi, Samiran; Liu, Shikai; Wong, Lilian; Kaltenboeck, Ludmilla; Quiniou, Sylvie; Bengten, Eva; Miller, Norman; Trant, John; Rokhsar, Daniel; Liu, Zhanjiang

    2010-03-23

    Background-Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification. Results-A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35percent of the unique sequences had significant similarities to known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis. Conclusions-This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.

  10. Herbarium genomics

    DEFF Research Database (Denmark)

    Bakker, Freek T.; Lei, Di; Yu, Jiaying

    2016-01-01

    Herbarium genomics is proving promising as next-generation sequencing approaches are well suited to deal with the usually fragmented nature of archival DNA. We show that routine assembly of partial plastome sequences from herbarium specimens is feasible, from total DNA extracts and with specimens...... up to 146 years old. We use genome skimming and an automated assembly pipeline, Iterative Organelle Genome Assembly, that assembles paired-end reads into a series of candidate assemblies, the best one of which is selected based on likelihood estimation. We used 93 specimens from 12 different...... correlation between plastome coverage and nuclear genome size (C value) in our samples, but the range of C values included is limited. Finally, we conclude that routine plastome sequencing from herbarium specimens is feasible and cost-effective (compared with Sanger sequencing or plastome...

  11. Self-assembled random arrays: high-performance imaging and genomics applications on a high-density microarray platform

    Science.gov (United States)

    Barker, David L.; Theriault, Greg; Che, Diping; Dickinson, Todd; Shen, Richard; Kain, Robert C.

    2003-07-01

    Illumina is developing a BeadArrayTM technology that supports SNP genotyping, mRNA expression analysis and protein expression analysis on the same platform. We use fiber-optic bundles with a density of approximately 40,000 fibers/mm2. At hte end of each fiber, a derivatized silica bead forms an array element for reading out a genotyping or expression assay data point. Each bead contains oligonucleotide probes that hybridize with high specificity to complementary sequences in a complex nucleic acid mixture. We derivatize the beads in bulk, pool them to form a quality-controlled source of microarray elements, and allow them to assemble spontaneously into pits etched into the end of each optical fiber bundle. We load our fiber bundles, containing 49,777 fibers, with up to 1520 different bead types. The presence of many beads of each type greatly improves the accuracy of each assay. As the final step in our manufacturing process, we decode the identity of each bead by a series of rapid hybridizations with fluroescent oligos. Decoding accuracy and the number of beads of each type is recorded for each array. Decoding also serves as a quality control procedure for the performance of each element in the array. To facilitate high-throughput analysis of many samples, the fiber bundles are arranged in an array matrix (SentrixTM arrays). Using a 96-bundle array matrix, up to 1520 assays can be performed on each of 96 samples simultaneously for a total of 145,920 assays. Using a 384-bundle array matrix, up to 583,680 assays can be performed simultaneously. The BeadArray platform is the highest density microarray in commercial use, requiring development of a high-performance array scanner. To meet this need, we developed the SherlockTM system, a laser-scanning confocal imaging system that automatically scans all 96 bundles of an array matrix at variable resolution down to 0.8 micron. The system scans with both 532 and 635 nm lasers simultaneously, collecting two fluorescence

  12. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms.

    Directory of Open Access Journals (Sweden)

    Ziwen He

    Full Text Available Nypa fruticans (Arecaceae is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to "response to stress" and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.

  13. Multiplexed next-generation sequencing and de novo assembly to obtain near full-length HIV-1 genome from plasma virus.

    Science.gov (United States)

    Aralaguppe, Shambhu G; Siddik, Abu Bakar; Manickam, Ashokkumar; Ambikan, Anoop T; Kumar, Milner M; Fernandes, Sunjay Jude; Amogne, Wondwossen; Bangaruswamy, Dhinoth K; Hanna, Luke Elizabeth; Sonnerborg, Anders; Neogi, Ujjwal

    2016-10-01

    Analysing the HIV-1 near full-length genome (HIV-NFLG) facilitates new understanding into the diversity of virus population dynamics at individual or population level. In this study we developed a simple but high-throughput next generation sequencing (NGS) protocol for HIV-NFLG using clinical specimens and validated the method against an external quality control (EQC) panel. Clinical specimens (n=105) were obtained from three cohorts from two highly conserved HIV-1C epidemics (India and Ethiopia) and one diverse epidemic (Sweden). Additionally an EQC panel (n=10) was used to validate the protocol. HIV-NFLG was performed amplifying the HIV-genome (Gag-to-nef) in two fragments. NGS was performed using the Illumina HiSeq2500 after multiplexing 24 samples, followed by de novo assembly in Iterative Virus Assembler or VICUNA. Subtyping was carried out using several bioinformatics tools. Amplification of HIV-NFLG has 90% (95/105) success-rate in clinical specimens. NGS was successful in all clinical specimens (n=45) and EQA samples (n=10) attempted. The mean error for mutations for the EQC panel viruses were <1%. Subtyping identified two as A1C recombinant. Our results demonstrate the feasibility of a simple NGS-based HIV-NFLG that can potentially be used in the molecular surveillance for effective identification of subtypes and transmission clusters for operational public health intervention.

  14. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms.

    Science.gov (United States)

    He, Ziwen; Zhang, Zhang; Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to "response to stress" and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.

  15. De novo genome assembly and annotation of Australia's largest freshwater fish, the Murray cod (Maccullochella peelii), from Illumina and Nanopore sequencing read.

    Science.gov (United States)

    Austin, Christopher M; Tan, Mun Hua; Harrisson, Katherine A; Lee, Yin Peng; Croft, Laurence J; Sunnucks, Paul; Pavlova, Alexandra; Gan, Han Ming

    2017-08-01

    One of the most iconic Australian fish is the Murray cod, Maccullochella peelii (Mitchell 1838), a freshwater species that can grow to ∼1.8 metres in length and live to age ≥48 years. The Murray cod is of a conservation concern as a result of strong population contractions, but it is also popular for recreational fishing and is of growing aquaculture interest. In this study, we report the whole genome sequence of the Murray cod to support ongoing population genetics, conservation, and management research, as well as to better understand the evolutionary ecology and history of the species. A draft Murray cod genome of 633 Mbp (N50 = 109 974bp; BUSCO and CEGMA completeness of 94.2% and 91.9%, respectively) with an estimated 148 Mbp of putative repetitive sequences was assembled from the combined sequencing data of 2 fish individuals with an identical maternal lineage; 47.2 Gb of Illumina HiSeq data and 804 Mb of Nanopore data were generated from the first individual while 23.2 Gb of Illumina MiSeq data were generated from the second individual. The inclusion of Nanopore reads for scaffolding followed by subsequent gap-closing using Illumina data led to a 29% reduction in the number of scaffolds and a 55% and 54% increase in the scaffold and contig N50, respectively. We also report the first transcriptome of Murray cod that was subsequently used to annotate the Murray cod genome, leading to the identification of 26 539 protein-coding genes. We present the whole genome of the Murray cod and anticipate this will be a catalyst for a range of genetic, genomic, and phylogenetic studies of the Murray cod and more generally other fish species of the Percichthydae family. © The Authors 2017. Published by Oxford University Press.

  16. Whole genome sequences of the USMARC beef cattle diversity panel v2.9 aligned to the bovine reference genome assembly

    Science.gov (United States)

    A searchable and publicly viewable set of mapped genomes from 96 beef sires from 19 popular breeds of U.S. cattle was created. These sires with minimal pedigree relationships, represent >99% of the germplasm used in the US beef industry circa 2000. The group is estimated to contain more than 187 u...

  17. The evolution of the natural killer complex; a comparison between mammals using new high-quality genome assemblies and targeted annotation.

    Science.gov (United States)

    Schwartz, John C; Gibson, Mark S; Heimeier, Dorothea; Koren, Sergey; Phillippy, Adam M; Bickhart, Derek M; Smith, Timothy P L; Medrano, Juan F; Hammond, John A

    2017-04-01

    Natural killer (NK) cells are a diverse population of lymphocytes with a range of biological roles including essential immune functions. NK cell diversity is in part created by the differential expression of cell surface receptors which modulate activation and function, including multiple subfamilies of C-type lectin receptors encoded within the NK complex (NKC). Little is known about the gene content of the NKC beyond rodent and primate lineages, other than it appears to be extremely variable between mammalian groups. We compared the NKC structure between mammalian species using new high-quality draft genome assemblies for cattle and goat; re-annotated sheep, pig, and horse genome assemblies; and the published human, rat, and mouse lemur NKC. The major NKC genes are largely in the equivalent positions in all eight species, with significant independent expansions and deletions between species, allowing us to propose a model for NKC evolution during mammalian radiation. The ruminant species, cattle and goats, have independently evolved a second KLRC locus flanked by KLRA and KLRJ, and a novel KLRH-like gene has acquired an activating tail. This novel gene has duplicated several times within cattle, while other activating receptor genes have been selectively disrupted. Targeted genome enrichment in cattle identified varying levels of allelic polymorphism between the NKC genes concentrated in the predicted extracellular ligand-binding domains. This novel recombination and allelic polymorphism is consistent with NKC evolution under balancing selection, suggesting that this diversity influences individual immune responses and may impact on differential outcomes of pathogen infection and vaccination.

  18. The Chloroplast Genome of Passiflora edulis (Passifloraceae) Assembled from Long Sequence Reads: Structural Organization and Phylogenomic Studies in Malpighiales

    Science.gov (United States)

    Cauz-Santos, Luiz A.; Munhoz, Carla F.; Rodde, Nathalie; Cauet, Stephane; Santos, Anselmo A.; Penha, Helen A.; Dornelas, Marcelo C.; Varani, Alessandro M.; Oliveira, Giancarlo C. X.; Bergès, Hélène; Vieira, Maria Lucia C.

    2017-01-01

    The family Passifloraceae consists of some 700 species classified in around 16 genera. Almost all its members belong to the genus Passiflora. In Brazil, the yellow passion fruit (Passiflora edulis) is of considerable economic importance, both for juice production and consumption as fresh fruit. The availability of chloroplast genomes (cp genomes) and their sequence comparisons has led to a better understanding of the evolutionary relationships within plant taxa. In this study, we obtained the complete nucleotide sequence of the P. edulis chloroplast genome, the first entirely sequenced in the Passifloraceae family. We determined its structure and organization, and also performed phylogenomic studies on the order Malpighiales and the Fabids clade. The P. edulis chloroplast genome is characterized by the presence of two copies of an inverted repeat sequence (IRA and IRB) of 26,154 bp, each separating a small single copy region of 13,378 bp and a large single copy (LSC) region of 85,720 bp. The annotation resulted in the identification of 105 unique genes, including 30 tRNAs, 4 rRNAs, and 71 protein coding genes. Also, 36 repetitive elements and 85 SSRs (microsatellites) were identified. The structure of the complete cp genome of P. edulis differs from that of other species because of rearrangement events detected by means of a comparison based on 22 members of the Malpighiales. The rearrangements were three inversions of 46,151, 3,765 and 1,631 bp, located in the LSC region. Phylogenomic analysis resulted in strongly supported trees, but this could also be a consequence of the limited taxonomic sampling used. Our results have provided a better understanding of the evolutionary relationships in the Malpighiales and the Fabids, confirming the potential of complete chloroplast genome sequences in inferring evolutionary relationships and the utility of long sequence reads for generating very accurate biological information. PMID:28344587

  19. Targeted isolation, sequence assembly and characterization of two white spruce (Picea glauca BAC clones for terpenoid synthase and cytochrome P450 genes involved in conifer defence reveal insights into a conifer genome

    Directory of Open Access Journals (Sweden)

    Ritland Carol

    2009-08-01

    Full Text Available Abstract Background Conifers are a large group of gymnosperm trees which are separated from the angiosperms by more than 300 million years of independent evolution. Conifer genomes are extremely large and contain considerable amounts of repetitive DNA. Currently, conifer sequence resources exist predominantly as expressed sequence tags (ESTs and full-length (FLcDNAs. There is no genome sequence available for a conifer or any other gymnosperm. Conifer defence-related genes often group into large families with closely related members. The goals of this study are to assess the feasibility of targeted isolation and sequence assembly of conifer BAC clones containing specific genes from two large gene families, and to characterize large segments of genomic DNA sequence for the first time from a conifer. Results We used a PCR-based approach to identify BAC clones for two target genes, a terpene synthase (3-carene synthase; 3CAR and a cytochrome P450 (CYP720B4 from a non-arrayed genomic BAC library of white spruce (Picea glauca. Shotgun genomic fragments isolated from the BAC clones were sequenced to a depth of 15.6- and 16.0-fold coverage, respectively. Assembly and manual curation yielded sequence scaffolds of 172 kbp (3CAR and 94 kbp (CYP720B4 long. Inspection of the genomic sequences revealed the intron-exon structures, the putative promoter regions and putative cis-regulatory elements of these genes. Sequences related to transposable elements (TEs, high complexity repeats and simple repeats were prevalent and comprised approximately 40% of the sequenced genomic DNA. An in silico simulation of the effect of sequencing depth on the quality of the sequence assembly provides direction for future efforts of conifer genome sequencing. Conclusion We report the first targeted cloning, sequencing, assembly, and annotation of large segments of genomic DNA from a conifer. We demonstrate that genomic BAC clones for individual members of multi-member gene

  20. Draft assembly of elite inbred line PH207 provides insights into genomic and transcriptome diversity in maize

    Science.gov (United States)

    Intense artificial selection over the last 100 years has produced elite maize (Zea mays) inbred lines that combine to produce high-yielding hybrids. To further our understanding of how genome and transcriptome variation contribute to the production of high-yielding hybrids, we generated a draft geno...

  1. Role of RNA structures in genome terminal sequences of the hepatitis C virus for replication and assembly.

    Science.gov (United States)

    Friebe, Peter; Bartenschlager, Ralf

    2009-11-01

    Hepatitis C virus (HCV) is a positive-strand RNA virus replicating its genome via a negative-strand [(-)] intermediate. Little is known about replication signals residing in the 3' end of HCV (-) RNA. Recent studies identified seven stem-loop structures (SL-I', -IIz', -IIy', -IIIa', -IIIb', -IIIcdef', and -IV') in this region. In the present study, we mapped the minimal region required for RNA replication to SL-I' and -IIz', functionally confirmed the SL-IIz' structure, and identified SL-IIIa' to -IV' as auxiliary replication elements. In addition, we show that the 5' nontranslated region of the genome most likely does not contain cis-acting RNA structures required for RNA packaging into infectious virions.

  2. Integration host factor assembly at the cohesive end site of the bacteriophage lambda genome: implications for viral DNA packaging and bacterial gene regulation.

    Science.gov (United States)

    Sanyal, Saurarshi J; Yang, Teng-Chieh; Catalano, Carlos Enrique

    2014-12-09

    Integration host factor (IHF) is an Escherichia coli protein involved in (i) condensation of the bacterial nucleoid and (ii) regulation of a variety of cellular functions. In its regulatory role, IHF binds to a specific sequence to introduce a strong bend into the DNA; this provides a duplex architecture conducive to the assembly of site-specific nucleoprotein complexes. Alternatively, the protein can bind in a sequence-independent manner that weakly bends and wraps the duplex to promote nucleoid formation. IHF is also required for the development of several viruses, including bacteriophage lambda, where it promotes site-specific assembly of a genome packaging motor required for lytic development. Multiple IHF consensus sequences have been identified within the packaging initiation site (cos), and we here interrogate IHF-cos binding interactions using complementary electrophoretic mobility shift (EMS) and analytical ultracentrifugation (AUC) approaches. IHF recognizes a single consensus sequence within cos (I1) to afford a strongly bent nucleoprotein complex. In contrast, IHF binds weakly but with positive cooperativity to nonspecific DNA to afford an ensemble of complexes with increasing masses and levels of condensation. Global analysis of the EMS and AUC data provides constrained thermodynamic binding constants and nearest neighbor cooperativity factors for binding of IHF to I1 and to nonspecific DNA substrates. At elevated IHF concentrations, the nucleoprotein complexes undergo a transition from a condensed to an extended rodlike conformation; specific binding of IHF to I1 imparts a significant energy barrier to the transition. The results provide insight into how IHF can assemble specific regulatory complexes in the background of extensive nonspecific DNA condensation.

  3. MutMap-Gap: whole-genome resequencing of mutant F2 progeny bulk combined with de novo assembly of gap regions identifies the rice blast resistance gene Pii.

    Science.gov (United States)

    Takagi, Hiroki; Uemura, Aiko; Yaegashi, Hiroki; Tamiru, Muluneh; Abe, Akira; Mitsuoka, Chikako; Utsushi, Hiroe; Natsume, Satoshi; Kanzaki, Hiroyuki; Matsumura, Hideo; Saitoh, Hiromasa; Yoshida, Kentaro; Cano, Liliana M; Kamoun, Sophien; Terauchi, Ryohei

    2013-10-01

    Next-generation sequencing allows the identification of mutations responsible for mutant phenotypes by whole-genome resequencing and alignment to a reference genome. However, when the resequenced cultivar/line displays significant structural variation from the reference genome, mutations in the genome regions missing from the reference (gaps) cannot be identified by simple alignment. Here we report on a method called 'MutMap-Gap', which involves delineating a candidate region harboring a mutation of interest using the recently reported MutMap method, followed by de novo assembly, alignment, and identification of the mutation within genome gaps. We applied MutMap-Gap to isolate the blast resistant gene Pii from the rice cv Hitomebore using mutant lines that have lost Pii function. MutMap-Gap should prove useful for cloning genes that exhibit significant structural variations such as disease resistance genes of the nucleotide-binding site-leucine rich repeat (NBS-LRR) class.

  4. Genome Assembly of the Fungus Cochliobolus miyabeanus, and Transcriptome Analysis during Early Stages of Infection on American Wildrice (Zizania palustris L..

    Directory of Open Access Journals (Sweden)

    Claudia V Castell-Miller

    Full Text Available The fungus Cochliobolus miyabeanus causes severe leaf spot disease on rice (Oryza sativa and two North American specialty crops, American wildrice (Zizania palustris and switchgrass (Panicum virgatum. Despite the importance of C. miyabeanus as a disease-causing agent in wildrice, little is known about either the mechanisms of pathogenicity or host defense responses. To start bridging these gaps, the genome of C. miyabeanus strain TG12bL2 was shotgun sequenced using Illumina technology. The genome assembly consists of 31.79 Mbp in 2,378 scaffolds with an N50 = 74,921. It contains 11,000 predicted genes of which 94.5% were annotated. Approximately 10% of total gene number is expected to be secreted. The C. miyabeanus genome is rich in carbohydrate active enzymes, and harbors 187 small secreted peptides (SSPs and some fungal effector homologs. Detoxification systems were represented by a variety of enzymes that could offer protection against plant defense compounds. The non-ribosomal peptide synthetases and polyketide synthases (PKS present were common to other Cochliobolus species. Additionally, the fungal transcriptome was analyzed at 48 hours after inoculation in planta. A total of 10,674 genes were found to be expressed, some of which are known to be involved in pathogenicity or response to host defenses including hydrophobins, cutinase, cell wall degrading enzymes, enzymes related to reactive oxygen species scavenging, PKS, detoxification systems, SSPs, and a known fungal effector. This work will facilitate future research on C. miyabeanus pathogen-associated molecular patterns and effectors, and in the identification of their corresponding wildrice defense mechanisms.

  5. Random walk in genome space: A key ingredient of intermittent dynamics of community assembly on evolutionary time scales

    KAUST Repository

    Murase, Yohsuke

    2010-06-01

    Community assembly is studied using individual-based multispecies models. The models have stochastic population dynamics with mutation, migration, and extinction of species. Mutants appear as a result of mutation of the resident species, while migrants have no correlation with the resident species. It is found that the dynamics of community assembly with mutations are quite different from the case with migrations. In contrast to mutation models, which show intermittent dynamics of quasi-steady states interrupted by sudden reorganizations of the community, migration models show smooth and gradual renewal of the community. As a consequence, instead of the 1/f diversity fluctuations found for the mutation models, 1/f2, random-walk like fluctuations are observed for the migration models. In addition, a characteristic species-lifetime distribution is found: a power law that is cut off by a "skewed" distribution in the long-lifetime regime. The latter has a longer tail than a simple exponential function, which indicates an age-dependent species-mortality function. Since this characteristic profile has been observed, both in fossil data and in several other mathematical models, we conclude that it is a universal feature of macroevolution. © 2010 Elsevier Ltd.

  6. The α-gliadin genes from Brachypodium distachyon L. provide evidence for a significant gap in the current genome assembly.

    Science.gov (United States)

    Chen, G X; Lv, D W; Li, W D; Subburaj, S; Yu, Z T; Wang, Y J; Li, X H; Wang, K; Ye, X G; Ma, Wujun; Yan, Y M

    2014-03-01

    Brachypodium distachyon, is a new model plant for most cereal crops while gliadin is a class of wheat storage proteins related with wheat quality attributes. In the published B. distachyon genome sequence databases, no gliadin gene is found. In the current study, a number of gliadin genes in B. distachyon were isolated, which is contradictory to the results of genome sequencing projects. In our study, the B. distachyon seeds were found to have no gliadin protein expression by gel electrophoresis, reversed-phase high-performance liquid chromatography and Western blotting analysis. However, Southern blotting revealed a presence of more than ten copies of α-gliadin coding genes in B. distachyon. By means of AS-PCR amplification, four novel full-ORF α-gliadin genes, and 26 pseudogenes with at least one stop codon as well as their promoter regions were cloned and sequenced from different Brachypodium accessions. Sequence analysis revealed a few of single-nucleotide polymorphisms among these genes. Most pseudogenes were resulted from a C to T change, leading to the generation of TAG or TAA in-frame stop codon. To compare both the full-ORFs and the pseudogenes among Triticum and Triticum-related species, their structural characteristics were analyzed. Based on the four T cell stimulatory toxic epitopes and two ployglutamine domains, Aegilops, Triticum, and Brachypodium species were found to be more closely related. The phylogenetic analysis further revealed that B. distachyon was more closely related to Aegilops tauschii, Aegilops umbellulata, and the A or D genome of Triticum aestivum. The α-gliadin genes were able to express successfully in E. coli using the functional T7 promoter. The relative and absolute quantification of the transcripts of α-gliadin genes in wheat was much higher than that in B. distachyon. The abundant pseudogenes may affect the transcriptional and/or posttranscriptional level of the α-gliadin in B. distachyon.

  7. Complete Genome Sequence of Pelosinus sp. Strain UFO1 Assembled Using Single-Molecule Real-Time DNA Sequencing Technology

    Energy Technology Data Exchange (ETDEWEB)

    Steven D. Brown; Sagar M. Utturkar; Timothy S. Magnuson; Allison E. Ray; Farris L. Poole; W. Andrew Lancaster; Michael P. Thorgersen; Michael W. W. Adams; Dwayne A. Elias

    2014-09-01

    Pelosinus fermentans strain R7 was isolated from Russian kaolin clays as the type strain and it can reduce Fe(III) during fermentative growth (1). Draft genome sequences for P. fermentans R7 and four strains from Hanford, Washington, USA, have been published (2–4). The P. fermentans 16S rRNA sequence dominated the lactate-based enrichment cultures from three geochemically contrasting soils from the Melton Branch Watershed, Oak Ridge, Tennessee, USA (5) and also at another stimulated, uraniumcontaminated field site near Oak Ridge (6). For the current work, strain UFO1 was isolated from pristine sediments at a background field site in Oak Ridge and characterized as facilitating U(VI) reduction and precipitation with phosphate (7).

  8. Self-assembled DNA nanoclews for the efficient delivery of CRISPR-Cas9 for genome editing.

    Science.gov (United States)

    Sun, Wujin; Ji, Wenyan; Hall, Jordan M; Hu, Quanyin; Wang, Chao; Beisel, Chase L; Gu, Zhen

    2015-10-05

    CRISPR-Cas9 represents a promising platform for genome editing, yet means for its safe and efficient delivery remain to be fully realized. A novel vehicle that simultaneously delivers the Cas9 protein and single guide RNA (sgRNA) is based on DNA nanoclews, yarn-like DNA nanoparticles that are synthesized by rolling circle amplification. The biologically inspired vehicles were efficiently loaded with Cas9/sgRNA complexes and delivered the complexes to the nuclei of human cells, thus enabling targeted gene disruption while maintaining cell viability. Editing was most efficient when the DNA nanoclew sequence and the sgRNA guide sequence were partially complementary, offering a design rule for enhancing delivery. Overall, this strategy provides a versatile method that could be adapted for delivering other DNA-binding proteins or functional nucleic acids.

  9. Detection of a Usp-like gene in Calotropis procera plant from the de novo assembled genome contigs of the high-throughput sequencing dataset.

    Science.gov (United States)

    Shokry, Ahmed M; Al-Karim, Saleh; Ramadan, Ahmed; Gadallah, Nour; Al Attas, Sanaa G; Sabir, Jamal S M; Hassan, Sabah M; Madkour, Magdy A; Bressan, Ray; Mahfouz, Magdy; Bahieldin, Ahmed

    2014-02-01

    The wild plant species Calotropis procera (C. procera) has many potential applications and beneficial uses in medicine, industry and ornamental field. It also represents an excellent source of genes for drought and salt tolerance. Genes encoding proteins that contain the conserved universal stress protein (USP) domain are known to provide organisms like bacteria, archaea, fungi, protozoa and plants with the ability to respond to a plethora of environmental stresses. However, information on the possible occurrence of Usp in C. procera is not available. In this study, we uncovered and characterized a one-class A Usp-like (UspA-like, NCBI accession No. KC954274) gene in this medicinal plant from the de novo assembled genome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for Usp sequences were blasted with the recovered de novo assembled contigs. Homology modelling of the deduced amino acids (NCBI accession No. AGT02387) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera USPA-like full sequence model on Thermus thermophilus USP UniProt protein (PDB accession No. Q5SJV7) was constructed using RasMol and Deep-View programs. The functional domains of the novel USPA-like amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

  10. Detection of a Usp-like gene in Calotropis procera plant from the de novo assembled genome contigs of the high-throughput sequencing dataset

    KAUST Repository

    Shokry, Ahmed M.

    2014-02-01

    The wild plant species Calotropis procera (C. procera) has many potential applications and beneficial uses in medicine, industry and ornamental field. It also represents an excellent source of genes for drought and salt tolerance. Genes encoding proteins that contain the conserved universal stress protein (USP) domain are known to provide organisms like bacteria, archaea, fungi, protozoa and plants with the ability to respond to a plethora of environmental stresses. However, information on the possible occurrence of Usp in C. procera is not available. In this study, we uncovered and characterized a one-class A Usp-like (UspA-like, NCBI accession No. KC954274) gene in this medicinal plant from the de novo assembled genome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for Usp sequences were blasted with the recovered de novo assembled contigs. Homology modelling of the deduced amino acids (NCBI accession No. AGT02387) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera USPA-like full sequence model on Thermus thermophilus USP UniProt protein (PDB accession No. Q5SJV7) was constructed using RasMol and Deep-View programs. The functional domains of the novel USPA-like amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM). © 2014 Académie des sciences.

  11. Comprehensive profiling of retroviral integration sites using target enrichment methods from historical koala samples without an assembled reference genome

    Directory of Open Access Journals (Sweden)

    Pin Cui

    2016-03-01

    Full Text Available Background. Retroviral integration into the host germline results in permanent viral colonization of vertebrate genomes. The koala retrovirus (KoRV is currently invading the germline of the koala (Phascolarctos cinereus and provides a unique opportunity for studying retroviral endogenization. Previous analysis of KoRV integration patterns in modern koalas demonstrate that they share integration sites primarily if they are related, indicating that the process is currently driven by vertical transmission rather than infection. However, due to methodological challenges, KoRV integrations have not been comprehensively characterized. Results. To overcome these challenges, we applied and compared three target enrichment techniques coupled with next generation sequencing (NGS and a newly customized sequence-clustering based computational pipeline to determine the integration sites for 10 museum Queensland and New South Wales (NSW koala samples collected between the 1870s and late 1980s. A secondary aim of this study sought to identify common integration sites across modern and historical specimens by comparing our dataset to previously published studies. Several million sequences were processed, and the KoRV integration sites in each koala were characterized. Conclusions. Although the three enrichment methods each exhibited bias in integration site retrieval, a combination of two methods, Primer Extension Capture and hybridization capture is recommended for future studies on historical samples. Moreover, identification of integration sites shows that the proportion of integration sites shared between any two koalas is quite small.

  12. "Triple negative breast cancer": Translational research and the (re)assembling of diseases in post-genomic medicine.

    Science.gov (United States)

    Keating, Peter; Cambrosio, Alberto; Nelson, Nicole C

    2016-10-01

    The paper examines the debate about the nature and status of "Triple-negative breast cancer", a controversial biomedical entity whose existence illustrates a number of features of post-genomic translational research. The emergence of TNBC is intimately linked to the rise of molecular oncology, and, more generally, to the changing configuration of the life sciences at the turn of the new century. An unprecedented degree of integration of biological and clinical practices has led to the proliferation of bio-clinical entities emerging from translational research. These translations take place between platforms rather than between clinical and laboratory settings. The complexity and heterogeneity of TNBC, its epistemic and technical, biological and clinical dualities, result from its multiple instantiations via different platforms, and from the uneven distribution of biological materials, techniques, and objects across clinical research settings. The fact that TNBC comes in multiple forms, some of which seem to be incompatible or, at least, only partially overlapping, appears to be less a threat to the whole endeavor, than an aspect of an ongoing translational research project. Discussions of translational research that rest on a distinction between basic research and its applications fail to capture the dynamics of this new domain of activity, insofar as application is built-in from the very beginning in the bio-clinical entities that emerge from the translational research domain.

  13. De Novo Assembly and Genome Analyses of the Marine-Derived Scopulariopsis brevicaulis Strain LF580 Unravels Life-Style Traits and Anticancerous Scopularide Biosynthetic Gene Cluster

    Science.gov (United States)

    Kumar, Abhishek; Henrissat, Bernard; Arvas, Mikko; Syed, Muhammad Fahad; Thieme, Nils; Benz, J. Philipp; Sørensen, Jens Laurids; Record, Eric; Pöggeler, Stefanie; Kempken, Frank

    2015-01-01

    The marine-derived Scopulariopsis brevicaulis strain LF580 produces scopularides A and B, which have anticancerous properties. We carried out genome sequencing using three next-generation DNA sequencing methods. De novo hybrid assembly yielded 621 scaffolds with a total size of 32.2 Mb and 16298 putative gene models. We identified a large non-ribosomal peptide synthetase gene (nrps1) and supporting pks2 gene in the same biosynthetic gene cluster. This cluster and the genes within the cluster are functionally active as confirmed by RNA-Seq. Characterization of carbohydrate-active enzymes and major facilitator superfamily (MFS)-type transporters lead to postulate S. brevicaulis originated from a soil fungus, which came into contact with the marine sponge Tethya aurantium. This marine sponge seems to provide shelter to this fungus and micro-environment suitable for its survival in the ocean. This study also builds the platform for further investigations of the role of life-style and secondary metabolites from S. brevicaulis. PMID:26505484

  14. De Novo Assembly and Genome Analyses of the Marine-Derived Scopulariopsis brevicaulis Strain LF580 Unravels Life-Style Traits and Anticancerous Scopularide Biosynthetic Gene Cluster.

    Science.gov (United States)

    Kumar, Abhishek; Henrissat, Bernard; Arvas, Mikko; Syed, Muhammad Fahad; Thieme, Nils; Benz, J Philipp; Sørensen, Jens Laurids; Record, Eric; Pöggeler, Stefanie; Kempken, Frank

    2015-01-01

    The marine-derived Scopulariopsis brevicaulis strain LF580 produces scopularides A and B, which have anticancerous properties. We carried out genome sequencing using three next-generation DNA sequencing methods. De novo hybrid assembly yielded 621 scaffolds with a total size of 32.2 Mb and 16298 putative gene models. We identified a large non-ribosomal peptide synthetase gene (nrps1) and supporting pks2 gene in the same biosynthetic gene cluster. This cluster and the genes within the cluster are functionally active as confirmed by RNA-Seq. Characterization of carbohydrate-active enzymes and major facilitator superfamily (MFS)-type transporters lead to postulate S. brevicaulis originated from a soil fungus, which came into contact with the marine sponge Tethya aurantium. This marine sponge seems to provide shelter to this fungus and micro-environment suitable for its survival in the ocean. This study also builds the platform for further investigations of the role of life-style and secondary metabolites from S. brevicaulis.

  15. Genome Sequencing

    DEFF Research Database (Denmark)

    Sato, Shusei; Andersen, Stig Uggerhøj

    2014-01-01

    The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based on transcr......The current Lotus japonicus reference genome sequence is based on a hybrid assembly of Sanger TAC/BAC, Sanger shotgun and Illumina shotgun sequencing data generated from the Miyakojima-MG20 accession. It covers nearly all expressed L. japonicus genes and has been annotated mainly based...

  16. Analysis of Illumina Microbial Assemblies

    Energy Technology Data Exchange (ETDEWEB)

    Clum, Alicia; Foster, Brian; Froula, Jeff; LaButti, Kurt; Sczyrba, Alex; Lapidus, Alla; Woyke, Tanja

    2010-05-28

    Since the emerging of second generation sequencing technologies, the evaluation of different sequencing approaches and their assembly strategies for different types of genomes has become an important undertaken. Next generation sequencing technologies dramatically increase sequence throughput while decreasing cost, making them an attractive tool for whole genome shotgun sequencing. To compare different approaches for de-novo whole genome assembly, appropriate tools and a solid understanding of both quantity and quality of the underlying sequence data are crucial. Here, we performed an in-depth analysis of short-read Illumina sequence assembly strategies for bacterial and archaeal genomes. Different types of Illumina libraries as well as different trim parameters and assemblers were evaluated. Results of the comparative analysis and sequencing platforms will be presented. The goal of this analysis is to develop a cost-effective approach for the increased throughput of the generation of high quality microbial genomes.

  17. Subgenome-specific assembly of vitamin E biosynthesis genes and expression patterns during seed development provide insight into the evolution of oat genome.

    Science.gov (United States)

    Gutierrez-Gonzalez, Juan J; Garvin, David F

    2016-11-01

    Vitamin E is essential for humans and thus must be a component of a healthy diet. Among the cereal grains, hexaploid oats (Avena sativa L.) have high vitamin E content. To date, no gene sequences in the vitamin E biosynthesis pathway have been reported for oats. Using deep sequencing and orthology-guided assembly, coding sequences of genes for each step in vitamin E synthesis in oats were reconstructed, including resolution of the sequences of homeologs. Three homeologs, presumably representing each of the three oat subgenomes, were identified for the main steps of the pathway. Partial sequences, likely representing pseudogenes, were recovered in some instances as well. Pairwise comparisons among homeologs revealed that two of the three putative subgenome-specific homeologs are almost identical for each gene. Synonymous substitution rates indicate the time of divergence of the two more similar subgenomes from the distinct one at 7.9-8.7 MYA, and a divergence between the similar subgenomes from a common ancestor 1.1 MYA. A new proposed evolutionary model for hexaploid oat formation is discussed. Homeolog-specific gene expression was quantified during oat seed development and compared with vitamin E accumulation. Homeolog expression largely appears to be similar for most of genes; however, for some genes, homoeolog-specific transcriptional bias was observed. The expression of HPPD, as well as certain homoeologs of VTE2 and VTE4, is highly correlated with seed vitamin E accumulation. Our findings expand our understanding of oat genome evolution and will assist efforts to modify vitamin E content and composition in oats. Published 2016. This article is a U.S. Government work and is in the public domain in the USA. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  18. AGORA: Assembly Guided by Optical Restriction Alignment

    Directory of Open Access Journals (Sweden)

    Lin Henry C

    2012-08-01

    Full Text Available Abstract Background Genome assembly is difficult due to repeated sequences within the genome, which create ambiguities and cause the final assembly to be broken up into many separate sequences (contigs. Long range linking information, such as mate-pairs or mapping data, is necessary to help assembly software resolve repeats, thereby leading to a more complete reconstruction of genomes. Prior work has used optical maps for validating assemblies and scaffolding contigs, after an initial assembly has been produced. However, optical maps have not previously been used within the genome assembly process. Here, we use optical map information within the popular de Bruijn graph assembly paradigm to eliminate paths in the de Bruijn graph which are not consistent with the optical map and help determine the correct reconstruction of the genome. Results We developed a new algorithm called AGORA: Assembly Guided by Optical Restriction Alignment. AGORA is the first algorithm to use optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided. Our simulations on bacterial genomes show that AGORA is effective at producing assemblies closely matching the reference sequences. Additionally, we show that noise in the optical map can have a strong impact on the final assembly quality for some complex genomes, and we also measure how various characteristics of the starting de Bruijn graph may impact the quality of the final assembly. Lastly, we show that a proper choice of restriction enzyme for the optical map may substantially improve the quality of the final assembly. Conclusions Our work shows that optical maps can be used effectively to assemble genomes within the de Bruijn graph assembly framework. Our experiments also provide insights into the characteristics of the mapping data that most affect the performance of our algorithm, indicating the

  19. CasEMBLR: Cas9-Facilitated Multiloci Genomic Integration of in Vivo Assembled DNA Parts in Saccharomyces cerevisiae

    DEFF Research Database (Denmark)

    Jakociunas, Tadas; Rajkumar, Arun Stephen; Zhang, Jie

    2015-01-01

    , we present a method for marker-free multiloci integration of in vivo assembled DNA parts. By the use of CRISPR/Cas9-mediated one-step double-strand breaks at single, double and triple integration sites we report the successful in vivo assembly and chromosomal integration of DNA parts. We call our...

  20. Draft genome assembly of two Pseudoclavibacter helvolus strains, G8 and W3, isolated from slaughterhouse environments

    DEFF Research Database (Denmark)

    Raghupathi, Prem Krishnan; Herschend, Jakob; Røder, Henriette Lyng;

    2016-01-01

    We report the draft genome sequences of twoPseudoclavibacter helvolusstrains. Strain G8 was isolated from a meat chopper and strain W3 isolated from the wall of a small slaughterhouse in Denmark. The two annotated genomes are 3.91 Mb and 4.00 Mb in size, respectively....

  1. Rapid, High-Throughput Identification of Anthrax-Causing and Emetic Bacillus cereus Group Genome Assemblies via BTyper, a Computational Tool for Virulence-Based Classification of Bacillus cereus Group Isolates by Using Nucleotide Sequencing Data

    Science.gov (United States)

    Carroll, Laura M.; Miller, Rachel A.; Wiedmann, Martin

    2017-01-01

    ABSTRACT The Bacillus cereus group comprises nine species, several of which are pathogenic. Differentiating between isolates that may cause disease and those that do not is a matter of public health and economic importance, but it can be particularly challenging due to the high genomic similarity within the group. To this end, we have developed BTyper, a computational tool that employs a combination of (i) virulence gene-based typing, (ii) multilocus sequence typing (MLST), (iii) panC clade typing, and (iv) rpoB allelic typing to rapidly classify B. cereus group isolates using nucleotide sequencing data. BTyper was applied to a set of 662 B. cereus group genome assemblies to (i) identify anthrax-associated genes in non-B. anthracis members of the B. cereus group, and (ii) identify assemblies from B. cereus group strains with emetic potential. With BTyper, the anthrax toxin genes cya, lef, and pagA were detected in 8 genomes classified by the NCBI as B. cereus that clustered into two distinct groups using k-medoids clustering, while either the B. anthracis poly-γ-d-glutamate capsule biosynthesis genes capABCDE or the hyaluronic acid capsule hasA gene was detected in an additional 16 assemblies classified as either B. cereus or Bacillus thuringiensis isolated from clinical, environmental, and food sources. The emetic toxin genes cesABCD were detected in 24 assemblies belonging to panC clades III and VI that had been isolated from food, clinical, and environmental settings. The command line version of BTyper is available at https://github.com/lmc297/BTyper. In addition, BMiner, a companion application for analyzing multiple BTyper output files in aggregate, can be found at https://github.com/lmc297/BMiner. IMPORTANCE Bacillus cereus is a foodborne pathogen that is estimated to cause tens of thousands of illnesses each year in the United States alone. Even with molecular methods, it can be difficult to distinguish nonpathogenic B. cereus group isolates from their

  2. Rapid, high-throughput identification of anthrax-causing and emetic Bacillus cereus group genome assemblies using BTyper, a computational tool for virulence-based classification of Bacillus cereus group isolates using nucleotide sequencing data.

    Science.gov (United States)

    Carroll, Laura M; Kovac, Jasna; Miller, Rachel A; Wiedmann, Martin

    2017-06-16

    The Bacillus cereus group comprises nine species, several of which are pathogenic. Differentiating between isolates that may cause disease and those that do not is a matter of public health and economic importance, but can be particularly challenging due to the high genomic similarity of the group. To this end, we have developed BTyper, a computational tool that employs a combination of (i) virulence gene-based typing, (ii) multi-locus sequence typing (MLST), (iii) panC clade typing, and (iv) rpoB allelic typing to rapidly classify B. cereus group isolates using nucleotide sequencing data. BTyper was applied to a set of 662 B. cereus group genome assemblies to (i) identify anthrax-associated genes in non-B. anthracis members of the B. cereus group, and (iI) identify assemblies from B. cereus group strains with emetic potential. With BTyper, anthrax toxin genes cya, lef and pagA were detected in 8 genomes classified in NCBI as B. cereus that clustered into two distinct groups using k-medoids clustering, while B. anthracis poly-γ-D-glutamate capsule biosynthesis genes capABCDE or hyaluronic acid capsule gene hasA were detected in an additional 16 assemblies classified as either B. cereus or B. thuringiensis isolated from clinical, environmental, and food sources. Emetic toxin genes cesABCD were detected in 24 assemblies belonging to panC clades III and VI that had been isolated from food, clinical, and environmental settings. The command line version of BTyper is available at https://github.com/lmc297/BTyper In addition, BMiner, a companion application for analyzing multiple BTyper output files in aggregate, can be found at https://github.com/lmc297/BMinerImportanceBacillus cereus is a foodborne pathogen that is estimated to cause tens of thousands of illnesses each year in the United States alone. Even with molecular methods, it can be difficult to distinguish non-pathogenic B. cereus group isolates from their pathogenic counterparts, including the human pathogen B

  3. Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale

    DEFF Research Database (Denmark)

    Huang, Shujia; Rao, Junhua; Ye, Weijian

    2015-01-01

    Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels...

  4. Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale

    DEFF Research Database (Denmark)

    Huang, Shujia; Rao, Junhua; Ye, Weijian

    2015-01-01

    Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) ...

  5. Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale

    DEFF Research Database (Denmark)

    Huang, Shujia; Rao, Junhua; Ye, Weijian;

    2015-01-01

    Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) ...

  6. Accurate Dna Assembly And Direct Genome Integration With Optimized Uracil Excision Cloning To Facilitate Engineering Of Escherichia Coli As A Cell Factory

    DEFF Research Database (Denmark)

    Cavaleiro, Mafalda; Kim, Se Hyeuk; Nørholm, Morten

    2015-01-01

    Plants produce a vast diversity of valuable compounds with medical properties, but these are often difficult to purify from the natural source or produce by organic synthesis. An alternative is to transfer the biosynthetic pathways to an efficient production host like the bacterium Escherichia co......-excision-based cloning and combining it with a genome-engineering approach to allow direct integration of whole metabolic pathways into the genome of E. coli, to facilitate the advanced engineering of cell factories....

  7. Snowball: Strain aware gene assembly of Metagenomes

    NARCIS (Netherlands)

    Gregor, I.; Schönhuth, A.; McHardy, A.C.

    2015-01-01

    Gene assembly is an important step in functional analysis of shotgun metagenomic data. Nonetheless, strain aware assembly remains a challenging task, as current assembly tools often fail to distinguish among strain variants or require closely related reference genomes of the studied species to be av

  8. A gene-based high-resolution comparative radiation hybrid map as a framework for genome sequence assembly of a bovine chromosome 6 region associated with QTL for growth, body composition, and milk performance traits

    Directory of Open Access Journals (Sweden)

    Laurent Pascal

    2006-03-01

    Full Text Available Abstract Background A number of different quantitative trait loci (QTL for various phenotypic traits, including milk production, functional, and conformation traits in dairy cattle as well as growth and body composition traits in meat cattle, have been mapped consistently in the middle region of bovine chromosome 6 (BTA6. Dense genetic and physical maps and, ultimately, a fully annotated genome sequence as well as their mutual connections are required to efficiently identify genes and gene variants responsible for genetic variation of phenotypic traits. A comprehensive high-resolution gene-rich map linking densely spaced bovine markers and genes to the annotated human genome sequence is required as a framework to facilitate this approach for the region on BTA6 carrying the QTL. Results Therefore, we constructed a high-resolution radiation hybrid (RH map for the QTL containing chromosomal region of BTA6. This new RH map with a total of 234 loci including 115 genes and ESTs displays a substantial increase in loci density compared to existing physical BTA6 maps. Screening the available bovine genome sequence resources, a total of 73 loci could be assigned to sequence contigs, which were already identified as specific for BTA6. For 43 loci, corresponding sequence contigs, which were not yet placed on the bovine genome assembly, were identified. In addition, the improved potential of this high-resolution RH map for BTA6 with respect to comparative mapping was demonstrated. Mapping a large number of genes on BTA6 and cross-referencing them with map locations in corresponding syntenic multi-species chromosome segments (human, mouse, rat, dog, chicken achieved a refined accurate alignment of conserved segments and evolutionary breakpoints across the species included. Conclusion The gene-anchored high-resolution RH map (1 locus/300 kb for the targeted region of BTA6 presented here will provide a valuable platform to guide high-quality assembling and

  9. Next-generation transcriptome assembly

    Energy Technology Data Exchange (ETDEWEB)

    Martin, Jeffrey A.; Wang, Zhong

    2011-09-01

    Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalog of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches - reference-based, de novo and combined strategies-along with some perspectives on transcriptome assembly in the near future.

  10. Cephalopod genomics

    DEFF Research Database (Denmark)

    Albertin, Caroline B.; Bonnaud, Laure; Brown, C. Titus

    2012-01-01

    The Cephalopod Sequencing Consortium (CephSeq Consortium) was established at a NESCent Catalysis Group Meeting, ``Paths to Cephalopod Genomics-Strategies, Choices, Organization,'' held in Durham, North Carolina, USA on May 24-27, 2012. Twenty-eight participants representing nine countries (Austria......, Australia, China, Denmark, France, Italy, Japan, Spain and the USA) met to address the pressing need for genome sequencing of cephalopod mollusks. This group, drawn from cephalopod biologists, neuroscientists, developmental and evolutionary biologists, materials scientists, bioinformaticians and researchers...... active in sequencing, assembling and annotating genomes, agreed on a set of cephalopod species of particular importance for initial sequencing and developed strategies and an organization (CephSeq Consortium) to promote this sequencing. The conclusions and recommendations of this meeting are described...

  11. Re-annotation of the physical map of Glycine max for polyploid-like regions by BAC end sequence driven whole genome shotgun read assembly

    Directory of Open Access Journals (Sweden)

    Shultz Jeffry

    2008-07-01

    Full Text Available Abstract Background Many of the world's most important food crops have either polyploid genomes or homeologous regions derived from segmental shuffling following polyploid formation. The soybean (Glycine max genome has been shown to be composed of approximately four thousand short interspersed homeologous regions with 1, 2 or 4 copies per haploid genome by RFLP analysis, microsatellite anchors to BACs and by contigs formed from BAC fingerprints. Despite these similar regions,, the genome has been sequenced by whole genome shotgun sequence (WGS. Here the aim was to use BAC end sequences (BES derived from three minimum tile paths (MTP to examine the extent and homogeneity of polyploid-like regions within contigs and the extent of correlation between the polyploid-like regions inferred from fingerprinting and the polyploid-like sequences inferred from WGS matches. Results Results show that when sequence divergence was 1–10%, the copy number of homeologous regions could be identified from sequence variation in WGS reads overlapping BES. Homeolog sequence variants (HSVs were single nucleotide polymorphisms (SNPs; 89% and single nucleotide indels (SNIs 10%. Larger indels were rare but present (1%. Simulations that had predicted fingerprints of homeologous regions could be separated when divergence exceeded 2% were shown to be false. We show that a 5–10% sequence divergence is necessary to separate homeologs by fingerprinting. BES compared to WGS traces showed polyploid-like regions with less than 1% sequence divergence exist at 2.3% of the locations assayed. Conclusion The use of HSVs like SNPs and SNIs to characterize BACs wil improve contig building methods. The implications for bioinformatic and functional annotation of polyploid and paleopolyploid genomes show that a combined approach of BAC fingerprint based physical maps, WGS sequence and HSV-based partitioning of BAC clones from homeologous regions to separate contigs will allow reliable de

  12. Next-generation genome sequencing and assembly provides tools for phylogenetics and identification of closely related species of Spathius, parasitoids of Agrilus planipennis (emerald ash borer)

    Science.gov (United States)

    A crucial step in biological control programs is identification of candidates for introduction. This is often difficult when cryptic species are involved. However, recent advances in next-generation sequencing allows whole genome sequencing in non-model species for the discovery and genotyping of ...

  13. Improved hybrid genome assemblies of 2 strains of Bacteroides xylanisolvens SD-CC-1b and SD-CC-2a using Illumina and 454 sequencing technologies

    Science.gov (United States)

    Bacteroides xlyanisolvens strains (SD_CC_1b, SD_CC_2a) isolated from human feces were able to grow on crystalline cellulose. Cellulolytic properties are not common in Bacteroides species. Here, we report improved genome sequences of both the B. xlyanisolvens strains....

  14. Bioinformatics decoding the genome

    CERN Document Server

    CERN. Geneva; Deutsch, Sam; Michielin, Olivier; Thomas, Arthur; Descombes, Patrick

    2006-01-01

    Extracting the fundamental genomic sequence from the DNA From Genome to Sequence : Biology in the early 21st century has been radically transformed by the availability of the full genome sequences of an ever increasing number of life forms, from bacteria to major crop plants and to humans. The lecture will concentrate on the computational challenges associated with the production, storage and analysis of genome sequence data, with an emphasis on mammalian genomes. The quality and usability of genome sequences is increasingly conditioned by the careful integration of strategies for data collection and computational analysis, from the construction of maps and libraries to the assembly of raw data into sequence contigs and chromosome-sized scaffolds. Once the sequence is assembled, a major challenge is the mapping of biologically relevant information onto this sequence: promoters, introns and exons of protein-encoding genes, regulatory elements, functional RNAs, pseudogenes, transposons, etc. The methodological ...

  15. Uracil Excision for Assembly of Complex Pathways

    DEFF Research Database (Denmark)

    Cavaleiro, Mafalda; Nielsen, Morten Thrane; Kim, Se Hyeuk

    2015-01-01

    inexpensive technologies available. Here, we describe four different protocols for uracil excision-based DNA editing: one for simple manipulations such as site-directed mutagenesis, one for plasmid-based multigene assembly in Escherichia coli, one for one-step assembly and integration of single or multiple...... genes into the genome, and a standardized assembly pipeline using benchmarked oligonucleotides for pathway assembly and multigene expression optimization....

  16. Snowball: Strain aware gene assembly of Metagenomes

    OpenAIRE

    Gregor, I.; Schönhuth, A.; McHardy, A. C.

    2015-01-01

    Gene assembly is an important step in functional analysis of shotgun metagenomic data. Nonetheless, strain aware assembly remains a challenging task, as current assembly tools often fail to distinguish among strain variants or require closely related reference genomes of the studied species to be available. We have developed Snowball, a novel strain aware and reference-free gene assembler for shotgun metagenomic data. It uses profile hidden Markov models (HMMs) of gene domains of interest to ...

  17. Genomic libraries: I. Construction and screening of fosmid genomic libraries.

    Science.gov (United States)

    Quail, Mike A; Matthews, Lucy; Sims, Sarah; Lloyd, Christine; Beasley, Helen; Baxter, Simon W

    2011-01-01

    Large insert genome libraries have been a core resource required to sequence genomes, analyze haplotypes, and aid gene discovery. While next generation sequencing technologies are revolutionizing the field of genomics, traditional genome libraries will still be required for accurate genome assembly. Their utility is also being extended to functional studies for understanding DNA regulatory elements. Here, we present a detailed method for constructing genomic fosmid libraries, testing for common contaminants, gridding the library to nylon membranes, then hybridizing the library membranes with a radiolabeled probe to identify corresponding genomic clones. While this chapter focuses on fosmid libraries, many of these steps can also be applied to bacterial artificial chromosome libraries.

  18. The UCSC Genome Browser Database

    DEFF Research Database (Denmark)

    Hinrichs, A S; Karolchik, D; Baertsch, R

    2006-01-01

    The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, ...

  19. Genome-based Characterization of Two Prenylation Steps in the Assembly of the Stephacidin and Notoamide Anticancer Agents in a Marine-derived Aspergillus sp

    OpenAIRE

    Ding, Yousong; de Wet, Jeffrey R.; Cavalcoli, James; Li, Shengying; Greshock, Thomas J.; Miller, Kenneth A.; Finefield, Jennifer M.; Sunderhaus, James D.; McAfoos, Timothy; Tsukamoto, Sachiko; Williams, Robert M.; Sherman, David H.

    2010-01-01

    Stephacidin and notoamide natural products belong to a group of prenylated indole alkaloids containing a core bicyclo[2.2.2]diazaoctane ring system. These bioactive fungal secondary metabolites have a range of unusual structural and stereochemical features but their biosynthesis has remained uncharacterized. Herein, we report the first biosynthetic gene cluster for this class of fungal alkaloids based on whole genome sequencing of a marine-derived Aspergillus sp. Two central pathway enzymes c...

  20. Draft genome assemblies and predicted microRNA complements of the intertidal lophotrochozoans Patella vulgata (Mollusca, Patellogastropoda) and Spirobranchus (Pomatoceros) lamarcki (Annelida, Serpulida).

    Science.gov (United States)

    Kenny, Nathan J; Namigai, Erica K O; Marlétaz, Ferdinand; Hui, Jerome H L; Shimeld, Sebastian M

    2015-12-01

    MicroRNAs (miRNA) are small non-coding RNAs that act post-transcriptionally to regulate gene expression levels. Some studies have indicated that microRNAs may have low homoplasy, and as a consequence the phylogenetic distribution of microRNA families has been used to study animal evolutionary relationships. Limited levels of lineage sampling, however, may distort such analyses. Lophotrochozoa is an under-sampled taxon that includes molluscs, annelids and nemerteans, among other phyla. Here, we present two novel draft genomes, those of the limpet Patella vulgata and polychaete Spirobranchus (Pomatoceros) lamarcki. Surveying these genomes for known microRNAs identifies numerous potential orthologues, including a number that have been considered to be confined to other lineages. RT-PCR demonstrates that some of these (miR-1285, miR-1287, miR-1957, miR-1983 and miR-3533), previously thought to be found only in vertebrates, are expressed. This study provides genomic resources for two lophotrochozoans and reveals patterns of microRNA evolution that could be hidden by more restricted sampling.

  1. iAssembler: a package for de novo assembly of Roche-454/Sanger transcriptome sequences

    Directory of Open Access Journals (Sweden)

    Zheng Yi

    2011-11-01

    Full Text Available Abstract Background Expressed Sequence Tags (ESTs have played significant roles in gene discovery and gene functional analysis, especially for non-model organisms. For organisms with no full genome sequences available, ESTs are normally assembled into longer consensus sequences for further downstream analysis. However current de novo EST assembly programs often generate large number of assembly errors that will negatively affect the downstream analysis. In order to generate more accurate consensus sequences from ESTs, tools are needed to reduce or eliminate errors from de novo assemblies. Results We present iAssembler, a pipeline that can assemble large-scale ESTs into consensus sequences with significantly higher accuracy than current existing assemblers. iAssembler employs MIRA and CAP3 assemblers to generate initial assemblies, followed by identifying and correcting two common types of transcriptome assembly errors: 1 ESTs from different transcripts (mainly alternatively spliced transcripts or paralogs are incorrectly assembled into same contigs; and 2 ESTs from same transcripts fail to be assembled together. iAssembler can be used to assemble ESTs generated using the traditional Sanger method and/or the Roche-454 massive parallel pyrosequencing technology. Conclusion We compared performances of iAssembler and several other de novo EST assembly programs using both Roche-454 and Sanger EST datasets. It demonstrated that iAssembler generated significantly more accurate consensus sequences than other assembly programs.

  2. Strategies and tools for whole genome alignments

    Energy Technology Data Exchange (ETDEWEB)

    Couronne, Olivier; Poliakov, Alexander; Bray, Nicolas; Ishkhanov,Tigran; Ryaboy, Dmitriy; Rubin, Edward; Pachter, Lior; Dubchak, Inna

    2002-11-25

    The availability of the assembled mouse genome makespossible, for the first time, an alignment and comparison of two largevertebrate genomes. We have investigated different strategies ofalignment for the subsequent analysis of conservation of genomes that areeffective for different quality assemblies. These strategies were appliedto the comparison of the working draft of the human genome with the MouseGenome Sequencing Consortium assembly, as well as other intermediatemouse assemblies. Our methods are fast and the resulting alignmentsexhibit a high degree of sensitivity, covering more than 90 percent ofknown coding exons in the human genome. We have obtained such coveragewhile preserving specificity. With a view towards the end user, we havedeveloped a suite of tools and websites for automatically aligning, andsubsequently browsing and working with whole genome comparisons. Wedescribe the use of these tools to identify conserved non-coding regionsbetween the human and mouse genomes, some of which have not beenidentified by other methods.

  3. When the genome plays dice: circumvention of the spindle assembly checkpoint and near-random chromosome segregation in multipolar cancer cell mitoses.

    Directory of Open Access Journals (Sweden)

    David Gisselsson

    Full Text Available BACKGROUND: Normal cell division is coordinated by a bipolar mitotic spindle, ensuring symmetrical segregation of chromosomes. Cancer cells, however, occasionally divide into three or more directions. Such multipolar mitoses have been proposed to generate genetic diversity and thereby contribute to clonal evolution. However, this notion has been little validated experimentally. PRINCIPAL FINDINGS: Chromosome segregation and DNA content in daughter cells from multipolar mitoses were assessed by multiphoton cross sectioning and fluorescence in situ hybridization in cancer cells and non-neoplastic transformed cells. The DNA distribution resulting from multipolar cell division was found to be highly variable, with frequent nullisomies in the daughter cells. Time-lapse imaging of H2B/GFP-labelled multipolar mitoses revealed that the time from the initiation of metaphase to the beginning of anaphase was prolonged and that the metaphase plates often switched polarity several times before metaphase-anaphase transition. The multipolar metaphase-anaphase transition was accompanied by a normal reduction of cellular cyclin B levels, but typically occurred before completion of the normal separase activity cycle. Centromeric AURKB and MAD2 foci were observed frequently to remain on the centromeres of multipolar ana-telophase chromosomes, indicating that multipolar mitoses were able to circumvent the spindle assembly checkpoint with some sister chromatids remaining unseparated after anaphase. Accordingly, scoring the distribution of individual chromosomes in multipolar daughter nuclei revealed a high frequency of nondisjunction events, resulting in a near-binomial allotment of sister chromatids to the daughter cells. CONCLUSION: The capability of multipolar mitoses to circumvent the spindle assembly checkpoint system typically results in a near-random distribution of chromosomes to daughter cells. Spindle multipolarity could thus be a highly efficient

  4. Slipping past the spindle assembly checkpoint.

    Science.gov (United States)

    Subramanian, Radhika; Kapoor, Tarun M

    2013-11-01

    Error-free genome segregation depends on the spindle assembly checkpoint (SAC), a signalling network that delays anaphase onset until chromosomes have established proper spindle attachments. Three reports now quantitatively examine the sensitivity and robustness of the SAC response.

  5. Genome-based characterization of two prenylation steps in the assembly of the stephacidin and notoamide anticancer agents in a marine-derived Aspergillus sp.

    Science.gov (United States)

    Ding, Yousong; de Wet, Jeffrey R; Cavalcoli, James; Li, Shengying; Greshock, Thomas J; Miller, Kenneth A; Finefield, Jennifer M; Sunderhaus, James D; McAfoos, Timothy J; Tsukamoto, Sachiko; Williams, Robert M; Sherman, David H

    2010-09-15

    Stephacidin and notoamide natural products belong to a group of prenylated indole alkaloids containing a core bicyclo[2.2.2]diazaoctane ring system. These bioactive fungal secondary metabolites have a range of unusual structural and stereochemical features but their biosynthesis has remained uncharacterized. Herein, we report the first biosynthetic gene cluster for this class of fungal alkaloids based on whole genome sequencing of a marine-derived Aspergillus sp. Two central pathway enzymes catalyzing both normal and reverse prenyltransfer reactions were characterized in detail. Our results establish the early steps for creation of the prenylated indole alkaloid structure and suggest a scheme for the biosynthesis of stephacidin and notoamide metabolites. The work provides the first genetic and biochemical insights for understanding the structural diversity of this important family of fungal alkaloids.

  6. Integration of high-resolution physical and genetic map reveals differential recombination frequency between chromosomes and the genome assembling quality in cucumber.

    Science.gov (United States)

    Lou, Qunfeng; He, Yuhua; Cheng, Chunyan; Zhang, Zhonghua; Li, Ji; Huang, Sanwen; Chen, Jinfeng

    2013-01-01

    Cucumber is an important model crop and the first species sequenced in Cucurbitaceae family. Compared to the fast increasing genetic and genomics resources, the molecular cytogenetic researches in cucumber are still very limited, which results in directly the shortage of relation between plenty of physical sequences or genetic data and chromosome structure. We mapped twenty-three fosmids anchored by SSR markers from LG-3, the longest linkage group, and LG-4, the shortest linkage group on pachytene chromosomes 3 and 4, using uorescence in situ hybridization (FISH). Integrated molecular cytogenetic maps of chromosomes 3 and 4 were constructed. Except for three SSR markers located on heterochromatin region, the cytological order of markers was concordant with those on the linkage maps. Distinct structural differences between chromosomes 3 and 4 were revealed by the high resolution pachytene chromosomes. The extreme difference of genetic length between LG-3 and LG-4 was mainly attributed to the difference of overall recombination frequency. The significant differentiation of heterochromatin contents in chromosomes 3 and 4 might have a direct correlation with recombination frequency. Meanwhile, the uneven distribution of recombination frequency along chromosome 4 was observed, and recombination frequency of the long arm was nearly 3.5 times higher than that of the short arm. The severe suppression of recombination was exhibited in centromeric and heterochromatin domains of chromosome 4. Whereas a close correlation between the gene density and recombination frequency was observed in chromosome 4, no significant correlation was observed between them along chromosome 3. The comparison between cytogenetic and sequence maps revealed a large gap on the pericentromeric heterochromatin region of sequence map of chromosome 4. These results showed that integrated molecular cytogenetic maps can provide important information for the study of genetic and genomics in cucumber.

  7. Integration of high-resolution physical and genetic map reveals differential recombination frequency between chromosomes and the genome assembling quality in cucumber.

    Directory of Open Access Journals (Sweden)

    Qunfeng Lou

    Full Text Available Cucumber is an important model crop and the first species sequenced in Cucurbitaceae family. Compared to the fast increasing genetic and genomics resources, the molecular cytogenetic researches in cucumber are still very limited, which results in directly the shortage of relation between plenty of physical sequences or genetic data and chromosome structure. We mapped twenty-three fosmids anchored by SSR markers from LG-3, the longest linkage group, and LG-4, the shortest linkage group on pachytene chromosomes 3 and 4, using uorescence in situ hybridization (FISH. Integrated molecular cytogenetic maps of chromosomes 3 and 4 were constructed. Except for three SSR markers located on heterochromatin region, the cytological order of markers was concordant with those on the linkage maps. Distinct structural differences between chromosomes 3 and 4 were revealed by the high resolution pachytene chromosomes. The extreme difference of genetic length between LG-3 and LG-4 was mainly attributed to the difference of overall recombination frequency. The significant differentiation of heterochromatin contents in chromosomes 3 and 4 might have a direct correlation with recombination frequency. Meanwhile, the uneven distribution of recombination frequency along chromosome 4 was observed, and recombination frequency of the long arm was nearly 3.5 times higher than that of the short arm. The severe suppression of recombination was exhibited in centromeric and heterochromatin domains of chromosome 4. Whereas a close correlation between the gene density and recombination frequency was observed in chromosome 4, no significant correlation was observed between them along chromosome 3. The comparison between cytogenetic and sequence maps revealed a large gap on the pericentromeric heterochromatin region of sequence map of chromosome 4. These results showed that integrated molecular cytogenetic maps can provide important information for the study of genetic and genomics

  8. The UCSC Genome Browser database: 2016 update.

    Science.gov (United States)

    Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James

    2016-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.

  9. A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP array

    Directory of Open Access Journals (Sweden)

    Bailey Dione K

    2007-05-01

    Full Text Available Abstract Background DNA copy number aberration (CNA is one of the key characteristics of cancer cells. Recent studies demonstrated the feasibility of utilizing high density single nucleotide polymorphism (SNP genotyping arrays to detect CNA. Compared with the two-color array-based comparative genomic hybridization (array-CGH, the SNP arrays offer much higher probe density and lower signal-to-noise ratio at the single SNP level. To accurately identify small segments of CNA from SNP array data, segmentation methods that are sensitive to CNA while resistant to noise are required. Results We have developed a highly sensitive algorithm for the edge detection of copy number data which is especially suitable for the SNP array-based copy number data. The method consists of an over-sensitive edge-detection step and a test-based forward-backward edge selection step. Conclusion Using simulations constructed from real experimental data, the method shows high sensitivity and specificity in detecting small copy number changes in focused regions. The method is implemented in an R package FASeg, which includes data processing and visualization utilities, as well as libraries for processing Affymetrix SNP array data.

  10. Subtype-independent near full-length HIV-1 genome sequencing and assembly to be used in large molecular epidemiological studies and clinical management

    Directory of Open Access Journals (Sweden)

    Sebastian Grossmann

    2015-06-01

    Full Text Available Introduction: HIV-1 near full-length genome (HIV-NFLG sequencing from plasma is an attractive multidimensional tool to apply in large-scale population-based molecular epidemiological studies. It also enables genotypic resistance testing (GRT for all drug target sites allowing effective intervention strategies for control and prevention in high-risk population groups. Thus, the main objective of this study was to develop a simplified subtype-independent, cost- and labour-efficient HIV-NFLG protocol that can be used in clinical management as well as in molecular epidemiological studies. Methods: Plasma samples (n=30 were obtained from HIV-1B (n=10, HIV-1C (n=10, CRF01_AE (n=5 and CRF01_AG (n=5 infected individuals with minimum viral load >1120 copies/ml. The amplification was performed with two large amplicons of 5.5 kb and 3.7 kb, sequenced with 17 primers to obtain HIV-NFLG. GRT was validated against ViroSeqTM HIV-1 Genotyping System. Results: After excluding four plasma samples with low-quality RNA, a total of 26 samples were attempted. Among them, NFLG was obtained from 24 (92% samples with the lowest viral load being 3000 copies/ml. High (>99% concordance was observed between HIV-NFLG and ViroSeqTM when determining the drug resistance mutations (DRMs. The N384I connection mutation was additionally detected by NFLG in two samples. Conclusions: Our high efficiency subtype-independent HIV-NFLG is a simple and promising approach to be used in large-scale molecular epidemiological studies. It will facilitate the understanding of the HIV-1 pandemic population dynamics and outline effective intervention strategies. Furthermore, it can potentially be applicable in clinical management of drug resistance by evaluating DRMs against all available antiretrovirals in a single assay.

  11. Dominant role of the 5' TAR bulge in dimerization of HIV-1 genomic RNA, but no evidence of TAR-TAR kissing during in vivo virus assembly.

    Science.gov (United States)

    Jalalirad, Mohammad; Saadatmand, Jenan; Laughrea, Michael

    2012-05-08

    The 5' untranslated region of HIV-1 genomic RNA (gRNA) contains two stem-loop structures that appear to be equally important for gRNA dimerization: the 57-nucleotide 5' TAR, at the very 5' end, and the 35-nucleotide SL1 (nucleotides 243-277). SL1 is well-known for containing the dimerization initiation site (DIS) in its apical loop. The DIS is a six-nucleotide palindrome. Here, we investigated the mechanism of TAR-directed gRNA dimerization. We found that the trinucleotide bulge (UCU24) of the 5' TAR has dominant impacts on both formation of HIV-1 RNA dimers and maturation of the formed dimers. The ΔUCU trinucleotide deletion strongly inhibited the first process and blocked the other, thus impairing gRNA dimerization as severely as deletion of the entire 5' TAR, and more severely than deletion of the DIS, inactivation of the viral protease, or most severe mutations in the nucleocapsid protein. The apical loop of TAR contains a 10-nucleotide palindrome that has been postulated to stimulate gRNA dimerization by a TAR-TAR kissing mechanism analogous to the one used by SL1 to stimulate dimerization. Using mutations that strongly destabilize formation of the TAR palindrome duplex, as well as compensatory mutations that restore duplex formation to a wild-type-like level, we found no evidence of TAR-TAR kissing, even though mutations nullifying the kissing potential of the TAR palindrome could impair dimerization by a mechanism other than hindering of SL1. However, nullifying the kissing potential of TAR had much less severe effects than ΔUCU. By not uncovering a dimerization mechanism intrinsic to TAR, our data suggest that TAR mutations exert their effect 3' of TAR, yet not on SL1, because TAR and SL1 mutations have synergistic effects on gRNA dimerization.

  12. Genomics of oral bacteria.

    Science.gov (United States)

    Duncan, Margaret J

    2003-01-01

    Advances in bacterial genetics came with the discovery of the genetic code, followed by the development of recombinant DNA technologies. Now the field is undergoing a new revolution because of investigators' ability to sequence and assemble complete bacterial genomes. Over 200 genome projects have been completed or are in progress, and the oral microbiology research community has benefited through projects for oral bacteria and their non-oral-pathogen relatives. This review describes features of several oral bacterial genomes, and emphasizes the themes of species relationships, comparative genomics, and lateral gene transfer. Genomics is having a broad impact on basic research in microbial pathogenesis, and will lead to new approaches in clinical research and therapeutics. The oral microbiota is a unique community especially suited for new challenges to sequence the metagenomes of microbial consortia, and the genomes of uncultivable bacteria.

  13. Sabot assembly

    Energy Technology Data Exchange (ETDEWEB)

    Bzorgi, Fariborz

    2016-11-08

    A sabot assembly includes a projectile and a housing dimensioned and configured for receiving the projectile. An air pressure cavity having a cavity diameter is disposed between a front end and a rear end of the housing. Air intake nozzles are in fluid communication with the air pressure cavity and each has a nozzle diameter less than the cavity diameter. In operation, air flows through the plurality of air intake nozzles and into the air pressure cavity upon firing of the projectile from a gun barrel to pressurize the air pressure cavity for assisting in separation of the housing from the projectile upon the sabot assembly exiting the gun barrel.

  14. The UCSC Genome Browser Database

    DEFF Research Database (Denmark)

    Karolchik, D; Kuhn, R M; Baertsch, R

    2008-01-01

    The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrat...

  15. Cocoa/Cotton Comparative Genomics

    Science.gov (United States)

    With genome sequence from two members of the Malvaceae family recently made available, we are exploring syntenic relationships, gene content, and evolutionary trajectories between the cacao and cotton genomes. An assembly of cacao (Theobroma cacao) using Illumina and 454 sequence technology yielded ...

  16. Microbial Genomics Research in China

    Institute of Scientific and Technical Information of China (English)

    ZHAO Guo-ping

    2004-01-01

    @@ Microorganisms, including phage/virus, were initial targets and tools for developing DNA sequencing technology. Microbial genomic study was started as a model system for the Human Genome Project (HGP) and it did successfully supported the HGP, particularly with respect to BAC contig construction and large-scale shotgun sequencing and assembly. Microbial genomics study has become the fastest developed genomics discipline along with HGP, taking the advantage of the organisms' highly diversified physiology, extremely long history of evolution, close relationship with human/environment,as well as relatively small genome sizes and simple systems for functional analysis.

  17. Microbial Genomics Research in China

    Institute of Scientific and Technical Information of China (English)

    ZHAOGuo-ping

    2004-01-01

    Microorganisms, including phage/virus, were initial targets and tools for developing DNA sequencing technology. Microbial genomic study was started as a model system for the Human Genome Project (HGP) and it did successfully supported the HGP, particularly with respect to BAC contig construction and large-scale shotgun sequencing and assembly. Microbial genomics study has become the fastest developed genomics discipline along with HGP, taking the advantage of the organisms' highly diversified physiology, extremely long history of evolution, close relationship with human/environment,as well as relatively small genome sizes and simple systems for functional analysis.

  18. Exploration of Metagenome Assemblies with an Interactive Visualization Tool

    Energy Technology Data Exchange (ETDEWEB)

    Cantor, Michael; Nordberg, Henrik; Smirnova, Tatyana; Andersen, Evan; Tringe, Susannah; Hess, Matthias; Dubchak, Inna

    2014-07-09

    Metagenomics, one of the fastest growing areas of modern genomic science, is the genetic profiling of the entire community of microbial organisms present in an environmental sample. Elviz is a web-based tool for the interactive exploration of metagenome assemblies. Elviz can be used with publicly available data sets from the Joint Genome Institute or with custom user-loaded assemblies. Elviz is available at genome.jgi.doe.gov/viz

  19. Metagenomic Assembly: Overview, Challenges and Applications

    Science.gov (United States)

    Ghurye, Jay S.; Cepeda-Espinoza, Victoria; Pop, Mihai

    2016-01-01

    Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems. PMID:27698619

  20. Assembling consumption

    DEFF Research Database (Denmark)

    Assembling Consumption marks a definitive step in the institutionalisation of qualitative business research. By gathering leading scholars and educators who study markets, marketing and consumption through the lenses of philosophy, sociology and anthropology, this book clarifies and applies...... societies. This is an essential reading for both seasoned scholars and advanced students of markets, economies and social forms of consumption....

  1. Reconstruction of Metabolic Pathways, Protein Expression, and Homeostasis Machineries across Maize Bundle Sheath and Mesophyll Chloroplasts: Large-Scale Quantitative Proteomics Using the First Maize Genome Assembly1[W][OA

    Science.gov (United States)

    Friso, Giulia; Majeran, Wojciech; Huang, Mingshu; Sun, Qi; van Wijk, Klaas J.

    2010-01-01

    Chloroplasts in differentiated bundle sheath (BS) and mesophyll (M) cells of maize (Zea mays) leaves are specialized to accommodate C4 photosynthesis. This study provides a reconstruction of how metabolic pathways, protein expression, and homeostasis functions are quantitatively distributed across BS and M chloroplasts. This yielded new insights into cellular specialization. The experimental analysis was based on high-accuracy mass spectrometry, protein quantification by spectral counting, and the first maize genome assembly. A bioinformatics workflow was developed to deal with gene models, protein families, and gene duplications related to the polyploidy of maize; this avoided overidentification of proteins and resulted in more accurate protein quantification. A total of 1,105 proteins were assigned as potential chloroplast proteins, annotated for function, and quantified. Nearly complete coverage of primary carbon, starch, and tetrapyrole metabolism, as well as excellent coverage for fatty acid synthesis, isoprenoid, sulfur, nitrogen, and amino acid metabolism, was obtained. This showed, for example, quantitative and qualitative cell type-specific specialization in starch biosynthesis, arginine synthesis, nitrogen assimilation, and initial steps in sulfur assimilation. An extensive overview of BS and M chloroplast protein expression and homeostasis machineries (more than 200 proteins) demonstrated qualitative and quantitative differences between M and BS chloroplasts and BS-enhanced levels of the specialized chaperones ClpB3 and HSP90 that suggest active remodeling of the BS proteome. The reconstructed pathways are presented as detailed flow diagrams including annotation, relative protein abundance, and cell-specific expression pattern. Protein annotation and identification data, and projection of matched peptides on the protein models, are available online through the Plant Proteome Database. PMID:20089766

  2. Pig genome sequence - analysis and publication strategy

    NARCIS (Netherlands)

    Archibald, A.L.; Bolund, L.; Churcher, C.; Fredholm, M.; Groenen, M.A.M.; Harlizius, B.

    2010-01-01

    Background - The pig genome is being sequenced and characterised under the auspices of the Swine Genome Sequencing Consortium. The sequencing strategy followed a hybrid approach combining hierarchical shotgun sequencing of BAC clones and whole genome shotgun sequencing. Results - Assemblies of the B

  3. Combining de novo and reference-guided assembly with scaffold_builder

    NARCIS (Netherlands)

    Silva, G.G.; Dutilh, B.E.; Matthews, T.D.; Elkins, K.; Schmieder, R.; Dinsdale, E.A.; Edwards, R.A.

    2013-01-01

    Genome sequencing has become routine, however genome assembly still remains a challenge despite the computational advances in the last decade. In particular, the abundance of repeat elements in genomes makes it difficult to assemble them into a single complete sequence. Identical repeats shorter tha

  4. Identifying wrong assemblies in de novo short read primary sequence assembly contigs

    Indian Academy of Sciences (India)

    VANDNA CHAWLA; RAJNISH KUMAR; RAVI SHANKAR

    2016-09-01

    With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequencedall over the world. Most of these assemblies are done using some de novo short read assemblers and other relatedapproaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuousdearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted orwrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing andassembling have been assessed for their role in causing mis-assembly by using different genome sequencing data.Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembledprimary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simpleunsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performingreasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that theproposed methodology may work as a complementary system to the existing tools to enhance their accuracy.

  5. Components of Adenovirus Genome Packaging

    Science.gov (United States)

    Ahi, Yadvinder S.; Mittal, Suresh K.

    2016-01-01

    Adenoviruses (AdVs) are icosahedral viruses with double-stranded DNA (dsDNA) genomes. Genome packaging in AdV is thought to be similar to that seen in dsDNA containing icosahedral bacteriophages and herpesviruses. Specific recognition of the AdV genome is mediated by a packaging domain located close to the left end of the viral genome and is mediated by the viral packaging machinery. Our understanding of the role of various components of the viral packaging machinery in AdV genome packaging has greatly advanced in recent years. Characterization of empty capsids assembled in the absence of one or more components involved in packaging, identification of the unique vertex, and demonstration of the role of IVa2, the putative packaging ATPase, in genome packaging have provided compelling evidence that AdVs follow a sequential assembly pathway. This review provides a detailed discussion on the functions of the various viral and cellular factors involved in AdV genome packaging. We conclude by briefly discussing the roles of the empty capsids, assembly intermediates, scaffolding proteins, portal vertex and DNA encapsidating enzymes in AdV assembly and packaging. PMID:27721809

  6. Assembly of Repeat Content Using Next Generation Sequencing Data

    Energy Technology Data Exchange (ETDEWEB)

    labutti, Kurt; Kuo, Alan; Grigoriev, Igor; Copeland, Alex

    2014-03-17

    Repetitive organisms pose a challenge for short read assembly, and typically only unique regions and repeat regions shorter than the read length, can be accurately assembled. Recently, we have been investigating the use of Pacific Biosciences reads for de novo fungal assembly. We will present an assessment of the quality and degree of repeat reconstruction possible in a fungal genome using long read technology. We will also compare differences in assembly of repeat content using short read and long read technology.

  7. Dump assembly

    Science.gov (United States)

    Goldmann, Louis H.

    1986-01-01

    A dump assembly having a fixed conduit and a rotatable conduit provided with overlapping plates, respectively, at their adjacent ends. The plates are formed with openings, respectively, normally offset from each other to block flow. The other end of the rotatable conduit is provided with means for securing the open end of a filled container thereto. Rotation of the rotatable conduit raises and inverts the container to empty the contents while concurrently aligning the conduit openings to permit flow of material therethrough.

  8. General Assembly

    CERN Multimedia

    Staff Association

    2016-01-01

    5th April, 2016 – Ordinary General Assembly of the Staff Association! In the first semester of each year, the Staff Association (SA) invites its members to attend and participate in the Ordinary General Assembly (OGA). This year the OGA will be held on Tuesday, April 5th 2016 from 11:00 to 12:00 in BE Auditorium, Meyrin (6-2-024). During the Ordinary General Assembly, the activity and financial reports of the SA are presented and submitted for approval to the members. This is the occasion to get a global view on the activities of the SA, its financial management, and an opportunity to express one’s opinion, including taking part in the votes. Other points are listed on the agenda, as proposed by the Staff Council. Who can vote? Only “ordinary” members (MPE) of the SA can vote. Associated members (MPA) of the SA and/or affiliated pensioners have a right to vote on those topics that are of direct interest to them. Who can give his/her opinion? The Ordinary General Asse...

  9. Genome Improvement at JGI-HAGSC

    Energy Technology Data Exchange (ETDEWEB)

    Grimwood, Jane; Schmutz, Jeremy J.; Myers, Richard M.

    2012-03-03

    Since the completion of the sequencing of the human genome, the Joint Genome Institute (JGI) has rapidly expanded its scientific goals in several DOE mission-relevant areas. At the JGI-HAGSC, we have kept pace with this rapid expansion of projects with our focus on assessing, assembling, improving and finishing eukaryotic whole genome shotgun (WGS) projects for which the shotgun sequence is generated at the Production Genomic Facility (JGI-PGF). We follow this by combining the draft WGS with genomic resources generated at JGI-HAGSC or in collaborator laboratories (including BAC end sequences, genetic maps and FLcDNA sequences) to produce an improved draft sequence. For eukaryotic genomes important to the DOE mission, we then add further information from directed experiments to produce reference genomic sequences that are publicly available for any scientific researcher. Also, we have continued our program for producing BAC-based finished sequence, both for adding information to JGI genome projects and for small BAC-based sequencing projects proposed through any of the JGI sequencing programs. We have now built our computational expertise in WGS assembly and analysis and have moved eukaryotic genome assembly from the JGI-PGF to JGI-HAGSC. We have concentrated our assembly development work on large plant genomes and complex fungal and algal genomes.

  10. Preliminary High-Throughput Metagenome Assembly

    Energy Technology Data Exchange (ETDEWEB)

    Dusheyko, Serge; Furman, Craig; Pangilinan, Jasmyn; Shapiro, Harris; Tu, Hank

    2007-03-26

    Metagenome data sets present a qualitatively different assembly problem than traditional single-organism whole-genome shotgun (WGS) assembly. The unique aspects of such projects include the presence of a potentially large number of distinct organisms and their representation in the data set at widely different fractions. In addition, multiple closely related strains could be present, which would be difficult to assemble separately. Failure to take these issues into account can result in poor assemblies that either jumble together different strains or which fail to yield useful results. The DOE Joint Genome Institute has sequenced a number of metagenomic projects and plans to considerably increase this number in the coming year. As a result, the JGI has a need for high-throughput tools and techniques for handling metagenome projects. We present the techniques developed to handle metagenome assemblies in a high-throughput environment. This includes a streamlined assembly wrapper, based on the JGI?s in-house WGS assembler, Jazz. It also includes the selection of sensible defaults targeted for metagenome data sets, as well as quality control automation for cleaning up the raw results. While analysis is ongoing, we will discuss preliminary assessments of the quality of the assembly results (http://fames.jgi-psf.org).

  11. Genome sequence of Psychrobacter cibarius strain W1

    DEFF Research Database (Denmark)

    Raghupathi, Prem Krishnan; Herschend, Jakob; Røder, Henriette Lyng

    2016-01-01

    Here, we report the draft genome sequence of Psychrobacter cibarius strain W1, which was isolated at a slaughterhouse in Denmark. The 3.63-Mb genome sequence was assembled into 241 contigs.......Here, we report the draft genome sequence of Psychrobacter cibarius strain W1, which was isolated at a slaughterhouse in Denmark. The 3.63-Mb genome sequence was assembled into 241 contigs....

  12. Genome sequence and analysis of the tuber crop potato

    DEFF Research Database (Denmark)

    Xu, X.; Pan, S.; Cheng, S.

    2011-01-01

    and assemble 86% of the 844-megabase genome. We predict 39,031 protein-coding genes and present evidence for at least two genome duplication events indicative of a palaeopolyploid origin. As the first genome sequence of an asterid, the potato genome reveals 2,642 genes specific to this large angiosperm clade...

  13. General Assembly

    CERN Multimedia

    Staff Association

    2015-01-01

    Mardi 5 mai à 11 h 00 Salle 13-2-005 Conformément aux statuts de l’Association du personnel, une Assemblée générale ordinaire est organisée une fois par année (article IV.2.1). Projet d’ordre du jour : 1- Adoption de l’ordre du jour. 2- Approbation du procès-verbal de l’Assemblée générale ordinaire du 22 mai 2014. 3- Présentation et approbation du rapport d’activités 2014. 4- Présentation et approbation du rapport financier 2014. 5- Présentation et approbation du rapport des vérificateurs aux comptes pour 2014. 6- Programme 2015. 7- Présentation et approbation du projet de budget 2015 et taux de cotisation pour 2015. 8- Pas de modifications aux Statuts de l'Association du personnel proposée. 9- Élections des membres de la Commission é...

  14. General Assembly

    CERN Multimedia

    Staff Association

    2016-01-01

    Mardi 5 avril à 11 h 00 BE Auditorium Meyrin (6-2-024) Conformément aux statuts de l’Association du personnel, une Assemblée générale ordinaire est organisée une fois par année (article IV.2.1). Projet d’ordre du jour : Adoption de l’ordre du jour. Approbation du procès-verbal de l’Assemblée générale ordinaire du 5 mai 2015. Présentation et approbation du rapport d’activités 2015. Présentation et approbation du rapport financier 2015. Présentation et approbation du rapport des vérificateurs aux comptes pour 2015. Programme de travail 2016. Présentation et approbation du projet de budget 2016 Approbation du taux de cotisation pour 2017. Modifications aux Statuts de l'Association du personnel proposée. Élections des membres de la Commissio...

  15. General assembly

    CERN Multimedia

    Staff Association

    2015-01-01

    Mardi 5 mai à 11 h 00 Salle 13-2-005 Conformément aux statuts de l’Association du personnel, une Assemblée générale ordinaire est organisée une fois par année (article IV.2.1). Projet d’ordre du jour : Adoption de l’ordre du jour. Approbation du procès-verbal de l’Assemblée générale ordinaire du 22 mai 2014. Présentation et approbation du rapport d’activités 2014. Présentation et approbation du rapport financier 2014. Présentation et approbation du rapport des vérificateurs aux comptes pour 2014. Programme 2015. Présentation et approbation du projet de budget 2015 et taux de cotisation pour 2015. Pas de modifications aux Statuts de l'Association du personnel proposée. Élections des membres de la Commission électorale. &am...

  16. General Assembly

    CERN Multimedia

    Staff Association

    2017-01-01

    Conformément aux statuts de l’Association du personnel, une Assemblée générale ordinaire est organisée une fois par année (article IV.2.1). Projet d’ordre du jour : Adoption de l’ordre du jour. Approbation du procès-verbal de l’Assemblée générale ordinaire du 5 avril 2016. Présentation et approbation du rapport d’activités 2016. Présentation et approbation du rapport financier 2016. Présentation et approbation du rapport des vérificateurs aux comptes pour 2016. Programme de travail 2017. Présentation et approbation du projet de budget 2017 Approbation du taux de cotisation pour 2018. Modifications aux Statuts de l'Association du personnel proposées. Élections des membres de la Commission électorale. Élections des vérifica...

  17. Automatic Tool for Local Assembly Structures

    Energy Technology Data Exchange (ETDEWEB)

    2016-10-11

    Whole community shotgun sequencing of total DNA (i.e. metagenomics) and total RNA (i.e. metatranscriptomics) has provided a wealth of information in the microbial community structure, predicted functions, metabolic networks, and is even able to reconstruct complete genomes directly. Here we present ATLAS (Automatic Tool for Local Assembly Structures) a comprehensive pipeline for assembly, annotation, genomic binning of metagenomic and metatranscriptomic data with an integrated framework for Multi-Omics. This will provide an open source tool for the Multi-Omic community at large.

  18. NCBI viral genomes resource.

    Science.gov (United States)

    Brister, J Rodney; Ako-Adjei, Danso; Bao, Yiming; Blinkova, Olga

    2015-01-01

    Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.

  19. The perennial ryegrass GenomeZipper: targeted use of genome resources for comparative grass genomics.

    Science.gov (United States)

    Pfeifer, Matthias; Martis, Mihaela; Asp, Torben; Mayer, Klaus F X; Lübberstedt, Thomas; Byrne, Stephen; Frei, Ursula; Studer, Bruno

    2013-02-01

    Whole-genome sequences established for model and major crop species constitute a key resource for advanced genomic research. For outbreeding forage and turf grass species like ryegrasses (Lolium spp.), such resources have yet to be developed. Here, we present a model of the perennial ryegrass (Lolium perenne) genome on the basis of conserved synteny to barley (Hordeum vulgare) and the model grass genome Brachypodium (Brachypodium distachyon) as well as rice (Oryza sativa) and sorghum (Sorghum bicolor). A transcriptome-based genetic linkage map of perennial ryegrass served as a scaffold to establish the chromosomal arrangement of syntenic genes from model grass species. This scaffold revealed a high degree of synteny and macrocollinearity and was then utilized to anchor a collection of perennial ryegrass genes in silico to their predicted genome positions. This resulted in the unambiguous assignment of 3,315 out of 8,876 previously unmapped genes to the respective chromosomes. In total, the GenomeZipper incorporates 4,035 conserved grass gene loci, which were used for the first genome-wide sequence divergence analysis between perennial ryegrass, barley, Brachypodium, rice, and sorghum. The perennial ryegrass GenomeZipper is an ordered, information-rich genome scaffold, facilitating map-based cloning and genome assembly in perennial ryegrass and closely related Poaceae species. It also represents a milestone in describing synteny between perennial ryegrass and fully sequenced model grass genomes, thereby increasing our understanding of genome organization and evolution in the most important temperate forage and turf grass species.

  20. The tile assembly model is intrinsically universal

    CERN Document Server

    Doty, David; Patitz, Matthew J; Schweller, Robert T; Summers, Scott M; Woods, Damien

    2011-01-01

    We prove that the abstract Tile Assembly Model (aTAM) of nanoscale self-assembly is intrinsically universal. This means that there is a single tile assembly system U that, with proper initialization, simulates any tile assembly system T. The simulation is "intrinsic" in the sense that the self-assembly process carried out by U is exactly that carried out by T, with each tile of T represented by an m x m "supertile" of U. Our construction works for the full aTAM at any temperature, and it faithfully simulates the deterministic or nondeterministic behavior of each T. Our construction succeeds by solving an analog of the cell differentiation problem in developmental biology: Each supertile of U, starting with those in the seed assembly, carries the "genome" of the simulated system T. At each location of a potential supertile in the self-assembly of U, a decision is made whether and how to express this genome, i.e., whether to generate a supertile and, if so, which tile of T it will represent. This decision must ...

  1. Whole-Genome Sequences of Three Symbiotic Endozoicomonas Bacteria

    KAUST Repository

    Neave, Matthew J.

    2014-08-14

    Members of the genus Endozoicomonas associate with a wide range of marine organisms. Here, we report on the whole-genome sequencing, assembly, and annotation of three Endozoicomonas type strains. These data will assist in exploring interactions between Endozoicomonas organisms and their hosts, and it will aid in the assembly of genomes from uncultivated Endozoicomonas spp.

  2. Rewriting the blueprint of life by synthetic genomics and genome engineering

    OpenAIRE

    Annaluru, Narayana; Ramalingam, Sivaprakash; Chandrasegaran, Srinivasan

    2015-01-01

    Advances in DNA synthesis and assembly methods over the past decade have made it possible to construct genome-size fragments from oligonucleotides. Early work focused on synthesis of small viral genomes, followed by hierarchical synthesis of wild-type bacterial genomes and subsequently on transplantation of synthesized bacterial genomes into closely related recipient strains. More recently, a synthetic designer version of yeast Saccharomyces cerevisiae chromosome III has been generated, with ...

  3. Building the sequence map of the human pan-genome

    DEFF Research Database (Denmark)

    Li, Ruiqiang; Li, Yingrui; Zheng, Hancheng

    2010-01-01

    Here we integrate the de novo assembly of an Asian and an African genome with the NCBI reference human genome, as a step toward constructing the human pan-genome. We identified approximately 5 Mb of novel sequences not present in the reference genome in each of these assemblies. Most novel...... analysis of predicted genes indicated that the novel sequences contain potentially functional coding regions. We estimate that a complete human pan-genome would contain approximately 19-40 Mb of novel sequence not present in the extant reference genome. The extensive amount of novel sequence contributing...... to the genetic variation of the pan-genome indicates the importance of using complete genome sequencing and de novo assembly....

  4. Inventory control: cytochrome c oxidase assembly regulates mitochondrial translation.

    Science.gov (United States)

    Mick, David U; Fox, Thomas D; Rehling, Peter

    2011-01-01

    Mitochondria maintain genome and translation machinery to synthesize a small subset of subunits of the oxidative phosphorylation system. To build up functional enzymes, these organellar gene products must assemble with imported subunits that are encoded in the nucleus. New findings on the early steps of cytochrome c oxidase assembly reveal how the mitochondrial translation of its core component, cytochrome c oxidase subunit 1 (Cox1), is directly coupled to the assembly of this respiratory complex.

  5. Inventory control: cytochrome oxidase assembly regulates mitochondrial translation

    Science.gov (United States)

    Mick, David U.; Fox, Thomas D.; Rehling, Peter

    2012-01-01

    Mitochondria maintain a genome and translation-machinery to synthesize a small subset of subunits of the oxidative phosphorylation system. These organellar gene products must assemble with imported subunits that are encoded in the nucleus to build up functional enzymes. New findings on the early steps in cytochrome oxidase assembly reveal how the mitochondrial translation of its core component Cox1 is directly coupled to the assembly of this respiratory complex. PMID:21179059

  6. Comparing de novo assemblers for 454 transcriptome data

    Directory of Open Access Journals (Sweden)

    Blaxter Mark L

    2010-10-01

    Full Text Available Abstract Background Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis. Results Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects, which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs. Conclusions Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies

  7. Draft Genome Sequence of "Terrisporobacter othiniensis" Isolated from a Blood Culture from a Human Patient

    DEFF Research Database (Denmark)

    Lund, Lars Christian; Sydenham, Thomas Vognbjerg; Høgh, Silje Vermedal

    2015-01-01

    "Terrisporobacter othiniensis" (proposed species) was isolated from a blood culture. Genomic DNA was sequenced using a MiSeq benchtop sequencer (Illumina) and assembled using the SPAdes genome assembler. This resulted in a draft genome sequence comprising 3,980,019 bp in 167 contigs containing 3...

  8. A comparison of rice chloroplast genomes

    DEFF Research Database (Denmark)

    Tang, Jiabin; Xia, Hong'ai; Cao, Mengliang

    2004-01-01

    Using high quality sequence reads extracted from our whole genome shotgun repository, we assembled two chloroplast genome sequences from two rice (Oryza sativa) varieties, one from 93-11 (a typical indica variety) and the other from PA64S (an indica-like variety with maternal origin of japonica),...

  9. Prospects for Genomic Research in Forestry

    Directory of Open Access Journals (Sweden)

    K. V. Krutovsky

    2014-08-01

    Full Text Available Conifers are keystone species of boreal forests. Their whole genome sequencing, assembly and annotation will allow us to understand the evolution of the complex ancient giant conifer genomes that are 4 times larger in larch and 7–9 times larger in pines than the human genome. Genomic studies will allow also to obtain important whole genome sequence data and develop highly polymorphic and informative genetic markers, such as microsatellites and single nucleotide polymorphisms (SNPs that can be efficiently used in timber origin identification, for genetic variation monitoring, to study local and climate change adaptation and in tree improvement and conservation programs.

  10. The ecoresponsive genome of Daphnia pulex

    Energy Technology Data Exchange (ETDEWEB)

    Colbourne, John K.; Pfrender, Michael E.; Gilbert, Donald; Thomas, W. Kelley; Tucker, Abraham; Oakley, Todd H.; Tokishita, Shinichi; Aerts, Andrea; Arnold, Georg J.; Basu, Malay Kumar; Bauer, Darren J.; Caceres, Carla E.; Carmel, Liran; Casola, Claudio; Choi, Jeong-Hyeon; Detter, John C.; Dong, Qunfeng; Dusheyko, Serge; Eads, Brian D.; Frohlich, Thomas; Geiler-Samerotte, Kerry A.; Gerlach, Daniel; Hatcher, Phil; Jogdeo, Sanjuro; Krijgsveld, Jeroen; Kriventseva, Evgenia V; Kültz, Dietmar; Laforsch, Christian; Lindquist, Erika; Lopez, Jacqueline; Manak, Robert; Muller, Jean; Pangilinan, Jasmyn; Patwardhan, Rupali P.; Pitluck, Samuel; Pritham, Ellen J.; Rechtsteiner, Andreas; Rho, Mina; Rogozin, Igor B.; Sakarya, Onur; Salamov, Asaf; Schaack, Sarah; Shapiro, Harris; Shiga, Yasuhiro; Skalitzky, Courtney; Smith, Zachary; Souvorov, Alexander; Sung, Way; Tang, Zuojian; Tsuchiya, Dai; Tu, Hank; Vos, Harmjan; Wang, Mei; Wolf, Yuri I.; Yamagata, Hideo; Yamada, Takuji; Ye, Yuzhen; Shaw, Joseph R.; Andrews, Justen; Crease, Teresa J.; Tang, Haixu; Lucas, Susan M.; Robertson, Hugh M.; Bork, Peer; Koonin, Eugene V.; Zdobnov, Evgeny M.; Grigoriev, Igor V.; Lynch, Michael; Boore, Jeffrey L.

    2011-02-04

    This document provides supporting material related to the sequencing of the ecoresponsive genome of Daphnia pulex. This material includes information on materials and methods and supporting text, as well as supplemental figures, tables, and references. The coverage of materials and methods addresses genome sequence, assembly, and mapping to chromosomes, gene inventory, attributes of a compact genome, the origin and preservation of Daphnia pulex genes, implications of Daphnia's genome structure, evolutionary diversification of duplicated genes, functional significance of expanded gene families, and ecoresponsive genes. Supporting text covers chromosome studies, gene homology among Daphnia genomes, micro-RNA and transposable elements and the 46 Daphnia pulex opsins. 36 figures, 50 tables, 183 references.

  11. Pig genome sequence - analysis and publication strategy

    DEFF Research Database (Denmark)

    Archibald, Alan L.; Bolund, Lars; Churcher, Carol;

    2010-01-01

    BACKGROUND: The pig genome is being sequenced and characterised under the auspices of the Swine Genome Sequencing Consortium. The sequencing strategy followed a hybrid approach combining hierarchical shotgun sequencing of BAC clones and whole genome shotgun sequencing. RESULTS: Assemblies......) is under construction and will incorporate whole genome shotgun sequence (WGS) data providing > 30x genome coverage. The WGS sequence, most of which comprise short Illumina/Solexa reads, were generated from DNA from the same single Duroc sow as the source of the BAC library from which clones were...

  12. A Taste of Algal Genomes from the Joint Genome Institute

    Energy Technology Data Exchange (ETDEWEB)

    Kuo, Alan; Grigoriev, Igor

    2012-06-17

    Algae play profound roles in aquatic food chains and the carbon cycle, can impose health and economic costs through toxic blooms, provide models for the study of symbiosis, photosynthesis, and eukaryotic evolution, and are candidate sources for bio-fuels; all of these research areas are part of the mission of DOE's Joint Genome Institute (JGI). To date JGI has sequenced, assembled, annotated, and released to the public the genomes of 18 species and strains of algae, sampling almost all of the major clades of photosynthetic eukaryotes. With more algal genomes currently undergoing analysis, JGI continues its commitment to driving forward basic and applied algal science. Among these ongoing projects are the pan-genome of the dominant coccolithophore Emiliania huxleyi, the interrelationships between the 4 genomes in the nucleomorph-containing Bigelowiella natans and Guillardia theta, and the search for symbiosis genes of lichens.

  13. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    Energy Technology Data Exchange (ETDEWEB)

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.

  14. Probe tip heating assembly

    Science.gov (United States)

    Schmitz, Roger William; Oh, Yunje

    2016-10-25

    A heating assembly configured for use in mechanical testing at a scale of microns or less. The heating assembly includes a probe tip assembly configured for coupling with a transducer of the mechanical testing system. The probe tip assembly includes a probe tip heater system having a heating element, a probe tip coupled with the probe tip heater system, and a heater socket assembly. The heater socket assembly, in one example, includes a yoke and a heater interface that form a socket within the heater socket assembly. The probe tip heater system, coupled with the probe tip, is slidably received and clamped within the socket.

  15. misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.

    Science.gov (United States)

    Zhu, Xiao; Leung, Henry C M; Wang, Rongjie; Chin, Francis Y L; Yiu, Siu Ming; Quan, Guangri; Li, Yajie; Zhang, Rui; Jiang, Qinghua; Liu, Bo; Dong, Yucui; Zhou, Guohui; Wang, Yadong

    2015-11-16

    Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence). We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded

  16. DNA assembly for plant biology: techniques and tools.

    Science.gov (United States)

    Patron, Nicola J

    2014-06-01

    As the speed and accuracy of genome sequencing improves, there are ever-increasing resources available for the design and construction of synthetic DNA parts. These can be used to engineer plant genomes to produce new functions or to elucidate the function of endogenous sequences. Until recently the assembly of amplified or cloned sequences into large and complex designs was a limiting step in plant synthetic biology and biotechnology. A number of new methods for assembling DNA molecules have been developed in the last few years, several of which have been applied to the production of molecules used to modify plant genomes.

  17. Spaced Seed Data Structures for De Novo Assembly

    Directory of Open Access Journals (Sweden)

    Inanç Birol

    2015-01-01

    Full Text Available De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

  18. Spaced Seed Data Structures for De Novo Assembly.

    Science.gov (United States)

    Birol, Inanç; Chu, Justin; Mohamadi, Hamid; Jackman, Shaun D; Raghavan, Karthika; Vandervalk, Benjamin P; Raymond, Anthony; Warren, René L

    2015-01-01

    De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

  19. Newnes electronics assembly handbook

    CERN Document Server

    Brindley, Keith

    2013-01-01

    Newnes Electronics Assembly Handbook: Techniques, Standards and Quality Assurance focuses on the aspects of electronic assembling. The handbook first looks at the printed circuit board (PCB). Base materials, basic mechanical properties, cleaning of assemblies, design, and PCB manufacturing processes are then explained. The text also discusses surface mounted assemblies and packaging of electromechanical assemblies, as well as the soldering process. Requirements for the soldering process; solderability and protective coatings; cleaning of PCBs; and mass solder/component reflow soldering are des

  20. The Chlamydomonas genome project: a decade on

    Science.gov (United States)

    Blaby, Ian K.; Blaby-Haas, Crysten; Tourasse, Nicolas; Hom, Erik F. Y.; Lopez, David; Aksoy, Munevver; Grossman, Arthur; Umen, James; Dutcher, Susan; Porter, Mary; King, Stephen; Witman, George; Stanke, Mario; Harris, Elizabeth H.; Goodstein, David; Grimwood, Jane; Schmutz, Jeremy; Vallon, Olivier; Merchant, Sabeeha S.; Prochnik, Simon

    2014-01-01

    The green alga Chlamydomonas reinhardtii is a popular unicellular organism for studying photosynthesis, cilia biogenesis and micronutrient homeostasis. Ten years since its genome project was initiated, an iterative process of improvements to the genome and gene predictions has propelled this organism to the forefront of the “omics” era. Housed at Phytozome, the Joint Genome Institute’s (JGI) plant genomics portal, the most up-to-date genomic data include a genome arranged on chromosomes and high-quality gene models with alternative splice forms supported by an abundance of RNA-Seq data. Here, we present the past, present and future of Chlamydomonas genomics. Specifically, we detail progress on genome assembly and gene model refinement, discuss resources for gene annotations, functional predictions and locus ID mapping between versions and, importantly, outline a standardized framework for naming genes. PMID:24950814

  1. PGSB/MIPS Plant Genome Information Resources and Concepts for the Analysis of Complex Grass Genomes.

    Science.gov (United States)

    Spannagl, Manuel; Bader, Kai; Pfeifer, Matthias; Nussbaumer, Thomas; Mayer, Klaus F X

    2016-01-01

    PGSB (Plant Genome and Systems Biology; formerly MIPS-Munich Institute for Protein Sequences) has been involved in developing, implementing and maintaining plant genome databases for more than a decade. Genome databases and analysis resources have focused on individual genomes and aim to provide flexible and maintainable datasets for model plant genomes as a backbone against which experimental data, e.g., from high-throughput functional genomics, can be organized and analyzed. In addition, genomes from both model and crop plants form a scaffold for comparative genomics, assisted by specialized tools such as the CrowsNest viewer to explore conserved gene order (synteny) between related species on macro- and micro-levels.The genomes of many economically important Triticeae plants such as wheat, barley, and rye present a great challenge for sequence assembly and bioinformatic analysis due to their enormous complexity and large genome size. Novel concepts and strategies have been developed to deal with these difficulties and have been applied to the genomes of wheat, barley, rye, and other cereals. This includes the GenomeZipper concept, reference-guided exome assembly, and "chromosome genomics" based on flow cytometry sorted chromosomes.

  2. Genome Sequence of Actinobacillus suis Type Strain ATCC 33415T.

    Science.gov (United States)

    Calcutt, Michael J; Foecking, Mark F; Mhlanga-Mutangadura, Tendai; Reilly, Thomas J

    2014-09-18

    The assembled and annotated genome of Actinobacillus suis ATCC 33415(T) is reported here. The 2,501,598-bp genome encodes 2,246 open reading frames (ORFs) with strain variable incursion of an integrative conjugative element into a tRNA locus. Comparative analysis of the deduced gene set should inform our understanding of pathogenesis, genomic plasticity, and serotype variation.

  3. The evolution of the Anopheles 16 genomes project

    NARCIS (Netherlands)

    Neafsey, Daniel E.; Christophides, George K.; Collins, Frank H.; Emrich, Scott J.; Fontaine, Michael C.; Gelbart, William; Hahn, Matthew W.; Howell, Paul I.; Kafatos, Fotis C.; Lawson, Daniel; Muskavitch, Marc A. T.; Waterhouse, Robert M.; Williams, Louise J.; Besansky, Nora J.

    2013-01-01

    We report the imminent completion of a set of reference genome assemblies for 16 species of Anopheles mosquitoes. In addition to providing a generally useful resource for comparative genomic analyses, these genome sequences will greatly facilitate exploration of the capacity exhibited by some Anophe

  4. High molecular weight DNA assembly in vivo for synthetic biology applications.

    Science.gov (United States)

    Juhas, Mario; Ajioka, James W

    2017-05-01

    DNA assembly is the key technology of the emerging interdisciplinary field of synthetic biology. While the assembly of smaller DNA fragments is usually performed in vitro, high molecular weight DNA molecules are assembled in vivo via homologous recombination in the host cell. Escherichia coli, Bacillus subtilis and Saccharomyces cerevisiae are the main hosts used for DNA assembly in vivo. Progress in DNA assembly over the last few years has paved the way for the construction of whole genomes. This review provides an update on recent synthetic biology advances with particular emphasis on high molecular weight DNA assembly in vivo in E. coli, B. subtilis and S. cerevisiae. Special attention is paid to the assembly of whole genomes, such as those of the first synthetic cell, synthetic yeast and minimal genomes.

  5. Strategies for complete plastid genome sequencing.

    Science.gov (United States)

    Twyford, Alex D; Ness, Rob W

    2016-10-28

    Plastid sequencing is an essential tool in the study of plant evolution. This high-copy organelle is one of the most technically accessible regions of the genome, and its sequence conservation makes it a valuable region for comparative genome evolution, phylogenetic analysis and population studies. Here, we discuss recent innovations and approaches for de novo plastid assembly that harness genomic tools. We focus on technical developments including low-cost sequence library preparation approaches for genome skimming, enrichment via hybrid baits and methylation-sensitive capture, sequence platforms with higher read outputs and longer read lengths, and automated tools for assembly. These developments allow for a much more streamlined assembly than via conventional short-range PCR. Although newer methods make complete plastid sequencing possible for any land plant or green alga, there are still challenges for producing finished plastomes particularly from herbarium material or from structurally divergent plastids such as those of parasitic plants.

  6. The genome of Theobroma cacao.

    Science.gov (United States)

    Argout, Xavier; Salse, Jerome; Aury, Jean-Marc; Guiltinan, Mark J; Droc, Gaetan; Gouzy, Jerome; Allegre, Mathilde; Chaparro, Cristian; Legavre, Thierry; Maximova, Siela N; Abrouk, Michael; Murat, Florent; Fouet, Olivier; Poulain, Julie; Ruiz, Manuel; Roguet, Yolande; Rodier-Goud, Maguy; Barbosa-Neto, Jose Fernandes; Sabot, Francois; Kudrna, Dave; Ammiraju, Jetty Siva S; Schuster, Stephan C; Carlson, John E; Sallet, Erika; Schiex, Thomas; Dievart, Anne; Kramer, Melissa; Gelley, Laura; Shi, Zi; Bérard, Aurélie; Viot, Christopher; Boccara, Michel; Risterucci, Ange Marie; Guignon, Valentin; Sabau, Xavier; Axtell, Michael J; Ma, Zhaorong; Zhang, Yufan; Brown, Spencer; Bourge, Mickael; Golser, Wolfgang; Song, Xiang; Clement, Didier; Rivallan, Ronan; Tahi, Mathias; Akaza, Joseph Moroh; Pitollat, Bertrand; Gramacho, Karina; D'Hont, Angélique; Brunel, Dominique; Infante, Diogenes; Kebe, Ismael; Costet, Pierre; Wing, Rod; McCombie, W Richard; Guiderdoni, Emmanuel; Quetier, Francis; Panaud, Olivier; Wincker, Patrick; Bocs, Stephanie; Lanaud, Claire

    2011-02-01

    We sequenced and assembled the draft genome of Theobroma cacao, an economically important tropical-fruit tree crop that is the source of chocolate. This assembly corresponds to 76% of the estimated genome size and contains almost all previously described genes, with 82% of these genes anchored on the 10 T. cacao chromosomes. Analysis of this sequence information highlighted specific expansion of some gene families during evolution, for example, flavonoid-related genes. It also provides a major source of candidate genes for T. cacao improvement. Based on the inferred paleohistory of the T. cacao genome, we propose an evolutionary scenario whereby the ten T. cacao chromosomes were shaped from an ancestor through eleven chromosome fusions.

  7. 基于DeBruijn图的DeNovo序列组装软件性能分析%The Analysis of De Novo Genome Assembly Software Based on De Bruijn Graph

    Institute of Scientific and Technical Information of China (English)

    孟金涛; 苑建蕊; 魏彦杰; 冯圣中

    2013-01-01

    随着新一代测序技术的发展,一些新的全基因组组装算法应运而生,特别是针对第三代高通量测序仪产生的海量短序列的组装软件被不断开发出来,这些组装软件渐渐走向市场。但是,由于这些组装软件的适用性和其性能的差别,选择一款性能优良的组装工具或者开发并行高吞吐的组装工具成为了当前面临的一大难题。本文选取基于De Bruijn图算法开发的4款De Novo组装的软件(Velvet、SOAPdenovo、IDBA、ABySS)对4种物种的基因组的模拟数据进行测试,并从软件的算法、组装性能和组装质量3个方面分析这4个软件的性能,同时根据其算法特点推断影响这些软件性能的关键因素,并给出软件的使用建议以及开发并行序列组装工具来组装超大规模的基因数据应该注意的问题。%Recently, new sequencing technologies have emerged, a new set of algorithms have been developed, and several assembly software packages have been created speciifcally for assembly of next-generation sequencing data. However, due to the poor knowledge about the applicability and performance of these software tools, choosing a beiftting assembler becomes a tough task. Here we compare the performance between Velvet, SOAPdenovo, IDBA and ABySS, which all are developed based on De Bruijn graph. We compare computational time, assembly accuracy and integrity, our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the development of existing assemblers.

  8. Cancer genomics

    DEFF Research Database (Denmark)

    Norrild, Bodil; Guldberg, Per; Ralfkiær, Elisabeth Methner

    2007-01-01

    Almost all cells in the human body contain a complete copy of the genome with an estimated number of 25,000 genes. The sequences of these genes make up about three percent of the genome and comprise the inherited set of genetic information. The genome also contains information that determines whe...

  9. Bovine Genome Database: new tools for gleaning function from the Bos taurus genome.

    Science.gov (United States)

    Elsik, Christine G; Unni, Deepak R; Diesh, Colin M; Tayal, Aditi; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Bovine Genome Database (BGD) (http://BovineGenome.org). The goal of BGD is to support bovine genomics research by providing genome annotation and data mining tools. We have developed new genome and annotation browsers using JBrowse and WebApollo for two Bos taurus genome assemblies, the reference genome assembly (UMD3.1.1) and the alternate genome assembly (Btau_4.6.1). Annotation tools have been customized to highlight priority genes for annotation, and to aid annotators in selecting gene evidence tracks from 91 tissue specific RNAseq datasets. We have also developed BovineMine, based on the InterMine data warehousing system, to integrate the bovine genome, annotation, QTL, SNP and expression data with external sources of orthology, gene ontology, gene interaction and pathway information. BovineMine provides powerful query building tools, as well as customized query templates, and allows users to analyze and download genome-wide datasets. With BovineMine, bovine researchers can use orthology to leverage the curated gene pathways of model organisms, such as human, mouse and rat. BovineMine will be especially useful for gene ontology and pathway analyses in conjunction with GWAS and QTL studies.

  10. Autonomous electrochromic assembly

    Energy Technology Data Exchange (ETDEWEB)

    Berland, Brian Spencer; Lanning, Bruce Roy; Stowell, Jr., Michael Wayne

    2015-03-10

    This disclosure describes system and methods for creating an autonomous electrochromic assembly, and systems and methods for use of the autonomous electrochromic assembly in combination with a window. Embodiments described herein include an electrochromic assembly that has an electrochromic device, an energy storage device, an energy collection device, and an electrochromic controller device. These devices may be combined into a unitary electrochromic insert assembly. The electrochromic assembly may have the capability of generating power sufficient to operate and control an electrochromic device. This control may occur through the application of a voltage to an electrochromic device to change its opacity state. The electrochromic assembly may be used in combination with a window.

  11. HIV-1 assembly in macrophages

    Directory of Open Access Journals (Sweden)

    Benaroch Philippe

    2010-04-01

    Full Text Available Abstract The molecular mechanisms involved in the assembly of newly synthesized Human Immunodeficiency Virus (HIV particles are poorly understood. Most of the work on HIV-1 assembly has been performed in T cells in which viral particle budding and assembly take place at the plasma membrane. In contrast, few studies have been performed on macrophages, the other major target of HIV-1. Infected macrophages represent a viral reservoir and probably play a key role in HIV-1 physiopathology. Indeed macrophages retain infectious particles for long periods of time, keeping them protected from anti-viral immune response or drug treatments. Here, we present an overview of what is known about HIV-1 assembly in macrophages as compared to T lymphocytes or cell lines. Early electron microscopy studies suggested that viral assembly takes place at the limiting membrane of an intracellular compartment in macrophages and not at the plasma membrane as in T cells. This was first considered as a late endosomal compartment in which viral budding seems to be similar to the process of vesicle release into multi-vesicular bodies. This view was notably supported by a large body of evidence involving the ESCRT (Endosomal Sorting Complex Required for Transport machinery in HIV-1 budding, the observation of viral budding profiles in such compartments by immuno-electron microscopy, and the presence of late endosomal markers associated with macrophage-derived virions. However, this model needs to be revisited as recent data indicate that the viral compartment has a neutral pH and can be connected to the plasma membrane via very thin micro-channels. To date, the exact nature and biogenesis of the HIV assembly compartment in macrophages remains elusive. Many cellular proteins potentially involved in the late phases of HIV-1 cycle have been identified; and, recently, the list has grown rapidly with the publication of four independent genome-wide screens. However, their respective

  12. A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.

    Science.gov (United States)

    Zhang, Yuan; Sun, Yanni; Cole, James R

    2014-08-01

    Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.

  13. A scalable and accurate targeted gene assembly tool (SAT-Assembler for next-generation sequencing data.

    Directory of Open Access Journals (Sweden)

    Yuan Zhang

    2014-08-01

    Full Text Available Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.

  14. Draft Genome Sequence of Microdochium bolleyi, a Dark Septate Fungal Endophyte of Beach Grass

    OpenAIRE

    David, Aaron S; Haridas, Sajeet; LaButti, Kurt; Lim, Joanne; Lipzen, Anna; Wang, Mei; Barry, Kerrie; Grigoriev, Igor V.; Spatafora, Joseph W.; May, Georgiana

    2016-01-01

    Here, we present the genome sequence of the dark septate fungal endophyte Microdochium bolleyi (Ascomycota, Sordariomycetes, Xylariales). The assembled genome size was 38.84 Mbp and consisted of 173 scaffolds and 13,177 predicted genes.

  15. Draft Genome Sequence of Microdochium bolleyi, a Dark Septate Fungal Endophyte of Beach Grass.

    Science.gov (United States)

    David, Aaron S; Haridas, Sajeet; LaButti, Kurt; Lim, Joanne; Lipzen, Anna; Wang, Mei; Barry, Kerrie; Grigoriev, Igor V; Spatafora, Joseph W; May, Georgiana

    2016-04-28

    Here, we present the genome sequence of the dark septate fungal endophyte Microdochium bolleyi (Ascomycota, Sordariomycetes, Xylariales). The assembled genome size was 38.84 Mbp and consisted of 173 scaffolds and 13,177 predicted genes.

  16. The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics

    DEFF Research Database (Denmark)

    Gopalakrishnan, Shyam; Samaniego Castruita, Jose Alfredo; Sinding, Mikkel Holger Strander

    2017-01-01

    that regardless of the reference genome choice, most evolutionary genomic analyses yield qualitatively similar results, including those exploring the structure between the wolves and dogs using admixture and principal component analysis. However, we do observe differences in the genomic coverage of re......, then using the boxer reference genome is appropriate, but if the aim of the study is to look at the variation within wolves and their relationships to dogs, then there are clear benefits to using the de novo assembled wolf reference genome....

  17. Rapid, economical single-nucleotide polymorphism and microsatellite discovery based on de novo assembly of a reduced representation genome in a non-model organism: a case study of Atlantic cod Gadus morhua.

    Science.gov (United States)

    Carlsson, J; Gauthier, D T; Carlsson, J E L; Coughlan, J P; Dillane, E; Fitzgerald, R D; Keating, U; McGinnity, P; Mirimin, L; Cross, T F

    2013-03-01

    By combining next-generation sequencing technology (454) and reduced representation library (RRL) construction, the rapid and economical isolation of over 25 000 potential single-nucleotide polymorphisms (SNP) and >6000 putative microsatellite loci from c. 2% of the genome of the non-model teleost, Atlantic cod Gadus morhua from the Celtic Sea, south of Ireland, was demonstrated. A small-scale validation of markers indicated that 80% (11 of 14) of SNP loci and 40% (6 of 15) of the microsatellite loci could be amplified and showed variability. The results clearly show that small-scale next-generation sequencing of RRL genomes is an economical and rapid approach for simultaneous SNP and microsatellite discovery that is applicable to any species. The low cost and relatively small investment in time allows for positive exploitation of ascertainment bias to design markers applicable to specific populations and study questions.

  18. The bonobo genome compared with the chimpanzee and human genomes

    Science.gov (United States)

    Prüfer, Kay; Munch, Kasper; Hellmann, Ines; Akagi, Keiko; Miller, Jason R.; Walenz, Brian; Koren, Sergey; Sutton, Granger; Kodira, Chinnappa; Winer, Roger; Knight, James R.; Mullikin, James C.; Meader, Stephen J.; Ponting, Chris P.; Lunter, Gerton; Higashino, Saneyuki; Hobolth, Asger; Dutheil, Julien; Karakoç, Emre; Alkan, Can; Sajjadian, Saba; Catacchio, Claudia Rita; Ventura, Mario; Marques-Bonet, Tomas; Eichler, Evan E.; André, Claudine; Atencia, Rebeca; Mugisha, Lawrence; Junhold, Jörg; Patterson, Nick; Siebauer, Michael; Good, Jeffrey M.; Fischer, Anne; Ptak, Susan E.; Lachmann, Michael; Symer, David E.; Mailund, Thomas; Schierup, Mikkel H.; Andrés, Aida M.; Kelso, Janet; Pääbo, Svante

    2012-01-01

    Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other. PMID:22722832

  19. The bonobo genome compared with the chimpanzee and human genomes.

    Science.gov (United States)

    Prüfer, Kay; Munch, Kasper; Hellmann, Ines; Akagi, Keiko; Miller, Jason R; Walenz, Brian; Koren, Sergey; Sutton, Granger; Kodira, Chinnappa; Winer, Roger; Knight, James R; Mullikin, James C; Meader, Stephen J; Ponting, Chris P; Lunter, Gerton; Higashino, Saneyuki; Hobolth, Asger; Dutheil, Julien; Karakoç, Emre; Alkan, Can; Sajjadian, Saba; Catacchio, Claudia Rita; Ventura, Mario; Marques-Bonet, Tomas; Eichler, Evan E; André, Claudine; Atencia, Rebeca; Mugisha, Lawrence; Junhold, Jörg; Patterson, Nick; Siebauer, Michael; Good, Jeffrey M; Fischer, Anne; Ptak, Susan E; Lachmann, Michael; Symer, David E; Mailund, Thomas; Schierup, Mikkel H; Andrés, Aida M; Kelso, Janet; Pääbo, Svante

    2012-06-28

    Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.

  20. Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.

    Directory of Open Access Journals (Sweden)

    Martijn Staats

    Full Text Available Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes, but at least generating vital comparative genomic data for testing (phylogenetic, demographic and genetic hypotheses, that become increasingly more

  1. Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.

    Science.gov (United States)

    Staats, Martijn; Erkens, Roy H J; van de Vossenberg, Bart; Wieringa, Jan J; Kraaijeveld, Ken; Stielow, Benjamin; Geml, József; Richardson, James E; Bakker, Freek T

    2013-01-01

    Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS) world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae) herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus) were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes), but at least generating vital comparative genomic data for testing (phylo)genetic, demographic and genetic hypotheses, that become increasingly more horizontal

  2. Polymer Directed Protein Assemblies

    NARCIS (Netherlands)

    van Rijn, Patrick

    2013-01-01

    Protein aggregation and protein self-assembly is an important occurrence in natural systems, and is in some form or other dictated by biopolymers. Very obvious influences of biopolymers on protein assemblies are, e. g., virus particles. Viruses are a multi-protein assembly of which the morphology is

  3. Polymer Directed Protein Assemblies

    NARCIS (Netherlands)

    van Rijn, Patrick

    Protein aggregation and protein self-assembly is an important occurrence in natural systems, and is in some form or other dictated by biopolymers. Very obvious influences of biopolymers on protein assemblies are, e. g., virus particles. Viruses are a multi-protein assembly of which the morphology is

  4. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

    OpenAIRE

    Li, Heng

    2015-01-01

    Motivation: Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10kbp in length, which have enabled high-quality genome assembly at an affordable cost. However, at present, long reads have an error rate as high as 10-15%. Complex and computationally intensive pipelines are required to assemble such reads. Results: We present a new mapper, minimap, and a de novo assembler, miniasm, for efficiently mapping and assembling SMRT and ONT ...

  5. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology

    DEFF Research Database (Denmark)

    Cao, Hongzhi; Hastie, Alex R.; Cao, Dandan;

    2014-01-01

    than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-base gaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-end data from whole-genome sequencing-based re-sequencing or de novo assembly sequence from...... fosmid data. Of the remaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 out of 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases. At the same time, genome mapping also provides...

  6. Sensor mount assemblies and sensor assemblies

    Science.gov (United States)

    Miller, David H [Redondo Beach, CA

    2012-04-10

    Sensor mount assemblies and sensor assemblies are provided. In an embodiment, by way of example only, a sensor mount assembly includes a busbar, a main body, a backing surface, and a first finger. The busbar has a first end and a second end. The main body is overmolded onto the busbar. The backing surface extends radially outwardly relative to the main body. The first finger extends axially from the backing surface, and the first finger has a first end, a second end, and a tooth. The first end of the first finger is disposed on the backing surface, and the tooth is formed on the second end of the first finger.

  7. StellaBase: The Nematostella vectensis Genomics Database

    OpenAIRE

    James C Sullivan; Ryan, Joseph F; Watson, James A.; Webb, Jeramy; Mullikin, James C; Rokhsar, Daniel; Finnerty, John R

    2005-01-01

    StellaBase, the Nematostella vectensis Genomics Database, is a web-based resource that will facilitate desktop and bench-top studies of the starlet sea anemone. Nematostella is an emerging model organism that has already proven useful for addressing fundamental questions in developmental evolution and evolutionary genomics. StellaBase allows users to query the assembled Nematostella genome, a confirmed gene library, and a predicted genome using both keyword and homology based search functions...

  8. The genome portal of the Department of Energy Joint Genome Institute: 2014 updates

    Energy Technology Data Exchange (ETDEWEB)

    Nordberg, Henrik [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Cantor, Michael [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Dusheyko, Serge [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Hua, Susan [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Poliakov, Alexander [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Shabalov, Igor [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Smirnova, Tatyana [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Grigoriev, Igor V. [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Dubchak, Inna [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)

    2013-11-12

    The U.S. Department of Energy (DOE) Joint Genome Institute (JGI), a national user facility, serves the diverse scientific community by providing integrated high-throughput sequencing and computational analysis to enable system-based scientific approaches in support of DOE missions related to clean energy generation and environmental characterization. The JGI Genome Portal (http://genome.jgi.doe.gov) provides unified access to all JGI genomic databases and analytical tools. The JGI maintains extensive data management systems and specialized analytical capabilities to manage and interpret complex genomic data. A user can search, download and explore multiple data sets available for all DOE JGI sequencing projects including their status, assemblies and annotations of sequenced genomes. In this paper, we describe major updates of the Genome Portal in the past 2 years with a specific emphasis on efficient handling of the rapidly growing amount of diverse genomic data accumulated in JGI.

  9. De novo assembly of highly diverse viral populations

    Directory of Open Access Journals (Sweden)

    Yang Xiao

    2012-09-01

    Full Text Available Abstract Background Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage. Results We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/ viral-genomics-analysis-software. Conclusions We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.

  10. The Rice Genome Knowledgebase (RGKbase): an annotation database for rice comparative genomics and evolutionary biology.

    Science.gov (United States)

    Wang, Dapeng; Xia, Yan; Li, Xinna; Hou, Lixia; Yu, Jun

    2013-01-01

    Over the past 10 years, genomes of cultivated rice cultivars and their wild counterparts have been sequenced although most efforts are focused on genome assembly and annotation of two major cultivated rice (Oryza sativa L.) subspecies, 93-11 (indica) and Nipponbare (japonica). To integrate information from genome assemblies and annotations for better analysis and application, we now introduce a comparative rice genome database, the Rice Genome Knowledgebase (RGKbase, http://rgkbase.big.ac.cn/RGKbase/). RGKbase is built to have three major components: (i) integrated data curation for rice genomics and molecular biology, which includes genome sequence assemblies, transcriptomic and epigenomic data, genetic variations, quantitative trait loci (QTLs) and the relevant literature; (ii) User-friendly viewers, such as Gbrowse, GeneBrowse and Circos, for genome annotations and evolutionary dynamics and (iii) Bioinformatic tools for compositional and synteny analyses, gene family classifications, gene ontology terms and pathways and gene co-expression networks. RGKbase current includes data from five rice cultivars and species: Nipponbare (japonica), 93-11 (indica), PA64s (indica), the African rice (Oryza glaberrima) and a wild rice species (Oryza brachyantha). We are also constantly introducing new datasets from variety of public efforts, such as two recent releases-sequence data from ∼1000 rice varieties, which are mapped into the reference genome, yielding ample high-quality single-nucleotide polymorphisms and insertions-deletions.

  11. Soldering in electronics assembly

    CERN Document Server

    Judd, Mike

    2013-01-01

    Soldering in Electronics Assembly discusses several concerns in soldering of electronic assemblies. The book is comprised of nine chapters that tackle different areas in electronic assembly soldering. Chapter 1 discusses the soldering process itself, while Chapter 2 covers the electronic assemblies. Chapter 3 talks about solders and Chapter 4 deals with flux. The text also tackles the CS and SC soldering process. The cleaning of soldered assemblies, solder quality, and standards and specifications are also discussed. The book will be of great use to professionals who deal with electronic assem

  12. Assembly of eukaryotic algal chromosomes in yeast

    OpenAIRE

    Karas, Bogumil J.; Molparia, Bhuvan; Jablanovic, Jelena; Hermann, Wolfgang J; Lin, Ying-Chi; Dupont, Christopher L.; Tagwerker, Christian; Yonemoto, Isaac T.; Noskov, Vladimir N.; Chuang, Ray-Yuan; Allen, Andrew E; Glass, John I.; Hutchison, Clyde A; Smith, Hamilton O; Venter, J Craig

    2013-01-01

    Background Synthetic genomic approaches offer unique opportunities to use powerful yeast and Escherichia coli genetic systems to assemble and modify chromosome-sized molecules before returning the modified DNA to the target host. For example, the entire 1 Mb Mycoplasma mycoides chromosome can be stably maintained and manipulated in yeast before being transplanted back into recipient cells. We have previously demonstrated that cloning in yeast of large (> ~ 150 kb), high G + C (55%) prokaryoti...

  13. Evolution of bird genomes-a transposon's-eye view.

    Science.gov (United States)

    Kapusta, Aurélie; Suh, Alexander

    2017-02-01

    Birds, the most species-rich monophyletic group of land vertebrates, have been subject to some of the most intense sequencing efforts to date, making them an ideal case study for recent developments in genomics research. Here, we review how our understanding of bird genomes has changed with the recent sequencing of more than 75 species from all major avian taxa. We illuminate avian genome evolution from a previously neglected perspective: their repetitive genomic parasites, transposable elements (TEs) and endogenous viral elements (EVEs). We show that (1) birds are unique among vertebrates in terms of their genome organization; (2) information about the diversity of avian TEs and EVEs is changing rapidly; (3) flying birds have smaller genomes yet more TEs than flightless birds; (4) current second-generation genome assemblies fail to capture the variation in avian chromosome number and genome size determined with cytogenetics; (5) the genomic microcosm of bird-TE "arms races" has yet to be explored; and (6) upcoming third-generation genome assemblies suggest that birds exhibit stability in gene-rich regions and instability in TE-rich regions. We emphasize that integration of cytogenetics and single-molecule technologies with repeat-resolved genome assemblies is essential for understanding the evolution of (bird) genomes. © 2016 New York Academy of Sciences.

  14. The Oxytricha trifallax macronuclear genome: a complex eukaryotic genome with 16,000 tiny chromosomes.

    Directory of Open Access Journals (Sweden)

    Estienne C Swart

    Full Text Available The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5% of its precursor "silent" germline micronuclear genome by a process of "unscrambling" and fragmentation. The tiny macronuclear "nanochromosomes" typically encode single, protein-coding genes (a small portion, 10%, encode 2-8 genes, have minimal noncoding regions, and are differentially amplified to an average of ∼2,000 copies. We report the high-quality genome assembly of ∼16,000 complete nanochromosomes (∼50 Mb haploid genome size that vary from 469 bp to 66 kb long (mean ∼3.2 kb and encode ∼18,500 genes. Alternative DNA fragmentation processes ∼10% of the nanochromosomes into multiple isoforms that usually encode complete genes. Nucleotide diversity in the macronucleus is very high (SNP heterozygosity is ∼4.0%, suggesting that Oxytricha trifallax may have one of the largest known effective population sizes of eukaryotes. Comparison to other ciliates with nonscrambled genomes and long macronuclear chromosomes (on the order of 100 kb suggests several candidate proteins that could be involved in genome rearrangement, including domesticated MULE and IS1595-like DDE transposases. The assembly of the highly fragmented Oxytricha macronuclear genome is the first completed genome with such an unusual architecture. This genome sequence provides tantalizing glimpses into novel molecular biology and evolution. For example, Oxytricha maintains tens of millions of telomeres per cell and has also evolved an intriguing expansion of telomere end-binding proteins. In conjunction with the micronuclear genome in progress, the O. trifallax macronuclear genome will provide an invaluable resource for investigating programmed genome rearrangements, complementing

  15. Composite fan stator assembly

    Energy Technology Data Exchange (ETDEWEB)

    Donges, G.L.

    1993-07-13

    A composite fan stator assembly is described for a gas turbine engine having at least two fan rotor stages, the composite stator assembly comprising: an annular composite fan case assembly including an access port, the fan case assembly circumferentially disposed around first and second fan rotor stage locations, a composite fan stator stage supported by and extending radially inward of the fan case assembly and axially disposed between the two fan rotor stage locations, the fan stator stage includes at least one removable vane segment accessible for removal through the access port for assembly and reassembly, the composite fan case assembly including a separable composite forward fan case assembly and a separable composite aft fan case assembly spaced axially aft of the forward fan case assembly, the forward fan case assembly being bolted to the aft fan case assembly, wherein the composite fan stator stage is axially and radially trapped and supported by the forward and aft fan case assemblies. A composite stator vane assembly comprising: a composite inner shroud, a composite outer shroud disposed radially outward of the inner shroud, a plurality of vanes disposed between the shrouds, the vanes including a suction side and a pressure side and radially inner and outer roots, the roots extending through platforms of corresponding ones of the inner and outer shrouds, four box-type attachment elements corresponding to curved suction and pressure sides of the inner and outer roots, the box-type attachment elements having two connected legs angled with respect to each other, a first one of the legs extending along, conforming to the curve of, and bonded to a corresponding one of the airfoil root sides, and a second one of the legs extending along and bonded to a composite shroud surface.

  16. A new DNA sequence assembly program.

    Science.gov (United States)

    Bonfield, J K; Smith, K f; Staden, R

    1995-01-01

    We describe the Genome Assembly Program (GAP), a new program for DNA sequence assembly. The program is suitable for large and small projects, a variety of strategies and can handle data from a range of sequencing instruments. It retains the useful components of our previous work, but includes many novel ideas and methods. Many of these methods have been made possible by the program's completely new, and highly interactive, graphical user interface. The program provides many visual clues to the current state of a sequencing project and allows users to interact in intuitive and graphical ways with their data. The program has tools to display and manipulate the various types of data that help to solve and check difficult assemblies, particularly those in repetitive genomes. We have introduced the following new displays: the Contig Selector, the Contig Comparator, the Template Display, the Restriction Enzyme Map and the Stop Codon Map. We have also made it possible to have any number of Contig Editors and Contig Joining Editors running simultaneously even on the same contig. The program also includes a new 'Directed Assembly' algorithm and routines for automatically detecting unfinished segments of sequence, to which it suggests experimental solutions. Images PMID:8559656

  17. Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA.

    Directory of Open Access Journals (Sweden)

    Kumar Parijat Tripathi

    Full Text Available RNA-seq is a new tool to measure RNA transcript counts, using high-throughput sequencing at an extraordinary accuracy. It provides quantitative means to explore the transcriptome of an organism of interest. However, interpreting this extremely large data into biological knowledge is a problem, and biologist-friendly tools are lacking. In our lab, we developed Transcriptator, a web application based on a computational Python pipeline with a user-friendly Java interface. This pipeline uses the web services available for BLAST (Basis Local Search Alignment Tool, QuickGO and DAVID (Database for Annotation, Visualization and Integrated Discovery tools. It offers a report on statistical analysis of functional and Gene Ontology (GO annotation's enrichment. It helps users to identify enriched biological themes, particularly GO terms, pathways, domains, gene/proteins features and protein-protein interactions related informations. It clusters the transcripts based on functional annotations and generates a tabular report for functional and gene ontology annotations for each submitted transcript to the web server. The implementation of QuickGo web-services in our pipeline enable the users to carry out GO-Slim analysis, whereas the integration of PORTRAIT (Prediction of transcriptomic non coding RNA (ncRNA by ab initio methods helps to identify the non coding RNAs and their regulatory role in transcriptome. In summary, Transcriptator is a useful software for both NGS and array data. It helps the users to characterize the de-novo assembled reads, obtained from NGS experiments for non-referenced organisms, while it also performs the functional enrichment analysis of differentially expressed transcripts/genes for both RNA-seq and micro-array experiments. It generates easy to read tables and interactive charts for better understanding of the data. The pipeline is modular in nature, and provides an opportunity to add new plugins in the future. Web application is

  18. Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA

    Science.gov (United States)

    Zuccaro, Antonio; Guarracino, Mario Rosario

    2015-01-01

    RNA-seq is a new tool to measure RNA transcript counts, using high-throughput sequencing at an extraordinary accuracy. It provides quantitative means to explore the transcriptome of an organism of interest. However, interpreting this extremely large data into biological knowledge is a problem, and biologist-friendly tools are lacking. In our lab, we developed Transcriptator, a web application based on a computational Python pipeline with a user-friendly Java interface. This pipeline uses the web services available for BLAST (Basis Local Search Alignment Tool), QuickGO and DAVID (Database for Annotation, Visualization and Integrated Discovery) tools. It offers a report on statistical analysis of functional and Gene Ontology (GO) annotation’s enrichment. It helps users to identify enriched biological themes, particularly GO terms, pathways, domains, gene/proteins features and protein—protein interactions related informations. It clusters the transcripts based on functional annotations and generates a tabular report for functional and gene ontology annotations for each submitted transcript to the web server. The implementation of QuickGo web-services in our pipeline enable the users to carry out GO-Slim analysis, whereas the integration of PORTRAIT (Prediction of transcriptomic non coding RNA (ncRNA) by ab initio methods) helps to identify the non coding RNAs and their regulatory role in transcriptome. In summary, Transcriptator is a useful software for both NGS and array data. It helps the users to characterize the de-novo assembled reads, obtained from NGS experiments for non-referenced organisms, while it also performs the functional enrichment analysis of differentially expressed transcripts/genes for both RNA-seq and micro-array experiments. It generates easy to read tables and interactive charts for better understanding of the data. The pipeline is modular in nature, and provides an opportunity to add new plugins in the future. Web application is freely

  19. The genome of Eucalyptus grandis

    Energy Technology Data Exchange (ETDEWEB)

    Myburg, Alexander A.; Grattapaglia, Dario; Tuskan, Gerald A.; Hellsten, Uffe; Hayes, Richard D.; Grimwood, Jane; Jenkins, Jerry; Lindquist, Erika; Tice, Hope; Bauer, Diane; Goodstein, David M.; Dubchak, Inna; Poliakov, Alexandre; Mizrachi, Eshchar; Kullan, Anand R. K.; Hussey, Steven G.; Pinard, Desre; van der Merwe, Karen; Singh, Pooja; van Jaarsveld, Ida; Silva-Junior, Orzenil B.; Togawa, Roberto C.; Pappas, Marilia R.; Faria, Danielle A.; Sansaloni, Carolina P.; Petroli, Cesar D.; Yang, Xiaohan; Ranjan, Priya; Tschaplinski, Timothy J.; Ye, Chu-Yu; Li, Ting; Sterck, Lieven; Vanneste, Kevin; Murat, Florent; Soler, Marçal; Clemente, Hélène San; Saidi, Naijib; Cassan-Wang, Hua; Dunand, Christophe; Hefer, Charles A.; Bornberg-Bauer, Erich; Kersting, Anna R.; Vining, Kelly; Amarasinghe, Vindhya; Ranik, Martin; Naithani, Sushma; Elser, Justin; Boyd, Alexander E.; Liston, Aaron; Spatafora, Joseph W.; Dharmwardhana, Palitha; Raja, Rajani; Sullivan, Christopher; Romanel, Elisson; Alves-Ferreira, Marcio; Külheim, Carsten; Foley, William; Carocha, Victor; Paiva, Jorge; Kudrna, David; Brommonschenkel, Sergio H.; Pasquali, Giancarlo; Byrne, Margaret; Rigault, Philippe; Tibbits, Josquin; Spokevicius, Antanas; Jones, Rebecca C.; Steane, Dorothy A.; Vaillancourt, René E.; Potts, Brad M.; Joubert, Fourie; Barry, Kerrie; Pappas, Georgios J.; Strauss, Steven H.; Jaiswal, Pankaj; Grima-Pettenati, Jacqueline; Salse, Jérôme; Van de Peer, Yves; Rokhsar, Daniel S.; Schmutz, Jeremy

    2014-06-11

    Eucalypts are the world s most widely planted hardwood trees. Their broad adaptability, rich species diversity, fast growth and superior multipurpose wood, have made them a global renewable resource of fiber and energy that mitigates human pressures on natural forests. We sequenced and assembled >94% of the 640 Mbp genome of Eucalyptus grandis into its 11 chromosomes. A set of 36,376 protein coding genes were predicted revealing that 34% occur in tandem duplications, the largest proportion found thus far in any plant genome. Eucalypts also show the highest diversity of genes for plant specialized metabolism that act as chemical defence against biotic agents and provide unique pharmaceutical oils. Resequencing of a set of inbred tree genomes revealed regions of strongly conserved heterozygosity, likely hotspots of inbreeding depression. The resequenced genome of the sister species E. globulus underscored the high inter-specific genome colinearity despite substantial genome size variation in the genus. The genome of E. grandis is the first reference for the early diverging Rosid order Myrtales and is placed here basal to the Eurosids. This resource expands knowledge on the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.

  20. A taste of pineapple evolution through genome sequencing.

    Science.gov (United States)

    Xu, Qing; Liu, Zhong-Jian

    2015-12-01

    The genome sequence assembly of the highly heterozygous Ananas comosus and its varieties is an impressive technical achievement. The sequence opens the door to a greater understanding of pineapple morphology and evolution.

  1. GenomePeek—an online tool for prokaryotic genome and metagenome analysis

    Directory of Open Access Journals (Sweden)

    Katelyn McNair

    2015-06-01

    Full Text Available As more and more prokaryotic sequencing takes place, a method to quickly and accurately analyze this data is needed. Previous tools are mainly designed for metagenomic analysis and have limitations; such as long runtimes and significant false positive error rates. The online tool GenomePeek (edwards.sdsu.edu/GenomePeek was developed to analyze both single genome and metagenome sequencing files, quickly and with low error rates. GenomePeek uses a sequence assembly approach where reads to a set of conserved genes are extracted, assembled and then aligned against the highly specific reference database. GenomePeek was found to be faster than traditional approaches while still keeping error rates low, as well as offering unique data visualization options.

  2. A genome draft of the legless anguid lizard, Ophisaurus gracilis.

    Science.gov (United States)

    Song, Bo; Cheng, Shifeng; Sun, Yanbo; Zhong, Xiao; Jin, Jieqiong; Guan, Rui; Murphy, Robert W; Che, Jing; Zhang, Yaping; Liu, Xin

    2015-01-01

    Transition from a lizard-like to a snake-like body form is one of the most important transformations in reptilian evolution. The increasing number of sequenced reptilian genomes is enabling a deeper understanding of vertebrate evolution, although the genetic basis of the loss of limbs in reptiles remains enigmatic. Here we report genome sequencing, assembly, and annotation for the Asian glass lizard Ophisaurus gracilis, a limbless lizard species with an elongated snake-like body form. Addition of this species to the genome repository will provide an excellent resource for studying the genetic basis of limb loss and trunk elongation. O. gracilis genome sequencing using the Illumina HiSeq2000 platform resulted in 274.20 Gbp of raw data that was filtered and assembled to a final size of 1.78 Gbp, comprising 6,717 scaffolds with N50 = 1.27 Mbp. Based on the k-mer estimated genome size of 1.71 Gbp, the assembly appears to be nearly 100% complete. A total of 19,513 protein-coding genes were predicted, and 884.06 Mbp of repeat sequences (approximately half of the genome) were annotated. The draft genome of O. gracilis has similar characteristics to both lizard and snake genomes. We report the first genome of a lizard from the family Anguidae, O. gracilis. This supplements currently available genetic and genomic resources for amniote vertebrates, representing a major increase in comparative genome data available for squamate reptiles in particular.

  3. ex vivo DNA assembly

    Directory of Open Access Journals (Sweden)

    Adam B Fisher

    2013-10-01

    Full Text Available Even with decreasing DNA synthesis costs there remains a need for inexpensive, rapid and reliable methods for assembling synthetic DNA into larger constructs or combinatorial libraries. Advances in cloning techniques have resulted in powerful in vitro and in vivo assembly of DNA. However, monetary and time costs have limited these approaches. Here, we report an ex vivo DNA assembly method that uses cellular lysates derived from a commonly used laboratory strain of Escherichia coli for joining double-stranded DNA with short end homologies embedded within inexpensive primers. This method concurrently shortens the time and decreases costs associated with current DNA assembly methods.

  4. Composite turbine bucket assembly

    Energy Technology Data Exchange (ETDEWEB)

    Liotta, Gary Charles; Garcia-Crespo, Andres

    2014-05-20

    A composite turbine blade assembly includes a ceramic blade including an airfoil portion, a shank portion and an attachment portion; and a transition assembly adapted to attach the ceramic blade to a turbine disk or rotor, the transition assembly including first and second transition components clamped together, trapping said ceramic airfoil therebetween. Interior surfaces of the first and second transition portions are formed to mate with the shank portion and the attachment portion of the ceramic blade, and exterior surfaces of said first and second transition components are formed to include an attachment feature enabling the transition assembly to be attached to the turbine rotor or disk.

  5. Target Assembly Facility

    Data.gov (United States)

    Federal Laboratory Consortium — The Target Assembly Facility integrates new armor concepts into actual armored vehicles. Featuring the capability ofmachining and cutting radioactive materials, it...

  6. Applying Small-Scale DNA Signatures as an Aid in Assembling Soybean Chromosome Sequences

    Directory of Open Access Journals (Sweden)

    Myron Peto

    2010-01-01

    Full Text Available Previous work has established a genomic signature based on relative counts of the 16 possible dinucleotides. Until now, it has been generally accepted that the dinucleotide signature is characteristic of a genome and is relatively homogeneous across a genome. However, we found some local regions of the soybean genome with a signature differing widely from that of the rest of the genome. Those regions were mostly centromeric and pericentromeric, and enriched for repetitive sequences. We found that DNA binding energy also presented large-scale patterns across soybean chromosomes. These two patterns were helpful during assembly and quality control of soybean whole genome shotgun scaffold sequences into chromosome pseudomolecules.

  7. The Past, Present, and Future of Human Centromere Genomics

    Directory of Open Access Journals (Sweden)

    Megan E. Aldrup-MacDonald

    2014-01-01

    Full Text Available The centromere is the chromosomal locus essential for chromosome inheritance and genome stability. Human centromeres are located at repetitive alpha satellite DNA arrays that compose approximately 5% of the genome. Contiguous alpha satellite DNA sequence is absent from the assembled reference genome, limiting current understanding of centromere organization and function. Here, we review the progress in centromere genomics spanning the discovery of the sequence to its molecular characterization and the work done during the Human Genome Project era to elucidate alpha satellite structure and sequence variation. We discuss exciting recent advances in alpha satellite sequence assembly that have provided important insight into the abundance and complex organization of this sequence on human chromosomes. In light of these new findings, we offer perspectives for future studies of human centromere assembly and function.

  8. Next-generation sequencing strategies for characterizing the turkey genome.

    Science.gov (United States)

    Dalloul, Rami A; Zimin, Aleksey V; Settlage, Robert E; Kim, Sungwon; Reed, Kent M

    2014-02-01

    The turkey genome sequencing project was initiated in 2008 and has relied primarily on next-generation sequencing (NGS) technologies. Our first efforts used a synergistic combination of 2 NGS platforms (Roche/454 and Illumina GAII), detailed bacterial artificial chromosome (BAC) maps, and unique assembly tools to sequence and assemble the genome of the domesticated turkey, Meleagris gallopavo. Since the first release in 2010, efforts to improve the genome assembly, gene annotation, and genomic analyses continue. The initial assembly build (2.01) represented about 89% of the genome sequence with 17X coverage depth (931 Mb). Sequence contigs were assigned to 30 of the 40 chromosomes with approximately 10% of the assembled sequence corresponding to unassigned chromosomes (ChrUn). The sequence has been refined through both genome-wide and area-focused sequencing, including shotgun and paired-end sequencing, and targeted sequencing of chromosomal regions with low or incomplete coverage. These additional efforts have improved the sequence assembly resulting in 2 subsequent genome builds of higher genome coverage (25X/Build3.0 and 30X/Build4.0) with a current sequence totaling 1,010 Mb. Further, BAC with end sequences assigned to the Z/W and MG18 (MHC) chromosomes, ChrUn, or not placed in the previous build were isolated, deeply sequenced (Hi-Seq), and incorporated into the latest build (5.0). To aid in the annotation and to generate a gene expression atlas of major tissues, a comprehensive set of RNA samples was collected at various developmental stages of female and male turkeys. Transcriptome sequencing data (using Illumina Hi-Seq) will provide information to enhance the final assembly and ultimately improve sequence annotation. The most current sequence covers more than 95% of the turkey genome and should yield a much improved gene level of annotation, making it a valuable resource for studying genetic variations underlying economically important traits in poultry.

  9. Insight into the evolution of the Solanaceae from the parental genomes of Petunia hybrida

    NARCIS (Netherlands)

    Bombarely, Aureliano; Moser, Michel; Amrad, Avichai; Bao, Manzhu; Bapaume, Laure; Barry, Cornelius S.; Bliek, Mattijs; Boersma, Maaike R.; Borghi, Lorenzo; Bruggmann, Rémy; Bucher, Marcel; Agostino, D' Nunzio; Davies, Kevin; Druege, Uwe; Dudareva, Natalia; Egea-Cortines, Marcos; Delledonne, Massimo; Fernandez-Pozo, Noe; Franken, Philipp; Grandont, Laurie; Heslop-Harrison, J.S.; Hintzsche, Jennifer; Johns, Mitrick; Koes, Ronald; Lv, Xiaodan; Lyons, Eric; Malla, Diwa; Martinoia, Enrico; Mattson, Neil S.; Morel, Patrice; Mueller, Lukas A.; Muhlemann, Joëlle; Nouri, Eva; Passeri, Valentina; Pezzotti, Mario; Qi, Qinzhou; Reinhardt, Didier; Rich, Melanie; Richert-Pöggeler, Katja R.; Robbins, Tim P.; Schatz, Michael C.; Schranz, Eric; Schuurink, Robert C.; Schwarzacher, Trude; Spelt, Kees; Tang, Haibao; Urbanus, Susan L.; Vandenbussche, Michiel; Vijverberg, Kitty; Villarino, Gonzalo H.; Warner, Ryan M.; Weiss, Julia; Yue, Zhen; Zethof, Jan; Quattrocchio, Francesca; Sims, Thomas L.; Kuhlemeier, Cris

    2016-01-01

    Petunia hybrida is a popular bedding plant that has a long history as a genetic model system. We report the whole-genome sequencing and assembly of inbred derivatives of its two wild parents, P. axillaris N and P. inflata S6. The assemblies include 91.3% and 90.2% coverage of their diploid genome

  10. Two genome sequences of the same bacterial strain, Gluconacetobacter diazotrophicus PAl 5, suggest a new standard in genome sequence submission.

    Science.gov (United States)

    Giongo, Adriana; Tyler, Heather L; Zipperer, Ursula N; Triplett, Eric W

    2010-06-15

    Gluconacetobacter diazotrophicus PAl 5 is of agricultural significance due to its ability to provide fixed nitrogen to plants. Consequently, its genome sequence has been eagerly anticipated to enhance understanding of endophytic nitrogen fixation. Two groups have sequenced the PAl 5 genome from the same source (ATCC 49037), though the resulting sequences contain a surprisingly high number of differences. Therefore, an optical map of PAl 5 was constructed in order to determine which genome assembly more closely resembles the chromosomal DNA by aligning each sequence against a physical map of the genome. While one sequence aligned very well, over 98% of the second sequence contained numerous rearrangements. The many differences observed between these two genome sequences could be owing to either assembly errors or rapid evolutionary divergence. The extent of the differences derived from sequence assembly errors could be assessed if the raw sequencing reads were provided by both genome centers at the time of genome sequence submission. Hence, a new genome sequence standard is proposed whereby the investigator supplies the raw reads along with the closed sequence so that the community can make more accurate judgments on whether differences observed in a single stain may be of biological origin or are simply caused by differences in genome assembly procedures.

  11. Antarctic Genomics

    Directory of Open Access Journals (Sweden)

    Alex D. Rogers

    2006-03-01

    Full Text Available With the development of genomic science and its battery of technologies, polar biology stands on the threshold of a revolution, one that will enable the investigation of important questions of unprecedented scope and with extraordinary depth and precision. The exotic organisms of polar ecosystems are ideal candidates for genomic analysis. Through such analyses, it will be possible to learn not only the novel features that enable polar organisms to survive, and indeed thrive, in their extreme environments, but also fundamental biological principles that are common to most, if not all, organisms. This article aims to review recent developments in Antarctic genomics and to demonstrate the global context of such studies.

  12. Genome sequence of Arthrobacter antarcticus strain W2, isolated from a slaughterhouse

    DEFF Research Database (Denmark)

    Herschend, Jakob; Raghupathi, Prem Krishnan; Røder, Henriette Lyng;

    2016-01-01

    We report the draft genome sequence ofArthrobacter antarcticusstrain W2, which was isolated from a wall of a small slaughterhouse in Denmark. The 4.43-Mb genome sequence was assembled into 170 contigs....

  13. Genomic Resources Notes accepted 1 February 2015 - 31 March 2015

    NARCIS (Netherlands)

    Arthofer, Wolfgang; Bertini, Laura; Caruso, Carla; Cicconardi, Francesco; Delph, Lynda F; Fields, Peter D; Ikeda, Minoru; Minegishi, Yuki; Proietti, Silvia; Ritthammer, Heike; Schlick-Steiner, Birgit C; Steiner, Florian M; Wachter, Gregor A; Wagner, Herbert C; Weingartner, Laura A

    2015-01-01

    This article documents the public availability of (i) raw transcriptome sequence data, assembled contigs and BLAST hits of the Antarctic plant Colobanthus quitensis grown in two different climatic conditions, (ii) the draft genome sequence data (raw reads, assembled contigs and unassembled reads) an

  14. Assembly of primary cilia

    DEFF Research Database (Denmark)

    Pedersen, Lotte B; Veland, Iben R; Schrøder, Jacob M

    2008-01-01

    in primary cilia assembly or function have been associated with a panoply of disorders and diseases, including polycystic kidney disease, left-right asymmetry defects, hydrocephalus, and Bardet Biedl Syndrome. Here we provide an up-to-date review focused on the molecular mechanisms involved in the assembly...

  15. Perspective: Geometrically frustrated assemblies

    Science.gov (United States)

    Grason, Gregory M.

    2016-09-01

    This perspective will overview an emerging paradigm for self-organized soft materials, geometrically frustrated assemblies, where interactions between self-assembling elements (e.g., particles, macromolecules, proteins) favor local packing motifs that are incompatible with uniform global order in the assembly. This classification applies to a broad range of material assemblies including self-twisting protein filament bundles, amyloid fibers, chiral smectics and membranes, particle-coated droplets, curved protein shells, and phase-separated lipid vesicles. In assemblies, geometric frustration leads to a host of anomalous structural and thermodynamic properties, including heterogeneous and internally stressed equilibrium structures, self-limiting assembly, and topological defects in the equilibrium assembly structures. The purpose of this perspective is to (1) highlight the unifying principles and consequences of geometric frustration in soft matter assemblies; (2) classify the known distinct modes of frustration and review corresponding experimental examples; and (3) describe outstanding questions not yet addressed about the unique properties and behaviors of this broad class of systems.

  16. Laser bottom hole assembly

    Science.gov (United States)

    Underwood, Lance D; Norton, Ryan J; McKay, Ryan P; Mesnard, David R; Fraze, Jason D; Zediker, Mark S; Faircloth, Brian O

    2014-01-14

    There is provided for laser bottom hole assembly for providing a high power laser beam having greater than 5 kW of power for a laser mechanical drilling process to advance a borehole. This assembly utilizes a reverse Moineau motor type power section and provides a self-regulating system that addresses fluid flows relating to motive force, cooling and removal of cuttings.

  17. Self-assembled nanostructures

    CERN Document Server

    Zhang, Jin Z; Liu, Jun; Chen, Shaowei; Liu, Gang-yu

    2003-01-01

    Nanostructures refer to materials that have relevant dimensions on the nanometer length scales and reside in the mesoscopic regime between isolated atoms and molecules in bulk matter. These materials have unique physical properties that are distinctly different from bulk materials. Self-Assembled Nanostructures provides systematic coverage of basic nanomaterials science including materials assembly and synthesis, characterization, and application. Suitable for both beginners and experts, it balances the chemistry aspects of nanomaterials with physical principles. It also highlights nanomaterial-based architectures including assembled or self-assembled systems. Filled with in-depth discussion of important applications of nano-architectures as well as potential applications ranging from physical to chemical and biological systems, Self-Assembled Nanostructures is the essential reference or text for scientists involved with nanostructures.

  18. Constrained space camera assembly

    Science.gov (United States)

    Heckendorn, Frank M.; Anderson, Erin K.; Robinson, Casandra W.; Haynes, Harriet B.

    1999-01-01

    A constrained space camera assembly which is intended to be lowered through a hole into a tank, a borehole or another cavity. The assembly includes a generally cylindrical chamber comprising a head and a body and a wiring-carrying conduit extending from the chamber. Means are included in the chamber for rotating the body about the head without breaking an airtight seal formed therebetween. The assembly may be pressurized and accompanied with a pressure sensing means for sensing if a breach has occurred in the assembly. In one embodiment, two cameras, separated from their respective lenses, are installed on a mounting apparatus disposed in the chamber. The mounting apparatus includes means allowing both longitudinal and lateral movement of the cameras. Moving the cameras longitudinally focuses the cameras, and moving the cameras laterally away from one another effectively converges the cameras so that close objects can be viewed. The assembly further includes means for moving lenses of different magnification forward of the cameras.

  19. Evaluating de Bruijn graph assemblers on 454 transcriptomic data.

    Directory of Open Access Journals (Sweden)

    Xianwen Ren

    Full Text Available Next generation sequencing (NGS technologies have greatly changed the landscape of transcriptomic studies of non-model organisms. Since there is no reference genome available, de novo assembly methods play key roles in the analysis of these data sets. Because of the huge amount of data generated by NGS technologies for each run, many assemblers, e.g., ABySS, Velvet and Trinity, are developed based on a de Bruijn graph due to its time- and space-efficiency. However, most of these assemblers were developed initially for the Illumina/Solexa platform. The performance of these assemblers on 454 transcriptomic data is unknown. In this study, we evaluated and compared the relative performance of these de Bruijn graph based assemblers on both simulated and real 454 transcriptomic data. The results suggest that Trinity, the Illumina/Solexa-specialized transcriptomic assembler, performs the best among the multiple de Bruijn graph assemblers, comparable to or even outperforming the standard 454 assembler Newbler which is based on the overlap-layout-consensus algorithm. Our evaluation is expected to provide helpful guidance for researchers to choose assemblers when analyzing 454 transcriptomic data.

  20. Identification of Sesame