WorldWideScience

Sample records for accurate genome alignment

  1. Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner

    DEFF Research Database (Denmark)

    Lu, David V; Brown, Randall H; Arumugam, Manimozhiyan

    2009-01-01

    MOTIVATION: The most accurate way to determine the intron-exon structures in a genome is to align spliced cDNA sequences to the genome. Thus, cDNA-to-genome alignment programs are a key component of most annotation pipelines. The scoring system used to choose the best alignment is a primary...... determinant of alignment accuracy, while heuristics that prevent consideration of certain alignments are a primary determinant of runtime and memory usage. Both accuracy and speed are important considerations in choosing an alignment algorithm, but scoring systems have received much less attention than...

  2. Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner.

    Science.gov (United States)

    Lu, David V; Brown, Randall H; Arumugam, Manimozhiyan; Brent, Michael R

    2009-07-01

    The most accurate way to determine the intron-exon structures in a genome is to align spliced cDNA sequences to the genome. Thus, cDNA-to-genome alignment programs are a key component of most annotation pipelines. The scoring system used to choose the best alignment is a primary determinant of alignment accuracy, while heuristics that prevent consideration of certain alignments are a primary determinant of runtime and memory usage. Both accuracy and speed are important considerations in choosing an alignment algorithm, but scoring systems have received much less attention than heuristics. We present Pairagon, a pair hidden Markov model based cDNA-to-genome alignment program, as the most accurate aligner for sequences with high- and low-identity levels. We conducted a series of experiments testing alignment accuracy with varying sequence identity. We first created 'perfect' simulated cDNA sequences by splicing the sequences of exons in the reference genome sequences of fly and human. The complete reference genome sequences were then mutated to various degrees using a realistic mutation simulator and the perfect cDNAs were aligned to them using Pairagon and 12 other aligners. To validate these results with natural sequences, we performed cross-species alignment using orthologous transcripts from human, mouse and rat. We found that aligner accuracy is heavily dependent on sequence identity. For sequences with 100% identity, Pairagon achieved accuracy levels of >99.6%, with one quarter of the errors of any other aligner. Furthermore, for human/mouse alignments, which are only 85% identical, Pairagon achieved 87% accuracy, higher than any other aligner. Pairagon source and executables are freely available at http://mblab.wustl.edu/software/pairagon/

  3. How genome complexity can explain the difficulty of aligning reads to genomes.

    Science.gov (United States)

    Phan, Vinhthuy; Gao, Shanshan; Tran, Quang; Vo, Nam S

    2015-01-01

    Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.

  4. Aligning the unalignable: bacteriophage whole genome alignments.

    Science.gov (United States)

    Bérard, Sèverine; Chateau, Annie; Pompidor, Nicolas; Guertin, Paul; Bergeron, Anne; Swenson, Krister M

    2016-01-13

    In recent years, many studies focused on the description and comparison of large sets of related bacteriophage genomes. Due to the peculiar mosaic structure of these genomes, few informative approaches for comparing whole genomes exist: dot plots diagrams give a mostly qualitative assessment of the similarity/dissimilarity between two or more genomes, and clustering techniques are used to classify genomes. Multiple alignments are conspicuously absent from this scene. Indeed, whole genome aligners interpret lack of similarity between sequences as an indication of rearrangements, insertions, or losses. This behavior makes them ill-prepared to align bacteriophage genomes, where even closely related strains can accomplish the same biological function with highly dissimilar sequences. In this paper, we propose a multiple alignment strategy that exploits functional collinearity shared by related strains of bacteriophages, and uses partial orders to capture mosaicism of sets of genomes. As classical alignments do, the computed alignments can be used to predict that genes have the same biological function, even in the absence of detectable similarity. The Alpha aligner implements these ideas in visual interactive displays, and is used to compute several examples of alignments of Staphylococcus aureus and Mycobacterium bacteriophages, involving up to 29 genomes. Using these datasets, we prove that Alpha alignments are at least as good as those computed by standard aligners. Comparison with the progressive Mauve aligner - which implements a partial order strategy, but whose alignments are linearized - shows a greatly improved interactive graphic display, while avoiding misalignments. Multiple alignments of whole bacteriophage genomes work, and will become an important conceptual and visual tool in comparative genomics of sets of related strains. A python implementation of Alpha, along with installation instructions for Ubuntu and OSX, is available on bitbucket (https://bitbucket.org/thekswenson/alpha).

  5. MUMmer4: A fast and versatile genome alignment system.

    Directory of Open Access Journals (Sweden)

    Guillaume Marçais

    2018-01-01

    Full Text Available The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

  6. Multiple Whole Genome Alignments Without a Reference Organism

    Energy Technology Data Exchange (ETDEWEB)

    Dubchak, Inna; Poliakov, Alexander; Kislyuk, Andrey; Brudno, Michael

    2009-01-16

    Multiple sequence alignments have become one of the most commonly used resources in genomics research. Most algorithms for multiple alignment of whole genomes rely either on a reference genome, against which all of the other sequences are laid out, or require a one-to-one mapping between the nucleotides of the genomes, preventing the alignment of recently duplicated regions. Both approaches have drawbacks for whole-genome comparisons. In this paper we present a novel symmetric alignment algorithm. The resulting alignments not only represent all of the genomes equally well, but also include all relevant duplications that occurred since the divergence from the last common ancestor. Our algorithm, implemented as a part of the VISTA Genome Pipeline (VGP), was used to align seven vertebrate and sixDrosophila genomes. The resulting whole-genome alignments demonstrate a higher sensitivity and specificity than the pairwise alignments previously available through the VGP and have higher exon alignment accuracy than comparable public whole-genome alignments. Of the multiple alignment methods tested, ours performed the best at aligning genes from multigene families?perhaps the most challenging test for whole-genome alignments. Our whole-genome multiple alignments are available through the VISTA Browser at http://genome.lbl.gov/vista/index.shtml.

  7. BFAST: an alignment tool for large scale genome resequencing.

    Directory of Open Access Journals (Sweden)

    Nils Homer

    2009-11-01

    Full Text Available The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation.We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net.

  8. Considerations in the identification of functional RNA structural elements in genomic alignments

    Directory of Open Access Journals (Sweden)

    Blencowe Benjamin J

    2007-01-01

    Full Text Available Abstract Background Accurate identification of novel, functional noncoding (nc RNA features in genome sequence has proven more difficult than for exons. Current algorithms identify and score potential RNA secondary structures on the basis of thermodynamic stability, conservation, and/or covariance in sequence alignments. Neither the algorithms nor the information gained from the individual inputs have been independently assessed. Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries. Results We developed a shuffling algorithm, shuffle-pair.pl, that simultaneously preserves dinucleotide frequency, gaps, and local conservation in pairwise sequence alignments. We used shuffle-pair.pl to assess precision and recall of six ncRNA search tools (MSARI, QRNA, ddbRNA, RNAz, Evofold, and several variants of simple thermodynamic stability on a test set of 3046 alignments of known ncRNAs. Relative to mononucleotide shuffling, preservation of dinucleotide content in shuffling the alignments resulted in a drastic increase in estimated false-positive detection rates for ncRNA elements, precluding evaluation of higher order alignments, which cannot not be adequately shuffled maintaining both dinucleotides and alignment structure. On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone. Although the high false-positive rates call into question the veracity of any individual predicted secondary structural element in our analysis, we nevertheless identified intriguing global trends in human genome alignments. The distribution of ncRNA prediction scores in 75-base windows overlapping UTRs, introns, and intergenic regions analyzed using both thermodynamic stability and EvoFold (which has no thermodynamic component was

  9. Alignment of whole genomes.

    Science.gov (United States)

    Delcher, A L; Kasif, S; Fleischmann, R D; Peterson, J; White, O; Salzberg, S L

    1999-01-01

    A new system for aligning whole genome sequences is described. Using an efficient data structure called a suffix tree, the system is able to rapidly align sequences containing millions of nucleotides. Its use is demonstrated on two strains of Mycoplasma tuberculosis, on two less similar species of Mycoplasma bacteria and on two syntenic sequences from human chromosome 12 and mouse chromosome 6. In each case it found an alignment of the input sequences, using between 30 s and 2 min of computation time. From the system output, information on single nucleotide changes, translocations and homologous genes can easily be extracted. Use of the algorithm should facilitate analysis of syntenic chromosomal regions, strain-to-strain comparisons, evolutionary comparisons and genomic duplications. PMID:10325427

  10. Genomic multiple sequence alignments: refinement using a genetic algorithm

    Directory of Open Access Journals (Sweden)

    Lefkowitz Elliot J

    2005-08-01

    Full Text Available Abstract Background Genomic sequence data cannot be fully appreciated in isolation. Comparative genomics – the practice of comparing genomic sequences from different species – plays an increasingly important role in understanding the genotypic differences between species that result in phenotypic differences as well as in revealing patterns of evolutionary relationships. One of the major challenges in comparative genomics is producing a high-quality alignment between two or more related genomic sequences. In recent years, a number of tools have been developed for aligning large genomic sequences. Most utilize heuristic strategies to identify a series of strong sequence similarities, which are then used as anchors to align the regions between the anchor points. The resulting alignment is globally correct, but in many cases is suboptimal locally. We describe a new program, GenAlignRefine, which improves the overall quality of global multiple alignments by using a genetic algorithm to improve local regions of alignment. Regions of low quality are identified, realigned using the program T-Coffee, and then refined using a genetic algorithm. Because a better COFFEE (Consistency based Objective Function For alignmEnt Evaluation score generally reflects greater alignment quality, the algorithm searches for an alignment that yields a better COFFEE score. To improve the intrinsic slowness of the genetic algorithm, GenAlignRefine was implemented as a parallel, cluster-based program. Results We tested the GenAlignRefine algorithm by running it on a Linux cluster to refine sequences from a simulation, as well as refine a multiple alignment of 15 Orthopoxvirus genomic sequences approximately 260,000 nucleotides in length that initially had been aligned by Multi-LAGAN. It took approximately 150 minutes for a 40-processor Linux cluster to optimize some 200 fuzzy (poorly aligned regions of the orthopoxvirus alignment. Overall sequence identity increased only

  11. Genome Update: alignment of bacterial chromosomes

    DEFF Research Database (Denmark)

    Ussery, David; Jensen, Mette; Poulsen, Tine Rugh

    2004-01-01

    There are four new microbial genomes listed in this month's Genome Update, three belonging to Gram-positive bacteria and one belonging to an archaeon that lives at pH 0; all of these genomes are listed in Table 1⇓. The method of genome comparison this month is that of genome alignment and, as an ...

  12. Automated whole-genome multiple alignment of rat, mouse, and human

    Energy Technology Data Exchange (ETDEWEB)

    Brudno, Michael; Poliakov, Alexander; Salamov, Asaf; Cooper, Gregory M.; Sidow, Arend; Rubin, Edward M.; Solovyev, Victor; Batzoglou, Serafim; Dubchak, Inna

    2004-07-04

    We have built a whole genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline which combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment, and consists of two main steps: (1) alignment of the mouse and rat genomes; and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human and 97% of all alignments with human sequence > 100kb agree with a three-way synteny map built independently using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment; and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.

  13. Accurate Alignment of Plasma Channels Based on Laser Centroid Oscillations

    International Nuclear Information System (INIS)

    Gonsalves, Anthony; Nakamura, Kei; Lin, Chen; Osterhoff, Jens; Shiraishi, Satomi; Schroeder, Carl; Geddes, Cameron; Toth, Csaba; Esarey, Eric; Leemans, Wim

    2011-01-01

    A technique has been developed to accurately align a laser beam through a plasma channel by minimizing the shift in laser centroid and angle at the channel outptut. If only the shift in centroid or angle is measured, then accurate alignment is provided by minimizing laser centroid motion at the channel exit as the channel properties are scanned. The improvement in alignment accuracy provided by this technique is important for minimizing electron beam pointing errors in laser plasma accelerators.

  14. VCFtoTree: a user-friendly tool to construct locus-specific alignments and phylogenies from thousands of anthropologically relevant genome sequences.

    Science.gov (United States)

    Xu, Duo; Jaber, Yousef; Pavlidis, Pavlos; Gokcumen, Omer

    2017-09-26

    Constructing alignments and phylogenies for a given locus from large genome sequencing studies with relevant outgroups allow novel evolutionary and anthropological insights. However, no user-friendly tool has been developed to integrate thousands of recently available and anthropologically relevant genome sequences to construct complete sequence alignments and phylogenies. Here, we provide VCFtoTree, a user friendly tool with a graphical user interface that directly accesses online databases to download, parse and analyze genome variation data for regions of interest. Our pipeline combines popular sequence datasets and tree building algorithms with custom data parsing to generate accurate alignments and phylogenies using all the individuals from the 1000 Genomes Project, Neanderthal and Denisovan genomes, as well as reference genomes of Chimpanzee and Rhesus Macaque. It can also be applied to other phased human genomes, as well as genomes from other species. The output of our pipeline includes an alignment in FASTA format and a tree file in newick format. VCFtoTree fulfills the increasing demand for constructing alignments and phylogenies for a given loci from thousands of available genomes. Our software provides a user friendly interface for a wider audience without prerequisite knowledge in programming. VCFtoTree can be accessed from https://github.com/duoduoo/VCFtoTree_3.0.0 .

  15. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes.

    Science.gov (United States)

    Treangen, Todd J; Ondov, Brian D; Koren, Sergey; Phillippy, Adam M

    2014-01-01

    Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.

  16. Accurate estimation of short read mapping quality for next-generation genome sequencing

    Science.gov (United States)

    Ruffalo, Matthew; Koyutürk, Mehmet; Ray, Soumya; LaFramboise, Thomas

    2012-01-01

    Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants. Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality. Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can ‘resurrect’ many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms. Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/. Contact: matthew.ruffalo@case.edu. PMID:22962451

  17. Alignment-free genome tree inference by learning group-specific distance metrics.

    Science.gov (United States)

    Patil, Kaustubh R; McHardy, Alice C

    2013-01-01

    Understanding the evolutionary relationships between organisms is vital for their in-depth study. Gene-based methods are often used to infer such relationships, which are not without drawbacks. One can now attempt to use genome-scale information, because of the ever increasing number of genomes available. This opportunity also presents a challenge in terms of computational efficiency. Two fundamentally different methods are often employed for sequence comparisons, namely alignment-based and alignment-free methods. Alignment-free methods rely on the genome signature concept and provide a computationally efficient way that is also applicable to nonhomologous sequences. The genome signature contains evolutionary signal as it is more similar for closely related organisms than for distantly related ones. We used genome-scale sequence information to infer taxonomic distances between organisms without additional information such as gene annotations. We propose a method to improve genome tree inference by learning specific distance metrics over the genome signature for groups of organisms with similar phylogenetic, genomic, or ecological properties. Specifically, our method learns a Mahalanobis metric for a set of genomes and a reference taxonomy to guide the learning process. By applying this method to more than a thousand prokaryotic genomes, we showed that, indeed, better distance metrics could be learned for most of the 18 groups of organisms tested here. Once a group-specific metric is available, it can be used to estimate the taxonomic distances for other sequenced organisms from the group. This study also presents a large scale comparison between 10 methods--9 alignment-free and 1 alignment-based.

  18. SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

    Science.gov (United States)

    Liu, Kevin; Warnow, Tandy J; Holder, Mark T; Nelesen, Serita M; Yu, Jiaye; Stamatakis, Alexandros P; Linder, C Randal

    2012-01-01

    Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of

  19. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver

    Science.gov (United States)

    Blanquart, François; Golubchik, Tanya; Gall, Astrid; Bakker, Margreet; Bezemer, Daniela; Croucher, Nicholas J; Hall, Matthew; Hillebregt, Mariska; Ratmann, Oliver; Albert, Jan; Bannert, Norbert; Fellay, Jacques; Fransen, Katrien; Gourlay, Annabelle; Grabowski, M Kate; Gunsenheimer-Bartmeyer, Barbara; Günthard, Huldrych F; Kivelä, Pia; Kouyos, Roger; Laeyendecker, Oliver; Liitsola, Kirsi; Meyer, Laurence; Porter, Kholoud; Ristola, Matti; van Sighem, Ard; Cornelissen, Marion; Kellam, Paul; Reiss, Peter

    2018-01-01

    Abstract Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver’s constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also

  20. Alignment of Escherichia coli K12 DNA sequences to a genomic restriction map.

    Science.gov (United States)

    Rudd, K E; Miller, W; Ostell, J; Benson, D A

    1990-01-25

    We use the extensive published information describing the genome of Escherichia coli and new restriction map alignment software to align DNA sequence, genetic, and physical maps. Restriction map alignment software is used which considers restriction maps as strings analogous to DNA or protein sequences except that two values, enzyme name and DNA base address, are associated with each position on the string. The resulting alignments reveal a nearly linear relationship between the physical and genetic maps of the E. coli chromosome. Physical map comparisons with the 1976, 1980, and 1983 genetic maps demonstrate a better fit with the more recent maps. The results of these alignments are genomic kilobase coordinates, orientation and rank of the alignment that best fits the genetic data. A statistical measure based on extreme value distribution is applied to the alignments. Additional computer analyses allow us to estimate the accuracy of the published E. coli genomic restriction map, simulate rearrangements of the bacterial chromosome, and search for repetitive DNA. The procedures we used are general enough to be applicable to other genome mapping projects.

  1. HAL: a hierarchical format for storing and analyzing multiple genome alignments.

    Science.gov (United States)

    Hickey, Glenn; Paten, Benedict; Earl, Dent; Zerbino, Daniel; Haussler, David

    2013-05-15

    Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. hickey@soe.ucsc.edu or haussler@soe.ucsc.edu Supplementary data are available at Bioinformatics online.

  2. Accurate and robust brain image alignment using boundary-based registration.

    Science.gov (United States)

    Greve, Douglas N; Fischl, Bruce

    2009-10-15

    The fine spatial scales of the structures in the human brain represent an enormous challenge to the successful integration of information from different images for both within- and between-subject analysis. While many algorithms to register image pairs from the same subject exist, visual inspection shows that their accuracy and robustness to be suspect, particularly when there are strong intensity gradients and/or only part of the brain is imaged. This paper introduces a new algorithm called Boundary-Based Registration, or BBR. The novelty of BBR is that it treats the two images very differently. The reference image must be of sufficient resolution and quality to extract surfaces that separate tissue types. The input image is then aligned to the reference by maximizing the intensity gradient across tissue boundaries. Several lower quality images can be aligned through their alignment with the reference. Visual inspection and fMRI results show that BBR is more accurate than correlation ratio or normalized mutual information and is considerably more robust to even strong intensity inhomogeneities. BBR also excels at aligning partial-brain images to whole-brain images, a domain in which existing registration algorithms frequently fail. Even in the limit of registering a single slice, we show the BBR results to be robust and accurate.

  3. W-curve alignments for HIV-1 genomic comparisons.

    Directory of Open Access Journals (Sweden)

    Douglas J Cork

    2010-06-01

    Full Text Available The W-curve was originally developed as a graphical visualization technique for viewing DNA and RNA sequences. Its ability to render features of DNA also makes it suitable for computational studies. Its main advantage in this area is utilizing a single-pass algorithm for comparing the sequences. Avoiding recursion during sequence alignments offers advantages for speed and in-process resources. The graphical technique also allows for multiple models of comparison to be used depending on the nucleotide patterns embedded in similar whole genomic sequences. The W-curve approach allows us to compare large numbers of samples quickly.We are currently tuning the algorithm to accommodate quirks specific to HIV-1 genomic sequences so that it can be used to aid in diagnostic and vaccine efforts. Tracking the molecular evolution of the virus has been greatly hampered by gap associated problems predominantly embedded within the envelope gene of the virus. Gaps and hypermutation of the virus slow conventional string based alignments of the whole genome. This paper describes the W-curve algorithm itself, and how we have adapted it for comparison of similar HIV-1 genomes. A treebuilding method is developed with the W-curve that utilizes a novel Cylindrical Coordinate distance method and gap analysis method. HIV-1 C2-V5 env sequence regions from a Mother/Infant cohort study are used in the comparison.The output distance matrix and neighbor results produced by the W-curve are functionally equivalent to those from Clustal for C2-V5 sequences in the mother/infant pairs infected with CRF01_AE.Significant potential exists for utilizing this method in place of conventional string based alignment of HIV-1 genomes, such as Clustal X. With W-curve heuristic alignment, it may be possible to obtain clinically useful results in a short time-short enough to affect clinical choices for acute treatment. A description of the W-curve generation process, including a comparison

  4. W-curve alignments for HIV-1 genomic comparisons.

    Science.gov (United States)

    Cork, Douglas J; Lembark, Steven; Tovanabutra, Sodsai; Robb, Merlin L; Kim, Jerome H

    2010-06-01

    The W-curve was originally developed as a graphical visualization technique for viewing DNA and RNA sequences. Its ability to render features of DNA also makes it suitable for computational studies. Its main advantage in this area is utilizing a single-pass algorithm for comparing the sequences. Avoiding recursion during sequence alignments offers advantages for speed and in-process resources. The graphical technique also allows for multiple models of comparison to be used depending on the nucleotide patterns embedded in similar whole genomic sequences. The W-curve approach allows us to compare large numbers of samples quickly. We are currently tuning the algorithm to accommodate quirks specific to HIV-1 genomic sequences so that it can be used to aid in diagnostic and vaccine efforts. Tracking the molecular evolution of the virus has been greatly hampered by gap associated problems predominantly embedded within the envelope gene of the virus. Gaps and hypermutation of the virus slow conventional string based alignments of the whole genome. This paper describes the W-curve algorithm itself, and how we have adapted it for comparison of similar HIV-1 genomes. A treebuilding method is developed with the W-curve that utilizes a novel Cylindrical Coordinate distance method and gap analysis method. HIV-1 C2-V5 env sequence regions from a Mother/Infant cohort study are used in the comparison. The output distance matrix and neighbor results produced by the W-curve are functionally equivalent to those from Clustal for C2-V5 sequences in the mother/infant pairs infected with CRF01_AE. Significant potential exists for utilizing this method in place of conventional string based alignment of HIV-1 genomes, such as Clustal X. With W-curve heuristic alignment, it may be possible to obtain clinically useful results in a short time-short enough to affect clinical choices for acute treatment. A description of the W-curve generation process, including a comparison technique of

  5. Base-By-Base: single nucleotide-level analysis of whole viral genome alignments.

    Science.gov (United States)

    Brodie, Ryan; Smith, Alex J; Roper, Rachel L; Tcherepanov, Vasily; Upton, Chris

    2004-07-14

    With ever increasing numbers of closely related virus genomes being sequenced, it has become desirable to be able to compare two genomes at a level more detailed than gene content because two strains of an organism may share the same set of predicted genes but still differ in their pathogenicity profiles. For example, detailed comparison of multiple isolates of the smallpox virus genome (each approximately 200 kb, with 200 genes) is not feasible without new bioinformatics tools. A software package, Base-By-Base, has been developed that provides visualization tools to enable researchers to 1) rapidly identify and correct alignment errors in large, multiple genome alignments; and 2) generate tabular and graphical output of differences between the genomes at the nucleotide level. Base-By-Base uses detailed annotation information about the aligned genomes and can list each predicted gene with nucleotide differences, display whether variations occur within promoter regions or coding regions and whether these changes result in amino acid substitutions. Base-By-Base can connect to our mySQL database (Virus Orthologous Clusters; VOCs) to retrieve detailed annotation information about the aligned genomes or use information from text files. Base-By-Base enables users to quickly and easily compare large viral genomes; it highlights small differences that may be responsible for important phenotypic differences such as virulence. It is available via the Internet using Java Web Start and runs on Macintosh, PC and Linux operating systems with the Java 1.4 virtual machine.

  6. Multiple Whole Genome Alignments and Novel Biomedical Applicationsat the VISTA Portal

    Energy Technology Data Exchange (ETDEWEB)

    Brudno, Michael; Poliakov, Alexander; Minovitsky, Simon; Ratnere,Igor; Dubchak, Inna

    2007-02-01

    The VISTA portal for comparative genomics is designed togive biomedical scientists a unified set of tools to lead them from theraw DNA sequences through the alignment and annotation to thevisualization of the results. The VISTA portal also hosts alignments of anumber of genomes computed by our group, allowing users to study regionsof their interest without having to manually download the individualsequences. Here we describe various algorithmic and functionalimprovements implemented in the VISTA portal over the last two years. TheVISTA Portal is accessible at http://genome.lbl.gov/vista.

  7. Base-By-Base: Single nucleotide-level analysis of whole viral genome alignments

    Directory of Open Access Journals (Sweden)

    Tcherepanov Vasily

    2004-07-01

    Full Text Available Abstract Background With ever increasing numbers of closely related virus genomes being sequenced, it has become desirable to be able to compare two genomes at a level more detailed than gene content because two strains of an organism may share the same set of predicted genes but still differ in their pathogenicity profiles. For example, detailed comparison of multiple isolates of the smallpox virus genome (each approximately 200 kb, with 200 genes is not feasible without new bioinformatics tools. Results A software package, Base-By-Base, has been developed that provides visualization tools to enable researchers to 1 rapidly identify and correct alignment errors in large, multiple genome alignments; and 2 generate tabular and graphical output of differences between the genomes at the nucleotide level. Base-By-Base uses detailed annotation information about the aligned genomes and can list each predicted gene with nucleotide differences, display whether variations occur within promoter regions or coding regions and whether these changes result in amino acid substitutions. Base-By-Base can connect to our mySQL database (Virus Orthologous Clusters; VOCs to retrieve detailed annotation information about the aligned genomes or use information from text files. Conclusion Base-By-Base enables users to quickly and easily compare large viral genomes; it highlights small differences that may be responsible for important phenotypic differences such as virulence. It is available via the Internet using Java Web Start and runs on Macintosh, PC and Linux operating systems with the Java 1.4 virtual machine.

  8. Fast and accurate phylogeny reconstruction using filtered spaced-word matches

    Science.gov (United States)

    Sohrabi-Jahromi, Salma; Morgenstern, Burkhard

    2017-01-01

    Abstract Motivation: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Contact: chris.leimeister@stud.uni-goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28073754

  9. NINJA-OPS: Fast Accurate Marker Gene Alignment Using Concatenated Ribosomes.

    Directory of Open Access Journals (Sweden)

    Gabriel A Al-Ghalith

    2016-01-01

    Full Text Available The explosion of bioinformatics technologies in the form of next generation sequencing (NGS has facilitated a massive influx of genomics data in the form of short reads. Short read mapping is therefore a fundamental component of next generation sequencing pipelines which routinely match these short reads against reference genomes for contig assembly. However, such techniques have seldom been applied to microbial marker gene sequencing studies, which have mostly relied on novel heuristic approaches. We propose NINJA Is Not Just Another OTU-Picking Solution (NINJA-OPS, or NINJA for short, a fast and highly accurate novel method enabling reference-based marker gene matching (picking Operational Taxonomic Units, or OTUs. NINJA takes advantage of the Burrows-Wheeler (BW alignment using an artificial reference chromosome composed of concatenated reference sequences, the "concatesome," as the BW input. Other features include automatic support for paired-end reads with arbitrary insert sizes. NINJA is also free and open source and implements several pre-filtering methods that elicit substantial speedup when coupled with existing tools. We applied NINJA to several published microbiome studies, obtaining accuracy similar to or better than previous reference-based OTU-picking methods while achieving an order of magnitude or more speedup and using a fraction of the memory footprint. NINJA is a complete pipeline that takes a FASTA-formatted input file and outputs a QIIME-formatted taxonomy-annotated BIOM file for an entire MiSeq run of human gut microbiome 16S genes in under 10 minutes on a dual-core laptop.

  10. How accurate is anatomic limb alignment in predicting mechanical limb alignment after total knee arthroplasty?

    Science.gov (United States)

    Lee, Seung Ah; Choi, Sang-Hee; Chang, Moon Jong

    2015-10-27

    Anatomic limb alignment often differs from mechanical limb alignment after total knee arthroplasty (TKA). We sought to assess the accuracy, specificity, and sensitivity for each of three commonly used ranges for anatomic limb alignment (3-9°, 5-10° and 2-10°) in predicting an acceptable range (neutral ± 3°) for mechanical limb alignment after TKA. We also assessed whether the accuracy of anatomic limb alignment was affected by anatomic variation. This retrospective study included 314 primary TKAs. The alignment of the limb was measured with both anatomic and mechanical methods of measurement. We also measured anatomic variation, including the femoral bowing angle, tibial bowing angle, and neck-shaft angle of the femur. All angles were measured on the same full-length standing anteroposterior radiographs. The accuracy, specificity, and sensitivity for each range of anatomic limb alignment were calculated and compared using mechanical limb alignment as the reference standard. The associations between the accuracy of anatomic limb alignment and anatomic variation were also determined. The range of 2-10° for anatomic limb alignment showed the highest accuracy, but it was only 73 % (3-9°, 65 %; 5-10°, 67 %). The specificity of the 2-10° range was 81 %, which was higher than that of the other ranges (3-9°, 69 %; 5-10°, 67 %). However, the sensitivity of the 2-10° range to predict varus malalignment was only 16 % (3-9°, 35 %; 5-10°, 68 %). In addition, the sensitivity of the 2-10° range to predict valgus malalignment was only 43 % (3-9°, 71 %; 5-10°, 43 %). The accuracy of anatomical limb alignment was lower for knees with greater femoral (odds ratio = 1.2) and tibial (odds ratio = 1.2) bowing. Anatomic limb alignment did not accurately predict mechanical limb alignment after TKA, and its accuracy was affected by anatomic variation. Thus, alignment after TKA should be assessed by measuring mechanical alignment rather than anatomic

  11. Evaluation of microRNA alignment techniques

    Science.gov (United States)

    Kaspi, Antony; El-Osta, Assam

    2016-01-01

    Genomic alignment of small RNA (smRNA) sequences such as microRNAs poses considerable challenges due to their short length (∼21 nucleotides [nt]) as well as the large size and complexity of plant and animal genomes. While several tools have been developed for high-throughput mapping of longer mRNA-seq reads (>30 nt), there are few that are specifically designed for mapping of smRNA reads including microRNAs. The accuracy of these mappers has not been systematically determined in the case of smRNA-seq. In addition, it is unknown whether these aligners accurately map smRNA reads containing sequence errors and polymorphisms. By using simulated read sets, we determine the alignment sensitivity and accuracy of 16 short-read mappers and quantify their robustness to mismatches, indels, and nontemplated nucleotide additions. These were explored in the context of a plant genome (Oryza sativa, ∼500 Mbp) and a mammalian genome (Homo sapiens, ∼3.1 Gbp). Analysis of simulated and real smRNA-seq data demonstrates that mapper selection impacts differential expression results and interpretation. These results will inform on best practice for smRNA mapping and enable more accurate smRNA detection and quantification of expression and RNA editing. PMID:27284164

  12. Dense and accurate whole-chromosome haplotyping of individual genomes

    NARCIS (Netherlands)

    Porubsky, David; Garg, Shilpa; Sanders, Ashley D.; Korbel, Jan O.; Guryev, Victor; Lansdorp, Peter M.; Marschall, Tobias

    2017-01-01

    The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate

  13. Genomic divergences among cattle, dog and human estimated from large-scale alignments of genomic sequences

    Directory of Open Access Journals (Sweden)

    Shade Larry L

    2006-06-01

    Full Text Available Abstract Background Approximately 11 Mb of finished high quality genomic sequences were sampled from cattle, dog and human to estimate genomic divergences and their regional variation among these lineages. Results Optimal three-way multi-species global sequence alignments for 84 cattle clones or loci (each >50 kb of genomic sequence were constructed using the human and dog genome assemblies as references. Genomic divergences and substitution rates were examined for each clone and for various sequence classes under different functional constraints. Analysis of these alignments revealed that the overall genomic divergences are relatively constant (0.32–0.37 change/site for pairwise comparisons among cattle, dog and human; however substitution rates vary across genomic regions and among different sequence classes. A neutral mutation rate (2.0–2.2 × 10(-9 change/site/year was derived from ancestral repetitive sequences, whereas the substitution rate in coding sequences (1.1 × 10(-9 change/site/year was approximately half of the overall rate (1.9–2.0 × 10(-9 change/site/year. Relative rate tests also indicated that cattle have a significantly faster rate of substitution as compared to dog and that this difference is about 6%. Conclusion This analysis provides a large-scale and unbiased assessment of genomic divergences and regional variation of substitution rates among cattle, dog and human. It is expected that these data will serve as a baseline for future mammalian molecular evolution studies.

  14. SPA: a probabilistic algorithm for spliced alignment.

    Directory of Open Access Journals (Sweden)

    2006-04-01

    Full Text Available Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5' and 3' ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non

  15. Alignment-free phylogeny of whole genomes using underlying subwords

    Directory of Open Access Journals (Sweden)

    Comin Matteo

    2012-12-01

    Full Text Available Abstract Background With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. Methods In this paper we propose a distance function based on subword compositions called Underlying Approach (UA. We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. Results The Underlying Approach (UA builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. Availability http://www.dei.unipd.it/∼ciompin/main/underlying.html

  16. Rapid and accurate pyrosequencing of angiosperm plastid genomes

    Science.gov (United States)

    Moore, Michael J; Dhingra, Amit; Soltis, Pamela S; Shaw, Regina; Farmerie, William G; Folta, Kevin M; Soltis, Douglas E

    2006-01-01

    Background Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20) System (454 Life Sciences Corporation), to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae) and Platanus occidentalis (Platanaceae). Results More than 99.75% of each plastid genome was simultaneously obtained during two GS 20 sequence runs, to an average depth of coverage of 24.6× in Nandina and 17.3× in Platanus. The Nandina and Platanus plastid genomes shared essentially identical gene complements and possessed the typical angiosperm plastid structure and gene arrangement. To assess the accuracy of the GS 20 sequence, over 45 kilobases of sequence were generated for each genome using conventional sequencing. Overall error rates of 0.043% and 0.031% were observed in GS 20 sequence for Nandina and Platanus, respectively. More than 97% of all observed errors were associated with homopolymer runs, with ~60% of all errors associated with homopolymer runs of 5 or more nucleotides and ~50% of all errors associated with regions of extensive homopolymer runs. No substitution errors were present in either genome. Error rates were generally higher in the single-copy and noncoding regions of both plastid genomes relative to the inverted repeat and coding regions. Conclusion Highly accurate and essentially complete sequence information was obtained for the Nandina and Platanus plastid genomes using the GS 20 System. More importantly, the high accuracy observed in the GS 20 plastid

  17. Rapid and accurate pyrosequencing of angiosperm plastid genomes

    Directory of Open Access Journals (Sweden)

    Farmerie William G

    2006-08-01

    Full Text Available Abstract Background Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20 System (454 Life Sciences Corporation, to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae and Platanus occidentalis (Platanaceae. Results More than 99.75% of each plastid genome was simultaneously obtained during two GS 20 sequence runs, to an average depth of coverage of 24.6× in Nandina and 17.3× in Platanus. The Nandina and Platanus plastid genomes shared essentially identical gene complements and possessed the typical angiosperm plastid structure and gene arrangement. To assess the accuracy of the GS 20 sequence, over 45 kilobases of sequence were generated for each genome using conventional sequencing. Overall error rates of 0.043% and 0.031% were observed in GS 20 sequence for Nandina and Platanus, respectively. More than 97% of all observed errors were associated with homopolymer runs, with ~60% of all errors associated with homopolymer runs of 5 or more nucleotides and ~50% of all errors associated with regions of extensive homopolymer runs. No substitution errors were present in either genome. Error rates were generally higher in the single-copy and noncoding regions of both plastid genomes relative to the inverted repeat and coding regions. Conclusion Highly accurate and essentially complete sequence information was obtained for the Nandina and Platanus plastid genomes using the GS 20 System. More importantly, the high accuracy

  18. Strawberry: Fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq.

    Science.gov (United States)

    Liu, Ruolin; Dickerson, Julie

    2017-11-01

    We propose a novel method and software tool, Strawberry, for transcript reconstruction and quantification from RNA-Seq data under the guidance of genome alignment and independent of gene annotation. Strawberry consists of two modules: assembly and quantification. The novelty of Strawberry is that the two modules use different optimization frameworks but utilize the same data graph structure, which allows a highly efficient, expandable and accurate algorithm for dealing large data. The assembly module parses aligned reads into splicing graphs, and uses network flow algorithms to select the most likely transcripts. The quantification module uses a latent class model to assign read counts from the nodes of splicing graphs to transcripts. Strawberry simultaneously estimates the transcript abundances and corrects for sequencing bias through an EM algorithm. Based on simulations, Strawberry outperforms Cufflinks and StringTie in terms of both assembly and quantification accuracies. Under the evaluation of a real data set, the estimated transcript expression by Strawberry has the highest correlation with Nanostring probe counts, an independent experiment measure for transcript expression. Strawberry is written in C++14, and is available as open source software at https://github.com/ruolin/strawberry under the MIT license.

  19. Improved fingercode alignment for accurate and compact fingerprint recognition

    CSIR Research Space (South Africa)

    Brown, Dane

    2016-05-01

    Full Text Available Alignment for Accurate and Compact Fingerprint Recognition Dane Brown∗† and Karen Bradshaw∗ ∗Department of Computer Science Rhodes University Grahamstown, South Africa †Council for Scientific and Industrial Research Modelling and Digital Sciences Pretoria.... The experimental analysis and results are discussed in Section IV. Section V concludes the paper. II. RELATED STUDIES FingerCode [1] uses circular tessellation of filtered finger- print images centered at the reference point, which results in a circular ROI...

  20. Towards accurate de novo assembly for genomes with repeats

    NARCIS (Netherlands)

    Bucur, Doina

    2017-01-01

    De novo genome assemblers designed for short k-mer length or using short raw reads are unlikely to recover complex features of the underlying genome, such as repeats hundreds of bases long. We implement a stochastic machine-learning method which obtains accurate assemblies with repeats and

  1. Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.

    Directory of Open Access Journals (Sweden)

    Sven Warris

    Full Text Available To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis.With the Parallel SW Alignment Software (PaSWAS it is possible (a to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs to perform high-speed sequence alignments, and (b retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1 tag recovery in next generation sequence data and (2 isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.

  2. Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.

    Science.gov (United States)

    Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter

    2015-01-01

    To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.

  3. Long Read Alignment with Parallel MapReduce Cloud Platform

    Science.gov (United States)

    Al-Absi, Ahmed Abdulhakim; Kang, Dae-Ki

    2015-01-01

    Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms. PMID:26839887

  4. Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS

    NARCIS (Netherlands)

    Warris, S.; Yalcin, F.; Jackson, K.J.; Nap, J.P.H.

    2015-01-01

    Motivation To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate

  5. Simultaneous gene finding in multiple genomes.

    Science.gov (United States)

    König, Stefanie; Romoth, Lars W; Gerischer, Lizzy; Stanke, Mario

    2016-11-15

    As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.

    Science.gov (United States)

    Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A

    2016-07-01

    Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.

  7. Long Read Alignment with Parallel MapReduce Cloud Platform

    Directory of Open Access Journals (Sweden)

    Ahmed Abdulhakim Al-Absi

    2015-01-01

    Full Text Available Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner’s Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms.

  8. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format

    Directory of Open Access Journals (Sweden)

    Flemington Erik

    2011-01-01

    Full Text Available Abstract Background Next Generation Sequencing (NGS technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM or Binary SAM (BAM format is now standard, biomedical researchers still have difficulty accessing this information. Results We have developed a Graphical User Interface (GUI software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files. Conclusions With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net.

  9. LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs

    DEFF Research Database (Denmark)

    Will, Sebastian; Joshi, Tejal; Hofacker, Ivo L.

    2012-01-01

    Current genomic screens for noncoding RNAs (ncRNAs) predict a large number of genomic regions containing potential structural ncRNAs. The analysis of these data requires highly accurate prediction of ncRNA boundaries and discrimination of promising candidate ncRNAs from weak predictions. Existing...... methods struggle with these goals because they rely on sequence-based multiple sequence alignments, which regularly misalign RNA structure and therefore do not support identification of structural similarities. To overcome this limitation, we compute columnwise and global reliabilities of alignments based...... on sequence and structure similarity; we refer to these structure-based alignment reliabilities as STARs. The columnwise STARs of alignments, or STAR profiles, provide a versatile tool for the manual and automatic analysis of ncRNAs. In particular, we improve the boundary prediction of the widely used nc...

  10. BBMap: A Fast, Accurate, Splice-Aware Aligner

    Energy Technology Data Exchange (ETDEWEB)

    Bushnell, Brian

    2014-03-17

    Alignment of reads is one of the primary computational tasks in bioinformatics. Of paramount importance to resequencing, alignment is also crucial to other areas - quality control, scaffolding, string-graph assembly, homology detection, assembly evaluation, error-correction, expression quantification, and even as a tool to evaluate other tools. An optimal aligner would greatly improve virtually any sequencing process, but optimal alignment is prohibitively expensive for gigabases of data. Here, we will present BBMap [1], a fast splice-aware aligner for short and long reads. We will demonstrate that BBMap has superior speed, sensitivity, and specificity to alternative high-throughput aligners bowtie2 [2], bwa [3], smalt, [4] GSNAP [5], and BLASR [6].

  11. A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus

    NARCIS (Netherlands)

    Bucur, Doina

    2017-01-01

    A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a

  12. A practical strategy for the accurate measurement of residual dipolar couplings in strongly aligned small molecules

    Science.gov (United States)

    Liu, Yizhou; Cohen, Ryan D.; Martin, Gary E.; Williamson, R. Thomas

    2018-06-01

    Accurate measurement of residual dipolar couplings (RDCs) requires an appropriate degree of alignment in order to optimize data quality. An overly weak alignment yields very small anisotropic data that are susceptible to measurement errors, whereas an overly strong alignment introduces extensive anisotropic effects that severely degrade spectral quality. The ideal alignment amplitude also depends on the specific pulse sequence used for the coupling measurement. In this work, we introduce a practical strategy for the accurate measurement of one-bond 13C-1H RDCs up to a range of ca. -300 to +300 Hz, corresponding to an alignment that is an order of magnitude stronger than typically employed for small molecule structural elucidation. This strong alignment was generated in the mesophase of the commercially available poly-γ-(benzyl-L-glutamate) polymer. The total coupling was measured by the simple and well-studied heteronuclear two-dimensional J-resolved experiment, which performs well in the presence of strong anisotropic effects. In order to unequivocally determine the sign of the total coupling and resolve ambiguities in assigning total couplings in the CH2 group, coupling measurements were conducted at an isotropic condition plus two anisotropic conditions of different alignment amplitudes. Most RDCs could be readily extracted from these measurements whereas more complicated spectral effects resulting from strong homonuclear coupling could be interpreted either theoretically or by simulation. Importantly, measurement of these very large RDCs actually offers significantly improved data quality and utility for the structure determination of small organic molecules.

  13. Alignment of 1000 Genomes Project reads to reference assembly GRCh38.

    Science.gov (United States)

    Zheng-Bradley, Xiangqun; Streeter, Ian; Fairley, Susan; Richardson, David; Clarke, Laura; Flicek, Paul

    2017-07-01

    The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38. © The Authors 2017. Published by Oxford University Press.

  14. Coval: improving alignment quality and variant calling accuracy for next-generation sequencing data.

    Directory of Open Access Journals (Sweden)

    Shunichi Kosugi

    Full Text Available Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in 'targeted' alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.

  15. nGASP - the nematode genome annotation assessment project

    Energy Technology Data Exchange (ETDEWEB)

    Coghlan, A; Fiedler, T J; McKay, S J; Flicek, P; Harris, T W; Blasiar, D; Allen, J; Stein, L D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C

  16. AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework.

    Directory of Open Access Journals (Sweden)

    Qi Zheng

    2016-10-01

    Full Text Available Accurate mapping of next-generation sequencing (NGS reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost's algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.

  17. AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework.

    Science.gov (United States)

    Zheng, Qi; Grice, Elizabeth A

    2016-10-01

    Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost's algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.

  18. Background Adjusted Alignment-Free Dissimilarity Measures Improve the Detection of Horizontal Gene Transfer

    Directory of Open Access Journals (Sweden)

    Kujin Tang

    2018-04-01

    Full Text Available Horizontal gene transfer (HGT plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and d2* that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, d2* with word length 3, Markov order 1 and d2* with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.

  19. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters.

    Science.gov (United States)

    Sela, Itamar; Ashkenazy, Haim; Katoh, Kazutaka; Pupko, Tal

    2015-07-01

    Inference of multiple sequence alignments (MSAs) is a critical part of phylogenetic and comparative genomics studies. However, from the same set of sequences different MSAs are often inferred, depending on the methodologies used and the assumed parameters. Much effort has recently been devoted to improving the ability to identify unreliable alignment regions. Detecting such unreliable regions was previously shown to be important for downstream analyses relying on MSAs, such as the detection of positive selection. Here we developed GUIDANCE2, a new integrative methodology that accounts for: (i) uncertainty in the process of indel formation, (ii) uncertainty in the assumed guide tree and (iii) co-optimal solutions in the pairwise alignments, used as building blocks in progressive alignment algorithms. We compared GUIDANCE2 with seven methodologies to detect unreliable MSA regions using extensive simulations and empirical benchmarks. We show that GUIDANCE2 outperforms all previously developed methodologies. Furthermore, GUIDANCE2 also provides a set of alternative MSAs which can be useful for downstream analyses. The novel algorithm is implemented as a web-server, available at: http://guidance.tau.ac.il. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. GenomePeek—an online tool for prokaryotic genome and metagenome analysis

    Directory of Open Access Journals (Sweden)

    Katelyn McNair

    2015-06-01

    Full Text Available As more and more prokaryotic sequencing takes place, a method to quickly and accurately analyze this data is needed. Previous tools are mainly designed for metagenomic analysis and have limitations; such as long runtimes and significant false positive error rates. The online tool GenomePeek (edwards.sdsu.edu/GenomePeek was developed to analyze both single genome and metagenome sequencing files, quickly and with low error rates. GenomePeek uses a sequence assembly approach where reads to a set of conserved genes are extracted, assembled and then aligned against the highly specific reference database. GenomePeek was found to be faster than traditional approaches while still keeping error rates low, as well as offering unique data visualization options.

  1. A Secure Alignment Algorithm for Mapping Short Reads to Human Genome.

    Science.gov (United States)

    Zhao, Yongan; Wang, Xiaofeng; Tang, Haixu

    2018-05-09

    The elastic and inexpensive computing resources such as clouds have been recognized as a useful solution to analyzing massive human genomic data (e.g., acquired by using next-generation sequencers) in biomedical researches. However, outsourcing human genome computation to public or commercial clouds was hindered due to privacy concerns: even a small number of human genome sequences contain sufficient information for identifying the donor of the genomic data. This issue cannot be directly addressed by existing security and cryptographic techniques (such as homomorphic encryption), because they are too heavyweight to carry out practical genome computation tasks on massive data. In this article, we present a secure algorithm to accomplish the read mapping, one of the most basic tasks in human genomic data analysis based on a hybrid cloud computing model. Comparing with the existing approaches, our algorithm delegates most computation to the public cloud, while only performing encryption and decryption on the private cloud, and thus makes the maximum use of the computing resource of the public cloud. Furthermore, our algorithm reports similar results as the nonsecure read mapping algorithms, including the alignment between reads and the reference genome, which can be directly used in the downstream analysis such as the inference of genomic variations. We implemented the algorithm in C++ and Python on a hybrid cloud system, in which the public cloud uses an Apache Spark system.

  2. Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?

    Directory of Open Access Journals (Sweden)

    Hartmann Stefanie

    2008-03-01

    Full Text Available Abstract Background While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood and studied ways to improve the accuracy of trees obtained from such datasets. Results We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences. Conclusion These results demonstrate that partial gene

  3. Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?

    Science.gov (United States)

    Hartmann, Stefanie; Vision, Todd J

    2008-03-26

    While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets. We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences. These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a

  4. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

    Science.gov (United States)

    Neuwald, Andrew F

    2009-08-01

    The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.

  5. Crowdsourcing RNA structural alignments with an online computer game.

    Science.gov (United States)

    Waldispühl, Jérôme; Kam, Arthur; Gardner, Paul P

    2015-01-01

    The annotation and classification of ncRNAs is essential to decipher molecular mechanisms of gene regulation in normal and disease states. A database such as Rfam maintains alignments, consensus secondary structures, and corresponding annotations for RNA families. Its primary purpose is the automated, accurate annotation of non-coding RNAs in genomic sequences. However, the alignment of RNAs is computationally challenging, and the data stored in this database are often subject to improvements. Here, we design and evaluate Ribo, a human-computing game that aims to improve the accuracy of RNA alignments already stored in Rfam. We demonstrate the potential of our techniques and discuss the feasibility of large scale collaborative annotation and classification of RNA families.

  6. When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes.

    Science.gov (United States)

    Gardner, Shea N; Hall, Barry G

    2013-01-01

    Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four "raw read" genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.

  7. Fast and accurate non-sequential protein structure alignment using a new asymmetric linear sum assignment heuristic.

    Science.gov (United States)

    Brown, Peter; Pullan, Wayne; Yang, Yuedong; Zhou, Yaoqi

    2016-02-01

    The three dimensional tertiary structure of a protein at near atomic level resolution provides insight alluding to its function and evolution. As protein structure decides its functionality, similarity in structure usually implies similarity in function. As such, structure alignment techniques are often useful in the classifications of protein function. Given the rapidly growing rate of new, experimentally determined structures being made available from repositories such as the Protein Data Bank, fast and accurate computational structure comparison tools are required. This paper presents SPalignNS, a non-sequential protein structure alignment tool using a novel asymmetrical greedy search technique. The performance of SPalignNS was evaluated against existing sequential and non-sequential structure alignment methods by performing trials with commonly used datasets. These benchmark datasets used to gauge alignment accuracy include (i) 9538 pairwise alignments implied by the HOMSTRAD database of homologous proteins; (ii) a subset of 64 difficult alignments from set (i) that have low structure similarity; (iii) 199 pairwise alignments of proteins with similar structure but different topology; and (iv) a subset of 20 pairwise alignments from the RIPC set. SPalignNS is shown to achieve greater alignment accuracy (lower or comparable root-mean squared distance with increased structure overlap coverage) for all datasets, and the highest agreement with reference alignments from the challenging dataset (iv) above, when compared with both sequentially constrained alignments and other non-sequential alignments. SPalignNS was implemented in C++. The source code, binary executable, and a web server version is freely available at: http://sparks-lab.org yaoqi.zhou@griffith.edu.au. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  8. Application of fast Fourier transform cross-correlation and mass spectrometry data for accurate alignment of chromatograms.

    Science.gov (United States)

    Zheng, Yi-Bao; Zhang, Zhi-Min; Liang, Yi-Zeng; Zhan, De-Jian; Huang, Jian-Hua; Yun, Yong-Huan; Xie, Hua-Lin

    2013-04-19

    Chromatography has been established as one of the most important analytical methods in the modern analytical laboratory. However, preprocessing of the chromatograms, especially peak alignment, is usually a time-consuming task prior to extracting useful information from the datasets because of the small unavoidable differences in the experimental conditions caused by minor changes and drift. Most of the alignment algorithms are performed on reduced datasets using only the detected peaks in the chromatograms, which means a loss of data and introduces the problem of extraction of peak data from the chromatographic profiles. These disadvantages can be overcome by using the full chromatographic information that is generated from hyphenated chromatographic instruments. A new alignment algorithm called CAMS (Chromatogram Alignment via Mass Spectra) is present here to correct the retention time shifts among chromatograms accurately and rapidly. In this report, peaks of each chromatogram were detected based on Continuous Wavelet Transform (CWT) with Haar wavelet and were aligned against the reference chromatogram via the correlation of mass spectra. The aligning procedure was accelerated by Fast Fourier Transform cross correlation (FFT cross correlation). This approach has been compared with several well-known alignment methods on real chromatographic datasets, which demonstrates that CAMS can preserve the shape of peaks and achieve a high quality alignment result. Furthermore, the CAMS method was implemented in the Matlab language and available as an open source package at http://www.github.com/matchcoder/CAMS. Copyright © 2013. Published by Elsevier B.V.

  9. Alignment-free comparative genomic screen for structured RNAs using coarse-grained secondary structure dot plots

    DEFF Research Database (Denmark)

    Kato, Yuki; Gorodkin, Jan; Havgaard, Jakob Hull

    2017-01-01

    . Methods: Here we present a fast and efficient method, DotcodeR, for detecting structurally similar RNAs in genomic sequences by comparing their corresponding coarse-grained secondary structure dot plots at string level. This allows us to perform an all-against-all scan of all window pairs from two genomes...... without alignment. Results: Our computational experiments with simulated data and real chromosomes demonstrate that the presented method has good sensitivity. Conclusions: DotcodeR can be useful as a pre-filter in a genomic comparative scan for structured RNAs....

  10. Accurate marker-free alignment with simultaneous geometry determination and reconstruction of tilt series in electron tomography

    International Nuclear Information System (INIS)

    Winkler, Hanspeter; Taylor, Kenneth A.

    2006-01-01

    An image alignment method for electron tomography is presented which is based on cross-correlation techniques and which includes a simultaneous refinement of the tilt geometry. A coarsely aligned tilt series is iteratively refined with a procedure consisting of two steps for each cycle: area matching and subsequent geometry correction. The first step, area matching, brings into register equivalent specimen regions in all images of the tilt series. It determines four parameters of a linear two-dimensional transformation, not just translation and rotation as is done during the preceding coarse alignment with conventional methods. The refinement procedure also differs from earlier methods in that the alignment references are now computed from already aligned images by reprojection of a backprojected volume. The second step, geometry correction, refines the initially inaccurate estimates of the geometrical parameters, including the direction of the tilt axis, a tilt angle offset, and the inclination of the specimen with respect to the support film or specimen holder. The correction values serve as an indicator for the progress of the refinement. For each new iteration, the correction values are used to compute an updated set of geometry parameters by a least squares fit. Model calculations show that it is essential to refine the geometrical parameters as well as the accurate alignment of the images to obtain a faithful map of the original structure

  11. Tools for Accurate and Efficient Analysis of Complex Evolutionary Mechanisms in Microbial Genomes. Final Report

    Energy Technology Data Exchange (ETDEWEB)

    Nakhleh, Luay

    2014-03-12

    I proposed to develop computationally efficient tools for accurate detection and reconstruction of microbes' complex evolutionary mechanisms, thus enabling rapid and accurate annotation, analysis and understanding of their genomes. To achieve this goal, I proposed to address three aspects. (1) Mathematical modeling. A major challenge facing the accurate detection of HGT is that of distinguishing between these two events on the one hand and other events that have similar "effects." I proposed to develop a novel mathematical approach for distinguishing among these events. Further, I proposed to develop a set of novel optimization criteria for the evolutionary analysis of microbial genomes in the presence of these complex evolutionary events. (2) Algorithm design. In this aspect of the project, I proposed to develop an array of e cient and accurate algorithms for analyzing microbial genomes based on the formulated optimization criteria. Further, I proposed to test the viability of the criteria and the accuracy of the algorithms in an experimental setting using both synthetic as well as biological data. (3) Software development. I proposed the nal outcome to be a suite of software tools which implements the mathematical models as well as the algorithms developed.

  12. A rank-based sequence aligner with applications in phylogenetic analysis.

    Directory of Open Access Journals (Sweden)

    Liviu P Dinu

    Full Text Available Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD. The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing [Formula: see text]-mer positions in a hash table for each read. Another improvement, that produces an approximate LRD aligner, is to consider only the positions in the reference that are likely to represent a good positional match of the read. The proposed aligner is evaluated and compared to other state of the art alignment tools in several experiments. A set of experiments are conducted to determine the precision and the recall of the proposed aligner, in the presence of contaminated reads. In another set of experiments, the proposed aligner is used to find the order, the family, or the species of a new (or unknown organism, given only a set of short Next-Generation Sequencing DNA reads. The empirical results show that the aligner proposed in this work is highly accurate from a biological point of view. Compared to the other evaluated tools, the LRD aligner has the important advantage of being very accurate even for a very low base coverage. Thus, the LRD aligner can be considered as a good alternative to standard alignment tools, especially when the accuracy of the aligner is of high importance. Source code and UNIX binaries of the aligner are freely available for future development and use at http://lrd.herokuapp.com/aligners. The software is implemented in C++ and Java, being supported on UNIX and MS Windows.

  13. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment.

    Science.gov (United States)

    Baichoo, Shakuntala; Ouzounis, Christos A

    A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Predicting RNA hyper-editing with a novel tool when unambiguous alignment is impossible.

    Science.gov (United States)

    McKerrow, Wilson H; Savva, Yiannis A; Rezaei, Ali; Reenan, Robert A; Lawrence, Charles E

    2017-07-10

    Repetitive elements are now known to have relevant cellular functions, including self-complementary sequences that form double stranded (ds) RNA. There are numerous pathways that determine the fate of endogenous dsRNA, and misregulation of endogenous dsRNA is a driver of autoimmune disease, particularly in the brain. Unfortunately, the alignment of high-throughput, short-read sequences to repeat elements poses a dilemma: Such sequences may align equally well to multiple genomic locations. In order to differentiate repeat elements, current alignment methods depend on sequence variation in the reference genome. Reads are discarded when no such variations are present. However, RNA hyper-editing, a possible fate for dsRNA, introduces enough variation to distinguish between repeats that are otherwise identical. To take advantage of this variation, we developed a new algorithm, RepProfile, that simultaneously aligns reads and predicts novel variations. RepProfile accurately aligns hyper-edited reads that other methods discard. In particular we predict hyper-editing of Drosophila melanogaster repeat elements in vivo at levels previously described only in vitro, and provide validation by Sanger sequencing sixty-two individual cloned sequences. We find that hyper-editing is concentrated in genes involved in cell-cell communication at the synapse, including some that are associated with neurodegeneration. We also find that hyper-editing tends to occur in short runs. Previous studies of RNA hyper-editing discarded ambiguously aligned reads, ignoring hyper-editing in long, perfect dsRNA - the perfect substrate for hyper-editing. We provide a method that simulation and Sanger validation show accurately predicts such RNA editing, yielding a superior picture of hyper-editing.

  15. HeurAA: accurate and fast detection of genetic variations with a novel heuristic amplicon aligner program for next generation sequencing.

    Directory of Open Access Journals (Sweden)

    Lőrinc S Pongor

    Full Text Available Next generation sequencing (NGS of PCR amplicons is a standard approach to detect genetic variations in personalized medicine such as cancer diagnostics. Computer programs used in the NGS community often miss insertions and deletions (indels that constitute a large part of known human mutations. We have developed HeurAA, an open source, heuristic amplicon aligner program. We tested the program on simulated datasets as well as experimental data from multiplex sequencing of 40 amplicons in 12 oncogenes collected on a 454 Genome Sequencer from lung cancer cell lines. We found that HeurAA can accurately detect all indels, and is more than an order of magnitude faster than previous programs. HeurAA can compare reads and reference sequences up to several thousand base pairs in length, and it can evaluate data from complex mixtures containing reads of different gene-segments from different samples. HeurAA is written in C and Perl for Linux operating systems, the code and the documentation are available for research applications at http://sourceforge.net/projects/heuraa/

  16. Analysis of the genetic variation in Mycobacterium tuberculosis strains by multiple genome alignments

    Directory of Open Access Journals (Sweden)

    Morales Juan

    2008-11-01

    Full Text Available Abstract Background The recent determination of the complete nucleotide sequence of several Mycobacterium tuberculosis (MTB genomes allows the use of comparative genomics as a tool for dissecting the nature and consequence of genetic variability within this species. The multiple alignment of the genomes of clinical strains (CDC1551, F11, Haarlem and C, along with the genomes of laboratory strains (H37Rv and H37Ra, provides new insights on the mechanisms of adaptation of this bacterium to the human host. Findings The genetic variation found in six M. tuberculosis strains does not involve significant genomic rearrangements. Most of the variation results from deletion and transposition events preferentially associated with insertion sequences and genes of the PE/PPE family but not with genes implicated in virulence. Using a Perl-based software islandsanalyser, which creates a representation of the genetic variation in the genome, we identified differences in the patterns of distribution and frequency of the polymorphisms across the genome. The identification of genes displaying strain-specific polymorphisms and the extrapolation of the number of strain-specific polymorphisms to an unlimited number of genomes indicates that the different strains contain a limited number of unique polymorphisms. Conclusion The comparison of multiple genomes demonstrates that the M. tuberculosis genome is currently undergoing an active process of gene decay, analogous to the adaptation process of obligate bacterial symbionts. This observation opens new perspectives into the evolution and the understanding of the pathogenesis of this bacterium.

  17. Analysis of Multiple Genomic Sequence Alignments: A Web Resource, Online Tools, and Lessons Learned From Analysis of Mammalian SCL Loci

    Science.gov (United States)

    Chapman, Michael A.; Donaldson, Ian J.; Gilbert, James; Grafham, Darren; Rogers, Jane; Green, Anthony R.; Göttgens, Berthold

    2004-01-01

    Comparative analysis of genomic sequences is becoming a standard technique for studying gene regulation. However, only a limited number of tools are currently available for the analysis of multiple genomic sequences. An extensive data set for the testing and training of such tools is provided by the SCL gene locus. Here we have expanded the data set to eight vertebrate species by sequencing the dog SCL locus and by annotating the dog and rat SCL loci. To provide a resource for the bioinformatics community, all SCL sequences and functional annotations, comprising a collation of the extensive experimental evidence pertaining to SCL regulation, have been made available via a Web server. A Web interface to new tools specifically designed for the display and analysis of multiple sequence alignments was also implemented. The unique SCL data set and new sequence comparison tools allowed us to perform a rigorous examination of the true benefits of multiple sequence comparisons. We demonstrate that multiple sequence alignments are, overall, superior to pairwise alignments for identification of mammalian regulatory regions. In the search for individual transcription factor binding sites, multiple alignments markedly increase the signal-to-noise ratio compared to pairwise alignments. PMID:14718377

  18. Harnessing cross-species alignment to discover SNPs and generate a draft genome sequence of a bighorn sheep (Ovis canadensis).

    Science.gov (United States)

    Miller, Joshua M; Moore, Stephen S; Stothard, Paul; Liao, Xiaoping; Coltman, David W

    2015-05-20

    Whole genome sequences (WGS) have proliferated as sequencing technology continues to improve and costs decline. While many WGS of model or domestic organisms have been produced, a growing number of non-model species are also being sequenced. In the absence of a reference, construction of a genome sequence necessitates de novo assembly which may be beyond the ability of many labs due to the large volumes of raw sequence data and extensive bioinformatics required. In contrast, the presence of a reference WGS allows for alignment which is more tractable than assembly. Recent work has highlighted that the reference need not come from the same species, potentially enabling a wide array of species WGS to be constructed using cross-species alignment. Here we report on the creation a draft WGS from a single bighorn sheep (Ovis canadensis) using alignment to the closely related domestic sheep (Ovis aries). Two sequencing libraries on SOLiD platforms yielded over 865 million reads, and combined alignment to the domestic sheep reference resulted in a nearly complete sequence (95% coverage of the reference) at an average of 12x read depth (104 SD). From this we discovered over 15 million variants and annotated them relative to the domestic sheep reference. We then conducted an enrichment analysis of those SNPs showing fixed differences between the reference and sequenced individual and found significant differences in a number of gene ontology (GO) terms, including those associated with reproduction, muscle properties, and bone deposition. Our results demonstrate that cross-species alignment enables the creation of novel WGS for non-model organisms. The bighorn sheep WGS will provide a resource for future resequencing studies or comparative genomics.

  19. STELLAR: fast and exact local alignments

    Directory of Open Access Journals (Sweden)

    Weese David

    2011-10-01

    Full Text Available Abstract Background Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. Results We present here the local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. Conclusions STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de.

  20. Tree decomposition based fast search of RNA structures including pseudoknots in genomes.

    Science.gov (United States)

    Song, Yinglei; Liu, Chunmei; Malmberg, Russell; Pan, Fangfang; Cai, Liming

    2005-01-01

    Searching genomes for RNA secondary structure with computational methods has become an important approach to the annotation of non-coding RNAs. However, due to the lack of efficient algorithms for accurate RNA structure-sequence alignment, computer programs capable of fast and effectively searching genomes for RNA secondary structures have not been available. In this paper, a novel RNA structure profiling model is introduced based on the notion of a conformational graph to specify the consensus structure of an RNA family. Tree decomposition yields a small tree width t for such conformation graphs (e.g., t = 2 for stem loops and only a slight increase for pseudo-knots). Within this modelling framework, the optimal alignment of a sequence to the structure model corresponds to finding a maximum valued isomorphic subgraph and consequently can be accomplished through dynamic programming on the tree decomposition of the conformational graph in time O(k(t)N(2)), where k is a small parameter; and N is the size of the projiled RNA structure. Experiments show that the application of the alignment algorithm to search in genomes yields the same search accuracy as methods based on a Covariance model with a significant reduction in computation time. In particular; very accurate searches of tmRNAs in bacteria genomes and of telomerase RNAs in yeast genomes can be accomplished in days, as opposed to months required by other methods. The tree decomposition based searching tool is free upon request and can be downloaded at our site h t t p ://w.uga.edu/RNA-informatics/software/index.php.

  1. Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

    Science.gov (United States)

    Borrayo, Ernesto; Mendizabal-Ruiz, E Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P; Morales, J Alejandro

    2014-01-01

    Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

  2. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

    Directory of Open Access Journals (Sweden)

    Dewey Colin N

    2011-08-01

    Full Text Available Abstract Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost

  3. Specificity control for read alignments using an artificial reference genome-guided false discovery rate.

    Science.gov (United States)

    Giese, Sven H; Zickmann, Franziska; Renard, Bernhard Y

    2014-01-01

    Accurate estimation, comparison and evaluation of read mapping error rates is a crucial step in the processing of next-generation sequencing data, as further analysis steps and interpretation assume the correctness of the mapping results. Current approaches are either focused on sensitivity estimation and thereby disregard specificity or are based on read simulations. Although continuously improving, read simulations are still prone to introduce a bias into the mapping error quantitation and cannot capture all characteristics of an individual dataset. We introduce ARDEN (artificial reference driven estimation of false positives in next-generation sequencing data), a novel benchmark method that estimates error rates of read mappers based on real experimental reads, using an additionally generated artificial reference genome. It allows a dataset-specific computation of error rates and the construction of a receiver operating characteristic curve. Thereby, it can be used for optimization of parameters for read mappers, selection of read mappers for a specific problem or for filtering alignments based on quality estimation. The use of ARDEN is demonstrated in a general read mapper comparison, a parameter optimization for one read mapper and an application example in single-nucleotide polymorphism discovery with a significant reduction in the number of false positive identifications. The ARDEN source code is freely available at http://sourceforge.net/projects/arden/.

  4. Ancestral sequence alignment under optimal conditions

    Directory of Open Access Journals (Sweden)

    Brown Daniel G

    2005-11-01

    Full Text Available Abstract Background Multiple genome alignment is an important problem in bioinformatics. An important subproblem used by many multiple alignment approaches is that of aligning two multiple alignments. Many popular alignment algorithms for DNA use the sum-of-pairs heuristic, where the score of a multiple alignment is the sum of its induced pairwise alignment scores. However, the biological meaning of the sum-of-pairs of pairs heuristic is not obvious. Additionally, many algorithms based on the sum-of-pairs heuristic are complicated and slow, compared to pairwise alignment algorithms. An alternative approach to aligning alignments is to first infer ancestral sequences for each alignment, and then align the two ancestral sequences. In addition to being fast, this method has a clear biological basis that takes into account the evolution implied by an underlying phylogenetic tree. In this study we explore the accuracy of aligning alignments by ancestral sequence alignment. We examine the use of both maximum likelihood and parsimony to infer ancestral sequences. Additionally, we investigate the effect on accuracy of allowing ambiguity in our ancestral sequences. Results We use synthetic sequence data that we generate by simulating evolution on a phylogenetic tree. We use two different types of phylogenetic trees: trees with a period of rapid growth followed by a period of slow growth, and trees with a period of slow growth followed by a period of rapid growth. We examine the alignment accuracy of four ancestral sequence reconstruction and alignment methods: parsimony, maximum likelihood, ambiguous parsimony, and ambiguous maximum likelihood. Additionally, we compare against the alignment accuracy of two sum-of-pairs algorithms: ClustalW and the heuristic of Ma, Zhang, and Wang. Conclusion We find that allowing ambiguity in ancestral sequences does not lead to better multiple alignments. Regardless of whether we use parsimony or maximum likelihood, the

  5. Comparative genomics beyond sequence-based alignments

    DEFF Research Database (Denmark)

    Þórarinsson, Elfar; Yao, Zizhen; Wiklund, Eric D.

    2008-01-01

    Recent computational scans for non-coding RNAs (ncRNAs) in multiple organisms have relied on existing multiple sequence alignments. However, as sequence similarity drops, a key signal of RNA structure--frequent compensating base changes--is increasingly likely to cause sequence-based alignment me...

  6. CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping

    Directory of Open Access Journals (Sweden)

    Shi Weisong

    2011-06-01

    Full Text Available Abstract Background Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS. However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. Results To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80% mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http

  7. CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping.

    Science.gov (United States)

    Nguyen, Tung; Shi, Weisong; Ruden, Douglas

    2011-06-06

    Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version is at http

  8. A Thousand Fly Genomes: An Expanded Drosophila Genome Nexus.

    Science.gov (United States)

    Lack, Justin B; Lange, Jeremy D; Tang, Alison D; Corbett-Detig, Russell B; Pool, John E

    2016-12-01

    The Drosophila Genome Nexus is a population genomic resource that provides D. melanogaster genomes from multiple sources. To facilitate comparisons across data sets, genomes are aligned using a common reference alignment pipeline which involves two rounds of mapping. Regions of residual heterozygosity, identity-by-descent, and recent population admixture are annotated to enable data filtering based on the user's needs. Here, we present a significant expansion of the Drosophila Genome Nexus, which brings the current data object to a total of 1,121 wild-derived genomes. New additions include 305 previously unpublished genomes from inbred lines representing six population samples in Egypt, Ethiopia, France, and South Africa, along with another 193 genomes added from recently-published data sets. We also provide an aligned D. simulans genome to facilitate divergence comparisons. This improved resource will broaden the range of population genomic questions that can addressed from multi-population allele frequencies and haplotypes in this model species. The larger set of genomes will also enhance the discovery of functionally relevant natural variation that exists within and between populations. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  9. Intraoperative panoramic image using alignment grid, is it accurate?

    Science.gov (United States)

    Apivatthakakul, T; Duanghakrung, M; Luevitoonvechkit, S; Patumasutra, S

    2013-07-01

    Minimally invasive orthopedic trauma surgery relies heavily on intraoperative fluoroscopic images to evaluate the quality of fracture reduction and fixation. However, fluoroscopic images have a narrow field of view and often cannot visualize the entire long bone axis. To compare the coronal femoral alignment between conventional X-rays to that achieved with a new method of acquiring a panoramic intraoperative image. Twenty-four cadaveric femurs with simple diaphyseal fractures were fixed with an angulated broad DCP to create coronal plane malalignment. An intraoperative alignment grid was used to help stitch different fluoroscopic images together to produce a panoramic image. A conventional X-ray of the entire femur was then performed. The coronal plane angulation in the panoramic images was then compared to the conventional X-rays using a Wilcoxon signed rank test. The mean angle measured from the panoramic view was 173.9° (range 169.3°-178.0°) with median of 173.2°. The mean angle measured from the conventional X-ray was 173.4° (range 167.7°-178.7°) with a median angle of 173.5°. There was no significant difference between both methods of measurement (P = 0.48). Panoramic images produced by stitching fluoroscopic images together with help of an alignment grid demonstrated the same accuracy at evaluating the coronal plane alignment of femur fractures as conventional X-rays.

  10. Fast and accurate phylogenetic reconstruction from high-resolution whole-genome data and a novel robustness estimator.

    Science.gov (United States)

    Lin, Y; Rajan, V; Moret, B M E

    2011-09-01

    The rapid accumulation of whole-genome data has renewed interest in the study of genomic rearrangements. Comparative genomics, evolutionary biology, and cancer research all require models and algorithms to elucidate the mechanisms, history, and consequences of these rearrangements. However, even simple models lead to NP-hard problems, particularly in the area of phylogenetic analysis. Current approaches are limited to small collections of genomes and low-resolution data (typically a few hundred syntenic blocks). Moreover, whereas phylogenetic analyses from sequence data are deemed incomplete unless bootstrapping scores (a measure of confidence) are given for each tree edge, no equivalent to bootstrapping exists for rearrangement-based phylogenetic analysis. We describe a fast and accurate algorithm for rearrangement analysis that scales up, in both time and accuracy, to modern high-resolution genomic data. We also describe a novel approach to estimate the robustness of results-an equivalent to the bootstrapping analysis used in sequence-based phylogenetic reconstruction. We present the results of extensive testing on both simulated and real data showing that our algorithm returns very accurate results, while scaling linearly with the size of the genomes and cubically with their number. We also present extensive experimental results showing that our approach to robustness testing provides excellent estimates of confidence, which, moreover, can be tuned to trade off thresholds between false positives and false negatives. Together, these two novel approaches enable us to attack heretofore intractable problems, such as phylogenetic inference for high-resolution vertebrate genomes, as we demonstrate on a set of six vertebrate genomes with 8,380 syntenic blocks. A copy of the software is available on demand.

  11. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population.

    Science.gov (United States)

    Lack, Justin B; Cardeno, Charis M; Crepeau, Marc W; Taylor, William; Corbett-Detig, Russell B; Stevens, Kristian A; Langley, Charles H; Pool, John E

    2015-04-01

    Hundreds of wild-derived Drosophila melanogaster genomes have been published, but rigorous comparisons across data sets are precluded by differences in alignment methodology. The most common approach to reference-based genome assembly is a single round of alignment followed by quality filtering and variant detection. We evaluated variations and extensions of this approach and settled on an assembly strategy that utilizes two alignment programs and incorporates both substitutions and short indels to construct an updated reference for a second round of mapping prior to final variant detection. Utilizing this approach, we reassembled published D. melanogaster population genomic data sets and added unpublished genomes from several sub-Saharan populations. Most notably, we present aligned data from phase 3 of the Drosophila Population Genomics Project (DPGP3), which provides 197 genomes from a single ancestral range population of D. melanogaster (from Zambia). The large sample size, high genetic diversity, and potentially simpler demographic history of the DPGP3 sample will make this a highly valuable resource for fundamental population genetic research. The complete set of assemblies described here, termed the Drosophila Genome Nexus, presently comprises 623 consistently aligned genomes and is publicly available in multiple formats with supporting documentation and bioinformatic tools. This resource will greatly facilitate population genomic analysis in this model species by reducing the methodological differences between data sets. Copyright © 2015 by the Genetics Society of America.

  12. Screening synteny blocks in pairwise genome comparisons through integer programming.

    Science.gov (United States)

    Tang, Haibao; Lyons, Eric; Pedersen, Brent; Schnable, James C; Paterson, Andrew H; Freeling, Michael

    2011-04-18

    It is difficult to accurately interpret chromosomal correspondences such as true orthology and paralogy due to significant divergence of genomes from a common ancestor. Analyses are particularly problematic among lineages that have repeatedly experienced whole genome duplication (WGD) events. To compare multiple "subgenomes" derived from genome duplications, we need to relax the traditional requirements of "one-to-one" syntenic matchings of genomic regions in order to reflect "one-to-many" or more generally "many-to-many" matchings. However this relaxation may result in the identification of synteny blocks that are derived from ancient shared WGDs that are not of interest. For many downstream analyses, we need to eliminate weak, low scoring alignments from pairwise genome comparisons. Our goal is to objectively select subset of synteny blocks whose total scores are maximized while respecting the duplication history of the genomes in comparison. We call this "quota-based" screening of synteny blocks in order to appropriately fill a quota of syntenic relationships within one genome or between two genomes having WGD events. We have formulated the synteny block screening as an optimization problem known as "Binary Integer Programming" (BIP), which is solved using existing linear programming solvers. The computer program QUOTA-ALIGN performs this task by creating a clear objective function that maximizes the compatible set of synteny blocks under given constraints on overlaps and depths (corresponding to the duplication history in respective genomes). Such a procedure is useful for any pairwise synteny alignments, but is most useful in lineages affected by multiple WGDs, like plants or fish lineages. For example, there should be a 1:2 ploidy relationship between genome A and B if genome B had an independent WGD subsequent to the divergence of the two genomes. We show through simulations and real examples using plant genomes in the rosid superorder that the quota

  13. Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

    Directory of Open Access Journals (Sweden)

    Kris Popendorf

    Full Text Available BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1 adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2 parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow in 21 hours CPU time (42 minutes wall time. This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with

  14. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.

    Science.gov (United States)

    Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver

    2012-07-15

    In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.

  15. Whole genome phylogenies for multiple Drosophila species

    Directory of Open Access Journals (Sweden)

    Seetharam Arun

    2012-12-01

    Full Text Available Abstract Background Reconstructing the evolutionary history of organisms using traditional phylogenetic methods may suffer from inaccurate sequence alignment. An alternative approach, particularly effective when whole genome sequences are available, is to employ methods that don’t use explicit sequence alignments. We extend a novel phylogenetic method based on Singular Value Decomposition (SVD to reconstruct the phylogeny of 12 sequenced Drosophila species. SVD analysis provides accurate comparisons for a high fraction of sequences within whole genomes without the prior identification of orthologs or homologous sites. With this method all protein sequences are converted to peptide frequency vectors within a matrix that is decomposed to provide simplified vector representations for each protein of the genome in a reduced dimensional space. These vectors are summed together to provide a vector representation for each species, and the angle between these vectors provides distance measures that are used to construct species trees. Results An unfiltered whole genome analysis (193,622 predicted proteins strongly supports the currently accepted phylogeny for 12 Drosophila species at higher dimensions except for the generally accepted but difficult to discern sister relationship between D. erecta and D. yakuba. Also, in accordance with previous studies, many sequences appear to support alternative phylogenies. In this case, we observed grouping of D. erecta with D. sechellia when approximately 55% to 95% of the proteins were removed using a filter based on projection values or by reducing resolution by using fewer dimensions. Similar results were obtained when just the melanogaster subgroup was analyzed. Conclusions These results indicate that using our novel phylogenetic method, it is possible to consult and interpret all predicted protein sequences within multiple whole genomes to produce accurate phylogenetic estimations of relatedness between

  16. Desktop aligner for fabrication of multilayer microfluidic devices.

    Science.gov (United States)

    Li, Xiang; Yu, Zeta Tak For; Geraldo, Dalton; Weng, Shinuo; Alve, Nitesh; Dun, Wu; Kini, Akshay; Patel, Karan; Shu, Roberto; Zhang, Feng; Li, Gang; Jin, Qinghui; Fu, Jianping

    2015-07-01

    Multilayer assembly is a commonly used technique to construct multilayer polydimethylsiloxane (PDMS)-based microfluidic devices with complex 3D architecture and connectivity for large-scale microfluidic integration. Accurate alignment of structure features on different PDMS layers before their permanent bonding is critical in determining the yield and quality of assembled multilayer microfluidic devices. Herein, we report a custom-built desktop aligner capable of both local and global alignments of PDMS layers covering a broad size range. Two digital microscopes were incorporated into the aligner design to allow accurate global alignment of PDMS structures up to 4 in. in diameter. Both local and global alignment accuracies of the desktop aligner were determined to be about 20 μm cm(-1). To demonstrate its utility for fabrication of integrated multilayer PDMS microfluidic devices, we applied the desktop aligner to achieve accurate alignment of different functional PDMS layers in multilayer microfluidics including an organs-on-chips device as well as a microfluidic device integrated with vertical passages connecting channels located in different PDMS layers. Owing to its convenient operation, high accuracy, low cost, light weight, and portability, the desktop aligner is useful for microfluidic researchers to achieve rapid and accurate alignment for generating multilayer PDMS microfluidic devices.

  17. Efficient alignment of pyrosequencing reads for re-sequencing applications

    Directory of Open Access Journals (Sweden)

    Russo Luis MS

    2011-05-01

    Full Text Available Abstract Background Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects. Results We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454 system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time. Conclusions The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net.

  18. ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes.

    Science.gov (United States)

    Otto, Thomas Dan; Catanho, Marcos; Tristão, Cristian; Bezerra, Márcia; Fernandes, Renan Mathias; Elias, Guilherme Steinberger; Scaglia, Alexandre Capeletto; Bovermann, Bill; Berstis, Viktors; Lifschitz, Sergio; de Miranda, Antonio Basílio; Degrave, Wim

    2010-03-01

    Many analyses in modern biological research are based on comparisons between biological sequences, resulting in functional, evolutionary and structural inferences. When large numbers of sequences are compared, heuristics are often used resulting in a certain lack of accuracy. In order to improve and validate results of such comparisons, we have performed radical all-against-all comparisons of 4 million protein sequences belonging to the RefSeq database, using an implementation of the Smith-Waterman algorithm. This extremely intensive computational approach was made possible with the help of World Community Grid, through the Genome Comparison Project. The resulting database, ProteinWorldDB, which contains coordinates of pairwise protein alignments and their respective scores, is now made available. Users can download, compare and analyze the results, filtered by genomes, protein functions or clusters. ProteinWorldDB is integrated with annotations derived from Swiss-Prot, Pfam, KEGG, NCBI Taxonomy database and gene ontology. The database is a unique and valuable asset, representing a major effort to create a reliable and consistent dataset of cross-comparisons of the whole protein content encoded in hundreds of completely sequenced genomes using a rigorous dynamic programming approach. The database can be accessed through http://proteinworlddb.org

  19. CAFE: aCcelerated Alignment-FrEe sequence analysis.

    Science.gov (United States)

    Lu, Yang Young; Tang, Kujin; Ren, Jie; Fuhrman, Jed A; Waterman, Michael S; Sun, Fengzhu

    2017-07-03

    Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. A statistically harmonized alignment-classification in image space enables accurate and robust alignment of noisy images in single particle analysis.

    Science.gov (United States)

    Kawata, Masaaki; Sato, Chikara

    2007-06-01

    In determining the three-dimensional (3D) structure of macromolecular assemblies in single particle analysis, a large representative dataset of two-dimensional (2D) average images from huge number of raw images is a key for high resolution. Because alignments prior to averaging are computationally intensive, currently available multireference alignment (MRA) software does not survey every possible alignment. This leads to misaligned images, creating blurred averages and reducing the quality of the final 3D reconstruction. We present a new method, in which multireference alignment is harmonized with classification (multireference multiple alignment: MRMA). This method enables a statistical comparison of multiple alignment peaks, reflecting the similarities between each raw image and a set of reference images. Among the selected alignment candidates for each raw image, misaligned images are statistically excluded, based on the principle that aligned raw images of similar projections have a dense distribution around the correctly aligned coordinates in image space. This newly developed method was examined for accuracy and speed using model image sets with various signal-to-noise ratios, and with electron microscope images of the Transient Receptor Potential C3 and the sodium channel. In every data set, the newly developed method outperformed conventional methods in robustness against noise and in speed, creating 2D average images of higher quality. This statistically harmonized alignment-classification combination should greatly improve the quality of single particle analysis.

  1. Towards Robust and Accurate Multi-View and Partially-Occluded Face Alignment.

    Science.gov (United States)

    Xing, Junliang; Niu, Zhiheng; Huang, Junshi; Hu, Weiming; Zhou, Xi; Yan, Shuicheng

    2018-04-01

    Face alignment acts as an important task in computer vision. Regression-based methods currently dominate the approach to solving this problem, which generally employ a series of mapping functions from the face appearance to iteratively update the face shape hypothesis. One keypoint here is thus how to perform the regression procedure. In this work, we formulate this regression procedure as a sparse coding problem. We learn two relational dictionaries, one for the face appearance and the other one for the face shape, with coupled reconstruction coefficient to capture their underlying relationships. To deploy this model for face alignment, we derive the relational dictionaries in a stage-wised manner to perform close-loop refinement of themselves, i.e., the face appearance dictionary is first learned from the face shape dictionary and then used to update the face shape hypothesis, and the updated face shape dictionary from the shape hypothesis is in return used to refine the face appearance dictionary. To improve the model accuracy, we extend this model hierarchically from the whole face shape to face part shapes, thus both the global and local view variations of a face are captured. To locate facial landmarks under occlusions, we further introduce an occlusion dictionary into the face appearance dictionary to recover face shape from partially occluded face appearance. The occlusion dictionary is learned in a data driven manner from background images to represent a set of elemental occlusion patterns, a sparse combination of which models various practical partial face occlusions. By integrating all these technical innovations, we obtain a robust and accurate approach to locate facial landmarks under different face views and possibly severe occlusions for face images in the wild. Extensive experimental analyses and evaluations on different benchmark datasets, as well as two new datasets built by ourselves, have demonstrated the robustness and accuracy of our proposed

  2. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    Science.gov (United States)

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.

  3. CSA: An efficient algorithm to improve circular DNA multiple alignment

    Directory of Open Access Journals (Sweden)

    Pereira Luísa

    2009-07-01

    Full Text Available Abstract Background The comparison of homologous sequences from different species is an essential approach to reconstruct the evolutionary history of species and of the genes they harbour in their genomes. Several complete mitochondrial and nuclear genomes are now available, increasing the importance of using multiple sequence alignment algorithms in comparative genomics. MtDNA has long been used in phylogenetic analysis and errors in the alignments can lead to errors in the interpretation of evolutionary information. Although a large number of multiple sequence alignment algorithms have been proposed to date, they all deal with linear DNA and cannot handle directly circular DNA. Researchers interested in aligning circular DNA sequences must first rotate them to the "right" place using an essentially manual process, before they can use multiple sequence alignment tools. Results In this paper we propose an efficient algorithm that identifies the most interesting region to cut circular genomes in order to improve phylogenetic analysis when using standard multiple sequence alignment algorithms. This algorithm identifies the largest chain of non-repeated longest subsequences common to a set of circular mitochondrial DNA sequences. All the sequences are then rotated and made linear for multiple alignment purposes. To evaluate the effectiveness of this new tool, three different sets of mitochondrial DNA sequences were considered. Other tests considering randomly rotated sequences were also performed. The software package Arlequin was used to evaluate the standard genetic measures of the alignments obtained with and without the use of the CSA algorithm with two well known multiple alignment algorithms, the CLUSTALW and the MAVID tools, and also the visualization tool SinicView. Conclusion The results show that a circularization and rotation pre-processing step significantly improves the efficiency of public available multiple sequence alignment

  4. SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees.

    Science.gov (United States)

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.

  5. Using structure to explore the sequence alignment space of remote homologs.

    Science.gov (United States)

    Kuziemko, Andrew; Honig, Barry; Petrey, Donald

    2011-10-01

    Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.

  6. A meiotic linkage map of the silver fox, aligned and compared to the canine genome.

    Science.gov (United States)

    Kukekova, Anna V; Trut, Lyudmila N; Oskina, Irina N; Johnson, Jennifer L; Temnykh, Svetlana V; Kharlamova, Anastasiya V; Shepeleva, Darya V; Gulievich, Rimma G; Shikhevich, Svetlana G; Graphodatsky, Alexander S; Aguirre, Gustavo D; Acland, Gregory M

    2007-03-01

    A meiotic linkage map is essential for mapping traits of interest and is often the first step toward understanding a cryptic genome. Specific strains of silver fox (a variant of the red fox, Vulpes vulpes), which segregate behavioral and morphological phenotypes, create a need for such a map. One such strain, selected for docility, exhibits friendly dog-like responses to humans, in contrast to another strain selected for aggression. Development of a fox map is facilitated by the known cytogenetic homologies between the dog and fox, and by the availability of high resolution canine genome maps and sequence data. Furthermore, the high genomic sequence identity between dog and fox allows adaptation of canine microsatellites for genotyping and meiotic mapping in foxes. Using 320 such markers, we have constructed the first meiotic linkage map of the fox genome. The resulting sex-averaged map covers 16 fox autosomes and the X chromosome with an average inter-marker distance of 7.5 cM. The total map length corresponds to 1480.2 cM. From comparison of sex-averaged meiotic linkage maps of the fox and dog genomes, suppression of recombination in pericentromeric regions of the metacentric fox chromosomes was apparent, relative to the corresponding segments of acrocentric dog chromosomes. Alignment of the fox meiotic map against the 7.6x canine genome sequence revealed high conservation of marker order between homologous regions of the two species. The fox meiotic map provides a critical tool for genetic studies in foxes and identification of genetic loci and genes implicated in fox domestication.

  7. Genomic Prediction from Whole Genome Sequence in Livestock: The 1000 Bull Genomes Project

    DEFF Research Database (Denmark)

    Hayes, Benjamin J; MacLeod, Iona M; Daetwyler, Hans D

    Advantages of using whole genome sequence data to predict genomic estimated breeding values (GEBV) include better persistence of accuracy of GEBV across generations and more accurate GEBV across breeds. The 1000 Bull Genomes Project provides a database of whole genome sequenced key ancestor bulls....... In a dairy data set, predictions using BayesRC and imputed sequence data from 1000 Bull Genomes were 2% more accurate than with 800k data. We could demonstrate the method identified causal mutations in some cases. Further improvements will come from more accurate imputation of sequence variant genotypes...

  8. SWPhylo – A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees

    Science.gov (United States)

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA. PMID:29511354

  9. Accurate overlaying for mobile augmented reality

    NARCIS (Netherlands)

    Pasman, W; van der Schaaf, A; Lagendijk, RL; Jansen, F.W.

    1999-01-01

    Mobile augmented reality requires accurate alignment of virtual information with objects visible in the real world. We describe a system for mobile communications to be developed to meet these strict alignment criteria using a combination of computer vision. inertial tracking and low-latency

  10. SOAP2: an improved ultrafast tool for short read alignment

    DEFF Research Database (Denmark)

    Li, Ruiqiang; Yu, Chang; Li, Yingrui

    2009-01-01

    SUMMARY: SOAP2 is a significantly improved version of the short oligonucleotide alignment program that both reduces computer memory usage and increases alignment speed at an unprecedented rate. We used a Burrows Wheeler Transformation (BWT) compression index to substitute the seed strategy...... for indexing the reference sequence in the main memory. We tested it on the whole human genome and found that this new algorithm reduced memory usage from 14.7 to 5.4 GB and improved alignment speed by 20-30 times. SOAP2 is compatible with both single- and paired-end reads. Additionally, this tool now supports...... multiple text and compressed file formats. A consensus builder has also been developed for consensus assembly and SNP detection from alignment of short reads on a reference genome. AVAILABILITY: http://soap.genomics.org.cn....

  11. Using structure to explore the sequence alignment space of remote homologs.

    Directory of Open Access Journals (Sweden)

    Andrew Kuziemko

    2011-10-01

    Full Text Available Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.

  12. CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy

    Directory of Open Access Journals (Sweden)

    Guanghong Zuo

    2015-10-01

    Full Text Available A faithful phylogeny and an objective taxonomy for prokaryotes should agree with each other and ultimately follow the genome data. With the number of sequenced genomes reaching tens of thousands, both tree inference and detailed comparison with taxonomy are great challenges. We now provide one solution in the latest Release 3.0 of the alignment-free and whole-genome-based web server CVTree3. The server resides in a cluster of 64 cores and is equipped with an interactive, collapsible, and expandable tree display. It is capable of comparing the tree branching order with prokaryotic classification at all taxonomic ranks from domains down to species and strains. CVTree3 allows for inquiry by taxon names and trial on lineage modifications. In addition, it reports a summary of monophyletic and non-monophyletic taxa at all ranks as well as produces print-quality subtree figures. After giving an overview of retrospective verification of the CVTree approach, the power of the new server is described for the mega-classification of prokaryotes and determination of taxonomic placement of some newly-sequenced genomes. A few discrepancies between CVTree and 16S rRNA analyses are also summarized with regard to possible taxonomic revisions. CVTree3 is freely accessible to all users at http://tlife.fudan.edu.cn/cvtree3/ without login requirements.

  13. CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy.

    Science.gov (United States)

    Zuo, Guanghong; Hao, Bailin

    2015-10-01

    A faithful phylogeny and an objective taxonomy for prokaryotes should agree with each other and ultimately follow the genome data. With the number of sequenced genomes reaching tens of thousands, both tree inference and detailed comparison with taxonomy are great challenges. We now provide one solution in the latest Release 3.0 of the alignment-free and whole-genome-based web server CVTree3. The server resides in a cluster of 64 cores and is equipped with an interactive, collapsible, and expandable tree display. It is capable of comparing the tree branching order with prokaryotic classification at all taxonomic ranks from domains down to species and strains. CVTree3 allows for inquiry by taxon names and trial on lineage modifications. In addition, it reports a summary of monophyletic and non-monophyletic taxa at all ranks as well as produces print-quality subtree figures. After giving an overview of retrospective verification of the CVTree approach, the power of the new server is described for the mega-classification of prokaryotes and determination of taxonomic placement of some newly-sequenced genomes. A few discrepancies between CVTree and 16S rRNA analyses are also summarized with regard to possible taxonomic revisions. CVTree3 is freely accessible to all users at http://tlife.fudan.edu.cn/cvtree3/ without login requirements. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.

  14. gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances.

    Science.gov (United States)

    Domazet-Lošo, Mirjana; Domazet-Lošo, Tomislav

    2016-01-01

    Prokaryotic and viral genomes are often altered by recombination and horizontal gene transfer. The existing methods for detecting recombination are primarily aimed at viral genomes or sets of loci, since the expensive computation of underlying statistical models often hinders the comparison of complete prokaryotic genomes. As an alternative, alignment-free solutions are more efficient, but cannot map (align) a query to subject genomes. To address this problem, we have developed gmos (Genome MOsaic Structure), a new program that determines the mosaic structure of query genomes when compared to a set of closely related subject genomes. The program first computes local alignments between query and subject genomes and then reconstructs the query mosaic structure by choosing the best local alignment for each query region. To accomplish the analysis quickly, the program mostly relies on pairwise alignments and constructs multiple sequence alignments over short overlapping subject regions only when necessary. This fine-tuned implementation achieves an efficiency comparable to an alignment-free tool. The program performs well for simulated and real data sets of closely related genomes and can be used for fast recombination detection; for instance, when a new prokaryotic pathogen is discovered. As an example, gmos was used to detect genome mosaicism in a pathogenic Enterococcus faecium strain compared to seven closely related genomes. The analysis took less than two minutes on a single 2.1 GHz processor. The output is available in fasta format and can be visualized using an accessory program, gmosDraw (freely available with gmos).

  15. gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances.

    Directory of Open Access Journals (Sweden)

    Mirjana Domazet-Lošo

    Full Text Available Prokaryotic and viral genomes are often altered by recombination and horizontal gene transfer. The existing methods for detecting recombination are primarily aimed at viral genomes or sets of loci, since the expensive computation of underlying statistical models often hinders the comparison of complete prokaryotic genomes. As an alternative, alignment-free solutions are more efficient, but cannot map (align a query to subject genomes. To address this problem, we have developed gmos (Genome MOsaic Structure, a new program that determines the mosaic structure of query genomes when compared to a set of closely related subject genomes. The program first computes local alignments between query and subject genomes and then reconstructs the query mosaic structure by choosing the best local alignment for each query region. To accomplish the analysis quickly, the program mostly relies on pairwise alignments and constructs multiple sequence alignments over short overlapping subject regions only when necessary. This fine-tuned implementation achieves an efficiency comparable to an alignment-free tool. The program performs well for simulated and real data sets of closely related genomes and can be used for fast recombination detection; for instance, when a new prokaryotic pathogen is discovered. As an example, gmos was used to detect genome mosaicism in a pathogenic Enterococcus faecium strain compared to seven closely related genomes. The analysis took less than two minutes on a single 2.1 GHz processor. The output is available in fasta format and can be visualized using an accessory program, gmosDraw (freely available with gmos.

  16. Statistical potential-based amino acid similarity matrices for aligning distantly related protein sequences.

    Science.gov (United States)

    Tan, Yen Hock; Huang, He; Kihara, Daisuke

    2006-08-15

    Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.

  17. Functional annotation by sequence-weighted structure alignments: statistical analysis and case studies from the Protein 3000 structural genomics project in Japan.

    Science.gov (United States)

    Standley, Daron M; Toh, Hiroyuki; Nakamura, Haruki

    2008-09-01

    A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized. 2008 Wiley-Liss, Inc.

  18. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised.

    Science.gov (United States)

    Pérez-Enciso, Miguel; Rincón, Juan C; Legarra, Andrés

    2015-05-09

    The development of next-generation sequencing technologies (NGS) has made the use of whole-genome sequence data for routine genetic evaluations possible, which has triggered a considerable interest in animal and plant breeding fields. Here, we investigated whether complete or partial sequence data can improve upon existing SNP (single nucleotide polymorphism) array-based selection strategies by simulation using a mixed coalescence - gene-dropping approach. We simulated 20 or 100 causal mutations (quantitative trait nucleotides, QTN) within 65 predefined 'gene' regions, each 10 kb long, within a genome composed of ten 3-Mb chromosomes. We compared prediction accuracy by cross-validation using a medium-density chip (7.5 k SNPs), a high-density (HD, 17 k) and sequence data (335 k). Genetic evaluation was based on a GBLUP method. The simulations showed: (1) a law of diminishing returns with increasing number of SNPs; (2) a modest effect of SNP ascertainment bias in arrays; (3) a small advantage of using whole-genome sequence data vs. HD arrays i.e. ~4%; (4) a minor effect of NGS errors except when imputation error rates are high (≥20%); and (5) if QTN were known, prediction accuracy approached 1. Since this is obviously unrealistic, we explored milder assumptions. We showed that, if all SNPs within causal genes were included in the prediction model, accuracy could also dramatically increase by ~40%. However, this criterion was highly sensitive to either misspecification (including wrong genes) or to the use of an incomplete gene list; in these cases, accuracy fell rapidly towards that reached when all SNPs from sequence data were blindly included in the model. Our study shows that, unless an accurate prior estimate on the functionality of SNPs can be included in the predictor, there is a law of diminishing returns with increasing SNP density. As a result, use of whole-genome sequence data may not result in a highly increased selection response over high

  19. Refining borders of genome-rearrangements including repetitions

    Directory of Open Access Journals (Sweden)

    JA Arjona-Medina

    2016-10-01

    Full Text Available Abstract Background DNA rearrangement events have been widely studied in comparative genomic for many years. The importance of these events resides not only in the study about relatedness among different species, but also to determine the mechanisms behind evolution. Although there are many methods to identify genome-rearrangements (GR, the refinement of their borders has become a huge challenge. Until now no accepted method exists to achieve accurate fine-tuning: i.e. the notion of breakpoint (BP is still an open issue, and despite repeated regions are vital to understand evolution they are not taken into account in most of the GR detection and refinement methods. Methods and results We propose a method to refine the borders of GR including repeated regions. Instead of removing these repetitions to facilitate computation, we take advantage of them using a consensus alignment sequence of the repeated region in between two blocks. Using the concept of identity vectors for Synteny Blocks (SB and repetitions, a Finite State Machine is designed to detect transition points in the difference between such vectors. The method does not force the BP to be a region or a point but depends on the alignment transitions within the SBs and repetitions. Conclusion The accurate definition of the borders of SB and repeated genomic regions and consequently the detection of BP might help to understand the evolutionary model of species. In this manuscript we present a new proposal for such a refinement. Features of the SBs borders and BPs are different and fit with what is expected. SBs with more diversity in annotations and BPs short and richer in DNA replication and stress response, which are strongly linked with rearrangements.

  20. DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.

    Science.gov (United States)

    Kelly, Steven; Maini, Philip K

    2013-01-01

    The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.

  1. DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.

    Directory of Open Access Journals (Sweden)

    Steven Kelly

    Full Text Available The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.

  2. An automatic and accurate x-ray tube focal spot/grid alignment system for mobile radiography: System description and alignment accuracy

    International Nuclear Information System (INIS)

    Gauntt, David M.; Barnes, Gary T.

    2010-01-01

    Purpose: A mobile radiography automatic grid alignment system (AGAS) has been developed by modifying a commercially available mobile unit. The objectives of this article are to describe the modifications and operation and to report on the accuracy with which the focal spot is aligned to the grid and the time required to achieve the alignment. Methods: The modifications include an optical target arm attached to the grid tunnel, a video camera attached to the collimator, a motion control system with six degrees of freedom to position the collimator and x-ray tube, and a computer to control the system. The video camera and computer determine the grid position, and then the motion control system drives the x-ray focal spot to the center of the grid focal axis. The accuracy of the alignment of the focal spot with the grid and the time required to achieve alignment were measured both in laboratory tests and in clinical use. Results: For a typical exam, the modified unit automatically aligns the focal spot with the grid in less than 10 s, with an accuracy of better than 4 mm. The results of the speed and accuracy tests in clinical use were similar to the results in laboratory tests. Comparison patient chest images are presented--one obtained with a standard mobile radiographic unit without a grid and the other obtained with the modified unit and a 15:1 grid. The 15:1 grid images demonstrate a marked improvement in image quality compared to the nongrid images with no increase in patient dose. Conclusions: The mobile radiography AGAS produces images of significantly improved quality compared to nongrid images with alignment times of less than 10 s and no increase in patient dose.

  3. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    Directory of Open Access Journals (Sweden)

    Kevin R Ramkissoon

    Full Text Available The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  4. Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

    Science.gov (United States)

    Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392

  5. Constrained-DFT method for accurate energy-level alignment of metal/molecule interfaces

    KAUST Repository

    Souza, A. M.

    2013-10-07

    We present a computational scheme for extracting the energy-level alignment of a metal/molecule interface, based on constrained density functional theory and local exchange and correlation functionals. The method, applied here to benzene on Li(100), allows us to evaluate charge-transfer energies, as well as the spatial distribution of the image charge induced on the metal surface. We systematically study the energies for charge transfer from the molecule to the substrate as function of the molecule-substrate distance, and investigate the effects arising from image-charge confinement and local charge neutrality violation. For benzene on Li(100) we find that the image-charge plane is located at about 1.8 Å above the Li surface, and that our calculated charge-transfer energies compare perfectly with those obtained with a classical electrostatic model having the image plane located at the same position. The methodology outlined here can be applied to study any metal/organic interface in the weak coupling limit at the computational cost of a total energy calculation. Most importantly, as the scheme is based on total energies and not on correcting the Kohn-Sham quasiparticle spectrum, accurate results can be obtained with local/semilocal exchange and correlation functionals. This enables a systematic approach to convergence.

  6. Constrained-DFT method for accurate energy-level alignment of metal/molecule interfaces

    KAUST Repository

    Souza, A. M.; Rungger, I.; Pemmaraju, C. D.; Schwingenschlö gl, Udo; Sanvito, S.

    2013-01-01

    We present a computational scheme for extracting the energy-level alignment of a metal/molecule interface, based on constrained density functional theory and local exchange and correlation functionals. The method, applied here to benzene on Li(100), allows us to evaluate charge-transfer energies, as well as the spatial distribution of the image charge induced on the metal surface. We systematically study the energies for charge transfer from the molecule to the substrate as function of the molecule-substrate distance, and investigate the effects arising from image-charge confinement and local charge neutrality violation. For benzene on Li(100) we find that the image-charge plane is located at about 1.8 Å above the Li surface, and that our calculated charge-transfer energies compare perfectly with those obtained with a classical electrostatic model having the image plane located at the same position. The methodology outlined here can be applied to study any metal/organic interface in the weak coupling limit at the computational cost of a total energy calculation. Most importantly, as the scheme is based on total energies and not on correcting the Kohn-Sham quasiparticle spectrum, accurate results can be obtained with local/semilocal exchange and correlation functionals. This enables a systematic approach to convergence.

  7. Enhancing faba bean (Vicia faba L.) genome resources.

    Science.gov (United States)

    Cooper, James W; Wilson, Michael H; Derks, Martijn F L; Smit, Sandra; Kunert, Karl J; Cullis, Christopher; Foyer, Christine H

    2017-04-01

    Grain legume improvement is currently impeded by a lack of genomic resources. The paucity of genome information for faba bean can be attributed to the intrinsic difficulties of assembling/annotating its giant (~13 Gb) genome. In order to address this challenge, RNA-sequencing analysis was performed on faba bean (cv. Wizard) leaves. Read alignment to the faba bean reference transcriptome identified 16 300 high quality unigenes. In addition, Illumina paired-end sequencing was used to establish a baseline for genomic information assembly. Genomic reads were assembled de novo into contigs with a size range of 50-5000 bp. Over 85% of sequences did not align to known genes, of which ~10% could be aligned to known repetitive genetic elements. Over 26 000 of the reference transcriptome unigenes could be aligned to DNA-sequencing (DNA-seq) reads with high confidence. Moreover, this comparison identified 56 668 potential splice points in all identified unigenes. Sequence length data were extended at 461 putative loci through alignment of DNA-seq contigs to full-length, publicly available linkage marker sequences. Reads also yielded coverages of 3466× and 650× for the chloroplast and mitochondrial genomes, respectively. Inter- and intraspecies organelle genome comparisons established core legume organelle gene sets, and revealed polymorphic regions of faba bean organelle genomes. © The Author 2017. Published by Oxford University Press on behalf of the Society for Experimental Biology.

  8. GOSSIP: a method for fast and accurate global alignment of protein structures.

    Science.gov (United States)

    Kifer, I; Nussinov, R; Wolfson, H J

    2011-04-01

    The database of known protein structures (PDB) is increasing rapidly. This results in a growing need for methods that can cope with the vast amount of structural data. To analyze the accumulating data, it is important to have a fast tool for identifying similar structures and clustering them by structural resemblance. Several excellent tools have been developed for the comparison of protein structures. These usually address the task of local structure alignment, an important yet computationally intensive problem due to its complexity. It is difficult to use such tools for comparing a large number of structures to each other at a reasonable time. Here we present GOSSIP, a novel method for a global all-against-all alignment of any set of protein structures. The method detects similarities between structures down to a certain cutoff (a parameter of the program), hence allowing it to detect similar structures at a much higher speed than local structure alignment methods. GOSSIP compares many structures in times which are several orders of magnitude faster than well-known available structure alignment servers, and it is also faster than a database scanning method. We evaluate GOSSIP both on a dataset of short structural fragments and on two large sequence-diverse structural benchmarks. Our conclusions are that for a threshold of 0.6 and above, the speed of GOSSIP is obtained with no compromise of the accuracy of the alignments or of the number of detected global similarities. A server, as well as an executable for download, are available at http://bioinfo3d.cs.tau.ac.il/gossip/.

  9. READSCAN: A fast and scalable pathogen discovery program with accurate genome relative abundance estimation

    KAUST Repository

    Naeem, Raeece

    2012-11-28

    Summary: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). Availability: http://cbrc.kaust.edu.sa/readscan Contact: or raeece.naeem@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. 2012 The Author(s).

  10. READSCAN: A fast and scalable pathogen discovery program with accurate genome relative abundance estimation

    KAUST Repository

    Naeem, Raeece; Rashid, Mamoon; Pain, Arnab

    2012-01-01

    Summary: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). Availability: http://cbrc.kaust.edu.sa/readscan Contact: or raeece.naeem@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. 2012 The Author(s).

  11. An efficient genetic algorithm for structural RNA pairwise alignment and its application to non-coding RNA discovery in yeast

    Directory of Open Access Journals (Sweden)

    Taneda Akito

    2008-12-01

    Full Text Available Abstract Background Aligning RNA sequences with low sequence identity has been a challenging problem since such a computation essentially needs an algorithm with high complexities for taking structural conservation into account. Although many sophisticated algorithms for the purpose have been proposed to date, further improvement in efficiency is necessary to accelerate its large-scale applications including non-coding RNA (ncRNA discovery. Results We developed a new genetic algorithm, Cofolga2, for simultaneously computing pairwise RNA sequence alignment and consensus folding, and benchmarked it using BRAliBase 2.1. The benchmark results showed that our new algorithm is accurate and efficient in both time and memory usage. Then, combining with the originally trained SVM, we applied the new algorithm to novel ncRNA discovery where we compared S. cerevisiae genome with six related genomes in a pairwise manner. By focusing our search to the relatively short regions (50 bp to 2,000 bp sandwiched by conserved sequences, we successfully predict 714 intergenic and 1,311 sense or antisense ncRNA candidates, which were found in the pairwise alignments with stable consensus secondary structure and low sequence identity (≤ 50%. By comparing with the previous predictions, we found that > 92% of the candidates is novel candidates. The estimated rate of false positives in the predicted candidates is 51%. Twenty-five percent of the intergenic candidates has supports for expression in cell, i.e. their genomic positions overlap those of the experimentally determined transcripts in literature. By manual inspection of the results, moreover, we obtained four multiple alignments with low sequence identity which reveal consensus structures shared by three species/sequences. Conclusion The present method gives an efficient tool complementary to sequence-alignment-based ncRNA finders.

  12. Flavivirus and Filovirus EvoPrinters: New alignment tools for the comparative analysis of viral evolution.

    Directory of Open Access Journals (Sweden)

    Thomas Brody

    2017-06-01

    Full Text Available Flavivirus and Filovirus infections are serious epidemic threats to human populations. Multi-genome comparative analysis of these evolving pathogens affords a view of their essential, conserved sequence elements as well as progressive evolutionary changes. While phylogenetic analysis has yielded important insights, the growing number of available genomic sequences makes comparisons between hundreds of viral strains challenging. We report here a new approach for the comparative analysis of these hemorrhagic fever viruses that can superimpose an unlimited number of one-on-one alignments to identify important features within genomes of interest.We have adapted EvoPrinter alignment algorithms for the rapid comparative analysis of Flavivirus or Filovirus sequences including Zika and Ebola strains. The user can input a full genome or partial viral sequence and then view either individual comparisons or generate color-coded readouts that superimpose hundreds of one-on-one alignments to identify unique or shared identity SNPs that reveal ancestral relationships between strains. The user can also opt to select a database genome in order to access a library of pre-aligned genomes of either 1,094 Flaviviruses or 460 Filoviruses for rapid comparative analysis with all database entries or a select subset. Using EvoPrinter search and alignment programs, we show the following: 1 superimposing alignment data from many related strains identifies lineage identity SNPs, which enable the assessment of sublineage complexity within viral outbreaks; 2 whole-genome SNP profile screens uncover novel Dengue2 and Zika recombinant strains and their parental lineages; 3 differential SNP profiling identifies host cell A-to-I hyper-editing within Ebola and Marburg viruses, and 4 hundreds of superimposed one-on-one Ebola genome alignments highlight ultra-conserved regulatory sequences, invariant amino acid codons and evolutionarily variable protein-encoding domains within a

  13. DIDA: Distributed Indexing Dispatched Alignment.

    Directory of Open Access Journals (Sweden)

    Hamid Mohamadi

    Full Text Available One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA, and is free for academic use.

  14. VISTA - computational tools for comparative genomics

    Energy Technology Data Exchange (ETDEWEB)

    Frazer, Kelly A.; Pachter, Lior; Poliakov, Alexander; Rubin,Edward M.; Dubchak, Inna

    2004-01-01

    Comparison of DNA sequences from different species is a fundamental method for identifying functional elements in genomes. Here we describe the VISTA family of tools created to assist biologists in carrying out this task. Our first VISTA server at http://www-gsd.lbl.gov/VISTA/ was launched in the summer of 2000 and was designed to align long genomic sequences and visualize these alignments with associated functional annotations. Currently the VISTA site includes multiple comparative genomics tools and provides users with rich capabilities to browse pre-computed whole-genome alignments of large vertebrate genomes and other groups of organisms with VISTA Browser, submit their own sequences of interest to several VISTA servers for various types of comparative analysis, and obtain detailed comparative analysis results for a set of cardiovascular genes. We illustrate capabilities of the VISTA site by the analysis of a 180 kilobase (kb) interval on human chromosome 5 that encodes for the kinesin family member3A (KIF3A) protein.

  15. PSI/TM-Coffee: a web server for fast and accurate multiple sequence alignments of regular and transmembrane proteins using homology extension on reduced databases.

    Science.gov (United States)

    Floden, Evan W; Tommaso, Paolo D; Chatzou, Maria; Magis, Cedrik; Notredame, Cedric; Chang, Jia-Ming

    2016-07-08

    The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. THE ATLAS INNER DETECTOR TRACK BASED ALIGNMENT

    CERN Document Server

    Marti i Garcia, Salvador; The ATLAS collaboration

    2018-01-01

    The alignment of the ATLAS Inner Detector is performed with a track-based alignment algorithm. Its goal is to provide an accurate description of the detector geometry such that track parameters are accurately determined and free from biases. Its software implementation is modular and configurable, with a clear separation of the alignment algorithm from the detector system specifics and the database handling. The alignment must cope with the rapid movements of the detector as well as with the slow drift of the different mechanical units. Prompt alignment constants are derived for every run at the calibration stage. These sets of constants are then dynamically split from the beginning of the run in many chunks, allowing to describe the tracker geometry as it evolves with time. The alignment of the Inner Detector is validated and improved by studying resonance decays (Z and J/psi to mu+mu-), as well as using information from the calorimeter system with the E/p method with electrons. A detailed study of these res...

  17. A cross-species alignment tool (CAT)

    DEFF Research Database (Denmark)

    Li, Heng; Guan, Liang; Liu, Tao

    2007-01-01

    BACKGROUND: The main two sorts of automatic gene annotation frameworks are ab initio and alignment-based, the latter splitting into two sub-groups. The first group is used for intra-species alignments, among which are successful ones with high specificity and speed. The other group contains more...... sensitive methods which are usually applied in aligning inter-species sequences. RESULTS: Here we present a new algorithm called CAT (for Cross-species Alignment Tool). It is designed to align mRNA sequences to mammalian-sized genomes. CAT is implemented using C scripts and is freely available on the web...... at http://xat.sourceforge.net/. CONCLUSIONS: Examined from different angles, CAT outperforms other extant alignment tools. Tested against all available mouse-human and zebrafish-human orthologs, we demonstrate that CAT combines the specificity and speed of the best intra-species algorithms, like BLAT...

  18. Comparative Genome Viewer

    International Nuclear Information System (INIS)

    Molineris, I.; Sales, G.

    2009-01-01

    The amount of information about genomes, both in the form of complete sequences and annotations, has been exponentially increasing in the last few years. As a result there is the need for tools providing a graphical representation of such information that should be comprehensive and intuitive. Visual representation is especially important in the comparative genomics field since it should provide a combined view of data belonging to different genomes. We believe that existing tools are limited in this respect as they focus on a single genome at a time (conservation histograms) or compress alignment representation to a single dimension. We have therefore developed a web-based tool called Comparative Genome Viewer (Cgv): it integrates a bidimensional representation of alignments between two regions, both at small and big scales, with the richness of annotations present in other genome browsers. We give access to our system through a web-based interface that provides the user with an interactive representation that can be updated in real time using the mouse to move from region to region and to zoom in on interesting details.

  19. Archived neonatal dried blood spot samples can be used for accurate whole genome and exome-targeted next-generation sequencing

    DEFF Research Database (Denmark)

    Hollegaard, Mads Vilhelm; Grauholm, Jonas; Nielsen, Ronni

    2013-01-01

    Dried blood spot samples (DBSS) have been collected and stored for decades as part of newborn screening programmes worldwide. Representing almost an entire population under a certain age and collected with virtually no bias, the Newborn Screening Biobanks are of immense value in medical studies......, for example, to examine the genetics of various disorders. We have previously demonstrated that DNA extracted from a fraction (2×3.2mm discs) of an archived DBSS can be whole genome amplified (wgaDNA) and used for accurate array genotyping. However, until now, it has been uncertain whether wgaDNA from DBSS...... can be used for accurate whole genome sequencing (WGS) and exome sequencing (WES). This study examined two individuals represented by three different types of samples each: whole-blood (reference samples), 3-year-old DBSS spotted with reference material (refDBSS), and 27- to 29-year-old archived...

  20. Performance analysis for W-band antenna alignment using accurate mechanical beam steering

    DEFF Research Database (Denmark)

    Morales Vicente, Alvaro; Rodríguez Páez, Juan Sebastián; Gallardo, Omar

    2017-01-01

    This article presents a study of antenna alignment impact on bit error rate for a wireless link between two directive W-band horn antennas where one of them is mechanically steered by a Stewart platform. Such a technique is applied to find the optimal alignment between transmitter and receiver...... with an accuracy of 18 both in azimuth and elevation angles. The maximum degree of misalignment which can be tolerated is also reported for different values of optical power in the generation of W-band signals by photonic up-conversion. (C) 2017 Wiley Periodicals, Inc....

  1. Leveraging FPGAs for Accelerating Short Read Alignment.

    Science.gov (United States)

    Arram, James; Kaplan, Thomas; Luk, Wayne; Jiang, Peiyong

    2017-01-01

    One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the n-step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.

  2. Microbial species delineation using whole genome sequences.

    Science.gov (United States)

    Varghese, Neha J; Mukherjee, Supratim; Ivanova, Natalia; Konstantinidis, Konstantinos T; Mavrommatis, Kostas; Kyrpides, Nikos C; Pati, Amrita

    2015-08-18

    Increased sequencing of microbial genomes has revealed that prevailing prokaryotic species assignments can be inconsistent with whole genome information for a significant number of species. The long-standing need for a systematic and scalable species assignment technique can be met by the genome-wide Average Nucleotide Identity (gANI) metric, which is widely acknowledged as a robust measure of genomic relatedness. In this work, we demonstrate that the combination of gANI and the alignment fraction (AF) between two genomes accurately reflects their genomic relatedness. We introduce an efficient implementation of AF,gANI and discuss its successful application to 86.5M genome pairs between 13,151 prokaryotic genomes assigned to 3032 species. Subsequently, by comparing the genome clusters obtained from complete linkage clustering of these pairs to existing taxonomy, we observed that nearly 18% of all prokaryotic species suffer from anomalies in species definition. Our results can be used to explore central questions such as whether microorganisms form a continuum of genetic diversity or distinct species represented by distinct genetic signatures. We propose that this precise and objective AF,gANI-based species definition: the MiSI (Microbial Species Identifier) method, be used to address previous inconsistencies in species classification and as the primary guide for new taxonomic species assignment, supplemented by the traditional polyphasic approach, as required. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. Laser alignment of rotating equipment at PNL

    International Nuclear Information System (INIS)

    Berndt, R.H.

    1994-05-01

    Lateral vibration in direct-drive equipment is usually caused by misalignment. Over the years, because of the need to improve on techniques and ways of working more efficiently, various types of alignment methods have evolved. In the beginning, craftsmen used a straight-edge scale across the coupling with a feeler gauge measuring the misalignment error. This is still preferred today for aligning small couplings. The industry has since decided that alignment of large direct-drive equipment needed a more accurate type of instrumentation. Rim and face is another of the first alignment methods and is used on all sizes of equipment. A disadvantage of the rim and face method is that in most cases the coupling has to be disassembled. This can cause alignment problems when the coupling is reassembled. Also, the rim and face method is not fast enough to work satisfactorily on alignment of thermally hot equipment. Another concern is that the coupling has to be manufactured accurately for correct rim and face readings. Reverse dial alignment is an improvement over the rim and face method, and depending on the operator's experience, this method can be very accurate. A good training program along with field experience will bring the operator to a proper level of proficiency for a successful program. A hand-held computer with reverse dial calculations in memory is a must for job efficiency. An advantage over the rim and face method is that the coupling is not disassembled and remains locked together. Reverse dial instrumentation measures from both shaft center lines, rather than the coupling surface so the machining of the coupling during manufacture is not a major concern

  4. Intramedullary versus extramedullary alignment of the tibial component in the Triathlon knee

    LENUS (Irish Health Repository)

    Cashman, James P

    2011-08-20

    Abstract Background Long term survivorship in total knee arthroplasty is significantly dependant on prosthesis alignment. Our aim was determine which alignment guide was more accurate in positioning of the tibial component in total knee arthroplasty. We also aimed to assess whether there was any difference in short term patient outcome. Method A comparison of intramedullary versus extramedullary alignment jig was performed. Radiological alignment of tibial components and patient outcomes of 103 Triathlon total knee arthroplasties were analysed. Results Use of the intramedullary was found to be significantly more accurate in determining coronal alignment (p = 0.02) while use of the extramedullary jig was found to give more accurate results in sagittal alignment (p = 0.04). There was no significant difference in WOMAC or SF-36 at six months. Conclusion Use of an intramedullary jig is preferable for positioning of the tibial component using this knee system.

  5. An improved Hough transform-based fingerprint alignment approach

    CSIR Research Space (South Africa)

    Mlambo, CS

    2014-11-01

    Full Text Available An improved Hough Transform based fingerprint alignment approach is presented, which improves computing time and memory usage with accurate alignment parameter (rotation and translation) results. This is achieved by studying the strengths...

  6. High-throughput sequence alignment using Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Trapnell Cole

    2007-12-01

    Full Text Available Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.

  7. Stochastic sampling of the RNA structural alignment space.

    Science.gov (United States)

    Harmanci, Arif Ozgun; Sharma, Gaurav; Mathews, David H

    2009-07-01

    A novel method is presented for predicting the common secondary structures and alignment of two homologous RNA sequences by sampling the 'structural alignment' space, i.e. the joint space of their alignments and common secondary structures. The structural alignment space is sampled according to a pseudo-Boltzmann distribution based on a pseudo-free energy change that combines base pairing probabilities from a thermodynamic model and alignment probabilities from a hidden Markov model. By virtue of the implicit comparative analysis between the two sequences, the method offers an improvement over single sequence sampling of the Boltzmann ensemble. A cluster analysis shows that the samples obtained from joint sampling of the structural alignment space cluster more closely than samples generated by the single sequence method. On average, the representative (centroid) structure and alignment of the most populated cluster in the sample of structures and alignments generated by joint sampling are more accurate than single sequence sampling and alignment based on sequence alone, respectively. The 'best' centroid structure that is closest to the known structure among all the centroids is, on average, more accurate than structure predictions of other methods. Additionally, cluster analysis identifies, on average, a few clusters, whose centroids can be presented as alternative candidates. The source code for the proposed method can be downloaded at http://rna.urmc.rochester.edu.

  8. R3D Align web server for global nucleotide to nucleotide alignments of RNA 3D structures.

    Science.gov (United States)

    Rahrig, Ryan R; Petrov, Anton I; Leontis, Neocles B; Zirbel, Craig L

    2013-07-01

    The R3D Align web server provides online access to 'RNA 3D Align' (R3D Align), a method for producing accurate nucleotide-level structural alignments of RNA 3D structures. The web server provides a streamlined and intuitive interface, input data validation and output that is more extensive and easier to read and interpret than related servers. The R3D Align web server offers a unique Gallery of Featured Alignments, providing immediate access to pre-computed alignments of large RNA 3D structures, including all ribosomal RNAs, as well as guidance on effective use of the server and interpretation of the output. By accessing the non-redundant lists of RNA 3D structures provided by the Bowling Green State University RNA group, R3D Align connects users to structure files in the same equivalence class and the best-modeled representative structure from each group. The R3D Align web server is freely accessible at http://rna.bgsu.edu/r3dalign/.

  9. Evaluation of alignment marks using ASML ATHENA alignment system in 90nm BEOL process

    CERN Document Server

    Tan Chin Boon; Koh Hui Peng; Koo Chee, Kiong; Siew Yong Kong; Yeo Swee Hock

    2003-01-01

    As the critical dimension (CD) in integrated circuit (IC) device reduces, the total overlay budget needs to be more stringent. Typically, the allowable overlay error is 1/3 of the CD in the IC device. In this case, robustness of alignment mark is critical, as accurate signal is required by the scanner's alignment system to precisely align a layer of pattern to the previous layer. Alignment issue is more severe in back-end process partly due to the influenced of Chemical Mechanical Polishing (CMP), which contribute to the asymmetric or total destruction of the alignment marks. Alignment marks on the wafer can be placed along the scribe-line of the IC pattern. ASML scanner allows such type of wafer alignment using phase grating mark, known as Scribe-line Primary Mark (SPM) which can be fit into a standard 80um scribe-line. In this paper, we have studied the feasibility of introducing Narrow SPM (NSPM) to enable a smaller scribe-line. The width of NSPM has been shrunk down to 70% of the SPM and the length remain...

  10. Node fingerprinting: an efficient heuristic for aligning biological networks.

    Science.gov (United States)

    Radu, Alex; Charleston, Michael

    2014-10-01

    With the continuing increase in availability of biological data and improvements to biological models, biological network analysis has become a promising area of research. An emerging technique for the analysis of biological networks is through network alignment. Network alignment has been used to calculate genetic distance, similarities between regulatory structures, and the effect of external forces on gene expression, and to depict conditional activity of expression modules in cancer. Network alignment is algorithmically complex, and therefore we must rely on heuristics, ideally as efficient and accurate as possible. The majority of current techniques for network alignment rely on precomputed information, such as with protein sequence alignment, or on tunable network alignment parameters, which may introduce an increased computational overhead. Our presented algorithm, which we call Node Fingerprinting (NF), is appropriate for performing global pairwise network alignment without precomputation or tuning, can be fully parallelized, and is able to quickly compute an accurate alignment between two biological networks. It has performed as well as or better than existing algorithms on biological and simulated data, and with fewer computational resources. The algorithmic validation performed demonstrates the low computational resource requirements of NF.

  11. Hal: an automated pipeline for phylogenetic analyses of genomic data.

    Science.gov (United States)

    Robbertse, Barbara; Yoder, Ryan J; Boyd, Alex; Reeves, John; Spatafora, Joseph W

    2011-02-07

    The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous clusters of proteins or genes, and the assembly of alignments of orthologous sequence data into individual and concatenated super alignments. Here we report the production of an automated pipeline, Hal that produces multiple alignments and trees from genomic data. These alignments can be produced by a choice of four alignment programs and analyzed by a variety of phylogenetic programs. In short, the Hal pipeline connects the programs BLASTP, MCL, user specified alignment programs, GBlocks, ProtTest and user specified phylogenetic programs to produce species trees. The script is available at sourceforge (http://sourceforge.net/projects/bio-hal/). The results from an example analysis of Kingdom Fungi are briefly discussed.

  12. The Saccharomyces Genome Database Variant Viewer.

    Science.gov (United States)

    Sheppard, Travis K; Hitz, Benjamin C; Engel, Stacia R; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla S; Demeter, Janos; Hellerstedt, Sage T; Karra, Kalpana; Nash, Robert S; Paskov, Kelley M; Skrzypek, Marek S; Weng, Shuai; Wong, Edith D; Cherry, J Michael

    2016-01-04

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Efficient alignment-free DNA barcode analytics.

    Science.gov (United States)

    Kuksa, Pavel; Pavlovic, Vladimir

    2009-11-10

    In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.

  14. MSuPDA: A Memory Efficient Algorithm for Sequence Alignment.

    Science.gov (United States)

    Khan, Mohammad Ibrahim; Kamal, Md Sarwar; Chowdhury, Linkon

    2016-03-01

    Space complexity is a million dollar question in DNA sequence alignments. In this regard, memory saving under pushdown automata can help to reduce the occupied spaces in computer memory. Our proposed process is that anchor seed (AS) will be selected from given data set of nucleotide base pairs for local sequence alignment. Quick splitting techniques will separate the AS from all the DNA genome segments. Selected AS will be placed to pushdown automata's (PDA) input unit. Whole DNA genome segments will be placed into PDA's stack. AS from input unit will be matched with the DNA genome segments from stack of PDA. Match, mismatch and indel of nucleotides will be popped from the stack under the control unit of pushdown automata. During the POP operation on stack, it will free the memory cell occupied by the nucleotide base pair.

  15. Genome-level homology and phylogeny of Shewanella (Gammaproteobacteria: lteromonadales: Shewanellaceae

    Directory of Open Access Journals (Sweden)

    Dikow Rebecca B

    2011-05-01

    Full Text Available Abstract Background The explosion in availability of whole genome data provides the opportunity to build phylogenetic hypotheses based on these data as well as the ability to learn more about the genomes themselves. The biological history of genes and genomes can be investigated based on the taxomonic history provided by the phylogeny. A phylogenetic hypothesis based on complete genome data is presented for the genus Shewanella (Gammaproteobacteria: Alteromonadales: Shewanellaceae. Nineteen taxa from Shewanella (16 species and 3 additional strains of one species as well as three outgroup species representing the genera Aeromonas (Gammaproteobacteria: Aeromonadales: Aeromonadaceae, Alteromonas (Gammaproteobacteria: Alteromonadales: Alteromonadaceae and Colwellia (Gammaproteobacteria: Alteromonadales: Colwelliaceae are included for a total of 22 taxa. Results Putatively homologous regions were found across unannotated genomes and tested with a phylogenetic analysis. Two genome-wide data-sets are considered, one including only those genomic regions for which all taxa are represented, which included 3,361,015 aligned nucleotide base-pairs (bp and a second that additionally includes those regions present in only subsets of taxa, which totaled 12,456,624 aligned bp. Alignment columns in these large data-sets were then randomly sampled to create smaller data-sets. After the phylogenetic hypothesis was generated, genome annotations were projected onto the DNA sequence alignment to compare the historical hypothesis generated by the phylogeny with the functional hypothesis posited by annotation. Conclusions Individual phylogenetic analyses of the 243 locally co-linear genome regions all failed to recover the genome topology, but the smaller data-sets that were random samplings of the large concatenated alignments all produced the genome topology. It is shown that there is not a single orthologous copy of 16S rRNA across the taxon sampling included in this

  16. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    Directory of Open Access Journals (Sweden)

    Kaufmann Michael

    2004-09-01

    Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

  17. Tidal alignment of galaxies

    Energy Technology Data Exchange (ETDEWEB)

    Blazek, Jonathan; Vlah, Zvonimir; Seljak, Uroš

    2015-08-01

    We develop an analytic model for galaxy intrinsic alignments (IA) based on the theory of tidal alignment. We calculate all relevant nonlinear corrections at one-loop order, including effects from nonlinear density evolution, galaxy biasing, and source density weighting. Contributions from density weighting are found to be particularly important and lead to bias dependence of the IA amplitude, even on large scales. This effect may be responsible for much of the luminosity dependence in IA observations. The increase in IA amplitude for more highly biased galaxies reflects their locations in regions with large tidal fields. We also consider the impact of smoothing the tidal field on halo scales. We compare the performance of this consistent nonlinear model in describing the observed alignment of luminous red galaxies with the linear model as well as the frequently used "nonlinear alignment model," finding a significant improvement on small and intermediate scales. We also show that the cross-correlation between density and IA (the "GI" term) can be effectively separated into source alignment and source clustering, and we accurately model the observed alignment down to the one-halo regime using the tidal field from the fully nonlinear halo-matter cross correlation. Inside the one-halo regime, the average alignment of galaxies with density tracers no longer follows the tidal alignment prediction, likely reflecting nonlinear processes that must be considered when modeling IA on these scales. Finally, we discuss tidal alignment in the context of cosmic shear measurements.

  18. ENCODE whole-genome data in the UCSC genome browser (2011 update).

    Science.gov (United States)

    Raney, Brian J; Cline, Melissa S; Rosenbloom, Kate R; Dreszer, Timothy R; Learned, Katrina; Barber, Galt P; Meyer, Laurence R; Sloan, Cricket A; Malladi, Venkat S; Roskin, Krishna M; Suh, Bernard B; Hinrichs, Angie S; Clawson, Hiram; Zweig, Ann S; Kirkup, Vanessa; Fujita, Pauline A; Rhead, Brooke; Smith, Kayla E; Pohl, Andy; Kuhn, Robert M; Karolchik, Donna; Haussler, David; Kent, W James

    2011-01-01

    The ENCODE project is an international consortium with a goal of cataloguing all the functional elements in the human genome. The ENCODE Data Coordination Center (DCC) at the University of California, Santa Cruz serves as the central repository for ENCODE data. In this role, the DCC offers a collection of high-throughput, genome-wide data generated with technologies such as ChIP-Seq, RNA-Seq, DNA digestion and others. This data helps illuminate transcription factor-binding sites, histone marks, chromatin accessibility, DNA methylation, RNA expression, RNA binding and other cell-state indicators. It includes sequences with quality scores, alignments, signals calculated from the alignments, and in most cases, element or peak calls calculated from the signal data. Each data set is available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/). ENCODE data can also be retrieved using a metadata system that captures the experimental parameters of each assay. The ENCODE web portal at UCSC (http://encodeproject.org/) provides information about the ENCODE data and links for access.

  19. A fast cross-validation method for alignment of electron tomography images based on Beer-Lambert law

    Science.gov (United States)

    Yan, Rui; Edwards, Thomas J.; Pankratz, Logan M.; Kuhn, Richard J.; Lanman, Jason K.; Liu, Jun; Jiang, Wen

    2015-01-01

    In electron tomography, accurate alignment of tilt series is an essential step in attaining high-resolution 3D reconstructions. Nevertheless, quantitative assessment of alignment quality has remained a challenging issue, even though many alignment methods have been reported. Here, we report a fast and accurate method, tomoAlignEval, based on the Beer-Lambert law, for the evaluation of alignment quality. Our method is able to globally estimate the alignment accuracy by measuring the goodness of log-linear relationship of the beam intensity attenuations at different tilt angles. Extensive tests with experimental data demonstrated its robust performance with stained and cryo samples. Our method is not only significantly faster but also more sensitive than measurements of tomogram resolution using Fourier shell correlation method (FSCe/o). From these tests, we also conclude that while current alignment methods are sufficiently accurate for stained samples, inaccurate alignments remain a major limitation for high resolution cryo-electron tomography. PMID:26455556

  20. A fast cross-validation method for alignment of electron tomography images based on Beer-Lambert law.

    Science.gov (United States)

    Yan, Rui; Edwards, Thomas J; Pankratz, Logan M; Kuhn, Richard J; Lanman, Jason K; Liu, Jun; Jiang, Wen

    2015-11-01

    In electron tomography, accurate alignment of tilt series is an essential step in attaining high-resolution 3D reconstructions. Nevertheless, quantitative assessment of alignment quality has remained a challenging issue, even though many alignment methods have been reported. Here, we report a fast and accurate method, tomoAlignEval, based on the Beer-Lambert law, for the evaluation of alignment quality. Our method is able to globally estimate the alignment accuracy by measuring the goodness of log-linear relationship of the beam intensity attenuations at different tilt angles. Extensive tests with experimental data demonstrated its robust performance with stained and cryo samples. Our method is not only significantly faster but also more sensitive than measurements of tomogram resolution using Fourier shell correlation method (FSCe/o). From these tests, we also conclude that while current alignment methods are sufficiently accurate for stained samples, inaccurate alignments remain a major limitation for high resolution cryo-electron tomography. Copyright © 2015 Elsevier Inc. All rights reserved.

  1. SNP-associations and phenotype predictions from hundreds of microbial genomes without genome alignments.

    Science.gov (United States)

    Hall, Barry G

    2014-01-01

    SNP-association studies are a starting point for identifying genes that may be responsible for specific phenotypes, such as disease traits. The vast bulk of tools for SNP-association studies are directed toward SNPs in the human genome, and I am unaware of any tools designed specifically for such studies in bacterial or viral genomes. The PPFS (Predict Phenotypes From SNPs) package described here is an add-on to kSNP , a program that can identify SNPs in a data set of hundreds of microbial genomes. PPFS identifies those SNPs that are non-randomly associated with a phenotype based on the χ² probability, then uses those diagnostic SNPs for two distinct, but related, purposes: (1) to predict the phenotypes of strains whose phenotypes are unknown, and (2) to identify those diagnostic SNPs that are most likely to be causally related to the phenotype. In the example illustrated here, from a set of 68 E. coli genomes, for 67 of which the pathogenicity phenotype was known, there were 418,500 SNPs. Using the phenotypes of 36 of those strains, PPFS identified 207 diagnostic SNPs. The diagnostic SNPs predicted the phenotypes of all of the genomes with 97% accuracy. It then identified 97 SNPs whose probability of being causally related to the pathogenic phenotype was >0.999. In a second example, from a set of 116 E. coli genome sequences, using the phenotypes of 65 strains PPFS identified 101 SNPs that predicted the source host (human or non-human) with 90% accuracy.

  2. Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.

    Science.gov (United States)

    Nagar, Anurag; Hahsler, Michael

    2013-01-01

    Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to

  3. CMASA: an accurate algorithm for detecting local protein structural similarity and its application to enzyme catalytic site annotation

    Directory of Open Access Journals (Sweden)

    Li Gong-Hua

    2010-08-01

    Full Text Available Abstract Background The rapid development of structural genomics has resulted in many "unknown function" proteins being deposited in Protein Data Bank (PDB, thus, the functional prediction of these proteins has become a challenge for structural bioinformatics. Several sequence-based and structure-based methods have been developed to predict protein function, but these methods need to be improved further, such as, enhancing the accuracy, sensitivity, and the computational speed. Here, an accurate algorithm, the CMASA (Contact MAtrix based local Structural Alignment algorithm, has been developed to predict unknown functions of proteins based on the local protein structural similarity. This algorithm has been evaluated by building a test set including 164 enzyme families, and also been compared to other methods. Results The evaluation of CMASA shows that the CMASA is highly accurate (0.96, sensitive (0.86, and fast enough to be used in the large-scale functional annotation. Comparing to both sequence-based and global structure-based methods, not only the CMASA can find remote homologous proteins, but also can find the active site convergence. Comparing to other local structure comparison-based methods, the CMASA can obtain the better performance than both FFF (a method using geometry to predict protein function and SPASM (a local structure alignment method; and the CMASA is more sensitive than PINTS and is more accurate than JESS (both are local structure alignment methods. The CMASA was applied to annotate the enzyme catalytic sites of the non-redundant PDB, and at least 166 putative catalytic sites have been suggested, these sites can not be observed by the Catalytic Site Atlas (CSA. Conclusions The CMASA is an accurate algorithm for detecting local protein structural similarity, and it holds several advantages in predicting enzyme active sites. The CMASA can be used in large-scale enzyme active site annotation. The CMASA can be available by the

  4. Inversion variants in human and primate genomes.

    Science.gov (United States)

    Catacchio, Claudia Rita; Maggiolini, Flavia Angela Maria; D'Addabbo, Pietro; Bitonto, Miriana; Capozzi, Oronzo; Signorile, Martina Lepore; Miroballo, Mattia; Archidiacono, Nicoletta; Eichler, Evan E; Ventura, Mario; Antonacci, Francesca

    2018-05-18

    For many years, inversions have been proposed to be a direct driving force in speciation since they suppress recombination when heterozygous. Inversions are the most common large-scale differences among humans and great apes. Nevertheless, they represent large events easily distinguishable by classical cytogenetics, whose resolution, however, is limited. Here, we performed a genome-wide comparison between human, great ape, and macaque genomes using the net alignments for the most recent releases of genome assemblies. We identified a total of 156 putative inversions, between 103 kb and 91 Mb, corresponding to 136 human loci. Combining literature, sequence, and experimental analyses, we analyzed 109 of these loci and found 67 regions inverted in one or multiple primates, including 28 newly identified inversions. These events overlap with 81 human genes at their breakpoints, and seven correspond to sites of recurrent rearrangements associated with human disease. This work doubles the number of validated primate inversions larger than 100 kb, beyond what was previously documented. We identified 74 sites of errors, where the sequence has been assembled in the wrong orientation, in the reference genomes analyzed. Our data serve two purposes: First, we generated a map of evolutionary inversions in these genomes representing a resource for interrogating differences among these species at a functional level; second, we provide a list of misassembled regions in these primate genomes, involving over 300 Mb of DNA and 1978 human genes. Accurately annotating these regions in the genome references has immediate applications for evolutionary and biomedical studies on primates. © 2018 Catacchio et al.; Published by Cold Spring Harbor Laboratory Press.

  5. Research on localization and alignment technology for transfer cask

    Energy Technology Data Exchange (ETDEWEB)

    Wang, Jingchuan, E-mail: jchwang@sjtu.edu.cn [Department of Automation, Shanghai Jiao Tong University, Shanghai (China); Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai (China); Yang, Ming; Chen, Weidong [Department of Automation, Shanghai Jiao Tong University, Shanghai (China); Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai (China)

    2015-10-15

    Highlights: • A method for the alignment between TB and HCB based on localizability is proposed. • A localization method based on the localizability estimation is proposed to realize the cask's localization accurately and ensures the transfer cask's accurate docking in the front of the window of Tokmak Building. • The experimental results show that the proposed algorithm works well in the indoor simulation environment. This system will be test in EAST of China. - Abstract: According to the long length characteristics of transfer cask compared to the environment space between Tokmak Building (TB) and HCB (Hot Cell Building), this paper proposes an autonomous localization and alignment method for the internal components transportation and replacement. A localization method based on the localizability estimation is used to realize the cask's localization and navigation accurately. Once the cask arrives at the front of the TB window, the position and attitude measurement system is used to detect the relative alignment error between the seal door of pallet and the window of TB real-time. The alignment between seal door and TB window could be realized based on this offset. The simulation experiment based on the real model is designed according to the real TB situation. The experiment results show that the proposed localization and alignment method can be used for transfer cask.

  6. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

    Directory of Open Access Journals (Sweden)

    Arthur W Pightling

    Full Text Available The wide availability of whole-genome sequencing (WGS and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i depth of sequencing coverage, ii choice of reference-guided short-read sequence assembler, iii choice of reference genome, and iv whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT, using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming. We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers

  7. Prediction of molecular alignment of nucleic acids in aligned media

    International Nuclear Information System (INIS)

    Wu Bin; Petersen, Michael; Girard, Frederic; Tessari, Marco; Wijmenga, Sybren S.

    2006-01-01

    We demonstrate - using the data base of all deposited DNA and RNA structures aligned in Pf1-medium and RDC refined - that for nucleic acids in a Pf1-medium the electrostatic alignment tensor can be predicted reliably and accurately via a simple and fast calculation based on the gyration tensor spanned out by the phosphodiester atoms. The rhombicity is well predicted over its full range from 0 to 0.66, while the alignment tensor orientation is predicted correctly for rhombicities up to ca. 0.4, for larger rhombicities it appears to deviate somewhat more than expected based on structural noise and measurement error. This simple analytical approach is based on the Debye-Huckel approximation for the electrostatic interaction potential, valid at distances sufficiently far away from a poly-ionic charged surface, a condition naturally enforced when the charge of alignment medium and solute are of equal sign, as for nucleic acids in a Pf1-phage medium. For the usual salt strengths and nucleic acid sizes, the Debye-Huckel screening length is smaller than the nucleic acid size, but large enough for the collective of Debye-Huckel spheres to encompass the whole molecule. The molecular alignment is then purely electrostatic, but it's functional form is under these conditions similar to that for steric alignment. The proposed analytical expression allows for very fast calculation of the alignment tensor and hence RDCs from the conformation of the nucleic acid molecule. This information provides opportunities for improved structure determination of nucleic acids, including better assessment of dynamics in (multi-domain) nucleic acids and the possibility to incorporate alignment tensor prediction from shape directly into the structure calculation process. The procedures are incorporated into MATLAB scripts, which are available on request

  8. Evolutionary rates at codon sites may be used to align sequences and infer protein domain function

    Directory of Open Access Journals (Sweden)

    Hazelhurst Scott

    2010-03-01

    Full Text Available Abstract Background Sequence alignments form part of many investigations in molecular biology, including the determination of phylogenetic relationships, the prediction of protein structure and function, and the measurement of evolutionary rates. However, to obtain meaningful results, a significant degree of sequence similarity is required to ensure that the alignments are accurate and the inferences correct. Limitations arise when sequence similarity is low, which is particularly problematic when working with fast-evolving genes, evolutionary distant taxa, genomes with nucleotide biases, and cases of convergent evolution. Results A novel approach was conceptualized to address the "low sequence similarity" alignment problem. We developed an alignment algorithm termed FIRE (Functional Inference using the Rates of Evolution, which aligns sequences using the evolutionary rate at codon sites, as measured by the dN/dS ratio, rather than nucleotide or amino acid residues. FIRE was used to test the hypotheses that evolutionary rates can be used to align sequences and that the alignments may be used to infer protein domain function. Using a range of test data, we found that aligning domains based on evolutionary rates was possible even when sequence similarity was very low (for example, antibody variable regions. Furthermore, the alignment has the potential to infer protein domain function, indicating that domains with similar functions are subject to similar evolutionary constraints. These data suggest that an evolutionary rate-based approach to sequence analysis (particularly when combined with structural data may be used to study cases of convergent evolution or when sequences have very low similarity. However, when aligning homologous gene sets with sequence similarity, FIRE did not perform as well as the best traditional alignment algorithms indicating that the conventional approach of aligning residues as opposed to evolutionary rates remains the

  9. PATtyFams: Protein families for the microbial genomes in the PATRIC database

    Directory of Open Access Journals (Sweden)

    James J Davis

    2016-02-01

    Full Text Available The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL. This new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.

  10. Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer.

    Directory of Open Access Journals (Sweden)

    Raquel Bromberg

    2016-06-01

    Full Text Available Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz.

  11. Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer

    Science.gov (United States)

    Grishin, Nick V.; Otwinowski, Zbyszek

    2016-01-01

    Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz. PMID:27336403

  12. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications.

    Science.gov (United States)

    Fungtammasan, Arkarachai; Ananda, Guruprasad; Hile, Suzanne E; Su, Marcia Shu-Wei; Sun, Chen; Harris, Robert; Medvedev, Paul; Eckert, Kristin; Makova, Kateryna D

    2015-05-01

    Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution. © 2015 Fungtammasan et al.; Published by Cold Spring Harbor Laboratory Press.

  13. A universal genomic coordinate translator for comparative genomics.

    Science.gov (United States)

    Zamani, Neda; Sundström, Görel; Meadows, Jennifer R S; Höppner, Marc P; Dainat, Jacques; Lantz, Henrik; Haas, Brian J; Grabherr, Manfred G

    2014-06-30

    Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across

  14. Optimization of sequence alignment for simple sequence repeat regions

    Directory of Open Access Journals (Sweden)

    Ogbonnaya Francis C

    2011-07-01

    Full Text Available Abstract Background Microsatellites, or simple sequence repeats (SSRs, are tandemly repeated DNA sequences, including tandem copies of specific sequences no longer than six bases, that are distributed in the genome. SSR has been used as a molecular marker because it is easy to detect and is used in a range of applications, including genetic diversity, genome mapping, and marker assisted selection. It is also very mutable because of slipping in the DNA polymerase during DNA replication. This unique mutation increases the insertion/deletion (INDELs mutation frequency to a high ratio - more than other types of molecular markers such as single nucleotide polymorphism (SNPs. SNPs are more frequent than INDELs. Therefore, all designed algorithms for sequence alignment fit the vast majority of the genomic sequence without considering microsatellite regions, as unique sequences that require special consideration. The old algorithm is limited in its application because there are many overlaps between different repeat units which result in false evolutionary relationships. Findings To overcome the limitation of the aligning algorithm when dealing with SSR loci, a new algorithm was developed using PERL script with a Tk graphical interface. This program is based on aligning sequences after determining the repeated units first, and the last SSR nucleotides positions. This results in a shifting process according to the inserted repeated unit type. When studying the phylogenic relations before and after applying the new algorithm, many differences in the trees were obtained by increasing the SSR length and complexity. However, less distance between different linage had been observed after applying the new algorithm. Conclusions The new algorithm produces better estimates for aligning SSR loci because it reflects more reliable evolutionary relations between different linages. It reduces overlapping during SSR alignment, which results in a more realistic

  15. Solving the shrinkage-induced PDMS alignment registration issue in multilayer soft lithography

    International Nuclear Information System (INIS)

    Moraes, Christopher; Sun, Yu; Simmons, Craig A

    2009-01-01

    Shrinkage of polydimethylsiloxane (PDMS) complicates alignment registration between layers during multilayer soft lithography fabrication. This often hinders the development of large-scale microfabricated arrayed devices. Here we report a rapid method to construct large-area, multilayered devices with stringent alignment requirements. This technique, which exploits a previously unrecognized aspect of sandwich mold fabrication, improves device yield, enables highly accurate alignment over large areas of multilayered devices and does not require strict regulation of fabrication conditions or extensive calibration processes. To demonstrate this technique, a microfabricated Braille display was developed and characterized. High device yield and accurate alignment within 15 µm were achieved over three layers for an array of 108 Braille units spread over a 6.5 cm 2 area, demonstrating the fabrication of well-aligned devices with greater ease and efficiency than previously possible

  16. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping.

    Science.gov (United States)

    Alser, Mohammed; Hassan, Hasan; Xin, Hongyi; Ergin, Oguz; Mutlu, Onur; Alkan, Can

    2017-11-01

    High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10. https://github.com/BilkentCompGen/GateKeeper. mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press

  17. DNAAlignEditor: DNA alignment editor tool

    Directory of Open Access Journals (Sweden)

    Guill Katherine E

    2008-03-01

    Full Text Available Abstract Background With advances in DNA re-sequencing methods and Next-Generation parallel sequencing approaches, there has been a large increase in genomic efforts to define and analyze the sequence variability present among individuals within a species. For very polymorphic species such as maize, this has lead to a need for intuitive, user-friendly software that aids the biologist, often with naïve programming capability, in tracking, editing, displaying, and exporting multiple individual sequence alignments. To fill this need we have developed a novel DNA alignment editor. Results We have generated a nucleotide sequence alignment editor (DNAAlignEditor that provides an intuitive, user-friendly interface for manual editing of multiple sequence alignments with functions for input, editing, and output of sequence alignments. The color-coding of nucleotide identity and the display of associated quality score aids in the manual alignment editing process. DNAAlignEditor works as a client/server tool having two main components: a relational database that collects the processed alignments and a user interface connected to database through universal data access connectivity drivers. DNAAlignEditor can be used either as a stand-alone application or as a network application with multiple users concurrently connected. Conclusion We anticipate that this software will be of general interest to biologists and population genetics in editing DNA sequence alignments and analyzing natural sequence variation regardless of species, and will be particularly useful for manual alignment editing of sequences in species with high levels of polymorphism.

  18. Simulation of beamline alignment operations

    International Nuclear Information System (INIS)

    Annese, C; Miller, M G.

    1999-01-01

    distributions rather than static values. The only way to accurately understand resource utilization and time requirements for a complex industrial application such as alignment, is to utilize simulation tools such as Simprocess to model the system

  19. SFESA: a web server for pairwise alignment refinement by secondary structure shifts.

    Science.gov (United States)

    Tong, Jing; Pei, Jimin; Grishin, Nick V

    2015-09-03

    Protein sequence alignment is essential for a variety of tasks such as homology modeling and active site prediction. Alignment errors remain the main cause of low-quality structure models. A bioinformatics tool to refine alignments is needed to make protein alignments more accurate. We developed the SFESA web server to refine pairwise protein sequence alignments. Compared to the previous version of SFESA, which required a set of 3D coordinates for a protein, the new server will search a sequence database for the closest homolog with an available 3D structure to be used as a template. For each alignment block defined by secondary structure elements in the template, SFESA evaluates alignment variants generated by local shifts and selects the best-scoring alignment variant. A scoring function that combines the sequence score of profile-profile comparison and the structure score of template-derived contact energy is used for evaluation of alignments. PROMALS pairwise alignments refined by SFESA are more accurate than those produced by current advanced alignment methods such as HHpred and CNFpred. In addition, SFESA also improves alignments generated by other software. SFESA is a web-based tool for alignment refinement, designed for researchers to compute, refine, and evaluate pairwise alignments with a combined sequence and structure scoring of alignment blocks. To our knowledge, the SFESA web server is the only tool that refines alignments by evaluating local shifts of secondary structure elements. The SFESA web server is available at http://prodata.swmed.edu/sfesa.

  20. G2S: A web-service for annotating genomic variants on 3D protein structures.

    Science.gov (United States)

    Wang, Juexin; Sheridan, Robert; Sumer, S Onur; Schultz, Nikolaus; Xu, Dong; Gao, Jianjiong

    2018-01-27

    Accurately mapping and annotating genomic locations on 3D protein structures is a key step in structure-based analysis of genomic variants detected by recent large-scale sequencing efforts. There are several mapping resources currently available, but none of them provides a web API (Application Programming Interface) that support programmatic access. We present G2S, a real-time web API that provides automated mapping of genomic variants on 3D protein structures. G2S can align genomic locations of variants, protein locations, or protein sequences to protein structures and retrieve the mapped residues from structures. G2S API uses REST-inspired design conception and it can be used by various clients such as web browsers, command terminals, programming languages and other bioinformatics tools for bringing 3D structures into genomic variant analysis. The webserver and source codes are freely available at https://g2s.genomenexus.org. g2s@genomenexus.org. Supplementary data are available at Bioinformatics online. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  1. Enhanced Dynamic Algorithm of Genome Sequence Alignments

    OpenAIRE

    Arabi E. keshk

    2014-01-01

    The merging of biology and computer science has created a new field called computational biology that explore the capacities of computers to gain knowledge from biological data, bioinformatics. Computational biology is rooted in life sciences as well as computers, information sciences, and technologies. The main problem in computational biology is sequence alignment that is a way of arranging the sequences of DNA, RNA or protein to identify the region of similarity and relationship between se...

  2. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space

    Directory of Open Access Journals (Sweden)

    Richard Wilton

    2015-03-01

    Full Text Available When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

  3. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    Science.gov (United States)

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  4. Validation of rice genome sequence by optical mapping

    Directory of Open Access Journals (Sweden)

    Pape Louise

    2007-08-01

    Full Text Available Abstract Background Rice feeds much of the world, and possesses the simplest genome analyzed to date within the grass family, making it an economically relevant model system for other cereal crops. Although the rice genome is sequenced, validation and gap closing efforts require purely independent means for accurate finishing of sequence build data. Results To facilitate ongoing sequencing finishing and validation efforts, we have constructed a whole-genome SwaI optical restriction map of the rice genome. The physical map consists of 14 contigs, covering 12 chromosomes, with a total genome size of 382.17 Mb; this value is about 11% smaller than original estimates. 9 of the 14 optical map contigs are without gaps, covering chromosomes 1, 2, 3, 4, 5, 7, 8 10, and 12 in their entirety – including centromeres and telomeres. Alignments between optical and in silico restriction maps constructed from IRGSP (International Rice Genome Sequencing Project and TIGR (The Institute for Genomic Research genome sequence sources are comprehensive and informative, evidenced by map coverage across virtually all published gaps, discovery of new ones, and characterization of sequence misassemblies; all totalling ~14 Mb. Furthermore, since optical maps are ordered restriction maps, identified discordances are pinpointed on a reliable physical scaffold providing an independent resource for closure of gaps and rectification of misassemblies. Conclusion Analysis of sequence and optical mapping data effectively validates genome sequence assemblies constructed from large, repeat-rich genomes. Given this conclusion we envision new applications of such single molecule analysis that will merge advantages offered by high-resolution optical maps with inexpensive, but short sequence reads generated by emerging sequencing platforms. Lastly, map construction techniques presented here points the way to new types of comparative genome analysis that would focus on discernment of

  5. Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data

    KAUST Repository

    Kobayashi, Masaaki

    2017-04-20

    Recent availability of large-scale genomic resources enables us to conduct so called genome-wide association studies (GWAS) and genomic prediction (GP) studies, particularly with next-generation sequencing (NGS) data. The effectiveness of GWAS and GP depends on not only their mathematical models, but the quality and quantity of variants employed in the analysis. In NGS single nucleotide polymorphism (SNP) calling, conventional tools ideally require more reads for higher SNP sensitivity and accuracy. In this study, we aimed to develop a tool, Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both ends of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site associated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap (29 March 2017, date last accessed) and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html (29 March 2017, date last accessed)).

  6. A Simulation Tool for the Study of Symmetric Inversions in Bacterial Genomes

    Science.gov (United States)

    Dias, Ulisses; Dias, Zanoni; Setubal, João C.

    We present the tool SIB that simulates genomic inversions in bacterial chromosomes. The tool simulates symmetric inversions but allows the appearance of nonsymmetric inversions by simulating small syntenic blocks frequently observed on bacterial genome comparisons. We evaluate SIB by comparing its results to real genome alignments. We develop measures that allow quantitative comparisons between real pairwise alignments (in terms of dotplots) and simulated ones. These measures allow an evaluation of SIB in terms of dendrograms. We evaluate SIB by comparing its results to whole chromosome alignments and maximum likelihood trees for three bacterial groups (the Pseudomonadaceae family and the Xanthomonas and Shewanella genera). We demonstrate an application of SIB by using it to evaluate the ancestral genome reconstruction tool MGR.

  7. Bos taurus strain:dairy beef (cattle): 1000 Bull Genomes Run 2, Bovine Whole Genome Sequence

    NARCIS (Netherlands)

    Bouwman, A.C.; Daetwyler, H.D.; Chamberlain, Amanda J.; Ponce, Carla Hurtado; Sargolzaei, Mehdi; Schenkel, Flavio S.; Sahana, Goutam; Govignon-Gion, Armelle; Boitard, Simon; Dolezal, Marlies; Pausch, Hubert; Brøndum, Rasmus F.; Bowman, Phil J.; Thomsen, Bo; Guldbrandtsen, Bernt; Lund, Mogens S.; Servin, Bertrand; Garrick, Dorian J.; Reecy, James M.; Vilkki, Johanna; Bagnato, Alessandro; Wang, Min; Hoff, Jesse L.; Schnabel, Robert D.; Taylor, Jeremy F.; Vinkhuyzen, Anna A.E.; Panitz, Frank; Bendixen, Christian; Holm, Lars-Erik; Gredler, Birgit; Hozé, Chris; Boussaha, Mekki; Sanchez, Marie Pierre; Rocha, Dominique; Capitan, Aurelien; Tribout, Thierry; Barbat, Anne; Croiseau, Pascal; Drögemüller, Cord; Jagannathan, Vidhya; Vander Jagt, Christy; Crowley, John J.; Bieber, Anna; Purfield, Deirdre C.; Berry, Donagh P.; Emmerling, Reiner; Götz, Kay Uwe; Frischknecht, Mirjam; Russ, Ingolf; Sölkner, Johann; Tassell, van Curtis P.; Fries, Ruedi; Stothard, Paul; Veerkamp, R.F.; Boichard, Didier; Goddard, Mike E.; Hayes, Ben J.

    2014-01-01

    Whole genome sequence data (BAM format) of 234 bovine individuals aligned to UMD3.1. The aim of the study was to identify genetic variants (SNPs and indels) for downstream analysis such as imputation, GWAS, and detection of lethal recessives. Additional sequences for later 1000 bull genomes runs can

  8. Hohlraum Target Alignment from X-ray Detector Images using Starburst Design Patterns

    International Nuclear Information System (INIS)

    Leach, R.R.; Conder, A.; Edwards, O.; Kroll, J.; Kozioziemski, B.; Mapoles, E.; McGuigan, D.; Wilhelmsen, K.

    2010-01-01

    National Ignition Facility (NIF) is a high-energy laser facility comprised of 192 laser beams focused with enough power and precision on a hydrogen-filled spherical, cryogenic target to initiate a fusion reaction. The target container, or hohlraum, must be accurately aligned to an x-ray imaging system to allow careful monitoring of the frozen fuel layer in the target. To achieve alignment, x-ray images are acquired through starburst-shaped windows cut into opposite sides of the hohlraum. When the hohlraum is in alignment, the starburst pattern pairs match nearly exactly and allow a clear view of the ice layer formation on the edge of the target capsule. During the alignment process, x-ray image analysis is applied to determine the direction and magnitude of adjustment required. X-ray detector and source are moved in concert during the alignment process. The automated pointing alignment system described here is both accurate and efficient. In this paper, we describe the control and associated image processing that enables automation of the starburst pointing alignment.

  9. Correction of the Caulobacter crescentus NA1000 genome annotation.

    Directory of Open Access Journals (Sweden)

    Bert Ely

    Full Text Available Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  10. Initial Alignment for SINS Based on Pseudo-Earth Frame in Polar Regions.

    Science.gov (United States)

    Gao, Yanbin; Liu, Meng; Li, Guangchun; Guang, Xingxing

    2017-06-16

    An accurate initial alignment must be required for inertial navigation system (INS). The performance of initial alignment directly affects the following navigation accuracy. However, the rapid convergence of meridians and the small horizontalcomponent of rotation of Earth make the traditional alignment methods ineffective in polar regions. In this paper, from the perspective of global inertial navigation, a novel alignment algorithm based on pseudo-Earth frame and backward process is proposed to implement the initial alignment in polar regions. Considering that an accurate coarse alignment of azimuth is difficult to obtain in polar regions, the dynamic error modeling with large azimuth misalignment angle is designed. At the end of alignment phase, the strapdown attitude matrix relative to local geographic frame is obtained without influence of position errors and cumbersome computation. As a result, it would be more convenient to access the following polar navigation system. Then, it is also expected to unify the polar alignment algorithm as much as possible, thereby further unifying the form of external reference information. Finally, semi-physical static simulation and in-motion tests with large azimuth misalignment angle assisted by unscented Kalman filter (UKF) validate the effectiveness of the proposed method.

  11. TotalReCaller: improved accuracy and performance via integrated alignment and base-calling.

    Science.gov (United States)

    Menges, Fabian; Narzisi, Giuseppe; Mishra, Bud

    2011-09-01

    Currently, re-sequencing approaches use multiple modules serially to interpret raw sequencing data from next-generation sequencing platforms, while remaining oblivious to the genomic information until the final alignment step. Such approaches fail to exploit the full information from both raw sequencing data and the reference genome that can yield better quality sequence reads, SNP-calls, variant detection, as well as an alignment at the best possible location in the reference genome. Thus, there is a need for novel reference-guided bioinformatics algorithms for interpreting analog signals representing sequences of the bases ({A, C, G, T}), while simultaneously aligning possible sequence reads to a source reference genome whenever available. Here, we propose a new base-calling algorithm, TotalReCaller, to achieve improved performance. A linear error model for the raw intensity data and Burrows-Wheeler transform (BWT) based alignment are combined utilizing a Bayesian score function, which is then globally optimized over all possible genomic locations using an efficient branch-and-bound approach. The algorithm has been implemented in soft- and hardware [field-programmable gate array (FPGA)] to achieve real-time performance. Empirical results on real high-throughput Illumina data were used to evaluate TotalReCaller's performance relative to its peers-Bustard, BayesCall, Ibis and Rolexa-based on several criteria, particularly those important in clinical and scientific applications. Namely, it was evaluated for (i) its base-calling speed and throughput, (ii) its read accuracy and (iii) its specificity and sensitivity in variant calling. A software implementation of TotalReCaller as well as additional information, is available at: http://bioinformatics.nyu.edu/wordpress/projects/totalrecaller/ fabian.menges@nyu.edu.

  12. Use of Whole-Genus Genome Sequence Data To Develop a Multilocus Sequence Typing Tool That Accurately Identifies Yersinia Isolates to the Species and Subspecies Levels

    Science.gov (United States)

    Hall, Miquette; Chattaway, Marie A.; Reuter, Sandra; Savin, Cyril; Strauch, Eckhard; Carniel, Elisabeth; Connor, Thomas; Van Damme, Inge; Rajakaruna, Lakshani; Rajendram, Dunstan; Jenkins, Claire; Thomson, Nicholas R.

    2014-01-01

    The genus Yersinia is a large and diverse bacterial genus consisting of human-pathogenic species, a fish-pathogenic species, and a large number of environmental species. Recently, the phylogenetic and population structure of the entire genus was elucidated through the genome sequence data of 241 strains encompassing every known species in the genus. Here we report the mining of this enormous data set to create a multilocus sequence typing-based scheme that can identify Yersinia strains to the species level to a level of resolution equal to that for whole-genome sequencing. Our assay is designed to be able to accurately subtype the important human-pathogenic species Yersinia enterocolitica to whole-genome resolution levels. We also report the validation of the scheme on 386 strains from reference laboratory collections across Europe. We propose that the scheme is an important molecular typing system to allow accurate and reproducible identification of Yersinia isolates to the species level, a process often inconsistent in nonspecialist laboratories. Additionally, our assay is the most phylogenetically informative typing scheme available for Y. enterocolitica. PMID:25339391

  13. Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study [version 1; referees: 1 approved, 2 approved with reservations

    Directory of Open Access Journals (Sweden)

    John A. Lees

    2018-03-01

    Full Text Available Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined “true tree” using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons.

  14. Dcode.org anthology of comparative genomic tools.

    Science.gov (United States)

    Loots, Gabriela G; Ovcharenko, Ivan

    2005-07-01

    Comparative genomics provides the means to demarcate functional regions in anonymous DNA sequences. The successful application of this method to identifying novel genes is currently shifting to deciphering the non-coding encryption of gene regulation across genomes. To facilitate the practical application of comparative sequence analysis to genetics and genomics, we have developed several analytical and visualization tools for the analysis of arbitrary sequences and whole genomes. These tools include two alignment tools, zPicture and Mulan; a phylogenetic shadowing tool, eShadow for identifying lineage- and species-specific functional elements; two evolutionary conserved transcription factor analysis tools, rVista and multiTF; a tool for extracting cis-regulatory modules governing the expression of co-regulated genes, Creme 2.0; and a dynamic portal to multiple vertebrate and invertebrate genome alignments, the ECR Browser. Here, we briefly describe each one of these tools and provide specific examples on their practical applications. All the tools are publicly available at the http://www.dcode.org/ website.

  15. Fiber optics welder having movable aligning mirror

    Science.gov (United States)

    Higgins, Robert W.; Robichaud, Roger E.

    1981-01-01

    A system for welding fiber optic waveguides together. The ends of the two fibers to be joined together are accurately, collinearly aligned in a vertical orientation and subjected to a controlled, diffuse arc to effect welding and thermal conditioning. A front-surfaced mirror mounted at a 45.degree. angle to the optical axis of a stereomicroscope mounted for viewing the junction of the ends provides two orthogonal views of the interface during the alignment operation.

  16. A comprehensive evaluation of alignment algorithms in the context of RNA-seq.

    Directory of Open Access Journals (Sweden)

    Robert Lindner

    Full Text Available Transcriptome sequencing (RNA-Seq overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete.

  17. PSAT: A web tool to compare genomic neighborhoods of multiple prokaryotic genomes

    Directory of Open Access Journals (Sweden)

    Wasnick Michael

    2008-03-01

    Full Text Available Abstract Background The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes. Results PSAT utilizes a database that is preloaded with gene annotation, BLAST hit results, and gene-clustering scores designed to help identify regions of conserved gene order. Researchers use the PSAT web interface to find a gene of interest in a reference genome and efficiently retrieve the sequence homologs found in other bacterial genomes. The tool generates a graphic of the genomic neighborhood surrounding the selected gene and the corresponding regions for its homologs in each comparison genome. Homologs in each region are color coded to assist users with analyzing gene order among various genomes. In contrast to common comparative analysis methods that filter sequence homolog data based on alignment score cutoffs, PSAT leverages gene context information for homologs, including those with weak alignment scores, enabling a more sensitive analysis. Features for constraining or ordering results are designed to help researchers browse results from large numbers of comparison genomes in an organized manner. PSAT has been demonstrated to be useful for helping to identify gene orthologs and potential functional gene clusters, and detecting genome modifications that may result in loss of function. Conclusion PSAT allows researchers to investigate the order of genes within local genomic neighborhoods of multiple genomes. A PSAT web server for public use is available for performing analyses on a growing set of reference genomes through any

  18. Alignment of the HERA-B RICH optical system with data

    International Nuclear Information System (INIS)

    Gorisek, A.; Krizan, P.; Korpar, S.; Staric, M.

    1999-01-01

    We present a method for alignment of the mirror segments in the Ring Image Cherenkov Counter of the HERA-B spectrometer. The method will use recorded data, and was tested by using simulated events. The study shows that the mirrors can be aligned accurately enough to make the corresponding error in Cherenkov angle measurement negligible compared to other contributions. The mirrors are aligned relative to one mirror segment which can be chosen arbitrarily

  19. Efficient oligonucleotide probe selection for pan-genomic tiling arrays

    Directory of Open Access Journals (Sweden)

    Zhang Wei

    2009-09-01

    Full Text Available Abstract Background Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome. Results This paper presents a new probe selection algorithm (PanArray that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pan-genome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. Conclusion PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on

  20. Delineating slowly and rapidly evolving fractions of the Drosophila genome.

    Science.gov (United States)

    Keith, Jonathan M; Adams, Peter; Stephen, Stuart; Mattick, John S

    2008-05-01

    Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.

  1. Genomic inference accurately predicts the timing and severity of a recent bottleneck in a non-model insect population

    Science.gov (United States)

    McCoy, Rajiv C.; Garud, Nandita R.; Kelley, Joanna L.; Boggs, Carol L.; Petrov, Dmitri A.

    2015-01-01

    The analysis of molecular data from natural populations has allowed researchers to answer diverse ecological questions that were previously intractable. In particular, ecologists are often interested in the demographic history of populations, information that is rarely available from historical records. Methods have been developed to infer demographic parameters from genomic data, but it is not well understood how inferred parameters compare to true population history or depend on aspects of experimental design. Here we present and evaluate a method of SNP discovery using RNA-sequencing and demographic inference using the program δaδi, which uses a diffusion approximation to the allele frequency spectrum to fit demographic models. We test these methods in a population of the checkerspot butterfly Euphydryas gillettii. This population was intentionally introduced to Gothic, Colorado in 1977 and has since experienced extreme fluctuations including bottlenecks of fewer than 25 adults, as documented by nearly annual field surveys. Using RNA-sequencing of eight individuals from Colorado and eight individuals from a native population in Wyoming, we generate the first genomic resources for this system. While demographic inference is commonly used to examine ancient demography, our study demonstrates that our inexpensive, all-in-one approach to marker discovery and genotyping provides sufficient data to accurately infer the timing of a recent bottleneck. This demographic scenario is relevant for many species of conservation concern, few of which have sequenced genomes. Our results are remarkably insensitive to sample size or number of genomic markers, which has important implications for applying this method to other non-model systems. PMID:24237665

  2. Exploring the relationship between sequence similarity and accurate phylogenetic trees.

    Science.gov (United States)

    Cantarel, Brandi L; Morrison, Hilary G; Pearson, William

    2006-11-01

    We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10(-60), produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not

  3. Alignment of the ATLAS Inner Detector Tracking System

    CERN Document Server

    Lacuesta, V; The ATLAS collaboration

    2010-01-01

    ATLAS is a multipurpose experiment that records the LHC collisions. To reconstruct trajectories of charged particles produced in these collisions, ATLAS tracking system is equipped with silicon planar sensors and drift‐tube based detectors. They constitute the ATLAS Inner Detector. In order to achieve its scientific goals, the alignment of the ATLAS tracking system requires the determine accurately its almost 36000 degrees of freedom. Thus the demanded precision for the alignment of the silicon sensors is below 10 micrometers. This implies to use a large sample of high momentum and isolated charge particle tracks. The high level trigger selects those tracks online. Then the raw data with the hits information of the triggered tracks is stored in a calibration stream. Tracks from cosmic trigger during empty LHC bunches are also used as input for the alignment. The implementation of the track based alignment within the ATLAS software framework unifies different alignment approaches and allows the alignment of ...

  4. First generation annotations for the fathead minnow (Pimephales promelas) genome

    Science.gov (United States)

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minno...

  5. Image Alignment for Multiple Camera High Dynamic Range Microscopy.

    Science.gov (United States)

    Eastwood, Brian S; Childs, Elisabeth C

    2012-01-09

    This paper investigates the problem of image alignment for multiple camera high dynamic range (HDR) imaging. HDR imaging combines information from images taken with different exposure settings. Combining information from multiple cameras requires an alignment process that is robust to the intensity differences in the images. HDR applications that use a limited number of component images require an alignment technique that is robust to large exposure differences. We evaluate the suitability for HDR alignment of three exposure-robust techniques. We conclude that image alignment based on matching feature descriptors extracted from radiant power images from calibrated cameras yields the most accurate and robust solution. We demonstrate the use of this alignment technique in a high dynamic range video microscope that enables live specimen imaging with a greater level of detail than can be captured with a single camera.

  6. libgapmis: extending short-read alignments.

    Science.gov (United States)

    Alachiotis, Nikolaos; Berger, Simon; Flouri, Tomáš; Pissis, Solon P; Stamatakis, Alexandros

    2013-01-01

    A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable. In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment. We present libgapmis, a library for extending pairwise short-read alignments. We

  7. FEAST: sensitive local alignment with multiple rates of evolution.

    Science.gov (United States)

    Hudek, Alexander K; Brown, Daniel G

    2011-01-01

    We present a pairwise local aligner, FEAST, which uses two new techniques: a sensitive extension algorithm for identifying homologous subsequences, and a descriptive probabilistic alignment model. We also present a new procedure for training alignment parameters and apply it to the human and mouse genomes, producing a better parameter set for these sequences. Our extension algorithm identifies homologous subsequences by considering all evolutionary histories. It has higher maximum sensitivity than Viterbi extensions, and better balances specificity. We model alignments with several submodels, each with unique statistical properties, describing strongly similar and weakly similar regions of homologous DNA. Training parameters using two submodels produces superior alignments, even when we align with only the parameters from the weaker submodel. Our extension algorithm combined with our new parameter set achieves sensitivity 0.59 on synthetic tests. In contrast, LASTZ with default settings achieves sensitivity 0.35 with the same false positive rate. Using the weak submodel as parameters for LASTZ increases its sensitivity to 0.59 with high error. FEAST is available at http://monod.uwaterloo.ca/feast/.

  8. Implementation of a Parallel Protein Structure Alignment Service on Cloud

    Directory of Open Access Journals (Sweden)

    Che-Lun Hung

    2013-01-01

    Full Text Available Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.

  9. Survey of local and global biological network alignment: the need to reconcile the two sides of the same coin.

    Science.gov (United States)

    Guzzi, Pietro Hiram; Milenković, Tijana

    2017-01-05

    Analogous to genomic sequence alignment that allows for across-species transfer of biological knowledge between conserved sequence regions, biological network alignment can be used to guide the knowledge transfer between conserved regions of molecular networks of different species. Hence, biological network alignment can be used to redefine the traditional notion of a sequence-based homology to a new notion of network-based homology. Analogous to genomic sequence alignment, there exist local and global biological network alignments. Here, we survey prominent and recent computational approaches of each network alignment type and discuss their (dis)advantages. Then, as it was recently shown that the two approach types are complementary, in the sense that they capture different slices of cellular functioning, we discuss the need to reconcile the two network alignment types and present a recent first step in this direction. We conclude with some open research problems on this topic and comment on the usefulness of network alignment in other domains besides computational biology. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  10. Genome-Wide Association Mapping and Genomic Selection for Alfalfa (Medicago sativa) Forage Quality Traits.

    Science.gov (United States)

    Biazzi, Elisa; Nazzicari, Nelson; Pecetti, Luciano; Brummer, E Charles; Palmonari, Alberto; Tava, Aldo; Annicchiarico, Paolo

    2017-01-01

    Genetic progress for forage quality has been poor in alfalfa (Medicago sativa L.), the most-grown forage legume worldwide. This study aimed at exploring opportunities for marker-assisted selection (MAS) and genomic selection of forage quality traits based on breeding values of parent plants. Some 154 genotypes from a broadly-based reference population were genotyped by genotyping-by-sequencing (GBS), and phenotyped for leaf-to-stem ratio, leaf and stem contents of protein, neutral detergent fiber (NDF) and acid detergent lignin (ADL), and leaf and stem NDF digestibility after 24 hours (NDFD), of their dense-planted half-sib progenies in three growing conditions (summer harvest, full irrigation; summer harvest, suspended irrigation; autumn harvest). Trait-marker analyses were performed on progeny values averaged over conditions, owing to modest germplasm × condition interaction. Genomic selection exploited 11,450 polymorphic SNP markers, whereas a subset of 8,494 M. truncatula-aligned markers were used for a genome-wide association study (GWAS). GWAS confirmed the polygenic control of quality traits and, in agreement with phenotypic correlations, indicated substantially different genetic control of a given trait in stems and leaves. It detected several SNPs in different annotated genes that were highly linked to stem protein content. Also, it identified a small genomic region on chromosome 8 with high concentration of annotated genes associated with leaf ADL, including one gene probably involved in the lignin pathway. Three genomic selection models, i.e., Ridge-regression BLUP, Bayes B and Bayesian Lasso, displayed similar prediction accuracy, whereas SVR-lin was less accurate. Accuracy values were moderate (0.3-0.4) for stem NDFD and leaf protein content, modest for leaf ADL and NDFD, and low to very low for the other traits. Along with previous results for the same germplasm set, this study indicates that GBS data can be exploited to improve both quality traits

  11. Design of practical alignment device in KSTAR Thomson diagnostic

    Energy Technology Data Exchange (ETDEWEB)

    Lee, J. H., E-mail: jhlee@nfri.re.kr [National Fusion Research Institute, Daejeon (Korea, Republic of); University of Science and Technology (UST), Daejeon (Korea, Republic of); Lee, S. H. [National Fusion Research Institute, Daejeon (Korea, Republic of); Yamada, I. [National Institute for Fusion Science, Toki (Japan)

    2016-11-15

    The precise alignment of the laser path and collection optics in Thomson scattering measurements is essential for accurately determining electron temperature and density in tokamak experiments. For the last five years, during the development stage, the KSTAR tokamak’s Thomson diagnostic system has had alignment fibers installed in its optical collection modules, but these lacked a proper alignment detection system. In order to address these difficulties, an alignment verifying detection device between lasers and an object field of collection optics is developed. The alignment detection device utilizes two types of filters: a narrow laser band wavelength for laser, and a broad wavelength filter for Thomson scattering signal. Four such alignment detection devices have been successfully developed for the KSTAR Thomson scattering system in this year, and these will be tested in KSTAR experiments in 2016. In this paper, we present the newly developed alignment detection device for KSTAR’s Thomson scattering diagnostics.

  12. Anatomically Plausible Surface Alignment and Reconstruction

    DEFF Research Database (Denmark)

    Paulsen, Rasmus R.; Larsen, Rasmus

    2010-01-01

    With the increasing clinical use of 3D surface scanners, there is a need for accurate and reliable algorithms that can produce anatomically plausible surfaces. In this paper, a combined method for surface alignment and reconstruction is proposed. It is based on an implicit surface representation...

  13. A privacy-preserving solution for compressed storage and selective retrieval of genomic data.

    Science.gov (United States)

    Huang, Zhicong; Ayday, Erman; Lin, Huang; Aiyar, Raeka S; Molyneaux, Adam; Xu, Zhenyu; Fellay, Jacques; Steinmetz, Lars M; Hubaux, Jean-Pierre

    2016-12-01

    In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients' complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data. © 2016 Huang et al.; Published by Cold Spring Harbor Laboratory Press.

  14. Multiple sequence alignment accuracy and phylogenetic inference.

    Science.gov (United States)

    Ogden, T Heath; Rosenberg, Michael S

    2006-04-01

    Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiple-sequence alignment can be an important factor in downstream effects on topological reconstruction.

  15. Method for the mechanical axis alignment of the linear induction accelerator

    International Nuclear Information System (INIS)

    Li Hong; China Academy of Engineering Physics, Mianyang; Yao Jin; Liu Yunlong; Zhang Linwen; Deng Jianjun

    2004-01-01

    Accurate mechanical axis alignment is a basic requirement for assembling a linear induction accelerator (LIA). The total length of an LIA is usually over thirty or fifty meters, and it consists of many induction cells. By using a laser tracker a new method of mechanical axis alignment for LIA is established to achieve the high accuracy. This paper introduces the method and gives implementation step and point position measure errors of the mechanical axis alignment. During the alignment process a 55 m-long alignment control survey net is built, and the theoretic revision of the coordinate of the control survey net is presented. (authors)

  16. Seeking the perfect alignment

    CERN Multimedia

    2002-01-01

    The first full-scale tests of the ATLAS Muon Spectrometer are about to begin in Prévessin. The set-up includes several layers of Monitored Drift Tubes Chambers (MDTs) and will allow tests of the performance of the detectors and of their highly accurate alignment system.   Monitored Drift Chambers in Building 887 in Prévessin, where they are just about to be tested. Muon chambers are keeping the ATLAS Muon Spectrometer team quite busy this summer. Now that most people go on holiday, the beam and alignment tests for these chambers are just starting. These chambers will measure with high accuracy the momentum of high-energy muons, and this implies very demanding requirements for their alignment. The MDT chambers consist of drift tubes, which are gas-filled metal tubes, 3 cm in diameter, with wires running down their axes. With high voltage between the wire and the tube wall, the ionisation due to traversing muons is detected as electrical pulses. With careful timing of the pulses, the position of the muon t...

  17. ChromAlign: A two-step algorithmic procedure for time alignment of three-dimensional LC-MS chromatographic surfaces.

    Science.gov (United States)

    Sadygov, Rovshan G; Maroto, Fernando Martin; Hühmer, Andreas F R

    2006-12-15

    We present an algorithmic approach to align three-dimensional chromatographic surfaces of LC-MS data of complex mixture samples. The approach consists of two steps. In the first step, we prealign chromatographic profiles: two-dimensional projections of chromatographic surfaces. This is accomplished by correlation analysis using fast Fourier transforms. In this step, a temporal offset that maximizes the overlap and dot product between two chromatographic profiles is determined. In the second step, the algorithm generates correlation matrix elements between full mass scans of the reference and sample chromatographic surfaces. The temporal offset from the first step indicates a range of the mass scans that are possibly correlated, then the correlation matrix is calculated only for these mass scans. The correlation matrix carries information on highly correlated scans, but it does not itself determine the scan or time alignment. Alignment is determined as a path in the correlation matrix that maximizes the sum of the correlation matrix elements. The computational complexity of the optimal path generation problem is reduced by the use of dynamic programming. The program produces time-aligned surfaces. The use of the temporal offset from the first step in the second step reduces the computation time for generating the correlation matrix and speeds up the process. The algorithm has been implemented in a program, ChromAlign, developed in C++ language for the .NET2 environment in WINDOWS XP. In this work, we demonstrate the applications of ChromAlign to alignment of LC-MS surfaces of several datasets: a mixture of known proteins, samples from digests of surface proteins of T-cells, and samples prepared from digests of cerebrospinal fluid. ChromAlign accurately aligns the LC-MS surfaces we studied. In these examples, we discuss various aspects of the alignment by ChromAlign, such as constant time axis shifts and warping of chromatographic surfaces.

  18. Accurate Local-Ancestry Inference in Exome-Sequenced Admixed Individuals via Off-Target Sequence Reads

    Science.gov (United States)

    Hu, Youna; Willer, Cristen; Zhan, Xiaowei; Kang, Hyun Min; Abecasis, Gonçalo R.

    2013-01-01

    Estimates of the ancestry of specific chromosomal regions in admixed individuals are useful for studies of human evolutionary history and for genetic association studies. Previously, this ancestry inference relied on high-quality genotypes from genome-wide association study (GWAS) arrays. These high-quality genotypes are not always available when samples are exome sequenced, and exome sequencing is the strategy of choice for many ongoing genetic studies. Here we show that off-target reads generated during exome-sequencing experiments can be combined with on-target reads to accurately estimate the ancestry of each chromosomal segment in an admixed individual. To reconstruct local ancestry, our method SEQMIX models aligned bases directly instead of relying on hard genotype calls. We evaluate the accuracy of our method through simulations and analysis of samples sequenced by the 1000 Genomes Project and the NHLBI Grand Opportunity Exome Sequencing Project. In African Americans, we show that local-ancestry estimates derived by our method are very similar to those derived with Illumina’s Omni 2.5M genotyping array and much improved in relation to estimates that use only exome genotypes and ignore off-target sequencing reads. Software implementing this method, SEQMIX, can be applied to analysis of human population history or used for genetic association studies in admixed individuals. PMID:24210252

  19. Validation of the CLIC alignment strategy on short range

    CERN Document Server

    Mainaud Durand, H; Griffet, S; Kemppinen, J; Rude, V; Sosin, M

    2012-01-01

    The pre-alignment of CLIC consists of aligning the components of linacs and beam delivery systems (BDS) in the most accurate possible way, so that a first pilot beam can circulate and allow the implementation of the beam based alignment. Taking into account the precision and accuracy needed: 10 µm rms over sliding windows of 200m, this pre-alignment must be active and it can be divided into two parts: the determination of a straight reference over 20 km, thanks to a metrological network and the determination of the component positions with respect to this reference, and their adjustment. The second part is the object of the paper, describing the steps of the proposed strategy: firstly the fiducialisation of the different components of CLIC; secondly, the alignment of these components on common supports and thirdly the active alignment of these supports using sensors and actuators. These steps have been validated on a test setup over a length of 4m, and the obtained results are analysed.

  20. M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species

    Directory of Open Access Journals (Sweden)

    Messeguer Xavier

    2006-10-01

    Full Text Available Abstract Background Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons. Results To facilitate such comparisons, we present an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. M-GCAT is able to compare and identify highly conserved regions in up to 20 closely related bacterial species in minutes on a standard computer, and as many as 90 (containing 75 cloned genomes from a set of 15 published enterobacterial genomes in an hour. M-GCAT also incorporates a novel comparative genomics data visualization interface allowing the user to globally and locally examine and inspect the conserved regions and gene annotations. Conclusion M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species. M-GCAT is freely available for download for academic and non-commercial use at: http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html.

  1. Dramatic improvement in genome assembly achieved using doubled-haploid genomes.

    Science.gov (United States)

    Zhang, Hong; Tan, Engkong; Suzuki, Yutaka; Hirose, Yusuke; Kinoshita, Shigeharu; Okano, Hideyuki; Kudoh, Jun; Shimizu, Atsushi; Saito, Kazuyoshi; Watabe, Shugo; Asakawa, Shuichi

    2014-10-27

    Improvement in de novo assembly of large genomes is still to be desired. Here, we improved draft genome sequence quality by employing doubled-haploid individuals. We sequenced wildtype and doubled-haploid Takifugu rubripes genomes, under the same conditions, using the Illumina platform and assembled contigs with SOAPdenovo2. We observed 5.4-fold and 2.6-fold improvement in the sizes of the N50 contig and scaffold of doubled-haploid individuals, respectively, compared to the wildtype, indicating that the use of a doubled-haploid genome aids in accurate genome analysis.

  2. An improved model for whole genome phylogenetic analysis by Fourier transform.

    Science.gov (United States)

    Yin, Changchuan; Yau, Stephen S-T

    2015-10-07

    DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees

  3. Software for computing and annotating genomic ranges.

    Science.gov (United States)

    Lawrence, Michael; Huber, Wolfgang; Pagès, Hervé; Aboyoun, Patrick; Carlson, Marc; Gentleman, Robert; Morgan, Martin T; Carey, Vincent J

    2013-01-01

    We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

  4. TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction.

    Science.gov (United States)

    Chang, Jia-Ming; Di Tommaso, Paolo; Lefort, Vincent; Gascuel, Olivier; Notredame, Cedric

    2015-07-01

    This article introduces the Transitive Consistency Score (TCS) web server; a service making it possible to estimate the local reliability of protein multiple sequence alignments (MSAs) using the TCS index. The evaluation can be used to identify the aligned positions most likely to contain structurally analogous residues and also most likely to support an accurate phylogenetic reconstruction. The TCS scoring scheme has been shown to be accurate predictor of structural alignment correctness among commonly used methods. It has also been shown to outperform common filtering schemes like Gblocks or trimAl when doing MSA post-processing prior to phylogenetic tree reconstruction. The web server is available from http://tcoffee.crg.cat/tcs. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  5. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data

    KAUST Repository

    Allam, Amin

    2015-07-14

    Motivation: Next-generation sequencing generates large amounts of data affected by errors in the form of substitutions, insertions or deletions of bases. Error correction based on the high-coverage information, typically improves de novo assembly. Most existing tools can correct substitution errors only; some support insertions and deletions, but accuracy in many cases is low. Results: We present Karect, a novel error correction technique based on multiple alignment. Our approach supports substitution, insertion and deletion errors. It can handle non-uniform coverage as well as moderately covered areas of the sequenced genome. Experiments with data from Illumina, 454 FLX and Ion Torrent sequencing machines demonstrate that Karect is more accurate than previous methods, both in terms of correcting individual-bases errors (up to 10% increase in accuracy gain) and post de novo assembly quality (up to 10% increase in NGA50). We also introduce an improved framework for evaluating the quality of error correction.

  6. Absorber Alignment Measurement Tool for Solar Parabolic Trough Collectors: Preprint

    Energy Technology Data Exchange (ETDEWEB)

    Stynes, J. K.; Ihas, B.

    2012-04-01

    As we pursue efforts to lower the capital and installation costs of parabolic trough solar collectors, it is essential to maintain high optical performance. While there are many optical tools available to measure the reflector slope errors of parabolic trough solar collectors, there are few tools to measure the absorber alignment. A new method is presented here to measure the absorber alignment in two dimensions to within 0.5 cm. The absorber alignment is measured using a digital camera and four photogrammetric targets. Physical contact with the receiver absorber or glass is not necessary. The alignment of the absorber is measured along its full length so that sagging of the absorber can be quantified with this technique. The resulting absorber alignment measurement provides critical information required to accurately determine the intercept factor of a collector.

  7. Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA

    KAUST Repository

    Magana-Mora, Arturo

    2017-08-15

    BackgroundPolyadenylation is a critical stage of RNA processing during the formation of mature mRNA, and is present in most of the known eukaryote protein-coding transcripts and many long non-coding RNAs. The correct identification of poly(A) signals (PAS) not only helps to elucidate the 3′-end genomic boundaries of a transcribed DNA region and gene regulatory mechanisms but also gives insight into the multiple transcript isoforms resulting from alternative PAS. Although progress has been made in the in-silico prediction of genomic signals, the recognition of PAS in DNA genomic sequences remains a challenge.ResultsIn this study, we analyzed human genomic DNA sequences for the 12 most common PAS variants. Our analysis has identified a set of features that helps in the recognition of true PAS, which may be involved in the regulation of the polyadenylation process. The proposed features, in combination with a recognition model, resulted in a novel method and tool, Omni-PolyA. Omni-PolyA combines several machine learning techniques such as different classifiers in a tree-like decision structure and genetic algorithms for deriving a robust classification model. We performed a comparison between results obtained by state-of-the-art methods, deep neural networks, and Omni-PolyA. Results show that Omni-PolyA significantly reduced the average classification error rate by 35.37% in the prediction of the 12 considered PAS variants relative to the state-of-the-art results.ConclusionsThe results of our study demonstrate that Omni-PolyA is currently the most accurate model for the prediction of PAS in human and can serve as a useful complement to other PAS recognition methods. Omni-PolyA is publicly available as an online tool accessible at www.cbrc.kaust.edu.sa/omnipolya/.

  8. Visualization of genome signatures of eukaryote genomes by batch-learning self-organizing map with a special emphasis on Drosophila genomes.

    Science.gov (United States)

    Abe, Takashi; Hamano, Yuta; Ikemura, Toshimichi

    2014-01-01

    A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method "BLSOM" for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.

  9. Face Alignment via Regressing Local Binary Features.

    Science.gov (United States)

    Ren, Shaoqing; Cao, Xudong; Wei, Yichen; Sun, Jian

    2016-03-01

    This paper presents a highly efficient and accurate regression approach for face alignment. Our approach has two novel components: 1) a set of local binary features and 2) a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for the final output. This approach achieves the state-of-the-art results when tested on the most challenging benchmarks to date. Furthermore, because extracting and regressing local binary features are computationally very cheap, our system is much faster than previous methods. It achieves over 3000 frames per second (FPS) on a desktop or 300 FPS on a mobile phone for locating a few dozens of landmarks. We also study a key issue that is important but has received little attention in the previous research, which is the face detector used to initialize alignment. We investigate several face detectors and perform quantitative evaluation on how they affect alignment accuracy. We find that an alignment friendly detector can further greatly boost the accuracy of our alignment method, reducing the error up to 16% relatively. To facilitate practical usage of face detection/alignment methods, we also propose a convenient metric to measure how good a detector is for alignment initialization.

  10. Identification of transcriptional signals in Encephalitozoon cuniculi widespread among Microsporidia phylum: support for accurate structural genome annotation

    Directory of Open Access Journals (Sweden)

    Wincker Patrick

    2009-12-01

    , 5'UTRs being strongly reduced, these signals can be used to ensure the accurate prediction of translation initiation codons for microsporidian genes and to improve microsporidian genome annotation.

  11. Alignment modification for pencil eye shields

    International Nuclear Information System (INIS)

    Evans, M.D.; Pla, M.; Podgorsak, E.B.

    1989-01-01

    Accurate alignment of pencil beam eye shields to protect the lens of the eye may be made easier by means of a simple modification of existing apparatus. This involves drilling a small hole through the center of the shield to isolate the rayline directed to the lens and fabricating a suitable plug for this hole

  12. eHive: An Artificial Intelligence workflow system for genomic analysis

    Science.gov (United States)

    2010-01-01

    Background The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future. Results We present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1) pairwise whole genome alignments, (2) multiple whole genome alignments and (3) gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios. Conclusions eHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/. PMID:20459813

  13. eHive: An Artificial Intelligence workflow system for genomic analysis

    Directory of Open Access Journals (Sweden)

    Gordon Leo

    2010-05-01

    Full Text Available Abstract Background The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future. Results We present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1 pairwise whole genome alignments, (2 multiple whole genome alignments and (3 gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios. Conclusions eHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/.

  14. eHive: an artificial intelligence workflow system for genomic analysis.

    Science.gov (United States)

    Severin, Jessica; Beal, Kathryn; Vilella, Albert J; Fitzgerald, Stephen; Schuster, Michael; Gordon, Leo; Ureta-Vidal, Abel; Flicek, Paul; Herrero, Javier

    2010-05-11

    The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future. We present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1) pairwise whole genome alignments, (2) multiple whole genome alignments and (3) gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios. eHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/.

  15. Establishing a framework for comparative analysis of genome sequences

    Energy Technology Data Exchange (ETDEWEB)

    Bansal, A.K.

    1995-06-01

    This paper describes a framework and a high-level language toolkit for comparative analysis of genome sequence alignment The framework integrates the information derived from multiple sequence alignment and phylogenetic tree (hypothetical tree of evolution) to derive new properties about sequences. Multiple sequence alignments are treated as an abstract data type. Abstract operations have been described to manipulate a multiple sequence alignment and to derive mutation related information from a phylogenetic tree by superimposing parsimonious analysis. The framework has been applied on protein alignments to derive constrained columns (in a multiple sequence alignment) that exhibit evolutionary pressure to preserve a common property in a column despite mutation. A Prolog toolkit based on the framework has been implemented and demonstrated on alignments containing 3000 sequences and 3904 columns.

  16. Aligning Metabolic Pathways Exploiting Binary Relation of Reactions.

    Directory of Open Access Journals (Sweden)

    Yiran Huang

    Full Text Available Metabolic pathway alignment has been widely used to find one-to-one and/or one-to-many reaction mappings to identify the alternative pathways that have similar functions through different sets of reactions, which has important applications in reconstructing phylogeny and understanding metabolic functions. The existing alignment methods exhaustively search reaction sets, which may become infeasible for large pathways. To address this problem, we present an effective alignment method for accurately extracting reaction mappings between two metabolic pathways. We show that connected relation between reactions can be formalized as binary relation of reactions in metabolic pathways, and the multiplications of zero-one matrices for binary relations of reactions can be accomplished in finite steps. By utilizing the multiplications of zero-one matrices for binary relation of reactions, we efficiently obtain reaction sets in a small number of steps without exhaustive search, and accurately uncover biologically relevant reaction mappings. Furthermore, we introduce a measure of topological similarity of nodes (reactions by comparing the structural similarity of the k-neighborhood subgraphs of the nodes in aligning metabolic pathways. We employ this similarity metric to improve the accuracy of the alignments. The experimental results on the KEGG database show that when compared with other state-of-the-art methods, in most cases, our method obtains better performance in the node correctness and edge correctness, and the number of the edges of the largest common connected subgraph for one-to-one reaction mappings, and the number of correct one-to-many reaction mappings. Our method is scalable in finding more reaction mappings with better biological relevance in large metabolic pathways.

  17. GapMis: a tool for pairwise sequence alignment with a single gap.

    Science.gov (United States)

    Flouri, Tomás; Frousios, Kimon; Iliopoulos, Costas S; Park, Kunsoo; Pissis, Solon P; Tischler, German

    2013-08-01

    Pairwise sequence alignment has received a new motivation due to the advent of recent patents in next-generation sequencing technologies, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and a high-quality fragment of a short read by a short-read alignment programme, an important problem is to find the alignment between a relatively short succeeding factor of the reference sequence and the remaining low-quality part of the read allowing a number of mismatches and the insertion of a single gap in the alignment. We present GapMis, a tool for pairwise sequence alignment with a single gap. It is based on a simple algorithm, which computes a different version of the traditional dynamic programming matrix. The presented experimental results demonstrate that GapMis is more suitable and efficient than most popular tools for this task.

  18. MANGO: a new approach to multiple sequence alignment.

    Science.gov (United States)

    Zhang, Zefeng; Lin, Hao; Li, Ming

    2007-01-01

    Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.

  19. Automated and Adaptable Quantification of Cellular Alignment from Microscopic Images for Tissue Engineering Applications

    Science.gov (United States)

    Xu, Feng; Beyazoglu, Turker; Hefner, Evan; Gurkan, Umut Atakan

    2011-01-01

    Cellular alignment plays a critical role in functional, physical, and biological characteristics of many tissue types, such as muscle, tendon, nerve, and cornea. Current efforts toward regeneration of these tissues include replicating the cellular microenvironment by developing biomaterials that facilitate cellular alignment. To assess the functional effectiveness of the engineered microenvironments, one essential criterion is quantification of cellular alignment. Therefore, there is a need for rapid, accurate, and adaptable methodologies to quantify cellular alignment for tissue engineering applications. To address this need, we developed an automated method, binarization-based extraction of alignment score (BEAS), to determine cell orientation distribution in a wide variety of microscopic images. This method combines a sequenced application of median and band-pass filters, locally adaptive thresholding approaches and image processing techniques. Cellular alignment score is obtained by applying a robust scoring algorithm to the orientation distribution. We validated the BEAS method by comparing the results with the existing approaches reported in literature (i.e., manual, radial fast Fourier transform-radial sum, and gradient based approaches). Validation results indicated that the BEAS method resulted in statistically comparable alignment scores with the manual method (coefficient of determination R2=0.92). Therefore, the BEAS method introduced in this study could enable accurate, convenient, and adaptable evaluation of engineered tissue constructs and biomaterials in terms of cellular alignment and organization. PMID:21370940

  20. Software for computing and annotating genomic ranges.

    Directory of Open Access Journals (Sweden)

    Michael Lawrence

    Full Text Available We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

  1. Vacuum mechatronic laser alignment system on the Nova laser

    International Nuclear Information System (INIS)

    Holliday, M.; Wong, K.; Shelton, R.

    1991-11-01

    The experiments conducted on NOVA are done to investigate inertially confined laser fusion reactions. To this end, the ten beams of the laser are aligned to within 30mm. The target chamber employs a vacuum mechatronic based reticle/target positioning system to accomplish this. It is a five degree-of-freedom chamber resident system, known as the Alignment Aids Positioner or AAP. The AAP aids in beam and diagnostic alignment by accurately positioning a reticle at target chamber center to with 7mm. The AAP system increases target positioning and alignment flexibility and accuracy through the use of a computer controlled multi degree-of-freedom stage assembly. This device uses microstepping DC stepper motors with encoders to achieve closed loop control in a 10 -6 torr vacuum. The AAP has two positioning regimes to move the alignment reticle and do beam alignment. One is course positioning in the Y-Z plane that moves a high resolution stage assembly to target chamber center. The other regime is high resolution movement in the X,Y,Z and q directions. 5 refs., 9 figs

  2. Tracing common origins of Genomic Islands in prokaryotes based on genome signature analyses.

    Science.gov (United States)

    van Passel, Mark Wj

    2011-09-01

    Horizontal gene transfer constitutes a powerful and innovative force in evolution, but often little is known about the actual origins of transferred genes. Sequence alignments are generally of limited use in tracking the original donor, since still only a small fraction of the total genetic diversity is thought to be uncovered. Alternatively, approaches based on similarities in the genome specific relative oligonucleotide frequencies do not require alignments. Even though the exact origins of horizontally transferred genes may still not be established using these compositional analyses, it does suggest that compositionally very similar regions are likely to have had a common origin. These analyses have shown that up to a third of large acquired gene clusters that reside in the same genome are compositionally very similar, indicative of a shared origin. This brings us closer to uncovering the original donors of horizontally transferred genes, and could help in elucidating possible regulatory interactions between previously unlinked sequences.

  3. New genomic resources for switchgrass: a BAC library and comparative analysis of homoeologous genomic regions harboring bioenergy traits

    Directory of Open Access Journals (Sweden)

    Feltus Frank A

    2011-07-01

    Full Text Available Abstract Background Switchgrass, a C4 species and a warm-season grass native to the prairies of North America, has been targeted for development into an herbaceous biomass fuel crop. Genetic improvement of switchgrass feedstock traits through marker-assisted breeding and biotechnology approaches calls for genomic tools development. Establishment of integrated physical and genetic maps for switchgrass will accelerate mapping of value added traits useful to breeding programs and to isolate important target genes using map based cloning. The reported polyploidy series in switchgrass ranges from diploid (2X = 18 to duodecaploid (12X = 108. Like in other large, repeat-rich plant genomes, this genomic complexity will hinder whole genome sequencing efforts. An extensive physical map providing enough information to resolve the homoeologous genomes would provide the necessary framework for accurate assembly of the switchgrass genome. Results A switchgrass BAC library constructed by partial digestion of nuclear DNA with EcoRI contains 147,456 clones covering the effective genome approximately 10 times based on a genome size of 3.2 Gigabases (~1.6 Gb effective. Restriction digestion and PFGE analysis of 234 randomly chosen BACs indicated that 95% of the clones contained inserts, ranging from 60 to 180 kb with an average of 120 kb. Comparative sequence analysis of two homoeologous genomic regions harboring orthologs of the rice OsBRI1 locus, a low-copy gene encoding a putative protein kinase and associated with biomass, revealed that orthologous clones from homoeologous chromosomes can be unambiguously distinguished from each other and correctly assembled to respective fingerprint contigs. Thus, the data obtained not only provide genomic resources for further analysis of switchgrass genome, but also improve efforts for an accurate genome sequencing strategy. Conclusions The construction of the first switchgrass BAC library and comparative analysis of

  4. Horizontally Transferred Genetic Elements in the Tsetse Fly Genome: An Alignment-Free Clustering Approach Using Batch Learning Self-Organising Map (BLSOM).

    Science.gov (United States)

    Nakao, Ryo; Abe, Takashi; Funayama, Shunsuke; Sugimoto, Chihiro

    2016-01-01

    Tsetse flies ( Glossina spp.) are the primary vectors of trypanosomes, which can cause human and animal African trypanosomiasis in Sub-Saharan African countries. The objective of this study was to explore the genome of Glossina morsitans morsitans for evidence of horizontal gene transfer (HGT) from microorganisms. We employed an alignment-free clustering method, that is, batch learning self-organising map (BLSOM), in which sequence fragments are clustered based on the similarity of oligonucleotide frequencies independently of sequence homology. After an initial scan of HGT events using BLSOM, we identified 3.8% of the tsetse fly genome as HGT candidates. The predicted donors of these HGT candidates included known symbionts, such as Wolbachia , as well as bacteria that have not previously been associated with the tsetse fly. We detected HGT candidates from diverse bacteria such as Bacillus and Flavobacteria, suggesting a past association between these taxa. Functional annotation revealed that the HGT candidates encoded loci in various functional pathways, such as metabolic and antibiotic biosynthesis pathways. These findings provide a basis for understanding the coevolutionary history of the tsetse fly and its microbes and establish the effectiveness of BLSOM for the detection of HGT events.

  5. Intra-Genomic Internal Transcribed Spacer Region Sequence Heterogeneity and Molecular Diagnosis in Clinical Microbiology.

    Science.gov (United States)

    Zhao, Ying; Tsang, Chi-Ching; Xiao, Meng; Cheng, Jingwei; Xu, Yingchun; Lau, Susanna K P; Woo, Patrick C Y

    2015-10-22

    Internal transcribed spacer region (ITS) sequencing is the most extensively used technology for accurate molecular identification of fungal pathogens in clinical microbiology laboratories. Intra-genomic ITS sequence heterogeneity, which makes fungal identification based on direct sequencing of PCR products difficult, has rarely been reported in pathogenic fungi. During the process of performing ITS sequencing on 71 yeast strains isolated from various clinical specimens, direct sequencing of the PCR products showed ambiguous sequences in six of them. After cloning the PCR products into plasmids for sequencing, interpretable sequencing electropherograms could be obtained. For each of the six isolates, 10-49 clones were selected for sequencing and two to seven intra-genomic ITS copies were detected. The identities of these six isolates were confirmed to be Candida glabrata (n=2), Pichia (Candida) norvegensis (n=2), Candida tropicalis (n=1) and Saccharomyces cerevisiae (n=1). Multiple sequence alignment revealed that one to four intra-genomic ITS polymorphic sites were present in the six isolates, and all these polymorphic sites were located in the ITS1 and/or ITS2 regions. We report and describe the first evidence of intra-genomic ITS sequence heterogeneity in four different pathogenic yeasts, which occurred exclusively in the ITS1 and ITS2 spacer regions for the six isolates in this study.

  6. MBGD update 2013: the microbial genome database for exploring the diversity of microbial world.

    Science.gov (United States)

    Uchiyama, Ikuo; Mihara, Motohiro; Nishide, Hiroyo; Chiba, Hirokazu

    2013-01-01

    The microbial genome database for comparative analysis (MBGD, available at http://mbgd.genome.ad.jp/) is a platform for microbial genome comparison based on orthology analysis. As its unique feature, MBGD allows users to conduct orthology analysis among any specified set of organisms; this flexibility allows MBGD to adapt to a variety of microbial genomic study. Reflecting the huge diversity of microbial world, the number of microbial genome projects now becomes several thousands. To efficiently explore the diversity of the entire microbial genomic data, MBGD now provides summary pages for pre-calculated ortholog tables among various taxonomic groups. For some closely related taxa, MBGD also provides the conserved synteny information (core genome alignment) pre-calculated using the CoreAligner program. In addition, efficient incremental updating procedure can create extended ortholog table by adding additional genomes to the default ortholog table generated from the representative set of genomes. Combining with the functionalities of the dynamic orthology calculation of any specified set of organisms, MBGD is an efficient and flexible tool for exploring the microbial genome diversity.

  7. Can a semi-automated surface matching and principal axis-based algorithm accurately quantify femoral shaft fracture alignment in six degrees of freedom?

    Science.gov (United States)

    Crookshank, Meghan C; Beek, Maarten; Singh, Devin; Schemitsch, Emil H; Whyne, Cari M

    2013-07-01

    Accurate alignment of femoral shaft fractures treated with intramedullary nailing remains a challenge for orthopaedic surgeons. The aim of this study is to develop and validate a cone-beam CT-based, semi-automated algorithm to quantify the malalignment in six degrees of freedom (6DOF) using a surface matching and principal axes-based approach. Complex comminuted diaphyseal fractures were created in nine cadaveric femora and cone-beam CT images were acquired (27 cases total). Scans were cropped and segmented using intensity-based thresholding, producing superior, inferior and comminution volumes. Cylinders were fit to estimate the long axes of the superior and inferior fragments. The angle and distance between the two cylindrical axes were calculated to determine flexion/extension and varus/valgus angulation and medial/lateral and anterior/posterior translations, respectively. Both surfaces were unwrapped about the cylindrical axes. Three methods of matching the unwrapped surface for determination of periaxial rotation were compared based on minimizing the distance between features. The calculated corrections were compared to the input malalignment conditions. All 6DOF were calculated to within current clinical tolerances for all but two cases. This algorithm yielded accurate quantification of malalignment of femoral shaft fractures for fracture gaps up to 60 mm, based on a single CBCT image of the fractured limb. Copyright © 2012 IPEM. Published by Elsevier Ltd. All rights reserved.

  8. Mango: multiple alignment with N gapped oligos.

    Science.gov (United States)

    Zhang, Zefeng; Lin, Hao; Li, Ming

    2008-06-01

    Multiple sequence alignment is a classical and challenging task. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state-of-the-art works suffer from the "once a gap, always a gap" phenomenon. Is there a radically new way to do multiple sequence alignment? In this paper, we introduce a novel and orthogonal multiple sequence alignment method, using both multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole and tries to build the alignment vertically, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds have proved significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks, showing that MANGO compares favorably, in both accuracy and speed, against state-of-the-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, ProbConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0, and Kalign 2.0. We have further demonstrated the scalability of MANGO on very large datasets of repeat elements. MANGO can be downloaded at http://www.bioinfo.org.cn/mango/ and is free for academic usage.

  9. Ligand Binding Site Detection by Local Structure Alignment and Its Performance Complementarity

    Science.gov (United States)

    Lee, Hui Sun; Im, Wonpil

    2013-01-01

    Accurate determination of potential ligand binding sites (BS) is a key step for protein function characterization and structure-based drug design. Despite promising results of template-based BS prediction methods using global structure alignment (GSA), there is a room to improve the performance by properly incorporating local structure alignment (LSA) because BS are local structures and often similar for proteins with dissimilar global folds. We present a template-based ligand BS prediction method using G-LoSA, our LSA tool. A large benchmark set validation shows that G-LoSA predicts drug-like ligands’ positions in single-chain protein targets more precisely than TM-align, a GSA-based method, while the overall success rate of TM-align is better. G-LoSA is particularly efficient for accurate detection of local structures conserved across proteins with diverse global topologies. Recognizing the performance complementarity of G-LoSA to TM-align and a non-template geometry-based method, fpocket, a robust consensus scoring method, CMCS-BSP (Complementary Methods and Consensus Scoring for ligand Binding Site Prediction), is developed and shows improvement on prediction accuracy. The G-LoSA source code is freely available at http://im.bioinformatics.ku.edu/GLoSA. PMID:23957286

  10. Phylo: a citizen science approach for improving multiple sequence alignment.

    Directory of Open Access Journals (Sweden)

    Alexander Kawrykow

    Full Text Available BACKGROUND: Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. METHODOLOGY/PRINCIPAL FINDINGS: We introduce Phylo, a human-based computing framework applying "crowd sourcing" techniques to solve the Multiple Sequence Alignment (MSA problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. CONCLUSIONS/SIGNIFICANCE: We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of "human-brain peta-flops" of computation that are spent every day playing games

  11. Accuracy of genomic selection for alfalfa biomass yield in different reference populations.

    Science.gov (United States)

    Annicchiarico, Paolo; Nazzicari, Nelson; Li, Xuehui; Wei, Yanling; Pecetti, Luciano; Brummer, E Charles

    2015-12-01

    Genomic selection based on genotyping-by-sequencing (GBS) data could accelerate alfalfa yield gains, if it displayed moderate ability to predict parent breeding values. Its interest would be enhanced by predicting ability also for germplasm/reference populations other than those for which it was defined. Predicting accuracy may be influenced by statistical models, SNP calling procedures and missing data imputation strategies. Landrace and variety material from two genetically-contrasting reference populations, i.e., 124 elite genotypes adapted to the Po Valley (sub-continental climate; PV population) and 154 genotypes adapted to Mediterranean-climate environments (Me population), were genotyped by GBS and phenotyped in separate environments for dry matter yield of their dense-planted half-sib progenies. Both populations showed no sub-population genetic structure. Predictive accuracy was higher by joint rather than separate SNP calling for the two data sets, and using random forest imputation of missing data. Highest accuracy was obtained using Support Vector Regression (SVR) for PV, and Ridge Regression BLUP and SVR for Me germplasm. Bayesian methods (Bayes A, Bayes B and Bayesian Lasso) tended to be less accurate. Random Forest Regression was the least accurate model. Accuracy attained about 0.35 for Me in the range of 0.30-0.50 missing data, and 0.32 for PV at 0.50 missing data, using at least 10,000 SNP markers. Cross-population predictions based on a smaller subset of common SNPs implied a relative loss of accuracy of about 25% for Me and 30% for PV. Genome-wide association analyses based on large subsets of M. truncatula-aligned markers revealed many SNPs with modest association with yield, and some genome areas hosting putative QTLs. A comparison of genomic vs. conventional selection for parent breeding value assuming 1-year vs. 5-year selection cycles, respectively, indicated over three-fold greater predicted yield gain per unit time for genomic selection

  12. Detecting the limits of regulatory element conservation anddivergence estimation using pairwise and multiple alignments

    Energy Technology Data Exchange (ETDEWEB)

    Pollard, Daniel A.; Moses, Alan M.; Iyer, Venky N.; Eisen,Michael B.

    2006-08-14

    Background: Molecular evolutionary studies of noncodingsequences rely on multiple alignments. Yet how multiple alignmentaccuracy varies across sequence types, tree topologies, divergences andtools, and further how this variation impacts specific inferences,remains unclear. Results: Here we develop a molecular evolutionsimulation platform, CisEvolver, with models of background noncoding andtranscription factor binding site evolution, and use simulated alignmentsto systematically examine multiple alignment accuracy and its impact ontwo key molecular evolutionary inferences: transcription factor bindingsite conservation and divergence estimation. We find that the accuracy ofmultiple alignments is determined almost exclusively by the pairwisedivergence distance of the two most diverged species and that additionalspecies have a negligible influence on alignment accuracy. Conservedtranscription factor binding sites align better than surroundingnoncoding DNA yet are often found to be misaligned at relatively shortdivergence distances, such that studies of binding site gain and losscould easily be confounded by alignment error. Divergence estimates frommultiple alignments tend to be overestimated at short divergencedistances but reach a tool specific divergence at which they cease toincrease, leading to underestimation at long divergences. Our moststriking finding was that overall alignment accuracy, binding sitealignment accuracy and divergence estimation accuracy vary greatly acrossbranches in a tree and are most accurate for terminal branches connectingsister taxa and least accurate for internal branches connectingsub-alignments. Conclusions: Our results suggest that variation inalignment accuracy can lead to errors in molecular evolutionaryinferences that could be construed as biological variation. Thesefindings have implications for which species to choose for analyses, whatkind of errors would be expected for a given set of species and howmultiple alignment tools and

  13. Beam based alignment at the KEK accelerator test facility

    International Nuclear Information System (INIS)

    Ross, M.; Nelson, J.; Woodley, M.; Wolski, A.

    2002-01-01

    The KEK Accelerator Test Facility (ATF) damping ring is a prototype low emittance source for the NLC/JLC linear collider. To achieve the goal normalized vertical emittance gey = 20 nm-rad, magnet placement accuracy better than 30 mm must be achieved. Accurate beam-based alignment (BBA) is required. The ATF arc optics uses a FOBO cell with two horizontally focusing quadrupoles, two sextupoles and a horizontally defocusing gradient dipole, all of which must be aligned with BBA. BBA at ATF uses the quadrupole and sextupole trim windings to find the trajectory through the center of each magnet. The results can be interpreted to assess the accuracy of the mechanical alignment and the beam position monitor offsets

  14. Genome-Scale Metabolic Model for the Green Alga Chlorella vulgaris UTEX 395 Accurately Predicts Phenotypes under Autotrophic, Heterotrophic, and Mixotrophic Growth Conditions.

    Science.gov (United States)

    Zuñiga, Cristal; Li, Chien-Ting; Huelsman, Tyler; Levering, Jennifer; Zielinski, Daniel C; McConnell, Brian O; Long, Christopher P; Knoshaug, Eric P; Guarnieri, Michael T; Antoniewicz, Maciek R; Betenbaugh, Michael J; Zengler, Karsten

    2016-09-01

    The green microalga Chlorella vulgaris has been widely recognized as a promising candidate for biofuel production due to its ability to store high lipid content and its natural metabolic versatility. Compartmentalized genome-scale metabolic models constructed from genome sequences enable quantitative insight into the transport and metabolism of compounds within a target organism. These metabolic models have long been utilized to generate optimized design strategies for an improved production process. Here, we describe the reconstruction, validation, and application of a genome-scale metabolic model for C. vulgaris UTEX 395, iCZ843. The reconstruction represents the most comprehensive model for any eukaryotic photosynthetic organism to date, based on the genome size and number of genes in the reconstruction. The highly curated model accurately predicts phenotypes under photoautotrophic, heterotrophic, and mixotrophic conditions. The model was validated against experimental data and lays the foundation for model-driven strain design and medium alteration to improve yield. Calculated flux distributions under different trophic conditions show that a number of key pathways are affected by nitrogen starvation conditions, including central carbon metabolism and amino acid, nucleotide, and pigment biosynthetic pathways. Furthermore, model prediction of growth rates under various medium compositions and subsequent experimental validation showed an increased growth rate with the addition of tryptophan and methionine. © 2016 American Society of Plant Biologists. All rights reserved.

  15. Genome-Scale Metabolic Model for the Green Alga Chlorella vulgaris UTEX 395 Accurately Predicts Phenotypes under Autotrophic, Heterotrophic, and Mixotrophic Growth Conditions1

    Science.gov (United States)

    Zuñiga, Cristal; Li, Chien-Ting; Zielinski, Daniel C.; Guarnieri, Michael T.; Antoniewicz, Maciek R.; Zengler, Karsten

    2016-01-01

    The green microalga Chlorella vulgaris has been widely recognized as a promising candidate for biofuel production due to its ability to store high lipid content and its natural metabolic versatility. Compartmentalized genome-scale metabolic models constructed from genome sequences enable quantitative insight into the transport and metabolism of compounds within a target organism. These metabolic models have long been utilized to generate optimized design strategies for an improved production process. Here, we describe the reconstruction, validation, and application of a genome-scale metabolic model for C. vulgaris UTEX 395, iCZ843. The reconstruction represents the most comprehensive model for any eukaryotic photosynthetic organism to date, based on the genome size and number of genes in the reconstruction. The highly curated model accurately predicts phenotypes under photoautotrophic, heterotrophic, and mixotrophic conditions. The model was validated against experimental data and lays the foundation for model-driven strain design and medium alteration to improve yield. Calculated flux distributions under different trophic conditions show that a number of key pathways are affected by nitrogen starvation conditions, including central carbon metabolism and amino acid, nucleotide, and pigment biosynthetic pathways. Furthermore, model prediction of growth rates under various medium compositions and subsequent experimental validation showed an increased growth rate with the addition of tryptophan and methionine. PMID:27372244

  16. GAAP: Genome-organization-framework-Assisted Assembly Pipeline for prokaryotic genomes.

    Science.gov (United States)

    Yuan, Lina; Yu, Yang; Zhu, Yanmin; Li, Yulai; Li, Changqing; Li, Rujiao; Ma, Qin; Siu, Gilman Kit-Hang; Yu, Jun; Jiang, Taijiao; Xiao, Jingfa; Kang, Yu

    2017-01-25

    Next-generation sequencing (NGS) technologies have greatly promoted the genomic study of prokaryotes. However, highly fragmented assemblies due to short reads from NGS are still a limiting factor in gaining insights into the genome biology. Reference-assisted tools are promising in genome assembly, but tend to result in false assembly when the assigned reference has extensive rearrangements. Herein, we present GAAP, a genome assembly pipeline for scaffolding based on core-gene-defined Genome Organizational Framework (cGOF) described in our previous study. Instead of assigning references, we use the multiple-reference-derived cGOFs as indexes to assist in order and orientation of the scaffolds and build a skeleton structure, and then use read pairs to extend scaffolds, called local scaffolding, and distinguish between true and chimeric adjacencies in the scaffolds. In our performance tests using both empirical and simulated data of 15 genomes in six species with diverse genome size, complexity, and all three categories of cGOFs, GAAP outcompetes or achieves comparable results when compared to three other reference-assisted programs, AlignGraph, Ragout and MeDuSa. GAAP uses both cGOF and pair-end reads to create assemblies in genomic scale, and performs better than the currently available reference-assisted assembly tools as it recovers more assemblies and makes fewer false locations, especially for species with extensive rearranged genomes. Our method is a promising solution for reconstruction of genome sequence from short reads of NGS.

  17. Capillary self-alignment of mesoscopic foil components for sensor-systems-in-foil

    International Nuclear Information System (INIS)

    Arutinov, Gari; Smits, Edsger C P; Van Heck, Gert; Van den Brand, Jeroen; Schoo, Herman F M; Mastrangeli, Massimo; Dietzel, Andreas

    2012-01-01

    This paper reports on the effective use of capillary self-alignment for low-cost and time-efficient assembly of heterogeneous foil components into a smart electronic identification label. Particularly, we demonstrate the accurate (better than 50 µm) alignment of cm-sized functional foil dies. We investigated the role played by the assembly liquid, by the size and the weight of assembling dies and by their initial offsets in the self-alignment performance. It was shown that there is a definite range of initial offsets allowing dies to align with high accuracy and within approximately the same time window, irrespective of their initial offset. (paper)

  18. The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences

    Directory of Open Access Journals (Sweden)

    Yandell Mark

    2010-07-01

    Full Text Available Abstract Background In today's age of genomic discovery, no attempt has been made to comprehensively sequence a gymnosperm genome. The largest genus in the coniferous family Pinaceae is Pinus, whose 110-120 species have extremely large genomes (c. 20-40 Gb, 2N = 24. The size and complexity of these genomes have prompted much speculation as to the feasibility of completing a conifer genome sequence. Conifer genomes are reputed to be highly repetitive, but there is little information available on the nature and identity of repetitive units in gymnosperms. The pines have extensive genetic resources, with approximately 329000 ESTs from eleven species and genetic maps in eight species, including a dense genetic map of the twelve linkage groups in Pinus taeda. Results We present here the Sanger sequence and annotation of ten P. taeda BAC clones and Genome Analyzer II whole genome shotgun (WGS sequences representing 7.5% of the genome. Computational annotation of ten BACs predicts three putative protein-coding genes and at least fifteen likely pseudogenes in nearly one megabase of sequence. We found three conifer-specific LTR retroelements in the BACs, and tentatively identified at least 15 others based on evidence from the distantly related angiosperms. Alignment of WGS sequences to the BACs indicates that 80% of BAC sequences have similar copies (≥ 75% nucleotide identity elsewhere in the genome, but only 23% have identical copies (99% identity. The three most common repetitive elements in the genome were identified and, when combined, represent less than 5% of the genome. Conclusions This study indicates that the majority of repeats in the P. taeda genome are 'novel' and will therefore require additional BAC or genomic sequencing for accurate characterization. The pine genome contains a very large number of diverged and probably defunct repetitive elements. This study also provides new evidence that sequencing a pine genome using a WGS approach is

  19. A Workflow to Improve the Alignment of Prostate Imaging with Whole-mount Histopathology.

    Science.gov (United States)

    Yamamoto, Hidekazu; Nir, Dror; Vyas, Lona; Chang, Richard T; Popert, Rick; Cahill, Declan; Challacombe, Ben; Dasgupta, Prokar; Chandra, Ashish

    2014-08-01

    Evaluation of prostate imaging tests against whole-mount histology specimens requires accurate alignment between radiologic and histologic data sets. Misalignment results in false-positive and -negative zones as assessed by imaging. We describe a workflow for three-dimensional alignment of prostate imaging data against whole-mount prostatectomy reference specimens and assess its performance against a standard workflow. Ethical approval was granted. Patients underwent motorized transrectal ultrasound (Prostate Histoscanning) to generate a three-dimensional image of the prostate before radical prostatectomy. The test workflow incorporated steps for axial alignment between imaging and histology, size adjustments following formalin fixation, and use of custom-made parallel cutters and digital caliper instruments. The control workflow comprised freehand cutting and assumed homogeneous block thicknesses at the same relative angles between pathology and imaging sections. Thirty radical prostatectomy specimens were histologically and radiologically processed, either by an alignment-optimized workflow (n = 20) or a control workflow (n = 10). The optimized workflow generated tissue blocks of heterogeneous thicknesses but with no significant drifting in the cutting plane. The control workflow resulted in significantly nonparallel blocks, accurately matching only one out of four histology blocks to their respective imaging data. The image-to-histology alignment accuracy was 20% greater in the optimized workflow (P alignment was observed in the optimized workflow. Evaluation of prostate imaging biomarkers using whole-mount histology references should include a test-to-reference spatial alignment workflow. Copyright © 2014 AUR. Published by Elsevier Inc. All rights reserved.

  20. Advanced Alignment of the ATLAS Inner Detector

    CERN Document Server

    Stahlman, JM; The ATLAS collaboration

    2012-01-01

    The primary goal of the ATLAS Inner Detector (ID) is to measure the trajectories of charged particles in the high particle density environment of the Large Hadron Collider (LHC) collisions. This is achieved using a combination of different technologies, including silicon pixels, silicon microstrips, and gaseous drift-tubes, all immersed in a 2 Tesla magnetic field. With over one million alignable degrees of freedom, it is crucial that an accurate model of the detector positions be produced using an automated and robust algorithm in order to achieve good tracking performance. This has been accomplished using a variety of alignment techniques resulting in near optimal hit and momentum resolutions.

  1. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

    DEFF Research Database (Denmark)

    Siepel, Adam; Bejerano, Gill; Pedersen, Jakob Skou

    2005-01-01

    We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three...... species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo......-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially...

  2. The mitochondrial genome of Grateloupia taiwanensis (Halymeniaceae, Rhodophyta) and comparative mitochondrial genomics of red algae.

    Science.gov (United States)

    DePriest, Michael S; Bhattacharya, Debashish; López-Bautista, Juan M

    2014-10-01

    Although red algae are economically highly valuable for their gelatinous cell wall compounds as well as being integral parts of marine benthic habitats, very little genome data are currently available. We present mitochondrial genome sequence data from the red alga Grateloupia taiwanensis S.-M. Lin & H.-Y. Liang. Comprising 28,906 nucleotide positions, the mitochondrial genome contig contains 25 protein-coding genes and 24 transfer RNA genes. It is highly similar to other red algal genomes in gene content as well as overall structure. An intron in the cox1 gene was found to be shared by G. taiwanensis and Grateloupia angusta (Okamura) S. Kawaguchi & H. W. Wang. We also used whole-genome alignments to compare G. taiwanensis to different groups of red algae, and these results are consistent with the currently accepted phylogeny of Rhodophyta. © 2014 Marine Biological Laboratory.

  3. Analysis Of Transcriptomes In A Porcine Tissue Collection Using RNA-Seq And Genome Assembly 10

    DEFF Research Database (Denmark)

    Hornshøj, Henrik; Thomsen, Bo; Hedegaard, Jakob

    2011-01-01

    The release of Sus scrofa genome assembly 10 supports improvement of the pig genome annotation and in depth transcriptome analyses using next-generation sequencing technologies. In this study we analyze RNA-seq reads from a tissue collection, including 10 separate tissues from Duroc boars and 10...... short read alignment software we mapped the reads to the genome assembly 10. We extracted contig sequences of gene transcripts using the Cufflinks software. Based on this information we identified expressed genes that are present in the genome assembly. The portion of these genes being previously known...... was roughly estimated by sequence comparison to known genes. Similarly, we searched for genes that are expressed in the tissues but not present in the genome assembly by aligning the non-genome-mapped reads to known gene transcripts. For the genes predicted to have alternative transcript variants by Cufflinks...

  4. Genome cluster database. A sequence family analysis platform for Arabidopsis and rice.

    Science.gov (United States)

    Horan, Kevin; Lauricha, Josh; Bailey-Serres, Julia; Raikhel, Natasha; Girke, Thomas

    2005-05-01

    The genome-wide protein sequences from Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) spp. japonica were clustered into families using sequence similarity and domain-based clustering. The two fundamentally different methods resulted in separate cluster sets with complementary properties to compensate the limitations for accurate family analysis. Functional names for the identified families were assigned with an efficient computational approach that uses the description of the most common molecular function gene ontology node within each cluster. Subsequently, multiple alignments and phylogenetic trees were calculated for the assembled families. All clustering results and their underlying sequences were organized in the Web-accessible Genome Cluster Database (http://bioinfo.ucr.edu/projects/GCD) with rich interactive and user-friendly sequence family mining tools to facilitate the analysis of any given family of interest for the plant science community. An automated clustering pipeline ensures current information for future updates in the annotations of the two genomes and clustering improvements. The analysis allowed the first systematic identification of family and singlet proteins present in both organisms as well as those restricted to one of them. In addition, the established Web resources for mining these data provide a road map for future studies of the composition and structure of protein families between the two species.

  5. A quantitative account of genomic island acquisitions in prokaryotes

    Directory of Open Access Journals (Sweden)

    Roos Tom E

    2011-08-01

    Full Text Available Abstract Background Microbial genomes do not merely evolve through the slow accumulation of mutations, but also, and often more dramatically, by taking up new DNA in a process called horizontal gene transfer. These innovation leaps in the acquisition of new traits can take place via the introgression of single genes, but also through the acquisition of large gene clusters, which are termed Genomic Islands. Since only a small proportion of all the DNA diversity has been sequenced, it can be hard to find the appropriate donors for acquired genes via sequence alignments from databases. In contrast, relative oligonucleotide frequencies represent a remarkably stable genomic signature in prokaryotes, which facilitates compositional comparisons as an alignment-free alternative for phylogenetic relatedness. In this project, we test whether Genomic Islands identified in individual bacterial genomes have a similar genomic signature, in terms of relative dinucleotide frequencies, and can therefore be expected to originate from a common donor species. Results When multiple Genomic Islands are present within a single genome, we find that up to 28% of these are compositionally very similar to each other, indicative of frequent recurring acquisitions from the same donor to the same acceptor. Conclusions This represents the first quantitative assessment of common directional transfer events in prokaryotic evolutionary history. We suggest that many of the resident Genomic Islands per prokaryotic genome originated from the same source, which may have implications with respect to their regulatory interactions, and for the elucidation of the common origins of these acquired gene clusters.

  6. A new method to cluster genomes based on cumulative Fourier power spectrum.

    Science.gov (United States)

    Dong, Rui; Zhu, Ziyue; Yin, Changchuan; He, Rong L; Yau, Stephen S-T

    2018-06-20

    Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum). Copyright © 2018. Published by Elsevier B.V.

  7. Pigs in sequence space: A 0.66X coverage pig genome survey based on shotgun sequencing

    Directory of Open Access Journals (Sweden)

    Li Wei

    2005-05-01

    Full Text Available Abstract Background Comparative whole genome analysis of Mammalia can benefit from the addition of more species. The pig is an obvious choice due to its economic and medical importance as well as its evolutionary position in the artiodactyls. Results We have generated ~3.84 million shotgun sequences (0.66X coverage from the pig genome. The data are hereby released (NCBI Trace repository with center name "SDJVP", and project name "Sino-Danish Pig Genome Project" together with an initial evolutionary analysis. The non-repetitive fraction of the sequences was aligned to the UCSC human-mouse alignment and the resulting three-species alignments were annotated using the human genome annotation. Ultra-conserved elements and miRNAs were identified. The results show that for each of these types of orthologous data, pig is much closer to human than mouse is. Purifying selection has been more efficient in pig compared to human, but not as efficient as in mouse, and pig seems to have an isochore structure most similar to the structure in human. Conclusion The addition of the pig to the set of species sequenced at low coverage adds to the understanding of selective pressures that have acted on the human genome by bisecting the evolutionary branch between human and mouse with the mouse branch being approximately 3 times as long as the human branch. Additionally, the joint alignment of the shot-gun sequences to the human-mouse alignment offers the investigator a rapid way to defining specific regions for analysis and resequencing.

  8. Alignment and orientation in ion/endash/atom collisions

    International Nuclear Information System (INIS)

    Kimura, M.; Lane, N.F.

    1987-01-01

    Recent progress in the theoretical study of alignment and orientation in atom-atom and ion-atom collisions at intermediate energies is reviewed. Recent systematic studies of the alignment and orientation of electronic charge cloud distributions of excited states resulting from such collisions clearly have provided more detailed information about the underlying collision dynamics. However, since accurate determination of these parameters is quite difficult, both theoretically and experimentally, a close collaboration between theory and experiment is necessary for a deeper understanding of the collision dynamics. A more complete approach, where the full density matrix is determined, is also discussed

  9. CMS Muon Alignment: System Description and first results

    CERN Document Server

    Sobron, M

    2008-01-01

    The CMS detector has been instrumented with a precise and complex opto-mechanical alignment subsystem that provides a common reference frame between Tracker and Muon detection systems by means of a net of laser beams. The system allows a continuous and accurate monitoring of the muon chambers positions with respect to the Tracker body. Preliminary results of operation during the test of the CMS 4T solenoid magnet, performed in 2006, are presented. These measurements complement the information provided by the use of survey techniques and the results of alignment algorithms based on muon tracks crossing the detector.

  10. New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes

    DEFF Research Database (Denmark)

    Parker, Brian John; Moltke, Ida; Roth, Adam

    2011-01-01

    a comparative method, EvoFam, for genome-wide identification of families of regulatory RNA structures, based on primary sequence and secondary structure similarity. We apply EvoFam to a 41-way genomic vertebrate alignment. Genome-wide, we identify 220 human, high-confidence families outside protein...

  11. Genomic resources for gene discovery, functional genome annotation, and evolutionary studies of maize and its close relatives.

    Science.gov (United States)

    Wang, Chao; Shi, Xue; Liu, Lin; Li, Haiyan; Ammiraju, Jetty S S; Kudrna, David A; Xiong, Wentao; Wang, Hao; Dai, Zhaozhao; Zheng, Yonglian; Lai, Jinsheng; Jin, Weiwei; Messing, Joachim; Bennetzen, Jeffrey L; Wing, Rod A; Luo, Meizhong

    2013-11-01

    Maize is one of the most important food crops and a key model for genetics and developmental biology. A genetically anchored and high-quality draft genome sequence of maize inbred B73 has been obtained to serve as a reference sequence. To facilitate evolutionary studies in maize and its close relatives, much like the Oryza Map Alignment Project (OMAP) (www.OMAP.org) bacterial artificial chromosome (BAC) resource did for the rice community, we constructed BAC libraries for maize inbred lines Zheng58, Chang7-2, and Mo17 and maize wild relatives Zea mays ssp. parviglumis and Tripsacum dactyloides. Furthermore, to extend functional genomic studies to maize and sorghum, we also constructed binary BAC (BIBAC) libraries for the maize inbred B73 and the sorghum landrace Nengsi-1. The BAC/BIBAC vectors facilitate transfer of large intact DNA inserts from BAC clones to the BIBAC vector and functional complementation of large DNA fragments. These seven Zea Map Alignment Project (ZMAP) BAC/BIBAC libraries have average insert sizes ranging from 92 to 148 kb, organellar DNA from 0.17 to 2.3%, empty vector rates between 0.35 and 5.56%, and genome equivalents of 4.7- to 8.4-fold. The usefulness of the Parviglumis and Tripsacum BAC libraries was demonstrated by mapping clones to the reference genome. Novel genes and alleles present in these ZMAP libraries can now be used for functional complementation studies and positional or homology-based cloning of genes for translational genomics.

  12. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data.

    Science.gov (United States)

    Dunn, Joshua G; Weissman, Jonathan S

    2016-11-22

    Next-generation sequencing (NGS) informs many biological questions with unprecedented depth and nucleotide resolution. These assays have created a need for analytical tools that enable users to manipulate data nucleotide-by-nucleotide robustly and easily. Furthermore, because many NGS assays encode information jointly within multiple properties of read alignments - for example, in ribosome profiling, the locations of ribosomes are jointly encoded in alignment coordinates and length - analytical tools are often required to extract the biological meaning from the alignments before analysis. Many assay-specific pipelines exist for this purpose, but there remains a need for user-friendly, generalized, nucleotide-resolution tools that are not limited to specific experimental regimes or analytical workflows. Plastid is a Python library designed specifically for nucleotide-resolution analysis of genomics and NGS data. As such, Plastid is designed to extract assay-specific information from read alignments while retaining generality and extensibility to novel NGS assays. Plastid represents NGS and other biological data as arrays of values associated with genomic or transcriptomic positions, and contains configurable tools to convert data from a variety of sources to such arrays. Plastid also includes numerous tools to manipulate even discontinuous genomic features, such as spliced transcripts, with nucleotide precision. Plastid automatically handles conversion between genomic and feature-centric coordinates, accounting for splicing and strand, freeing users of burdensome accounting. Finally, Plastid's data models use consistent and familiar biological idioms, enabling even beginners to develop sophisticated analytical workflows with minimal effort. Plastid is a versatile toolkit that has been used to analyze data from multiple NGS assays, including RNA-seq, ribosome profiling, and DMS-seq. It forms the genomic engine of our ORF annotation tool, ORF-RATER, and is readily

  13. A method for accurate detection of genomic microdeletions using real-time quantitative PCR

    Directory of Open Access Journals (Sweden)

    Bassett Anne S

    2005-12-01

    Full Text Available Abstract Background Quantitative Polymerase Chain Reaction (qPCR is a well-established method for quantifying levels of gene expression, but has not been routinely applied to the detection of constitutional copy number alterations of human genomic DNA. Microdeletions or microduplications of the human genome are associated with a variety of genetic disorders. Although, clinical laboratories routinely use fluorescence in situ hybridization (FISH to identify such cryptic genomic alterations, there remains a significant number of individuals in which constitutional genomic imbalance is suspected, based on clinical parameters, but cannot be readily detected using current cytogenetic techniques. Results In this study, a novel application for real-time qPCR is presented that can be used to reproducibly detect chromosomal microdeletions and microduplications. This approach was applied to DNA from a series of patient samples and controls to validate genomic copy number alteration at cytoband 22q11. The study group comprised 12 patients with clinical symptoms of chromosome 22q11 deletion syndrome (22q11DS, 1 patient trisomic for 22q11 and 4 normal controls. 6 of the patients (group 1 had known hemizygous deletions, as detected by standard diagnostic FISH, whilst the remaining 6 patients (group 2 were classified as 22q11DS negative using the clinical FISH assay. Screening of the patients and controls with a set of 10 real time qPCR primers, spanning the 22q11.2-deleted region and flanking sequence, confirmed the FISH assay results for all patients with 100% concordance. Moreover, this qPCR enabled a refinement of the region of deletion at 22q11. Analysis of DNA from chromosome 22 trisomic sample demonstrated genomic duplication within 22q11. Conclusion In this paper we present a qPCR approach for the detection of chromosomal microdeletions and microduplications. The strategic use of in silico modelling for qPCR primer design to avoid regions of repetitive

  14. pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment.

    Science.gov (United States)

    Warris, Sven; Timal, N Roshan N; Kempenaar, Marcel; Poortinga, Arne M; van de Geest, Henri; Varbanescu, Ana L; Nap, Jan-Peter

    2018-01-01

    Our previously published CUDA-only application PaSWAS for Smith-Waterman (SW) sequence alignment of any type of sequence on NVIDIA-based GPUs is platform-specific and therefore adopted less than could be. The OpenCL language is supported more widely and allows use on a variety of hardware platforms. Moreover, there is a need to promote the adoption of parallel computing in bioinformatics by making its use and extension more simple through more and better application of high-level languages commonly used in bioinformatics, such as Python. The novel application pyPaSWAS presents the parallel SW sequence alignment code fully packed in Python. It is a generic SW implementation running on several hardware platforms with multi-core systems and/or GPUs that provides accurate sequence alignments that also can be inspected for alignment details. Additionally, pyPaSWAS support the affine gap penalty. Python libraries are used for automated system configuration, I/O and logging. This way, the Python environment will stimulate further extension and use of pyPaSWAS. pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment should be considered for other computationally intensive bioinformatics algorithms.

  15. A highly accurate positioning and orientation system based on the usage of four-cluster fibre optic gyros

    International Nuclear Information System (INIS)

    Zhang, Xiaoyue; Lin, Zhili; Zhang, Chunxi

    2013-01-01

    A highly accurate positioning and orientation technique based on four-cluster fibre optic gyros (FOGs) is presented. The four-cluster FOG inertial measurement unit (IMU) comprises three low-precision FOGs, one static high-precision FOG and three accelerometers. To realize high-precision positioning and orientation, the static alignment (north-seeking) before vehicle manoeuvre was divided into a low-precision self-alignment phase and a high-precision north-seeking (online calibration) phase. The high-precision FOG measurement information was introduced to obtain high-precision azimuth alignment (north-seeking) result and achieve online calibration of the low-precision three-cluster FOG. The results of semi-physical simulation were presented to validate the availability and utility of the highly accurate positioning and orientation technique based on the four-cluster FOGs. (paper)

  16. Genome-wide identification of the regulatory targets of a transcription factor using biochemical characterization and computational genomic analysis

    Directory of Open Access Journals (Sweden)

    Jolly Emmitt R

    2005-11-01

    Full Text Available Abstract Background A major challenge in computational genomics is the development of methodologies that allow accurate genome-wide prediction of the regulatory targets of a transcription factor. We present a method for target identification that combines experimental characterization of binding requirements with computational genomic analysis. Results Our method identified potential target genes of the transcription factor Ndt80, a key transcriptional regulator involved in yeast sporulation, using the combined information of binding affinity, positional distribution, and conservation of the binding sites across multiple species. We have also developed a mathematical approach to compute the false positive rate and the total number of targets in the genome based on the multiple selection criteria. Conclusion We have shown that combining biochemical characterization and computational genomic analysis leads to accurate identification of the genome-wide targets of a transcription factor. The method can be extended to other transcription factors and can complement other genomic approaches to transcriptional regulation.

  17. ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes

    Energy Technology Data Exchange (ETDEWEB)

    Novichkov, Pavel S.; Ratnere, Igor; Wolf, Yuri I.; Koonin, Eugene V.; Dubchak, Inna

    2009-07-23

    The database of Alignable Tight Genomic Clusters (ATGCs) consists of closely related genomes of archaea and bacteria, and is a resource for research into prokaryotic microevolution. Construction of a data set with appropriate characteristics is a major hurdle for this type of studies. With the current rate of genome sequencing, it is difficult to follow the progress of the field and to determine which of the available genome sets meet the requirements of a given research project, in particular, with respect to the minimum and maximum levels of similarity between the included genomes. Additionally, extraction of specific content, such as genomic alignments or families of orthologs, from a selected set of genomes is a complicated and time-consuming process. The database addresses these problems by providing an intuitive and efficient web interface to browse precomputed ATGCs, select appropriate ones and access ATGC-derived data such as multiple alignments of orthologous proteins, matrices of pairwise intergenomic distances based on genome-wide analysis of synonymous and nonsynonymous substitution rates and others. The ATGC database will be regularly updated following new releases of the NCBI RefSeq. The database is hosted by the Genomics Division at Lawrence Berkeley National laboratory and is publicly available at http://atgc.lbl.gov.

  18. YAHA: fast and flexible long-read alignment with optimal breakpoint detection.

    Science.gov (United States)

    Faust, Gregory G; Hall, Ira M

    2012-10-01

    With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this. We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints. YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA. imh4y@virginia.edu.

  19. AlignMiner: a Web-based tool for detection of divergent regions in multiple sequence alignments of conserved sequences

    Directory of Open Access Journals (Sweden)

    Claros M Gonzalo

    2010-06-01

    Full Text Available Abstract Background Multiple sequence alignments are used to study gene or protein function, phylogenetic relations, genome evolution hypotheses and even gene polymorphisms. Virtually without exception, all available tools focus on conserved segments or residues. Small divergent regions, however, are biologically important for specific quantitative polymerase chain reaction, genotyping, molecular markers and preparation of specific antibodies, and yet have received little attention. As a consequence, they must be selected empirically by the researcher. AlignMiner has been developed to fill this gap in bioinformatic analyses. Results AlignMiner is a Web-based application for detection of conserved and divergent regions in alignments of conserved sequences, focusing particularly on divergence. It accepts alignments (protein or nucleic acid obtained using any of a variety of algorithms, which does not appear to have a significant impact on the final results. AlignMiner uses different scoring methods for assessing conserved/divergent regions, Entropy being the method that provides the highest number of regions with the greatest length, and Weighted being the most restrictive. Conserved/divergent regions can be generated either with respect to the consensus sequence or to one master sequence. The resulting data are presented in a graphical interface developed in AJAX, which provides remarkable user interaction capabilities. Users do not need to wait until execution is complete and can.even inspect their results on a different computer. Data can be downloaded onto a user disk, in standard formats. In silico and experimental proof-of-concept cases have shown that AlignMiner can be successfully used to designing specific polymerase chain reaction primers as well as potential epitopes for antibodies. Primer design is assisted by a module that deploys several oligonucleotide parameters for designing primers "on the fly". Conclusions AlignMiner can be used

  20. High-throughput automated microfluidic sample preparation for accurate microbial genomics.

    Science.gov (United States)

    Kim, Soohong; De Jonghe, Joachim; Kulesa, Anthony B; Feldman, David; Vatanen, Tommi; Bhattacharyya, Roby P; Berdy, Brittany; Gomez, James; Nolan, Jill; Epstein, Slava; Blainey, Paul C

    2017-01-27

    Low-cost shotgun DNA sequencing is transforming the microbial sciences. Sequencing instruments are so effective that sample preparation is now the key limiting factor. Here, we introduce a microfluidic sample preparation platform that integrates the key steps in cells to sequence library sample preparation for up to 96 samples and reduces DNA input requirements 100-fold while maintaining or improving data quality. The general-purpose microarchitecture we demonstrate supports workflows with arbitrary numbers of reaction and clean-up or capture steps. By reducing the sample quantity requirements, we enabled low-input (∼10,000 cells) whole-genome shotgun (WGS) sequencing of Mycobacterium tuberculosis and soil micro-colonies with superior results. We also leveraged the enhanced throughput to sequence ∼400 clinical Pseudomonas aeruginosa libraries and demonstrate excellent single-nucleotide polymorphism detection performance that explained phenotypically observed antibiotic resistance. Fully-integrated lab-on-chip sample preparation overcomes technical barriers to enable broader deployment of genomics across many basic research and translational applications.

  1. Dynamics of genome rearrangement in bacterial populations.

    Directory of Open Access Journals (Sweden)

    Aaron E Darling

    2008-07-01

    Full Text Available Genome structure variation has profound impacts on phenotype in organisms ranging from microbes to humans, yet little is known about how natural selection acts on genome arrangement. Pathogenic bacteria such as Yersinia pestis, which causes bubonic and pneumonic plague, often exhibit a high degree of genomic rearrangement. The recent availability of several Yersinia genomes offers an unprecedented opportunity to study the evolution of genome structure and arrangement. We introduce a set of statistical methods to study patterns of rearrangement in circular chromosomes and apply them to the Yersinia. We constructed a multiple alignment of eight Yersinia genomes using Mauve software to identify 78 conserved segments that are internally free from genome rearrangement. Based on the alignment, we applied Bayesian statistical methods to infer the phylogenetic inversion history of Yersinia. The sampling of genome arrangement reconstructions contains seven parsimonious tree topologies, each having different histories of 79 inversions. Topologies with a greater number of inversions also exist, but were sampled less frequently. The inversion phylogenies agree with results suggested by SNP patterns. We then analyzed reconstructed inversion histories to identify patterns of rearrangement. We confirm an over-representation of "symmetric inversions"-inversions with endpoints that are equally distant from the origin of chromosomal replication. Ancestral genome arrangements demonstrate moderate preference for replichore balance in Yersinia. We found that all inversions are shorter than expected under a neutral model, whereas inversions acting within a single replichore are much shorter than expected. We also found evidence for a canonical configuration of the origin and terminus of replication. Finally, breakpoint reuse analysis reveals that inversions with endpoints proximal to the origin of DNA replication are nearly three times more frequent. Our findings

  2. An efficient approach to BAC based assembly of complex genomes.

    Science.gov (United States)

    Visendi, Paul; Berkman, Paul J; Hayashi, Satomi; Golicz, Agnieszka A; Bayer, Philipp E; Ruperao, Pradeep; Hurgobin, Bhavna; Montenegro, Juan; Chan, Chon-Kit Kenneth; Staňková, Helena; Batley, Jacqueline; Šimková, Hana; Doležel, Jaroslav; Edwards, David

    2016-01-01

    There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.

  3. Control rod housing alignment and repair method

    International Nuclear Information System (INIS)

    Dixon, R.C.; Deaver, G.A.; Punches, J.R.; Singleton, G.E.; Erbes, J.G.; Offer, H.P.

    1992-01-01

    This patent describes a method for underwater welding of a control rod drive housing inserted through a stub tube to maintain requisite alignment and elevation of the top of the control rod drive housing to an overlying and corresponding aperture in a core plate as measured by an alignment device which determines the relative elevation and angularity with respect to the aperture. It comprises providing a welding cylinder dependent from the alignment device such that the elevation of the top of the welding cylinder is in a fixed relationship to the alignment device and is gas-proof; pressurizing the welding cylinder with inert welding gas sufficient to maintain the interior of the welding cylinder dry; lowering the welding cylinder through the aperture in the core plate by depending the cylinder with respect to the alignment device, the lowering including lowering through and adjusting the elevation relationship of the welding cylinder to the alignment device such that when the alignment device is in position to measure the elevation and angularity of the new control rod drive housing, the lower distal end of the welding cylinder extends below the upper periphery of the stub where welding is to occur; inserting a new control rod drive housing through the stub tube and positioning the control rod drive housing to a predetermined relationship to the anticipated final position of the control rod drive housing; providing welding implements transversely rotatably mounted interior of the welding cylinder relative to the alignment device such that the welding implements may be accurately positioned for dispensing weldment around the periphery of the top of the stub tube and at the side of the control rod drive housing; measuring the elevation and angularity of the control rod drive housing; and dispensing weldment along the top of the stub tube and at the side of the control rod drive housing

  4. Introduction of hind foot coronal alignment view

    International Nuclear Information System (INIS)

    Moon, Il Bong; Jeon, Ju Seob; Yoon, Kang Cheol; Choi, Nam Kil; Kim, Seung Kook

    2006-01-01

    Accurate clinical evaluation of the alignment of the calcaneus relative to the tibia in the coronal plane is essential in the evaluation and treatment of hind foot pathologic condition. Previously described standard anteroposterior, lateral, and oblique radiographic methods of the foot or ankle do not demonstrate alignment of the tibia relation to the calcaneus in the coronal plane. The purpose of this study was to introduce hind foot coronal alignment view. Both feet were imaged simultaneously on an elevated, radiolucent foot stand equipment. Both feet stood on a radiolucent platform with equal weight on both feet. Both feet are located foot axis longitudinal perpendicular to the platform. Silhouette tracing around both feet are made, and line is then drawn to bisect the silhouette of the second toe and the outline of the heel. The x-ray beam is angled down approximately 15 .deg. to 20 .deg. This image described tibial axis and medial, lateral tuberosity of calcaneus. Calcaneus do not rotated. The view is showed by talotibial joint space. Although computed tomographic and magnetic resonance imaging techniques are capable of demonstrating coronal hind foot alignment, they lack usefulness in most clinical situations because the foot is imaged in a non-weight bearing position. But hind foot coronal alignment view is obtained for evaluating position changing of inversion, eversion of the hind foot and varus, valgus deformity of calcaneus

  5. Quantifying Temporal Genomic Erosion in Endangered Species.

    Science.gov (United States)

    Díez-Del-Molino, David; Sánchez-Barreiro, Fatima; Barnes, Ian; Gilbert, M Thomas P; Dalén, Love

    2018-03-01

    Many species have undergone dramatic population size declines over the past centuries. Although stochastic genetic processes during and after such declines are thought to elevate the risk of extinction, comparative analyses of genomic data from several endangered species suggest little concordance between genome-wide diversity and current population sizes. This is likely because species-specific life-history traits and ancient bottlenecks overshadow the genetic effect of recent demographic declines. Therefore, we advocate that temporal sampling of genomic data provides a more accurate approach to quantify genetic threats in endangered species. Specifically, genomic data from predecline museum specimens will provide valuable baseline data that enable accurate estimation of recent decreases in genome-wide diversity, increases in inbreeding levels, and accumulation of deleterious genetic variation. Copyright © 2017 Elsevier Ltd. All rights reserved.

  6. GapBlaster-A Graphical Gap Filler for Prokaryote Genomes.

    Directory of Open Access Journals (Sweden)

    Pablo H C G de Sá

    Full Text Available The advent of NGS (Next Generation Sequencing technologies has resulted in an exponential increase in the number of complete genomes available in biological databases. This advance has allowed the development of several computational tools enabling analyses of large amounts of data in each of the various steps, from processing and quality filtering to gap filling and manual curation. The tools developed for gap closure are very useful as they result in more complete genomes, which will influence downstream analyses of genomic plasticity and comparative genomics. However, the gap filling step remains a challenge for genome assembly, often requiring manual intervention. Here, we present GapBlaster, a graphical application to evaluate and close gaps. GapBlaster was developed via Java programming language. The software uses contigs obtained in the assembly of the genome to perform an alignment against a draft of the genome/scaffold, using BLAST or Mummer to close gaps. Then, all identified alignments of contigs that extend through the gaps in the draft sequence are presented to the user for further evaluation via the GapBlaster graphical interface. GapBlaster presents significant results compared to other similar software and has the advantage of offering a graphical interface for manual curation of the gaps. GapBlaster program, the user guide and the test datasets are freely available at https://sourceforge.net/projects/gapblaster2015/. It requires Sun JDK 8 and Blast or Mummer.

  7. Genomic comparison of closely related Giant Viruses supports an accordion-like model of evolution

    OpenAIRE

    Filée, Jonathan

    2015-01-01

    Genome gigantism occurs so far in Phycodnaviridae and Mimiviridae (order Megavirales). Origin and evolution of these Giant Viruses (GVs) remain open questions. Interestingly, availability of a collection of closely related GV genomes enabling genomic comparisons offer the opportunity to better understand the different evolutionary forces acting on these genomes. Whole genome alignment for five groups of viruses belonging to the Mimiviridae and Phycodnaviridae families show that there is no tr...

  8. Beam alignment based on two-dimensional power spectral density of a near-field image.

    Science.gov (United States)

    Wang, Shenzhen; Yuan, Qiang; Zeng, Fa; Zhang, Xin; Zhao, Junpu; Li, Kehong; Zhang, Xiaolu; Xue, Qiao; Yang, Ying; Dai, Wanjun; Zhou, Wei; Wang, Yuanchen; Zheng, Kuixing; Su, Jingqin; Hu, Dongxia; Zhu, Qihua

    2017-10-30

    Beam alignment is crucial to high-power laser facilities and is used to adjust the laser beams quickly and accurately to meet stringent requirements of pointing and centering. In this paper, a novel alignment method is presented, which employs data processing of the two-dimensional power spectral density (2D-PSD) for a near-field image and resolves the beam pointing error relative to the spatial filter pinhole directly. Combining this with a near-field fiducial mark, the operation of beam alignment is achieved. It is experimentally demonstrated that this scheme realizes a far-field alignment precision of approximately 3% of the pinhole size. This scheme adopts only one near-field camera to construct the alignment system, which provides a simple, efficient, and low-cost way to align lasers.

  9. Planar self-aligned imprint lithography for coplanar plasmonic nanostructures fabrication

    KAUST Repository

    Wan, Weiwei

    2014-03-01

    Nanoimprint lithography (NIL) is a cost-efficient nanopatterning technology because of its promising advantages of high throughput and high resolution. However, accurate multilevel overlay capability of NIL required for integrated circuit manufacturing remains a challenge due to the high cost of achieving mechanical alignment precision. Although self-aligned imprint lithography was developed to avoid the need of alignment for the vertical layered structures, it has limited usage in the manufacture of the coplanar structures, such as integrated plasmonic devices. In this paper, we develop a new process of planar self-alignment imprint lithography (P-SAIL) to fabricate the metallic and dielectric structures on the same plane. P-SAIL transfers the multilevel imprint processes to a single-imprint process which offers higher efficiency and less cost than existing manufacturing methods. Such concept is demonstrated in an example of fabricating planar plasmonic structures consisting of different materials. © 2014 Springer-Verlag Berlin Heidelberg.

  10. Rapid and Accurate Sequencing of Enterovirus Genomes Using MinION Nanopore Sequencer.

    Science.gov (United States)

    Wang, Ji; Ke, Yue Hua; Zhang, Yong; Huang, Ke Qiang; Wang, Lei; Shen, Xin Xin; Dong, Xiao Ping; Xu, Wen Bo; Ma, Xue Jun

    2017-10-01

    Knowledge of an enterovirus genome sequence is very important in epidemiological investigation to identify transmission patterns and ascertain the extent of an outbreak. The MinION sequencer is increasingly used to sequence various viral pathogens in many clinical situations because of its long reads, portability, real-time accessibility of sequenced data, and very low initial costs. However, information is lacking on MinION sequencing of enterovirus genomes. In this proof-of-concept study using Enterovirus 71 (EV71) and Coxsackievirus A16 (CA16) strains as examples, we established an amplicon-based whole genome sequencing method using MinION. We explored the accuracy, minimum sequencing time, discrimination and high-throughput sequencing ability of MinION, and compared its performance with Sanger sequencing. Within the first minute (min) of sequencing, the accuracy of MinION was 98.5% for the single EV71 strain and 94.12%-97.33% for 10 genetically-related CA16 strains. In as little as 14 min, 99% identity was reached for the single EV71 strain, and in 17 min (on average), 99% identity was achieved for 10 CA16 strains in a single run. MinION is suitable for whole genome sequencing of enteroviruses with sufficient accuracy and fine discrimination and has the potential as a fast, reliable and convenient method for routine use. Copyright © 2017 The Editorial Board of Biomedical and Environmental Sciences. Published by China CDC. All rights reserved.

  11. Parente2: a fast and accurate method for detecting identity by descent

    KAUST Repository

    Rodriguez, Jesse M.; Bercovici, Sivan; Huang, Lin; Frostig, Roy; Batzoglou, Serafim

    2014-01-01

    Identity-by-descent (IBD) inference is the problem of establishing a genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is an important preceding step in a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of accurate IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: Given a cohort's size, a quadratic number of pairwise genome comparisons must be performed. Therefore, computation time and the false discovery rate can also scale quadratically. To enable accurate and efficient large-scale IBD detection, we present Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses a model that accounts for linkage disequilibrium by explicitly modeling haplotype frequencies. Parente2 operates directly on genotype data without the need to phase data prior to IBD inference. We evaluate Parente2's performance through extensive simulations using real data, and we show that it provides substantially higher accuracy compared to previous state-of-the-art methods while maintaining high computational efficiency.

  12. Genometa--a fast and accurate classifier for short metagenomic shotgun reads.

    Science.gov (United States)

    Davenport, Colin F; Neugebauer, Jens; Beckmann, Nils; Friedrich, Benedikt; Kameri, Burim; Kokott, Svea; Paetow, Malte; Siekmann, Björn; Wieding-Drewes, Matthias; Wienhöfer, Markus; Wolf, Stefan; Tümmler, Burkhard; Ahlers, Volker; Sprengel, Frauke

    2012-01-01

    Metagenomic studies use high-throughput sequence data to investigate microbial communities in situ. However, considerable challenges remain in the analysis of these data, particularly with regard to speed and reliable analysis of microbial species as opposed to higher level taxa such as phyla. We here present Genometa, a computationally undemanding graphical user interface program that enables identification of bacterial species and gene content from datasets generated by inexpensive high-throughput short read sequencing technologies. Our approach was first verified on two simulated metagenomic short read datasets, detecting 100% and 94% of the bacterial species included with few false positives or false negatives. Subsequent comparative benchmarking analysis against three popular metagenomic algorithms on an Illumina human gut dataset revealed Genometa to attribute the most reads to bacteria at species level (i.e. including all strains of that species) and demonstrate similar or better accuracy than the other programs. Lastly, speed was demonstrated to be many times that of BLAST due to the use of modern short read aligners. Our method is highly accurate if bacteria in the sample are represented by genomes in the reference sequence but cannot find species absent from the reference. This method is one of the most user-friendly and resource efficient approaches and is thus feasible for rapidly analysing millions of short reads on a personal computer. The Genometa program, a step by step tutorial and Java source code are freely available from http://genomics1.mh-hannover.de/genometa/ and on http://code.google.com/p/genometa/. This program has been tested on Ubuntu Linux and Windows XP/7.

  13. Genometa--a fast and accurate classifier for short metagenomic shotgun reads.

    Directory of Open Access Journals (Sweden)

    Colin F Davenport

    Full Text Available Metagenomic studies use high-throughput sequence data to investigate microbial communities in situ. However, considerable challenges remain in the analysis of these data, particularly with regard to speed and reliable analysis of microbial species as opposed to higher level taxa such as phyla. We here present Genometa, a computationally undemanding graphical user interface program that enables identification of bacterial species and gene content from datasets generated by inexpensive high-throughput short read sequencing technologies. Our approach was first verified on two simulated metagenomic short read datasets, detecting 100% and 94% of the bacterial species included with few false positives or false negatives. Subsequent comparative benchmarking analysis against three popular metagenomic algorithms on an Illumina human gut dataset revealed Genometa to attribute the most reads to bacteria at species level (i.e. including all strains of that species and demonstrate similar or better accuracy than the other programs. Lastly, speed was demonstrated to be many times that of BLAST due to the use of modern short read aligners. Our method is highly accurate if bacteria in the sample are represented by genomes in the reference sequence but cannot find species absent from the reference. This method is one of the most user-friendly and resource efficient approaches and is thus feasible for rapidly analysing millions of short reads on a personal computer.The Genometa program, a step by step tutorial and Java source code are freely available from http://genomics1.mh-hannover.de/genometa/ and on http://code.google.com/p/genometa/. This program has been tested on Ubuntu Linux and Windows XP/7.

  14. Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops

    Directory of Open Access Journals (Sweden)

    Gendrault-Jacquemard A

    2005-07-01

    Full Text Available Abstract Background Public databases now contain multitude of complete bacterial genomes, including several genomes of the same species. The available data offers new opportunities to address questions about bacterial genome evolution, a task that requires reliable fine comparison data of closely related genomes. Recent analyses have shown, using pairwise whole genome alignments, that it is possible to segment bacterial genomes into a common conserved backbone and strain-specific sequences called loops. Results Here, we generalize this approach and propose a strategy that allows systematic and non-biased genome segmentation based on multiple genome alignments. Segmentation analyses, as applied to 13 different bacterial species, confirmed the feasibility of our approach to discern the 'mosaic' organization of bacterial genomes. Segmentation results are available through a Web interface permitting functional analysis, extraction and visualization of the backbone/loops structure of documented genomes. To illustrate the potential of this approach, we performed a precise analysis of the mosaic organization of three E. coli strains and functional characterization of the loops. Conclusion The segmentation results including the backbone/loops structure of 13 bacterial species genomes are new and available for use by the scientific community at the URL: http://genome.jouy.inra.fr/mosaic.

  15. High-speed all-optical DNA local sequence alignment based on a three-dimensional artificial neural network.

    Science.gov (United States)

    Maleki, Ehsan; Babashah, Hossein; Koohi, Somayyeh; Kavehvash, Zahra

    2017-07-01

    This paper presents an optical processing approach for exploring a large number of genome sequences. Specifically, we propose an optical correlator for global alignment and an extended moiré matching technique for local analysis of spatially coded DNA, whose output is fed to a novel three-dimensional artificial neural network for local DNA alignment. All-optical implementation of the proposed 3D artificial neural network is developed and its accuracy is verified in Zemax. Thanks to its parallel processing capability, the proposed structure performs local alignment of 4 million sequences of 150 base pairs in a few seconds, which is much faster than its electrical counterparts, such as the basic local alignment search tool.

  16. Quasiparticle Level Alignment for Photocatalytic Interfaces.

    Science.gov (United States)

    Migani, Annapaoala; Mowbray, Duncan J; Zhao, Jin; Petek, Hrvoje; Rubio, Angel

    2014-05-13

    Electronic level alignment at the interface between an adsorbed molecular layer and a semiconducting substrate determines the activity and efficiency of many photocatalytic materials. Standard density functional theory (DFT)-based methods have proven unable to provide a quantitative description of this level alignment. This requires a proper treatment of the anisotropic screening, necessitating the use of quasiparticle (QP) techniques. However, the computational complexity of QP algorithms has meant a quantitative description of interfacial levels has remained elusive. We provide a systematic study of a prototypical interface, bare and methanol-covered rutile TiO2(110) surfaces, to determine the type of many-body theory required to obtain an accurate description of the level alignment. This is accomplished via a direct comparison with metastable impact electron spectroscopy (MIES), ultraviolet photoelectron spectroscopy (UPS), and two-photon photoemission (2PP) spectroscopy. We consider GGA DFT, hybrid DFT, and G0W0, scQPGW1, scQPGW0, and scQPGW QP calculations. Our results demonstrate that G0W0, or our recently introduced scQPGW1 approach, are required to obtain the correct alignment of both the highest occupied and lowest unoccupied interfacial molecular levels (HOMO/LUMO). These calculations set a new standard in the interpretation of electronic structure probe experiments of complex organic molecule/semiconductor interfaces.

  17. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine

    Directory of Open Access Journals (Sweden)

    Hao Ye

    2015-11-01

    Full Text Available Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.

  18. Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

    DEFF Research Database (Denmark)

    Zhan, Bujie; Fadista, João; Thomsen, Bo

    2011-01-01

    Background Integration of genomic variation with phenotypic information is an effective approach for uncovering genotype-phenotype associations. This requires an accurate identification of the different types of variation in individual genomes. Results We report the integration of the whole genome...... of split-read and read-pair approaches proved to be complementary in finding different signatures. CNVs were identified on the basis of the depth of sequenced reads, and by using SNP and CGH arrays. Conclusions Our results provide high resolution mapping of diverse classes of genomic variation...

  19. Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix.

    Directory of Open Access Journals (Sweden)

    Jakob H Havgaard

    2007-10-01

    Full Text Available It has become clear that noncoding RNAs (ncRNA play important roles in cells, and emerging studies indicate that there might be a large number of unknown ncRNAs in mammalian genomes. There exist computational methods that can be used to search for ncRNAs by comparing sequences from different genomes. One main problem with these methods is their computational complexity, and heuristics are therefore employed. Two heuristics are currently very popular: pre-folding and pre-aligning. However, these heuristics are not ideal, as pre-aligning is dependent on sequence similarity that may not be present and pre-folding ignores the comparative information. Here, pruning of the dynamical programming matrix is presented as an alternative novel heuristic constraint. All subalignments that do not exceed a length-dependent minimum score are discarded as the matrix is filled out, thus giving the advantage of providing the constraints dynamically. This has been included in a new implementation of the FOLDALIGN algorithm for pairwise local or global structural alignment of RNA sequences. It is shown that time and memory requirements are dramatically lowered while overall performance is maintained. Furthermore, a new divide and conquer method is introduced to limit the memory requirement during global alignment and backtrack of local alignment. All branch points in the computed RNA structure are found and used to divide the structure into smaller unbranched segments. Each segment is then realigned and backtracked in a normal fashion. Finally, the FOLDALIGN algorithm has also been updated with a better memory implementation and an improved energy model. With these improvements in the algorithm, the FOLDALIGN software package provides the molecular biologist with an efficient and user-friendly tool for searching for new ncRNAs. The software package is available for download at http://foldalign.ku.dk.

  20. D-GENIES: dot plot large genomes in an interactive, efficient and simple way.

    Science.gov (United States)

    Cabanettes, Floréal; Klopp, Christophe

    2018-01-01

    Dot plots are widely used to quickly compare sequence sets. They provide a synthetic similarity overview, highlighting repetitions, breaks and inversions. Different tools have been developed to easily generated genomic alignment dot plots, but they are often limited in the input sequence size. D-GENIES is a standalone and web application performing large genome alignments using minimap2 software package and generating interactive dot plots. It enables users to sort query sequences along the reference, zoom in the plot and download several image, alignment or sequence files. D-GENIES is an easy-to-install, open-source software package (GPL) developed in Python and JavaScript. The source code is available at https://github.com/genotoul-bioinfo/dgenies and it can be tested at http://dgenies.toulouse.inra.fr/.

  1. Mitochondrial Genomes of Kinorhyncha: trnM Duplication and New Gene Orders within Animals.

    Science.gov (United States)

    Popova, Olga V; Mikhailov, Kirill V; Nikitin, Mikhail A; Logacheva, Maria D; Penin, Aleksey A; Muntyan, Maria S; Kedrova, Olga S; Petrov, Nikolai B; Panchin, Yuri V; Aleoshin, Vladimir V

    2016-01-01

    Many features of mitochondrial genomes of animals, such as patterns of gene arrangement, nucleotide content and substitution rate variation are extensively used in evolutionary and phylogenetic studies. Nearly 6,000 mitochondrial genomes of animals have already been sequenced, covering the majority of animal phyla. One of the groups that escaped mitogenome sequencing is phylum Kinorhyncha-an isolated taxon of microscopic worm-like ecdysozoans. The kinorhynchs are thought to be one of the early-branching lineages of Ecdysozoa, and their mitochondrial genomes may be important for resolving evolutionary relations between major animal taxa. Here we present the results of sequencing and analysis of mitochondrial genomes from two members of Kinorhyncha, Echinoderes svetlanae (Cyclorhagida) and Pycnophyes kielensis (Allomalorhagida). Their mitochondrial genomes are circular molecules approximately 15 Kbp in size. The kinorhynch mitochondrial gene sequences are highly divergent, which precludes accurate phylogenetic inference. The mitogenomes of both species encode a typical metazoan complement of 37 genes, which are all positioned on the major strand, but the gene order is distinct and unique among Ecdysozoa or animals as a whole. We predict four types of start codons for protein-coding genes in E. svetlanae and five in P. kielensis with a consensus DTD in single letter code. The mitochondrial genomes of E. svetlanae and P. kielensis encode duplicated methionine tRNA genes that display compensatory nucleotide substitutions. Two distant species of Kinorhyncha demonstrate similar patterns of gene arrangements in their mitogenomes. Both genomes have duplicated methionine tRNA genes; the duplication predates the divergence of two species. The kinorhynchs share a few features pertaining to gene order that align them with Priapulida. Gene order analysis reveals that gene arrangement specific of Priapulida may be ancestral for Scalidophora, Ecdysozoa, and even Protostomia.

  2. Mitochondrial Genomes of Kinorhyncha: trnM Duplication and New Gene Orders within Animals.

    Directory of Open Access Journals (Sweden)

    Olga V Popova

    Full Text Available Many features of mitochondrial genomes of animals, such as patterns of gene arrangement, nucleotide content and substitution rate variation are extensively used in evolutionary and phylogenetic studies. Nearly 6,000 mitochondrial genomes of animals have already been sequenced, covering the majority of animal phyla. One of the groups that escaped mitogenome sequencing is phylum Kinorhyncha-an isolated taxon of microscopic worm-like ecdysozoans. The kinorhynchs are thought to be one of the early-branching lineages of Ecdysozoa, and their mitochondrial genomes may be important for resolving evolutionary relations between major animal taxa. Here we present the results of sequencing and analysis of mitochondrial genomes from two members of Kinorhyncha, Echinoderes svetlanae (Cyclorhagida and Pycnophyes kielensis (Allomalorhagida. Their mitochondrial genomes are circular molecules approximately 15 Kbp in size. The kinorhynch mitochondrial gene sequences are highly divergent, which precludes accurate phylogenetic inference. The mitogenomes of both species encode a typical metazoan complement of 37 genes, which are all positioned on the major strand, but the gene order is distinct and unique among Ecdysozoa or animals as a whole. We predict four types of start codons for protein-coding genes in E. svetlanae and five in P. kielensis with a consensus DTD in single letter code. The mitochondrial genomes of E. svetlanae and P. kielensis encode duplicated methionine tRNA genes that display compensatory nucleotide substitutions. Two distant species of Kinorhyncha demonstrate similar patterns of gene arrangements in their mitogenomes. Both genomes have duplicated methionine tRNA genes; the duplication predates the divergence of two species. The kinorhynchs share a few features pertaining to gene order that align them with Priapulida. Gene order analysis reveals that gene arrangement specific of Priapulida may be ancestral for Scalidophora, Ecdysozoa, and even

  3. L-GRAAL: Lagrangian graphlet-based network aligner.

    Science.gov (United States)

    Malod-Dognin, Noël; Pržulj, Nataša

    2015-07-01

    Discovering and understanding patterns in networks of protein-protein interactions (PPIs) is a central problem in systems biology. Alignments between these networks aid functional understanding as they uncover important information, such as evolutionary conserved pathways, protein complexes and functional orthologs. A few methods have been proposed for global PPI network alignments, but because of NP-completeness of underlying sub-graph isomorphism problem, producing topologically and biologically accurate alignments remains a challenge. We introduce a novel global network alignment tool, Lagrangian GRAphlet-based ALigner (L-GRAAL), which directly optimizes both the protein and the interaction functional conservations, using a novel alignment search heuristic based on integer programming and Lagrangian relaxation. We compare L-GRAAL with the state-of-the-art network aligners on the largest available PPI networks from BioGRID and observe that L-GRAAL uncovers the largest common sub-graphs between the networks, as measured by edge-correctness and symmetric sub-structures scores, which allow transferring more functional information across networks. We assess the biological quality of the protein mappings using the semantic similarity of their Gene Ontology annotations and observe that L-GRAAL best uncovers functionally conserved proteins. Furthermore, we introduce for the first time a measure of the semantic similarity of the mapped interactions and show that L-GRAAL also uncovers best functionally conserved interactions. In addition, we illustrate on the PPI networks of baker's yeast and human the ability of L-GRAAL to predict new PPIs. Finally, L-GRAAL's results are the first to show that topological information is more important than sequence information for uncovering functionally conserved interactions. L-GRAAL is coded in C++. Software is available at: http://bio-nets.doc.ic.ac.uk/L-GRAAL/. n.malod-dognin@imperial.ac.uk Supplementary data are available at

  4. Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes

    Directory of Open Access Journals (Sweden)

    Maggi Giorgio P

    2008-06-01

    Full Text Available Abstract Background The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent on the availability of annotated proteins. Results In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci. Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes. Conclusion Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.

  5. Mitochondrial genome sequences and comparative genomics ofPhytophthora ramorum and P. sojae

    Energy Technology Data Exchange (ETDEWEB)

    Martin, Frank N.; Douda, Bensasson; Tyler, Brett M.; Boore,Jeffrey L.

    2007-01-01

    The complete sequences of the mitochondrial genomes of theoomycetes of Phytophthora ramorum and P. sojae were determined during thecourse of their complete nuclear genome sequencing (Tyler, et al. 2006).Both are circular, with sizes of 39,314 bp for P. ramorum and 42,975 bpfor P. sojae. Each contains a total of 37 identifiable protein-encodinggenes, 25 or 26 tRNAs (P. sojae and P. ramorum, respectively)specifying19 amino acids, and a variable number of ORFs (7 for P. ramorum and 12for P. sojae) which are potentially additional functional genes.Non-coding regions comprise approximately 11.5 percent and 18.4 percentof the genomes of P. ramorum and P. sojae, respectively. Relative to P.sojae, there is an inverted repeat of 1,150 bp in P. ramorum thatincludes an unassigned unique ORF, a tRNA gene, and adjacent non-codingsequences, but otherwise the gene order in both species is identical.Comparisons of these genomes with published sequences of the P. infestansmitochondrial genome reveals a number of similarities, but the gene orderin P. infestans differs in two adjacent locations due to inversions.Sequence alignments of the three genomes indicated sequence conservationranging from 75 to 85 percent and that specific regions were morevariable than others.

  6. Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers.

    Science.gov (United States)

    Galpert, Deborah; Fernández, Alberto; Herrera, Francisco; Antunes, Agostinho; Molina-Ruiz, Reinaldo; Agüero-Chapin, Guillermin

    2018-05-03

    The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were

  7. SMC Progressively Aligns Chromosomal Arms in Caulobacter crescentus but Is Antagonized by Convergent Transcription

    Directory of Open Access Journals (Sweden)

    Ngat T. Tran

    2017-08-01

    Full Text Available The structural maintenance of chromosomes (SMC complex plays an important role in chromosome organization and segregation in most living organisms. In Caulobacter crescentus, SMC is required to align the left and the right arms of the chromosome that run in parallel down the long axis of the cell. However, the mechanism of SMC-mediated alignment of chromosomal arms remains elusive. Here, using genome-wide methods and microscopy of single cells, we show that Caulobacter SMC is recruited to the centromeric parS site and that SMC-mediated arm alignment depends on the chromosome-partitioning protein ParB. We provide evidence that SMC likely tethers the parS-proximal regions of the chromosomal arms together, promoting arm alignment. Furthermore, we show that highly transcribed genes near parS that are oriented against SMC translocation disrupt arm alignment, suggesting that head-on transcription interferes with SMC translocation. Our results demonstrate a tight interdependence of bacterial chromosome organization and global patterns of transcription.

  8. Alignment of in-vessel components by metrology defined adaptive machining

    International Nuclear Information System (INIS)

    Wilson, David; Bernard, Nathanaël; Mariani, Antony

    2015-01-01

    Highlights: • Advanced metrology techniques developed for large volume high density in-vessel surveys. • Virtual alignment process employed to optimize the alignment of 440 blanket modules. • Auto-geometry construct, from survey data, using CAD proximity detection and orientation logic. • HMI developed to relocate blanket modules if customization limits on interfaces are exceeded. • Data export format derived for Catia parametric models, defining customization requirements. - Abstract: The assembly of ITER will involve the precise and accurate alignment of a large number of components and assemblies in areas where access will often be severely constrained and where process efficiency will be critical. One such area is the inside of the vacuum vessel where several thousand components shall be custom machined to provide the alignment references for in-vessel systems. The paper gives an overview of the process that will be employed; to survey the interfaces for approximately 3500 components then define and execute the customization process.

  9. Alignment of in-vessel components by metrology defined adaptive machining

    Energy Technology Data Exchange (ETDEWEB)

    Wilson, David [ITER Organization, Route de Vinon sur Verdon, CS90 046, St Paul-lez-Durance (France); Bernard, Nathanaël [G2Métric, Launaguet 31140 (France); Mariani, Antony [Spatial Alignment Ltd., Witney (United Kingdom)

    2015-10-15

    Highlights: • Advanced metrology techniques developed for large volume high density in-vessel surveys. • Virtual alignment process employed to optimize the alignment of 440 blanket modules. • Auto-geometry construct, from survey data, using CAD proximity detection and orientation logic. • HMI developed to relocate blanket modules if customization limits on interfaces are exceeded. • Data export format derived for Catia parametric models, defining customization requirements. - Abstract: The assembly of ITER will involve the precise and accurate alignment of a large number of components and assemblies in areas where access will often be severely constrained and where process efficiency will be critical. One such area is the inside of the vacuum vessel where several thousand components shall be custom machined to provide the alignment references for in-vessel systems. The paper gives an overview of the process that will be employed; to survey the interfaces for approximately 3500 components then define and execute the customization process.

  10. An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage

    Directory of Open Access Journals (Sweden)

    Stuart Gary W

    2004-12-01

    Full Text Available Abstract Background Eukaryotic whole genome sequences are accumulating at an impressive rate. Effective methods for comparing multiple whole eukaryotic genomes on a large scale are needed. Most attempted solutions involve the production of large scale alignments, and many of these require a high stringency pre-screen for putative orthologs in order to reduce the effective size of the dataset and provide a reasonably high but unknown fraction of correctly aligned homologous sites for comparison. As an alternative, highly efficient methods that do not require the pre-alignment of operationally defined orthologs are also being explored. Results A non-alignment method based on the Singular Value Decomposition (SVD was used to compare the predicted protein complement of nine whole eukaryotic genomes ranging from yeast to man. This analysis resulted in the simultaneous identification and definition of a large number of well conserved motifs and gene families, and produced a species tree supporting one of two conflicting hypotheses of metazoan relationships. Conclusions Our SVD-based analysis of the entire protein complement of nine whole eukaryotic genomes suggests that highly conserved motifs and gene families can be identified and effectively compared in a single coherent definition space for the easy extraction of gene and species trees. While this occurs without the explicit definition of orthologs or homologous sites, the analysis can provide a basis for these definitions.

  11. Alignment of the Thomson scattering diagnostic on NSTX

    International Nuclear Information System (INIS)

    LeBlanc, B P; Diallo, A

    2013-01-01

    The Thomson scattering diagnostic can provide profile measurement of the electron temperature, T e , and density, n e , in plasmas. Proper laser beam path and optics arrangement permits profiles T e (R) and n e (R) measurement along the major radius R. Keeping proper alignment between the laser beam path and the collection optics is necessary for an accurate determination of the electron density. As time progresses the relative position of the collection optics field of view with respect to the laser beam path will invariably shift. This can be kept to a minimum by proper attention to the physical arrangement of the collection and laser-beam delivery optics. A system has been in place to monitor the relative position between laser beam and collection optics. Variation of the alignment can be detected before it begins to affect the quality of the profile data. This paper discusses details of the instrumentation and techniques used to maintain alignment during NSTX multi-month experimental campaigns

  12. The diploid genome sequence of an Asian individual

    DEFF Research Database (Denmark)

    Wang, Jun; Wang, Wei; Li, Ruiqiang

    2008-01-01

    Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we...... used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP...... identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J...

  13. A Python Script for Aligning the STIS Echelle Blaze Function

    Science.gov (United States)

    Baer, Malinda; Proffitt, Charles R.; Lockwood, Sean A.

    2018-01-01

    Accurate flux calibration for the STIS echelle modes is heavily dependent on the proper alignment of the blaze function for each spectral order. However, due to changes in the instrument alignment over time and between exposures, the blaze function can shift in wavelength. This may result in flux calibration inconsistencies of up to 10%. We present the stisblazefix Python module as a tool for STIS users to correct their echelle spectra. The stisblazefix module assumes that the error in the blaze alignment is a linear function of spectral order, and finds the set of shifts that minimizes the flux inconsistencies in the overlap between spectral orders. We discuss the uses and limitations of this tool, and show that its use can provide significant improvements to the default pipeline flux calibration for many observations.

  14. Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line.

    Science.gov (United States)

    Teo, Audrey S M; Verzotto, Davide; Yao, Fei; Nagarajan, Niranjan; Hillmer, Axel M

    2015-01-01

    Next-generation sequencing (NGS) technologies have changed our understanding of the variability of the human genome. However, the identification of genome structural variations based on NGS approaches with read lengths of 35-300 bases remains a challenge. Single-molecule optical mapping technologies allow the analysis of DNA molecules of up to 2 Mb and as such are suitable for the identification of large-scale genome structural variations, and for de novo genome assemblies when combined with short-read NGS data. Here we present optical mapping data for two human genomes: the HapMap cell line GM12878 and the colorectal cancer cell line HCT116. High molecular weight DNA was obtained by embedding GM12878 and HCT116 cells, respectively, in agarose plugs, followed by DNA extraction under mild conditions. Genomic DNA was digested with KpnI and 310,000 and 296,000 DNA molecules (≥ 150 kb and 10 restriction fragments), respectively, were analyzed per cell line using the Argus optical mapping system. Maps were aligned to the human reference by OPTIMA, a new glocal alignment method. Genome coverage of 6.8× and 5.7× was obtained, respectively; 2.9× and 1.7× more than the coverage obtained with previously available software. Optical mapping allows the resolution of large-scale structural variations of the genome, and the scaffold extension of NGS-based de novo assemblies. OPTIMA is an efficient new alignment method; our optical mapping data provide a resource for genome structure analyses of the human HapMap reference cell line GM12878, and the colorectal cancer cell line HCT116.

  15. De novo assembly of a haplotype-resolved human genome.

    Science.gov (United States)

    Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun

    2015-06-01

    The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.

  16. Cryo-EM image alignment based on nonuniform fast Fourier transform

    International Nuclear Information System (INIS)

    Yang Zhengfan; Penczek, Pawel A.

    2008-01-01

    In single particle analysis, two-dimensional (2-D) alignment is a fundamental step intended to put into register various particle projections of biological macromolecules collected at the electron microscope. The efficiency and quality of three-dimensional (3-D) structure reconstruction largely depends on the computational speed and alignment accuracy of this crucial step. In order to improve the performance of alignment, we introduce a new method that takes advantage of the highly accurate interpolation scheme based on the gridding method, a version of the nonuniform fast Fourier transform, and utilizes a multi-dimensional optimization algorithm for the refinement of the orientation parameters. Using simulated data, we demonstrate that by using less than half of the sample points and taking twice the runtime, our new 2-D alignment method achieves dramatically better alignment accuracy than that based on quadratic interpolation. We also apply our method to image to volume registration, the key step in the single particle EM structure refinement protocol. We find that in this case the accuracy of the method not only surpasses the accuracy of the commonly used real-space implementation, but results are achieved in much shorter time, making gridding-based alignment a perfect candidate for efficient structure determination in single particle analysis

  17. Cryo-EM image alignment based on nonuniform fast Fourier transform.

    Science.gov (United States)

    Yang, Zhengfan; Penczek, Pawel A

    2008-08-01

    In single particle analysis, two-dimensional (2-D) alignment is a fundamental step intended to put into register various particle projections of biological macromolecules collected at the electron microscope. The efficiency and quality of three-dimensional (3-D) structure reconstruction largely depends on the computational speed and alignment accuracy of this crucial step. In order to improve the performance of alignment, we introduce a new method that takes advantage of the highly accurate interpolation scheme based on the gridding method, a version of the nonuniform fast Fourier transform, and utilizes a multi-dimensional optimization algorithm for the refinement of the orientation parameters. Using simulated data, we demonstrate that by using less than half of the sample points and taking twice the runtime, our new 2-D alignment method achieves dramatically better alignment accuracy than that based on quadratic interpolation. We also apply our method to image to volume registration, the key step in the single particle EM structure refinement protocol. We find that in this case the accuracy of the method not only surpasses the accuracy of the commonly used real-space implementation, but results are achieved in much shorter time, making gridding-based alignment a perfect candidate for efficient structure determination in single particle analysis.

  18. Parente2: a fast and accurate method for detecting identity by descent

    KAUST Repository

    Rodriguez, Jesse M.

    2014-10-01

    Identity-by-descent (IBD) inference is the problem of establishing a genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is an important preceding step in a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of accurate IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: Given a cohort\\'s size, a quadratic number of pairwise genome comparisons must be performed. Therefore, computation time and the false discovery rate can also scale quadratically. To enable accurate and efficient large-scale IBD detection, we present Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses a model that accounts for linkage disequilibrium by explicitly modeling haplotype frequencies. Parente2 operates directly on genotype data without the need to phase data prior to IBD inference. We evaluate Parente2\\'s performance through extensive simulations using real data, and we show that it provides substantially higher accuracy compared to previous state-of-the-art methods while maintaining high computational efficiency.

  19. RNA structure alignment by a unit-vector approach.

    Science.gov (United States)

    Capriotti, Emidio; Marti-Renom, Marc A

    2008-08-15

    The recent discovery of tiny RNA molecules such as microRNAs and small interfering RNA are transforming the view of RNA as a simple information transfer molecule. Similar to proteins, the native three-dimensional structure of RNA determines its biological activity. Therefore, classifying the current structural space is paramount for functionally annotating RNA molecules. The increasing numbers of RNA structures deposited in the PDB requires more accurate, automatic and benchmarked methods for RNA structure comparison. In this article, we introduce a new algorithm for RNA structure alignment based on a unit-vector approach. The algorithm has been implemented in the SARA program, which results in RNA structure pairwise alignments and their statistical significance. The SARA program has been implemented to be of general applicability even when no secondary structure can be calculated from the RNA structures. A benchmark against the ARTS program using a set of 1275 non-redundant pairwise structure alignments results in inverted approximately 6% extra alignments with at least 50% structurally superposed nucleotides and base pairs. A first attempt to perform RNA automatic functional annotation based on structure alignments indicates that SARA can correctly assign the deepest SCOR classification to >60% of the query structures. The SARA program is freely available through a World Wide Web server http://sgu.bioinfo.cipf.es/services/SARA/. Supplementary data are available at Bioinformatics online.

  20. Cross-family translational genomics of abiotic stress-responsive genes between Arabidopsis and Medicago truncatula.

    Directory of Open Access Journals (Sweden)

    Daejin Hyung

    Full Text Available Cross-species translation of genomic information may play a pivotal role in applying biological knowledge gained from relatively simple model system to other less studied, but related, genomes. The information of abiotic stress (ABS-responsive genes in Arabidopsis was identified and translated into the legume model system, Medicago truncatula. Various data resources, such as TAIR/AtGI DB, expression profiles and literatures, were used to build a genome-wide list of ABS genes. tBlastX/BlastP similarity search tools and manual inspection of alignments were used to identify orthologous genes between the two genomes. A total of 1,377 genes were finally collected and classified into 18 functional criteria of gene ontology (GO. The data analysis according to the expression cues showed that there was substantial level of interaction among three major types (i.e., drought, salinity and cold stress of abiotic stresses. In an attempt to translate the ABS genes between these two species, genomic locations for each gene were mapped using an in-house-developed comparative analysis platform. The comparative analysis revealed that fragmental colinearity, represented by only 37 synteny blocks, existed between Arabidopsis and M. truncatula. Based on the combination of E-value and alignment remarks, estimated translation rate was 60.2% for this cross-family translation. As a prelude of the functional comparative genomic approaches, in-silico gene network/interactome analyses were conducted to predict key components in the ABS responses, and one of the sub-networks was integrated with corresponding comparative map. The results demonstrated that core members of the sub-network were well aligned with previously reported ABS regulatory networks. Taken together, the results indicate that network-based integrative approaches of comparative and functional genomics are important to interpret and translate genomic information for complex traits such as abiotic stresses.

  1. GenPlay Multi-Genome, a tool to compare and analyze multiple human genomes in a graphical interface.

    Science.gov (United States)

    Lajugie, Julien; Fourel, Nicolas; Bouhassira, Eric E

    2015-01-01

    Parallel visualization of multiple individual human genomes is a complex endeavor that is rapidly gaining importance with the increasing number of personal, phased and cancer genomes that are being generated. It requires the display of variants such as SNPs, indels and structural variants that are unique to specific genomes and the introduction of multiple overlapping gaps in the reference sequence. Here, we describe GenPlay Multi-Genome, an application specifically written to visualize and analyze multiple human genomes in parallel. GenPlay Multi-Genome is ideally suited for the comparison of allele-specific expression and functional genomic data obtained from multiple phased genomes in a graphical interface with access to multiple-track operation. It also allows the analysis of data that have been aligned to custom genomes rather than to a standard reference and can be used as a variant calling format file browser and as a tool to compare different genome assembly, such as hg19 and hg38. GenPlay is available under the GNU public license (GPL-3) from http://genplay.einstein.yu.edu. The source code is available at https://github.com/JulienLajugie/GenPlay. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  2. AlignMe—a membrane protein sequence alignment web server

    Science.gov (United States)

    Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.

    2014-01-01

    We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425

  3. Comparative genome analysis of trypanotolerance QTL | Nganga ...

    African Journals Online (AJOL)

    Homologous sequences were used in the definition of synteny relationships and subsequent identification of the shared disease response genes. The homologous genes within the human genome were then identified and aligned to the bovine radiation hybrid map in order to identify the mouse/bovine homologous regions.

  4. Plantagora: modeling whole genome sequencing and assembly of plant genomes.

    Directory of Open Access Journals (Sweden)

    Roger Barthelson

    Full Text Available BACKGROUND: Genomics studies are being revolutionized by the next generation sequencing technologies, which have made whole genome sequencing much more accessible to the average researcher. Whole genome sequencing with the new technologies is a developing art that, despite the large volumes of data that can be produced, may still fail to provide a clear and thorough map of a genome. The Plantagora project was conceived to address specifically the gap between having the technical tools for genome sequencing and knowing precisely the best way to use them. METHODOLOGY/PRINCIPAL FINDINGS: For Plantagora, a platform was created for generating simulated reads from several different plant genomes of different sizes. The resulting read files mimicked either 454 or Illumina reads, with varying paired end spacing. Thousands of datasets of reads were created, most derived from our primary model genome, rice chromosome one. All reads were assembled with different software assemblers, including Newbler, Abyss, and SOAPdenovo, and the resulting assemblies were evaluated by an extensive battery of metrics chosen for these studies. The metrics included both statistics of the assembly sequences and fidelity-related measures derived by alignment of the assemblies to the original genome source for the reads. The results were presented in a website, which includes a data graphing tool, all created to help the user compare rapidly the feasibility and effectiveness of different sequencing and assembly strategies prior to testing an approach in the lab. Some of our own conclusions regarding the different strategies were also recorded on the website. CONCLUSIONS/SIGNIFICANCE: Plantagora provides a substantial body of information for comparing different approaches to sequencing a plant genome, and some conclusions regarding some of the specific approaches. Plantagora also provides a platform of metrics and tools for studying the process of sequencing and assembly

  5. Accurate DNA assembly and genome engineering with optimized uracil excision cloning

    DEFF Research Database (Denmark)

    Cavaleiro, Mafalda; Kim, Se Hyeuk; Seppala, Susanna

    2015-01-01

    Simple and reliable DNA editing by uracil excision (a.k.a. USER cloning) has been described by several research groups, but the optimal design of cohesive DNA ends for multigene assembly remains elusive. Here, we use two model constructs based on expression of gfp and a four-gene pathway that pro......Simple and reliable DNA editing by uracil excision (a.k.a. USER cloning) has been described by several research groups, but the optimal design of cohesive DNA ends for multigene assembly remains elusive. Here, we use two model constructs based on expression of gfp and a four-gene pathway...... that produces β-carotene to optimize assembly junctions and the uracil excision protocol. By combining uracil excision cloning with a genomic integration technology, we demonstrate that up to six DNA fragments can be assembled in a one-tube reaction for direct genome integration with high accuracy, greatly...... facilitating the advanced engineering of robust cell factories....

  6. Insights into structural variations and genome rearrangements in prokaryotic genomes.

    Science.gov (United States)

    Periwal, Vinita; Scaria, Vinod

    2015-01-01

    Structural variations (SVs) are genomic rearrangements that affect fairly large fragments of DNA. Most of the SVs such as inversions, deletions and translocations have been largely studied in context of genetic diseases in eukaryotes. However, recent studies demonstrate that genome rearrangements can also have profound impact on prokaryotic genomes, leading to altered cell phenotype. In contrast to single-nucleotide variations, SVs provide a much deeper insight into organization of bacterial genomes at a much better resolution. SVs can confer change in gene copy number, creation of new genes, altered gene expression and many other functional consequences. High-throughput technologies have now made it possible to explore SVs at a much refined resolution in bacterial genomes. Through this review, we aim to highlight the importance of the less explored field of SVs in prokaryotic genomes and their impact. We also discuss its potential applicability in the emerging fields of synthetic biology and genome engineering where targeted SVs could serve to create sophisticated and accurate genome editing. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. Development of the laser alignment system with PSD used for shaft calibration

    Science.gov (United States)

    Jiao, Guohua; Li, Yulin; Hu, Baowen

    2006-02-01

    Shaft calibration is an important technique during installation and maintenance of a rotating machine. It requires unique and high-precision measurement instruments with calculation capability, and relies on experience on heavy, high-speed, or high-temperature machines. A high-precision laser alignment system has been designed using PSD (Position Sensing Detector) to change traditional manual way of shaft calibration and to make the measurement easier and more accurate. The system is comprised of two small measuring units (Laser transmitter and detector) and a hand operated control unit or a PC. Such a laser alignment system has been used in some actual shaft alignment with offset resolution 1.5μm and angular resolution 0.1°.

  8. Microarray MAPH: accurate array-based detection of relative copy number in genomic DNA

    Directory of Open Access Journals (Sweden)

    Chan Alan

    2006-06-01

    Full Text Available Abstract Background Current methods for measurement of copy number do not combine all the desirable qualities of convenience, throughput, economy, accuracy and resolution. In this study, to improve the throughput associated with Multiplex Amplifiable Probe Hybridisation (MAPH we aimed to develop a modification based on the 3-Dimensional, Flow-Through Microarray Platform from PamGene International. In this new method, electrophoretic analysis of amplified products is replaced with photometric analysis of a probed oligonucleotide array. Copy number analysis of hybridised probes is based on a dual-label approach by comparing the intensity of Cy3-labelled MAPH probes amplified from test samples co-hybridised with similarly amplified Cy5-labelled reference MAPH probes. The key feature of using a hybridisation-based end point with MAPH is that discrimination of amplified probes is based on sequence and not fragment length. Results In this study we showed that microarray MAPH measurement of PMP22 gene dosage correlates well with PMP22 gene dosage determined by capillary MAPH and that copy number was accurately reported in analyses of DNA from 38 individuals, 12 of which were known to have Charcot-Marie-Tooth disease type 1A (CMT1A. Conclusion Measurement of microarray-based endpoints for MAPH appears to be of comparable accuracy to electrophoretic methods, and holds the prospect of fully exploiting the potential multiplicity of MAPH. The technology has the potential to simplify copy number assays for genes with a large number of exons, or of expanded sets of probes from dispersed genomic locations.

  9. Microarray MAPH: accurate array-based detection of relative copy number in genomic DNA.

    Science.gov (United States)

    Gibbons, Brian; Datta, Parikkhit; Wu, Ying; Chan, Alan; Al Armour, John

    2006-06-30

    Current methods for measurement of copy number do not combine all the desirable qualities of convenience, throughput, economy, accuracy and resolution. In this study, to improve the throughput associated with Multiplex Amplifiable Probe Hybridisation (MAPH) we aimed to develop a modification based on the 3-Dimensional, Flow-Through Microarray Platform from PamGene International. In this new method, electrophoretic analysis of amplified products is replaced with photometric analysis of a probed oligonucleotide array. Copy number analysis of hybridised probes is based on a dual-label approach by comparing the intensity of Cy3-labelled MAPH probes amplified from test samples co-hybridised with similarly amplified Cy5-labelled reference MAPH probes. The key feature of using a hybridisation-based end point with MAPH is that discrimination of amplified probes is based on sequence and not fragment length. In this study we showed that microarray MAPH measurement of PMP22 gene dosage correlates well with PMP22 gene dosage determined by capillary MAPH and that copy number was accurately reported in analyses of DNA from 38 individuals, 12 of which were known to have Charcot-Marie-Tooth disease type 1A (CMT1A). Measurement of microarray-based endpoints for MAPH appears to be of comparable accuracy to electrophoretic methods, and holds the prospect of fully exploiting the potential multiplicity of MAPH. The technology has the potential to simplify copy number assays for genes with a large number of exons, or of expanded sets of probes from dispersed genomic locations.

  10. Precise rotational alignment of x-ray transmission diffraction gratings

    International Nuclear Information System (INIS)

    Hill, S.L.

    1988-01-01

    Gold transmission diffraction gratings used for x-ray spectroscopy must sometimes be rotationally aligned to the axis of a diagnostic instrument to within sub-milliradian accuracy. We have fabricated transmission diffraction gratings with high line-densities (grating period of 200 and 300 nm) using uv holographic and x-ray lithography. Since the submicron features of the gratings are not optically visible, precision alignment is time consuming and difficult to verify in situ. We have developed a technique to write an optically visible alignment pattern onto these gratings using a scanning electron microscope (SEM). At high magnification (15000 X) several submicron lines of the grating are observable in the SEM, making it possible to write an alignment pattern parallel to the grating lines in an electron-beam-sensitive coating that overlays the grating. We create an alignment pattern by following a 1-cm-long grating line using the SEM's joystick-controlled translation stage. By following the same grating line we are assured the traveled direction of the SEM electron beam is parallel to the grating to better than 10 μradian. The electron-beam-exposed line-width can be large (5 to 15 μm wide) depending on the SEM magnification, and is therefore optically visible. The exposed pattern is eventually made a permanent feature of the grating by ion beam etching or gold electroplating. The pattern can be used to accurately align the grating to the axis of a diagnostic instrument. More importantly, the alignment of the grating can be quickly verified in situ

  11. In-Flight Self-Alignment Method Aided by Geomagnetism for Moving Basement of Guided Munitions

    Directory of Open Access Journals (Sweden)

    Shuang-biao Zhang

    2015-01-01

    Full Text Available Due to power-after-launch mode of guided munitions of high rolling speed, initial attitude of munitions cannot be determined accurately, and this makes it difficult for navigation and control system to work effectively and validly. An in-flight self-alignment method aided by geomagnetism that includes a fast in-flight coarse alignment method and an in-flight alignment model based on Kalman theory is proposed in this paper. Firstly a fast in-flight coarse alignment method is developed by using gyros, magnetic sensors, and trajectory angles. Then, an in-flight alignment model is derived by investigation of the measurement errors and attitude errors, which regards attitude errors as state variables and geomagnetic components in navigation frame as observed variables. Finally, fight data of a spinning projectile is used to verify the performance of the in-flight self-alignment method. The satisfying results show that (1 the precision of coarse alignment can attain below 5°; (2 the attitude errors by in-flight alignment model converge to 24′ at early of the latter half of the flight; (3 the in-flight alignment model based on Kalman theory has better adaptability, and show satisfying performance.

  12. GraphAlignment: Bayesian pairwise alignment of biological networks

    Directory of Open Access Journals (Sweden)

    Kolář Michal

    2012-11-01

    Full Text Available Abstract Background With increased experimental availability and accuracy of bio-molecular networks, tools for their comparative and evolutionary analysis are needed. A key component for such studies is the alignment of networks. Results We introduce the Bioconductor package GraphAlignment for pairwise alignment of bio-molecular networks. The alignment incorporates information both from network vertices and network edges and is based on an explicit evolutionary model, allowing inference of all scoring parameters directly from empirical data. We compare the performance of our algorithm to an alternative algorithm, Græmlin 2.0. On simulated data, GraphAlignment outperforms Græmlin 2.0 in several benchmarks except for computational complexity. When there is little or no noise in the data, GraphAlignment is slower than Græmlin 2.0. It is faster than Græmlin 2.0 when processing noisy data containing spurious vertex associations. Its typical case complexity grows approximately as O(N2.6. On empirical bacterial protein-protein interaction networks (PIN and gene co-expression networks, GraphAlignment outperforms Græmlin 2.0 with respect to coverage and specificity, albeit by a small margin. On large eukaryotic PIN, Græmlin 2.0 outperforms GraphAlignment. Conclusions The GraphAlignment algorithm is robust to spurious vertex associations, correctly resolves paralogs, and shows very good performance in identification of homologous vertices defined by high vertex and/or interaction similarity. The simplicity and generality of GraphAlignment edge scoring makes the algorithm an appropriate choice for global alignment of networks.

  13. Asexual populations of the human malaria parasite, Plasmodium falciparum, use a two-step genomic strategy to acquire accurate, beneficial DNA amplifications.

    Directory of Open Access Journals (Sweden)

    Jennifer L Guler

    Full Text Available Malaria drug resistance contributes to up to a million annual deaths. Judicious deployment of new antimalarials and vaccines could benefit from an understanding of early molecular events that promote the evolution of parasites. Continuous in vitro challenge of Plasmodium falciparum parasites with a novel dihydroorotate dehydrogenase (DHODH inhibitor reproducibly selected for resistant parasites. Genome-wide analysis of independently-derived resistant clones revealed a two-step strategy to evolutionary success. Some haploid blood-stage parasites first survive antimalarial pressure through fortuitous DNA duplications that always included the DHODH gene. Independently-selected parasites had different sized amplification units but they were always flanked by distant A/T tracks. Higher level amplification and resistance was attained using a second, more efficient and more accurate, mechanism for head-to-tail expansion of the founder unit. This second homology-based process could faithfully tune DNA copy numbers in either direction, always retaining the unique DNA amplification sequence from the original A/T-mediated duplication for that parasite line. Pseudo-polyploidy at relevant genomic loci sets the stage for gaining additional mutations at the locus of interest. Overall, we reveal a population-based genomic strategy for mutagenesis that operates in human stages of P. falciparum to efficiently yield resistance-causing genetic changes at the correct locus in a successful parasite. Importantly, these founding events arise with precision; no other new amplifications are seen in the resistant haploid blood stage parasite. This minimizes the need for meiotic genetic cleansing that can only occur in sexual stage development of the parasite in mosquitoes.

  14. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method.

    Science.gov (United States)

    Nielsen, Morten; Lundegaard, Claus; Lund, Ole

    2007-07-04

    Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles. The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. The SMM-align method was shown to outperform other

  15. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method

    Directory of Open Access Journals (Sweden)

    Lund Ole

    2007-07-01

    Full Text Available Abstract Background Antigen presenting cells (APCs sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC and three mouse H2-IA alleles. Results The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR, we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. Conclusion

  16. A Guide to the PLAZA 3.0 Plant Comparative Genomic Database.

    Science.gov (United States)

    Vandepoele, Klaas

    2017-01-01

    PLAZA 3.0 is an online resource for comparative genomics and offers a versatile platform to study gene functions and gene families or to analyze genome organization and evolution in the green plant lineage. Starting from genome sequence information for over 35 plant species, precomputed comparative genomic data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, and genomic colinearity information within and between species. Complementary functional data sets, a Workbench, and interactive visualization tools are available through a user-friendly web interface, making PLAZA an excellent starting point to translate sequence or omics data sets into biological knowledge. PLAZA is available at http://bioinformatics.psb.ugent.be/plaza/ .

  17. Image correlation method for DNA sequence alignment.

    Science.gov (United States)

    Curilem Saldías, Millaray; Villarroel Sassarini, Felipe; Muñoz Poblete, Carlos; Vargas Vásquez, Asticio; Maureira Butler, Iván

    2012-01-01

    The complexity of searches and the volume of genomic data make sequence alignment one of bioinformatics most active research areas. New alignment approaches have incorporated digital signal processing techniques. Among these, correlation methods are highly sensitive. This paper proposes a novel sequence alignment method based on 2-dimensional images, where each nucleic acid base is represented as a fixed gray intensity pixel. Query and known database sequences are coded to their pixel representation and sequence alignment is handled as object recognition in a scene problem. Query and database become object and scene, respectively. An image correlation process is carried out in order to search for the best match between them. Given that this procedure can be implemented in an optical correlator, the correlation could eventually be accomplished at light speed. This paper shows an initial research stage where results were "digitally" obtained by simulating an optical correlation of DNA sequences represented as images. A total of 303 queries (variable lengths from 50 to 4500 base pairs) and 100 scenes represented by 100 x 100 images each (in total, one million base pair database) were considered for the image correlation analysis. The results showed that correlations reached very high sensitivity (99.01%), specificity (98.99%) and outperformed BLAST when mutation numbers increased. However, digital correlation processes were hundred times slower than BLAST. We are currently starting an initiative to evaluate the correlation speed process of a real experimental optical correlator. By doing this, we expect to fully exploit optical correlation light properties. As the optical correlator works jointly with the computer, digital algorithms should also be optimized. The results presented in this paper are encouraging and support the study of image correlation methods on sequence alignment.

  18. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    Science.gov (United States)

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  19. Markov random field based automatic image alignment for electron tomography.

    Science.gov (United States)

    Amat, Fernando; Moussavi, Farshid; Comolli, Luis R; Elidan, Gal; Downing, Kenneth H; Horowitz, Mark

    2008-03-01

    We present a method for automatic full-precision alignment of the images in a tomographic tilt series. Full-precision automatic alignment of cryo electron microscopy images has remained a difficult challenge to date, due to the limited electron dose and low image contrast. These facts lead to poor signal to noise ratio (SNR) in the images, which causes automatic feature trackers to generate errors, even with high contrast gold particles as fiducial features. To enable fully automatic alignment for full-precision reconstructions, we frame the problem probabilistically as finding the most likely particle tracks given a set of noisy images, using contextual information to make the solution more robust to the noise in each image. To solve this maximum likelihood problem, we use Markov Random Fields (MRF) to establish the correspondence of features in alignment and robust optimization for projection model estimation. The resulting algorithm, called Robust Alignment and Projection Estimation for Tomographic Reconstruction, or RAPTOR, has not needed any manual intervention for the difficult datasets we have tried, and has provided sub-pixel alignment that is as good as the manual approach by an expert user. We are able to automatically map complete and partial marker trajectories and thus obtain highly accurate image alignment. Our method has been applied to challenging cryo electron tomographic datasets with low SNR from intact bacterial cells, as well as several plastic section and X-ray datasets.

  20. Automatic laser beam alignment using blob detection for an environment monitoring spectroscopy

    Science.gov (United States)

    Khidir, Jarjees; Chen, Youhua; Anderson, Gary

    2013-05-01

    This paper describes a fully automated system to align an infra-red laser beam with a small retro-reflector over a wide range of distances. The component development and test were especially used for an open-path spectrometer gas detection system. Using blob detection under OpenCV library, an automatic alignment algorithm was designed to achieve fast and accurate target detection in a complex background environment. Test results are presented to show that the proposed algorithm has been successfully applied to various target distances and environment conditions.

  1. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms

    Science.gov (United States)

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  2. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms.

    Directory of Open Access Journals (Sweden)

    Francesca Bertolini

    Full Text Available Few studies investigated the donkey (Equus asinus at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca. The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing and Ion Torrent (RRL runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources.

  3. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G.Gomez.

    Since June of 2009, the muon alignment group has focused on providing new alignment constants and on finalizing the hardware alignment reconstruction. Alignment constants for DTs and CSCs were provided for CRAFT09 data reprocessing. For DT chambers, the track-based alignment was repeated using CRAFT09 cosmic ray muons and validated using segment extrapolation and split cosmic tools. One difference with respect to the previous alignment is that only five degrees of freedom were aligned, leaving the rotation around the local x-axis to be better determined by the hardware system. Similarly, DT chambers poorly aligned by tracks (due to limited statistics) were aligned by a combination of photogrammetry and hardware-based alignment. For the CSC chambers, the hardware system provided alignment in global z and rotations about local x. Entire muon endcap rings were further corrected in the transverse plane (global x and y) by the track-based alignment. Single chamber track-based alignment suffers from poor statistic...

  4. Draft Sequencing of the Heterozygous Diploid Genome of Satsuma (Citrus unshiu Marc. Using a Hybrid Assembly Approach

    Directory of Open Access Journals (Sweden)

    Tokurou Shimizu

    2017-12-01

    Full Text Available Satsuma (Citrus unshiu Marc. is one of the most abundantly produced mandarin varieties of citrus, known for its seedless fruit production and as a breeding parent of citrus. De novo assembly of the heterozygous diploid genome of Satsuma (“Miyagawa Wase” was conducted by a hybrid assembly approach using short-read sequences, three mate-pair libraries, and a long-read sequence of PacBio by the PLATANUS assembler. The assembled sequence, with a total size of 359.7 Mb at the N50 length of 386,404 bp, consisted of 20,876 scaffolds. Pseudomolecules of Satsuma constructed by aligning the scaffolds to three genetic maps showed genome-wide synteny to the genomes of Clementine, pummelo, and sweet orange. Gene prediction by modeling with MAKER-P proposed 29,024 genes and 37,970 mRNA; additionally, gene prediction analysis found candidates for novel genes in several biosynthesis pathways for gibberellin and violaxanthin catabolism. BUSCO scores for the assembled scaffold and predicted transcripts, and another analysis by BAC end sequence mapping indicated the assembled genome consistency was close to those of the haploid Clementine, pummel, and sweet orange genomes. The number of repeat elements and long terminal repeat retrotransposon were comparable to those of the seven citrus genomes; this suggested no significant failure in the assembly at the repeat region. A resequencing application using the assembled sequence confirmed that both kunenbo-A and Satsuma are offsprings of Kishu, and Satsuma is a back-crossed offspring of Kishu. These results illustrated the performance of the hybrid assembly approach and its ability to construct an accurate heterozygous diploid genome.

  5. From Word Alignment to Word Senses, via Multilingual Wordnets

    Directory of Open Access Journals (Sweden)

    Dan Tufis

    2006-05-01

    Full Text Available Most of the successful commercial applications in language processing (text and/or speech dispense with any explicit concern on semantics, with the usual motivations stemming from the computational high costs required for dealing with semantics, in case of large volumes of data. With recent advances in corpus linguistics and statistical-based methods in NLP, revealing useful semantic features of linguistic data is becoming cheaper and cheaper and the accuracy of this process is steadily improving. Lately, there seems to be a growing acceptance of the idea that multilingual lexical ontologisms might be the key towards aligning different views on the semantic atomic units to be used in characterizing the general meaning of various and multilingual documents. Depending on the granularity at which semantic distinctions are necessary, the accuracy of the basic semantic processing (such as word sense disambiguation can be very high with relatively low complexity computing. The paper substantiates this statement by presenting a statistical/based system for word alignment and word sense disambiguation in parallel corpora. We describe a word alignment platform which ensures text pre-processing (tokenization, POS-tagging, lemmatization, chunking, sentence and word alignment as required by an accurate word sense disambiguation.

  6. CCD Camera Lens Interface for Real-Time Theodolite Alignment

    Science.gov (United States)

    Wake, Shane; Scott, V. Stanley, III

    2012-01-01

    Theodolites are a common instrument in the testing, alignment, and building of various systems ranging from a single optical component to an entire instrument. They provide a precise way to measure horizontal and vertical angles. They can be used to align multiple objects in a desired way at specific angles. They can also be used to reference a specific location or orientation of an object that has moved. Some systems may require a small margin of error in position of components. A theodolite can assist with accurately measuring and/or minimizing that error. The technology is an adapter for a CCD camera with lens to attach to a Leica Wild T3000 Theodolite eyepiece that enables viewing on a connected monitor, and thus can be utilized with multiple theodolites simultaneously. This technology removes a substantial part of human error by relying on the CCD camera and monitors. It also allows image recording of the alignment, and therefore provides a quantitative means to measure such error.

  7. A BAC clone fingerprinting approach to the detection of human genome rearrangements

    Science.gov (United States)

    Krzywinski, Martin; Bosdet, Ian; Mathewson, Carrie; Wye, Natasja; Brebner, Jay; Chiu, Readman; Corbett, Richard; Field, Matthew; Lee, Darlene; Pugh, Trevor; Volik, Stas; Siddiqui, Asim; Jones, Steven; Schein, Jacquie; Collins, Collin; Marra, Marco

    2007-01-01

    We present a method, called fingerprint profiling (FPP), that uses restriction digest fingerprints of bacterial artificial chromosome clones to detect and classify rearrangements in the human genome. The approach uses alignment of experimental fingerprint patterns to in silico digests of the sequence assembly and is capable of detecting micro-deletions (1-5 kb) and balanced rearrangements. Our method has compelling potential for use as a whole-genome method for the identification and characterization of human genome rearrangements. PMID:17953769

  8. MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement

    Directory of Open Access Journals (Sweden)

    Zhang Liqing

    2010-01-01

    Full Text Available Abstract Background Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model. However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model. Results In this paper, we develop MSOAR 2.0, an improved system for one-to-one ortholog assignment. For a pair of input genomes, the system first focuses on the tandemly duplicated genes of each genome and tries to identify among them those that were duplicated after the speciation (i.e., the so-called inparalogs, using a simple phylogenetic tree reconciliation method. For each such set of tandemly duplicated inparalogs, all but one gene will be deleted from the concerned genome (because they cannot possibly appear in any one-to-one ortholog pairs, and MSOAR is invoked. Using both simulated and real data experiments, we show that MSOAR 2.0 is able to achieve a better sensitivity and specificity than MSOAR. In comparison with the well-known genome-scale ortholog assignment tool InParanoid, Ensembl ortholog database, and the orthology information extracted from the well-known whole-genome multiple alignment program MultiZ, MSOAR 2.0 shows the highest sensitivity. Although the specificity of MSOAR 2.0 is slightly worse than that of InParanoid in the real data experiments

  9. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  10. Strategy and validation of fiducialisation for the pre-alignment of CLIC components

    CERN Document Server

    Griffet, S; Kemppinen, J; Mainaud Durand, H; Rude, V; Sterbini, G

    2012-01-01

    The feasibility of the high energy e+ e- linear collider CLIC (Compact Linear Collider) is very dependent on the ability to accurately pre-align its components. There are two 20 km long Main Linacs which meet in an interaction point (IP). The Main Linacs are composed of thousands of 2 m long modules. One of the challenges is to meet very tight alignment tolerances at the level of CLIC module: for example, the magnetic centre of a Drive Beam Quad needs to be aligned within 20 µm rms with respect to a straight line. Such accuracies cannot be achieved using usual measurement devices. Thus it is necessary to work in close collaboration with the metrology lab. To test and improve many critical points, including alignment, a CLIC mock-up is being assembled at CERN. This paper describes the application of the strategy of fiducialisation for the pre-alignment of CLIC mock-up components. It also deals with the first results obtained by performing measurements using a CMM (Coordinate Measuring Machine) to ensure the f...

  11. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data.

    Science.gov (United States)

    Uchiyama, Ikuo; Mihara, Motohiro; Nishide, Hiroyo; Chiba, Hirokazu

    2015-01-01

    The microbial genome database for comparative analysis (MBGD) (available at http://mbgd.genome.ad.jp/) is a comprehensive ortholog database for flexible comparative analysis of microbial genomes, where the users are allowed to create an ortholog table among any specified set of organisms. Because of the rapid increase in microbial genome data owing to the next-generation sequencing technology, it becomes increasingly challenging to maintain high-quality orthology relationships while allowing the users to incorporate the latest genomic data available into an analysis. Because many of the recently accumulating genomic data are draft genome sequences for which some complete genome sequences of the same or closely related species are available, MBGD now stores draft genome data and allows the users to incorporate them into a user-specific ortholog database using the MyMBGD functionality. In this function, draft genome data are incorporated into an existing ortholog table created only from the complete genome data in an incremental manner to prevent low-quality draft data from affecting clustering results. In addition, to provide high-quality orthology relationships, the standard ortholog table containing all the representative genomes, which is first created by the rapid classification program DomClust, is now refined using DomRefine, a recently developed program for improving domain-level clustering using multiple sequence alignment information. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment

    Directory of Open Access Journals (Sweden)

    Daniels Noah M

    2012-10-01

    Full Text Available Abstract Background The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult. Results We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD. Conclusions Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

  13. MACSIMS : multiple alignment of complete sequences information management system

    Directory of Open Access Journals (Sweden)

    Plewniak Frédéric

    2006-06-01

    Full Text Available Abstract Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at http://bips.u-strasbg.fr/MACSIMS/.

  14. Comparative analysis of rosaceous genomes and the reconstruction of a putative ancestral genome for the family.

    Science.gov (United States)

    Illa, Eudald; Sargent, Daniel J; Lopez Girona, Elena; Bushakra, Jill; Cestaro, Alessandro; Crowhurst, Ross; Pindo, Massimo; Cabrera, Antonio; van der Knaap, Esther; Iezzoni, Amy; Gardiner, Susan; Velasco, Riccardo; Arús, Pere; Chagné, David; Troggio, Michela

    2011-01-12

    Comparative genome mapping studies in Rosaceae have been conducted until now by aligning genetic maps within the same genus, or closely related genera and using a limited number of common markers. The growing body of genomics resources and sequence data for both Prunus and Fragaria permits detailed comparisons between these genera and the recently released Malus × domestica genome sequence. We generated a comparative analysis using 806 molecular markers that are anchored genetically to the Prunus and/or Fragaria reference maps, and physically to the Malus genome sequence. Markers in common for Malus and Prunus, and Malus and Fragaria, respectively were 784 and 148. The correspondence between marker positions was high and conserved syntenic blocks were identified among the three genera in the Rosaceae. We reconstructed a proposed ancestral genome for the Rosaceae. A genome containing nine chromosomes is the most likely candidate for the ancestral Rosaceae progenitor. The number of chromosomal translocations observed between the three genera investigated was low. However, the number of inversions identified among Malus and Prunus was much higher than any reported genome comparisons in plants, suggesting that small inversions have played an important role in the evolution of these two genera or of the Rosaceae.

  15. Comparative analysis of rosaceous genomes and the reconstruction of a putative ancestral genome for the family

    Directory of Open Access Journals (Sweden)

    Velasco Riccardo

    2011-01-01

    Full Text Available Abstract Background Comparative genome mapping studies in Rosaceae have been conducted until now by aligning genetic maps within the same genus, or closely related genera and using a limited number of common markers. The growing body of genomics resources and sequence data for both Prunus and Fragaria permits detailed comparisons between these genera and the recently released Malus × domestica genome sequence. Results We generated a comparative analysis using 806 molecular markers that are anchored genetically to the Prunus and/or Fragaria reference maps, and physically to the Malus genome sequence. Markers in common for Malus and Prunus, and Malus and Fragaria, respectively were 784 and 148. The correspondence between marker positions was high and conserved syntenic blocks were identified among the three genera in the Rosaceae. We reconstructed a proposed ancestral genome for the Rosaceae. Conclusions A genome containing nine chromosomes is the most likely candidate for the ancestral Rosaceae progenitor. The number of chromosomal translocations observed between the three genera investigated was low. However, the number of inversions identified among Malus and Prunus was much higher than any reported genome comparisons in plants, suggesting that small inversions have played an important role in the evolution of these two genera or of the Rosaceae.

  16. The influence of atomic alignment on absorption and emission spectroscopy

    Science.gov (United States)

    Zhang, Heshou; Yan, Huirong; Richter, Philipp

    2018-06-01

    Spectroscopic observations play essential roles in astrophysics. They are crucial for determining physical parameters in the universe, providing information about the chemistry of various astronomical environments. The proper execution of the spectroscopic analysis requires accounting for all the physical effects that are compatible to the signal-to-noise ratio. We find in this paper the influence on spectroscopy from the atomic/ground state alignment owing to anisotropic radiation and modulated by interstellar magnetic field, has significant impact on the study of interstellar gas. In different observational scenarios, we comprehensively demonstrate how atomic alignment influences the spectral analysis and provide the expressions for correcting the effect. The variations are even more pronounced for multiplets and line ratios. We show the variation of the deduced physical parameters caused by the atomic alignment effect, including alpha-to-iron ratio ([X/Fe]) and ionisation fraction. Synthetic observations are performed to illustrate the visibility of such effect with current facilities. A study of PDRs in ρ Ophiuchi cloud is presented to demonstrate how to account for atomic alignment in practice. Our work has shown that due to its potential impact, atomic alignment has to be included in an accurate spectroscopic analysis of the interstellar gas with current observational capability.

  17. Identification and classification of conserved RNA secondary structures in the human genome

    DEFF Research Database (Denmark)

    Pedersen, Jakob Skou; Bejerano, Gill; Siepel, Adam

    2006-01-01

    The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars...... for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set......, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization....

  18. Boiler: lossy compression of RNA-seq alignments using coverage vectors.

    Science.gov (United States)

    Pritt, Jacob; Langmead, Ben

    2016-09-19

    We describe Boiler, a new software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; we show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantification. Boiler also allows the user to pose fast and useful queries without decompressing the entire file. Boiler is free open source software available from github.com/jpritt/boiler. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G.Gomez

    2010-01-01

    The main developments in muon alignment since March 2010 have been the production, approval and deployment of alignment constants for the ICHEP data reprocessing. In the barrel, a new geometry, combining information from both hardware and track-based alignment systems, has been developed for the first time. The hardware alignment provides an initial DT geometry, which is then anchored as a rigid solid, using the link alignment system, to a reference frame common to the tracker. The “GlobalPositionRecords” for both the Tracker and Muon systems are being used for the first time, and the initial tracker-muon relative positioning, based on the link alignment, yields good results within the photogrammetry uncertainties of the Tracker and alignment ring positions. For the first time, the optical and track-based alignments show good agreement between them; the optical alignment being refined by the track-based alignment. The resulting geometry is the most complete to date, aligning all 250 DTs, ...

  20. Advanced Whole-Genome Sequencing and Analysis of Fetal Genomes from Amniotic Fluid.

    Science.gov (United States)

    Mao, Qing; Chin, Robert; Xie, Weiwei; Deng, Yuqing; Zhang, Wenwei; Xu, Huixin; Zhang, Rebecca Yu; Shi, Quan; Peters, Erin E; Gulbahce, Natali; Li, Zhenyu; Chen, Fang; Drmanac, Radoje; Peters, Brock A

    2018-04-01

    Amniocentesis is a common procedure, the primary purpose of which is to collect cells from the fetus to allow testing for abnormal chromosomes, altered chromosomal copy number, or a small number of genes that have small single- to multibase defects. Here we demonstrate the feasibility of generating an accurate whole-genome sequence of a fetus from either the cellular or cell-free DNA (cfDNA) of an amniotic sample. cfDNA and DNA isolated from the cell pellet of 31 amniocenteses were sequenced to approximately 50× genome coverage by use of the Complete Genomics nanoarray platform. In a subset of the samples, long fragment read libraries were generated from DNA isolated from cells and sequenced to approximately 100× genome coverage. Concordance of variant calls between the 2 DNA sources and with parental libraries was >96%. Two fetal genomes were found to harbor potentially detrimental variants in chromodomain helicase DNA binding protein 8 ( CHD8 ) and LDL receptor-related protein 1 ( LRP1 ), variations of which have been associated with autism spectrum disorder and keratosis pilaris atrophicans, respectively. We also discovered drug sensitivities and carrier information of fetuses for a variety of diseases. We were able to elucidate the complete genome sequence of 31 fetuses from amniotic fluid and demonstrate that the cfDNA or DNA from the cell pellet can be analyzed with little difference in quality. We believe that current technologies could analyze this material in a highly accurate and complete manner and that analyses like these should be considered for addition to current amniocentesis procedures. © 2018 American Association for Clinical Chemistry.

  1. VERSE: a novel approach to detect virus integration in host genomes through reference genome customization.

    Science.gov (United States)

    Wang, Qingguo; Jia, Peilin; Zhao, Zhongming

    2015-01-01

    Fueled by widespread applications of high-throughput next generation sequencing (NGS) technologies and urgent need to counter threats of pathogenic viruses, large-scale studies were conducted recently to investigate virus integration in host genomes (for example, human tumor genomes) that may cause carcinogenesis or other diseases. A limiting factor in these studies, however, is rapid virus evolution and resulting polymorphisms, which prevent reads from aligning readily to commonly used virus reference genomes, and, accordingly, make virus integration sites difficult to detect. Another confounding factor is host genomic instability as a result of virus insertions. To tackle these challenges and improve our capability to identify cryptic virus-host fusions, we present a new approach that detects Virus intEgration sites through iterative Reference SEquence customization (VERSE). To the best of our knowledge, VERSE is the first approach to improve detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE is implemented in the open source package VirusFinder 2 that is available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.

  2. Application of probabilistic modelling for the uncertainty evaluation of alignment measurements of large accelerator magnets assemblies

    Science.gov (United States)

    Doytchinov, I.; Tonnellier, X.; Shore, P.; Nicquevert, B.; Modena, M.; Mainaud Durand, H.

    2018-05-01

    Micrometric assembly and alignment requirements for future particle accelerators, and especially large assemblies, create the need for accurate uncertainty budgeting of alignment measurements. Measurements and uncertainties have to be accurately stated and traceable, to international standards, for metre-long sized assemblies, in the range of tens of µm. Indeed, these hundreds of assemblies will be produced and measured by several suppliers around the world, and will have to be integrated into a single machine. As part of the PACMAN project at CERN, we proposed and studied a practical application of probabilistic modelling of task-specific alignment uncertainty by applying a simulation by constraints calibration method. Using this method, we calibrated our measurement model using available data from ISO standardised tests (10360 series) for the metrology equipment. We combined this model with reference measurements and analysis of the measured data to quantify the actual specific uncertainty of each alignment measurement procedure. Our methodology was successfully validated against a calibrated and traceable 3D artefact as part of an international inter-laboratory study. The validated models were used to study the expected alignment uncertainty and important sensitivity factors in measuring the shortest and longest of the compact linear collider study assemblies, 0.54 m and 2.1 m respectively. In both cases, the laboratory alignment uncertainty was within the targeted uncertainty budget of 12 µm (68% confidence level). It was found that the remaining uncertainty budget for any additional alignment error compensations, such as the thermal drift error due to variation in machine operation heat load conditions, must be within 8.9 µm and 9.8 µm (68% confidence level) respectively.

  3. Annotation of the protein coding regions of the equine genome

    DEFF Research Database (Denmark)

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced m...... and appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross...

  4. Genomic selection in small dairy cattle populations

    DEFF Research Database (Denmark)

    Thomasen, Jørn Rind

    on optimization of genomc selction for a small dairy cattle breed such as Danish Jersey. Implementing genetic superior breeding schemes thus requires more accurate genomc predictions. Besides international collaboration, genotyping of cows is an efficient way to obtain more accurate genomic predictions...

  5. Analysis Of Segmental Duplications In The Pig Genome Based On Next-Generation Sequencing

    DEFF Research Database (Denmark)

    Fadista, João; Bendixen, Christian

    Segmental duplications are >1kb segments of duplicated DNA present in a genome with high sequence identity (>90%). They are associated with genomic rearrangements and provide a significant source of gene and genome evolution within mammalian genomes. Although segmental duplications have been...... extensively studied in other organisms, its analysis in pig has been hampered by the lack of a complete pig genome assembly. By measuring the depth of coverage of Illumina whole-genome shotgun sequencing reads of the Tabasco animal aligned to the latest pig genome assembly (Sus scrofa 10 – based also...... and their associated copy number alterations, focusing on the global organization of these segments and their possible functional significance in porcine phenotypes. This work provides insights into mammalian genome evolution and generates a valuable resource for porcine genomics research...

  6. Gramene 2018: unifying comparative genomics and pathway resources for plant research

    OpenAIRE

    Tello-Ruiz, Marcela K; Naithani, Sushma; Stein, Joshua C; Gupta, Parul; Campbell, Michael; Olson, Andrew; Wei, Sharon; Preece, Justin; Geniza, Matthew J; Jiao, Yinping; Lee, Young Koung; Wang, Bo; Mulvaney, Joseph; Chougule, Kapeel; Elser, Justin

    2017-01-01

    Abstract Gramene (http://www.gramene.org) is a knowledgebase for comparative functional analysis in major crops and model plant species. The current release, #54, includes over 1.7 million genes from 44 reference genomes, most of which were organized into 62,367 gene families through orthologous and paralogous gene classification, whole-genome alignments, and synteny. Additional gene annotations include ontology-based protein structure and function; genetic, epigenetic, and phenotypic diversi...

  7. Phylogenetic tree based on complete genomes using fractal and correlation analyses without sequence alignment

    Directory of Open Access Journals (Sweden)

    Zu-Guo Yu

    2006-06-01

    Full Text Available The complete genomes of living organisms have provided much information on their phylogenetic relationships. Similarly, the complete genomes of chloroplasts have helped resolve the evolution of this organelle in photosynthetic eukaryotes. In this review, we describe two algorithms to construct phylogenetic trees based on the theories of fractals and dynamic language using complete genomes. These algorithms were developed by our research group in the past few years. Our distance-based phylogenetic tree of 109 prokaryotes and eukaryotes agrees with the biologists' "tree of life" based on the 16S-like rRNA genes in a majority of basic branchings and most lower taxa. Our phylogenetic analysis also shows that the chloroplast genomes are separated into two major clades corresponding to chlorophytes s.l. and rhodophytes s.l. The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution.

  8. An improved image alignment procedure for high-resolution transmission electron microscopy.

    Science.gov (United States)

    Lin, Fang; Liu, Yan; Zhong, Xiaoyan; Chen, Jianghua

    2010-06-01

    Image alignment is essential for image processing methods such as through-focus exit-wavefunction reconstruction and image averaging in high-resolution transmission electron microscopy. Relative image displacements exist in any experimentally recorded image series due to the specimen drifts and image shifts, hence image alignment for correcting the image displacements has to be done prior to any further image processing. The image displacement between two successive images is determined by the correlation function of the two relatively shifted images. Here it is shown that more accurate image alignment can be achieved by using an appropriate aperture to filter the high-frequency components of the images being aligned, especially for a crystalline specimen with little non-periodic information. For the image series of crystalline specimens with little amorphous, the radius of the filter aperture should be as small as possible, so long as it covers the innermost lattice reflections. Testing with an experimental through-focus series of Si[110] images, the accuracies of image alignment with different correlation functions are compared with respect to the error functions in through-focus exit-wavefunction reconstruction based on the maximum-likelihood method. Testing with image averaging over noisy experimental images from graphene and carbon-nanotube samples, clear and sharp crystal lattice fringes are recovered after applying optimal image alignment. Copyright 2010 Elsevier Ltd. All rights reserved.

  9. Widespread of horizontal gene transfer in the human genome.

    Science.gov (United States)

    Huang, Wenze; Tsai, Lillian; Li, Yulong; Hua, Nan; Sun, Chen; Wei, Chaochun

    2017-04-04

    A fundamental concept in biology is that heritable material is passed from parents to offspring, a process called vertical gene transfer. An alternative mechanism of gene acquisition is through horizontal gene transfer (HGT), which involves movement of genetic materials between different species. Horizontal gene transfer has been found prevalent in prokaryotes but very rare in eukaryote. In this paper, we investigate horizontal gene transfer in the human genome. From the pair-wise alignments between human genome and 53 vertebrate genomes, 1,467 human genome regions (2.6 M bases) from all chromosomes were found to be more conserved with non-mammals than with most mammals. These human genome regions involve 642 known genes, which are enriched with ion binding. Compared to known horizontal gene transfer regions in the human genome, there were few overlapping regions, which indicated horizontal gene transfer is more common than we expected in the human genome. Horizontal gene transfer impacts hundreds of human genes and this study provided insight into potential mechanisms of HGT in the human genome.

  10. Multiple alignment analysis on phylogenetic tree of the spread of SARS epidemic using distance method

    Science.gov (United States)

    Amiroch, S.; Pradana, M. S.; Irawan, M. I.; Mukhlash, I.

    2017-09-01

    Multiple Alignment (MA) is a particularly important tool for studying the viral genome and determine the evolutionary process of the specific virus. Application of MA in the case of the spread of the Severe acute respiratory syndrome (SARS) epidemic is an interesting thing because this virus epidemic a few years ago spread so quickly that medical attention in many countries. Although there has been a lot of software to process multiple sequences, but the use of pairwise alignment to process MA is very important to consider. In previous research, the alignment between the sequences to process MA algorithm, Super Pairwise Alignment, but in this study used a dynamic programming algorithm Needleman wunchs simulated in Matlab. From the analysis of MA obtained and stable region and unstable which indicates the position where the mutation occurs, the system network topology that produced the phylogenetic tree of the SARS epidemic distance method, and system area networks mutation.

  11. The SOAPS project – Spin-orbit alignment of planetary systems

    Directory of Open Access Journals (Sweden)

    Hebb L.

    2013-04-01

    Full Text Available The wealth of information rendered by Kepler planets and planet candidates is indispensable for statistically significant studies of distinct planet populations, in both single and multiple systems. Empirical evidences suggest that Kepler's planet population shows different physical properties as compared to the bulk of known exoplanets. The SOAPS project, aims to shed light on Kepler's planets formation, their migration and architecture. By measuring v sini accurately for Kepler hosts with rotation periods measured from their high-precision light curves, we will assess the alignment of the planetary orbit with respect to the stellar spin axis. This degree of alignment traces the formation history and evolution of the planetary systems, and thus, allows to distinguish between different proposed migration theories. SOAPS will increase by a factor of 2 the number of spin-orbit alignment measurements pushing the parameters space down to the SuperEarth domain. Here we present our preliminary results.

  12. A novel micromachined shadow mask system with self-alignment and gap control capability

    International Nuclear Information System (INIS)

    Hong, Jung Moo; Zou Jun

    2008-01-01

    We present a novel micromachined shadow mask system, which is capable of accurate self-alignment and mask-substrate gap control. The shadow mask system consists of a silicon shadow mask and a silicon carrier wafer with pyramidal cavities fabricated with bulk micromachining. Self-alignment and gap control of the shadow mask and the fabrication substrate can readily be achieved by using matching pairs of pyramidal cavities and steel spheres placed between. The layer-to-layer alignment accuracy of the new shadow mask system has been experimentally characterized and verified using both optical and atomic force microscopic measurements. As an application of this new shadow mask system, an organic thin-film transistor (OTFT) using pentacene as the semiconductor layer has been successfully fabricated and tested

  13. Whole-genome modeling accurately predicts quantitative traits, as revealed in plants.

    OpenAIRE

    Tatarinova, Tatiana; Shin, Min-Gyoung; Marjoram, Paul; Nuzhdin, Sergey; Triska, Martin; Rickauer, Martina; Nikolsky, Yuri; Mazurier, Melanie; Gentzbittel, Laurent; Ben, Cecile

    2016-01-01

    Many adaptive events in natural populations, as well as response to artificial selection, are caused by polygenic action. Under selective pressure, the adaptive traits can quickly respond via small allele frequency shifts spread across numerous loci. We hypothesize that a large proportion of current phenotypic variation between individuals may be best explained by population admixture. We thus consider the complete, genome-wide universe of genetic variability, spread across several ancestral ...

  14. Improving model construction of profile HMMs for remote homology detection through structural alignment

    Directory of Open Access Journals (Sweden)

    Zaverucha Gerson

    2007-11-01

    Full Text Available Abstract Background Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the Twilight Zone, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance. Results We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test. Conclusion We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.

  15. Improving model construction of profile HMMs for remote homology detection through structural alignment.

    Science.gov (United States)

    Bernardes, Juliana S; Dávila, Alberto M R; Costa, Vítor S; Zaverucha, Gerson

    2007-11-09

    Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs) are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the Twilight Zone, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance. We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test. We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.

  16. Theoretical and practical feasibility demonstration of a micrometric remotely controlled pre-alignment system for the CLIC linear collider

    CERN Document Server

    Mainaud Durand, H; Chritin, N; Griffet, S; Kemppinen, J; Sosin, M; Touze, T

    2011-01-01

    The active pre-alignment of the Compact Linear Collider (CLIC) is one of the key points of the project: the components must be pre-aligned w.r.t. a straight line within a few microns over a sliding window of 200 m, along the two linacs of 20 km each. The proposed solution consists of stretched wires of more than 200 m, overlapping over half of their length, which will be the reference of alignment. Wire Positioning Sensors (WPS), coupled to the supports to be pre-aligned, will perform precise and accurate measurements within a few microns w.r.t. these wires. A micrometric fiducialisation of the components and a micrometric alignment of the components on common supports will make the strategy of pre-alignment complete. In this paper, the global strategy of active pre-alignment is detailed and illustrated by the latest results demonstrating the feasibility of the proposed solution.

  17. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G.Gomez

    2010-01-01

    Most of the work in muon alignment since December 2009 has focused on the geometry reconstruction from the optical systems and improvements in the internal alignment of the DT chambers. The barrel optical alignment system has progressively evolved from reconstruction of single active planes to super-planes (December 09) to a new, full barrel reconstruction. Initial validation studies comparing this full barrel alignment at 0T with photogrammetry provide promising results. In addition, the method has been applied to CRAFT09 data, and the resulting alignment at 3.8T yields residuals from tracks (extrapolated from the tracker) which look smooth, suggesting a good internal barrel alignment with a small overall offset with respect to the tracker. This is a significant improvement, which should allow the optical system to provide a start-up alignment for 2010. The end-cap optical alignment has made considerable progress in the analysis of transfer line data. The next set of alignment constants for CSCs will there...

  18. Pigs in sequence space: A 0.66X coverage pig genome survey based on shotgun sequencing

    DEFF Research Database (Denmark)

    Wernersson, Rasmus; Schierup, M.H.; Jorgensen, F.G.

    2005-01-01

    sequences (0.66X coverage) from the pig genome. The data are hereby released (NCBI Trace repository with center name "SDJVP", and project name "Sino-Danish Pig Genome Project") together with an initial evolutionary analysis. The non-repetitive fraction of the sequences was aligned to the UCSC human...

  19. Genome-wide detection of selection and other evolutionary forces

    DEFF Research Database (Denmark)

    Xu, Zhuofei; Zhou, Rui

    2015-01-01

    As is well known, pathogenic microbes evolve rapidly to escape from the host immune system and antibiotics. Genetic variations among microbial populations occur frequently during the long-term pathogen–host evolutionary arms race, and individual mutation beneficial for the fitness can be fixed...... to scan genome-wide alignments for evidence of positive Darwinian selection, recombination, and other evolutionary forces operating on the coding regions. In this chapter, we describe an integrative analysis pipeline and its application to tracking featured evolutionary trajectories on the genome...

  20. Fractal MapReduce decomposition of sequence alignment

    Directory of Open Access Journals (Sweden)

    Almeida Jonas S

    2012-05-01

    Full Text Available Abstract Background The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp, highlighting the browser's emergence as an environment for high performance distributed computing. Availability Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".

  1. Rotational Mal-Alignment after Reamed Intramedullary Nailing for tibial shaft fracture.

    Science.gov (United States)

    Khan, Sher Baz; Mohib, Yasir; Rashid, Rizwan Haroon; Rashid, Haroonur

    2016-10-01

    Intra-medullary (IM) nailing is standard of care for unstable tibial shaft fractures. Malrotation is very common but it is under-recognised, inpart because of variation in normal anatomy and partly due to difficulty in accurately assessing rotation. This study was planned to evaluate the frequency of rotational mal-alignment after reamed tibia IMnailing. This cross-sectional study was conducted at Aga khan University Hospital, Karachi, and comprised patients with tibia shaft fractures managed with IMnailing from January to December 2014. All the patients were assessed intra-operatively for rotational alignment using the knee and ankle fluoroscopic images. There were 81 patients with a mean age of 38±16.9 years. There were 64(79%) male patients. Overall the incidence of malrotation was in 20(24.7%) cases. Rotational mal-alignment is one of the preventable complications after IMnailing which can be assesed intra-operatively under fluoroscope.

  2. Single-molecule approach to bacterial genomic comparisons via optical mapping.

    Energy Technology Data Exchange (ETDEWEB)

    Zhou, Shiguo [Univ. Wisc.-Madison; Kile, A. [Univ. Wisc.-Madison; Bechner, M. [Univ. Wisc.-Madison; Kvikstad, E. [Univ. Wisc.-Madison; Deng, W. [Univ. Wisc.-Madison; Wei, J. [Univ. Wisc.-Madison; Severin, J. [Univ. Wisc.-Madison; Runnheim, R. [Univ. Wisc.-Madison; Churas, C. [Univ. Wisc.-Madison; Forrest, D. [Univ. Wisc.-Madison; Dimalanta, E. [Univ. Wisc.-Madison; Lamers, C. [Univ. Wisc.-Madison; Burland, V. [Univ. Wisc.-Madison; Blattner, F. R. [Univ. Wisc.-Madison; Schwartz, David C. [Univ. Wisc.-Madison

    2004-01-01

    Modern comparative genomics has been established, in part, by the sequencing and annotation of a broad range of microbial species. To gain further insights, new sequencing efforts are now dealing with the variety of strains or isolates that gives a species definition and range; however, this number vastly outstrips our ability to sequence them. Given the availability of a large number of microbial species, new whole genome approaches must be developed to fully leverage this information at the level of strain diversity that maximize discovery. Here, we describe how optical mapping, a single-molecule system, was used to identify and annotate chromosomal alterations between bacterial strains represented by several species. Since whole-genome optical maps are ordered restriction maps, sequenced strains of Shigella flexneri serotype 2a (2457T and 301), Yersinia pestis (CO 92 and KIM), and Escherichia coli were aligned as maps to identify regions of homology and to further characterize them as possible insertions, deletions, inversions, or translocations. Importantly, an unsequenced Shigella flexneri strain (serotype Y strain AMC[328Y]) was optically mapped and aligned with two sequenced ones to reveal one novel locus implicated in serotype conversion and several other loci containing insertion sequence elements or phage-related gene insertions. Our results suggest that genomic rearrangements and chromosomal breakpoints are readily identified and annotated against a prototypic sequenced strain by using the tools of optical mapping.

  3. Genome Context Viewer: visual exploration of multiple annotated genomes using microsynteny.

    Science.gov (United States)

    Cleary, Alan; Farmer, Andrew

    2018-05-01

    The Genome Context Viewer is a visual data-mining tool that allows users to search across multiple providers of genome data for regions with similarly annotated content that may be aligned and visualized at the level of their shared functional elements. By handling ordered sequences of gene family memberships as a unit of search and comparison, the user interface enables quick and intuitive assessment of the degree of gene content divergence and the presence of various types of structural events within syntenic contexts. Insights into functionally significant differences seen at this level of abstraction can then serve to direct the user to more detailed explorations of the underlying data in other interconnected, provider-specific tools. GCV is provided under the GNU General Public License version 3 (GPL-3.0). Source code is available at https://github.com/legumeinfo/lis_context_viewer. adf@ncgr.org. Supplementary data are available at Bioinformatics online.

  4. Hapsembler: An Assembler for Highly Polymorphic Genomes

    Science.gov (United States)

    Donmez, Nilgun; Brudno, Michael

    As whole genome sequencing has become a routine biological experiment, algorithms for assembly of whole genome shotgun data has become a topic of extensive research, with a plethora of off-the-shelf methods that can reconstruct the genomes of many organisms. Simultaneously, several recently sequenced genomes exhibit very high polymorphism rates. For these organisms genome assembly remains a challenge as most assemblers are unable to handle highly divergent haplotypes in a single individual. In this paper we describe Hapsembler, an assembler for highly polymorphic genomes, which makes use of paired reads. Our experiments show that Hapsembler produces accurate and contiguous assemblies of highly polymorphic genomes, while performing on par with the leading tools on haploid genomes. Hapsembler is available for download at http://compbio.cs.toronto.edu/hapsembler.

  5. Alignment of the ATLAS Inner Detector

    CERN Document Server

    Marti-Garcia, Salvador; The ATLAS collaboration

    2016-01-01

    The Run-2 of the LHC has presented new challenges to track and vertex reconstruction with higher energies, denser jets and higher rates. In addition, the Insertable B-layer (IBL) is a fourth pixel layer, which has been deployed at the centre of ATLAS during the longshutdown-1 of the LHC. The physics performance of the experiment requires a high resolution and unbiased measurement of all charged particle kinematic parameters. In its turn, the performance of the tracking depends, among many other issues, on the accurate determination of the alignment parameters of the tracking sensors. The offline track based alignment of the ATLAS tracking system has to deal with more than 700,000 degrees of freedom (DoF). This represents a considerable numerical challenge in terms of both CPU time and precision. During Run-2, a mechanical distortion of the IBL staves up to 20um has been observed during data-taking, plus other short time scale movements. The talk will also describe the procedures implemented to detect and remo...

  6. Genomic sequencing of Pleistocene cave bears

    Energy Technology Data Exchange (ETDEWEB)

    Noonan, James P.; Hofreiter, Michael; Smith, Doug; Priest, JamesR.; Rohland, Nadin; Rabeder, Gernot; Krause, Johannes; Detter, J. Chris; Paabo, Svante; Rubin, Edward M.

    2005-04-01

    Despite the information content of genomic DNA, ancient DNA studies to date have largely been limited to amplification of mitochondrial DNA due to technical hurdles such as contamination and degradation of ancient DNAs. In this study, we describe two metagenomic libraries constructed using unamplified DNA extracted from the bones of two 40,000-year-old extinct cave bears. Analysis of {approx}1 Mb of sequence from each library showed that, despite significant microbial contamination, 5.8 percent and 1.1 percent of clones in the libraries contain cave bear inserts, yielding 26,861 bp of cave bear genome sequence. Alignment of this sequence to the dog genome, the closest sequenced genome to cave bear in terms of evolutionary distance, revealed roughly the expected ratio of cave bear exons, repeats and conserved noncoding sequences. Only 0.04 percent of all clones sequenced were derived from contamination with modern human DNA. Comparison of cave bear with orthologous sequences from several modern bear species revealed the evolutionary relationship of these lineages. Using the metagenomic approach described here, we have recovered substantial quantities of mammalian genomic sequence more than twice as old as any previously reported, establishing the feasibility of ancient DNA genomic sequencing programs.

  7. Whole-genome shotgun optical mapping of rhodospirillumrubrum

    Energy Technology Data Exchange (ETDEWEB)

    Reslewic, Susan; Zhou, Shiguo; Place, Mike; Zhang, Yaoping; Briska, Adam; Goldstein, Steve; Churas, Chris; Runnheim, Rod; Forrest,Dan; Lim, Alex; Lapidus, Alla; Han, Cliff S.; Roberts, Gary P.; Schwartz,David C.

    2004-07-01

    Rhodospirillum rubrum is a phototrophic purple non-sulfur bacterium known for its unique and well-studied nitrogen fixation and carbon monoxide oxidation systems, and as a source of hydrogen and biodegradable plastics production. To better understand this organism and to facilitate assembly of its sequence, three whole-genome restriction maps (Xba I, Nhe I, and Hind III) of R. rubrum strain ATCC 11170 were created by optical mapping. Optical mapping is a system for creating whole-genome ordered restriction maps from randomly sheared genomic DNA molecules extracted directly from cells. During the sequence finishing process, all three optical maps confirmed a putative error in sequence assembly, while the Hind III map acted as a scaffold for high resolution alignment with sequence contigs spanning the whole genome. In addition to highlighting optical mapping's role in the assembly and validation of genome sequence, our work underscores the unique niche in resolution occupied by the optical mapping system. With a resolution ranging from 6.5 kb (previously published) to 45 kb (reported here), optical mapping advances a ''molecular cytogenetics'' approach to solving problems in genomic analysis.

  8. Sensing Characteristics of A Precision Aligner Using Moire Gratings for Precision Alignment System

    Institute of Scientific and Technical Information of China (English)

    ZHOU Lizhong; Hideo Furuhashi; Yoshiyuki Uchida

    2001-01-01

    Sensing characteristics of a precision aligner using moire gratings for precision alignment sysem has been investigated. A differential moire alignment system and a modified alignment system were used. The influence of the setting accuracy of the gap length and inclination of gratings on the alignment accuracy has been studied experimentally and theoretically. Setting accuracy of the gap length less than 2.5μm is required in modified moire alignment. There is no influence of the gap length on the alignment accuracy in the differential alignment system. The inclination affects alignment accuracies in both differential and modified moire alignment systems.

  9. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    Z. Szillasi and G. Gomez.

    2013-01-01

    When CMS is opened up, major components of the Link and Barrel Alignment systems will be removed. This operation, besides allowing for maintenance of the detector underneath, is needed for making interventions that will reinforce the alignment measurements and make the operation of the alignment system more reliable. For that purpose and also for their general maintenance and recalibration, the alignment components will be transferred to the Alignment Lab situated in the ISR area. For the track-based alignment, attention is focused on the determination of systematic uncertainties, which have become dominant, since now there is a large statistics of muon tracks. This will allow for an improved Monte Carlo misalignment scenario and updated alignment position errors, crucial for high-momentum muon analysis such as Z′ searches.

  10. Triangular Alignment (TAME). A Tensor-based Approach for Higher-order Network Alignment

    Energy Technology Data Exchange (ETDEWEB)

    Mohammadi, Shahin [Purdue Univ., West Lafayette, IN (United States); Gleich, David F. [Purdue Univ., West Lafayette, IN (United States); Kolda, Tamara G. [Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Grama, Ananth [Purdue Univ., West Lafayette, IN (United States)

    2015-11-01

    Network alignment is an important tool with extensive applications in comparative interactomics. Traditional approaches aim to simultaneously maximize the number of conserved edges and the underlying similarity of aligned entities. We propose a novel formulation of the network alignment problem that extends topological similarity to higher-order structures and provide a new objective function that maximizes the number of aligned substructures. This objective function corresponds to an integer programming problem, which is NP-hard. Consequently, we approximate this objective function as a surrogate function whose maximization results in a tensor eigenvalue problem. Based on this formulation, we present an algorithm called Triangular AlignMEnt (TAME), which attempts to maximize the number of aligned triangles across networks. We focus on alignment of triangles because of their enrichment in complex networks; however, our formulation and resulting algorithms can be applied to general motifs. Using a case study on the NAPABench dataset, we show that TAME is capable of producing alignments with up to 99% accuracy in terms of aligned nodes. We further evaluate our method by aligning yeast and human interactomes. Our results indicate that TAME outperforms the state-of-art alignment methods both in terms of biological and topological quality of the alignments.

  11. Induced alignment and measurement of dipolar couplings of an SH2 domain through direct binding with filamentous phage

    International Nuclear Information System (INIS)

    Dahlke Ojennus, Deanna; Mitton-Fry, Rachel M.; Wuttke, Deborah S.

    1999-01-01

    Large residual 15 N- 1 H dipolar couplings have been measured in a Src homology II domain aligned at Pf1 bacteriophage concentrations an order of magnitude lower than used for induction of a similar degree of alignment of nucleic acids and highly acidic proteins. An increase in 1 H and 15 N protein linewidths and a decrease in T 2 and T 1 ρ relaxation time constants implicates a binding interaction between the protein and phage as the mechanism of alignment. However, the associated increased linewidth does not preclude the accurate measurement of large dipolar couplings in the aligned protein. A good correlation is observed between measured dipolar couplings and predicted values based on the high resolution NMR structure of the SH2 domain. The observation of binding-induced protein alignment promises to broaden the scope of alignment techniques by extending their applicability to proteins that are able to interact weakly with the alignment medium

  12. Estimates of statistical significance for comparison of individual positions in multiple sequence alignments

    Directory of Open Access Journals (Sweden)

    Sadreyev Ruslan I

    2004-08-01

    Full Text Available Abstract Background Profile-based analysis of multiple sequence alignments (MSA allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1 MSA position and a set of predicted residue frequencies, and (2 between two MSA positions. These problems are important for (i evaluation and optimization of methods predicting residue occurrence at protein positions; (ii detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii detection of sites that determine functional or structural specificity in two related families. Results For problems (1 and (2, we propose analytical estimates of P-value and apply them to the detection of significant positional dissimilarities in various experimental situations. (a We compare structure-based predictions of residue propensities at a protein position to the actual residue frequencies in the MSA of homologs. (b We evaluate our method by the ability to detect erroneous position matches produced by an automatic sequence aligner. (c We compare MSA positions that correspond to residues aligned by automatic structure aligners. (d We compare MSA positions that are aligned by high-quality manual superposition of structures. Detected dissimilarities reveal shortcomings of the automatic methods for residue frequency prediction and alignment construction. For the high-quality structural alignments, the dissimilarities suggest sites of potential functional or structural importance. Conclusion The proposed computational method is of significant potential value for the analysis of protein families.

  13. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G. Gomez and J. Pivarski

    2011-01-01

    Alignment efforts in the first few months of 2011 have shifted away from providing alignment constants (now a well established procedure) and focussed on some critical remaining issues. The single most important task left was to understand the systematic differences observed between the track-based (TB) and hardware-based (HW) barrel alignments: a systematic difference in r-φ and in z, which grew as a function of z, and which amounted to ~4-5 mm differences going from one end of the barrel to the other. This difference is now understood to be caused by the tracker alignment. The systematic differences disappear when the track-based barrel alignment is performed using the new “twist-free” tracker alignment. This removes the largest remaining source of systematic uncertainty. Since the barrel alignment is based on hardware, it does not suffer from the tracker twist. However, untwisting the tracker causes endcap disks (which are aligned ...

  14. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    Gervasio Gomez

    The main progress of the muon alignment group since March has been in the refinement of both the track-based alignment for the DTs and the hardware-based alignment for the CSCs. For DT track-based alignment, there has been significant improvement in the internal alignment of the superlayers inside the DTs. In particular, the distance between superlayers is now corrected, eliminating the residual dependence on track impact angles, and good agreement is found between survey and track-based corrections. The new internal geometry has been approved to be included in the forthcoming reprocessing of CRAFT samples. The alignment of DTs with respect to the tracker using global tracks has also improved significantly, since the algorithms use the latest B-field mapping, better run selection criteria, optimized momentum cuts, and an alignment is now obtained for all six degrees of freedom (three spatial coordinates and three rotations) of the aligned DTs. This work is ongoing and at a stage where we are trying to unders...

  15. Alignment of dipole magnet in micro-beam line of HIRFL

    International Nuclear Information System (INIS)

    Wang Shaoming; Chen Wenjun; Yang Shengli; Cai Guozhu; Guo Yizhen; Zhou Guangming; Man Kaidi; Song Mingtao

    2010-01-01

    Microbeam irradiation facility is an experiment platform, which can reduce the beam-spot on the irradiated sample to micrometer level, and can accurately locate and count the radioactive particles. It is a powerful research tool for the irradiation material science, irradiation biology, irradiation biomedicine and micro mechanical machining. The microbeam irradiation facility requires the precise work for installation and alignment. These conditions make magnet's change for directions and positions because the location space of dipole magnets in micro-beam line of HIRFL (Heavy Ion Research Facility in Lanzhou) is very small. It is a challenge for the installation and alignment work of magnets. It was solved by transforming coordinates of benchmarks of magnets, which controlled the error of magnet setup within error tolerance range. (authors)

  16. Genome size variation in the genus Avena.

    Science.gov (United States)

    Yan, Honghai; Martin, Sara L; Bekele, Wubishet A; Latta, Robert G; Diederichsen, Axel; Peng, Yuanying; Tinker, Nicholas A

    2016-03-01

    Genome size is an indicator of evolutionary distance and a metric for genome characterization. Here, we report accurate estimates of genome size in 99 accessions from 26 species of Avena. We demonstrate that the average genome size of C genome diploid species (2C = 10.26 pg) is 15% larger than that of A genome species (2C = 8.95 pg), and that this difference likely accounts for a progression of size among tetraploid species, where AB genome configuration had similar genome sizes (average 2C = 25.74 pg). Genome size was mostly consistent within species and in general agreement with current information about evolutionary distance among species. Results also suggest that most of the polyploid species in Avena have experienced genome downsizing in relation to their diploid progenitors. Genome size measurements could provide additional quality control for species identification in germplasm collections, especially in cases where diploid and polyploid species have similar morphology.

  17. Comprehensive genomic characterization of campylobacter genus reveals some underlying mechanisms for its genomic diversification.

    Directory of Open Access Journals (Sweden)

    Yizhuang Zhou

    Full Text Available Campylobacter species.are phenotypically diverse in many aspects including host habitats and pathogenicities, which demands comprehensive characterization of the entire Campylobacter genus to study their underlying genetic diversification. Up to now, 34 Campylobacter strains have been sequenced and published in public databases, providing good opportunity to systemically analyze their genomic diversities. In this study, we first conducted genomic characterization, which includes genome-wide alignments, pan-genome analysis, and phylogenetic identification, to depict the genetic diversity of Campylobacter genus. Afterward, we improved the tetranucleotide usage pattern-based naïve Bayesian classifier to identify the abnormal composition fragments (ACFs, fragments with significantly different tetranucleotide frequency profiles from its genomic tetranucleotide frequency profiles including horizontal gene transfers (HGTs to explore the mechanisms for the genetic diversity of this organism. Finally, we analyzed the HGTs transferred via bacteriophage transductions. To our knowledge, this study is the first to use single nucleotide polymorphism information to construct liable microevolution phylogeny of 21 Campylobacter jejuni strains. Combined with the phylogeny of all the collected Campylobacter species based on genome-wide core gene information, comprehensive phylogenetic inference of all 34 Campylobacter organisms was determined. It was found that C. jejuni harbors a high fraction of ACFs possibly through intraspecies recombination, whereas other Campylobacter members possess numerous ACFs possibly via intragenus recombination. Furthermore, some Campylobacter strains have undergone significant ancient viral integration during their evolution process. The improved method is a powerful tool for bacterial genomic analysis. Moreover, the findings would provide useful information for future research on Campylobacter genus.

  18. Program for PET image alignment: Effects on calculated differences in cerebral metabolic rates for glucose

    International Nuclear Information System (INIS)

    Phillips, R.L.; London, E.D.; Links, J.M.; Cascella, N.G.

    1990-01-01

    A program was developed to align positron emission tomography images from multiple studies on the same subject. The program allowed alignment of two images with a fineness of one-tenth the width of a pixel. The indications and effects of misalignment were assessed in eight subjects from a placebo-controlled double-blind crossover study on the effects of cocaine on regional cerebral metabolic rates for glucose. Visual examination of a difference image provided a sensitive and accurate tool for assessing image alignment. Image alignment within 2.8 mm was essential to reduce variability of measured cerebral metabolic rates for glucose. Misalignment by this amount introduced errors on the order of 20% in the computed metabolic rate for glucose. These errors propagate to the difference between metabolic rates for a subject measured in basal versus perturbed states

  19. BLESS 2: accurate, memory-efficient and fast error correction method.

    Science.gov (United States)

    Heo, Yun; Ramachandran, Anand; Hwu, Wen-Mei; Ma, Jian; Chen, Deming

    2016-08-01

    The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes. Freely available at https://sourceforge.net/projects/bless-ec dchen@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Accurate phylogenetic classification of DNA fragments based onsequence composition

    Energy Technology Data Exchange (ETDEWEB)

    McHardy, Alice C.; Garcia Martin, Hector; Tsirigos, Aristotelis; Hugenholtz, Philip; Rigoutsos, Isidore

    2006-05-01

    Metagenome studies have retrieved vast amounts of sequenceout of a variety of environments, leading to novel discoveries and greatinsights into the uncultured microbial world. Except for very simplecommunities, diversity makes sequence assembly and analysis a verychallenging problem. To understand the structure a 5 nd function ofmicrobial communities, a taxonomic characterization of the obtainedsequence fragments is highly desirable, yet currently limited mostly tothose sequences that contain phylogenetic marker genes. We show that forclades at the rank of domain down to genus, sequence composition allowsthe very accurate phylogenetic 10 characterization of genomic sequence.We developed a composition-based classifier, PhyloPythia, for de novophylogenetic sequence characterization and have trained it on adata setof 340 genomes. By extensive evaluation experiments we show that themethodis accurate across all taxonomic ranks considered, even forsequences that originate fromnovel organisms and are as short as 1kb.Application to two metagenome datasets 15 obtained from samples ofphosphorus-removing sludge showed that the method allows the accurateclassification at genus level of most sequence fragments from thedominant populations, while at the same time correctly characterizingeven larger parts of the samples at higher taxonomic levels.

  1. Newly discovered young CORE-SINEs in marsupial genomes.

    Science.gov (United States)

    Munemasa, Maruo; Nikaido, Masato; Nishihara, Hidenori; Donnellan, Stephen; Austin, Christopher C; Okada, Norihiro

    2008-01-15

    Although recent mammalian genome projects have uncovered a large part of genomic component of various groups, several repetitive sequences still remain to be characterized and classified for particular groups. The short interspersed repetitive elements (SINEs) distributed among marsupial genomes are one example. We have identified and characterized two new SINEs from marsupial genomes that belong to the CORE-SINE family, characterized by a highly conserved "CORE" domain. PCR and genomic dot blot analyses revealed that the distribution of each SINE shows distinct patterns among the marsupial genomes, implying different timing of their retroposition during the evolution of marsupials. The members of Mar3 (Marsupialia 3) SINE are distributed throughout the genomes of all marsupials, whereas the Mac1 (Macropodoidea 1) SINE is distributed specifically in the genomes of kangaroos. Sequence alignment of the Mar3 SINEs revealed that they can be further divided into four subgroups, each of which has diagnostic nucleotides. The insertion patterns of each SINE at particular genomic loci, together with the distribution patterns of each SINE, suggest that the Mar3 SINEs have intensively amplified after the radiation of diprotodontians, whereas the Mac1 SINE has amplified only slightly after the divergence of hypsiprimnodons from other macropods. By compiling the information of CORE-SINEs characterized to date, we propose a comprehensive picture of how SINE evolution occurred in the genomes of marsupials.

  2. Accurate measurement of gene copy number for human alpha-defensin DEFA1A3.

    Science.gov (United States)

    Khan, Fayeza F; Carpenter, Danielle; Mitchell, Laura; Mansouri, Omniah; Black, Holly A; Tyson, Jess; Armour, John A L

    2013-10-20

    Multi-allelic copy number variants include examples of extensive variation between individuals in the copy number of important genes, most notably genes involved in immune function. The definition of this variation, and analysis of its impact on function, has been hampered by the technical difficulty of large-scale but accurate typing of genomic copy number. The copy-variable alpha-defensin locus DEFA1A3 on human chromosome 8 commonly varies between 4 and 10 copies per diploid genome, and presents considerable challenges for accurate high-throughput typing. In this study, we developed two paralogue ratio tests and three allelic ratio measurements that, in combination, provide an accurate and scalable method for measurement of DEFA1A3 gene number. We combined information from different measurements in a maximum-likelihood framework which suggests that most samples can be assigned to an integer copy number with high confidence, and applied it to typing 589 unrelated European DNA samples. Typing the members of three-generation pedigrees provided further reassurance that correct integer copy numbers had been assigned. Our results have allowed us to discover that the SNP rs4300027 is strongly associated with DEFA1A3 gene copy number in European samples. We have developed an accurate and robust method for measurement of DEFA1A3 copy number. Interrogation of rs4300027 and associated SNPs in Genome-Wide Association Study SNP data provides no evidence that alpha-defensin copy number is a strong risk factor for phenotypes such as Crohn's disease, type I diabetes, HIV progression and multiple sclerosis.

  3. Scintillation counter: photomultiplier tube alignment

    International Nuclear Information System (INIS)

    Olson, R.E.

    1975-01-01

    A scintillation counter, particularly for counting gamma ray photons, includes a massive lead radiation shield surrounding a sample-receiving zone. The shield is disassembleable into a plurality of segments to allow facile installation and removal of a photomultiplier tube assembly, the segments being so constructed as to prevent straight-line access of external radiation through the shield into the sample receiving zone. Provisions are made for accurately aligning the photomultiplier tube with respect to one or more sample-transmitting bores extending through the shield to the sample receiving zone. A sample elevator, used in transporting samples into the zone, is designed to provide a maximum gamma-receiving aspect to maximize the gamma detecting efficiency. (auth)

  4. Interpreting a sequenced genome: toward a cosmid transgenic library of Caenorhabditis elegans.

    Science.gov (United States)

    Janke, D L; Schein, J E; Ha, T; Franz, N W; O'Neil, N J; Vatcher, G P; Stewart, H I; Kuervers, L M; Baillie, D L; Rose, A M

    1997-10-01

    We have generated a library of transgenic Caenorhabditis elegans strains that carry sequenced cosmids from the genome of the nematode. Each strain carries an extrachromosomal array containing a single cosmid, sequenced by the C. elegans Genome Sequencing Consortium, and a dominate Rol-6 marker. More than 500 transgenic strains representing 250 cosmids have been constructed. Collectively, these strains contain approximately 8 Mb of sequence data, or approximately 8% of the C. elegans genome. The transgenic strains are being used to rescue mutant phenotypes, resulting in a high-resolution map alignment of the genetic, physical, and DNA sequence maps of the nematode. We have chosen the region of chromosome III deleted by sDf127 and not covered by the duplication sDp8(III;I) as a starting point for a systematic correlation of mutant phenotypes with nucleotide sequence. In this defined region, we have identified 10 new essential genes whose mutant phenotypes range from developmental arrest at early larva, to maternal effect lethal. To date, 8 of these 10 essential genes have been rescued. In this region, these rescues represent approximately 10% of the genes predicted by GENEFINDER and considerably enhance the map alignment. Furthermore, this alignment facilitates future efforts to physically position and clone other genes in the region. [Updated information about the Transgenic Library is available via the Internet at http://darwin.mbb.sfu.ca/imbb/dbaillie/cos mid.html.

  5. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G. Gomez

    Since December, the muon alignment community has focused on analyzing the data recorded so far in order to produce new DT and CSC Alignment Records for the second reprocessing of CRAFT data. Two independent algorithms were developed which align the DT chambers using global tracks, thus providing, for the first time, a relative alignment of the barrel with respect to the tracker. These results are an important ingredient for the second CRAFT reprocessing and allow, for example, a more detailed study of any possible mis-modelling of the magnetic field in the muon spectrometer. Both algorithms are constructed in such a way that the resulting alignment constants are not affected, to first order, by any such mis-modelling. The CSC chambers have not yet been included in this global track-based alignment due to a lack of statistics, since only a few cosmics go through the tracker and the CSCs. A strategy exists to align the CSCs using the barrel as a reference until collision tracks become available. Aligning the ...

  6. MUON DETECTORS: ALIGNMENT

    CERN Multimedia

    G.Gomez

    2011-01-01

    The Muon Alignment work now focuses on producing a new track-based alignment with higher track statistics, making systematic studies between the results of the hardware and track-based alignment methods and aligning the barrel using standalone muon tracks. Currently, the muon track reconstruction software uses a hardware-based alignment in the barrel (DT) and a track-based alignment in the endcaps (CSC). An important task is to assess the muon momentum resolution that can be achieved using the current muon alignment, especially for highly energetic muons. For this purpose, cosmic ray muons are used, since the rate of high-energy muons from collisions is very low and the event statistics are still limited. Cosmics have the advantage of higher statistics in the pT region above 100 GeV/c, but they have the disadvantage of having a mostly vertical topology, resulting in a very few global endcap muons. Only the barrel alignment has therefore been tested so far. Cosmic muons traversing CMS from top to bottom are s...

  7. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction.

    Science.gov (United States)

    Chang, Jia-Ming; Di Tommaso, Paolo; Notredame, Cedric

    2014-06-01

    Multiple sequence alignment (MSA) is a key modeling procedure when analyzing biological sequences. Homology and evolutionary modeling are the most common applications of MSAs. Both are known to be sensitive to the underlying MSA accuracy. In this work, we show how this problem can be partly overcome using the transitive consistency score (TCS), an extended version of the T-Coffee scoring scheme. Using this local evaluation function, we show that one can identify the most reliable portions of an MSA, as judged from BAliBASE and PREFAB structure-based reference alignments. We also show how this measure can be used to improve phylogenetic tree reconstruction using both an established simulated data set and a novel empirical yeast data set. For this purpose, we describe a novel lossless alternative to site filtering that involves overweighting the trustworthy columns. Our approach relies on the T-Coffee framework; it uses libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. We compared TCS with Heads-or-Tails, GUIDANCE, Gblocks, and trimAl and found it to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees. The software is available from www.tcoffee.org/Projects/tcs. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  8. Finding function: evaluation methods for functional genomic data

    Directory of Open Access Journals (Sweden)

    Barrett Daniel R

    2006-07-01

    Full Text Available Abstract Background Accurate evaluation of the quality of genomic or proteomic data and computational methods is vital to our ability to use them for formulating novel biological hypotheses and directing further experiments. There is currently no standard approach to evaluation in functional genomics. Our analysis of existing approaches shows that they are inconsistent and contain substantial functional biases that render the resulting evaluations misleading both quantitatively and qualitatively. These problems make it essentially impossible to compare computational methods or large-scale experimental datasets and also result in conclusions that generalize poorly in most biological applications. Results We reveal issues with current evaluation methods here and suggest new approaches to evaluation that facilitate accurate and representative characterization of genomic methods and data. Specifically, we describe a functional genomics gold standard based on curation by expert biologists and demonstrate its use as an effective means of evaluation of genomic approaches. Our evaluation framework and gold standard are freely available to the community through our website. Conclusion Proper methods for evaluating genomic data and computational approaches will determine how much we, as a community, are able to learn from the wealth of available data. We propose one possible solution to this problem here but emphasize that this topic warrants broader community discussion.

  9. DNA motif alignment by evolving a population of Markov chains.

    Science.gov (United States)

    Bi, Chengpeng

    2009-01-30

    Deciphering cis-regulatory elements or de novo motif-finding in genomes still remains elusive although much algorithmic effort has been expended. The Markov chain Monte Carlo (MCMC) method such as Gibbs motif samplers has been widely employed to solve the de novo motif-finding problem through sequence local alignment. Nonetheless, the MCMC-based motif samplers still suffer from local maxima like EM. Therefore, as a prerequisite for finding good local alignments, these motif algorithms are often independently run a multitude of times, but without information exchange between different chains. Hence it would be worth a new algorithm design enabling such information exchange. This paper presents a novel motif-finding algorithm by evolving a population of Markov chains with information exchange (PMC), each of which is initialized as a random alignment and run by the Metropolis-Hastings sampler (MHS). It is progressively updated through a series of local alignments stochastically sampled. Explicitly, the PMC motif algorithm performs stochastic sampling as specified by a population-based proposal distribution rather than individual ones, and adaptively evolves the population as a whole towards a global maximum. The alignment information exchange is accomplished by taking advantage of the pooled motif site distributions. A distinct method for running multiple independent Markov chains (IMC) without information exchange, or dubbed as the IMC motif algorithm, is also devised to compare with its PMC counterpart. Experimental studies demonstrate that the performance could be improved if pooled information were used to run a population of motif samplers. The new PMC algorithm was able to improve the convergence and outperformed other popular algorithms tested using simulated and biological motif sequences.

  10. A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes.

    Science.gov (United States)

    Palacios-Flores, Kim; García-Sotelo, Jair; Castillo, Alejandra; Uribe, Carina; Aguilar, Luis; Morales, Lucía; Gómez-Romero, Laura; Reyes, José; Garciarubio, Alejandro; Boege, Margareta; Dávila, Guillermo

    2018-04-01

    We present a conceptually simple, sensitive, precise, and essentially nonstatistical solution for the analysis of genome variation in haploid organisms. The generation of a Perfect Match Genomic Landscape (PMGL), which computes intergenome identity with single nucleotide resolution, reveals signatures of variation wherever a query genome differs from a reference genome. Such signatures encode the precise location of different types of variants, including single nucleotide variants, deletions, insertions, and amplifications, effectively introducing the concept of a general signature of variation. The precise nature of variants is then resolved through the generation of targeted alignments between specific sets of sequence reads and known regions of the reference genome. Thus, the perfect match logic decouples the identification of the location of variants from the characterization of their nature, providing a unified framework for the detection of genome variation. We assessed the performance of the PMGL strategy via simulation experiments. We determined the variation profiles of natural genomes and of a synthetic chromosome, both in the context of haploid yeast strains. Our approach uncovered variants that have previously escaped detection. Moreover, our strategy is ideally suited for further refining high-quality reference genomes. The source codes for the automated PMGL pipeline have been deposited in a public repository. Copyright © 2018 by the Genetics Society of America.

  11. Multiplexed precision genome editing with trackable genomic barcodes in yeast.

    Science.gov (United States)

    Roy, Kevin R; Smith, Justin D; Vonesch, Sibylle C; Lin, Gen; Tu, Chelsea Szu; Lederer, Alex R; Chu, Angela; Suresh, Sundari; Nguyen, Michelle; Horecka, Joe; Tripathi, Ashutosh; Burnett, Wallace T; Morgan, Maddison A; Schulz, Julia; Orsley, Kevin M; Wei, Wu; Aiyar, Raeka S; Davis, Ronald W; Bankaitis, Vytas A; Haber, James E; Salit, Marc L; St Onge, Robert P; Steinmetz, Lars M

    2018-07-01

    Our understanding of how genotype controls phenotype is limited by the scale at which we can precisely alter the genome and assess the phenotypic consequences of each perturbation. Here we describe a CRISPR-Cas9-based method for multiplexed accurate genome editing with short, trackable, integrated cellular barcodes (MAGESTIC) in Saccharomyces cerevisiae. MAGESTIC uses array-synthesized guide-donor oligos for plasmid-based high-throughput editing and features genomic barcode integration to prevent plasmid barcode loss and to enable robust phenotyping. We demonstrate that editing efficiency can be increased more than fivefold by recruiting donor DNA to the site of breaks using the LexA-Fkh1p fusion protein. We performed saturation editing of the essential gene SEC14 and identified amino acids critical for chemical inhibition of lipid signaling. We also constructed thousands of natural genetic variants, characterized guide mismatch tolerance at the genome scale, and ascertained that cryptic Pol III termination elements substantially reduce guide efficacy. MAGESTIC will be broadly useful to uncover the genetic basis of phenotypes in yeast.

  12. Analysis and Comparison of Information Theory-based Distances for Genomic Strings

    Science.gov (United States)

    Balzano, Walter; Cicalese, Ferdinando; Del Sorbo, Maria Rosaria; Vaccaro, Ugo

    2008-07-01

    Genomic string comparison via alignment are widely applied for mining and retrieval of information in biological databases. In some situation, the effectiveness of such alignment based comparison is still unclear, e.g., for sequences with non-uniform length and with significant shuffling of identical substrings. An alternative approach is the one based on information theory distances. Biological data information content is stored in very long strings of only four characters. In last ten years, several entropic measures have been proposed for genomic string analysis. Notwithstanding their individual merit and experimental validation, to the nest of our knowledge, there is no direct comparison of these different metrics. We shall present four of the most representative alignment-free distance measures, based on mutual information. Each one has a different origin and expression. Our comparison involves a sort of arrangement, to reduce different concepts to a unique formalism, so as it has been possible to construct a phylogenetic tree for each of them. The trees produced via these metrics are compared to the ones widely accepted as biologically validated. In general the results provided more evidence of the reliability of the alignment-free distance models. Also, we observe that one of the metrics appeared to be more robust than the other three. We believe that this result can be object of further researches and observations. Many of the results of experimentation, the graphics and the table are available at the following URL: http://people.na.infn.it/˜wbalzano/BIO

  13. Alignment-Annotator web server: rendering and annotating sequence alignments.

    Science.gov (United States)

    Gille, Christoph; Fähling, Michael; Weyand, Birgit; Wieland, Thomas; Gille, Andreas

    2014-07-01

    Alignment-Annotator is a novel web service designed to generate interactive views of annotated nucleotide and amino acid sequence alignments (i) de novo and (ii) embedded in other software. All computations are performed at server side. Interactivity is implemented in HTML5, a language native to web browsers. The alignment is initially displayed using default settings and can be modified with the graphical user interfaces. For example, individual sequences can be reordered or deleted using drag and drop, amino acid color code schemes can be applied and annotations can be added. Annotations can be made manually or imported (BioDAS servers, the UniProt, the Catalytic Site Atlas and the PDB). Some edits take immediate effect while others require server interaction and may take a few seconds to execute. The final alignment document can be downloaded as a zip-archive containing the HTML files. Because of the use of HTML the resulting interactive alignment can be viewed on any platform including Windows, Mac OS X, Linux, Android and iOS in any standard web browser. Importantly, no plugins nor Java are required and therefore Alignment-Anotator represents the first interactive browser-based alignment visualization. http://www.bioinformatics.org/strap/aa/ and http://strap.charite.de/aa/. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  14. The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.

    Science.gov (United States)

    Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K

    2012-11-01

    Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species. © 2012 The Authors. The Plant Journal © 2012 Blackwell Publishing Ltd.

  15. Simultaneous alignment and Lorentz angle calibration in the CMS silicon tracker using Millepede II

    CERN Document Server

    Bartosik, Nazar

    2013-01-01

    The CMS silicon tracker consists of 25 684 sensors that provide measurements of trajectories of charged particles that are used by almost every physics analysis at CMS. In order to achieve high measurement precision, the positions and orientations of all sensors have to be determined very accurately. This is achieved by track-based alignment using the global fit approach of the Millepede II program. This approach is capable of determining about 200 000 parameters simultaneously.The alignment precision reached such a high level that even small calibration inaccuracies are noticeable. Therefore the alignment framework has been extended to treat position sensitive calibration parameters. Of special interest is the Lorentz angle which affects the hit positions due to the drift of the signal electrons in the magnetic field. We present the results from measurements of the Lorentz angle and its time dependence during full 2012 data taking period as well as general description of the alignment and calibration procedu...

  16. The European sea bass Dicentrarchus labrax genome puzzle: comparative BAC-mapping and low coverage shotgun sequencing

    Directory of Open Access Journals (Sweden)

    Volckaert Filip AM

    2010-01-01

    Full Text Available Abstract Background Food supply from the ocean is constrained by the shortage of domesticated and selected fish. Development of genomic models of economically important fishes should assist with the removal of this bottleneck. European sea bass Dicentrarchus labrax L. (Moronidae, Perciformes, Teleostei is one of the most important fishes in European marine aquaculture; growing genomic resources put it on its way to serve as an economic model. Results End sequencing of a sea bass genomic BAC-library enabled the comparative mapping of the sea bass genome using the three-spined stickleback Gasterosteus aculeatus genome as a reference. BAC-end sequences (102,690 were aligned to the stickleback genome. The number of mappable BACs was improved using a two-fold coverage WGS dataset of sea bass resulting in a comparative BAC-map covering 87% of stickleback chromosomes with 588 BAC-contigs. The minimum size of 83 contigs covering 50% of the reference was 1.2 Mbp; the largest BAC-contig comprised 8.86 Mbp. More than 22,000 BAC-clones aligned with both ends to the reference genome. Intra-chromosomal rearrangements between sea bass and stickleback were identified. Size distributions of mapped BACs were used to calculate that the genome of sea bass may be only 1.3 fold larger than the 460 Mbp stickleback genome. Conclusions The BAC map is used for sequencing single BACs or BAC-pools covering defined genomic entities by second generation sequencing technologies. Together with the WGS dataset it initiates a sea bass genome sequencing project. This will allow the quantification of polymorphisms through resequencing, which is important for selecting highly performing domesticated fish.

  17. Long- and short-term selective forces on malaria parasite genomes

    KAUST Repository

    Nygaard, Sanne

    2010-09-09

    Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ~23 Mb genomes encoding ~5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome. © 2010 Nygaard et al.

  18. FSO tracking and auto-alignment transceiver system

    Science.gov (United States)

    Cap, Gabriel A.; Refai, Hakki H.; Sluss, James J., Jr.

    2008-10-01

    Free-space optics (FSO) technology utilizes a modulated light beam to transmit information through the atmosphere. Due to reduced size and cost, and higher data rates, FSO can be more effective than wireless communication. Although atmospheric conditions can affect FSO communication, a line-of-sight connection between FSO transceivers is a necessary condition to maintain continuous exchange of data, voice, and video information. To date, the primary concentration of mobile FSO research and development has been toward accurate alignment between two transceivers. This study introduces a fully automatic, advanced alignment system that will maintain a line of sight connection for any FSO transceiver system. A complete transceiver system includes a position-sensing detector (PSD) to receive the signal, a laser to transmit the signal, a gimbal to move the transceiver to maintain alignment, and a computer to coordinate the necessary movements during motion. The FSO system was tested for mobility by employing one gimbal as a mobile unit and establishing another as a base station. Tests were performed to establish that alignment between two transceivers could be maintained during a given period of experiments and to determine the maximum speeds tolerated by the system. Implementation of the transceiver system can be realized in many ways, including vehicle-to-base station communication or vehicle-to-vehicle communication. This study is especially promising in that it suggests such a system is able to provide high-speed data in many applications where current wireless technology may not be effective. This phenomenon, coupled with the ability to maintain an autonomously realigned connection, opens the possibility of endless applications for both military and civilian use.

  19. KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation.

    Science.gov (United States)

    Wang, Dapeng; Xu, Jiayue; Yu, Jun

    2015-09-16

    The K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison. To meet increasing demands for comparing large genome sequences and to promote the use of the K-mer approach, we develop a versatile database, KGCAK ( http://kgcak.big.ac.cn/KGCAK/ ), containing ~8,000 genomes that include genome sequences of diverse life forms (viruses, prokaryotes, protists, animals, and plants) and cellular organelles of eukaryotic lineages. It builds phylogeny based on genomic elements in an alignment-free fashion and provides in-depth data processing enabling users to compare the complexity of genome sequences based on K-mer distribution. We hope that KGCAK becomes a powerful tool for exploring relationship within and among groups of species in a tree of life based on genomic data.

  20. Automatic alignment device for focal spot measurements in the center of the field for mammography

    International Nuclear Information System (INIS)

    Vieira, Marcelo A.C.; Watanabe, Alex O.; Oliveira Junior, Paulo D.; Schiabel, Homero

    2010-01-01

    Some quality control procedures used for mammography, such as focal spot evaluation, requires previous alignment of the measurement equipment with the X-ray central beam. However, alignment procedures are, in general, the most difficult task and the one that needs more time to be performed. Moreover, the operator sometimes is exposed to radiation during this procedure. This work presents an automatic alignment system for mammographic equipment that allows locating the central ray of the radiation beam and, immediately, aligns with it by dislocating itself automatically along the field. The system consists on a bidirectional moving device, connected to a CCD sensor for digital radiographic image acquisition. A computational analysis of a radiographic image, acquired at any position on the field, is performed in order to determine its positioning under the X-ray beam. Finally, a mechanical system for two moving directions, electronically controlled by a microcontroller under USB communication, makes the system to align automatically with the radiation beam central ray. The alignment process is fully automatic, fast and accurate, with no operator exposure to radiation, which allows a considerable time saving for quality control procedures achievement for mammography. (author)

  1. Annotating individual human genomes.

    Science.gov (United States)

    Torkamani, Ali; Scott-Van Zeeland, Ashley A; Topol, Eric J; Schork, Nicholas J

    2011-10-01

    Advances in DNA sequencing technologies have made it possible to rapidly, accurately and affordably sequence entire individual human genomes. As impressive as this ability seems, however, it will not likely amount to much if one cannot extract meaningful information from individual sequence data. Annotating variations within individual genomes and providing information about their biological or phenotypic impact will thus be crucially important in moving individual sequencing projects forward, especially in the context of the clinical use of sequence information. In this paper we consider the various ways in which one might annotate individual sequence variations and point out limitations in the available methods for doing so. It is arguable that, in the foreseeable future, DNA sequencing of individual genomes will become routine for clinical, research, forensic, and personal purposes. We therefore also consider directions and areas for further research in annotating genomic variants. Copyright © 2011 Elsevier Inc. All rights reserved.

  2. ANNOTATING INDIVIDUAL HUMAN GENOMES*

    Science.gov (United States)

    Torkamani, Ali; Scott-Van Zeeland, Ashley A.; Topol, Eric J.; Schork, Nicholas J.

    2014-01-01

    Advances in DNA sequencing technologies have made it possible to rapidly, accurately and affordably sequence entire individual human genomes. As impressive as this ability seems, however, it will not likely to amount to much if one cannot extract meaningful information from individual sequence data. Annotating variations within individual genomes and providing information about their biological or phenotypic impact will thus be crucially important in moving individual sequencing projects forward, especially in the context of the clinical use of sequence information. In this paper we consider the various ways in which one might annotate individual sequence variations and point out limitations in the available methods for doing so. It is arguable that, in the foreseeable future, DNA sequencing of individual genomes will become routine for clinical, research, forensic, and personal purposes. We therefore also consider directions and areas for further research in annotating genomic variants. PMID:21839162

  3. Whole-genome shotgun optical mapping of Rhodospirillum rubrum

    Energy Technology Data Exchange (ETDEWEB)

    Reslewic, S. [Univ. Wisc.-Madison; Zhou, S. [Univ. Wisc.-Madison; Place, M. [Univ. Wisc.-Madison; Zhang, Y. [Univ. Wisc.-Madison; Briska, A. [Univ. Wisc.-Madison; Goldstein, S. [Univ. Wisc.-Madison; Churas, C. [Univ. Wisc.-Madison; Runnheim, R. [Univ. Wisc.-Madison; Forrest, D. [Univ. Wisc.-Madison; Lim, A. [Univ. Wisc.-Madison; Lapidus, A. [Univ. Wisc.-Madison; Han, C. S. [Univ. Wisc.-Madison; Roberts, G. P. [Univ. Wisc.-Madison; Schwartz, D. C. [Univ. Wisc.-Madison

    2005-09-01

    Rhodospirillum rubrum is a phototrophic purple nonsulfur bacterium known for its unique and well-studied nitrogen fixation and carbon monoxide oxidation systems and as a source of hydrogen and biodegradable plastic production. To better understand this organism and to facilitate assembly of its sequence, three whole-genome restriction endonuclease maps (XbaI, NheI, and HindIII) of R. rubrum strain ATCC 11170 were created by optical mapping. Optical mapping is a system for creating whole-genome ordered restriction endonuclease maps from randomly sheared genomic DNA molecules extracted from cells. During the sequence finishing process, all three optical maps confirmed a putative error in sequence assembly, while the HindIII map acted as a scaffold for high-resolution alignment with sequence contigs spanning the whole genome. In addition to highlighting optical mapping's role in the assembly and confirmation of genome sequence, this work underscores the unique niche in resolution occupied by the optical mapping system. With a resolution ranging from 6.5 kb (previously published) to 45 kb (reported here), optical mapping advances a "molecular cytogenetics" approach to solving problems in genomic analysis.

  4. Predictive genomics: A cancer hallmark network framework for predicting tumor clinical phenotypes using genome sequencing data

    OpenAIRE

    Wang, Edwin; Zaman, Naif; Mcgee, Shauna; Milanese, Jean-Sébastien; Masoudi-Nejad, Ali; O'Connor, Maureen

    2014-01-01

    We discuss a cancer hallmark network framework for modelling genome-sequencing data to predict cancer clonal evolution and associated clinical phenotypes. Strategies of using this framework in conjunction with genome sequencing data in an attempt to predict personalized drug targets, drug resistance, and metastasis for a cancer patient, as well as cancer risks for a healthy individual are discussed. Accurate prediction of cancer clonal evolution and clinical phenotypes will have substantial i...

  5. SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification

    KAUST Repository

    Fan, Ming

    2012-06-28

    With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2012 Fan et al.

  6. SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification

    KAUST Repository

    Fan, Ming; Wong, Ka-Chun; Ryu, Tae Woo; Ravasi, Timothy; Gao, Xin

    2012-01-01

    With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2012 Fan et al.

  7. Accessing the SEED genome databases via Web services API: tools for programmers.

    Science.gov (United States)

    Disz, Terry; Akhter, Sajia; Cuevas, Daniel; Olson, Robert; Overbeek, Ross; Vonstein, Veronika; Stevens, Rick; Edwards, Robert A

    2010-06-14

    The SEED integrates many publicly available genome sequences into a single resource. The database contains accurate and up-to-date annotations based on the subsystems concept that leverages clustering between genomes and other clues to accurately and efficiently annotate microbial genomes. The backend is used as the foundation for many genome annotation tools, such as the Rapid Annotation using Subsystems Technology (RAST) server for whole genome annotation, the metagenomics RAST server for random community genome annotations, and the annotation clearinghouse for exchanging annotations from different resources. In addition to a web user interface, the SEED also provides Web services based API for programmatic access to the data in the SEED, allowing the development of third-party tools and mash-ups. The currently exposed Web services encompass over forty different methods for accessing data related to microbial genome annotations. The Web services provide comprehensive access to the database back end, allowing any programmer access to the most consistent and accurate genome annotations available. The Web services are deployed using a platform independent service-oriented approach that allows the user to choose the most suitable programming platform for their application. Example code demonstrate that Web services can be used to access the SEED using common bioinformatics programming languages such as Perl, Python, and Java. We present a novel approach to access the SEED database. Using Web services, a robust API for access to genomics data is provided, without requiring large volume downloads all at once. The API ensures timely access to the most current datasets available, including the new genomes as soon as they come online.

  8. Genome Sequences of Oryza Species

    KAUST Repository

    Kumagai, Masahiko

    2018-02-14

    This chapter summarizes recent data obtained from genome sequencing, annotation projects, and studies on the genome diversity of Oryza sativa and related Oryza species. O. sativa, commonly known as Asian rice, is the first monocot species whose complete genome sequence was deciphered based on physical mapping by an international collaborative effort. This genome, along with its accurate and comprehensive annotation, has become an indispensable foundation for crop genomics and breeding. With the development of innovative sequencing technologies, genomic studies of O. sativa have dramatically increased; in particular, a large number of cultivars and wild accessions have been sequenced and compared with the reference rice genome. Since de novo genome sequencing has become cost-effective, the genome of African cultivated rice, O. glaberrima, has also been determined. Comparative genomic studies have highlighted the independent domestication processes of different rice species, but it also turned out that Asian and African rice share a common gene set that has experienced similar artificial selection. An international project aimed at constructing reference genomes and examining the genome diversity of wild Oryza species is currently underway, and the genomes of some species are publicly available. This project provides a platform for investigations such as the evolution, development, polyploidization, and improvement of crops. Studies on the genomic diversity of Oryza species, including wild species, should provide new insights to solve the problem of growing food demands in the face of rapid climatic changes.

  9. Genome Sequences of Oryza Species

    KAUST Repository

    Kumagai, Masahiko; Tanaka, Tsuyoshi; Ohyanagi, Hajime; Hsing, Yue-Ie C.; Itoh, Takeshi

    2018-01-01

    This chapter summarizes recent data obtained from genome sequencing, annotation projects, and studies on the genome diversity of Oryza sativa and related Oryza species. O. sativa, commonly known as Asian rice, is the first monocot species whose complete genome sequence was deciphered based on physical mapping by an international collaborative effort. This genome, along with its accurate and comprehensive annotation, has become an indispensable foundation for crop genomics and breeding. With the development of innovative sequencing technologies, genomic studies of O. sativa have dramatically increased; in particular, a large number of cultivars and wild accessions have been sequenced and compared with the reference rice genome. Since de novo genome sequencing has become cost-effective, the genome of African cultivated rice, O. glaberrima, has also been determined. Comparative genomic studies have highlighted the independent domestication processes of different rice species, but it also turned out that Asian and African rice share a common gene set that has experienced similar artificial selection. An international project aimed at constructing reference genomes and examining the genome diversity of wild Oryza species is currently underway, and the genomes of some species are publicly available. This project provides a platform for investigations such as the evolution, development, polyploidization, and improvement of crops. Studies on the genomic diversity of Oryza species, including wild species, should provide new insights to solve the problem of growing food demands in the face of rapid climatic changes.

  10. Alignment and qualification of the Gaia telescope using a Shack-Hartmann sensor

    Science.gov (United States)

    Dovillaire, G.; Pierot, D.

    2017-09-01

    Since almost 20 years, Imagine Optic develops, manufactures and offers to its worldwide customers reliable and accurate wavefront sensors and adaptive optics solutions. Long term collaboration between Imagine Optic and Airbus Defence and Space has been initiated on the Herschel program. More recently, a similar technology has been used to align and qualify the GAIA telescope.

  11. Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches.

    Science.gov (United States)

    Schürch, A C; Arredondo-Alonso, S; Willems, R J L; Goering, R V

    2018-04-01

    Whole genome sequence (WGS)-based strain typing finds increasing use in the epidemiologic analysis of bacterial pathogens in both public health as well as more localized infection control settings. This minireview describes methodologic approaches that have been explored for WGS-based epidemiologic analysis and considers the challenges and pitfalls of data interpretation. Personal collection of relevant publications. When applying WGS to study the molecular epidemiology of bacterial pathogens, genomic variability between strains is translated into measures of distance by determining single nucleotide polymorphisms in core genome alignments or by indexing allelic variation in hundreds to thousands of core genes, assigning types to unique allelic profiles. Interpreting isolate relatedness from these distances is highly organism specific, and attempts to establish species-specific cutoffs are unlikely to be generally applicable. In cases where single nucleotide polymorphism or core gene typing do not provide the resolution necessary for accurate assessment of the epidemiology of bacterial pathogens, inclusion of accessory gene or plasmid sequences may provide the additional required discrimination. As with all epidemiologic analysis, realizing the full potential of the revolutionary advances in WGS-based approaches requires understanding and dealing with issues related to the fundamental steps of data generation and interpretation. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.

  12. A fungal phylogeny based on 42 complete genomes derived from supertree and combined gene analysis

    Directory of Open Access Journals (Sweden)

    Stajich Jason E

    2006-11-01

    Full Text Available Abstract Background To date, most fungal phylogenies have been derived from single gene comparisons, or from concatenated alignments of a small number of genes. The increase in fungal genome sequencing presents an opportunity to reconstruct evolutionary events using entire genomes. As a tool for future comparative, phylogenomic and phylogenetic studies, we used both supertrees and concatenated alignments to infer relationships between 42 species of fungi for which complete genome sequences are available. Results A dataset of 345,829 genes was extracted from 42 publicly available fungal genomes. Supertree methods were employed to derive phylogenies from 4,805 single gene families. We found that the average consensus supertree method may suffer from long-branch attraction artifacts, while matrix representation with parsimony (MRP appears to be immune from these. A genome phylogeny was also reconstructed from a concatenated alignment of 153 universally distributed orthologs. Our MRP supertree and concatenated phylogeny are highly congruent. Within the Ascomycota, the sub-phyla Pezizomycotina and Saccharomycotina were resolved. Both phylogenies infer that the Leotiomycetes are the closest sister group to the Sordariomycetes. There is some ambiguity regarding the placement of Stagonospora nodurum, the sole member of the class Dothideomycetes present in the dataset. Within the Saccharomycotina, a monophyletic clade containing organisms that translate CTG as serine instead of leucine is evident. There is also strong support for two groups within the CTG clade, one containing the fully sexual species Candida lusitaniae, Candida guilliermondii and Debaryomyces hansenii, and the second group containing Candida albicans, Candida dubliniensis, Candida tropicalis, Candida parapsilosis and Lodderomyces elongisporus. The second major clade within the Saccharomycotina contains species whose genomes have undergone a whole genome duplication (WGD, and their close

  13. Alignment of the ATLAS Inner Detector Tracking System

    CERN Document Server

    Heller, C; The ATLAS collaboration

    2011-01-01

    ATLAS is one of the multipurpose experiments that records the products of the LHC proton-proton and heavy ion collisions. In order to reconstruct trajectories of charged particles produced in these collisions, ATLAS is equipped with a tracking system built using two different technologies, silicon planar sensors (pixel and microstrips) and drift-tube based detectors. Together they constitute the ATLAS Inner Detector, which is embedded in a 2 T axial field. Efficiently reconstructing tracks from charged particles traversing the detector, and precisely measure their momenta is of crucial importance for physics analyses. In order to achieve its scientific goals, an alignment of the ATLAS Inner Detector is required to accurately determine its more than 700,000 degrees of freedom. The goal of the alignment is set such that the limited knowledge of the sensor locations should not deteriorate the resolution of track parameters by more than 20% with respect to the intrinsic tracker resolution. The implementation of t...

  14. Apparatus for accurately measuring high temperatures

    Science.gov (United States)

    Smith, D.D.

    The present invention is a thermometer used for measuring furnace temperatures in the range of about 1800/sup 0/ to 2700/sup 0/C. The thermometer comprises a broadband multicolor thermal radiation sensor positioned to be in optical alignment with the end of a blackbody sight tube extending into the furnace. A valve-shutter arrangement is positioned between the radiation sensor and the sight tube and a chamber for containing a charge of high pressure gas is positioned between the valve-shutter arrangement and the radiation sensor. A momentary opening of the valve shutter arrangement allows a pulse of the high gas to purge the sight tube of air-borne thermal radiation contaminants which permits the radiation sensor to accurately measure the thermal radiation emanating from the end of the sight tube.

  15. Comparison of robust H∞ filter and Kalman filter for initial alignment of inertial navigation system

    Institute of Scientific and Technical Information of China (English)

    HAO Yan-ling; CHEN Ming-hui; LI Liang-jun; XU Bo

    2008-01-01

    There are many filtering methods that can be used for the initial alignment of an integrated inertial navigation system.This paper discussed the use of GPS,but focused on two kinds of filters for the initial alignment of an integrated strapdown inertial navigation system (SINS).One method is based on the Kalman filter (KF),and the other is based on the robust filter.Simulation results showed that the filter provides a quick transient response and a little more accurate estimate than KF,given substantial process noise or unknown noise statistics.So the robust filter is an effective and useful method for initial alignment of SINS.This research should make the use of SINS more popular,and is also a step for further research.

  16. Magnet-assisted device-level alignment for the fabrication of membrane-sandwiched polydimethylsiloxane microfluidic devices

    International Nuclear Information System (INIS)

    Lu, J-C; Liao, W-H; Tung, Y-C

    2012-01-01

    Polydimethylsiloxane (PDMS) microfluidic device is one of the most essential techniques that advance microfluidics research in recent decades. PDMS is broadly exploited to construct microfluidic devices due to its unique and advantageous material properties. To realize more functionalities, PDMS microfluidic devices with multi-layer architectures, especially those with sandwiched membranes, have been developed for various applications. However, existing alignment methods for device fabrication are mainly based on manual observations, which are time consuming, inaccurate and inconsistent. This paper develops a magnet-assisted alignment method to enhance device-level alignment accuracy and precision without complicated fabrication processes. In the developed alignment method, magnets are embedded into PDMS layers at the corners of the device. The paired magnets are arranged in symmetric positions at each PDMS layer, and the magnetic attraction force automatically pulls the PDMS layers into the aligned position during assembly. This paper also applies the method to construct a practical microfluidic device, a tunable chaotic micromixer. The results demonstrate the successful operation of the device without failure, which suggests the accurate alignment and reliable bonding achieved by the method. Consequently, the fabrication method developed in this paper is promising to be exploited to construct various membrane-sandwiched PDMS microfluidic devices with more integrated functionalities to advance microfluidics research. (paper)

  17. Quantification of Cardiomyocyte Alignment from Three-Dimensional (3D) Confocal Microscopy of Engineered Tissue.

    Science.gov (United States)

    Kowalski, William J; Yuan, Fangping; Nakane, Takeichiro; Masumoto, Hidetoshi; Dwenger, Marc; Ye, Fei; Tinney, Joseph P; Keller, Bradley B

    2017-08-01

    Biological tissues have complex, three-dimensional (3D) organizations of cells and matrix factors that provide the architecture necessary to meet morphogenic and functional demands. Disordered cell alignment is associated with congenital heart disease, cardiomyopathy, and neurodegenerative diseases and repairing or replacing these tissues using engineered constructs may improve regenerative capacity. However, optimizing cell alignment within engineered tissues requires quantitative 3D data on cell orientations and both efficient and validated processing algorithms. We developed an automated method to measure local 3D orientations based on structure tensor analysis and incorporated an adaptive subregion size to account for multiple scales. Our method calculates the statistical concentration parameter, κ, to quantify alignment, as well as the traditional orientational order parameter. We validated our method using synthetic images and accurately measured principal axis and concentration. We then applied our method to confocal stacks of cleared, whole-mount engineered cardiac tissues generated from human-induced pluripotent stem cells or embryonic chick cardiac cells and quantified cardiomyocyte alignment. We found significant differences in alignment based on cellular composition and tissue geometry. These results from our synthetic images and confocal data demonstrate the efficiency and accuracy of our method to measure alignment in 3D tissues.

  18. Experimental annotation of the human genome using microarray technology.

    Science.gov (United States)

    Shoemaker, D D; Schadt, E E; Armour, C D; He, Y D; Garrett-Engele, P; McDonagh, P D; Loerch, P M; Leonardson, A; Lum, P Y; Cavet, G; Wu, L F; Altschuler, S J; Edwards, S; King, J; Tsang, J S; Schimmack, G; Schelter, J M; Koch, J; Ziman, M; Marton, M J; Li, B; Cundiff, P; Ward, T; Castle, J; Krolewski, M; Meyer, M R; Mao, M; Burchard, J; Kidd, M J; Dai, H; Phillips, J W; Linsley, P S; Stoughton, R; Scherer, S; Boguski, M S

    2001-02-15

    The most important product of the sequencing of a genome is a complete, accurate catalogue of genes and their products, primarily messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-jet oligonucleotide synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and allow the detection of mRNA splice variants and identification of the tissue- and disease-specific conditions under which genes are expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, more efficient, full-length complementary DNA cloning strategies and application to complex diseases.

  19. Voxel-based registration of simulated and real patient CBCT data for accurate dental implant pose estimation

    Science.gov (United States)

    Moreira, António H. J.; Queirós, Sandro; Morais, Pedro; Rodrigues, Nuno F.; Correia, André Ricardo; Fernandes, Valter; Pinho, A. C. M.; Fonseca, Jaime C.; Vilaça, João. L.

    2015-03-01

    The success of dental implant-supported prosthesis is directly linked to the accuracy obtained during implant's pose estimation (position and orientation). Although traditional impression techniques and recent digital acquisition methods are acceptably accurate, a simultaneously fast, accurate and operator-independent methodology is still lacking. Hereto, an image-based framework is proposed to estimate the patient-specific implant's pose using cone-beam computed tomography (CBCT) and prior knowledge of implanted model. The pose estimation is accomplished in a threestep approach: (1) a region-of-interest is extracted from the CBCT data using 2 operator-defined points at the implant's main axis; (2) a simulated CBCT volume of the known implanted model is generated through Feldkamp-Davis-Kress reconstruction and coarsely aligned to the defined axis; and (3) a voxel-based rigid registration is performed to optimally align both patient and simulated CBCT data, extracting the implant's pose from the optimal transformation. Three experiments were performed to evaluate the framework: (1) an in silico study using 48 implants distributed through 12 tridimensional synthetic mandibular models; (2) an in vitro study using an artificial mandible with 2 dental implants acquired with an i-CAT system; and (3) two clinical case studies. The results shown positional errors of 67+/-34μm and 108μm, and angular misfits of 0.15+/-0.08° and 1.4°, for experiment 1 and 2, respectively. Moreover, in experiment 3, visual assessment of clinical data results shown a coherent alignment of the reference implant. Overall, a novel image-based framework for implants' pose estimation from CBCT data was proposed, showing accurate results in agreement with dental prosthesis modelling requirements.

  20. Image-based quantification of fiber alignment within electrospun tissue engineering scaffolds is related to mechanical anisotropy.

    Science.gov (United States)

    Fee, Timothy; Downs, Crawford; Eberhardt, Alan; Zhou, Yong; Berry, Joel

    2016-07-01

    It is well documented that electrospun tissue engineering scaffolds can be fabricated with variable degrees of fiber alignment to produce scaffolds with anisotropic mechanical properties. Several attempts have been made to quantify the degree of fiber alignment within an electrospun scaffold using image-based methods. However, these methods are limited by the inability to produce a quantitative measure of alignment that can be used to make comparisons across publications. Therefore, we have developed a new approach to quantifying the alignment present within a scaffold from scanning electron microscopic (SEM) images. The alignment is determined by using the Sobel approximation of the image gradient to determine the distribution of gradient angles with an image. This data was fit to a Von Mises distribution to find the dispersion parameter κ, which was used as a quantitative measure of fiber alignment. We fabricated four groups of electrospun polycaprolactone (PCL) + Gelatin scaffolds with alignments ranging from κ = 1.9 (aligned) to κ = 0.25 (random) and tested our alignment quantification method on these scaffolds. It was found that our alignment quantification method could distinguish between scaffolds of different alignments more accurately than two other published methods. Additionally, the alignment parameter κ was found to be a good predictor the mechanical anisotropy of our electrospun scaffolds. The ability to quantify fiber alignment within and make direct comparisons of scaffold fiber alignment across publications can reduce ambiguity between published results where cells are cultured on "highly aligned" fibrous scaffolds. This could have important implications for characterizing mechanics and cellular behavior on aligned tissue engineering scaffolds. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 1680-1686, 2016. © 2016 Wiley Periodicals, Inc.

  1. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences

    DEFF Research Database (Denmark)

    Wernersson, Rasmus; Pedersen, Anders Gorm

    2003-01-01

    The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit...... proteins. It is therefore preferable to align coding DNA at the amino acid level and it is for this purpose we have constructed the program RevTrans. RevTrans constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning the resulting peptide sequences; and (iii) building a multiple DNA...

  2. PriFi - Using a Multiple Alignment of Related Sequences to Find Primers for  Amplification of Homologs

    DEFF Research Database (Denmark)

    Fredslund, Jakob; Schauser, Leif; Madsen, Lene Heegaard

    2005-01-01

    Using a comparative approach, the web program PriFi (http://cgi-www.daimi.au.dk/cgi-chili/PriFi/main) designs pairs of primers useful for PCR amplification of genomic DNA in species where prior sequence information is not available. The program works with an alignment of DNA sequences from phylog...

  3. Precise Alignment and Permanent Mounting of Thin and Lightweight X-ray Segments

    Science.gov (United States)

    Biskach, Michael P.; Chan, Kai-Wing; Hong, Melinda N.; Mazzarella, James R.; McClelland, Ryan S.; Norman, Michael J.; Saha, Timo T.; Zhang, William W.

    2012-01-01

    To provide observations to support current research efforts in high energy astrophysics. future X-ray telescope designs must provide matching or better angular resolution while significantly increasing the total collecting area. In such a design the permanent mounting of thin and lightweight segments is critical to the overall performance of the complete X-ray optic assembly. The thin and lightweight segments used in the assemhly of the modules are desigued to maintain and/or exceed the resolution of existing X-ray telescopes while providing a substantial increase in collecting area. Such thin and delicate X-ray segments are easily distorted and yet must be aligned to the arcsecond level and retain accurate alignment for many years. The Next Generation X-ray Optic (NGXO) group at NASA Goddard Space Flight Center has designed, assembled. and implemented new hardware and procedures mth the short term goal of aligning three pairs of X-ray segments in a technology demonstration module while maintaining 10 arcsec alignment through environmental testing as part of the eventual design and construction of a full sized module capable of housing hundreds of X-ray segments. The recent attempts at multiple segment pair alignment and permanent mounting is described along with an overview of the procedure used. A look into what the next year mll bring for the alignment and permanent segment mounting effort illustrates some of the challenges left to overcome before an attempt to populate a full sized module can begin.

  4. A periodic pattern of SNPs in the human genome

    DEFF Research Database (Denmark)

    Madsen, Bo Eskerod; Villesen, Palle; Wiuf, Carsten

    2007-01-01

    By surveying a filtered, high-quality set of SNPs in the human genome, we have found that SNPs positioned 1, 2, 4, 6, or 8 bp apart are more frequent than SNPs positioned 3, 5, 7, or 9 bp apart. The observed pattern is not restricted to genomic regions that are known to cause sequencing...... periodic DNA. Our results suggest that not all SNPs in the human genome are created by independent single nucleotide mutations, and that care should be taken in analysis of SNPs from periodic DNA. The latter may have important consequences for SNP and association studies....... or alignment errors, for example, transposable elements (SINE, LINE, and LTR), tandem repeats, and large duplicated regions. However, we found that the pattern is almost entirely confined to what we define as "periodic DNA." Periodic DNA is a genomic region with a high degree of periodicity in nucleotide usage...

  5. Model selection in Bayesian segmentation of multiple DNA alignments.

    Science.gov (United States)

    Oldmeadow, Christopher; Keith, Jonathan M

    2011-03-01

    The analysis of multiple sequence alignments is allowing researchers to glean valuable insights into evolution, as well as identify genomic regions that may be functional, or discover novel classes of functional elements. Understanding the distribution of conservation levels that constitutes the evolutionary landscape is crucial to distinguishing functional regions from non-functional. Recent evidence suggests that a binary classification of evolutionary rates is inappropriate for this purpose and finds only highly conserved functional elements. Given that the distribution of evolutionary rates is multi-modal, determining the number of modes is of paramount concern. Through simulation, we evaluate the performance of a number of information criterion approaches derived from MCMC simulations in determining the dimension of a model. We utilize a deviance information criterion (DIC) approximation that is more robust than the approximations from other information criteria, and show our information criteria approximations do not produce superfluous modes when estimating conservation distributions under a variety of circumstances. We analyse the distribution of conservation for a multiple alignment comprising four primate species and mouse, and repeat this on two additional multiple alignments of similar species. We find evidence of six distinct classes of evolutionary rates that appear to be robust to the species used. Source code and data are available at http://dl.dropbox.com/u/477240/changept.zip.

  6. Whole genome SNP discovery and analysis of genetic diversity in Turkey (Meleagris gallopavo)

    Science.gov (United States)

    2012-01-01

    Background The turkey (Meleagris gallopavo) is an important agricultural species and the second largest contributor to the world’s poultry meat production. Genetic improvement is attributed largely to selective breeding programs that rely on highly heritable phenotypic traits, such as body size and breast muscle development. Commercial breeding with small effective population sizes and epistasis can result in loss of genetic diversity, which in turn can lead to reduced individual fitness and reduced response to selection. The presence of genomic diversity in domestic livestock species therefore, is of great importance and a prerequisite for rapid and accurate genetic improvement of selected breeds in various environments, as well as to facilitate rapid adaptation to potential changes in breeding goals. Genomic selection requires a large number of genetic markers such as e.g. single nucleotide polymorphisms (SNPs) the most abundant source of genetic variation within the genome. Results Alignment of next generation sequencing data of 32 individual turkeys from different populations was used for the discovery of 5.49 million SNPs, which subsequently were used for the analysis of genetic diversity among the different populations. All of the commercial lines branched from a single node relative to the heritage varieties and the South Mexican turkey population. Heterozygosity of all individuals from the different turkey populations ranged from 0.17-2.73 SNPs/Kb, while heterozygosity of populations ranged from 0.73-1.64 SNPs/Kb. The average frequency of heterozygous SNPs in individual turkeys was 1.07 SNPs/Kb. Five genomic regions with very low nucleotide variation were identified in domestic turkeys that showed state of fixation towards alleles different than wild alleles. Conclusion The turkey genome is much less diverse with a relatively low frequency of heterozygous SNPs as compared to other livestock species like chicken and pig. The whole genome SNP discovery

  7. Whole genome SNP discovery and analysis of genetic diversity in Turkey (Meleagris gallopavo

    Directory of Open Access Journals (Sweden)

    Aslam Muhammad L

    2012-08-01

    Full Text Available Abstract Background The turkey (Meleagris gallopavo is an important agricultural species and the second largest contrib