WorldWideScience

Sample records for large sequence contigs

  1. AcEST: CL1889Contig1 [AcEST

    Lifescience Database Archive (English)

    Full Text Available CL1889Contig1 491 2 Adiantum capillus-veneris contig: CL1889contig1 sequence. Link ...apillus-veneris contig: CL1889contig1 sequence. Link to clone list Link to clone list Clone ID BP919609 BP91

  2. Contig-Layout-Authenticator (CLA): A Combinatorial Approach to Ordering and Scaffolding of Bacterial Contigs for Comparative Genomics and Molecular Epidemiology.

    Science.gov (United States)

    Shaik, Sabiha; Kumar, Narender; Lankapalli, Aditya K; Tiwari, Sumeet K; Baddam, Ramani; Ahmed, Niyaz

    2016-01-01

    A wide variety of genome sequencing platforms have emerged in the recent past. High-throughput platforms like Illumina and 454 are essentially adaptations of the shotgun approach generating millions of fragmented single or paired sequencing reads. To reconstruct whole genomes, the reads have to be assembled into contigs, which often require further downstream processing. The contigs can be directly ordered according to a reference, scaffolded based on paired read information, or assembled using a combination of the two approaches. While the reference-based approach appears to mask strain-specific information, scaffolding based on paired-end information suffers when repetitive elements longer than the size of the sequencing reads are present in the genome. Sequencing technologies that produce long reads can solve the problems associated with repetitive elements but are not necessarily easily available to researchers. The most common high-throughput technology currently used is the Illumina short read platform. To improve upon the shortcomings associated with the construction of draft genomes with Illumina paired-end sequencing, we developed Contig-Layout-Authenticator (CLA). The CLA pipeline can scaffold reference-sorted contigs based on paired reads, resulting in better assembled genomes. Moreover, CLA also hints at probable misassemblies and contaminations, for the users to cross-check before constructing the consensus draft. The CLA pipeline was designed and trained extensively on various bacterial genome datasets for the ordering and scaffolding of large repetitive contigs. The tool has been validated and compared favorably with other widely-used scaffolding and ordering tools using both simulated and real sequence datasets. CLA is a user friendly tool that requires a single command line input to generate ordered scaffolds.

  3. Dicty_cDB: Contig-U10406-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10406-1 no gap 661 4 1621526 1620875 MINUS 1 1 U10406 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U10406-1 Contig ID Contig-U10406-1 Contig update 2002. 9.13 Contig sequence >Contig-U10406-1 (Contig...-U10406-1Q) /CSM_Contig/Contig-U10406-1Q.Seq.d NNNNNNNNNNATAAGTAAAAGAGTTATTGGTCCAAGATTAGATGATGACA...TACAAATAAGTAAAGTTG ATAAAGAACAT Gap no gap Contig length 661 Chromosome number (1....cid sequence XXXISKRVIGPRLDDDNNNNDNDKFNNNNKKAIGPSRIGPTIGPSIGPSRYNTNNNDSNH NSNNDDDDDSSEEDEEDTKSEWERVRNMIENNKN

  4. Dicty_cDB: Contig-U16457-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16457-1 no gap 1065 3 996438 997502 PLUS 6 5 U16457 0 0 1 0 1 0 2 0 0 0 0 0 1 1 Show Contig...-U16457-1 Contig ID Contig-U16457-1 Contig update 2004. 6.11 Contig sequence >Contig-U16457-1 (Contig...-U16457-1Q) /CSM_Contig/Contig-U16457-1Q.Seq.d ACAATTGGTGTTGCTGCTCTATTCGGTCTTCCAGCTATGGCACGTTCCGC A...TTTAACAAGATTGGAAGAC CAAAAAGAAAAAAAA Gap no gap Contig length 1065 Chromosome numb... Translated Amino Acid sequence TIGVAALFGLPAMARSAAMSLVFLIPFMWIVFSVHYPINSVVADICMSYNNNTGSIEQQL ANYTNPIVSEIFGTC

  5. Dicty_cDB: Contig-U13326-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13326-1 no gap 240 6 1728259 1728019 MINUS 1 1 U13326 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Show Contig...-U13326-1 Contig ID Contig-U13326-1 Contig update 2002.12.18 Contig sequence >Contig-U13326-1 (Contig...-U13326-1Q) /CSM_Contig/Contig-U13326-1Q.Seq.d AATGACTCAACAAATCTTGGAGAGTATGCAAAATACTTTCCAATCTATGG...CCTCGTTAAAGGTGCTGGTGC TGAATTAAGTTCTCGTGCTCATGAGTGTTTCATTAGTGCCTTGGATATTG CCTCTGATTATACCTACGAGAAAATTACCATTGGCTTGGA Gap no gap Contig...FQSMDGPTIKRLATTIQYGSKDVDEQQIHSTLVKGAGAELSSRAHECF ISALDIASDYTYEKITIGL Translated Amino Acid sequence (All Fra

  6. Dicty_cDB: Contig-U01997-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U01997-1 gap included 886 2 1683026 1682230 MINUS 3 4 U01997 1 0 0 0 0 0 2 0 0 0 0 0 0 0 Show Contig...-U01997-1 Contig ID Contig-U01997-1 Contig update 2001. 8.29 Contig sequence >Contig-U01997-1 (Contig-U01997-1Q) /CSM_Contig/Contig-U01997...ATTGAAATAATATTTATTTATTTTTTTAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Gap gap included Contig...nfkvfgieiifiyffkkkkkkkkkkkkkkkkkkk own update 2004. 6. 9 Homology vs CSM-cDNA Query= Contig-U01997-1 (Contig-U01997-1Q) /CSM_Contig.../Contig-U01997-1Q.Seq.d (896 letters) Database: CSM 6905 sequences; 5,674,871 total l

  7. Dicty_cDB: Contig-U09581-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09581-1 gap included 1235 1 2575525 2576764 PLUS 1 2 U09581 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09581-1 Contig ID Contig-U09581-1 Contig update 2002. 9.13 Contig sequence >Contig-U09581-1 (Contig-U09581-1Q) /CSM_Contig/Contig-U09581...ATCAAAATAAATTTTTGTAACATTAATAATAAATAAN Gap gap included Contig length 1235 Chromosome number (1..6, M) 1 Chro... VFD420Z ,579,1237 Translated Amino Acid sequence KKPGVVTIKGSSFCSQPTITIGDDSCSQPILSVGNDYDSLTCNFQSNAGLSNSTLLVS...ames) Frame A: KKPGVVTIKGSSFCSQPTITIGDDSCSQPILSVGNDYDSLTCNFQSNAGLSNSTLLVSII CDTIQ

  8. Dicty_cDB: Contig-U09822-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09822-1 gap included 1255 3 5930658 5929418 MINUS 5 6 U09822 3 0 2 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09822-1 Contig ID Contig-U09822-1 Contig update 2002. 9.13 Contig sequence >Contig-U09822-1 (Contig-U09822-1Q) /CSM_Contig/Contig-U0982...AAAAGAAAAAAAAAAAAAAAAGATTTAATTAAATAAAAAAAAA AAAAAAAAAAAAAAA Gap gap included Contig length 1255 Chromosome n...,975 est6= VSA519Z ,780,1257 Translated Amino Acid sequence QPFYLVQSMFEPIQDSSFTSIGEIISYDTIG...rfn*ikkkkkk k Frame C: QPFYLVQSMFEPIQDSSFTSIGEIISYDTIGFDGKINTAVMSSLSPSTMYFYCVGDKS

  9. Dicty_cDB: Contig-U16467-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16467-1 no gap 1261 2 7818565 7817305 MINUS 17 18 U16467 0 0 5 0 1 2 1 0 6 0 0 1 1 0 Show Contig...-U16467-1 Contig ID Contig-U16467-1 Contig update 2004. 6.11 Contig sequence >Contig-U16467-1 (Contig...-U16467-1Q) /CSM_Contig/Contig-U16467-1Q.Seq.d CAACAATTAACATTACTTAAATATAATATTATTATATTTTTTTTTTT...TTCAAATAAATAATTGTTTAGAAATTTCTAGAAAAAAAA AAAAAAAAAAA Gap no gap Contig length 1261 Chromosome number (1..6, M...LK833Z ,1005,1249 Translated Amino Acid sequence qqltllkyniiiffffyllplhlyhy**LKKKTLTIIKYFFQKMNKIALLFTIFFALFAI SFACDEFNPNTSTIG

  10. Dicty_cDB: Contig-U12086-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12086-1 gap included 1101 3 5710254 5711336 PLUS 1 2 U12086 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Show Contig...-U12086-1 Contig ID Contig-U12086-1 Contig update 2002.12.18 Contig sequence >Contig-U12086-1 (Contig-U12086-1Q) /CSM_Contig/Contig-U12086...ATCGGATTA Gap gap included Contig length 1101 Chromosome number (1..6, M) 3 Chromosome length 6358359 Start ...te 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U12086-1 (Contig-U12086-1Q) /CSM_Contig/Contig...Sequences producing significant alignments: (bits) Value Contig-U12086-1 (Contig-U12086-1Q) /CSM_Contig/Conti... 404 e-113 Contig

  11. Dicty_cDB: Contig-U09694-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09694-1 gap included 1129 1 4027135 4026071 MINUS 3 4 U09694 2 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09694-1 Contig ID Contig-U09694-1 Contig update 2002. 9.13 Contig sequence >Contig-U09694-1 (Contig-U09694-1Q) /CSM_Contig/Contig-U0969...TTAAATTAAAACAACAACAATTTCATAATATAAATAAT Gap gap included Contig length 1129 Chromosome number (1..6, M) 1 Chr...iklkqqqfklkqqqfhninn own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U09694-1 (Contig-U09694-1Q) /CSM_Contig/Contig...E Sequences producing significant alignments: (bits) Value Contig-U09694-1 (Contig-U09694-1Q) /CSM_Contig

  12. Dicty_cDB: Contig-U03323-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U03323-1 no gap 533 2 4820223 4820756 PLUS 2 1 U03323 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Show Contig...-U03323-1 Contig ID Contig-U03323-1 Contig update 2001. 8.29 Contig sequence >Contig-U03323-1 (Contig...-U03323-1Q) /CSM_Contig/Contig-U03323-1Q.Seq.d ACATGTGACATTACTATTGGTAAATGTCAATGTTTAAAAAATACATGGTC ...TCAATAATGGTGGTGGTGGTGGTTTAGGT GAAACCCCCAATAGTAATAGTAATAGTGGTGAACTAGTTATCCCACCAAA ATCAAATACTACATTAAATGAAGAAACAGGTGG Gap no gap Contig... Link to clone list U03323 List of clone(s) est1= FC-IC0176F ,1,534 Translated Amino Acid sequence TCDITIGKC

  13. Dicty_cDB: Contig-U15069-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15069-1 no gap 1241 1 2719927 2720886 PLUS 37 43 U15069 16 2 0 0 0 5 8 0 0 1 1 0 4 0 Show Contig...-U15069-1 Contig ID Contig-U15069-1 Contig update 2004. 6.11 Contig sequence >Contig-U15069-1 (Contig...-U15069-1Q) /CSM_Contig/Contig-U15069-1Q.Seq.d TTTCAAACCAAAACATAAAATAATTAAAAATGACAACTGTTAAACCA...AAAAATAAAATAAATAAAAATAGTTTTAAA Gap no gap Contig length 1241 Chromosome number (1..6, M) 1 Chromosome length...07Z ,263,623 est42= VSJ431Z ,390,646 est43= CHB363Z ,460,1187 Translated Amino Acid sequence snqnik*lkmttvkptspenprvffditig

  14. Dicty_cDB: Contig-U12316-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12316-1 gap included 1238 4 1925901 1927143 PLUS 5 6 U12316 0 4 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U12316-1 Contig ID Contig-U12316-1 Contig update 2002.12.18 Contig sequence >Contig-U12316-1 (Contig-U12316-1Q) /CSM_Contig/Contig-U12316...GAGTTGAAGATTTAGTTTTATCAGNANGAANAAATAAGAT Gap gap included Contig length 1238 Chromosome number (1..6, M) 4 C...,915,1174 Translated Amino Acid sequence lvqhhyh*liscvivllksmv*isqvhivvhlfmfvn*qyileih*iptlknlskiftig...lip*r*rtrkttn*kiknny*itketkiqs*t*rvmmmi*vedlvls xxxnk Frame B: lvqhhyh*liscvivllksmv*isqvhivvhlfmfvn*qyileih*iptlknlskiftig

  15. Dicty_cDB: Contig-U16108-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16108-1 gap included 1456 4 1889609 1888449 MINUS 4 6 U16108 0 0 2 0 1 0 1 0 0 0 0 0 0 0 Show Contig...-U16108-1 Contig ID Contig-U16108-1 Contig update 2004. 6.11 Contig sequence >Contig-U16108-1 (Contig-U16108-1Q) /CSM_Contig/Contig-U1610...AAAATCA TAAAATCAAAAATTGTATAATTAAAATAAAAATAAAAAAAAAAACAAAAA TAAAAAAAAAAAACAA Gap gap included Contig length 1...DFLSQFYGELN QPSLNNLTENIITIDQSSFIPIGYTTITAGLNNFAYAYIPTSCKNDKSLCSIHVAFHGCL QTVATIGDNFYTKTGYNEIAETNNIIILYPQALET...---NYVNNDNIKTMFDIQSEHAFITNSFGNNCTYLGPDYINNCNFNAPWDFLSQFYGELN QPSLNNLTENIITIDQSSFIPIGYTTITAGLNNFAYAYIPTSCKNDKSLCSIHVAFHGCL QTVATIG

  16. Dicty_cDB: Contig-U13974-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13974-1 no gap 1782 1 1265322 1267105 PLUS 29 32 U13974 0 0 0 1 2 0 22 0 4 0 0 0 0 0 Show Contig...-U13974-1 Contig ID Contig-U13974-1 Contig update 2002.12.18 Contig sequence >Contig-U13974-1 (Contig...-U13974-1Q) /CSM_Contig/Contig-U13974-1Q.Seq.d AAGAGTTAAAACAAAAATAAAAAAATAAAATAAAAAAAAAAAATTAA...TAAAACAAATAA ACATTAAAATGATATTTAGGTTTTAAATTTAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Gap no gap Contig...TTRKIYVYDNQNFFPIDNQGFD VDPAKRIYLNEKKTYHNYHFCMKMNTVFTYKGYEVFNFRGDDDVWVFINNKLVIDLGGLH SPIGTSVDTMTLGLTIG

  17. Dicty_cDB: Contig-U16008-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16008-1 gap included 1557 5 1711154 1712676 PLUS 5 8 U16008 0 0 0 0 1 1 1 0 1 0 0 0 1 0 Show Contig...-U16008-1 Contig ID Contig-U16008-1 Contig update 2004. 6.11 Contig sequence >Contig-U16008-1 (Contig-U16008-1Q) /CSM_Contig/Contig-U16008... TAAGGTTTATGATTTTTGATTTTAGATTTTATATTTTATTTATTTTAATA AAAAAAAAAAAAAAAAA Gap gap included Contig length 1557 Ch...F LIFVHGSSTIIVLGIAIINFSISRIFERSKMLPAVTWIFNLIILWTCY--- ---PFGGFGARGPPSTIGYSRHTIGGMYGGHSPGPRLHLTGYLGIEPMNGKFLN...SSTIIVLGIAIINFSISRIFERSKMLPAVTWIFNLIILWTCY--- ---PFGGFGARGPPSTIGYSRHTIGGMYGGHSPGPRLHLTGYLGIEPMNGKFLNIGRTFR L

  18. Dicty_cDB: Contig-U11342-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U11342-1 gap included 2051 2 611517 609465 MINUS 4 7 U11342 0 2 1 1 0 0 0 0 0 0 0 0 0 0 Show Contig...-U11342-1 Contig ID Contig-U11342-1 Contig update 2002.12.18 Contig sequence >Contig-U11342-1 (Contig...-U11342-1Q) /CSM_Contig/Contig-U11342-1Q.Seq.d GTCAACATTAACATCATCATCATCATCATCACCATCTAGTAATAA...GAATTTGGTAATTTTAAAATCACTNATTAATATATTAAACAAAATTA TAAAAATAAAA Gap gap included Contig...EFFFIDRKSLLVNFP RGSICAQILKLIGNLYGSNDIIFKINTNNVSFFDGTIGANNSTNNSNSNQPMTPQQVVIK YLNPTARWKRREISNFEYLMTLNTIAGRTYN

  19. Dicty_cDB: Contig-U11195-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U11195-1 gap included 2858 2 4308456 4311316 PLUS 16 27 U11195 0 2 0 8 1 0 0... 3 0 2 0 0 0 0 Show Contig-U11195-1 Contig ID Contig-U11195-1 Contig update 2002.12.18 Contig sequence >Contig-U11195-1 (Contig...-U11195-1Q) /CSM_Contig/Contig-U11195-1Q.Seq.d AGCATTGGAACAAATCGAATTACGTGAAAAGATACCATTGTT...TATCACCTGCTCTTTATCCTTCAAATTTAAGT AATTCAACATTGGCCCAAAGAGTTACATGGATAAATAAATTATAAATAAT GTATAAAATCATTCTCTC Gap gap included Contig... EYREKIPLLDLPWGASKPWTLVDLRDDYDEDLMVRFYNELMLPNFPVKNELEPLSNFISA LSEERRESFNPHLSEVHVLLALRWPTDSSDLQPTIGAGIIFEYFSN

  20. Dicty_cDB: Contig-U01791-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U01791-1 no gap 527 2 7629792 7630319 PLUS 1 1 U01791 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U01791-1 Contig ID Contig-U01791-1 Contig update 2001. 8.29 Contig sequence >Contig-U01791-1 (Contig...-U01791-1Q) /CSM_Contig/Contig-U01791-1Q.Seq.d GTTTGATTATAATTTATATGAATGTGAAATTAGACAAGCATTATCAAATA ...TCGTTCCCTTATGATTTAAGAACAACTTT GAATAGTTACAGAAATGGTGAATTTAGTATTTATCAATAAATTTTTTTTT AAAGATTTATAATTAAAATAAAAAAAA Gap no gap Contig...SILWSIESIGSLIVSAQINDDRETMELLHRYQIPQKFLIPLF QILALIDQLEKDLSHQIELDKFTINRDYYFLKSFSNLIEPPLNCLGILKTSRPHFRIFKL VGKNMISQVLETIG

  1. Dicty_cDB: Contig-U09412-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09412-1 gap included 873 3 3953072 3953946 PLUS 1 2 U09412 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U09412-1 Contig ID Contig-U09412-1 Contig update 2002. 9.13 Contig sequence >Contig-U09412-1 (Contig...-U09412-1Q) /CSM_Contig/Contig-U09412-1Q.Seq.d ATTATCACAACTATTTTATAATAAACCAATTTTAAAGATTAAAGT...TGGTTCAATAAAAGAAATTAAATATAATTATCAATAAT AATAATAAATTAATTAATAAATTTAAATCAAAA Gap gap included Contig length 873 ...DCQCGFVSVVENNNNNNNNSDNENNENNENNENNE NNEDLEDFIPRKLLKKSSSTLQSRTYLVIYLGRRGILEIWGLKHRSREYFKTIG

  2. Dicty_cDB: Contig-U12357-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12357-1 gap included 1333 1 2827305 2828232 PLUS 5 6 U12357 0 1 1 2 0 0 1 0 0 0 0 0 0 0 Show Contig...-U12357-1 Contig ID Contig-U12357-1 Contig update 2002.12.18 Contig sequence >Contig-U12357-1 (Contig-U12357-1Q) /CSM_Contig/Contig-U12357...ATAAAATAAAATTTATTAATTTTCCAACT Gap gap included Contig length 1333 Chromosome numb...RYXEKKKXXXXDSXNXXXXXPXX XXLXXXXPXX--- ---QYEKMKLSGEKVDPTLDASIILGNRYLEKKKVTIGDSENYTITVPFSQILKNQKPLI IQRKTKGTL...-QYEKMKLSGEKVDPTLDASIILGNRYLEKKKVTIGDSENYTITVPFSQILKNQKPLI IQRKTKGTLYYSINLSYASLNPISKAIFNRGLNIKRTYYPVSNSNDVIY

  3. Dicty_cDB: Contig-U10996-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10996-1 gap included 3017 2 5488454 5485454 MINUS 41 76 U10996 0 3 0 24 1 0... 0 8 0 5 0 0 0 0 Show Contig-U10996-1 Contig ID Contig-U10996-1 Contig update 2002.12.18 Contig sequence >Contig-U10996-1 (Contig...-U10996-1Q) /CSM_Contig/Contig-U10996-1Q.Seq.d TGGCCTACTGGTAAAAAAAATTCTAATTTTATTAAAACCC...CTATTTATAATGTATTGTTAAG GCAAAAATAAAAAAAAAAGNAAAAAAA Gap gap included Contig length...LTTTA SSSQQQQQELGLAVLTIRQGYEFENIVKELLDEKKKIEIWSMKPNSKQQWELIKKGSPGN TQMFEDVLLNGNCEGSVMMALKVTREKGSIVFGISFGDATFKTIG

  4. Dicty_cDB: Contig-U12049-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12049-1 gap included 2563 4 3071598 3069091 MINUS 9 17 U12049 0 0 0 0 2 0 0... 1 4 1 1 0 0 0 Show Contig-U12049-1 Contig ID Contig-U12049-1 Contig update 2002.12.18 Contig sequence >Contig-U12049-1 (Contig...-U12049-1Q) /CSM_Contig/Contig-U12049-1Q.Seq.d TAATGAAGGTAGTAATAATAATATAGTTGAAGCATCAAAAGA...TATCATTTAAACTGAAAAAAGTC CAAAAGATTTATGCAATGATTGCTGCGAATATGCTGCAACTTGTTCTCAT TAAAAATAAACAAAAAAATAATA Gap gap included Contig...disngqcvyseiidcgsssienss nqesssdidittastlgstiastigstigltstttttttsqttgtpttppqtvseipisl astistspvsdegtiastiatt

  5. Dicty_cDB: Contig-U10291-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10291-1 no gap 932 4 3203354 3204286 PLUS 2 2 U10291 0 0 1 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U10291-1 Contig ID Contig-U10291-1 Contig update 2002. 9.13 Contig sequence >Contig-U10291-1 (Contig...-U10291-1Q) /CSM_Contig/Contig-U10291-1Q.Seq.d GTAAAGGTTTTATGTGTATATTTTTTAATGACCTTTTCGAATTAGTTTCA ...CAAAATAGATTAAATCTTAGTTACTCTCATGC TAATCAATATGTTGAGAGTTTTCCATCACAAATGTTATCAACAATTGCAA AATTCATTAGTTTCTTATTTGGTT...SLMYSL FNYIFDENGIIKSEFQDPTQRKRLSRGLSRRFMTIGILGLFTTPFIFFFLLINFFFEYAE ELKNRPGSLFSREWSPLARWEFRELNELPHYFQNRLNLSY

  6. Dicty_cDB: Contig-U09640-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09640-1 gap included 1368 2 219988 218635 MINUS 4 5 U09640 0 0 2 0 0 0 0 0 0 0 0 0 1 1 Show Contig...-U09640-1 Contig ID Contig-U09640-1 Contig update 2002. 9.13 Contig sequence >Contig-U09640-1 (Contig...-U09640-1Q) /CSM_Contig/Contig-U09640-1Q.Seq.d ACTGTTGGCCTACTGGNAAAAAATAGTGTAATAATAACCAACAAT...AACAACAACAACAAAAACAAAAACAAATTTTAATT AAATAAAATAATAATATAAAATATAATA Gap gap included Contig...ate 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U09640-1 (Contig-U09640-1Q) /CSM_Contig/Contig

  7. Dicty_cDB: Contig-U09720-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09720-1 gap included 1323 2 5906974 5908260 PLUS 1 2 U09720 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09720-1 Contig ID Contig-U09720-1 Contig update 2002. 9.13 Contig sequence >Contig-U09720-1 (Contig-U09720-1Q) /CSM_Contig/Contig-U09720...ATNATTATTATAAAAATTT Gap gap included Contig length 1323 Chromosome number (1..6, ...QLEAEDIVKQSQLVRNTLLSILNKLFSNY NNSNETTATTTIGQDQEKLSTLKNQREIIAQSLKIXKKL*linqxll*kf ...AEMFDIDSRNNHAIENDGRLDDA LVCSVGIALAPQSIFQSWKSMSEHKREKYFEQLEAEDIVKQSQLVRNTLLSILNKLFSNY NNSNETTATTTIGQDQEKLSTLK

  8. Dicty_cDB: Contig-U09379-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09379-1 gap included 899 2 1392012 1392912 PLUS 1 2 U09379 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U09379-1 Contig ID Contig-U09379-1 Contig update 2002. 9.13 Contig sequence >Contig-U09379-1 (Contig...-U09379-1Q) /CSM_Contig/Contig-U09379-1Q.Seq.d AAAAATTTTTTAAACTAAAAAATAAAAAAAATAAATAAAAAAAAA...TTTAAAAATAATAATAAAAGTGAATATTATAATATTAT AATCTTTTTGGTATAATTGAAAAAGATCAATAATATATTAAAATTTCCAA AAAAAAAAA Gap gap included Contig...VSVCRAYATETATIENKTQIMGKMSGAQGAGFVLGPGIGFLLNFCNFTIG--- ---INNK******sn*finykl***f*kikqphfknlkiiikvniiil*sfwyn

  9. Dicty_cDB: Contig-U15566-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15566-1 gap included 1830 4 3730704 3729599 MINUS 4 8 U15566 0 0 1 0 1 0 0 0 2 0 0 0 0 0 Show Contig...-U15566-1 Contig ID Contig-U15566-1 Contig update 2004. 6.11 Contig sequence >Contig-U15566-1 (Contig-U15566-1Q) /CSM_Contig/Contig-U1556...CAAGATCCAA TGGAATTTTAATAATAAATAAGAATAATAAAAAAAAAAAA Gap gap included Contig length 1830 Chromosome number (1...ITLTPSEDIEKKLKEI QDENLSNSEIWFAVKSYLEDNNLKEHLYNLVFHYTMPRIDEPVTIGLDHLGNVLVSNR*c tflvvvvvytfgcriephni*qerivlqf*...asilnhirvelsqnqipilkrsfdqillphfekc iieeqqiftnekqrknflsllpisykrqdrkipltpsediekklkeiqdenlsnseiwfa vksylednnlkehlynlvfhytmpridepvtig

  10. Dicty_cDB: Contig-U04334-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U04334-1 no gap 399 4 3746420 3746021 MINUS 3 3 U04334 0 0 0 0 0 0 3 0 0 0 0 0 0 0 Show Contig...-U04334-1 Contig ID Contig-U04334-1 Contig update 2001. 8.29 Contig sequence >Contig-U04334-1 (Contig...-U04334-1Q) /CSM_Contig/Contig-U04334-1Q.Seq.d CAAAAAAAAAAAAGTAAAACAATAAATTATATAAAAAAAATAAAAAAAAT...CTAATTTCA AACAATATCAATAAAATGTTATATAATTACTATTAAAATGAAAAAAAAA Gap no gap Contig len...ce QKKKSKTINYIKKIKKMSIINTISKLSLSNSLKSNITIGNLNGTTVNNYTHNETSSKFTE FFYKII*qnkrwf*kvkelnkkkrkkdyiissfcklysiyfvfs

  11. Dicty_cDB: Contig-U10335-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10335-1 no gap 1353 2 2769724 2768368 MINUS 3 6 U10335 0 0 2 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U10335-1 Contig ID Contig-U10335-1 Contig update 2002. 9.13 Contig sequence >Contig-U10335-1 (Contig...-U10335-1Q) /CSM_Contig/Contig-U10335-1Q.Seq.d ATTTTTTTTCTAAATATATAAAAAATAATAATAATAATAATAATATAAT...AAACATAATAAAACAAAAGATAAAAATAAAA ACA Gap no gap Contig length 1353 Chromosome numb...SSLATNNNINNNKRITIPDNH SNNPDKLLEIQLINKIFDISKAFDGKSNNLVSSFQNCTNNNNNNNNNTDNNNNNNISNNN NNNNVPTLQPLSFNNRNNLVNGNISSSSSSNSSNNNIGSSNSNNVTIG

  12. Dicty_cDB: Contig-U12399-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12399-1 gap included 1358 3 4712677 4711450 MINUS 1 2 U12399 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U12399-1 Contig ID Contig-U12399-1 Contig update 2002.12.18 Contig sequence >Contig-U12399-1 (Contig-U12399-1Q) /CSM_Contig/Contig-U1239...GAAGATGATATTAGTCTGAGGAAGATATTCTTAAAGA ATTTAACAAATGTTAACA Gap gap included Contig ...*e iekkklnyl*eqkvkyqknhqkimiq*enxmks*LQIYHXFAXLIGEPIPNNDXXX--- ---XXXRHVIWKLYEEITIGLKRTISITXKRESCKSHYLANCIMH...kkklnyl*eqkvkyqknhqkimiq*enxmks*LQIYHXFAXLIGEPIPNNDXXX--- ---XXXRHVIWKLYEEITIGLKRTISITXKRESCKSHYLANCIMHVYWRL

  13. Dicty_cDB: Contig-U11404-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U11404-1 gap included 1618 6 1729583 1727965 MINUS 11 19 U11404 0 6 1 1 0 2 ...0 1 0 0 0 0 0 0 Show Contig-U11404-1 Contig ID Contig-U11404-1 Contig update 2002.12.18 Contig sequence >Contig-U11404-1 (Contig...-U11404-1Q) /CSM_Contig/Contig-U11404-1Q.Seq.d ATTTTAAGAGTTTTAATTTTAATAACTATACTTTTAATAAA...TTTTTCTTTTGAACCAGAAAAAAAAA Gap gap included Contig length 1618 Chromosome number ...AGARMLASLATDKLSNVIYLDVSENDFGDEGVSVICDGFVGNSTIKKLILNGNFKQ SK--- ---YEKITIGLDSVFKDLILEESQAQNEASGATPIPDSPVPTRSP

  14. Dicty_cDB: Contig-U09569-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09569-1 gap included 1424 5 3658944 3660352 PLUS 8 14 U09569 0 0 8 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09569-1 Contig ID Contig-U09569-1 Contig update 2002. 9.13 Contig sequence >Contig-U09569-1 (Contig-U09569-1Q) /CSM_Contig/Contig-U0956...TTAAAAA TAAAATAAATATAAAATAAAATAAAAATTAACAA Gap gap included Contig length 1424 Chromosome number (1..6, M) 5...NQTFQQKYYVNDQYYNYKNGGPIILYINGEGPVSSPPYSSDDGVVIYAQA LNCMIVTLEHRFYGESSPFSELTIENLQYLSHQQALEDLATFVVDFQSKLVGAGHIVTIG...YLSHQQALEDLATFVVDFQSKLVGAGHIVTIG GSYSGALSAWFRIKYPHITVGSIASLGVVHSILDFTAFDAYVSYA---

  15. Dicty_cDB: Contig-U15306-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15306-1 no gap 2452 3 3887051 3889342 PLUS 54 91 U15306 0 0 0 49 4 1 0 0 0 0 0 0 0 0 Show Contig...-U15306-1 Contig ID Contig-U15306-1 Contig update 2004. 6.11 Contig sequence >Contig-U15306-1 (Contig...-U15306-1Q) /CSM_Contig/Contig-U15306-1Q.Seq.d AAGCATAAACGGTGAATACCTCGACTCCTAAATCGATGAAGACCGTA...TTTTAGAACTTCAAAAAATAGTAC AAATTTTTTCAAATTAAGATAAAAAAAATAAAACAAAAATTAATTTAAAA CA Gap no gap Contig length 2452...*naagtgkgeegrt*hkslpywlapqvkgsvmprggqghygasrggrkhmgidfssivg qdivapisgkvvnfkgartkypmlqlypskkftefdylqmlyvhppvginmgasyqvsvg dtig

  16. Dicty_cDB: Contig-U14772-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U14772-1 no gap 665 1 1988279 1987624 MINUS 1 1 U14772 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U14772-1 Contig ID Contig-U14772-1 Contig update 2002.12.18 Contig sequence >Contig-U14772-1 (Contig...-U14772-1Q) /CSM_Contig/Contig-U14772-1Q.Seq.d AAAAACAATAACCATCGTTTTTTATTTTTATTTTCAAAATATGGATTTAA...AAATTAATGAAGAAAAAA AAGTAANNNNNNNNN Gap no gap Contig length 665 Chromosome number...DADTTISFLSSQNLSQLSIIKNLVNGKTIG DKKVIVDFYDFKKVIPTPTPIPTPTPPTKTQEESNKKIKLTNEKPKEKKP

  17. Dicty_cDB: Contig-U11141-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U11141-1 gap included 2122 2 1113359 1111236 MINUS 6 12 U11141 0 1 0 2 0 0 0... 1 0 2 0 0 0 0 Show Contig-U11141-1 Contig ID Contig-U11141-1 Contig update 2002.12.18 Contig sequence >Contig-U11141-1 (Contig...-U11141-1Q) /CSM_Contig/Contig-U11141-1Q.Seq.d AAAAAACAATCTTAAAACACACACACACTCAACACACTATCA...AAATCAAAATCAAAATCAAA ATAATAATAATTATAATAATAGCTATAATAAT Gap gap included Contig length 2122 Chromosome number ...HNYFGKVSRGIVSLSDYKYYGYLRSVHLIGYE QHEEELIKTIKSLPVGVSTLELSGHLNKIIFKEGSL--- ---DDSTIGAILNSFSSSSSRETFPRSVESLHLNI

  18. Dicty_cDB: Contig-U13202-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13202-1 no gap 1083 4 1301578 1302630 PLUS 41 45 U13202 8 0 13 0 0 2 16 0 2 0 0 0 0 0 Show Contig...-U13202-1 Contig ID Contig-U13202-1 Contig update 2002.12.18 Contig sequence >Contig-U13202-1 (Contig...-U13202-1Q) /CSM_Contig/Contig-U13202-1Q.Seq.d ACTGTTGGCCTACTGGGATTTTCTGCAGTAATAATAAAATCAAATA...TTTGTAATTTTAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Gap no gap Contig len...kvgqfirvprgaqpaqtskftlmih*gvkshffsmlqpnwpncttigpvq nqarcgsllgfwvlqnqlltvlcihnnekcsikfygygyl**nlitvvkvvmpslhg

  19. Dicty_cDB: Contig-U15062-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15062-1 no gap 1282 3 4759691 4758480 MINUS 5 6 U15062 0 0 0 0 0 0 1 0 1 1 1 0 1 0 Show Contig...-U15062-1 Contig ID Contig-U15062-1 Contig update 2004. 6.11 Contig sequence >Contig-U15062-1 (Contig...-U15062-1Q) /CSM_Contig/Contig-U15062-1Q.Seq.d CAAATATTTAAATAAATTTAACATTATAAAAACAAAAATTAATAAAGTA...TTTTCAATAGATAATAATAAAAAAAAAAAAAAAAAAA AAAAAAAAATTATTTTAAAAATAAAAAAAAAA Gap no gap Contig length 1282 Chromos...KMSHNHNSNNNKTTTTTTNDSGSAIANGINLEKILADVKECN YNLVNSITATEAIQKEKESLENELSTKGTIGDGKRIKKLQYNISLQTETLMKTLMKLDSL SITG

  20. Dicty_cDB: Contig-U09432-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09432-1 gap included 993 5 741953 740957 MINUS 1 2 U09432 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U09432-1 Contig ID Contig-U09432-1 Contig update 2002. 9.13 Contig sequence >Contig-U09432-1 (Contig...-U09432-1Q) /CSM_Contig/Contig-U09432-1Q.Seq.d AGGAAATATTTTAATATTTTATTTTTTTTATTTTTTTTATTTATTA...TTTTGGTGGTAAATATAGATATGAAAATAAA CAAATCCAAATTTTAGTTGAATTAAATTTCACTGATACCACTCAAAAAAA AAA Gap gap included Contig...iy*sni*SVKFGICYNYAKYHLSICNHTIYPGSDNQSLYFKLSSIFDS PTILSGYAVIYNSLDQIITNGTYNLILDEDVPTIG

  1. Dicty_cDB: Contig-U15058-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15058-1 no gap 1987 4 4423139 4424727 PLUS 2 4 U15058 0 0 0 0 0 0 0 0 1 0 1 0 0 0 Show Contig...-U15058-1 Contig ID Contig-U15058-1 Contig update 2004. 6.11 Contig sequence >Contig-U15058-1 (Contig...-U15058-1Q) /CSM_Contig/Contig-U15058-1Q.Seq.d AAAAAAGGTTACTCACAAAGTTAAAGAAATCAATGAAAGATTTACCACCC...ACTCAAGGGGGTAGGAGAATAAAATCAACCGATTATCCAGGCNTTAAG CGACCTTTTTCCCAAAAAAAAAAGATGTTCAGAAAAT Gap no gap Contig len...srx*atffpkkkdvq k own update 2004. 6.23 Homology vs CSM-cDNA Query= Contig-U15058-1 (Contig-U15058-1Q) /CSM_Contig/Contig

  2. Dicty_cDB: Contig-U14745-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U14745-1 no gap 1780 6 3063854 3065579 PLUS 2 4 U14745 1 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U14745-1 Contig ID Contig-U14745-1 Contig update 2002.12.18 Contig sequence >Contig-U14745-1 (Contig...-U14745-1Q) /CSM_Contig/Contig-U14745-1Q.Seq.d GCGTCCGGACAATTTCAATAAAACAAATTTAAAAATAAATAATTTTTAAT...AATAAAATA ATTTAAATAAAAAAATATTTATTTTATTTTAAGATTAACAAAATAAAATA ATTTAAATAAAAAAATATTTATTTTAAAGA Gap no gap Contig...k*kniyfk own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U14745-1 (Contig-U14745-1Q) /CSM_Contig/Contig

  3. Dicty_cDB: Contig-U03367-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U03367-1 no gap 323 - - - - 2 1 U03367 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Show Contig-U03367-1 Contig... ID Contig-U03367-1 Contig update 2001. 8.29 Contig sequence >Contig-U03367-1 (Contig-U03367-1Q) /CSM_Contig/Contig...TTGCGGGTTGGCAGGACTGTNGGNAGGCATGGNCATCGGTATNNTTGGAG ATGCTNGTGTGAGGGCGAATGCT Gap no gap Contig length 323 Chro...HLXXGLXCGLAGLXXGMXIGXXGDAXVRANA own update 2004. 6. 9 Homology vs CSM-cDNA Query= Contig-U03367-1 (Contig...-U03367-1Q) /CSM_Contig/Contig-U03367-1Q.Seq.d (323 letters) Database: CSM 6905 sequ

  4. Dicty_cDB: Contig-U16086-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16086-1 gap included 1018 - - - - 3 4 U16086 0 0 0 0 0 1 1 0 0 0 1 0 0 0 Show Contig-U16086-1 Contig... ID Contig-U16086-1 Contig update 2004. 6.11 Contig sequence >Contig-U16086-1 (Contig-U16086-1Q) /CSM_Contig.../Contig-U16086-1Q.Seq.d AATTTGATGAAGTAGTAGTAGAGGTAAAACATGTATCAAAACATTATAAG ATTGCAGG...ACTTGGATATAAATGAAG GTAGCTCATCAAATTTTTCAAATAATGATAATTTTAAATCGGTAGATCAA ATTACCAATGACCTTAGCCGTATTTTAT Gap gap included Contig...KSVDQI TNDLSRIL own update 2004. 6.23 Homology vs CSM-cDNA Query= Contig-U16086-1 (Contig-U16086-1Q) /CSM_Contig/Contig

  5. Dicty_cDB: Contig-U13737-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13737-1 no gap 672 6 1762420 1761754 MINUS 1 1 U13737 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U13737-1 Contig ID Contig-U13737-1 Contig update 2002.12.18 Contig sequence >Contig-U13737-1 (Contig...-U13737-1Q) /CSM_Contig/Contig-U13737-1Q.Seq.d NNNNNNNNNNAAAATTAGAAAATGGTACAATTGTTTTTAGAGATATTTCA...AGAATAGAAGGAAAATAT AGATCAATGGGGTGGCACAACA Gap no gap Contig length 672 Chromosome...gwhn own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U13737-1 (Contig-U13737-1Q) /CSM_Contig/Contig

  6. Dicty_cDB: Contig-U06307-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U06307-1 no gap 637 6 29174 29801 PLUS 4 5 U06307 4 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U06307-1 Contig ID Contig-U06307-1 Contig update 2002. 9.13 Contig sequence >Contig-U06307-1 (Contig...-U06307-1Q) /CSM_Contig/Contig-U06307-1Q.Seq.d CCCGCGTCCGAATGCCTCGTATTTTACACACTATGCTCCGTGTGGGTAAT TTAG...ATAGTATTTTTATTTTATT CTTTTTCTTTTAAAAATTTTTTATATTGTCAACAATATAATCAAATAAAT GTATTTAATTATCGGGTATTAAAAAAAAAAAAAAAAA Gap no gap Contig...own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U06307-1 (Contig-U06307-1Q) /CSM_Contig/Contig-U063

  7. Dicty_cDB: Contig-U15541-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15541-1 gap included 2750 - - - - 634 1127 U15541 1 129 1 375 19 0 2 32 4 69 1 0 1 0 Show Contig...-U15541-1 Contig ID Contig-U15541-1 Contig update 2004. 6.11 Contig sequence >Contig-U15541-1 (Contig...-U15541-1Q) /CSM_Contig/Contig-U15541-1Q.Seq.d ATAATAAACGGTGAATACCTCGACTCCTAAATCGATGAAGACCGTAG...AAAAAT AAAAATAAAAATAAATAAATAATCATTTCATATTAATATTTTTTTTTATT TTTAAAAAAA Gap gap included Contig...ffyf*k own update 2004. 6.23 Homology vs CSM-cDNA Query= Contig-U15541-1 (Contig-U15541-1Q) /CSM_Contig/Contig

  8. Dicty_cDB: Contig-U15828-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15828-1 gap included 1593 1 4184040 4182448 MINUS 12 19 U15828 0 0 6 0 0 0 ...0 0 2 0 4 0 0 0 Show Contig-U15828-1 Contig ID Contig-U15828-1 Contig update 2004. 6.11 Contig sequence >Contig-U15828-1 (Contig...-U15828-1Q) /CSM_Contig/Contig-U15828-1Q.Seq.d ATAAAAAAAATTAAAAAATTAAAAAAGTTATCCACCCAAGT...ACA AATATTATAACTGGTACTGCTACTGTTTCAATCCCTCAAAAAAATTTAAT TTATATTTTACCAAATTCAAATACAATTAATCAATCAACAATTACAATTA CAA Gap gap included Contig...SFNPANSDFSFSYNINTTITQPTQIYLNQDIYYPNGFTTNIITGTATVSIPQ KNLIYILPNSNTINQSTITIT own update 2004. 6.23 Homology vs CSM-cDNA Query= Contig

  9. Dicty_cDB: Contig-U01750-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U01750-1 no gap 811 3 3337090 3336279 MINUS 2 2 U01750 1 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U01750-1 Contig ID Contig-U01750-1 Contig update 2001. 8.29 Contig sequence >Contig-U01750-1 (Contig...-U01750-1Q) /CSM_Contig/Contig-U01750-1Q.Seq.d GGAAGTTGTAATAATAAAAAAATAAAAATAAAAATAAAAAAATAAAAAAA...GAATACCAAGGTGAAAGAATTTTTCAAAAACTTCCTCAA ATCAACACAAATTTCGAAAAATTAACAATTTGGGAAAAGAAAATCGTTTC AAATCTTTATT Gap no gap Contig...crncnciwsktl*tywiyskiinpi**i*ipr *knfsktssnqhkfrkinnlgkenrfksl own update 2004. 6. 7 Homology vs CSM-cDNA Query= Contig

  10. Dicty_cDB: Contig-U07021-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U07021-1 no gap 601 2 3862699 3862098 MINUS 1 2 U07021 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U07021-1 Contig ID Contig-U07021-1 Contig update 2001. 8.30 Contig sequence >Contig-U07021-1 (Contig...-U07021-1Q) /CSM_Contig/Contig-U07021-1Q.Seq.d AAAAAAACAAAATGAATAAATTTAATATTACATCATTATTTATTATTTTA...TTTAATATATTCAGAAGGAAATTC TTATTTACAACAAAATTTCCCATTACTTTCTTANTTAAANTCCGTTAAAA T Gap no gap Contig length 601 C...QACCRTTQLFINYADNSFLDSAGFSPFGKVISGFNNTLNFYGGYGEEPDQSLIYSE GNSYLQQNFPLLSXLXSVK own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig

  11. Dicty_cDB: Contig-U15005-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15005-1 no gap 2023 1 1509217 1507616 MINUS 2 4 U15005 0 0 0 0 1 0 1 0 0 0 0 0 0 0 Show Contig...-U15005-1 Contig ID Contig-U15005-1 Contig update 2004. 6.11 Contig sequence >Contig-U15005-1 (Contig...-U15005-1Q) /CSM_Contig/Contig-U15005-1Q.Seq.d AATTTTCTTTTCTTTTTAAAACTTAAGTACCATATGGCAGAATATACAC...ATAATAACGATATTAA Gap no gap Contig length 2023 Chromosome number (1..6, M) 1 Chro...HMAEYTHYFIQYNLTDIFYEDVNIEKYSCSICYESVYKKEIYQCKEIHWF CKTCWAESLFKKKECMICRCIVKSISELSRNRFIEQDFLNIKVNCPNSFKYIDENKNNNN KIKDLENGCKDIITIG

  12. Dicty_cDB: Contig-U04768-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U04768-1 no gap 762 6 2607190 2606476 MINUS 3 3 U04768 1 0 0 0 0 0 2 0 0 0 0 0 0 0 Show Contig...-U04768-1 Contig ID Contig-U04768-1 Contig update 2001. 8.29 Contig sequence >Contig-U04768-1 (Contig...-U04768-1Q) /CSM_Contig/Contig-U04768-1Q.Seq.d AAAGTCTTATTTGTTTAAAAAAAAAAAAAAAAAATAAAAAACTTTATTCT...AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAA Gap no gap Contig length 762 Chromosome number (1..6, M...lknf*KMVMMHDEYISPTKLQFGFMIAVAFLG TIGVMGFCQNVFDILLGVISILSIYIGMRGVWKRKKRWLFVFMWLMMGMGFLHLVSFAVV VILHHKNPTKNTVF

  13. Dicty_cDB: Contig-U15036-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15036-1 no gap 3102 - - - - 16 24 U15036 0 5 1 2 0 1 1 2 3 1 0 0 0 0 Show Contig-U15036-1 Contig... ID Contig-U15036-1 Contig update 2004. 6.11 Contig sequence >Contig-U15036-1 (Contig-U15036-1Q) /CSM_Contig.../Contig-U15036-1Q.Seq.d ATCTTTTTAAAAAAAAAAAAAATAAAACAAATAAAGAAAGAAATTAAATA AATATTAATAAT...AATTTAAAATTAATTTTTAG AT Gap no gap Contig length 3102 Chromosome number (1..6, M) - Chromosome length - Star...RKKQTDAVAEIPVD NPTSTSTTTTTTTTSNATSILSAIHTSTINSNTSSHNNNQQQQQQQQTILPTQPTIINTP TPVRSSVSRSQSPLPSGNGSSIISQEKTPLSTFVLSTCRPSALVLPPGSTIG

  14. Dicty_cDB: Contig-U13455-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13455-1 no gap 750 2 945431 946181 PLUS 2 2 U13455 0 0 0 0 0 0 1 0 0 1 0 0 0 0 Show Contig...-U13455-1 Contig ID Contig-U13455-1 Contig update 2002.12.18 Contig sequence >Contig-U13455-1 (Contig...-U13455-1Q) /CSM_Contig/Contig-U13455-1Q.Seq.d TAATTCCAACAACATCAACAAATTCAACAACAATTACAAATGCAACAACA TA...CAATAATAATAATAATAACAATAACAATAATAATAA Gap no gap Contig length 750 Chromosome number (1..6, M) 2 Chromosome l...KMLEYIQKNPSATRPSCIQVVQQPSSKVVWKNRRLDTPFKVKVDLKAASAMA GTNLTTASVITIGIVTDHKGKLQIDSVENFTEAFNGQGLAVFQGLKMTKGTWGKE

  15. Dicty_cDB: Contig-U14400-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U14400-1 no gap 1939 4 4053811 4055750 PLUS 5 7 U14400 0 0 2 0 0 1 0 0 1 1 0 0 0 0 Show Contig...-U14400-1 Contig ID Contig-U14400-1 Contig update 2002.12.18 Contig sequence >Contig-U14400-1 (Contig...-U14400-1Q) /CSM_Contig/Contig-U14400-1Q.Seq.d CATTACCAATAAATTTATCTGCTTCAACACCTATACCAATGACATCACCA...AGGTTTATAAAATATATTGAATCAATTTTTGATTAAA Gap no gap Contig length 1939 Chromosome number (1..6, M) 4 Chromosome...HQQQQSKTVTSSTTSTETTTTVESSTTSTTITTSTSTPIPTTITTTPTTPI NSDNSWTFTSFSPKVFKEIRRYYGVDEEFLKSQENSSGIVKFLEVQTIGRSGSFFY

  16. Dicty_cDB: Contig-U10709-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10709-1 gap included 1228 4 757921 759149 PLUS 2 3 U10709 0 0 0 1 1 0 0 0 0 0 0 0 0 0 Show Contig...-U10709-1 Contig ID Contig-U10709-1 Contig update 2002.12.18 Contig sequence >Contig-U10709-1 (Contig...-U10709-1Q) /CSM_Contig/Contig-U10709-1Q.Seq.d ATTAGTAACACAGACATTGGTAACACGAATTTATTACCACCATCAC...ATGTTTAGGTGATAATACTCATAGTCAA Gap gap included Contig length 1228 Chromosome number (1..6, M) 4 Chromosome le...LDIFLIQIGAAIMGSNQFIQHAINIYNLEDWFEIEPFNG SLNKSTEGTPTTTSSQPPSTPSKQTSLRNSAGTVPTTPSQSSSTIVPTLDTIGETTTTTT TTATTTT

  17. Dicty_cDB: Contig-U10837-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10837-1 gap included 1996 2 5280203 5282199 PLUS 8 9 U10837 0 3 0 3 1 0 0 1 0 0 0 0 0 0 Show Contig...-U10837-1 Contig ID Contig-U10837-1 Contig update 2002.12.18 Contig sequence >Contig-U10837-1 (Contig-U10837-1Q) /CSM_Contig/Contig-U10837...TCNT Gap gap included Contig length 1996 Chromosome number (1..6, M) 2 Chromosome...YSSKGYFKHLDSFLSEISVP LCESVSKSSTLVFSLLFNMLEYSTADYRYPILKILTALVKCGVNPAETKSSRVPEWFDTV TQFLNDHKTPHYIVSQAIRFIEITSGNSPTSLITIDNASLKPSKNTIG...SSRVPEWFDTV TQFLNDHKTPHYIVSQAIRFIEITSGNSPTSLITIDNASLKPSKNTIGTKKFSNKVDRGT LLAGNYFNKVLVDTVPGVRSSVNSLTKSIYSTTQI

  18. Dicty_cDB: Contig-U12765-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12765-1 no gap 1256 6 1467819 1466563 MINUS 3 3 U12765 0 0 0 2 0 0 0 0 0 0 1 0 0 0 Show Contig...-U12765-1 Contig ID Contig-U12765-1 Contig update 2002.12.18 Contig sequence >Contig-U12765-1 (Contig...-U12765-1Q) /CSM_Contig/Contig-U12765-1Q.Seq.d CAAAAAGGAAACACTAGTCCAGTTAGAACCCCAAATACTACTACTACTA...TATCGATTGTTCAAAGGTTTCAATGGTTGATACTAAT TTCTTA Gap no gap Contig length 1256 Chromosome number (1..6, M) 6 Chr...EYQEDLTPIFEPIFLDLIKIL STTTLTGNVFPYYKVFSRLVQFKAVSDLVGTLQCWNSPNFNGKEMERNTILGSLFSPSSA SDDGSTIKQYFSNASTMNKNTIGDA

  19. Dicty_cDB: Contig-U09480-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09480-1 gap included 705 5 4277527 4276817 MINUS 1 2 U09480 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U09480-1 Contig ID Contig-U09480-1 Contig update 2002. 9.13 Contig sequence >Contig-U09480-1 (Contig-U09480-1Q) /CSM_Contig/Contig-U09480...AAAAAAAAAA Gap gap included Contig length 705 Chromosome number (1..6, M) 5 Chromosome length 5062330 Start ...**********imaeinienpfhvntkidvntfvnqirgipngsrcdftnsvvkhf sslgynvfvchpnhavtgpyaklhcefrntkfstig...srcdftnsvvkhf sslgynvfvchpnhavtgpyaklhcefrntkfstigydvyiiargrkvtatnfgdggydn wasggh

  20. Dicty_cDB: Contig-U09345-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09345-1 gap included 1216 4 3361857 3360637 MINUS 4 5 U09345 1 0 1 0 0 0 2 0 0 0 0 0 0 0 Show Contig...-U09345-1 Contig ID Contig-U09345-1 Contig update 2002. 9.13 Contig sequence >Contig-U09345-1 (Contig-U09345-1Q) /CSM_Contig/Contig-U0934...AATGGTATTTTAAAAATAA Gap gap included Contig length 1216 Chromosome number (1..6, M) 4 Chromosome length 5430...ALFTSSNPKYGCSGCVQLKNQIESFSLSYEPYL NSAGFLEKPIFIVILEVDYNMEVFQTIGLNTIPHLLFIPSGSKPITQKGYAYTGFEQTSS QSISDFIYSHSKI...LLALFTSSNPKYGCSGCVQLKNQIESFSLSYEPYL NSAGFLEKPIFIVILEVDYNMEVFQTIGLNTIPHLLFIPSGSKPI

  1. Dicty_cDB: Contig-U15323-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15323-1 no gap 1230 2 3760829 3759661 MINUS 76 108 U15323 2 0 21 0 9 4 0 0 22 4 13 0 1 0 Show Contig...-U15323-1 Contig ID Contig-U15323-1 Contig update 2004. 6.11 Contig sequence >Contig-U15323-1 (Contig-U15323-1Q) /CSM_Contig/Contig-U1532...TAAAATTTAAGCAATCATTCCAT Gap no gap Contig length 1230 Chromosome number (1..6, M) 2 Chromosome length 846757...VGLLVFFNILYCTPLYYILFFFKMNSKFADELIATAKAIVAPGKGILAADESTNTIGAR FKKINLENNEENRRAYRELLIGTGNGVNEFIGGIILYEETLYQKMADG...MNSKFADELIATAKAIVAPGKGILAADESTNTIGAR FKKINLENNEENRRAYRELLIGTGNGVNEFIGGIILYEETLYQK

  2. Dicty_cDB: Contig-U14236-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U14236-1 no gap 660 2 5626866 5627517 PLUS 1 1 U14236 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Show Contig...-U14236-1 Contig ID Contig-U14236-1 Contig update 2002.12.18 Contig sequence >Contig-U14236-1 (Contig...-U14236-1Q) /CSM_Contig/Contig-U14236-1Q.Seq.d NNNNNNNNNNGAAAATCAAAAATTAAAAAGTAACATTACTCTATTATATG ...CAATCACTCCAATTAAA CCATAGTTTT Gap no gap Contig length 660 Chromosome number (1..6...MGSEKSPFNLKQYPSLVKIDDVS QCPKYKCLKRKSLNEWTIGLNIPAFCRESRYDCSLCYKYIECSFSDEF*tnlsalfv

  3. Dicty_cDB: Contig-U13065-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13065-1 no gap 718 1 3561021 3561729 PLUS 1 1 U13065 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Show Contig...-U13065-1 Contig ID Contig-U13065-1 Contig update 2002.12.18 Contig sequence >Contig-U13065-1 (Contig...-U13065-1Q) /CSM_Contig/Contig-U13065-1Q.Seq.d NNNNNNNNNNCAATCAAAGCAATCAATGGTAAATTAACTTTGTTACCATT ...TGATTCAACTCTCTCTG TTTCAAATTTACAACTTGCTTTAGATGAATCCTTTGAAGTTGATTTTGTA TTATATTAAAAATTATCA Gap no gap Contig...kny own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U13065-1 (Contig-U13065-1Q) /CSM_Contig

  4. Dicty_cDB: Contig-U04432-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U04432-1 no gap 600 1 1520578 1521098 PLUS 1 1 U04432 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U04432-1 Contig ID Contig-U04432-1 Contig update 2001. 8.29 Contig sequence >Contig-U04432-1 (Contig...-U04432-1Q) /CSM_Contig/Contig-U04432-1Q.Seq.d AATTATAATCAAAACAAATTAATAAAAAAAATGATTAATAGTTTTGTCTC ...TCAACAATATGAAATTGCAAGAT TAAATGGTTATGATAATGCCCATAATTTACCAAGAGATATTAGTCAAATA Gap no gap Contig length 600 Chro...ni**fkgrnsnknyfsrymgtiessti*n ckikwl**cp*ftkry*sn own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U04432-1 (Contig

  5. Dicty_cDB: Contig-U07545-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U07545-1 no gap 439 3 4955441 4955098 MINUS 1 1 U07545 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U07545-1 Contig ID Contig-U07545-1 Contig update 2002. 5. 9 Contig sequence >Contig-U07545-1 (Contig...-U07545-1Q) /CSM_Contig/Contig-U07545-1Q.Seq.d ATATGAAATACTTAATACTTTTAATTTTCCTTTTAATAAATTCAACTTTT...ATGTTTCAGAGTCTGGTTG Gap no gap Contig length 439 Chromosome number (1..6, M) 3 Chromosome length 6358359 Sta...e MKYLILLIFLLINSTFGNIQFSKYISNSGNDNNSCGSFTSPCKTIGYSIQQIKSYEYNQY SIEILLDSGNYYSQNPINLYGLNISISAQNSNDLVQFLVPNINGT

  6. Dicty_cDB: Contig-U15359-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15359-1 no gap 1420 6 1334613 1333192 MINUS 3 3 U15359 0 1 0 0 1 0 0 0 0 1 0 0 0 0 Show Contig...-U15359-1 Contig ID Contig-U15359-1 Contig update 2004. 6.11 Contig sequence >Contig-U15359-1 (Contig...-U15359-1Q) /CSM_Contig/Contig-U15359-1Q.Seq.d TATAGCATCATTTGCAAAGTTTAGTTTAAAGAAAAAAGAGAAAGCGGAA...A AAAAAAACTGGAAAAATTAA Gap no gap Contig length 1420 Chromosome number (1..6, M) 6 Chromosome length 3595308...SSGF DEPSLAVMYVDRALKGASAVQTIGRLSRVSKGKNACYIVDFVNTRREISDAFGQYWRETC LKGETRKTVLELKLNRVLGKLSAIEPLANGRLEESVEYILRD

  7. Dicty_cDB: Contig-U04729-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U04729-1 no gap 251 5 1037629 1037880 PLUS 1 1 U04729 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Show Contig...-U04729-1 Contig ID Contig-U04729-1 Contig update 2001. 8.29 Contig sequence >Contig-U04729-1 (Contig...-U04729-1Q) /CSM_Contig/Contig-U04729-1Q.Seq.d TGGATTTATAACAGAGGTTATTGTAGGTGGTAAAACTTTTAGAGGAATCG ...CATTATCTAATGGG T Gap no gap Contig length 251 Chromosome number (1..6, M) 5 Chromosome length 5062330 Start ...ITEVIVGGKTFRGIVFEDLKSSNQTNNHSQNFSPNQSGTNLNNSNSNIPSSKKIKDKN ISPSSFLPTIGSTTSTSNPLSNG Translated Amino Acid seq

  8. Dicty_cDB: Contig-U06929-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U06929-1 no gap 726 5 4252576 4251850 MINUS 1 1 U06929 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U06929-1 Contig ID Contig-U06929-1 Contig update 2001. 8.30 Contig sequence >Contig-U06929-1 (Contig...-U06929-1Q) /CSM_Contig/Contig-U06929-1Q.Seq.d AGTTCATTCATTTAGTCGTATGATAGTATCACCATTTATAAATCCAAAAT...TAAATTAAATAAATA Gap no gap Contig length 726 Chromosome number (1..6, M) 5 Chromosome length 5062330 Start p...PSAISNNSNNS NNNDDNRPPILGLPFLFDYKNRITRGSRFFETIHYKIVHVTSATEFGIRRISKLYGTKWQ LEIGLKHQITQSGALQCLFTHTIGQTTIFGLSFGF

  9. Dicty_cDB: Contig-U15525-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15525-1 gap included 3361 6 202399 204109 PLUS 34 57 U15525 0 0 7 0 7 6 0 0 4 3 7 0 0 0 Show Contig...-U15525-1 Contig ID Contig-U15525-1 Contig update 2004. 6.11 Contig sequence >Contig-U15525-1 (Contig-U15525-1Q) /CSM_Contig/Contig-U15525...ATTTAATTAAATAATAATA Gap gap included Contig length 3361 Chromosome number (1..6, M) 6 Chromosome length 3595...TEATCLILSVD ETVQNNQAEQAQAGPQINNQTRQALSRVEVFKQ--- ---LDTIGIKKESGGGLGDSQFIAGAAFKRTFFYAGFEQQPKHIKNPKVLCLNIELELK...lslnsiqslpqlkqlv*ssll mkpfkiiklnklklvhklitkhvklyhg*rcss--- ---LDTIGIKKESGGGLGDSQFIAGAAFKRTFFYAGFEQQPKHIKNPKV

  10. Dicty_cDB: Contig-U11883-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U11883-1 gap included 599 2 1457179 1457762 PLUS 1 2 U11883 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Show Contig...-U11883-1 Contig ID Contig-U11883-1 Contig update 2002.12.18 Contig sequence >Contig-U11883-1 (Contig...-U11883-1Q) /CSM_Contig/Contig-U11883-1Q.Seq.d TACAAAATTTATATATATATATAATATTTTTAAATAATTATATTT...ATTTAGATGTATTTGGTATTCAAACATTA ACCGAACAACAAGCCTCTACAAAATTATTAACTTTTGTCATTTCAAAATC AGGTGAAAA Gap gap included Contig...ffkixn*kikkgfhvkxksflwfkxxx--- ---xxxx******************yprkyiniti*rn*kdil*ii*rne*rergtksc* nifs*kestpl*fnsxfktniilfstvfnttnvstig

  11. Dicty_cDB: Contig-U13680-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13680-1 no gap 822 5 2371965 2372786 PLUS 2 2 U13680 0 2 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U13680-1 Contig ID Contig-U13680-1 Contig update 2002.12.18 Contig sequence >Contig-U13680-1 (Contig...-U13680-1Q) /CSM_Contig/Contig-U13680-1Q.Seq.d AAAAAGATTCTCAAGGAATTCACCGTGTTTATACTTCTTATGGTAGAACT ...GGGAATCAATGATTTAAATATCTACCAAATTCAAAAGG AAGGTGATGTCGAGTCACATTCATTACAATCACCATCGAAATTATTATTT CATGGTTCAAGAGCATCGAATT Gap no gap Contig...**sirtinkdig*kslc*snhsidk*ffsynh*twy*ntigclingt s*kw*tcfeknqylfewynqsiisrvgeikfrifhnyst*tw*rfrcclkeyh*kfgsie

  12. Dicty_cDB: Contig-U15718-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15718-1 gap included 3735 6 2645446 2642451 MINUS 153 276 U15718 0 0 0 118 ...1 0 0 20 3 10 1 0 0 0 Show Contig-U15718-1 Contig ID Contig-U15718-1 Contig update 2004. 6.11 Contig sequence >Contig...-U15718-1 (Contig-U15718-1Q) /CSM_Contig/Contig-U15718-1Q.Seq.d AAATTATTAAATTGTTTATTAATTTTTTTTTTTAC...CCTG Gap gap included Contig length 3735 Chromosome number (1..6, M) 6 Chromosome length 3595308 Start point...ptqtppptqtpt nhsigvnecdccpegqycllifghercfiandggdgipeetigcpgvttgtptstdggtg hytesgtgnphlcdrhhcrsgmechvingipecl

  13. Dicty_cDB: Contig-U15573-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15573-1 gap included 2005 4 5020093 5018210 MINUS 13 13 U15573 0 5 0 1 1 0 ...0 0 0 1 1 0 2 2 Show Contig-U15573-1 Contig ID Contig-U15573-1 Contig update 2004. 6.11 Contig sequence >Contig-U15573-1 (Contig...-U15573-1Q) /CSM_Contig/Contig-U15573-1Q.Seq.d AGTCTTGAGCTTTTATTGGGTCAACCATTGGGTGAATATAC... AGCNTTAACNGGNAA Gap gap included Contig length 2005 Chromosome number (1..6, M) ...xxlfrsnxslxxxxxxsxnxx Frame C: s*afigstig*iyiylkrfhlfl*skryyqskw*fkifpilkqttiiyen

  14. Dicty_cDB: Contig-U01204-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U01204-1 gap included 918 2 1928287 1927368 MINUS 2 3 U01204 0 0 0 0 0 2 0 0 0 0 0 0 0 0 Show Contig...-U01204-1 Contig ID Contig-U01204-1 Contig update 2001. 8.29 Contig sequence >Contig-U01204-1 (Contig-U01204-1Q) /CSM_Contig/Contig-U01204...AAAAATAATAA Gap gap included Contig length 918 Chromosome number (1..6, M) 2 Chromosome length 8467578 Start...LAWEVFWVGTPLFVLMASAFNQIHWALAWVLMVIILQSGFMN--- ---QHSHTIGNETIIIVMDSWVVDQIPDQVSWMEQ...fgwvlhyly*whqhsikfighwhgy*w*sfynlvl*--- ---QHSHTIGNETIIIVMDSWVVDQIPDQVSWMEQVLSDNN

  15. Dicty_cDB: Contig-U12043-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12043-1 gap included 1898 6 2694437 2692539 MINUS 7 13 U12043 0 6 0 0 0 0 0... 1 0 0 0 0 0 0 Show Contig-U12043-1 Contig ID Contig-U12043-1 Contig update 2002.12.18 Contig sequence >Contig-U12043-1 (Contig...-U12043-1Q) /CSM_Contig/Contig-U12043-1Q.Seq.d GAAACCATTCGTTTAAAGAAATGAAATATTTATATATATTAA...ATAAA AATAAATT Gap gap included Contig length 1898 Chromosome number (1..6, M) 6 Chromosome length 3595308 S...VPDIVSGILASKYASITLLNSGEM DLTNGITIGLLENSTSDQLFQINPILNTSLTNILVGQRFSIPFEISIKDSTISNQL

  16. Dicty_cDB: Contig-U06822-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U06822-1 no gap 468 3 438742 439211 PLUS 1 1 U06822 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U06822-1 Contig ID Contig-U06822-1 Contig update 2001. 8.30 Contig sequence >Contig-U06822-1 (Contig...-U06822-1Q) /CSM_Contig/Contig-U06822-1Q.Seq.d ATATTATTCTATTCACTCGTAATAATACATATAAATTGATATCAATCAGA AA...TGCTATTAAGACTTTGGAGCAAAAAAC TAACAAATCAATTCAAAA Gap no gap Contig length 468 Chromosome number (1..6, M) 3 Ch...*mmlklkeikllvllrlwskkltnqfk own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U06822-1 (Contig-U06822-1Q) /CSM_Contig/Contig

  17. Dicty_cDB: Contig-U13254-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13254-1 no gap 575 5 203798 203233 MINUS 1 1 U13254 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Show Contig...-U13254-1 Contig ID Contig-U13254-1 Contig update 2002.12.18 Contig sequence >Contig-U13254-1 (Contig...-U13254-1Q) /CSM_Contig/Contig-U13254-1Q.Seq.d AAATAATTTATTTAATTTTAAAATTAATAGATAAAAAGATGGAAATGATA A...CATTTTAACATTATTGGATAAT GTCAATGATTGGCCAANNNNNNNNN Gap no gap Contig length 575 Chromosome number (1..6, M) 5 ...2004. 6.10 Homology vs CSM-cDNA Query= Contig-U13254-1 (Contig-U13254-1Q) /CSM_Contig/Contig-U13254-1Q.Seq.d

  18. Dicty_cDB: Contig-U13891-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13891-1 no gap 1355 6 799802 798446 MINUS 4 4 U13891 0 0 0 0 1 1 1 0 0 0 1 0 0 0 Show Contig...-U13891-1 Contig ID Contig-U13891-1 Contig update 2002.12.18 Contig sequence >Contig-U13891-1 (Contig...-U13891-1Q) /CSM_Contig/Contig-U13891-1Q.Seq.d TTTTAAAATATTTCAAAATTAGCGAGCACGCATTCGCATATAAATATATT ...ACAAATAAAAAAAAAAAATAAAAAAAATA ATTTA Gap no gap Contig length 1355 Chromosome numb...own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U13891-1 (Contig-U13891-1Q) /CSM_Contig/Contig-U138

  19. Dicty_cDB: Contig-U16093-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U16093-1 gap included 1020 2 4899973 4899063 MINUS 29 31 U16093 7 0 0 0 0 2 ...18 0 0 0 0 0 2 0 Show Contig-U16093-1 Contig ID Contig-U16093-1 Contig update 2004. 6.11 Contig sequence >Contig-U16093-1 (Contig...-U16093-1Q) /CSM_Contig/Contig-U16093-1Q.Seq.d TTTTTTTTTTTTTTTTTAATTTTTTTTTTTCATAAAACTT...AAAATTAAATT Gap gap included Contig length 1020 Chromosome number (1..6, M) 2 Chr...pdate 2004. 6.23 Homology vs CSM-cDNA Query= Contig-U16093-1 (Contig-U16093-1Q) /CSM_Contig/Contig-U16093-1Q

  20. Dicty_cDB: Contig-U06384-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U06384-1 no gap 660 5 3008439 3007779 MINUS 2 2 U06384 2 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U06384-1 Contig ID Contig-U06384-1 Contig update 2001. 8.30 Contig sequence >Contig-U06384-1 (Contig...-U06384-1Q) /CSM_Contig/Contig-U06384-1Q.Seq.d TGAAAAAATTAGAGACAACAAGTGGATCAGCACGTAAAGTATGGCGTTTA...AAATAAAAATTAATTTCC AAAAATAAAA Gap no gap Contig length 660 Chromosome number (1.....own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U06384-1 (Contig-U06384-1Q) /CSM_Contig/Contig-U063

  1. Dicty_cDB: Contig-U12545-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12545-1 gap included 1165 3 3275272 3276395 PLUS 1 2 U12545 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U12545-1 Contig ID Contig-U12545-1 Contig update 2002.12.18 Contig sequence >Contig-U12545-1 (Contig-U12545-1Q) /CSM_Contig/Contig-U12545...CGTTCTAAATCACTCATTAAAAGATTAAAAATTAAANAAGGTAATATC TCACGACNGCTNNCTCATACACACN Gap gap included Contig length 11...vliknlskrkerkis*klyqlkriqlsl vknwlklvlnhslkd*klxkvishdxxliht own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig...-U12545-1 (Contig-U12545-1Q) /CSM_Contig/Contig-U12545-1Q.Seq.d (1175 letters) Database: CSM 6905 s

  2. Dicty_cDB: Contig-U10823-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U10823-1 gap included 1750 1 3559501 3561234 PLUS 85 124 U10823 0 5 0 30 1 0... 0 20 0 29 0 0 0 0 Show Contig-U10823-1 Contig ID Contig-U10823-1 Contig update 2002.12.18 Contig sequence >Contig-U10823-1 (Contig...-U10823-1Q) /CSM_Contig/Contig-U10823-1Q.Seq.d ACTGTTGGCCTACTGGTATTTTTGGTAGTGTGTTAAAA...CAACAAATAAAATTAAAATTA GTTATATTTTTTTTAAATTAAAAAAAAAAATAAAAAAAATAAATTATTTA TTAAATTTTT Gap gap included Contig ...4. 6.10 Homology vs CSM-cDNA Query= Contig-U10823-1 (Contig-U10823-1Q) /CSM_Contig/Contig-U10823-1Q.Seq.d (1

  3. Dicty_cDB: Contig-U13894-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U13894-1 no gap 1550 2 2081463 2079913 MINUS 30 31 U13894 1 0 15 0 9 1 1 0 1 1 1 0 0 0 Show Contig...-U13894-1 Contig ID Contig-U13894-1 Contig update 2002.12.18 Contig sequence >Contig-U13894-1 (Contig...-U13894-1Q) /CSM_Contig/Contig-U13894-1Q.Seq.d CTTTTTGATTGTATAATTGAAAAAAAAAAAAAAAAAAAAAAAAAAA...TAAATTAAATAATTAAAAAAAACAAAAAAATTAAGTGAAAATCAAAAAA Gap no gap Contig length 1550 Chromosome number (1..6, M) ...V*kkkkikk*k*sk*fklnn*kkqkn*vkikk own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U13894-1 (Contig

  4. Dicty_cDB: Contig-U15462-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U15462-1 no gap 546 4 3384206 3383661 MINUS 2 2 U15462 0 0 2 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U15462-1 Contig ID Contig-U15462-1 Contig update 2004. 6.11 Contig sequence >Contig-U15462-1 (Contig...-U15462-1Q) /CSM_Contig/Contig-U15462-1Q.Seq.d CTTTAGATTGGGGNTCAAGAAAAATATTGAAGTATTTGGTGGTGATAAGA...ATTCGATTCACTATCTTATA Gap no gap Contig length 546 Chromosome number (1..6, M) 4 Chromosome length 5430582 St...VMKLGFEVKDLITNDPKCDLFDSLS Y own update 2004. 6.23 Homology vs CSM-cDNA Query= Contig-U15462-1 (Contig

  5. Dicty_cDB: Contig-U08861-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U08861-1 gap included 1295 5 2877914 2879217 PLUS 1 2 U08861 0 0 0 0 1 0 0 0 0 0 0 0 0 0 Show Contig...-U08861-1 Contig ID Contig-U08861-1 Contig update 2002. 9.13 Contig sequence >Contig-U08861-1 (Contig-U08861-1Q) /CSM_Contig/Contig-U08861...CACATTATAAAGTACCAAATAAGTTATTAATTTTAGAAAATA AATTCCAAAGAATGCAATGTCTAAAGTTAATAAAAAAGAATACTAAAATA TTTTC Gap gap included Contig...k**iwsryccnhcl*kkqkttnef*r i*nql*tkistl*stk*vinfrk*ipknamskvnkkey*nif own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig...-U08861-1 (Contig-U08861-1Q) /CSM_Contig/Contig-U08861-1Q.Seq.d (1305 letters) Database: C

  6. Dicty_cDB: Contig-U06829-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U06829-1 no gap 449 5 4394444 4394893 PLUS 1 1 U06829 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U06829-1 Contig ID Contig-U06829-1 Contig update 2001. 8.30 Contig sequence >Contig-U06829-1 (Contig...-U06829-1Q) /CSM_Contig/Contig-U06829-1Q.Seq.d GTAAAAGAATGTAATGAAAATGAAAAAATTAATTTTATAATAAAATTATT ...ATGATTTAGAATTGGTACAATTAGTTTA Gap no gap Contig length 449 Chromosome number (1..6, M) 5 Chromosome length 50...04. 6.10 Homology vs CSM-cDNA Query= Contig-U06829-1 (Contig-U06829-1Q) /CSM_Contig/Contig-U06829-1Q.Seq.d (

  7. Dicty_cDB: Contig-U12073-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12073-1 gap included 912 2 2118980 2119867 PLUS 4 5 U12073 0 0 0 2 0 0 0 1 0 1 0 0 0 0 Show Contig...-U12073-1 Contig ID Contig-U12073-1 Contig update 2002.12.18 Contig sequence >Contig-U12073-1 (Contig...-U12073-1Q) /CSM_Contig/Contig-U12073-1Q.Seq.d CTGTTGGCCTACTGGNAATTGAAACAATTGTTTCAGCAAATATTA...AAGA Gap gap included Contig length 912 Chromosome number (1..6, M) 2 Chromosome length 8467578 Start point ...GPXSXDY*r own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U12073-1 (Contig-U12073-1Q) /CSM_Contig/Contig

  8. Dicty_cDB: Contig-U09615-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U09615-1 gap included 1134 3 4459395 4458259 MINUS 1 2 U09615 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Show Contig...-U09615-1 Contig ID Contig-U09615-1 Contig update 2002. 9.13 Contig sequence >Contig-U09615-1 (Contig-U09615-1Q) /CSM_Contig/Contig-U0961...TGCAAGATTAGAAAGATTAGAAAAAGATGCTATGCTAAAAATA Gap gap included Contig length 1134 Chromosome number (1..6, M) ...*wcnlyfrcre*emgkcn iefhiintrfkiwphrcidtighnvgicw**fnfecsfisleiqyrv**mgirfkyw*ww s*c*irpyfnnhafqyydyiwwskfwh*...4. 6.10 Homology vs CSM-cDNA Query= Contig-U09615-1 (Contig-U09615-1Q) /CSM_Contig/Contig-U09615-1Q.Seq.d (1

  9. Dicty_cDB: Contig-U12682-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available Contig-U12682-1 no gap 1408 4 4961739 4963050 PLUS 47 48 U12682 0 0 0 5 0 0 2 30 0 10 0 0 0 0 Show Contig...-U12682-1 Contig ID Contig-U12682-1 Contig update 2002.12.18 Contig sequence >Contig-U12682-1 (Contig...-U12682-1Q) /CSM_Contig/Contig-U12682-1Q.Seq.d AAACACATCATCCCGTTCGATCTGATAAGTAAATCGACCTCAGGCC...ATGA AACTACTG Gap no gap Contig length 1408 Chromosome number (1..6, M) 4 Chromosome length 5430582 Start po... kwniikwysyinwykswyn**fihsiklqwsy*qcke*si*yiir*ny own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig

  10. Genome-wide SNP identification by high-throughput sequencing and selective mapping allows sequence assembly positioning using a framework genetic linkage map

    Directory of Open Access Journals (Sweden)

    Xu Xiangming

    2010-12-01

    Full Text Available Abstract Background Determining the position and order of contigs and scaffolds from a genome assembly within an organism's genome remains a technical challenge in a majority of sequencing projects. In order to exploit contemporary technologies for DNA sequencing, we developed a strategy for whole genome single nucleotide polymorphism sequencing allowing the positioning of sequence contigs onto a linkage map using the bin mapping method. Results The strategy was tested on a draft genome of the fungal pathogen Venturia inaequalis, the causal agent of apple scab, and further validated using sequence contigs derived from the diploid plant genome Fragaria vesca. Using our novel method we were able to anchor 70% and 92% of sequences assemblies for V. inaequalis and F. vesca, respectively, to genetic linkage maps. Conclusions We demonstrated the utility of this approach by accurately determining the bin map positions of the majority of the large sequence contigs from each genome sequence and validated our method by mapping single sequence repeat markers derived from sequence contigs on a full mapping population.

  11. Dicty_cDB: Contig-U16279-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available ( AB254080 |pid:none) Streptomyces kanamyceticus kanam... 47 0.002 CP000964_3229( CP000964 |pid:none) Klebsiella pneumoni...nkkmtkpvasyeldekrfltllgkligetenlqnrppalipiednag rhviealtpylkanggvleleqvhcdpvnypkrgniiie... letters Score E Sequences producing significant alignments: (bits) Value Contig-U16279-1 (Contig-U16279-1Q....................................................done Score E Sequences producing significant alignments: (bits) Val....................................done Score E Sequences producing significant alignments: (bits) Val

  12. Contig Maps and Genomic Sequencing Identify Candidate Genes in the Usher 1C Locus

    Science.gov (United States)

    Higgins, Michael J.; Day, Colleen D.; Smilinich, Nancy J.; Ni, L.; Cooper, Paul R.; Nowak, Norma J.; Davies, Chris; de Jong, Pieter J.; Hejtmancik, Fielding; Evans, Glen A.; Smith, Richard J.H.; Shows, Thomas B.

    1998-01-01

    Usher syndrome 1C (USH1C) is a congenital condition manifesting profound hearing loss, the absence of vestibular function, and eventual retinal degeneration. The USH1C locus has been mapped genetically to a 2- to 3-cM interval in 11p14–15.1 between D11S899 and D11S861. In an effort to identify the USH1C disease gene we have isolated the region between these markers in yeast artificial chromosomes (YACs) using a combination of STS content mapping and Alu–PCR hybridization. The YAC contig is ∼3.5 Mb and has located several other loci within this interval, resulting in the order CEN-LDHA-SAA1-TPH-D11S1310-(D11S1888/KCNC1)-MYOD1-D11S902D11S921-D11S1890-TEL. Subsequent haplotyping and homozygosity analysis refined the location of the disease gene to a 400-kb interval between D11S902 and D11S1890 with all affected individuals being homozygous for the internal marker D11S921. To facilitate gene identification, the critical region has been converted into P1 artificial chromosome (PAC) clones using sequence-tagged sites (STSs) mapped to the YAC contig, Alu–PCR products generated from the YACs, and PAC end probes. A contig of >50 PAC clones has been assembled between D11S1310 and D11S1890, confirming the order of markers used in haplotyping. Three PAC clones representing nearly two-thirds of the USH1C critical region have been sequenced. PowerBLAST analysis identified six clusters of expressed sequence tags (ESTs), two known genes (BIR,SUR1) mapped previously to this region, and a previously characterized but unmapped gene NEFA (DNA binding/EF hand/acidic amino-acid-rich). GRAIL analysis identified 11 CpG islands and 73 exons of excellent quality. These data allowed the construction of a transcription map for the USH1C critical region, consisting of three known genes and six or more novel transcripts. Based on their map location, these loci represent candidate disease loci for USH1C. The NEFA gene was assessed as the USH1C locus by the sequencing of an amplified NEFA

  13. LTC: a novel algorithm to improve the efficiency of contig assembly for physical mapping in complex genomes

    Directory of Open Access Journals (Sweden)

    Feuillet Catherine

    2010-11-01

    Full Text Available Abstract Background Physical maps are the substrate of genome sequencing and map-based cloning and their construction relies on the accurate assembly of BAC clones into large contigs that are then anchored to genetic maps with molecular markers. High Information Content Fingerprinting has become the method of choice for large and repetitive genomes such as those of maize, barley, and wheat. However, the high level of repeated DNA present in these genomes requires the application of very stringent criteria to ensure a reliable assembly with the FingerPrinted Contig (FPC software, which often results in short contig lengths (of 3-5 clones before merging as well as an unreliable assembly in some difficult regions. Difficulties can originate from a non-linear topological structure of clone overlaps, low power of clone ordering algorithms, and the absence of tools to identify sources of gaps in Minimal Tiling Paths (MTPs. Results To address these problems, we propose a novel approach that: (i reduces the rate of false connections and Q-clones by using a new cutoff calculation method; (ii obtains reliable clusters robust to the exclusion of single clone or clone overlap; (iii explores the topological contig structure by considering contigs as networks of clones connected by significant overlaps; (iv performs iterative clone clustering combined with ordering and order verification using re-sampling methods; and (v uses global optimization methods for clone ordering and Band Map construction. The elements of this new analytical framework called Linear Topological Contig (LTC were applied on datasets used previously for the construction of the physical map of wheat chromosome 3B with FPC. The performance of LTC vs. FPC was compared also on the simulated BAC libraries based on the known genome sequences for chromosome 1 of rice and chromosome 1 of maize. Conclusions The results show that compared to other methods, LTC enables the construction of highly

  14. A Blumeria graminis f.sp. hordei BAC library - contig building and microsynteny studies

    DEFF Research Database (Denmark)

    Pedersen, C.; Wu, B.; Giese, H.

    2002-01-01

    A bacterial artificial chromosome (BAC) library of Blumeria graminis f.sp. hordei, containing 12,000 clones with an average insert size of 41 kb, was constructed. The library represents about three genome equivalents and BAC-end sequencing showed a high content of repetitive sequences, making...... contigs, at or close to avirulence loci, were constructed. Single nucleotide polymorphism (SNP) markers were developed from BAC-end sequences to link the contigs to the genetic maps. Two other BAC contigs were used to study microsynteny between B. graminis and two other ascomycetes, Neurospora crassa...

  15. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.

    Science.gov (United States)

    Lu, Yang Young; Chen, Ting; Fuhrman, Jed A; Sun, Fengzhu

    2017-03-15

    The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples. The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT. The software is available at https://github.com/younglululu/COCACOLA . fsun@usc.edu. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  16. Sequencing of BAC pools by different next generation sequencing platforms and strategies

    Directory of Open Access Journals (Sweden)

    Scholz Uwe

    2011-10-01

    Full Text Available Abstract Background Next generation sequencing of BACs is a viable option for deciphering the sequence of even large and highly repetitive genomes. In order to optimize this strategy, we examined the influence of read length on the quality of Roche/454 sequence assemblies, to what extent Illumina/Solexa mate pairs (MPs improve the assemblies by scaffolding and whether barcoding of BACs is dispensable. Results Sequencing four BACs with both FLX and Titanium technologies revealed similar sequencing accuracy, but showed that the longer Titanium reads produce considerably less misassemblies and gaps. The 454 assemblies of 96 barcoded BACs were improved by scaffolding 79% of the total contig length with MPs from a non-barcoded library. Assembly of the unmasked 454 sequences without separation by barcodes revealed chimeric contig formation to be a major problem, encompassing 47% of the total contig length. Masking the sequences reduced this fraction to 24%. Conclusion Optimal BAC pool sequencing should be based on the longest available reads, with barcoding essential for a comprehensive assessment of both repetitive and non-repetitive sequence information. When interest is restricted to non-repetitive regions and repeats are masked prior to assembly, barcoding is non-essential. In any case, the assemblies can be improved considerably by scaffolding with non-barcoded BAC pool MPs.

  17. SIS: a program to generate draft genome sequence scaffolds for prokaryotes

    Directory of Open Access Journals (Sweden)

    Dias Zanoni

    2012-05-01

    Full Text Available Abstract Background Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome. Results We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance. Conclusions sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large

  18. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies

    Energy Technology Data Exchange (ETDEWEB)

    Catfish Genome Consortium; Wang, Shaolin; Peatman, Eric; Abernathy, Jason; Waldbieser, Geoff; Lindquist, Erika; Richardson, Paul; Lucas, Susan; Wang, Mei; Li, Ping; Thimmapuram, Jyothi; Liu, Lei; Vullaganti, Deepika; Kucuktas, Huseyin; Murdock, Christopher; Small, Brian C; Wilson, Melanie; Liu, Hong; Jiang, Yanliang; Lee, Yoona; Chen, Fei; Lu, Jianguo; Wang, Wenqi; Xu, Peng; Somridhivej, Benjaporn; Baoprasertkul, Puttharat; Quilang, Jonas; Sha, Zhenxia; Bao, Baolong; Wang, Yaping; Wang, Qun; Takano, Tomokazu; Nandi, Samiran; Liu, Shikai; Wong, Lilian; Kaltenboeck, Ludmilla; Quiniou, Sylvie; Bengten, Eva; Miller, Norman; Trant, John; Rokhsar, Daniel; Liu, Zhanjiang

    2010-03-23

    Background-Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification. Results-A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35percent of the unique sequences had significant similarities to known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis. Conclusions-This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.

  19. 454 sequencing of pooled BAC clones on chromosome 3H of barley

    Directory of Open Access Journals (Sweden)

    Yamaji Nami

    2011-05-01

    Full Text Available Abstract Background Genome sequencing of barley has been delayed due to its large genome size (ca. 5,000Mbp. Among the fast sequencing systems, 454 liquid phase pyrosequencing provides the longest reads and is the most promising method for BAC clones. Here we report the results of pooled sequencing of BAC clones selected with ESTs genetically mapped to chromosome 3H. Results We sequenced pooled barley BAC clones using a 454 parallel genome sequencer. A PCR screening system based on primer sets derived from genetically mapped ESTs on chromosome 3H was used for clone selection in a BAC library developed from cultivar "Haruna Nijo". The DNA samples of 10 or 20 BAC clones were pooled and used for shotgun library development. The homology between contig sequences generated in each pooled library and mapped EST sequences was studied. The number of contigs assigned on chromosome 3H was 372. Their lengths ranged from 1,230 bp to 58,322 bp with an average 14,891 bp. Of these contigs, 240 showed homology and colinearity with the genome sequence of rice chromosome 1. A contig annotation browser supplemented with query search by unique sequence or genetic map position was developed. The identified contigs can be annotated with barley cDNAs and reference sequences on the browser. Homology analysis of these contigs with rice genes indicated that 1,239 rice genes can be assigned to barley contigs by the simple comparison of sequence lengths in both species. Of these genes, 492 are assigned to rice chromosome 1. Conclusions We demonstrate the efficiency of sequencing gene rich regions from barley chromosome 3H, with special reference to syntenic relationships with rice chromosome 1.

  20. CBrowse: a SAM/BAM-based contig browser for transcriptome assembly visualization and analysis.

    Science.gov (United States)

    Li, Pei; Ji, Guoli; Dong, Min; Schmidt, Emily; Lenox, Douglas; Chen, Liangliang; Liu, Qi; Liu, Lin; Zhang, Jie; Liang, Chun

    2012-09-15

    To address the impending need for exploring rapidly increased transcriptomics data generated for non-model organisms, we developed CBrowse, an AJAX-based web browser for visualizing and analyzing transcriptome assemblies and contigs. Designed in a standard three-tier architecture with a data pre-processing pipeline, CBrowse is essentially a Rich Internet Application that offers many seamlessly integrated web interfaces and allows users to navigate, sort, filter, search and visualize data smoothly. The pre-processing pipeline takes the contig sequence file in FASTA format and its relevant SAM/BAM file as the input; detects putative polymorphisms, simple sequence repeats and sequencing errors in contigs and generates image, JSON and database-compatible CSV text files that are directly utilized by different web interfaces. CBowse is a generic visualization and analysis tool that facilitates close examination of assembly quality, genetic polymorphisms, sequence repeats and/or sequencing errors in transcriptome sequencing projects. CBrowse is distributed under the GNU General Public License, available at http://bioinfolab.muohio.edu/CBrowse/ liangc@muohio.edu or liangc.mu@gmail.com; glji@xmu.edu.cn Supplementary data are available at Bioinformatics online.

  1. Dicty_cDB: Contig-U03802-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available T-2KB Trichosurus... 48 3e-11 4 ( DY894715 ) CeleSEQ14351 Cunninghamella elegans pBluescript (... 58 4e-11 3... letters Score E Sequences producing significant alignments: (bits) Value Contig-U03802-1 (Contig-U... letters Searching..................................................done Score E Sequences producing significant al...1... 62 4e-05 1 ( EJ306703 ) 1095390099376 Global-Ocean-Sampling_GS-27-01-01-1... 62 4e-05 1 ( CP000238 ) Baumannia cicadellinicola... AY241394 |pid:none) Melopsittacus undulatus Mn superox... 244 2e-63 AF329270_1( AF329270 |pid:none) Gallus gallus manganes

  2. ESTminer: a Web interface for mining EST contig and cluster databases.

    Science.gov (United States)

    Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R

    2005-03-01

    ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.

  3. Generation and analysis of expressed sequence tags in the extreme large genomes Lilium and Tulipa

    Directory of Open Access Journals (Sweden)

    Shahin Arwa

    2012-11-01

    Full Text Available Abstract Background Bulbous flowers such as lily and tulip (Liliaceae family are monocot perennial herbs that are economically very important ornamental plants worldwide. However, there are hardly any genetic studies performed and genomic resources are lacking. To build genomic resources and develop tools to speed up the breeding in both crops, next generation sequencing was implemented. We sequenced and assembled transcriptomes of four lily and five tulip genotypes using 454 pyro-sequencing technology. Results Successfully, we developed the first set of 81,791 contigs with an average length of 514 bp for tulip, and enriched the very limited number of 3,329 available ESTs (Expressed Sequence Tags for lily with 52,172 contigs with an average length of 555 bp. The contigs together with singletons covered on average 37% of lily and 39% of tulip estimated transcriptome. Mining lily and tulip sequence data for SSRs (Simple Sequence Repeats showed that di-nucleotide repeats were twice more abundant in UTRs (UnTranslated Regions compared to coding regions, while tri-nucleotide repeats were equally spread over coding and UTR regions. Two sets of single nucleotide polymorphism (SNP markers suitable for high throughput genotyping were developed. In the first set, no SNPs flanking the target SNP (50 bp on either side were allowed. In the second set, one SNP in the flanking regions was allowed, which resulted in a 2 to 3 fold increase in SNP marker numbers compared with the first set. Orthologous groups between the two flower bulbs: lily and tulip (12,017 groups and among the three monocot species: lily, tulip, and rice (6,900 groups were determined using OrthoMCL. Orthologous groups were screened for common SNP markers and EST-SSRs to study synteny between lily and tulip, which resulted in 113 common SNP markers and 292 common EST-SSR. Lily and tulip contigs generated were annotated and described according to Gene Ontology terminology. Conclusions

  4. Generation and analysis of expressed sequence tags in the extreme large genomes Lilium and Tulipa.

    Science.gov (United States)

    Shahin, Arwa; van Kaauwen, Martijn; Esselink, Danny; Bargsten, Joachim W; van Tuyl, Jaap M; Visser, Richard G F; Arens, Paul

    2012-11-20

    Bulbous flowers such as lily and tulip (Liliaceae family) are monocot perennial herbs that are economically very important ornamental plants worldwide. However, there are hardly any genetic studies performed and genomic resources are lacking. To build genomic resources and develop tools to speed up the breeding in both crops, next generation sequencing was implemented. We sequenced and assembled transcriptomes of four lily and five tulip genotypes using 454 pyro-sequencing technology. Successfully, we developed the first set of 81,791 contigs with an average length of 514 bp for tulip, and enriched the very limited number of 3,329 available ESTs (Expressed Sequence Tags) for lily with 52,172 contigs with an average length of 555 bp. The contigs together with singletons covered on average 37% of lily and 39% of tulip estimated transcriptome. Mining lily and tulip sequence data for SSRs (Simple Sequence Repeats) showed that di-nucleotide repeats were twice more abundant in UTRs (UnTranslated Regions) compared to coding regions, while tri-nucleotide repeats were equally spread over coding and UTR regions. Two sets of single nucleotide polymorphism (SNP) markers suitable for high throughput genotyping were developed. In the first set, no SNPs flanking the target SNP (50 bp on either side) were allowed. In the second set, one SNP in the flanking regions was allowed, which resulted in a 2 to 3 fold increase in SNP marker numbers compared with the first set. Orthologous groups between the two flower bulbs: lily and tulip (12,017 groups) and among the three monocot species: lily, tulip, and rice (6,900 groups) were determined using OrthoMCL. Orthologous groups were screened for common SNP markers and EST-SSRs to study synteny between lily and tulip, which resulted in 113 common SNP markers and 292 common EST-SSR. Lily and tulip contigs generated were annotated and described according to Gene Ontology terminology. Two transcriptome sets were built that are valuable

  5. The human MCP-2 gene (SCYA8): Cloning, sequence analysis, tissue expression, and assignment to the CC chemokine gene contig on chromosome 17q11.2

    Energy Technology Data Exchange (ETDEWEB)

    Van Coillie, E.; Fiten, P.; Van Damme, J.; Opdenakker, G. [Univ. of Leuven (Belgium)] [and others

    1997-03-01

    Monocyte chemotactic proteins (MCPs) form a subfamily of chemokines that recruit leukocytes to sites of inflammation and that may contribute to tumor-associated leukocyte infiltration and to the antiviral state against HIV infection. With the use of degenerate primers that were based on CC chemokine consensus sequences, the known MIP-1{alpha}/LD78{alpha}, MCP-1, and MCP-3 genes and the previously unidentified eotaxin and MCP-2 genes were isolated from a YAC contig from human chromosome 17q11.2. The amplified genomic MCP-2 fragment was used to isolate an MCP-2 cosmid from which the gene sequence was determined. The MCP-2 gene shares with the MCP-1 and MCP-3 genes a conserved intron-exon structure and a coding nucleotide sequence homology of 77%. By Northern blot analysis the 1.0-kb MCP-2 mRNA was predominantly detectable in the small intestine, peripheral blood, heart, placenta, lung, skeletal muscle, ovary, colon, spinal cord, pancreas, and thymus. Transcripts of 1.5 and 2.4 kb were found in the testis, the small intestine, and the colon. The isolation of the MCP-2 gene from the chemokine contig localized it on YAC clones of chromosome 17q11.2, which also contain the eotaxin, MCP-1, MCP-3, and NCC-1/MCP-4 genes. The combination of using degenerate primer PCR and YACs illustrates that novel genes can efficiently be isolated from gene cluster contigs with less redundancy and effort than the isolation of novel ESTs. 42 refs., 5 figs., 2 tabs.

  6. Detection of a Usp-like gene in Calotropis procera plant from the de novo assembled genome contigs of the high-throughput sequencing dataset

    KAUST Repository

    Shokry, Ahmed M.

    2014-02-01

    The wild plant species Calotropis procera (C. procera) has many potential applications and beneficial uses in medicine, industry and ornamental field. It also represents an excellent source of genes for drought and salt tolerance. Genes encoding proteins that contain the conserved universal stress protein (USP) domain are known to provide organisms like bacteria, archaea, fungi, protozoa and plants with the ability to respond to a plethora of environmental stresses. However, information on the possible occurrence of Usp in C. procera is not available. In this study, we uncovered and characterized a one-class A Usp-like (UspA-like, NCBI accession No. KC954274) gene in this medicinal plant from the de novo assembled genome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for Usp sequences were blasted with the recovered de novo assembled contigs. Homology modelling of the deduced amino acids (NCBI accession No. AGT02387) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera USPA-like full sequence model on Thermus thermophilus USP UniProt protein (PDB accession No. Q5SJV7) was constructed using RasMol and Deep-View programs. The functional domains of the novel USPA-like amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM). © 2014 Académie des sciences.

  7. [Complete genome sequencing and sequence analysis of BCG Tice].

    Science.gov (United States)

    Wang, Zhiming; Pan, Yuanlong; Wu, Jun; Zhu, Baoli

    2012-10-04

    The objective of this study is to obtain the complete genome sequence of Bacillus Calmette-Guerin Tice (BCG Tice), in order to provide more information about the molecular biology of BCG Tice and design more reasonable vaccines to prevent tuberculosis. We assembled the data from high-throughput sequencing with SOAPdenovo software, with many contigs and scaffolds obtained. There are many sequence gaps and physical gaps remained as a result of regional low coverage and low quality. We designed primers at the end of contigs and performed PCR amplification in order to link these contigs and scaffolds. With various enzymes to perform PCR amplification, adjustment of PCR reaction conditions, and combined with clone construction to sequence, all the gaps were finished. We obtained the complete genome sequence of BCG Tice and submitted it to GenBank of National Center for Biotechnology Information (NCBI). The genome of BCG Tice is 4334064 base pairs in length, with GC content 65.65%. The problems and strategies during the finishing step of BCG Tice sequencing are illuminated here, with the hope of affording some experience to those who are involved in the finishing step of genome sequencing. The microarray data were verified by our results.

  8. Toward an Integrated BAC Library Resource for Genome Sequencing and Analysis; FINAL

    International Nuclear Information System (INIS)

    Simon, M. I.; Kim, U.-J.

    2002-01-01

    We developed a great deal of expertise in building large BAC libraries from a variety of DNA sources including humans, mice, corn, microorganisms, worms, and Arabidopsis. We greatly improved the technology for screening these libraries rapidly and for selecting appropriate BACs and mapping BACs to develop large overlapping contigs. We became involved in supplying BACs and BAC contigs to a variety of sequencing and mapping projects and we began to collaborate with Drs. Adams and Venter at TIGR and with Dr. Leroy Hood and his group at University of Washington to provide BACs for end sequencing and for mapping and sequencing of large fragments of chromosome 16. Together with Dr. Ian Dunham and his co-workers at the Sanger Center we completed the mapping and they completed the sequencing of the first human chromosome, chromosome 22. This was published in Nature in 1999 and our BAC contigs made a major contribution to this sequencing effort. Drs. Shizuya and Ding invented an automated highly accurate BAC mapping technique. We also developed long-term collaborations with Dr. Uli Weier at UCSF in the design of BAC probes for characterization of human tumors and specific chromosome deletions and breakpoints. Finally the contribution of our work to the human genome project has been recognized in the publication both by the international consortium and the NIH of a draft sequence of the human genome in Nature last year. Dr. Shizuya was acknowledged in the authorship of that landmark paper. Dr. Simon was also an author on the Venter/Adams Celera project sequencing the human genome that was published in Science last year

  9. Characteristics of the Lotus japonicus gene repertoire deduced from large-scale expressed sequence tag (EST) analysis.

    Science.gov (United States)

    Asamizu, Erika; Nakamura, Yasukazu; Sato, Shusei; Tabata, Satoshi

    2004-02-01

    To perform a comprehensive analysis of genes expressed in a model legume, Lotus japonicus, a total of 74472 3'-end expressed sequence tags (EST) were generated from cDNA libraries produced from six different organs. Clustering of sequences was performed with an identity criterion of 95% for 50 bases, and a total of 20457 non-redundant sequences, 8503 contigs and 11954 singletons were generated. EST sequence coverage was analyzed by using the annotated L. japonicus genomic sequence and 1093 of the 1889 predicted protein-encoding genes (57.9%) were hit by the EST sequence(s). Gene content was compared to several plant species. Among the 8503 contigs, 471 were identified as sequences conserved only in leguminous species and these included several disease resistance-related genes. This suggested that in legumes, these genes may have evolved specifically to resist pathogen attack. The rate of gene sequence divergence was assessed by comparing similarity level and functional category based on the Gene Ontology (GO) annotation of Arabidopsis genes. This revealed that genes encoding ribosomal proteins, as well as those related to translation, photosynthesis, and cellular structure were more abundantly represented in the highly conserved class, and that genes encoding transcription factors and receptor protein kinases were abundantly represented in the less conserved class. To make the sequence information and the cDNA clones available to the research community, a Web database with useful services was created at http://www.kazusa.or.jp/en/plant/lotus/EST/.

  10. Two sequence-ready contigs spanning the two copies of a 200-kb duplication on human 21q: partial sequence and polymorphisms.

    Science.gov (United States)

    Potier, M; Dutriaux, A; Orti, R; Groet, J; Gibelin, N; Karadima, G; Lutfalla, G; Lynn, A; Van Broeckhoven, C; Chakravarti, A; Petersen, M; Nizetic, D; Delabar, J; Rossier, J

    1998-08-01

    Physical mapping across a duplication can be a tour de force if the region is larger than the size of a bacterial clone. This was the case of the 170- to 275-kb duplication present on the long arm of chromosome 21 in normal human at 21q11.1 (proximal region) and at 21q22.1 (distal region), which we described previously. We have constructed sequence-ready contigs of the two copies of the duplication of which all the clones are genuine representatives of one copy or the other. This required the identification of four duplicon polymorphisms that are copy-specific and nonallelic variations in the sequence of the STSs. Thirteen STSs were mapped inside the duplicated region and 5 outside but close to the boundaries. Among these STSs 10 were end clones from YACs, PACs, or cosmids, and the average interval between two markers in the duplicated region was 16 kb. Eight PACs and cosmids showing minimal overlaps were selected in both copies of the duplication. Comparative sequence analysis along the duplication showed three single-basepair changes between the two copies over 659 bp sequenced (4 STSs), suggesting that the duplication is recent (less than 4 mya). Two CpG islands were located in the duplication, but no genes were identified after a 36-kb cosmid from the proximal copy of the duplication was sequenced. The homology of this chromosome 21 duplicated region with the pericentromeric regions of chromosomes 13, 2, and 18 suggests that the mechanism involved is probably similar to pericentromeric-directed mechanisms described in interchromosomal duplications. Copyright 1998 Academic Press.

  11. The binning of metagenomic contigs for microbial physiology of mixed cultures.

    Science.gov (United States)

    Strous, Marc; Kraft, Beate; Bisdorf, Regina; Tegetmeyer, Halina E

    2012-01-01

    So far, microbial physiology has dedicated itself mainly to pure cultures. In nature, cross feeding and competition are important aspects of microbial physiology and these can only be addressed by studying complete communities such as enrichment cultures. Metagenomic sequencing is a powerful tool to characterize such mixed cultures. In the analysis of metagenomic data, well established algorithms exist for the assembly of short reads into contigs and for the annotation of predicted genes. However, the binning of the assembled contigs or unassembled reads is still a major bottleneck and required to understand how the overall metabolism is partitioned over different community members. Binning consists of the clustering of contigs or reads that apparently originate from the same source population. In the present study eight metagenomic samples from the same habitat, a laboratory enrichment culture, were sequenced. Each sample contained 13-23 Mb of assembled contigs and up to eight abundant populations. Binning was attempted with existing methods but they were found to produce poor results, were slow, dependent on non-standard platforms or produced errors. A new binning procedure was developed based on multivariate statistics of tetranucleotide frequencies combined with the use of interpolated Markov models. Its performance was evaluated by comparison of the results between samples with BLAST and in comparison to existing algorithms for four publicly available metagenomes and one previously published artificial metagenome. The accuracy of the new approach was comparable or higher than existing methods. Further, it was up to a 100 times faster. It was implemented in Java Swing as a complete open source graphical binning application available for download and further development (http://sourceforge.net/projects/metawatt).

  12. The binning of metagenomic contigs for microbial physiology of mixed cultures

    Directory of Open Access Journals (Sweden)

    Marc eStrous

    2012-12-01

    Full Text Available So far, microbial physiology has dedicated itself mainly to pure cultures. In nature, cross feeding and competition are important aspects of microbial physiology and these can only be addressed by studying complete communities such as enrichment cultures. Metagenomic sequencing is a powerful tool to characterize such mixed cultures. In the analysis of metagenomic data, well established algorithms exist for the assembly of short reads into contigs and for the annotation of predicted genes. However, the binning of the assembled contigs or unassembled reads is still a major bottleneck and required to understand how the overall metabolism is partitioned over different community members. Binning consists of the clustering of contigs or reads that apparently originate from the same source population.In the present study eight metagenomic samples originating from the same habitat, a laboratory enrichment culture, were sequenced. Each sample contained 13-23 Mb of assembled contigs and up to eight abundant populations. Binning was attempted with existing methods but they were found to produce poor results, were slow, dependent on non-standard platforms or produced errors. A new binning procedure was developed based on multivariate statistics of tetranucleotide frequencies combined with the use of interpolated Markov models. Its performance was evaluated by comparison of the results between samples with BLAST and in comparison to exisiting algorithms for four publicly available metagenomes and one previously published artificial metagenome. The accuracy of the new approach was comparable or higher than existing methods. Further, it was up to a hunderd times faster. It was implemented in Java Swing as a complete open source graphical binning application available for download and further development (http://sourceforge.net/projects/metawatt.

  13. Assessment of metagenomic assembly using simulated next generation sequencing data

    DEFF Research Database (Denmark)

    Mende, Daniel R; Waller, Alison S; Sunagawa, Shinichi

    2012-01-01

    with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved...... the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition...... the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities...

  14. Generation of expressed sequence tags for discovery of genes responsible for floral traits of Chrysanthemum morifolium by next-generation sequencing technology.

    Science.gov (United States)

    Sasaki, Katsutomo; Mitsuda, Nobutaka; Nashima, Kenji; Kishimoto, Kyutaro; Katayose, Yuichi; Kanamori, Hiroyuki; Ohmiya, Akemi

    2017-09-04

    Chrysanthemum morifolium is one of the most economically valuable ornamental plants worldwide. Chrysanthemum is an allohexaploid plant with a large genome that is commercially propagated by vegetative reproduction. New cultivars with different floral traits, such as color, morphology, and scent, have been generated mainly by classical cross-breeding and mutation breeding. However, only limited genetic resources and their genome information are available for the generation of new floral traits. To obtain useful information about molecular bases for floral traits of chrysanthemums, we read expressed sequence tags (ESTs) of chrysanthemums by high-throughput sequencing using the 454 pyrosequencing technology. We constructed normalized cDNA libraries, consisting of full-length, 3'-UTR, and 5'-UTR cDNAs derived from various tissues of chrysanthemums. These libraries produced a total number of 3,772,677 high-quality reads, which were assembled into 213,204 contigs. By comparing the data obtained with those of full genome-sequenced species, we confirmed that our chrysanthemum contig set contained the majority of all expressed genes, which was sufficient for further molecular analysis in chrysanthemums. We confirmed that our chrysanthemum EST set (contigs) contained a number of contigs that encoded transcription factors and enzymes involved in pigment and aroma compound metabolism that was comparable to that of other species. This information can serve as an informative resource for identifying genes involved in various biological processes in chrysanthemums. Moreover, the findings of our study will contribute to a better understanding of the floral characteristics of chrysanthemums including the myriad cultivars at the molecular level.

  15. Improvement of methods for large scale sequencing; application to human Xq28

    Energy Technology Data Exchange (ETDEWEB)

    Gibbs, R.A.; Andersson, B.; Wentland, M.A. [Baylor College of Medicine, Houston, TX (United States)] [and others

    1994-09-01

    Sequencing of a one-metabase region of Xq28, spanning the FRAXA and IDS loci has been undertaken in order to investigate the practicality of the shotgun approach for large scale sequencing and as a platform to develop improved methods. The efficiency of several steps in the shotgun sequencing strategy has been increased using PCR-based approaches. An improved method for preparation of M13 libraries has been developed. This protocol combines a previously described adaptor-based protocol with the uracil DNA glycosylase (UDG)-cloning procedure. The efficiency of this procedure has been found to be up to 100-fold higher than that of previously used protocols. In addition the novel protocol is more reliable and thus easy to establish in a laboratory. The method has also been adapted for the simultaneous shotgun sequencing of multiple short fragments by concentrating them before library construction is presented. This protocol is suitable for rapid characterization of cDNA clones. A library was constructed from 15 PCR-amplified and concentrated human cDNA inserts, and the insert sequences could easily be identified as separate contigs during the assembly process and the sequence coverage was even along each fragment. Using this strategy, the fine structures of the FraxA and IDS loci have been revealed and several EST homologies indicating novel expressed sequences have been identified. Use of PCR to close repetitive regions that are difficult to clone was tested by determination of the sequence of a cosmid mapping DXS455 in Xq28, containing a polymorphic VNTR. The region containing the VNTR was not represented in the shotgun library, but by designing PCR primers in the sequences flanking the gap and by cloning and sequencing the PCR product, the fine structure of the VNTR has been determined. It was found to be an AT-rich VNTR with a repeated 25-mer at the center.

  16. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver

    Science.gov (United States)

    Blanquart, François; Golubchik, Tanya; Gall, Astrid; Bakker, Margreet; Bezemer, Daniela; Croucher, Nicholas J; Hall, Matthew; Hillebregt, Mariska; Ratmann, Oliver; Albert, Jan; Bannert, Norbert; Fellay, Jacques; Fransen, Katrien; Gourlay, Annabelle; Grabowski, M Kate; Gunsenheimer-Bartmeyer, Barbara; Günthard, Huldrych F; Kivelä, Pia; Kouyos, Roger; Laeyendecker, Oliver; Liitsola, Kirsi; Meyer, Laurence; Porter, Kholoud; Ristola, Matti; van Sighem, Ard; Cornelissen, Marion; Kellam, Paul; Reiss, Peter

    2018-01-01

    Abstract Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver’s constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also

  17. Radiation hybrid maps of the D-genome of Aegilops tauschii and their application in sequence assembly of large and complex plant genomes.

    Science.gov (United States)

    Kumar, Ajay; Seetan, Raed; Mergoum, Mohamed; Tiwari, Vijay K; Iqbal, Muhammad J; Wang, Yi; Al-Azzam, Omar; Šimková, Hana; Luo, Ming-Cheng; Dvorak, Jan; Gu, Yong Q; Denton, Anne; Kilian, Andrzej; Lazo, Gerard R; Kianian, Shahryar F

    2015-10-16

    The large and complex genome of bread wheat (Triticum aestivum L., ~17 Gb) requires high resolution genome maps with saturated marker scaffolds to anchor and orient BAC contigs/ sequence scaffolds for whole genome assembly. Radiation hybrid (RH) mapping has proven to be an excellent tool for the development of such maps for it offers much higher and more uniform marker resolution across the length of the chromosome compared to genetic mapping and does not require marker polymorphism per se, as it is based on presence (retention) vs. absence (deletion) marker assay. In this study, a 178 line RH panel was genotyped with SSRs and DArT markers to develop the first high resolution RH maps of the entire D-genome of Ae. tauschii accession AL8/78. To confirm map order accuracy, the AL8/78-RH maps were compared with:1) a DArT consensus genetic map constructed using more than 100 bi-parental populations, 2) a RH map of the D-genome of reference hexaploid wheat 'Chinese Spring', and 3) two SNP-based genetic maps, one with anchored D-genome BAC contigs and another with anchored D-genome sequence scaffolds. Using marker sequences, the RH maps were also anchored with a BAC contig based physical map and draft sequence of the D-genome of Ae. tauschii. A total of 609 markers were mapped to 503 unique positions on the seven D-genome chromosomes, with a total map length of 14,706.7 cR. The average distance between any two marker loci was 29.2 cR which corresponds to 2.1 cM or 9.8 Mb. The average mapping resolution across the D-genome was estimated to be 0.34 Mb (Mb/cR) or 0.07 cM (cM/cR). The RH maps showed almost perfect agreement with several published maps with regard to chromosome assignments of markers. The mean rank correlations between the position of markers on AL8/78 maps and the four published maps, ranged from 0.75 to 0.92, suggesting a good agreement in marker order. With 609 mapped markers, a total of 2481 deletions for the whole D-genome were detected with an average

  18. Transcriptome analysis of carnation (Dianthus caryophyllus L.) based on next-generation sequencing technology.

    Science.gov (United States)

    Tanase, Koji; Nishitani, Chikako; Hirakawa, Hideki; Isobe, Sachiko; Tabata, Satoshi; Ohmiya, Akemi; Onozaki, Takashi

    2012-07-02

    Carnation (Dianthus caryophyllus L.), in the family Caryophyllaceae, can be found in a wide range of colors and is a model system for studies of flower senescence. In addition, it is one of the most important flowers in the global floriculture industry. However, few genomics resources, such as sequences and markers are available for carnation or other members of the Caryophyllaceae. To increase our understanding of the genetic control of important characters in carnation, we generated an expressed sequence tag (EST) database for a carnation cultivar important in horticulture by high-throughput sequencing using 454 pyrosequencing technology. We constructed a normalized cDNA library and a 3'-UTR library of carnation, obtaining a total of 1,162,126 high-quality reads. These reads were assembled into 300,740 unigenes consisting of 37,844 contigs and 262,896 singlets. The contigs were searched against an Arabidopsis sequence database, and 61.8% (23,380) of them had at least one BLASTX hit. These contigs were also annotated with Gene Ontology (GO) and were found to cover a broad range of GO categories. Furthermore, we identified 17,362 potential simple sequence repeats (SSRs) in 14,291 of the unigenes. We focused on gene discovery in the areas of flower color and ethylene biosynthesis. Transcripts were identified for almost every gene involved in flower chlorophyll and carotenoid metabolism and in anthocyanin biosynthesis. Transcripts were also identified for every step in the ethylene biosynthesis pathway. We present the first large-scale sequence data set for carnation, generated using next-generation sequencing technology. The large EST database generated from these sequences is an informative resource for identifying genes involved in various biological processes in carnation and provides an EST resource for understanding the genetic diversity of this plant.

  19. Transcriptome analysis of carnation (Dianthus caryophyllus L. based on next-generation sequencing technology

    Directory of Open Access Journals (Sweden)

    Tanase Koji

    2012-07-01

    Full Text Available Abstract Background Carnation (Dianthus caryophyllus L., in the family Caryophyllaceae, can be found in a wide range of colors and is a model system for studies of flower senescence. In addition, it is one of the most important flowers in the global floriculture industry. However, few genomics resources, such as sequences and markers are available for carnation or other members of the Caryophyllaceae. To increase our understanding of the genetic control of important characters in carnation, we generated an expressed sequence tag (EST database for a carnation cultivar important in horticulture by high-throughput sequencing using 454 pyrosequencing technology. Results We constructed a normalized cDNA library and a 3’-UTR library of carnation, obtaining a total of 1,162,126 high-quality reads. These reads were assembled into 300,740 unigenes consisting of 37,844 contigs and 262,896 singlets. The contigs were searched against an Arabidopsis sequence database, and 61.8% (23,380 of them had at least one BLASTX hit. These contigs were also annotated with Gene Ontology (GO and were found to cover a broad range of GO categories. Furthermore, we identified 17,362 potential simple sequence repeats (SSRs in 14,291 of the unigenes. We focused on gene discovery in the areas of flower color and ethylene biosynthesis. Transcripts were identified for almost every gene involved in flower chlorophyll and carotenoid metabolism and in anthocyanin biosynthesis. Transcripts were also identified for every step in the ethylene biosynthesis pathway. Conclusions We present the first large-scale sequence data set for carnation, generated using next-generation sequencing technology. The large EST database generated from these sequences is an informative resource for identifying genes involved in various biological processes in carnation and provides an EST resource for understanding the genetic diversity of this plant.

  20. Construction of a yeast artifical chromosome contig spanning the spinal muscular atrophy disease gene region

    Energy Technology Data Exchange (ETDEWEB)

    Kleyn, P.W.; Wang, C.H.; Vitale, E.; Pan, J.; Ross, B.M.; Grunn, A.; Palmer, D.A.; Warburton, D.; Brzustowicz, L.M.; Gilliam, T.G. (New York State Psychiatric Institute, NY (United States)); Lien, L.L.; Kunkel, L.M. (Howard Hughes Medical Institute, Boston, MA (United States))

    1993-07-15

    The childhood spinal muscular atrophies (SMAs) are the most common, serious neuromuscular disorders of childhood second to Duchenne muscular dystrophy. A single locus for these disorders has been mapped by recombination events to a region of 0.7 centimorgan (range, 0.1-2.1 centimorgans) between loci D5S435 and MAP1B on chromosome 5q11.2-13.3. By using PCR amplification to screen yeast artificial chromosome (YAC) DNA pools and the PCR-vectorette method to amplify YAC ends, a YAC contig was constructed across the disease gene region. Nine walk steps identified 32 YACs, including a minimum of seven overlapping YAC clones (average size, 460 kb) that span the SMA region. The contig is characterized by a collection of 30 YAC-end sequence tag sites together with seven genetic markers. The entire YAC contig spans a minimum of 3.2 Mb; the SMA locus is confined to roughly half of this region. Microsatellite markers generated along the YAC contig segregate with the SMA locus in all families where the flanking markers (D5S435 and MAP1B) recombine. Construction of a YAC contig across the disease gene region is an essential step in isolation of the SMA-encoding gene. 26 refs., 3 figs., 1 tab.

  1. Why barcode? High-throughput multiplex sequencing of mitochondrial genomes for molecular systematics.

    Science.gov (United States)

    Timmermans, M J T N; Dodsworth, S; Culverwell, C L; Bocak, L; Ahrens, D; Littlewood, D T J; Pons, J; Vogler, A P

    2010-11-01

    Mitochondrial genome sequences are important markers for phylogenetics but taxon sampling remains sporadic because of the great effort and cost required to acquire full-length sequences. Here, we demonstrate a simple, cost-effective way to sequence the full complement of protein coding mitochondrial genes from pooled samples using the 454/Roche platform. Multiplexing was achieved without the need for expensive indexing tags ('barcodes'). The method was trialled with a set of long-range polymerase chain reaction (PCR) fragments from 30 species of Coleoptera (beetles) sequenced in a 1/16th sector of a sequencing plate. Long contigs were produced from the pooled sequences with sequencing depths ranging from ∼10 to 100× per contig. Species identity of individual contigs was established via three 'bait' sequences matching disparate parts of the mitochondrial genome obtained by conventional PCR and Sanger sequencing. This proved that assembly of contigs from the sequencing pool was correct. Our study produced sequences for 21 nearly complete and seven partial sets of protein coding mitochondrial genes. Combined with existing sequences for 25 taxa, an improved estimate of basal relationships in Coleoptera was obtained. The procedure could be employed routinely for mitochondrial genome sequencing at the species level, to provide improved species 'barcodes' that currently use the cox1 gene only.

  2. Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey

    Directory of Open Access Journals (Sweden)

    Varala Kranthi

    2007-05-01

    Full Text Available Abstract Background Extensive computational and database tools are available to mine genomic and genetic databases for model organisms, but little genomic data is available for many species of ecological or agricultural significance, especially those with large genomes. Genome surveys using conventional sequencing techniques are powerful, particularly for detecting sequences present in many copies per genome. However these methods are time-consuming and have potential drawbacks. High throughput 454 sequencing provides an alternative method by which much information can be gained quickly and cheaply from high-coverage surveys of genomic DNA. Results We sequenced 78 million base-pairs of randomly sheared soybean DNA which passed our quality criteria. Computational analysis of the survey sequences provided global information on the abundant repetitive sequences in soybean. The sequence was used to determine the copy number across regions of large genomic clones or contigs and discover higher-order structures within satellite repeats. We have created an annotated, online database of sequences present in multiple copies in the soybean genome. The low bias of pyrosequencing against repeat sequences is demonstrated by the overall composition of the survey data, which matches well with past estimates of repetitive DNA content obtained by DNA re-association kinetics (Cot analysis. Conclusion This approach provides a potential aid to conventional or shotgun genome assembly, by allowing rapid assessment of copy number in any clone or clone-end sequence. In addition, we show that partial sequencing can provide access to partial protein-coding sequences.

  3. CSAR-web: a web server of contig scaffolding using algebraic rearrangements.

    Science.gov (United States)

    Chen, Kun-Tze; Lu, Chin Lung

    2018-05-04

    CSAR-web is a web-based tool that allows the users to efficiently and accurately scaffold (i.e. order and orient) the contigs of a target draft genome based on a complete or incomplete reference genome from a related organism. It takes as input a target genome in multi-FASTA format and a reference genome in FASTA or multi-FASTA format, depending on whether the reference genome is complete or incomplete, respectively. In addition, it requires the users to choose either 'NUCmer on nucleotides' or 'PROmer on translated amino acids' for CSAR-web to identify conserved genomic markers (i.e. matched sequence regions) between the target and reference genomes, which are used by the rearrangement-based scaffolding algorithm in CSAR-web to order and orient the contigs of the target genome based on the reference genome. In the output page, CSAR-web displays its scaffolding result in a graphical mode (i.e. scalable dotplot) allowing the users to visually validate the correctness of scaffolded contigs and in a tabular mode allowing the users to view the details of scaffolds. CSAR-web is available online at http://genome.cs.nthu.edu.tw/CSAR-web.

  4. A 2-megabase physical contig incorporating 43 DNA markers on the human X chromosome at p11.23-p11.22 from ZNF21 to DXS255

    Energy Technology Data Exchange (ETDEWEB)

    Boycott, K.M.; Bech-Hansen, N.T. [Univ. of Calgary, Alberta (Canada); Halley, G.R.; Schlessinger, D. [Washington Univ. School of Medicine, St. Louis, MO (United States)

    1996-05-01

    A comprehensive physical contig of yeast artificial chromosomes (YACs) and cosmid clones between ZNF21 and DXS255 has been constructed, spanning 2 Mb within the region Xp11.23-p11.22. As a portion of the region was found to be particularly unstable in yeast, the integrity of the contig is dependent on additional information provided by the sequence-tagged site (STS) content of cosmid clones and DNA marker retention in conventional and radiation hybrids. The contig was formatted with 43 DNA markers, including 19 new STSs from YAC insert ends and an internal Alu-PCR product. The density of STSs across the contig ranges from one marker every 20 kb to one every 60 kb, with an average density of one marker every 50 kb. The relative order of previously known gene and expressed sequence tags in this region is predicted to be Xpter-ZNF21-DXS7465E (MG66)-DXS7927E (MG81)-WASP, DXS1011E, DXS7467E (MG21)-DXS-7466E (MG44)-GATA1-DXS7469E (Xp664)-TFE3-SYP (DXS1007E)-Xcen. This contig extends the coverage in Xp11 and provides a framework for the future identification and mapping of new genes, as well as the resources for developing DNA sequencing templates. 47 refs., 1 fig., 4 tabs.

  5. Generation and analysis of large-scale expressed sequence tags (ESTs from a full-length enriched cDNA library of porcine backfat tissue

    Directory of Open Access Journals (Sweden)

    Lee Hae-Young

    2006-02-01

    Full Text Available Abstract Background Genome research in farm animals will expand our basic knowledge of the genetic control of complex traits, and the results will be applied in the livestock industry to improve meat quality and productivity, as well as to reduce the incidence of disease. A combination of quantitative trait locus mapping and microarray analysis is a useful approach to reduce the overall effort needed to identify genes associated with quantitative traits of interest. Results We constructed a full-length enriched cDNA library from porcine backfat tissue. The estimated average size of the cDNA inserts was 1.7 kb, and the cDNA fullness ratio was 70%. In total, we deposited 16,110 high-quality sequences in the dbEST division of GenBank (accession numbers: DT319652-DT335761. For all the expressed sequence tags (ESTs, approximately 10.9 Mb of porcine sequence were generated with an average length of 674 bp per EST (range: 200–952 bp. Clustering and assembly of these ESTs resulted in a total of 5,008 unique sequences with 1,776 contigs (35.46% and 3,232 singleton (65.54% ESTs. From a total of 5,008 unique sequences, 3,154 (62.98% were similar to other sequences, and 1,854 (37.02% were identified as having no hit or low identity (Sus scrofa. Gene ontology (GO annotation of unique sequences showed that approximately 31.7, 32.3, and 30.8% were assigned molecular function, biological process, and cellular component GO terms, respectively. A total of 1,854 putative novel transcripts resulted after comparison and filtering with the TIGR SsGI; these included a large percentage of singletons (80.64% and a small proportion of contigs (13.36%. Conclusion The sequence data generated in this study will provide valuable information for studying expression profiles using EST-based microarrays and assist in the condensation of current pig TCs into clusters representing longer stretches of cDNA sequences. The isolation of genes expressed in backfat tissue is the

  6. A base composition analysis of natural patterns for the preprocessing of metagenome sequences.

    Science.gov (United States)

    Bonham-Carter, Oliver; Ali, Hesham; Bastola, Dhundy

    2013-01-01

    On the pretext that sequence reads and contigs often exhibit the same kinds of base usage that is also observed in the sequences from which they are derived, we offer a base composition analysis tool. Our tool uses these natural patterns to determine relatedness across sequence data. We introduce spectrum sets (sets of motifs) which are permutations of bacterial restriction sites and the base composition analysis framework to measure their proportional content in sequence data. We suggest that this framework will increase the efficiency during the pre-processing stages of metagenome sequencing and assembly projects. Our method is able to differentiate organisms and their reads or contigs. The framework shows how to successfully determine the relatedness between these reads or contigs by comparison of base composition. In particular, we show that two types of organismal-sequence data are fundamentally different by analyzing their spectrum set motif proportions (coverage). By the application of one of the four possible spectrum sets, encompassing all known restriction sites, we provide the evidence to claim that each set has a different ability to differentiate sequence data. Furthermore, we show that the spectrum set selection having relevance to one organism, but not to the others of the data set, will greatly improve performance of sequence differentiation even if the fragment size of the read, contig or sequence is not lengthy. We show the proof of concept of our method by its application to ten trials of two or three freshly selected sequence fragments (reads and contigs) for each experiment across the six organisms of our set. Here we describe a novel and computationally effective pre-processing step for metagenome sequencing and assembly tasks. Furthermore, our base composition method has applications in phylogeny where it can be used to infer evolutionary distances between organisms based on the notion that related organisms often have much conserved code.

  7. Assembling large, complex environmental metagenomes

    Energy Technology Data Exchange (ETDEWEB)

    Howe, A. C. [Michigan State Univ., East Lansing, MI (United States). Microbiology and Molecular Genetics, Plant Soil and Microbial Sciences; Jansson, J. [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Earth Sciences Division; Malfatti, S. A. [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Tringe, S. G. [USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Tiedje, J. M. [Michigan State Univ., East Lansing, MI (United States). Microbiology and Molecular Genetics, Plant Soil and Microbial Sciences; Brown, C. T. [Michigan State Univ., East Lansing, MI (United States). Microbiology and Molecular Genetics, Computer Science and Engineering

    2012-12-28

    The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two pre-assembly filtering approaches, digital normalization and partitioning, to make large metagenome assemblies more computationaly tractable. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.

  8. Utilization of deletion bins to anchor and order sequences along the wheat 7B chromosome.

    Science.gov (United States)

    Belova, Tatiana; Grønvold, Lars; Kumar, Ajay; Kianian, Shahryar; He, Xinyao; Lillemo, Morten; Springer, Nathan M; Lien, Sigbjørn; Olsen, Odd-Arne; Sandve, Simen R

    2014-09-01

    A total of 3,671 sequence contigs and scaffolds were mapped to deletion bins on wheat chromosome 7B providing a foundation for developing high-resolution integrated physical map for this chromosome. Bread wheat (Triticum aestivum L.) has a large, complex and highly repetitive genome which is challenging to assemble into high quality pseudo-chromosomes. As part of the international effort to sequence the hexaploid bread wheat genome by the international wheat genome sequencing consortium (IWGSC) we are focused on assembling a reference sequence for chromosome 7B. The successful completion of the reference chromosome sequence is highly dependent on the integration of genetic and physical maps. To aid the integration of these two types of maps, we have constructed a high-density deletion bin map of chromosome 7B. Using the 270 K Nimblegen comparative genomic hybridization (CGH) array on a set of cv. Chinese spring deletion lines, a total of 3,671 sequence contigs and scaffolds (~7.8 % of chromosome 7B physical length) were mapped into nine deletion bins. Our method of genotyping deletions on chromosome 7B relied on a model-based clustering algorithm (Mclust) to accurately predict the presence or absence of a given genomic sequence in a deletion line. The bin mapping results were validated using three different approaches, viz. (a) PCR-based amplification of randomly selected bin mapped sequences (b) comparison with previously mapped ESTs and (c) comparison with a 7B genetic map developed in the present study. Validation of the bin mapping results suggested a high accuracy of the assignment of 7B sequence contigs and scaffolds to the 7B deletion bins.

  9. Deep RNA sequencing of the skeletal muscle transcriptome in swimming fish.

    Directory of Open Access Journals (Sweden)

    Arjan P Palstra

    Full Text Available Deep RNA sequencing (RNA-seq was performed to provide an in-depth view of the transcriptome of red and white skeletal muscle of exercised and non-exercised rainbow trout (Oncorhynchus mykiss with the specific objective to identify expressed genes and quantify the transcriptomic effects of swimming-induced exercise. Pubertal autumn-spawning seawater-raised female rainbow trout were rested (n = 10 or swum (n = 10 for 1176 km at 0.75 body-lengths per second in a 6,000-L swim-flume under reproductive conditions for 40 days. Red and white muscle RNA of exercised and non-exercised fish (4 lanes was sequenced and resulted in 15-17 million reads per lane that, after de novo assembly, yielded 149,159 red and 118,572 white muscle contigs. Most contigs were annotated using an iterative homology search strategy against salmonid ESTs, the zebrafish Danio rerio genome and general Metazoan genes. When selecting for large contigs (>500 nucleotides, a number of novel rainbow trout gene sequences were identified in this study: 1,085 and 1,228 novel gene sequences for red and white muscle, respectively, which included a number of important molecules for skeletal muscle function. Transcriptomic analysis revealed that sustained swimming increased transcriptional activity in skeletal muscle and specifically an up-regulation of genes involved in muscle growth and developmental processes in white muscle. The unique collection of transcripts will contribute to our understanding of red and white muscle physiology, specifically during the long-term reproductive migration of salmonids.

  10. A Scaffold Analysis Tool Using Mate-Pair Information in Genome Sequencing

    Directory of Open Access Journals (Sweden)

    Pan-Gyu Kim

    2008-01-01

    Full Text Available We have developed a Windows-based program, ConPath, as a scaffold analyzer. ConPath constructs scaffolds by ordering and orienting separate sequence contigs by exploiting the mate-pair information between contig-pairs. Our algorithm builds directed graphs from link information and traverses them to find the longest acyclic graphs. Using end read pairs of fixed-sized mate-pair libraries, ConPath determines relative orientations of all contigs, estimates the gap size of each adjacent contig pair, and reports wrong assembly information by validating orientations and gap sizes. We have utilized ConPath in more than 10 microbial genome projects, including Mannheimia succiniciproducens and Vibro vulnificus, where we verified contig assembly and identified several erroneous contigs using the four types of error defined in ConPath. Also, ConPath supports some convenient features and viewers that permit investigation of each contig in detail; these include contig viewer, scaffold viewer, edge information list, mate-pair list, and the printing of complex scaffold structures.

  11. Mining and gene ontology based annotation of SSR markers from expressed sequence tags of Humulus lupulus

    Science.gov (United States)

    Singh, Swati; Gupta, Sanchita; Mani, Ashutosh; Chaturvedi, Anoop

    2012-01-01

    Humulus lupulus is commonly known as hops, a member of the family moraceae. Currently many projects are underway leading to the accumulation of voluminous genomic and expressed sequence tag sequences in public databases. The genetically characterized domains in these databases are limited due to non-availability of reliable molecular markers. The large data of EST sequences are available in hops. The simple sequence repeat markers extracted from EST data are used as molecular markers for genetic characterization, in the present study. 25,495 EST sequences were examined and assembled to get full-length sequences. Maximum frequency distribution was shown by mononucleotide SSR motifs i.e. 60.44% in contig and 62.16% in singleton where as minimum frequency are observed for hexanucleotide SSR in contig (0.09%) and pentanucleotide SSR in singletons (0.12%). Maximum trinucleotide motifs code for Glutamic acid (GAA) while AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking primer pairs were designed in-silico for the SSR containing sequences. Functional categorization of SSRs containing sequences was done through gene ontology terms like biological process, cellular component and molecular function. PMID:22368382

  12. Sequence Capture and Phylogenetic Utility of Genomic Ultraconserved Elements Obtained from Pinned Insect Specimens.

    Directory of Open Access Journals (Sweden)

    Bonnie B Blaimer

    Full Text Available Obtaining sequence data from historical museum specimens has been a growing research interest, invigorated by next-generation sequencing methods that allow inputs of highly degraded DNA. We applied a target enrichment and next-generation sequencing protocol to generate ultraconserved elements (UCEs from 51 large carpenter bee specimens (genus Xylocopa, representing 25 species with specimen ages ranging from 2-121 years. We measured the correlation between specimen age and DNA yield (pre- and post-library preparation DNA concentration and several UCE sequence capture statistics (raw read count, UCE reads on target, UCE mean contig length and UCE locus count with linear regression models. We performed piecewise regression to test for specific breakpoints in the relationship of specimen age and DNA yield and sequence capture variables. Additionally, we compared UCE data from newer and older specimens of the same species and reconstructed their phylogeny in order to confirm the validity of our data. We recovered 6-972 UCE loci from samples with pre-library DNA concentrations ranging from 0.06-9.8 ng/μL. All investigated DNA yield and sequence capture variables were significantly but only moderately negatively correlated with specimen age. Specimens of age 20 years or less had significantly higher pre- and post-library concentrations, UCE contig lengths, and locus counts compared to specimens older than 20 years. We found breakpoints in our data indicating a decrease of the initial detrimental effect of specimen age on pre- and post-library DNA concentration and UCE contig length starting around 21-39 years after preservation. Our phylogenetic results confirmed the integrity of our data, giving preliminary insights into relationships within Xylocopa. We consider the effect of additional factors not measured in this study on our age-related sequence capture results, such as DNA fragmentation and preservation method, and discuss the promise of the UCE

  13. The peculiar landscape of repetitive sequences in the olive (Olea europaea L.) genome.

    Science.gov (United States)

    Barghini, Elena; Natali, Lucia; Cossu, Rosa Maria; Giordani, Tommaso; Pindo, Massimo; Cattonaro, Federica; Scalabrin, Simone; Velasco, Riccardo; Morgante, Michele; Cavallini, Andrea

    2014-04-01

    Analyzing genome structure in different species allows to gain an insight into the evolution of plant genome size. Olive (Olea europaea L.) has a medium-sized haploid genome of 1.4 Gb, whose structure is largely uncharacterized, despite the growing importance of this tree as oil crop. Next-generation sequencing technologies and different computational procedures have been used to study the composition of the olive genome and its repetitive fraction. A total of 2.03 and 2.3 genome equivalents of Illumina and 454 reads from genomic DNA, respectively, were assembled following different procedures, which produced more than 200,000 differently redundant contigs, with mean length higher than 1,000 nt. Mapping Illumina reads onto the assembled sequences was used to estimate their redundancy. The genome data set was subdivided into highly and medium redundant and nonredundant contigs. By combining identification and mapping of repeated sequences, it was established that tandem repeats represent a very large portion of the olive genome (∼31% of the whole genome), consisting of six main families of different length, two of which were first discovered in these experiments. The other large redundant class in the olive genome is represented by transposable elements (especially long terminal repeat-retrotransposons). On the whole, the results of our analyses show the peculiar landscape of the olive genome, related to the massive amplification of tandem repeats, more than that reported for any other sequenced plant genome.

  14. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis

    Directory of Open Access Journals (Sweden)

    Roodt-Wilding Rouvay

    2011-03-01

    Full Text Available Abstract Background Worldwide, the genus Haliotis is represented by 56 extant species and several of these are commercially cultured. Among the six abalone species found in South Africa, Haliotis midae is the only aquacultured species. Despite its economic importance, genomic sequence resources for H. midae, and for abalone in general, are still scarce. Next generation sequencing technologies provide a fast and efficient tool to generate large sequence collections that can be used to characterize the transcriptome and identify expressed genes associated with economically important traits like growth and disease resistance. Results More than 25 million short reads generated by the Illumina Genome Analyzer were de novo assembled in 22,761 contigs with an average size of 260 bp. With a stringent E-value threshold of 10-10, 3,841 contigs (16.8% had a BLAST homologous match against the Genbank non-redundant (NR protein database. Most of these sequences were annotated using the gene ontology (GO and eukaryotic orthologous groups of proteins (KOG databases and assigned to various functional categories. According to annotation results, many gene families involved in immune response were identified. Thousands of simple sequence repeats (SSR and single nucleotide polymorphisms (SNP were detected. Setting stringent parameters to ensure a high probability of amplification, 420 primer pairs in 181 contigs containing SSR loci were designed. Conclusion This data represents the most comprehensive genomic resource for the South African abalone H. midae to date. The amount of assembled sequences demonstrated the utility of the Illumina sequencing technology in the transcriptome characterization of a non-model species. It allowed the development of several markers and the identification of promising candidate genes for future studies on population and functional genomics in H. midae and in other abalone species.

  15. Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags

    Science.gov (United States)

    de Souza, Sandro J.; Camargo, Anamaria A.; Briones, Marcelo R. S.; Costa, Fernando F.; Nagai, Maria Aparecida; Verjovski-Almeida, Sergio; Zago, Marco A.; Andrade, Luis Eduardo C.; Carrer, Helaine; El-Dorry, Hamza F. A.; Espreafico, Enilza M.; Habr-Gama, Angelita; Giannella-Neto, Daniel; Goldman, Gustavo H.; Gruber, Arthur; Hackel, Christine; Kimura, Edna T.; Maciel, Rui M. B.; Marie, Suely K. N.; Martins, Elizabeth A. L.; Nóbrega, Marina P.; Paçó-Larson, Maria Luisa; Pardini, Maria Inês M. C.; Pereira, Gonçalo G.; Pesquero, João Bosco; Rodrigues, Vanderlei; Rogatto, Silvia R.; da Silva, Ismael D. C. G.; Sogayar, Mari C.; de Fátima Sonati, Maria; Tajara, Eloiza H.; Valentini, Sandro R.; Acencio, Marcio; Alberto, Fernando L.; Amaral, Maria Elisabete J.; Aneas, Ivy; Bengtson, Mário Henrique; Carraro, Dirce M.; Carvalho, Alex F.; Carvalho, Lúcia Helena; Cerutti, Janete M.; Corrêa, Maria Lucia C.; Costa, Maria Cristina R.; Curcio, Cyntia; Gushiken, Tsieko; Ho, Paulo L.; Kimura, Elza; Leite, Luciana C. C.; Maia, Gustavo; Majumder, Paromita; Marins, Mozart; Matsukuma, Adriana; Melo, Analy S. A.; Mestriner, Carlos Alberto; Miracca, Elisabete C.; Miranda, Daniela C.; Nascimento, Ana Lucia T. O.; Nóbrega, Francisco G.; Ojopi, Élida P. B.; Pandolfi, José Rodrigo C.; Pessoa, Luciana Gilbert; Rahal, Paula; Rainho, Claudia A.; da Ro's, Nancy; de Sá, Renata G.; Sales, Magaly M.; da Silva, Neusa P.; Silva, Tereza C.; da Silva, Wilson; Simão, Daniel F.; Sousa, Josane F.; Stecconi, Daniella; Tsukumo, Fernando; Valente, Valéria; Zalcberg, Heloisa; Brentani, Ricardo R.; Reis, Luis F. L.; Dias-Neto, Emmanuel; Simpson, Andrew J. G.

    2000-01-01

    Transcribed sequences in the human genome can be identified with confidence only by alignment with sequences derived from cDNAs synthesized from naturally occurring mRNAs. We constructed a set of 250,000 cDNAs that represent partial expressed gene sequences and that are biased toward the central coding regions of the resulting transcripts. They are termed ORF expressed sequence tags (ORESTES). The 250,000 ORESTES were assembled into 81,429 contigs. Of these, 1,181 (1.45%) were found to match sequences in chromosome 22 with at least one ORESTES contig for 162 (65.6%) of the 247 known genes, for 67 (44.6%) of the 150 related genes, and for 45 of the 148 (30.4%) EST-predicted genes on this chromosome. Using a set of stringent criteria to validate our sequences, we identified a further 219 previously unannotated transcribed sequences on chromosome 22. Of these, 171 were in fact also defined by EST or full length cDNA sequences available in GenBank but not utilized in the initial annotation of the first human chromosome sequence. Thus despite representing less than 15% of all expressed human sequences in the public databases at the time of the present analysis, ORESTES sequences defined 48 transcribed sequences on chromosome 22 not defined by other sequences. All of the transcribed sequences defined by ORESTES coincided with DNA regions predicted as encoding exons by genscan. (http://genes.mit.edu/GENSCAN.html). PMID:11070084

  16. Projector : automatic contig mapping for gap closure purposes

    NARCIS (Netherlands)

    van Hijum, SAFT; Zomer, AL; Kuipers, OP; Kok, J

    2003-01-01

    Projector was designed for automatic positioning of contigs from an unfinished prokaryotic genome onto a template genome of a closely related strain or species. Projector mapped 84 contigs of Lactococcus lactis MG1363 (corresponding to 81% of the assembly nucleotides) against the genome of L.lactis

  17. [Reconstruction of long polynucleotide sequences from fragments using the Iskra-226 personal computer

    Science.gov (United States)

    Kostetskiĭ, P V; Dobrova, I E

    1988-04-01

    An algorithm for reconstructing long DNA sequences, i.e. arranging all overlapping gel readings in the contigs, and the corresponding BASIC programme for personal computer "Iskra-226" (USSR) are described. The contig construction begins with the search for all fragments overlapping the basic (longest) one follower by determination of coordinates of 5' ends of the overlapping fragments. Then the gel reading with minimal 5' end coordinate and the gel reading with maximal 3' end coordinate are selected and used as basic ones at the next assembly steps. The procedure is finished when no gel reading overlapping the basic one can be found. All gel readings entered the contig are ignored at the next steps of the assembly. Finally, one or several contigs consisted of DNA fragments are obtained. Effectiveness of the algorithm was tested on a model based on the multiple assembly of the nucleotide sequence, encoding the Na, K-ATPase alpha-subunit of pig kidney. The programme does not call for user's participation and can comprise contigs up to 10,000 nucleotides long.

  18. Validation of rice genome sequence by optical mapping

    Directory of Open Access Journals (Sweden)

    Pape Louise

    2007-08-01

    Full Text Available Abstract Background Rice feeds much of the world, and possesses the simplest genome analyzed to date within the grass family, making it an economically relevant model system for other cereal crops. Although the rice genome is sequenced, validation and gap closing efforts require purely independent means for accurate finishing of sequence build data. Results To facilitate ongoing sequencing finishing and validation efforts, we have constructed a whole-genome SwaI optical restriction map of the rice genome. The physical map consists of 14 contigs, covering 12 chromosomes, with a total genome size of 382.17 Mb; this value is about 11% smaller than original estimates. 9 of the 14 optical map contigs are without gaps, covering chromosomes 1, 2, 3, 4, 5, 7, 8 10, and 12 in their entirety – including centromeres and telomeres. Alignments between optical and in silico restriction maps constructed from IRGSP (International Rice Genome Sequencing Project and TIGR (The Institute for Genomic Research genome sequence sources are comprehensive and informative, evidenced by map coverage across virtually all published gaps, discovery of new ones, and characterization of sequence misassemblies; all totalling ~14 Mb. Furthermore, since optical maps are ordered restriction maps, identified discordances are pinpointed on a reliable physical scaffold providing an independent resource for closure of gaps and rectification of misassemblies. Conclusion Analysis of sequence and optical mapping data effectively validates genome sequence assemblies constructed from large, repeat-rich genomes. Given this conclusion we envision new applications of such single molecule analysis that will merge advantages offered by high-resolution optical maps with inexpensive, but short sequence reads generated by emerging sequencing platforms. Lastly, map construction techniques presented here points the way to new types of comparative genome analysis that would focus on discernment of

  19. Physical mapping in highly heterozygous genomes: a physical contig map of the Pinot Noir grapevine cultivar

    Directory of Open Access Journals (Sweden)

    Jurman Irena

    2010-03-01

    Full Text Available Abstract Background Most of the grapevine (Vitis vinifera L. cultivars grown today are those selected centuries ago, even though grapevine is one of the most important fruit crops in the world. Grapevine has therefore not benefited from the advances in modern plant breeding nor more recently from those in molecular genetics and genomics: genes controlling important agronomic traits are practically unknown. A physical map is essential to positionally clone such genes and instrumental in a genome sequencing project. Results We report on the first whole genome physical map of grapevine built using high information content fingerprinting of 49,104 BAC clones from the cultivar Pinot Noir. Pinot Noir, as most grape varieties, is highly heterozygous at the sequence level. This resulted in the two allelic haplotypes sometimes assembling into separate contigs that had to be accommodated in the map framework or in local expansions of contig maps. We performed computer simulations to assess the effects of increasing levels of sequence heterozygosity on BAC fingerprint assembly and showed that the experimental assembly results are in full agreement with the theoretical expectations, given the heterozygosity levels reported for grape. The map is anchored to a dense linkage map consisting of 994 markers. 436 contigs are anchored to the genetic map, covering 342 of the 475 Mb that make up the grape haploid genome. Conclusions We have developed a resource that makes it possible to access the grapevine genome, opening the way to a new era both in grape genetics and breeding and in wine making. The effects of heterozygosity on the assembly have been analyzed and characterized by using several complementary approaches which could be easily transferred to the study of other genomes which present the same features.

  20. Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery

    Directory of Open Access Journals (Sweden)

    Benkman Craig W

    2010-03-01

    Full Text Available Abstract Background Massively parallel sequencing of cDNA is now an efficient route for generating enormous sequence collections that represent expressed genes. This approach provides a valuable starting point for characterizing functional genetic variation in non-model organisms, especially where whole genome sequencing efforts are currently cost and time prohibitive. The large and complex genomes of pines (Pinus spp. have hindered the development of genomic resources, despite the ecological and economical importance of the group. While most genomic studies have focused on a single species (P. taeda, genomic level resources for other pines are insufficiently developed to facilitate ecological genomic research. Lodgepole pine (P. contorta is an ecologically important foundation species of montane forest ecosystems and exhibits substantial adaptive variation across its range in western North America. Here we describe a sequencing study of expressed genes from P. contorta, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies. Results We obtained 586,732 sequencing reads from a 454 GS XLR70 Titanium pyrosequencer (mean length: 306 base pairs. A combination of reference-based and de novo assemblies yielded 63,657 contigs, with 239,793 reads remaining as singletons. Based on sequence similarity with known proteins, these sequences represent approximately 17,000 unique genes, many of which are well covered by contig sequences. This sequence collection also included a surprisingly large number of retrotransposon sequences, suggesting that they are highly transcriptionally active in the tissues we sampled. We located and characterized thousands of simple sequence repeats and single nucleotide polymorphisms as potential molecular markers in our assembled and annotated sequences. High quality PCR primers were designed for a substantial number of the SSR loci

  1. YAC contig information - RGP physicalmap | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available 8908/lsdba.nbdc00318-06-001 Description of data contents YAC contigs on the rice chromosomes Data file File name: rgp_physical...map_yac_contigs.zip File URL: ftp://ftp.biosciencedbc.jp/archive/rgp-physicalmap/LATEST/rgp_physical...sciencedbc.jp/togodb/view/rgp_physicalmap_yac_contigs#en Data acquisition method The range including YAC con...m Description Chrom. No. Chromosome number Region Region number Physical map image The file name of rice physical...n Download License Update History of This Database Site Policy | Contact Us YAC contig information - RGP physicalmap | LSDB Archive ...

  2. Draft genome sequence of Acidithiobacillus ferrooxidans YQH-1

    Directory of Open Access Journals (Sweden)

    Lei Yan

    2015-12-01

    Full Text Available Acidithiobacillus ferrooxidans YQH-1 is a moderate acidophilic bacterium isolated from a river in a volcano of Northeast China. Here, we describe the draft genome of strain YQH-1, which was assembled into 123 contigs containing 3,111,222 bp with a G + C content of 58.63%. A large number of genes related to carbon dioxide fixation, dinitrogen fixation, pH tolerance, heavy metal detoxification, and oxidative stress defense were detected. The genome sequence can be accessed at DDBJ/EMBL/GenBank under the accession no. LJBT00000000.

  3. Gene content and organization of a 281-kbp contig from the genome of the extremely thermophilic archaeon, Sulfolobus solfataricus P2

    NARCIS (Netherlands)

    Charlebois, R.; Confalonieri, F.; Curtis, B.; Doolittle, W.F.; Duguet, M.; Erauso, G.; Faguy, D.; Gaasterland, T.; Garrett, R.A.; Gordon, P.; Kozera, C.; Medina, N.; Oost, van der J.; Peng, X.; Ragan, M.; She, Q.; Singh, R.K.

    2000-01-01

    The sequence of a 281-kbp contig from the crenarchaeote Sulfolobus solfataricus P2 was determined and analysed. Notable features in this region include 29 ribosomal protein genes, 12 tRNA genes (four of which contain archaeal-type introns), operons encoding enzymes of histidine biosynthesis,

  4. Complete Genome Sequence of the Probiotic Lactic Acid Bacterium Lactobacillus Rhamnosus

    Directory of Open Access Journals (Sweden)

    Samat Kozhakhmetov

    2014-01-01

    Full Text Available Introduction: Lactobacilli are a bacteria commonly found in the gastrointestinal tract. Some species of this genus have probiotic properties. The most common of these is Lactobacillus rhamnosus, a microoganism, generally regarded as safe (GRAS. It is also a homofermentative L-(+-lactic acid producer. The genus Lactobacillus is characterized by an extraordinary degree of the phenotypic and genotypic diversity. However, the studies of the genus were conducted mostly with the unequally distributed, non-random choice of species for sequencing; thus, there is only one representative genome from the Lactobacillus rhamnosus clade available to date. The aim of this study was to characterize the genome sequencing of selected strains of Lactobacilli. Methods: 109 samples were isolated from national domestic dairy products in the laboratory of Center for life sciences. After screaning isolates for probiotic properties, a highly active Lactobacillus spp strain was chosen. Genomic DNA was extracted according to the manufacturing protocol (Wizard® Genomic DNA Purification Kit. The Lactobacillus rhamnosus strain was identified as the highly active Lactobacillus strain accoridng to its morphological, cultural, physiological, and biochemical properties, and a genotypic analysis. Results: The genome of Lactobacillus rhamnosus was sequenced using the Roche 454 GS FLX (454 GS FLX platforms. The initial draft assembly was prepared from 14 large contigs (20 all contigs by the Newbler gsAssembler 2.3 (454 Life Sciences, Branford, CT. Conclusion: A full genome-sequencing of selected strains of lactic acid bacteria was made during the study.

  5. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

    Science.gov (United States)

    Peng, Yu; Leung, Henry C M; Yiu, S M; Chin, Francis Y L

    2012-06-01

    Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud

  6. Gene discovery and molecular marker development, based on high-throughput transcript sequencing of Paspalum dilatatum Poir.

    Directory of Open Access Journals (Sweden)

    Andrea Giordano

    Full Text Available BACKGROUND: Paspalum dilatatum Poir. (common name dallisgrass is a native grass species of South America, with special relevance to dairy and red meat production. P. dilatatum exhibits higher forage quality than other C4 forage grasses and is tolerant to frost and water stress. This species is predominantly cultivated in an apomictic monoculture, with an inherent high risk that biotic and abiotic stresses could potentially devastate productivity. Therefore, advanced breeding strategies that characterise and use available genetic diversity, or assess germplasm collections effectively are required to deliver advanced cultivars for production systems. However, there are limited genomic resources available for this forage grass species. RESULTS: Transcriptome sequencing using second-generation sequencing platforms has been employed using pooled RNA from different tissues (stems, roots, leaves and inflorescences at the final reproductive stage of P. dilatatum cultivar Primo. A total of 324,695 sequence reads were obtained, corresponding to c. 102 Mbp. The sequences were assembled, generating 20,169 contigs of a combined length of 9,336,138 nucleotides. The contigs were BLAST analysed against the fully sequenced grass species of Oryza sativa subsp. japonica, Brachypodium distachyon, the closely related Sorghum bicolor and foxtail millet (Setaria italica genomes as well as against the UniRef 90 protein database allowing a comprehensive gene ontology analysis to be performed. The contigs generated from the transcript sequencing were also analysed for the presence of simple sequence repeats (SSRs. A total of 2,339 SSR motifs were identified within 1,989 contigs and corresponding primer pairs were designed. Empirical validation of a cohort of 96 SSRs was performed, with 34% being polymorphic between sexual and apomictic biotypes. CONCLUSIONS: The development of genetic and genomic resources for P. dilatatum will contribute to gene discovery and expression

  7. DeepSimulator: a deep simulator for Nanopore sequencing

    KAUST Repository

    Li, Yu; Han, Renmin; Bi, Chongwei; Li, Mo; Wang, Sheng; Gao, Xin

    2017-01-01

    or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments

  8. Comparative analysis of transcriptomes in aerial stems and roots of Ephedra sinica based on high-throughput mRNA sequencing

    Directory of Open Access Journals (Sweden)

    Taketo Okada

    2016-12-01

    Full Text Available Ephedra plants are taxonomically classified as gymnosperms, and are medicinally important as the botanical origin of crude drugs and as bioresources that contain pharmacologically active chemicals. Here we show a comparative analysis of the transcriptomes of aerial stems and roots of Ephedra sinica based on high-throughput mRNA sequencing by RNA-Seq. De novo assembly of short cDNA sequence reads generated 23,358, 13,373, and 28,579 contigs longer than 200 bases from aerial stems, roots, or both aerial stems and roots, respectively. The presumed functions encoded by these contig sequences were annotated by BLAST (blastx. Subsequently, these contigs were classified based on gene ontology slims, Enzyme Commission numbers, and the InterPro database. Furthermore, comparative gene expression analysis was performed between aerial stems and roots. These transcriptome analyses revealed differences and similarities between the transcriptomes of aerial stems and roots in E. sinica. Deep transcriptome sequencing of Ephedra should open the door to molecular biological studies based on the entire transcriptome, tissue- or organ-specific transcriptomes, or targeted genes of interest.

  9. Draft Genome Sequence of "Terrisporobacter othiniensis" Isolated from a Blood Culture from a Human Patient

    DEFF Research Database (Denmark)

    Lund, Lars Christian; Sydenham, Thomas Vognbjerg; Høgh, Silje Vermedal

    2015-01-01

    "Terrisporobacter othiniensis" (proposed species) was isolated from a blood culture. Genomic DNA was sequenced using a MiSeq benchtop sequencer (Illumina) and assembled using the SPAdes genome assembler. This resulted in a draft genome sequence comprising 3,980,019 bp in 167 contigs containing 3...

  10. Non-contiguous finished genome sequence of the opportunistic oral pathogen Prevotella multisaccharivorax type strain (PPPA20T)

    Energy Technology Data Exchange (ETDEWEB)

    Pati, Amrita [U.S. Department of Energy, Joint Genome Institute; Gronow, Sabine [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Lu, Megan [Los Alamos National Laboratory (LANL); Lapidus, Alla L. [U.S. Department of Energy, Joint Genome Institute; Nolan, Matt [U.S. Department of Energy, Joint Genome Institute; Lucas, Susan [U.S. Department of Energy, Joint Genome Institute; Hammon, Nancy [U.S. Department of Energy, Joint Genome Institute; Deshpande, Shweta [U.S. Department of Energy, Joint Genome Institute; Cheng, Jan-Fang [U.S. Department of Energy, Joint Genome Institute; Tapia, Roxanne [Los Alamos National Laboratory (LANL); Han, Cliff [Los Alamos National Laboratory (LANL); Goodwin, Lynne A. [Los Alamos National Laboratory (LANL); Pitluck, Sam [U.S. Department of Energy, Joint Genome Institute; Liolios, Konstantinos [U.S. Department of Energy, Joint Genome Institute; Pagani, Ioanna [U.S. Department of Energy, Joint Genome Institute; Mavromatis, K [U.S. Department of Energy, Joint Genome Institute; Mikhailova, Natalia [U.S. Department of Energy, Joint Genome Institute; Huntemann, Marcel [U.S. Department of Energy, Joint Genome Institute; Chen, Amy [U.S. Department of Energy, Joint Genome Institute; Palaniappan, Krishna [U.S. Department of Energy, Joint Genome Institute; Land, Miriam L [ORNL; Hauser, Loren John [ORNL; Detter, J. Chris [U.S. Department of Energy, Joint Genome Institute; Brambilla, Evelyne-Marie [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Rohde, Manfred [HZI - Helmholtz Centre for Infection Research, Braunschweig, Germany; Goker, Markus [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Woyke, Tanja [U.S. Department of Energy, Joint Genome Institute; Bristow, James [U.S. Department of Energy, Joint Genome Institute; Eisen, Jonathan [U.S. Department of Energy, Joint Genome Institute; Markowitz, Victor [U.S. Department of Energy, Joint Genome Institute; Hugenholtz, Philip [U.S. Department of Energy, Joint Genome Institute; Kyrpides, Nikos C [U.S. Department of Energy, Joint Genome Institute; Klenk, Hans-Peter [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Ivanova, N [U.S. Department of Energy, Joint Genome Institute

    2011-01-01

    Prevotella multisaccharivorax Sakamoto et al. 2005 is a species of the large genus Prevotella, which belongs to the family Prevotellaceae. The species is of medical interest because its members are able to cause diseases in the human oral cavity such as periodontitis, root caries and others. Although 77 Prevotella genomes have already been sequenced or are targeted for sequencing, this is only the second completed genome sequence of a type strain of a species within the genus Prevotella to be published. The 3,388,644 bp long genome is assembled in three non-contiguous contigs, harbors 2,876 protein-coding and 75 RNA genes and is a part of the Genomic Encyclopedia of Bacteria and Archaea project.

  11. An expressed sequence tag (EST) library for Drosophila serrata, a model system for sexual selection and climatic adaptation studies.

    Science.gov (United States)

    Frentiu, Francesca D; Adamski, Marcin; McGraw, Elizabeth A; Blows, Mark W; Chenoweth, Stephen F

    2009-01-21

    The native Australian fly Drosophila serrata belongs to the highly speciose montium subgroup of the melanogaster species group. It has recently emerged as an excellent model system with which to address a number of important questions, including the evolution of traits under sexual selection and traits involved in climatic adaptation along latitudinal gradients. Understanding the molecular genetic basis of such traits has been limited by a lack of genomic resources for this species. Here, we present the first expressed sequence tag (EST) collection for D. serrata that will enable the identification of genes underlying sexually-selected phenotypes and physiological responses to environmental change and may help resolve controversial phylogenetic relationships within the montium subgroup. A normalized cDNA library was constructed from whole fly bodies at several developmental stages, including larvae and adults. Assembly of 11,616 clones sequenced from the 3' end allowed us to identify 6,607 unique contigs, of which at least 90% encoded peptides. Partial transcripts were discovered from a variety of genes of evolutionary interest by BLASTing contigs against the 12 Drosophila genomes currently sequenced. By incorporating into the cDNA library multiple individuals from populations spanning a large portion of the geographical range of D. serrata, we were able to identify 11,057 putative single nucleotide polymorphisms (SNPs), with 278 different contigs having at least one "double hit" SNP that is highly likely to be a real polymorphism. At least 394 EST-associated microsatellite markers, representing 355 different contigs, were also found, providing an additional set of genetic markers. The assembled EST library is available online at http://www.chenowethlab.org/serrata/index.cgi. We have provided the first gene collection and largest set of polymorphic genetic markers, to date, for the fly D. serrata. The EST collection will provide much needed genomic resources for

  12. Prevalence of single nucleotide polymorphism among 27 diverse alfalfa genotypes as assessed by transcriptome sequencing

    Directory of Open Access Journals (Sweden)

    Li Xuehui

    2012-10-01

    Full Text Available Abstract Background Alfalfa, a perennial, outcrossing species, is a widely planted forage legume producing highly nutritious biomass. Currently, improvement of cultivated alfalfa mainly relies on recurrent phenotypic selection. Marker assisted breeding strategies can enhance alfalfa improvement efforts, particularly if many genome-wide markers are available. Transcriptome sequencing enables efficient high-throughput discovery of single nucleotide polymorphism (SNP markers for a complex polyploid species. Result The transcriptomes of 27 alfalfa genotypes, including elite breeding genotypes, parents of mapping populations, and unimproved wild genotypes, were sequenced using an Illumina Genome Analyzer IIx. De novo assembly of quality-filtered 72-bp reads generated 25,183 contigs with a total length of 26.8 Mbp and an average length of 1,065 bp, with an average read depth of 55.9-fold for each genotype. Overall, 21,954 (87.2% of the 25,183 contigs represented 14,878 unique protein accessions. Gene ontology (GO analysis suggested that a broad diversity of genes was represented in the resulting sequences. The realignment of individual reads to the contigs enabled the detection of 872,384 SNPs and 31,760 InDels. High resolution melting (HRM analysis was used to validate 91% of 192 putative SNPs identified by sequencing. Both allelic variants at about 95% of SNP sites identified among five wild, unimproved genotypes are still present in cultivated alfalfa, and all four US breeding programs also contain a high proportion of these SNPs. Thus, little evidence exists among this dataset for loss of significant DNA sequence diversity from either domestication or breeding of alfalfa. Structure analysis indicated that individuals from the subspecies falcata, the diploid subspecies caerulea, and the tetraploid subspecies sativa (cultivated tetraploid alfalfa were clearly separated. Conclusion We used transcriptome sequencing to discover large numbers of SNPs

  13. A YAC contig and an EST map in the pericentromeric region of chromosome 13 surrounding the loci for neurosensory nonsyndromic deafness (DFNB1 and DFNA3) and Limb-Girdle muscular dystrophy type 2C (LGMD2C)

    Energy Technology Data Exchange (ETDEWEB)

    Guilford, P.; Crozet, F.; Blanchard, S. [Institut Pasteur, Paris (France)] [and others

    1995-09-01

    Two forms of inherited childhood nonsyndromic deafness (DFNB1 and DFNA3) and a Duchenne-like form of progressive muscular dystrophy (LGMD2C) have been mapped to the pericentromeric region of chromosome 13. To clone the genes responsible for these diseases we constructed a yeast artificial chromosome (YAC) contig spanning an 8-cM region between the polymorphic markers D13S221. The contig comprises 24 sequence-tagged sites, among which 15 were newly obtained. This contig allowed us to order the polymorphic markers centromere- D13S175-D13S141-D13S143-D13S115-AFM128yc1-D13S292-D13S283-AFM323vh5-D13S221-telomere. Eight expressed sequence tags, previously assigned to 13q11-q12 (D13S182E, D13S183E, D13S502E, D13S504E, D13S505E, D13S837E, TUBA2, ATP1AL1), were localized on the YAC contig. YAC screening of a cDNA library derived from mouse cochlea allowed us to identify an {alpha}-tubulin gene (TUBA2) that was subsequently precisely mapped within the candidate region. 36 refs., 2 figs., 2 tabs.

  14. De novo 454 sequencing of barcoded BAC pools for comprehensive gene survey and genome analysis in the complex genome of barley

    Directory of Open Access Journals (Sweden)

    Scholz Uwe

    2009-11-01

    Full Text Available Abstract Background De novo sequencing the entire genome of a large complex plant genome like the one of barley (Hordeum vulgare L. is a major challenge both in terms of experimental feasibility and costs. The emergence and breathtaking progress of next generation sequencing technologies has put this goal into focus and a clone based strategy combined with the 454/Roche technology is conceivable. Results To test the feasibility, we sequenced 91 barcoded, pooled, gene containing barley BACs using the GS FLX platform and assembled the sequences under iterative change of parameters. The BAC assemblies were characterized by N50 of ~50 kb (N80 ~31 kb, N90 ~21 kb and a Q40 of 94%. For ~80% of the clones, the best assemblies consisted of less than 10 contigs at 24-fold mean sequence coverage. Moreover we show that gene containing regions seem to assemble completely and uninterrupted thus making the approach suitable for detecting complete and positionally anchored genes. By comparing the assemblies of four clones to their complete reference sequences generated by the Sanger method, we evaluated the distribution, quality and representativeness of the 454 sequences as well as the consistency and reliability of the assemblies. Conclusion The described multiplex 454 sequencing of barcoded BACs leads to sequence consensi highly representative for the clones. Assemblies are correct for the majority of contigs. Though the resolution of complex repetitive structures requires additional experimental efforts, our approach paves the way for a clone based strategy of sequencing the barley genome.

  15. The Douglas-fir genome sequence reveals specialization of the photosynthetic apparatus in Pinaceae

    Science.gov (United States)

    David B. Neale; Patrick E. McGuire; Nicholas C. Wheeler; Kristian A. Stevens; Marc W. Crepeau; Charis Cardeno; Aleksey V. Zimin; Daniela Puiu; Geo M. Pertea; U. Uzay Sezen; Claudio Casola; Tomasz E. Koralewski; Robin Paul; Daniel Gonzalez-Ibeas; Sumaira Zaman; Richard Cronn; Mark Yandell; Carson Holt; Charles H. Langley; James A. Yorke; Steven L. Salzberg; Jill L. Wegrzyn

    2017-01-01

    A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50...

  16. Graph mining for next generation sequencing: leveraging the assembly graph for biological insights.

    Science.gov (United States)

    Warnke-Sommer, Julia; Ali, Hesham

    2016-05-06

    The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current assembly tools is to assemble NGS reads into contiguous stretches of sequence called contigs while maximizing for both accuracy and contig length. The end goal of this process is to produce longer contigs with the major focus being on assembly only. Sequence read assembly is an aggregative process, during which read overlap relationship information is lost as reads are merged into longer sequences or contigs. The assembly graph is information rich and capable of capturing the genomic architecture of an input read data set. We have developed a novel hybrid graph in which nodes represent sequence regions at different levels of granularity. This model, utilized in the assembly and analysis pipeline Focus, presents a concise yet feature rich view of a given input data set, allowing for the extraction of biologically relevant graph structures for graph mining purposes. Focus was used to create hybrid graphs to model metagenomics data sets obtained from the gut microbiomes of five individuals with Crohn's disease and eight healthy individuals. Repetitive and mobile genetic elements are found to be associated with hybrid graph structure. Using graph mining techniques, a comparative study of the Crohn's disease and healthy data sets was conducted with focus on antibiotics resistance genes associated with transposase genes. Results demonstrated significant differences in the phylogenetic distribution of categories of antibiotics resistance genes in the healthy and diseased patients. Focus was also evaluated as a pure assembly tool and produced excellent results when compared against the Meta-velvet, Omega, and UD-IDBA assemblers. Mining the hybrid graph can reveal biological phenomena captured

  17. The sequence and de novo assembly of the giant panda genome

    Science.gov (United States)

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C.; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W.; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2013-01-01

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. PMID:20010809

  18. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample.

    Science.gov (United States)

    Luo, Chengwei; Tsementzi, Despina; Kyrpides, Nikos; Read, Timothy; Konstantinidis, Konstantinos T

    2012-01-01

    Next-generation sequencing (NGS) is commonly used in metagenomic studies of complex microbial communities but whether or not different NGS platforms recover the same diversity from a sample and their assembled sequences are of comparable quality remain unclear. We compared the two most frequently used platforms, the Roche 454 FLX Titanium and the Illumina Genome Analyzer (GA) II, on the same DNA sample obtained from a complex freshwater planktonic community. Despite the substantial differences in read length and sequencing protocols, the platforms provided a comparable view of the community sampled. For instance, derived assemblies overlapped in ~90% of their total sequences and in situ abundances of genes and genotypes (estimated based on sequence coverage) correlated highly between the two platforms (R(2)>0.9). Evaluation of base-call error, frameshift frequency, and contig length suggested that Illumina offered equivalent, if not better, assemblies than Roche 454. The results from metagenomic samples were further validated against DNA samples of eighteen isolate genomes, which showed a range of genome sizes and G+C% content. We also provide quantitative estimates of the errors in gene and contig sequences assembled from datasets characterized by different levels of complexity and G+C% content. For instance, we noted that homopolymer-associated, single-base errors affected ~1% of the protein sequences recovered in Illumina contigs of 10× coverage and 50% G+C; this frequency increased to ~3% when non-homopolymer errors were also considered. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies.

  19. A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies.

    Science.gov (United States)

    Utturkar, Sagar M; Klingeman, Dawn M; Hurt, Richard A; Brown, Steven D

    2017-01-01

    This study characterized regions of DNA which remained unassembled by either PacBio and Illumina sequencing technologies for seven bacterial genomes. Two genomes were manually finished using bioinformatics and PCR/Sanger sequencing approaches and regions not assembled by automated software were analyzed. Gaps present within Illumina assemblies mostly correspond to repetitive DNA regions such as multiple rRNA operon sequences. PacBio gap sequences were evaluated for several properties such as GC content, read coverage, gap length, ability to form strong secondary structures, and corresponding annotations. Our hypothesis that strong secondary DNA structures blocked DNA polymerases and contributed to gap sequences was not accepted. PacBio assemblies had few limitations overall and gaps were explained as cumulative effect of lower than average sequence coverage and repetitive sequences at contig termini. An important aspect of the present study is the compilation of biological features that interfered with assembly and included active transposons, multiple plasmid sequences, phage DNA integration, and large sequence duplication. Our targeted genome finishing approach and systematic evaluation of the unassembled DNA will be useful for others looking to close, finish, and polish microbial genome sequences.

  20. Moleculo Long-Read Sequencing Facilitates Assembly and Genomic Binning from Complex Soil Metagenomes

    Energy Technology Data Exchange (ETDEWEB)

    White, Richard Allen; Bottos, Eric M.; Roy Chowdhury, Taniya; Zucker, Jeremy D.; Brislawn, Colin J.; Nicora, Carrie D.; Fansler, Sarah J.; Glaesemann, Kurt R.; Glass, Kevin; Jansson, Janet K.; Langille, Morgan

    2016-06-28

    functional roles in ecosystem stability and responses to environmental perturbations. This knowledge gap is largely due to the difficulty in culturing the majority of soil microbes. Thus, use of culture-independent approaches, such as metagenomics, promises the direct assessment of the functional potential of soil microbiomes. Soil is, however, a challenge for metagenomic assembly due to its high microbial diversity and variable evenness, resulting in low coverage and uneven sampling of microbial genomes. Despite increasingly large soil metagenome data volumes (>200 Gbp), the majority of the data do not assemble. Here, we used the cutting-edge approach of synthetic long-read sequencing technology (Moleculo) to assemble soil metagenome sequence data into long contigs and used the assemblies for binning of genomes.

    Author Video: Anauthor video summaryof this article is available.

  1. Next generation sequencing provides rapid access to the genome of Puccinia striiformis f. sp. tritici, the causal agent of wheat stripe rust.

    Directory of Open Access Journals (Sweden)

    Dario Cantu

    Full Text Available BACKGROUND: The wheat stripe rust fungus (Puccinia striiformis f. sp. tritici, PST is responsible for significant yield losses in wheat production worldwide. In spite of its economic importance, the PST genomic sequence is not currently available. Fortunately Next Generation Sequencing (NGS has radically improved sequencing speed and efficiency with a great reduction in costs compared to traditional sequencing technologies. We used Illumina sequencing to rapidly access the genomic sequence of the highly virulent PST race 130 (PST-130. METHODOLOGY/PRINCIPAL FINDINGS: We obtained nearly 80 million high quality paired-end reads (>50x coverage that were assembled into 29,178 contigs (64.8 Mb, which provide an estimated coverage of at least 88% of the PST genes and are available through GenBank. Extensive micro-synteny with the Puccinia graminis f. sp. tritici (PGTG genome and high sequence similarity with annotated PGTG genes support the quality of the PST-130 contigs. We characterized the transposable elements present in the PST-130 contigs and using an ab initio gene prediction program we identified and tentatively annotated 22,815 putative coding sequences. We provide examples on the use of comparative approaches to improve gene annotation for both PST and PGTG and to identify candidate effectors. Finally, the assembled contigs provided an inventory of PST repetitive elements, which were annotated and deposited in Repbase. CONCLUSIONS/SIGNIFICANCE: The assembly of the PST-130 genome and the predicted proteins provide useful resources to rapidly identify and clone PST genes and their regulatory regions. Although the automatic gene prediction has limitations, we show that a comparative genomics approach using multiple rust species can greatly improve the quality of gene annotation in these species. The PST-130 sequence will also be useful for comparative studies within PST as more races are sequenced. This study illustrates the power of NGS for

  2. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs.

    Science.gov (United States)

    Swain, Martin T; Tsai, Isheng J; Assefa, Samual A; Newbold, Chris; Berriman, Matthew; Otto, Thomas D

    2012-06-07

    Genome projects now produce draft assemblies within weeks owing to advanced high-throughput sequencing technologies. For milestone projects such as Escherichia coli or Homo sapiens, teams of scientists were employed to manually curate and finish these genomes to a high standard. Nowadays, this is not feasible for most projects, and the quality of genomes is generally of a much lower standard. This protocol describes software (PAGIT) that is used to improve the quality of draft genomes. It offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence and exploit reference genomes (if available) in order to improve scaffolding and generating annotations. The protocol is most accessible for bacterial and small eukaryotic genomes (up to 300 Mb), such as pathogenic bacteria, malaria and parasitic worms. Applying PAGIT to an E. coli assembly takes ∼24 h: it doubles the average contig size and annotates over 4,300 gene models.

  3. Re-annotation of the physical map of Glycine max for polyploid-like regions by BAC end sequence driven whole genome shotgun read assembly

    Directory of Open Access Journals (Sweden)

    Shultz Jeffry

    2008-07-01

    Full Text Available Abstract Background Many of the world's most important food crops have either polyploid genomes or homeologous regions derived from segmental shuffling following polyploid formation. The soybean (Glycine max genome has been shown to be composed of approximately four thousand short interspersed homeologous regions with 1, 2 or 4 copies per haploid genome by RFLP analysis, microsatellite anchors to BACs and by contigs formed from BAC fingerprints. Despite these similar regions,, the genome has been sequenced by whole genome shotgun sequence (WGS. Here the aim was to use BAC end sequences (BES derived from three minimum tile paths (MTP to examine the extent and homogeneity of polyploid-like regions within contigs and the extent of correlation between the polyploid-like regions inferred from fingerprinting and the polyploid-like sequences inferred from WGS matches. Results Results show that when sequence divergence was 1–10%, the copy number of homeologous regions could be identified from sequence variation in WGS reads overlapping BES. Homeolog sequence variants (HSVs were single nucleotide polymorphisms (SNPs; 89% and single nucleotide indels (SNIs 10%. Larger indels were rare but present (1%. Simulations that had predicted fingerprints of homeologous regions could be separated when divergence exceeded 2% were shown to be false. We show that a 5–10% sequence divergence is necessary to separate homeologs by fingerprinting. BES compared to WGS traces showed polyploid-like regions with less than 1% sequence divergence exist at 2.3% of the locations assayed. Conclusion The use of HSVs like SNPs and SNIs to characterize BACs wil improve contig building methods. The implications for bioinformatic and functional annotation of polyploid and paleopolyploid genomes show that a combined approach of BAC fingerprint based physical maps, WGS sequence and HSV-based partitioning of BAC clones from homeologous regions to separate contigs will allow reliable de

  4. De novo assembly of human genomes with massively parallel short read sequencing

    DEFF Research Database (Denmark)

    Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue

    2010-01-01

    genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities...... for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way....

  5. Sequence analysis of the genome of carnation (Dianthus caryophyllus L.).

    Science.gov (United States)

    Yagi, Masafumi; Kosugi, Shunichi; Hirakawa, Hideki; Ohmiya, Akemi; Tanase, Koji; Harada, Taro; Kishimoto, Kyutaro; Nakayama, Masayoshi; Ichimura, Kazuo; Onozaki, Takashi; Yamaguchi, Hiroyasu; Sasaki, Nobuhiro; Miyahara, Taira; Nishizaki, Yuzo; Ozeki, Yoshihiro; Nakamura, Noriko; Suzuki, Takamasa; Tanaka, Yoshikazu; Sato, Shusei; Shirasawa, Kenta; Isobe, Sachiko; Miyamura, Yoshinori; Watanabe, Akiko; Nakayama, Shinobu; Kishida, Yoshie; Kohara, Mitsuyo; Tabata, Satoshi

    2014-06-01

    The whole-genome sequence of carnation (Dianthus caryophyllus L.) cv. 'Francesco' was determined using a combination of different new-generation multiplex sequencing platforms. The total length of the non-redundant sequences was 568,887,315 bp, consisting of 45,088 scaffolds, which covered 91% of the 622 Mb carnation genome estimated by k-mer analysis. The N50 values of contigs and scaffolds were 16,644 bp and 60,737 bp, respectively, and the longest scaffold was 1,287,144 bp. The average GC content of the contig sequences was 36%. A total of 1050, 13, 92 and 143 genes for tRNAs, rRNAs, snoRNA and miRNA, respectively, were identified in the assembled genomic sequences. For protein-encoding genes, 43 266 complete and partial gene structures excluding those in transposable elements were deduced. Gene coverage was ∼ 98%, as deduced from the coverage of the core eukaryotic genes. Intensive characterization of the assigned carnation genes and comparison with those of other plant species revealed characteristic features of the carnation genome. The results of this study will serve as a valuable resource for fundamental and applied research of carnation, especially for breeding new carnation varieties. Further information on the genomic sequences is available at http://carnation.kazusa.or.jp. © The Author 2013. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  6. High-throughput physical map anchoring via BAC-pool sequencing

    Czech Academy of Sciences Publication Activity Database

    Cviková, Kateřina; Cattonaro, F.; Alaux, M.; Stein, N.; Mayer, K.F.X.; Doležel, Jaroslav; Bartoš, Jan

    2015-01-01

    Roč. 15, APR 11 (2015) ISSN 1471-2229 R&D Projects: GA ČR GA13-08786S; GA MŠk(CZ) LO1204 Institutional support: RVO:61389030 Keywords : Physical map * Contig anchoring * Next generation sequencing Subject RIV: EB - Genetics ; Molecular Biology Impact factor: 3.631, year: 2015

  7. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample.

    Directory of Open Access Journals (Sweden)

    Chengwei Luo

    Full Text Available Next-generation sequencing (NGS is commonly used in metagenomic studies of complex microbial communities but whether or not different NGS platforms recover the same diversity from a sample and their assembled sequences are of comparable quality remain unclear. We compared the two most frequently used platforms, the Roche 454 FLX Titanium and the Illumina Genome Analyzer (GA II, on the same DNA sample obtained from a complex freshwater planktonic community. Despite the substantial differences in read length and sequencing protocols, the platforms provided a comparable view of the community sampled. For instance, derived assemblies overlapped in ~90% of their total sequences and in situ abundances of genes and genotypes (estimated based on sequence coverage correlated highly between the two platforms (R(2>0.9. Evaluation of base-call error, frameshift frequency, and contig length suggested that Illumina offered equivalent, if not better, assemblies than Roche 454. The results from metagenomic samples were further validated against DNA samples of eighteen isolate genomes, which showed a range of genome sizes and G+C% content. We also provide quantitative estimates of the errors in gene and contig sequences assembled from datasets characterized by different levels of complexity and G+C% content. For instance, we noted that homopolymer-associated, single-base errors affected ~1% of the protein sequences recovered in Illumina contigs of 10× coverage and 50% G+C; this frequency increased to ~3% when non-homopolymer errors were also considered. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies.

  8. SWORDS: A statistical tool for analysing large DNA sequences

    Indian Academy of Sciences (India)

    Unknown

    These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in ... tions with the cellular processes like recombination, replication .... in DNA sequences using certain specific probability laws. (Pevzner et al ...

  9. Sequence finishing and mapping of Drosophila melanogasterheterochromatin

    Energy Technology Data Exchange (ETDEWEB)

    Hoskins, Roger A.; Carlson, Joseph W.; Kennedy, Cameron; Acevedo,David; Evans-Holm, Martha; Frise, Erwin; Wan, Kenneth H.; Park, Soo; Mendez-Lago, Maria; Rossi, Fabrizio; Villasante, Alfredo; Dimitri,Patrizio; Karpen, Gary H.; Celniker, Susan E.

    2007-06-15

    Genome sequences for most metazoans are incomplete due tothe presence of repeated DNA in the pericentromeric heterochromatin. Theheterochromatic regions of D. melanogaster contain 20 Mb of sequenceamenable to mapping, sequence assembly and finishing. Here we describethe generation of 15 Mb of finished or improved heterochromatic sequenceusing available clone resources and assembly and mapping methods. We alsoconstructed a BAC-based physical map that spans approximately 13 Mb ofthe pericentromeric heterochromatin, and a cytogenetic map that positionsapproximately 11 Mb of BAC contigs and sequence scaffolds in specificchromosomal locations. The integrated sequence assembly and maps greatlyimprove our understanding of the structure and composition of this poorlyunderstood fraction of a metazoan genome and provide a framework forfunctional analyses.

  10. Giant panda BAC library construction and assembly of a 650-kb contig spanning major histocompatibility complex class II region

    Directory of Open Access Journals (Sweden)

    Pan Hui-Juan

    2007-09-01

    Full Text Available Abstract Background Giant panda is rare and endangered species endemic to China. The low rates of reproductive success and infectious disease resistance have severely hampered the development of captive and wild populations of the giant panda. The major histocompatibility complex (MHC plays important roles in immune response and reproductive system such as mate choice and mother-fetus bio-compatibility. It is thus essential to understand genetic details of the giant panda MHC. Construction of a bacterial artificial chromosome (BAC library will provide a new tool for panda genome physical mapping and thus facilitate understanding of panda MHC genes. Results A giant panda BAC library consisting of 205,800 clones has been constructed. The average insert size was calculated to be 97 kb based on the examination of 174 randomly selected clones, indicating that the giant panda library contained 6.8-fold genome equivalents. Screening of the library with 16 giant panda PCR primer pairs revealed 6.4 positive clones per locus, in good agreement with an expected 6.8-fold genomic coverage of the library. Based on this BAC library, we constructed a contig map of the giant panda MHC class II region from BTNL2 to DAXX spanning about 650 kb by a three-step method: (1 PCR-based screening of the BAC library with primers from homologous MHC class II gene loci, end sequences and BAC clone shotgun sequences, (2 DNA sequencing validation of positive clones, and (3 restriction digest fingerprinting verification of inter-clone overlapping. Conclusion The identifications of genes and genomic regions of interest are greatly favored by the availability of this giant panda BAC library. The giant panda BAC library thus provides a useful platform for physical mapping, genome sequencing or complex analysis of targeted genomic regions. The 650 kb sequence-ready BAC contig map of the giant panda MHC class II region from BTNL2 to DAXX, verified by the three-step method, offers a

  11. Scaffold filling, contig fusion and comparative gene order inference

    Directory of Open Access Journals (Sweden)

    Rounsley Steve

    2010-06-01

    Full Text Available Abstract Background There has been a trend in increasing the phylogenetic scope of genome sequencing without finishing the sequence of the genome. Increasing numbers of genomes are being published in scaffold or contig form. Rearrangement algorithms, however, including gene order-based phylogenetic tools, require whole genome data on gene order or syntenic block order. How then can we use rearrangement algorithms to compare genomes available in scaffold form only? Can the comparative evidence predict the location of unsequenced genes? Results Our method involves optimally filling in genes missing from the scaffolds, while incorporating the augmented scaffolds directly into the rearrangement algorithms as if they were chromosomes. This is accomplished by an exact, polynomial-time algorithm. We then correct for the number of extra fusion/fission operations required to make scaffolds comparable to full assemblies. We model the relationship between the ratio of missing genes actually absent from the genome versus merely unsequenced ones, on one hand, and the increase of genomic distance after scaffold filling, on the other. We estimate the parameters of this model through simulations and by comparing the angiosperm genomes Ricinus communis and Vitis vinifera. Conclusions The algorithm solves the comparison of genomes with 18,300 genes, including 4500 missing from one genome, in less than a minute on a MacBook, putting virtually all genomes within range of the method.

  12. Scaffold filling, contig fusion and comparative gene order inference.

    Science.gov (United States)

    Muñoz, Adriana; Zheng, Chunfang; Zhu, Qian; Albert, Victor A; Rounsley, Steve; Sankoff, David

    2010-06-04

    There has been a trend in increasing the phylogenetic scope of genome sequencing without finishing the sequence of the genome. Increasing numbers of genomes are being published in scaffold or contig form. Rearrangement algorithms, however, including gene order-based phylogenetic tools, require whole genome data on gene order or syntenic block order. How then can we use rearrangement algorithms to compare genomes available in scaffold form only? Can the comparative evidence predict the location of unsequenced genes? Our method involves optimally filling in genes missing from the scaffolds, while incorporating the augmented scaffolds directly into the rearrangement algorithms as if they were chromosomes. This is accomplished by an exact, polynomial-time algorithm. We then correct for the number of extra fusion/fission operations required to make scaffolds comparable to full assemblies. We model the relationship between the ratio of missing genes actually absent from the genome versus merely unsequenced ones, on one hand, and the increase of genomic distance after scaffold filling, on the other. We estimate the parameters of this model through simulations and by comparing the angiosperm genomes Ricinus communis and Vitis vinifera. The algorithm solves the comparison of genomes with 18,300 genes, including 4500 missing from one genome, in less than a minute on a MacBook, putting virtually all genomes within range of the method.

  13. The canine sarcoglycan delta gene: BAC clone contig assembly, chromosome assignment and interrogation as a candidate gene for dilated cardiomyopathy in Dobermann dogs.

    Science.gov (United States)

    Stabej, P; Leegwater, P A J; Imholz, S; Versteeg, S A; Zijlstra, C; Stokhof, A A; Domanjko-Petriè, A; van Oost, B A

    2005-01-01

    Dilated cardiomyopathy (DCM) is a common disease of the myocardium recognized in human, dog and experimental animals. Genetic factors are responsible for a large proportion of cases in humans, and 17 genes with DCM causing mutations have been identified. The genetic origin of DCM in the Dobermann dogs has been suggested, but no disease genes have been identified to date. In this paper, we describe the characterization and evaluation of the canine sarcoglycan delta (SGCD), a gene implicated in DCM in human and hamster. Bacterial artificial chromosomes (BACs) containing the canine SGCD gene were isolated with probes for exon 3 and exons 4-8 and were characterized by Southern blot analysis. BAC end sequences were obtained for four BACs. Three of the BACs overlapped and could be ordered relative to each other and the end sequences of all four BACs could be anchored on the preliminary assembly of the dog genome sequence (www. ensembl.org). One of the BACs of the partial contig was localized by fluorescent in situ hybridization to canine chromosome 4q22, in agreement with the dog genome sequence. Two highly informative polymorphic microsatellite markers in intron 7 of the SGCD gene were identified. In 25 DCM-affected and 13 non DCM-affected dogs seven different haplotypes could be distinguished. However, no association between any of the SGCD variants and the disease locus was apparent.

  14. Next Generation Sequencing of Actinobacteria for the Discovery of Novel Natural Products

    Science.gov (United States)

    Gomez-Escribano, Juan Pablo; Alt, Silke; Bibb, Mervyn J.

    2016-01-01

    Like many fields of the biosciences, actinomycete natural products research has been revolutionised by next-generation DNA sequencing (NGS). Hundreds of new genome sequences from actinobacteria are made public every year, many of them as a result of projects aimed at identifying new natural products and their biosynthetic pathways through genome mining. Advances in these technologies in the last five years have meant not only a reduction in the cost of whole genome sequencing, but also a substantial increase in the quality of the data, having moved from obtaining a draft genome sequence comprised of several hundred short contigs, sometimes of doubtful reliability, to the possibility of obtaining an almost complete and accurate chromosome sequence in a single contig, allowing a detailed study of gene clusters and the design of strategies for refactoring and full gene cluster synthesis. The impact that these technologies are having in the discovery and study of natural products from actinobacteria, including those from the marine environment, is only starting to be realised. In this review we provide a historical perspective of the field, analyse the strengths and limitations of the most relevant technologies, and share the insights acquired during our genome mining projects. PMID:27089350

  15. Genome survey sequencing and genetic background characterization of Gracilariopsis lemaneiformis (Rhodophyta) based on next-generation sequencing.

    Science.gov (United States)

    Zhou, Wei; Hu, Yiyi; Sui, Zhenghong; Fu, Feng; Wang, Jinguo; Chang, Lianpeng; Guo, Weihua; Li, Binbin

    2013-01-01

    Gracilariopsis lemaneiformis has a high economic value and is one of the most important aquaculture species in China. Despite it is economic importance, it has remained largely unstudied at the genomic level. In this study, we conducted a genome survey of Gp. lemaneiformis using next-generation sequencing (NGS) technologies. In total, 18.70 Gb of high-quality sequence data with an estimated genome size of 97 Mb were obtained by HiSeq 2000 sequencing for Gp. lemaneiformis. These reads were assembled into 160,390 contigs with a N50 length of 3.64 kb, which were further assembled into 125,685 scaffolds with a total length of 81.17 Mb. Genome analysis predicted 3490 genes and a GC% content of 48%. The identified genes have an average transcript length of 1,429 bp, an average coding sequence size of 1,369 bp, 1.36 exons per gene, exon length of 1,008 bp, and intron length of 191 bp. From the initial assembled scaffold, transposable elements constituted 54.64% (44.35 Mb) of the genome, and 7737 simple sequence repeats (SSRs) were identified. Among these SSRs, the trinucleotide repeat type was the most abundant (up to 73.20% of total SSRs), followed by the di- (17.41%), tetra- (5.49%), hexa- (2.90%), and penta- (1.00%) nucleotide repeat type. These characteristics suggest that Gp. lemaneiformis is a model organism for genetic study. This is the first report of genome-wide characterization within this taxon.

  16. Genome Survey Sequencing and Genetic Background Characterization of Gracilariopsis lemaneiformis (Rhodophyta) Based on Next-Generation Sequencing

    Science.gov (United States)

    Sui, Zhenghong; Fu, Feng; Wang, Jinguo; Chang, Lianpeng; Guo, Weihua; Li, Binbin

    2013-01-01

    Gracilariopsis lemaneiformis has a high economic value and is one of the most important aquaculture species in China. Despite it is economic importance, it has remained largely unstudied at the genomic level. In this study, we conducted a genome survey of Gp. lemaneiformis using next-generation sequencing (NGS) technologies. In total, 18.70 Gb of high-quality sequence data with an estimated genome size of 97 Mb were obtained by HiSeq 2000 sequencing for Gp. lemaneiformis. These reads were assembled into 160,390 contigs with a N50 length of 3.64 kb, which were further assembled into 125,685 scaffolds with a total length of 81.17 Mb. Genome analysis predicted 3490 genes and a GC% content of 48%. The identified genes have an average transcript length of 1,429 bp, an average coding sequence size of 1,369 bp, 1.36 exons per gene, exon length of 1,008 bp, and intron length of 191 bp. From the initial assembled scaffold, transposable elements constituted 54.64% (44.35 Mb) of the genome, and 7737 simple sequence repeats (SSRs) were identified. Among these SSRs, the trinucleotide repeat type was the most abundant (up to 73.20% of total SSRs), followed by the di- (17.41%), tetra- (5.49%), hexa- (2.90%), and penta- (1.00%) nucleotide repeat type. These characteristics suggest that Gp. lemaneiformis is a model organism for genetic study. This is the first report of genome-wide characterization within this taxon. PMID:23875008

  17. Gene discovery from Jatropha curcas by sequencing of ESTs from normalized and full-length enriched cDNA library from developing seeds

    Directory of Open Access Journals (Sweden)

    Sugantham Priyanka Annabel

    2010-10-01

    Full Text Available Abstract Background Jatropha curcas L. is promoted as an important non-edible biodiesel crop worldwide. Jatropha oil, which is a triacylglycerol, can be directly blended with petro-diesel or transesterified with methanol and used as biodiesel. Genetic improvement in jatropha is needed to increase the seed yield, oil content, drought and pest resistance, and to modify oil composition so that it becomes a technically and economically preferred source for biodiesel production. However, genetic improvement efforts in jatropha could not take advantage of genetic engineering methods due to lack of cloned genes from this species. To overcome this hurdle, the current gene discovery project was initiated with an objective of isolating as many functional genes as possible from J. curcas by large scale sequencing of expressed sequence tags (ESTs. Results A normalized and full-length enriched cDNA library was constructed from developing seeds of J. curcas. The cDNA library contained about 1 × 106 clones and average insert size of the clones was 2.1 kb. Totally 12,084 ESTs were sequenced to average high quality read length of 576 bp. Contig analysis revealed 2258 contigs and 4751 singletons. Contig size ranged from 2-23 and there were 7333 ESTs in the contigs. This resulted in 7009 unigenes which were annotated by BLASTX. It showed 3982 unigenes with significant similarity to known genes and 2836 unigenes with significant similarity to genes of unknown, hypothetical and putative proteins. The remaining 191 unigenes which did not show similarity with any genes in the public database may encode for unique genes. Functional classification revealed unigenes related to broad range of cellular, molecular and biological functions. Among the 7009 unigenes, 6233 unigenes were identified to be potential full-length genes. Conclusions The high quality normalized cDNA library was constructed from developing seeds of J. curcas for the first time and 7009 unigenes coding

  18. Construction of a YAC contig and STS map spanning 2.5 Mbp in Xq25, the critical region for the X-linked lymphoproliferative (XLP) gene

    Energy Technology Data Exchange (ETDEWEB)

    Lanyi, A.; Li, B.F.; Li, S. [Univ. of Nebraska Medical Center, Omaha, NE (United States)] [and others

    1994-09-01

    X-linked lymphoproliferative disease (XLP) is characterized by a marked vulnerability in Epstein-Barr virus (EBV) infection. Infection of XLP patients with EBV invariably results in fatal mononucleosis, agammaglobulinemia or B-cell lymphoma. The XLP gene lies within a 10 cM region in Xq25 between DXS42 and DXS10. Initial chromosome studies revealed an interstitial, cytogenetically visible deletion in Xq25 in one XLP family (43-004). We estimated the size of the Xq25 deletion by dual laser flow karyotyping to involve 2% of the X chromosome, or approximately 3 Mbp of DNA sequences. To further delineate the deletion we performed a series of pulsed field gel electrophoresis (PFGE) analyses which showed that DXS6 and DXS100, two Xq25-specific markers, are missing from 45-004 DNA. Five yeast artificial chromosomes (YACs) from a chromosome X specific YAC library containing sequences deleted in patient`s 43-004 DNA were isolated. These five YACs did not overlap, and their end fragments were used to screen the CEPH MegaYAC library. Seven YACs were isolated from the CEPH MegaYAC library. They could be arranged into a contig which spans between DXS6 and DXS100. The contig contains a minimum of 2.5 Mbp of human DNA. A total of 12 YAC end clone, lambda subclones and STS probes have been used to order clones within the contig. These reagents were also used in Southern blot and patients showed interstitial deletions in Xq25. The size of these deletions range between 0.5 and 2.5 Mbp. The shortest deletion probably represents the critical region for the XLP gene.

  19. Draft Genome Sequence of Lactobacillus plantarum XJ25 Isolated from Chinese Red Wine.

    Science.gov (United States)

    Zhao, Meijing; Liu, Shuwen; He, Ling; Tian, Yu

    2016-11-17

    Here, we present the draft genome sequence of Lactobacillus plantarum XJ25, isolated from Chinese red wine that had undergone spontaneous malolactic fermentation, which consists of 25 contigs and is 3,218,018 bp long. Copyright © 2016 Zhao et al.

  20. Dicty_cDB: Contig-U14477-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available vmvvvivfylqviynlriivilmvqlivivf*mglvmvtviiivivii i*rinnnnsnnnnnsnnnkikmif*yqiinrlnnyf*shyqkfiiiqrldfwdyqrler* *hhlyqrlvnqvvivq*fhwisl...amvlaimxxx own update 2004. 6.10 Homology vs CSM-cDNA Query= Contig-U14477-1 (Conti

  1. Characterizing novel endogenous retroviruses from genetic variation inferred from short sequence reads

    DEFF Research Database (Denmark)

    Mourier, Tobias; Mollerup, Sarah; Vinner, Lasse

    2015-01-01

    From Illumina sequencing of DNA from brain and liver tissue from the lion, Panthera leo, and tumor samples from the pike-perch, Sander lucioperca, we obtained two assembled sequence contigs with similarity to known retroviruses. Phylogenetic analyses suggest that the pike-perch retrovirus belongs...... to the epsilonretroviruses, and the lion retrovirus to the gammaretroviruses. To determine if these novel retroviral sequences originate from an endogenous retrovirus or from a recently integrated exogenous retrovirus, we assessed the genetic diversity of the parental sequences from which the short Illumina reads...

  2. Curtobacterium sp. Genome Sequencing Underlines Plant Growth Promotion-Related Traits.

    Science.gov (United States)

    Bulgari, Daniela; Minio, Andrea; Casati, Paola; Quaglino, Fabio; Delledonne, Massimo; Bianco, Piero A

    2014-07-17

    Endophytic bacteria are microorganisms residing in plant tissues without causing disease symptoms. Here, we provide the high-quality genome sequence of Curtobacterium sp. strain S6, isolated from grapevine plant. The genome assembly contains 2,759,404 bp in 13 contigs and 2,456 predicted genes. Copyright © 2014 Bulgari et al.

  3. A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica

    Directory of Open Access Journals (Sweden)

    Ueno Saneyoshi

    2012-04-01

    Full Text Available Abstract Background Microsatellites or simple sequence repeats (SSRs in expressed sequence tags (ESTs are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica. Results We collected 141,097 sequence reads by Sanger sequencing and 1,333,444 by pyrosequencing. After trimming contaminant and low quality sequences, 118,319 Sanger and 1,201,150 pyrosequencing reads were passed to the MIRA assembler, generating 81,284 contigs that were analysed for SSRs. 4,059 SSRs were found in 3,694 (4.54% contigs, giving an SSR frequency lower than that in seven other plant species with gene indices (5.4–21.9%. The average GC content of the SSR-containing contigs was 41.55%, compared to 40.23% for all contigs. Tri-SSRs were the most common SSRs; the most common motif was AT, which was found in 655 (46.3% di-SSRs, followed by the AAG motif, found in 342 (25.9% tri-SSRs. Most (72.8% tri-SSRs were in coding regions, but 55.6% of the di-SSRs were in non-coding regions; the AT motif was most abundant in 3′ untranslated regions. Gene ontology (GO annotations showed that six GO terms were significantly overrepresented within SSR-containing contigs. Forty–four EST-SSR markers were developed from 192 primer pairs using two pipelines: read2Marker and the newly-developed CMiB, which combines several open tools. Markers resulting from both pipelines showed no differences in PCR success rate and polymorphisms, but PCR success and polymorphism were significantly affected by the expected PCR product size

  4. A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica

    Science.gov (United States)

    2012-01-01

    Background Microsatellites or simple sequence repeats (SSRs) in expressed sequence tags (ESTs) are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica). Results We collected 141,097 sequence reads by Sanger sequencing and 1,333,444 by pyrosequencing. After trimming contaminant and low quality sequences, 118,319 Sanger and 1,201,150 pyrosequencing reads were passed to the MIRA assembler, generating 81,284 contigs that were analysed for SSRs. 4,059 SSRs were found in 3,694 (4.54%) contigs, giving an SSR frequency lower than that in seven other plant species with gene indices (5.4–21.9%). The average GC content of the SSR-containing contigs was 41.55%, compared to 40.23% for all contigs. Tri-SSRs were the most common SSRs; the most common motif was AT, which was found in 655 (46.3%) di-SSRs, followed by the AAG motif, found in 342 (25.9%) tri-SSRs. Most (72.8%) tri-SSRs were in coding regions, but 55.6% of the di-SSRs were in non-coding regions; the AT motif was most abundant in 3′ untranslated regions. Gene ontology (GO) annotations showed that six GO terms were significantly overrepresented within SSR-containing contigs. Forty–four EST-SSR markers were developed from 192 primer pairs using two pipelines: read2Marker and the newly-developed CMiB, which combines several open tools. Markers resulting from both pipelines showed no differences in PCR success rate and polymorphisms, but PCR success and polymorphism were significantly affected by the expected PCR product size and number of SSR

  5. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences.

    Science.gov (United States)

    Gao, Song; Sung, Wing-Kin; Nagarajan, Niranjan

    2011-11-01

    Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of high-quality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no guarantees on the quality of the solution. In this work, we explored the feasibility of an exact solution for scaffolding and present a first tractable solution for this problem (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes (Availability: http://sourceforge.net/projects/operasf/ ).

  6. The European sea bass Dicentrarchus labrax genome puzzle: comparative BAC-mapping and low coverage shotgun sequencing

    Directory of Open Access Journals (Sweden)

    Volckaert Filip AM

    2010-01-01

    Full Text Available Abstract Background Food supply from the ocean is constrained by the shortage of domesticated and selected fish. Development of genomic models of economically important fishes should assist with the removal of this bottleneck. European sea bass Dicentrarchus labrax L. (Moronidae, Perciformes, Teleostei is one of the most important fishes in European marine aquaculture; growing genomic resources put it on its way to serve as an economic model. Results End sequencing of a sea bass genomic BAC-library enabled the comparative mapping of the sea bass genome using the three-spined stickleback Gasterosteus aculeatus genome as a reference. BAC-end sequences (102,690 were aligned to the stickleback genome. The number of mappable BACs was improved using a two-fold coverage WGS dataset of sea bass resulting in a comparative BAC-map covering 87% of stickleback chromosomes with 588 BAC-contigs. The minimum size of 83 contigs covering 50% of the reference was 1.2 Mbp; the largest BAC-contig comprised 8.86 Mbp. More than 22,000 BAC-clones aligned with both ends to the reference genome. Intra-chromosomal rearrangements between sea bass and stickleback were identified. Size distributions of mapped BACs were used to calculate that the genome of sea bass may be only 1.3 fold larger than the 460 Mbp stickleback genome. Conclusions The BAC map is used for sequencing single BACs or BAC-pools covering defined genomic entities by second generation sequencing technologies. Together with the WGS dataset it initiates a sea bass genome sequencing project. This will allow the quantification of polymorphisms through resequencing, which is important for selecting highly performing domesticated fish.

  7. New Sequences with Low Correlation and Large Family Size

    Science.gov (United States)

    Zeng, Fanxin

    In direct-sequence code-division multiple-access (DS-CDMA) communication systems and direct-sequence ultra wideband (DS-UWB) radios, sequences with low correlation and large family size are important for reducing multiple access interference (MAI) and accepting more active users, respectively. In this paper, a new collection of families of sequences of length pn-1, which includes three constructions, is proposed. The maximum number of cyclically distinct families without GMW sequences in each construction is φ(pn-1)/n·φ(pm-1)/m, where p is a prime number, n is an even number, and n=2m, and these sequences can be binary or polyphase depending upon choice of the parameter p. In Construction I, there are pn distinct sequences within each family and the new sequences have at most d+2 nontrivial periodic correlation {-pm-1, -1, pm-1, 2pm-1,…,dpm-1}. In Construction II, the new sequences have large family size p2n and possibly take the nontrivial correlation values in {-pm-1, -1, pm-1, 2pm-1,…,(3d-4)pm-1}. In Construction III, the new sequences possess the largest family size p(d-1)n and have at most 2d correlation levels {-pm-1, -1,pm-1, 2pm-1,…,(2d-2)pm-1}. Three constructions are near-optimal with respect to the Welch bound because the values of their Welch-Ratios are moderate, WR_??_d, WR_??_3d-4 and WR_??_2d-2, respectively. Each family in Constructions I, II and III contains a GMW sequence. In addition, Helleseth sequences and Niho sequences are special cases in Constructions I and III, and their restriction conditions to the integers m and n, pm≠2 (mod 3) and n≅0 (mod 4), respectively, are removed in our sequences. Our sequences in Construction III include the sequences with Niho type decimation 3·2m-2, too. Finally, some open questions are pointed out and an example that illustrates the performance of these sequences is given.

  8. Physical mapping and BAC-end sequence analysis provide initial insights into the flax (Linum usitatissimum L.) genome.

    Science.gov (United States)

    Ragupathy, Raja; Rathinavelu, Rajkumar; Cloutier, Sylvie

    2011-05-09

    Flax (Linum usitatissimum L.) is an important source of oil rich in omega-3 fatty acids, which have proven health benefits and utility as an industrial raw material. Flax seeds also contain lignans which are associated with reducing the risk of certain types of cancer. Its bast fibres have broad industrial applications. However, genomic tools needed for molecular breeding were non existent. Hence a project, Total Utilization Flax GENomics (TUFGEN) was initiated. We report here the first genome-wide physical map of flax and the generation and analysis of BAC-end sequences (BES) from 43,776 clones, providing initial insights into the genome. The physical map consists of 416 contigs spanning ~368 Mb, assembled from 32,025 fingerprints, representing roughly 54.5% to 99.4% of the estimated haploid genome (370-675 Mb). The N50 size of the contigs was estimated to be ~1,494 kb. The longest contig was ~5,562 kb comprising 437 clones. There were 96 contigs containing more than 100 clones. Approximately 54.6 Mb representing 8-14.8% of the genome was obtained from 80,337 BES. Annotation revealed that a large part of the genome consists of ribosomal DNA (~13.8%), followed by known transposable elements at 6.1%. Furthermore, ~7.4% of sequence was identified to harbour novel repeat elements. Homology searches against flax-ESTs and NCBI-ESTs suggested that ~5.6% of the transcriptome is unique to flax. A total of 4064 putative genomic SSRs were identified and are being developed as novel markers for their use in molecular breeding. The first genome-wide physical map of flax constructed with BAC clones provides a framework for accessing target loci with economic importance for marker development and positional cloning. Analysis of the BES has provided insights into the uniqueness of the flax genome. Compared to other plant genomes, the proportion of rDNA was found to be very high whereas the proportion of known transposable elements was low. The SSRs identified from BES will be

  9. Draft Genome Sequence of Lactobacillus sp. Strain TCF032-E4, Isolated from Fermented Radish.

    Science.gov (United States)

    Mao, Yuejian; Chen, Meng; Horvath, Philippe

    2015-07-30

    Here, we report the draft genome sequence of Lactobacillus sp. strain TCF032-E4 (= CCTCC AB2015090 = DSM 100358), isolated from a Chinese fermented radish. The total length of the 57 contigs is about 2.9 Mb, with a G+C content of 43.5 mol% and 2,797 predicted coding sequences (CDSs). Copyright © 2015 Mao et al.

  10. Direct chloroplast sequencing: comparison of sequencing platforms and analysis tools for whole chloroplast barcoding.

    Directory of Open Access Journals (Sweden)

    Marta Brozynska

    Full Text Available Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina and Ion Torrent (Life Technology sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare. Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis.

  11. Draft Genome Sequence of Escherichia coli K-12 (ATCC 10798)

    OpenAIRE

    Dimitrova, Daniela; Engelbrecht, Kathleen C.; Putonti, Catherine; Koenig, David W.; Wolfe, Alan J.

    2017-01-01

    ABSTRACT Here, we present the draft genome sequence of Escherichia coli ATCC 10798. E.?coli ATCC 10798 is a K-12 strain, one of the most well-studied model microorganisms. The size of the genome was 4,685,496?bp, with a G+C content of 50.70%. This assembly consists of 62 contigs and the F plasmid.

  12. Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing

    Directory of Open Access Journals (Sweden)

    Cannon Charles H

    2011-07-01

    Full Text Available Abstract Background Acacia auriculiformis × Acacia mangium hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in Acacia hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants. Results We sequenced transcriptomes of A. auriculiformis and A. mangium from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. De novo assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for A. auriculiformis and A. mangium respectively. The assemblies of A. auriculiformis and A. mangium had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168 and one legume-specific family (miR2086. Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs in the transcriptomes of A. auriculiformis and A. mangium

  13. Development and cross-species/genera transferability of microsatellite markers discovered using 454 genome sequencing in chokecherry (Prunus virginiana L.).

    Science.gov (United States)

    Wang, Hongxia; Walla, James A; Zhong, Shaobin; Huang, Danqiong; Dai, Wenhao

    2012-11-01

    Chokecherry (Prunus virginiana L.) (2n = 4x = 32) is a unique Prunus species for both genetics and disease-resistance research due to its tetraploid nature and X-disease resistance. However, no genetic and genomic information on chokecherry is available. A partial chokecherry genome was sequenced using Roche 454 sequencing technology. A total of 145,094 reads covering 4.8 Mbp of the chokecherry genome were generated and 15,113 contigs were assembled, of which 11,675 contigs were larger than 100 bp in size. A total of 481 SSR loci were identified from 234 (out of 11,675) contigs and 246 polymerase chain reaction (PCR) primer pairs were designed. Of 246 primers, 212 (86.2 %) effectively produced amplification from the genomic DNA of chokecherry. All 212 amplifiable chokecherry primers were used to amplify genomic DNA from 11 other rosaceous species (sour cherry, sweet cherry, black cherry, peach, apricot, plum, apple, crabapple, pear, juneberry, and raspberry). Thus, chokecherry SSR primers can be transferable across Prunus species and other rosaceous species. An average of 63.2 and 58.7 % of amplifiable chokecherry primers amplified DNA from cherry and other Prunus species, respectively, while 47.2 % of amplifiable chokecherry primers amplified DNA from other rosaceous species. Using random genome sequence data generated from next-generation sequencing technology to identify microsatellite loci appears to be rapid and cost-efficient, particularly for species with no sequence information available. Sequence information and confirmed transferability of the identified chokecherry SSRs among species will be valuable for genetic research in Prunus and other rosaceous species. Key message A total of 246 SSR primers were identified from chokecherry genome sequences. Of which, 212 were confirmed amplifiable both in chokecherry and other 11 other rosaceous species.

  14. Targeted sequencing of large genomic regions with CATCH-Seq.

    Directory of Open Access Journals (Sweden)

    Kenneth Day

    Full Text Available Current target enrichment systems for large-scale next-generation sequencing typically require synthetic oligonucleotides used as capture reagents to isolate sequences of interest. The majority of target enrichment reagents are focused on gene coding regions or promoters en masse. Here we introduce development of a customizable targeted capture system using biotinylated RNA probe baits transcribed from sheared bacterial artificial chromosome clone templates that enables capture of large, contiguous blocks of the genome for sequencing applications. This clone adapted template capture hybridization sequencing (CATCH-Seq procedure can be used to capture both coding and non-coding regions of a gene, and resolve the boundaries of copy number variations within a genomic target site. Furthermore, libraries constructed with methylated adapters prior to solution hybridization also enable targeted bisulfite sequencing. We applied CATCH-Seq to diverse targets ranging in size from 125 kb to 3.5 Mb. Our approach provides a simple and cost effective alternative to other capture platforms because of template-based, enzymatic probe synthesis and the lack of oligonucleotide design costs. Given its similarity in procedure, CATCH-Seq can also be performed in parallel with commercial systems.

  15. A segment of the apospory-specific genomic region is highly microsyntenic not only between the apomicts Pennisetum squamulatum and buffelgrass, but also with a rice chromosome 11 centromeric-proximal genomic region.

    Science.gov (United States)

    Gualtieri, Gustavo; Conner, Joann A; Morishige, Daryl T; Moore, L David; Mullet, John E; Ozias-Akins, Peggy

    2006-03-01

    Bacterial artificial chromosome (BAC) clones from apomicts Pennisetum squamulatum and buffelgrass (Cenchrus ciliaris), isolated with the apospory-specific genomic region (ASGR) marker ugt197, were assembled into contigs that were extended by chromosome walking. Gene-like sequences from contigs were identified by shotgun sequencing and BLAST searches, and used to isolate orthologous rice contigs. Additional gene-like sequences in the apomicts' contigs were identified by bioinformatics using fully sequenced BACs from orthologous rice contigs as templates, as well as by interspecies, whole-contig cross-hybridizations. Hierarchical contig orthology was rapidly assessed by constructing detailed long-range contig molecular maps showing the distribution of gene-like sequences and markers, and searching for microsyntenic patterns of sequence identity and spatial distribution within and across species contigs. We found microsynteny between P. squamulatum and buffelgrass contigs. Importantly, this approach also enabled us to isolate from within the rice (Oryza sativa) genome contig Rice A, which shows the highest microsynteny and is most orthologous to the ugt197-containing C1C buffelgrass contig. Contig Rice A belongs to the rice genome database contig 77 (according to the current September 12, 2003, rice fingerprint contig build) that maps proximal to the chromosome 11 centromere, a feature that interestingly correlates with the mapping of ASGR-linked BACs proximal to the centromere or centromere-like sequences. Thus, relatedness between these two orthologous contigs is supported both by their molecular microstructure and by their centromeric-proximal location. Our discoveries promote the use of a microsynteny-based positional-cloning approach using the rice genome as a template to aid in constructing the ASGR toward the isolation of genes underlying apospory.

  16. A Segment of the Apospory-Specific Genomic Region Is Highly Microsyntenic Not Only between the Apomicts Pennisetum squamulatum and Buffelgrass, But Also with a Rice Chromosome 11 Centromeric-Proximal Genomic Region1[W

    Science.gov (United States)

    Gualtieri, Gustavo; Conner, Joann A.; Morishige, Daryl T.; Moore, L. David; Mullet, John E.; Ozias-Akins, Peggy

    2006-01-01

    Bacterial artificial chromosome (BAC) clones from apomicts Pennisetum squamulatum and buffelgrass (Cenchrus ciliaris), isolated with the apospory-specific genomic region (ASGR) marker ugt197, were assembled into contigs that were extended by chromosome walking. Gene-like sequences from contigs were identified by shotgun sequencing and BLAST searches, and used to isolate orthologous rice contigs. Additional gene-like sequences in the apomicts' contigs were identified by bioinformatics using fully sequenced BACs from orthologous rice contigs as templates, as well as by interspecies, whole-contig cross-hybridizations. Hierarchical contig orthology was rapidly assessed by constructing detailed long-range contig molecular maps showing the distribution of gene-like sequences and markers, and searching for microsyntenic patterns of sequence identity and spatial distribution within and across species contigs. We found microsynteny between P. squamulatum and buffelgrass contigs. Importantly, this approach also enabled us to isolate from within the rice (Oryza sativa) genome contig Rice A, which shows the highest microsynteny and is most orthologous to the ugt197-containing C1C buffelgrass contig. Contig Rice A belongs to the rice genome database contig 77 (according to the current September 12, 2003, rice fingerprint contig build) that maps proximal to the chromosome 11 centromere, a feature that interestingly correlates with the mapping of ASGR-linked BACs proximal to the centromere or centromere-like sequences. Thus, relatedness between these two orthologous contigs is supported both by their molecular microstructure and by their centromeric-proximal location. Our discoveries promote the use of a microsynteny-based positional-cloning approach using the rice genome as a template to aid in constructing the ASGR toward the isolation of genes underlying apospory. PMID:16415213

  17. Detecting authorized and unauthorized genetically modified organisms containing vip3A by real-time PCR and next-generation sequencing.

    Science.gov (United States)

    Liang, Chanjuan; van Dijk, Jeroen P; Scholtens, Ingrid M J; Staats, Martijn; Prins, Theo W; Voorhuijzen, Marleen M; da Silva, Andrea M; Arisi, Ana Carolina Maisonnave; den Dunnen, Johan T; Kok, Esther J

    2014-04-01

    The growing number of biotech crops with novel genetic elements increasingly complicates the detection of genetically modified organisms (GMOs) in food and feed samples using conventional screening methods. Unauthorized GMOs (UGMOs) in food and feed are currently identified through combining GMO element screening with sequencing the DNA flanking these elements. In this study, a specific and sensitive qPCR assay was developed for vip3A element detection based on the vip3Aa20 coding sequences of the recently marketed MIR162 maize and COT102 cotton. Furthermore, SiteFinding-PCR in combination with Sanger, Illumina or Pacific BioSciences (PacBio) sequencing was performed targeting the flanking DNA of the vip3Aa20 element in MIR162. De novo assembly and Basic Local Alignment Search Tool searches were used to mimic UGMO identification. PacBio data resulted in relatively long contigs in the upstream (1,326 nucleotides (nt); 95 % identity) and downstream (1,135 nt; 92 % identity) regions, whereas Illumina data resulted in two smaller contigs of 858 and 1,038 nt with higher sequence identity (>99 % identity). Both approaches outperformed Sanger sequencing, underlining the potential for next-generation sequencing in UGMO identification.

  18. Simultaneous identification of long similar substrings in large sets of sequences

    Directory of Open Access Journals (Sweden)

    Wittig Burghardt

    2007-05-01

    Full Text Available Abstract Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1. Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.

  19. Transcriptome sequencing of lentil based on second-generation technology permits large-scale unigene assembly and SSR marker discovery

    Directory of Open Access Journals (Sweden)

    Materne Michael

    2011-05-01

    Full Text Available Abstract Background Lentil (Lens culinaris Medik. is a cool-season grain legume which provides a rich source of protein for human consumption. In terms of genomic resources, lentil is relatively underdeveloped, in comparison to other Fabaceae species, with limited available data. There is hence a significant need to enhance such resources in order to identify novel genes and alleles for molecular breeding to increase crop productivity and quality. Results Tissue-specific cDNA samples from six distinct lentil genotypes were sequenced using Roche 454 GS-FLX Titanium technology, generating c. 1.38 × 106 expressed sequence tags (ESTs. De novo assembly generated a total of 15,354 contigs and 68,715 singletons. The complete unigene set was sequence-analysed against genome drafts of the model legume species Medicago truncatula and Arabidopsis thaliana to identify 12,639, and 7,476 unique matches, respectively. When compared to the genome of Glycine max, a total of 20,419 unique hits were observed corresponding to c. 31% of the known gene space. A total of 25,592 lentil unigenes were subsequently annoated from GenBank. Simple sequence repeat (SSR-containing ESTs were identified from consensus sequences and a total of 2,393 primer pairs were designed. A subset of 192 EST-SSR markers was screened for validation across a panel 12 cultivated lentil genotypes and one wild relative species. A total of 166 primer pairs obtained successful amplification, of which 47.5% detected genetic polymorphism. Conclusions A substantial collection of ESTs has been developed from sequence analysis of lentil genotypes using second-generation technology, permitting unigene definition across a broad range of functional categories. As well as providing resources for functional genomics studies, the unigene set has permitted significant enhancement of the number of publicly-available molecular genetic markers as tools for improvement of this species.

  20. De Novo Assembly of Human Herpes Virus Type 1 (HHV-1) Genome, Mining of Non-Canonical Structures and Detection of Novel Drug-Resistance Mutations Using Short- and Long-Read Next Generation Sequencing Technologies.

    Science.gov (United States)

    Karamitros, Timokratis; Harrison, Ian; Piorkowska, Renata; Katzourakis, Aris; Magiorkinis, Gkikas; Mbisa, Jean Lutamyo

    2016-01-01

    Human herpesvirus type 1 (HHV-1) has a large double-stranded DNA genome of approximately 152 kbp that is structurally complex and GC-rich. This makes the assembly of HHV-1 whole genomes from short-read sequencing data technically challenging. To improve the assembly of HHV-1 genomes we have employed a hybrid genome assembly protocol using data from two sequencing technologies: the short-read Roche 454 and the long-read Oxford Nanopore MinION sequencers. We sequenced 18 HHV-1 cell culture-isolated clinical specimens collected from immunocompromised patients undergoing antiviral therapy. The susceptibility of the samples to several antivirals was determined by plaque reduction assay. Hybrid genome assembly resulted in a decrease in the number of contigs in 6 out of 7 samples and an increase in N(G)50 and N(G)75 of all 7 samples sequenced by both technologies. The approach also enhanced the detection of non-canonical contigs including a rearrangement between the unique (UL) and repeat (T/IRL) sequence regions of one sample that was not detectable by assembly of 454 reads alone. We detected several known and novel resistance-associated mutations in UL23 and UL30 genes. Genome-wide genetic variability ranged from genomes will be useful in determining genetic determinants of drug resistance, virulence, pathogenesis and viral evolution. The numerous, complex repeat regions of the HHV-1 genome currently remain a barrier towards this goal.

  1. Combined DECS Analysis and Next-Generation Sequencing Enable Efficient Detection of Novel Plant RNA Viruses

    Directory of Open Access Journals (Sweden)

    Hironobu Yanagisawa

    2016-03-01

    Full Text Available The presence of high molecular weight double-stranded RNA (dsRNA within plant cells is an indicator of infection with RNA viruses as these possess genomic or replicative dsRNA. DECS (dsRNA isolation, exhaustive amplification, cloning, and sequencing analysis has been shown to be capable of detecting unknown viruses. We postulated that a combination of DECS analysis and next-generation sequencing (NGS would improve detection efficiency and usability of the technique. Here, we describe a model case in which we efficiently detected the presumed genome sequence of Blueberry shoestring virus (BSSV, a member of the genus Sobemovirus, which has not so far been reported. dsRNAs were isolated from BSSV-infected blueberry plants using the dsRNA-binding protein, reverse-transcribed, amplified, and sequenced using NGS. A contig of 4,020 nucleotides (nt that shared similarities with sequences from other Sobemovirus species was obtained as a candidate of the BSSV genomic sequence. Reverse transcription (RT-PCR primer sets based on sequences from this contig enabled the detection of BSSV in all BSSV-infected plants tested but not in healthy controls. A recombinant protein encoded by the putative coat protein gene was bound by the BSSV-antibody, indicating that the candidate sequence was that of BSSV itself. Our results suggest that a combination of DECS analysis and NGS, designated here as “DECS-C,” is a powerful method for detecting novel plant viruses.

  2. AcEST: [AcEST

    Lifescience Database Archive (English)

    Full Text Available CL2194Contig1 771 2 Adiantum capillus-veneris contig: CL2194contig1 sequence. Link to clone list...apillus-veneris contig: CL2194contig1 sequence. Link to clone list Link to clone list Clone ID BP915172 DK95

  3. Dicty_cDB: Contig-U05126-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CP000939 ) Clostridium botulinum B1 str. Okra, complete genome. 34 2.5 18 ( GE803619 ) EST_scau_evk_893885 ...scauevk mixed_tissue Sebastes... 32 2.6 3 ( AM462416 ) Vitis vinifera contig VV78X219254.19, whole genom...

  4. Enhanced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals

    Science.gov (United States)

    Brown, Robert; Pasaniuc, Bogdan

    2014-01-01

    Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs. PMID:24743331

  5. Statistical processing of large image sequences.

    Science.gov (United States)

    Khellah, F; Fieguth, P; Murray, M J; Allen, M

    2005-01-01

    The dynamic estimation of large-scale stochastic image sequences, as frequently encountered in remote sensing, is important in a variety of scientific applications. However, the size of such images makes conventional dynamic estimation methods, for example, the Kalman and related filters, impractical. In this paper, we present an approach that emulates the Kalman filter, but with considerably reduced computational and storage requirements. Our approach is illustrated in the context of a 512 x 512 image sequence of ocean surface temperature. The static estimation step, the primary contribution here, uses a mixture of stationary models to accurately mimic the effect of a nonstationary prior, simplifying both computational complexity and modeling. Our approach provides an efficient, stable, positive-definite model which is consistent with the given correlation structure. Thus, the methods of this paper may find application in modeling and single-frame estimation.

  6. Several Families of Sequences with Low Correlation and Large Linear Span

    Science.gov (United States)

    Zeng, Fanxin; Zhang, Zhenyu

    In DS-CDMA systems and DS-UWB radios, low correlation of spreading sequences can greatly help to minimize multiple access interference (MAI) and large linear span of spreading sequences can reduce their predictability. In this letter, new sequence sets with low correlation and large linear span are proposed. Based on the construction Trm1[Trnm(αbt+γiαdt)]r for generating p-ary sequences of period pn-1, where n=2m, d=upm±v, b=u±v, γi∈GF(pn), and p is an arbitrary prime number, several methods to choose the parameter d are provided. The obtained sequences with family size pn are of four-valued, five-valued, six-valued or seven-valued correlation and the maximum nontrivial correlation value is (u+v-1)pm-1. The simulation by a computer shows that the linear span of the new sequences is larger than that of the sequences with Niho-type and Welch-type decimations, and similar to that of [10].

  7. Draft Genome Sequence of Escherichia coli K-12 (ATCC 10798).

    Science.gov (United States)

    Dimitrova, Daniela; Engelbrecht, Kathleen C; Putonti, Catherine; Koenig, David W; Wolfe, Alan J

    2017-07-06

    Here, we present the draft genome sequence of Escherichia coli ATCC 10798. E. coli ATCC 10798 is a K-12 strain, one of the most well-studied model microorganisms. The size of the genome was 4,685,496 bp, with a G+C content of 50.70%. This assembly consists of 62 contigs and the F plasmid. Copyright © 2017 Dimitrova et al.

  8. Draft Genome Sequence of Leuconostoc mesenteroides P45 Isolated from Pulque, a Traditional Mexican Alcoholic Fermented Beverage.

    Science.gov (United States)

    Riveros-Mckay, Fernando; Campos, Itzia; Giles-Gómez, Martha; Bolívar, Francisco; Escalante, Adelfo

    2014-11-06

    Leuconostoc mesenteroides P45 was isolated from the traditional Mexican pulque beverage. We report its draft genome sequence, assembled in 6 contigs consisting of 1,874,188 bp and no plasmids. Genome annotation predicted a total of 1,800 genes, 1,687 coding sequences, 52 pseudogenes, 9 rRNAs, 51 tRNAs, 1 noncoding RNA, and 44 frameshifted genes. Copyright © 2014 Riveros-Mckay et al.

  9. Physical mapping and BAC-end sequence analysis provide initial insights into the flax (Linum usitatissimum L. genome

    Directory of Open Access Journals (Sweden)

    Cloutier Sylvie

    2011-05-01

    Full Text Available Abstract Background Flax (Linum usitatissimum L. is an important source of oil rich in omega-3 fatty acids, which have proven health benefits and utility as an industrial raw material. Flax seeds also contain lignans which are associated with reducing the risk of certain types of cancer. Its bast fibres have broad industrial applications. However, genomic tools needed for molecular breeding were non existent. Hence a project, Total Utilization Flax GENomics (TUFGEN was initiated. We report here the first genome-wide physical map of flax and the generation and analysis of BAC-end sequences (BES from 43,776 clones, providing initial insights into the genome. Results The physical map consists of 416 contigs spanning ~368 Mb, assembled from 32,025 fingerprints, representing roughly 54.5% to 99.4% of the estimated haploid genome (370-675 Mb. The N50 size of the contigs was estimated to be ~1,494 kb. The longest contig was ~5,562 kb comprising 437 clones. There were 96 contigs containing more than 100 clones. Approximately 54.6 Mb representing 8-14.8% of the genome was obtained from 80,337 BES. Annotation revealed that a large part of the genome consists of ribosomal DNA (~13.8%, followed by known transposable elements at 6.1%. Furthermore, ~7.4% of sequence was identified to harbour novel repeat elements. Homology searches against flax-ESTs and NCBI-ESTs suggested that ~5.6% of the transcriptome is unique to flax. A total of 4064 putative genomic SSRs were identified and are being developed as novel markers for their use in molecular breeding. Conclusion The first genome-wide physical map of flax constructed with BAC clones provides a framework for accessing target loci with economic importance for marker development and positional cloning. Analysis of the BES has provided insights into the uniqueness of the flax genome. Compared to other plant genomes, the proportion of rDNA was found to be very high whereas the proportion of known transposable

  10. V-GAP: Viral genome assembly pipeline

    KAUST Repository

    Nakamura, Yoji

    2015-10-22

    Next-generation sequencing technologies have allowed the rapid determination of the complete genomes of many organisms. Although shotgun sequences from large genome organisms are still difficult to reconstruct perfect contigs each of which represents a full chromosome, those from small genomes have been assembled successfully into a very small number of contigs. In this study, we show that shotgun reads from phage genomes can be reconstructed into a single contig by controlling the number of read sequences used in de novo assembly. We have developed a pipeline to assemble small viral genomes with good reliability using a resampling method from shotgun data. This pipeline, named V-GAP (Viral Genome Assembly Pipeline), will contribute to the rapid genome typing of viruses, which are highly divergent, and thus will meet the increasing need for viral genome comparisons in metagenomic studies.

  11. V-GAP: Viral genome assembly pipeline

    KAUST Repository

    Nakamura, Yoji; Yasuike, Motoshige; Nishiki, Issei; Iwasaki, Yuki; Fujiwara, Atushi; Kawato, Yasuhiko; Nakai, Toshihiro; Nagai, Satoshi; Kobayashi, Takanori; Gojobori, Takashi; Ototake, Mitsuru

    2015-01-01

    Next-generation sequencing technologies have allowed the rapid determination of the complete genomes of many organisms. Although shotgun sequences from large genome organisms are still difficult to reconstruct perfect contigs each of which represents a full chromosome, those from small genomes have been assembled successfully into a very small number of contigs. In this study, we show that shotgun reads from phage genomes can be reconstructed into a single contig by controlling the number of read sequences used in de novo assembly. We have developed a pipeline to assemble small viral genomes with good reliability using a resampling method from shotgun data. This pipeline, named V-GAP (Viral Genome Assembly Pipeline), will contribute to the rapid genome typing of viruses, which are highly divergent, and thus will meet the increasing need for viral genome comparisons in metagenomic studies.

  12. Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

    Science.gov (United States)

    Mackey, Aaron J; Pearson, William R

    2004-10-01

    Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.

  13. Draft Genome Sequence of the Sordariomycete Lecythophora (Coniochaeta) hoffmannii CBS 245.38.

    Science.gov (United States)

    Leonhardt, Sabrina; Büttner, Enrico; Gebauer, Anna Maria; Hofrichter, Martin; Kellner, Harald

    2018-02-15

    Lecythophora ( Coniochaeta ) hoffmannii , a soil- and lignocellulose-inhabiting sordariomycete (Ascomycota) that can also live as a facultative tree pathogen causing soft rot, belongs to the family Coniochaetaceae. The strain CBS 245.38 sequenced here was assembled into 869 contigs, has a size of 30.8 Mb, and comprises 10,596 predicted protein-coding genes. Copyright © 2018 Leonhardt et al.

  14. Dicty_cDB: Contig-U11311-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available A from... 58 6e-07 2 ( AM481444 ) Vitis vinifera contig VV78X144362.4, whole genome... 68 1e-06 1 ( AY604469 ) Prodonto...117 4e-27 AY604469_1( AY604469 |pid:none) Prodontorhabditis wirthi strain DF... 125 5e-27 ( P25202 ) RecName

  15. High-throughput sequencing of three Lemnoideae (duckweeds chloroplast genomes from total DNA.

    Directory of Open Access Journals (Sweden)

    Wenqin Wang

    Full Text Available BACKGROUND: Chloroplast genomes provide a wealth of information for evolutionary and population genetic studies. Chloroplasts play a particularly important role in the adaption for aquatic plants because they float on water and their major surface is exposed continuously to sunlight. The subfamily of Lemnoideae represents such a collection of aquatic species that because of photosynthesis represents one of the fastest growing plant species on earth. METHODS: We sequenced the chloroplast genomes from three different genera of Lemnoideae, Spirodela polyrhiza, Wolffiella lingulata and Wolffia australiana by high-throughput DNA sequencing of genomic DNA using the SOLiD platform. Unfractionated total DNA contains high copies of plastid DNA so that sequences from the nucleus and mitochondria can easily be filtered computationally. Remaining sequence reads were assembled into contiguous sequences (contigs using SOLiD software tools. Contigs were mapped to a reference genome of Lemna minor and gaps, selected by PCR, were sequenced on the ABI3730xl platform. CONCLUSIONS: This combinatorial approach yielded whole genomic contiguous sequences in a cost-effective manner. Over 1,000-time coverage of chloroplast from total DNA were reached by the SOLiD platform in a single spot on a quadrant slide without purification. Comparative analysis indicated that the chloroplast genome was conserved in gene number and organization with respect to the reference genome of L. minor. However, higher nucleotide substitution, abundant deletions and insertions occurred in non-coding regions of these genomes, indicating a greater genomic dynamics than expected from the comparison of other related species in the Pooideae. Noticeably, there was no transition bias over transversion in Lemnoideae. The data should have immediate applications in evolutionary biology and plant taxonomy with increased resolution and statistical power.

  16. Dramatic improvement in genome assembly achieved using doubled-haploid genomes.

    Science.gov (United States)

    Zhang, Hong; Tan, Engkong; Suzuki, Yutaka; Hirose, Yusuke; Kinoshita, Shigeharu; Okano, Hideyuki; Kudoh, Jun; Shimizu, Atsushi; Saito, Kazuyoshi; Watabe, Shugo; Asakawa, Shuichi

    2014-10-27

    Improvement in de novo assembly of large genomes is still to be desired. Here, we improved draft genome sequence quality by employing doubled-haploid individuals. We sequenced wildtype and doubled-haploid Takifugu rubripes genomes, under the same conditions, using the Illumina platform and assembled contigs with SOAPdenovo2. We observed 5.4-fold and 2.6-fold improvement in the sizes of the N50 contig and scaffold of doubled-haploid individuals, respectively, compared to the wildtype, indicating that the use of a doubled-haploid genome aids in accurate genome analysis.

  17. Deep Sequencing-Identified Kanamycin-Resistant Paenibacillus sp. Strain KS1 Isolated from Epiphyte Tillandsia usneoides (Spanish Moss) in Central Florida, USA.

    Science.gov (United States)

    Lata, Pushpa; Govindarajan, Subramaniam S; Qi, Feng; Li, Jian-Liang; Sahoo, Malaya K

    2017-02-02

    Paenibacillus sp. strain KS1 was isolated from an epiphyte, Tillandsia usneoides (Spanish moss), in central Florida, USA. Here, we report a draft genome sequence of this strain, which consists of a total of 398 contigs spanning 6,508,195 bp, with a G+C content of 46.5% and comprising 5,401 predicted coding sequences. Copyright © 2017 Lata et al.

  18. Mapping of Micro-Tom BAC-End Sequences to the Reference Tomato Genome Reveals Possible Genome Rearrangements and Polymorphisms

    Science.gov (United States)

    Asamizu, Erika; Shirasawa, Kenta; Hirakawa, Hideki; Sato, Shusei; Tabata, Satoshi; Yano, Kentaro; Ariizumi, Tohru; Shibata, Daisuke; Ezura, Hiroshi

    2012-01-01

    A total of 93,682 BAC-end sequences (BESs) were generated from a dwarf model tomato, cv. Micro-Tom. After removing repetitive sequences, the BESs were similarity searched against the reference tomato genome of a standard cultivar, “Heinz 1706.” By referring to the “Heinz 1706” physical map and by eliminating redundant or nonsignificant hits, 28,804 “unique pair ends” and 8,263 “unique ends” were selected to construct hypothetical BAC contigs. The total physical length of the BAC contigs was 495, 833, 423 bp, covering 65.3% of the entire genome. The average coverage of euchromatin and heterochromatin was 58.9% and 67.3%, respectively. From this analysis, two possible genome rearrangements were identified: one in chromosome 2 (inversion) and the other in chromosome 3 (inversion and translocation). Polymorphisms (SNPs and Indels) between the two cultivars were identified from the BLAST alignments. As a result, 171,792 polymorphisms were mapped on 12 chromosomes. Among these, 30,930 polymorphisms were found in euchromatin (1 per 3,565 bp) and 140,862 were found in heterochromatin (1 per 2,737 bp). The average polymorphism density in the genome was 1 polymorphism per 2,886 bp. To facilitate the use of these data in Micro-Tom research, the BAC contig and polymorphism information are available in the TOMATOMICS database. PMID:23227037

  19. BESST--efficient scaffolding of large fragmented assemblies.

    Science.gov (United States)

    Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn; Lundeberg, Joakim; Arvestad, Lars

    2014-08-15

    The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.

  20. Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads.

    Science.gov (United States)

    Huson, Daniel H; Tappu, Rewati; Bazinet, Adam L; Xie, Chao; Cummings, Michael P; Nieselt, Kay; Williams, Rohan

    2017-01-25

    Microbiome sequencing projects typically collect tens of millions of short reads per sample. Depending on the goals of the project, the short reads can either be subjected to direct sequence analysis or be assembled into longer contigs. The assembly of whole genomes from metagenomic sequencing reads is a very difficult problem. However, for some questions, only specific genes of interest need to be assembled. This is then a gene-centric assembly where the goal is to assemble reads into contigs for a family of orthologous genes. We present a new method for performing gene-centric assembly, called protein-alignment-guided assembly, and provide an implementation in our metagenome analysis tool MEGAN. Genes are assembled on the fly, based on the alignment of all reads against a protein reference database such as NCBI-nr. Specifically, the user selects a gene family based on a classification such as KEGG and all reads binned to that gene family are assembled. Using published synthetic community metagenome sequencing reads and a set of 41 gene families, we show that the performance of this approach compares favorably with that of full-featured assemblers and that of a recently published HMM-based gene-centric assembler, both in terms of the number of reference genes detected and of the percentage of reference sequence covered. Protein-alignment-guided assembly of orthologous gene families complements whole-metagenome assembly in a new and very useful way.

  1. Genomic library screening for viruses from the human dental plaque revealed pathogen-specific lytic phage sequences.

    Science.gov (United States)

    Al-Jarbou, Ahmed Nasser

    2012-01-01

    Bacterial pathogenesis presents an astounding arsenal of virulence factors that allow them to conquer many different niches throughout the course of infection. Principally fascinating is the fact that some bacterial species are able to induce different diseases by expression of different combinations of virulence factors. Nevertheless, studies aiming at screening for the presence of bacteriophages in humans have been limited. Such screening procedures would eventually lead to identification of phage-encoded properties that impart increased bacterial fitness and/or virulence in a particular niche, and hence, would potentially be used to reverse the course of bacterial infections. As the human oral cavity represents a rich and dynamic ecosystem for several upper respiratory tract pathogens. However, little is known about virus diversity in human dental plaque which is an important reservoir. We applied the culture-independent approach to characterize virus diversity in human dental plaque making a library from a virus DNA fraction amplified using a multiple displacement method and sequenced 80 clones. The resulting sequence showed 44% significant identities to GenBank databases by TBLASTX analysis. TBLAST homology comparisons showed that 66% was viral; 18% eukarya; 10% bacterial; 6% mobile elements. These sequences were sorted into 6 contigs and 45 single sequences in which 4 contigs and a single sequence showed significant identity to a small region of a putative prophage in the Corynebacterium diphtheria genome. These findings interestingly highlight the uniqueness of over half of the sequences, whilst the dominance of a pathogen-specific prophage sequences imply their role in virulence.

  2. Comparative genomic mapping of the bovine Fragile Histidine Triad (FHIT tumour suppressor gene: characterization of a 2 Mb BAC contig covering the locus, complete annotation of the gene, analysis of cDNA and of physiological expression profiles

    Directory of Open Access Journals (Sweden)

    Boussaha Mekki

    2006-05-01

    Full Text Available Abstract Background The Fragile Histidine Triad gene (FHIT is an oncosuppressor implicated in many human cancers, including vesical tumors. FHIT is frequently hit by deletions caused by fragility at FRA3B, the most active of human common fragile sites, where FHIT lays. Vesical tumors affect also cattle, including animals grazing in the wild on bracken fern; compounds released by the fern are known to induce chromosome fragility and may trigger cancer with the interplay of latent Papilloma virus. Results The bovine FHIT was characterized by assembling a contig of 78 BACs. Sequence tags were designed on human exons and introns and used directly to select bovine BACs, or compared with sequence data in the bovine genome database or in the trace archive of the bovine genome sequencing project, and adapted before use. FHIT is split in ten exons like in man, with exons 5 to 9 coding for a 149 amino acids protein. VISTA global alignments between bovine genomic contigs retrieved from the bovine genome database and the human FHIT region were performed. Conservation was extremely high over a 2 Mb region spanning the whole FHIT locus, including the size of introns. Thus, the bovine FHIT covers about 1.6 Mb compared to 1.5 Mb in man. Expression was analyzed by RT-PCR and Northern blot, and was found to be ubiquitous. Four cDNA isoforms were isolated and sequenced, that originate from an alternative usage of three variants of exon 4, revealing a size very close to the major human FHIT cDNAs. Conclusion A comparative genomic approach allowed to assemble a contig of 78 BACs and to completely annotate a 1.6 Mb region spanning the bovine FHIT gene. The findings confirmed the very high level of conservation between human and bovine genomes and the importance of comparative mapping to speed the annotation process of the recently sequenced bovine genome. The detailed knowledge of the genomic FHIT region will allow to study the role of FHIT in bovine cancerogenesis

  3. Comparative genomic mapping of the bovine Fragile Histidine Triad (FHIT) tumour suppressor gene: characterization of a 2 Mb BAC contig covering the locus, complete annotation of the gene, analysis of cDNA and of physiological expression profiles.

    Science.gov (United States)

    Uboldi, Cristina; Guidi, Elena; Roperto, Sante; Russo, Valeria; Roperto, Franco; Di Meo, Giulia Pia; Iannuzzi, Leopoldo; Floriot, Sandrine; Boussaha, Mekki; Eggen, André; Ferretti, Luca

    2006-05-23

    The Fragile Histidine Triad gene (FHIT) is an oncosuppressor implicated in many human cancers, including vesical tumors. FHIT is frequently hit by deletions caused by fragility at FRA3B, the most active of human common fragile sites, where FHIT lays. Vesical tumors affect also cattle, including animals grazing in the wild on bracken fern; compounds released by the fern are known to induce chromosome fragility and may trigger cancer with the interplay of latent Papilloma virus. The bovine FHIT was characterized by assembling a contig of 78 BACs. Sequence tags were designed on human exons and introns and used directly to select bovine BACs, or compared with sequence data in the bovine genome database or in the trace archive of the bovine genome sequencing project, and adapted before use. FHIT is split in ten exons like in man, with exons 5 to 9 coding for a 149 amino acids protein. VISTA global alignments between bovine genomic contigs retrieved from the bovine genome database and the human FHIT region were performed. Conservation was extremely high over a 2 Mb region spanning the whole FHIT locus, including the size of introns. Thus, the bovine FHIT covers about 1.6 Mb compared to 1.5 Mb in man. Expression was analyzed by RT-PCR and Northern blot, and was found to be ubiquitous. Four cDNA isoforms were isolated and sequenced, that originate from an alternative usage of three variants of exon 4, revealing a size very close to the major human FHIT cDNAs. A comparative genomic approach allowed to assemble a contig of 78 BACs and to completely annotate a 1.6 Mb region spanning the bovine FHIT gene. The findings confirmed the very high level of conservation between human and bovine genomes and the importance of comparative mapping to speed the annotation process of the recently sequenced bovine genome. The detailed knowledge of the genomic FHIT region will allow to study the role of FHIT in bovine cancerogenesis, especially of vesical papillomavirus-associated cancers of

  4. Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools.

    Science.gov (United States)

    Kisand, Veljo; Lettieri, Teresa

    2013-04-01

    De novo genome sequencing of previously uncharacterized microorganisms has the potential to open up new frontiers in microbial genomics by providing insight into both functional capabilities and biodiversity. Until recently, Roche 454 pyrosequencing was the NGS method of choice for de novo assembly because it generates hundreds of thousands of long reads (tools for processing NGS data are increasingly free and open source and are often adopted for both their high quality and role in promoting academic freedom. The error rate of pyrosequencing the Alcanivorax borkumensis genome was such that thousands of insertions and deletions were artificially introduced into the finished genome. Despite a high coverage (~30 fold), it did not allow the reference genome to be fully mapped. Reads from regions with errors had low quality, low coverage, or were missing. The main defect of the reference mapping was the introduction of artificial indels into contigs through lower than 100% consensus and distracting gene calling due to artificial stop codons. No assembler was able to perform de novo assembly comparable to reference mapping. Automated annotation tools performed similarly on reference mapped and de novo draft genomes, and annotated most CDSs in the de novo assembled draft genomes. Free and open source software (FOSS) tools for assembly and annotation of NGS data are being developed rapidly to provide accurate results with less computational effort. Usability is not high priority and these tools currently do not allow the data to be processed without manual intervention. Despite this, genome assemblers now readily assemble medium short reads into long contigs (>97-98% genome coverage). A notable gap in pyrosequencing technology is the quality of base pair calling and conflicting base pairs between single reads at the same nucleotide position. Regardless, using draft whole genomes that are not finished and remain fragmented into tens of contigs allows one to characterize

  5. Dicty_cDB: Contig-U16102-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available 0 6 ( BJ408668 ) Dictyostelium discoideum cDNA clone:dds46g14, 3' ... 44 3.0 2 ( CV162186 ) CS_hyp_01d11_M13Reverse Blue crab hypoder...08_M13Reverse Blue crab hypodermis, nor... 42 3.6 2 ( AM474408 ) Vitis vinifera contig VV78X173370.5, whole

  6. BAC-end sequence-based SNPs and Bin mapping for rapid integration of physical and genetic maps in apple.

    Science.gov (United States)

    Han, Yuepeng; Chagné, David; Gasic, Ksenija; Rikkerink, Erik H A; Beever, Jonathan E; Gardiner, Susan E; Korban, Schuyler S

    2009-03-01

    A genome-wide BAC physical map of the apple, Malus x domestica Borkh., has been recently developed. Here, we report on integrating the physical and genetic maps of the apple using a SNP-based approach in conjunction with bin mapping. Briefly, BAC clones located at ends of BAC contigs were selected, and sequenced at both ends. The BAC end sequences (BESs) were used to identify candidate SNPs. Subsequently, these candidate SNPs were genetically mapped using a bin mapping strategy for the purpose of mapping the physical onto the genetic map. Using this approach, 52 (23%) out of 228 BESs tested were successfully exploited to develop SNPs. These SNPs anchored 51 contigs, spanning approximately 37 Mb in cumulative physical length, onto 14 linkage groups. The reliability of the integration of the physical and genetic maps using this SNP-based strategy is described, and the results confirm the feasibility of this approach to construct an integrated physical and genetic maps for apple.

  7. MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

    Energy Technology Data Exchange (ETDEWEB)

    White, Richard A.; Panyala, Ajay R.; Glass, Kevin A.; Colby, Sean M.; Glaesemann, Kurt R.; Jansson, Georg C.; Jansson, Janet K.

    2017-02-21

    MerCat is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. MerCat inputs include assembled contigs and raw sequence reads from any platform resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST and/or DIAMOND for compositional analysis of whole community shotgun sequencing (e.g. metagenomes and metatranscriptomes).

  8. Pms2 suppresses large expansions of the (GAA·TTCn sequence in neuronal tissues.

    Directory of Open Access Journals (Sweden)

    Rebecka L Bourn

    Full Text Available Expanded trinucleotide repeat sequences are the cause of several inherited neurodegenerative diseases. Disease pathogenesis is correlated with several features of somatic instability of these sequences, including further large expansions in postmitotic tissues. The presence of somatic expansions in postmitotic tissues is consistent with DNA repair being a major determinant of somatic instability. Indeed, proteins in the mismatch repair (MMR pathway are required for instability of the expanded (CAG·CTG(n sequence, likely via recognition of intrastrand hairpins by MutSβ. It is not clear if or how MMR would affect instability of disease-causing expanded trinucleotide repeat sequences that adopt secondary structures other than hairpins, such as the triplex/R-loop forming (GAA·TTC(n sequence that causes Friedreich ataxia. We analyzed somatic instability in transgenic mice that carry an expanded (GAA·TTC(n sequence in the context of the human FXN locus and lack the individual MMR proteins Msh2, Msh6 or Pms2. The absence of Msh2 or Msh6 resulted in a dramatic reduction in somatic mutations, indicating that mammalian MMR promotes instability of the (GAA·TTC(n sequence via MutSα. The absence of Pms2 resulted in increased accumulation of large expansions in the nervous system (cerebellum, cerebrum, and dorsal root ganglia but not in non-neuronal tissues (heart and kidney, without affecting the prevalence of contractions. Pms2 suppressed large expansions specifically in tissues showing MutSα-dependent somatic instability, suggesting that they may act on the same lesion or structure associated with the expanded (GAA·TTC(n sequence. We conclude that Pms2 specifically suppresses large expansions of a pathogenic trinucleotide repeat sequence in neuronal tissues, possibly acting independently of the canonical MMR pathway.

  9. Physical mapping of a large plant genome using global high-information-content-fingerprinting: the distal region of the wheat ancestor Aegilops tauschii chromosome 3DS

    Directory of Open Access Journals (Sweden)

    You Frank M

    2010-06-01

    Full Text Available Abstract Background Physical maps employing libraries of bacterial artificial chromosome (BAC clones are essential for comparative genomics and sequencing of large and repetitive genomes such as those of the hexaploid bread wheat. The diploid ancestor of the D-genome of hexaploid wheat (Triticum aestivum, Aegilops tauschii, is used as a resource for wheat genomics. The barley diploid genome also provides a good model for the Triticeae and T. aestivum since it is only slightly larger than the ancestor wheat D genome. Gene co-linearity between the grasses can be exploited by extrapolating from rice and Brachypodium distachyon to Ae. tauschii or barley, and then to wheat. Results We report the use of Ae. tauschii for the construction of the physical map of a large distal region of chromosome arm 3DS. A physical map of 25.4 Mb was constructed by anchoring BAC clones of Ae. tauschii with 85 EST on the Ae. tauschii and barley genetic maps. The 24 contigs were aligned to the rice and B. distachyon genomic sequences and a high density SNP genetic map of barley. As expected, the mapped region is highly collinear to the orthologous chromosome 1 in rice, chromosome 2 in B. distachyon and chromosome 3H in barley. However, the chromosome scale of the comparative maps presented provides new insights into grass genome organization. The disruptions of the Ae. tauschii-rice and Ae. tauschii-Brachypodium syntenies were identical. We observed chromosomal rearrangements between Ae. tauschii and barley. The comparison of Ae. tauschii physical and genetic maps showed that the recombination rate across the region dropped from 2.19 cM/Mb in the distal region to 0.09 cM/Mb in the proximal region. The size of the gaps between contigs was evaluated by comparing the recombination rate along the map with the local recombination rates calculated on single contigs. Conclusions The physical map reported here is the first physical map using fingerprinting of a complete

  10. BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes.

    Science.gov (United States)

    Staňková, Helena; Hastie, Alex R; Chan, Saki; Vrána, Jan; Tulpová, Zuzana; Kubaláková, Marie; Visendi, Paul; Hayashi, Satomi; Luo, Mingcheng; Batley, Jacqueline; Edwards, David; Doležel, Jaroslav; Šimková, Hana

    2016-07-01

    The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC-by-BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high-resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high-resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome-scale analysis of repetitive sequences and revealed a ~800-kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone-by-clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC-contig physical map and validate sequence assembly on a chromosome-arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome-by-chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules. © 2016 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  11. Transcriptome sequencing and annotation for the Jamaican fruit bat (Artibeus jamaicensis.

    Directory of Open Access Journals (Sweden)

    Timothy I Shaw

    Full Text Available The Jamaican fruit bat (Artibeus jamaicensis is one of the most common bats in the tropical Americas. It is thought to be a potential reservoir host of Tacaribe virus, an arenavirus closely related to the South American hemorrhagic fever viruses. We performed transcriptome sequencing and annotation from lung, kidney and spleen tissues using 454 and Illumina platforms to develop this species as an animal model. More than 100,000 contigs were assembled, with 25,000 genes that were functionally annotated. Of the remaining unannotated contigs, 80% were found within bat genomes or transcriptomes. Annotated genes are involved in a broad range of activities ranging from cellular metabolism to genome regulation through ncRNAs. Reciprocal BLAST best hits yielded 8,785 sequences that are orthologous to mouse, rat, cattle, horse and human. Species tree analysis of sequences from 2,378 loci was used to achieve 95% bootstrap support for the placement of bat as sister to the clade containing horse, dog, and cattle. Through substitution rate estimation between bat and human, 32 genes were identified with evidence for positive selection. We also identified 466 immune-related genes, which may be useful for studying Tacaribe virus infection of this species. The Jamaican fruit bat transcriptome dataset is a resource that should provide additional candidate markers for studying bat evolution and ecology, and tools for analysis of the host response and pathology of disease.

  12. Construction of an American mink Bacterial Artificial Chromosome (BAC library and sequencing candidate genes important for the fur industry

    Directory of Open Access Journals (Sweden)

    Christensen Knud

    2011-07-01

    Full Text Available Abstract Background Bacterial artificial chromosome (BAC libraries continue to be invaluable tools for the genomic analysis of complex organisms. Complemented by the newly and fast growing deep sequencing technologies, they provide an excellent source of information in genomics projects. Results Here, we report the construction and characterization of the CHORI-231 BAC library constructed from a Danish-farmed, male American mink (Neovison vison. The library contains approximately 165,888 clones with an average insert size of 170 kb, representing approximately 10-fold coverage. High-density filters, each consisting of 18,432 clones spotted in duplicate, have been produced for hybridization screening and are publicly available. Overgo probes derived from expressed sequence tags (ESTs, representing 21 candidate genes for traits important for the mink industry, were used to screen the BAC library. These included candidate genes for coat coloring, hair growth and length, coarseness, and some receptors potentially involved in viral diseases in mink. The extensive screening yielded positive results for 19 of these genes. Thirty-five clones corresponding to 19 genes were sequenced using 454 Roche, and large contigs (184 kb in average were assembled. Knowing the complete sequences of these candidate genes will enable confirmation of the association with a phenotype and the finding of causative mutations for the targeted phenotypes. Additionally, 1577 BAC clones were end sequenced; 2505 BAC end sequences (80% of BACs were obtained. An excess of 2 Mb has been analyzed, thus giving a snapshot of the mink genome. Conclusions The availability of the CHORI-321 American mink BAC library will aid in identification of genes and genomic regions of interest. We have demonstrated how the library can be used to identify specific genes of interest, develop genetic markers, and for BAC end sequencing and deep sequencing of selected clones. To our knowledge, this is the

  13. High-throughput sequencing and pathway analysis reveal alteration of the pituitary transcriptome by 17α-ethynylestradiol (EE2) in female coho salmon, Oncorhynchus kisutch

    Energy Technology Data Exchange (ETDEWEB)

    Harding, Louisa B. [School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA 98195 (United States); Schultz, Irvin R. [Battelle, Marine Sciences Laboratory – Pacific Northwest National Laboratory, 1529 West Sequim Bay Road, Sequim, WA 98382 (United States); Goetz, Giles W. [School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA 98195 (United States); Luckenbach, J. Adam [Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, 2725 Montlake Blvd E, Seattle, WA 98112 (United States); Center for Reproductive Biology, Washington State University, Pullman, WA 98164 (United States); Young, Graham [School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA 98195 (United States); Center for Reproductive Biology, Washington State University, Pullman, WA 98164 (United States); Goetz, Frederick W. [Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Manchester Research Station, P.O. Box 130, Manchester, WA 98353 (United States); Swanson, Penny, E-mail: penny.swanson@noaa.gov [Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, 2725 Montlake Blvd E, Seattle, WA 98112 (United States); Center for Reproductive Biology, Washington State University, Pullman, WA 98164 (United States)

    2013-10-15

    Highlights: •Studied impacts of ethynylestradiol (EE2) exposure on salmon pituitary transcriptome. •High-throughput sequencing, RNAseq, and pathway analysis were performed. •EE2 altered mRNAs for genes in circadian rhythm, GnRH, and TGFβ signaling pathways. •LH and FSH beta subunit mRNAs were most highly up- and down-regulated by EE2, respectively. •Estrogens may alter processes associated with reproductive timing in salmon. -- Abstract: Considerable research has been done on the effects of endocrine disrupting chemicals (EDCs) on reproduction and gene expression in the brain, liver and gonads of teleost fish, but information on impacts to the pituitary gland are still limited despite its central role in regulating reproduction. The aim of this study was to further our understanding of the potential effects of natural and synthetic estrogens on the brain–pituitary–gonad axis in fish by determining the effects of 17α-ethynylestradiol (EE2) on the pituitary transcriptome. We exposed sub-adult coho salmon (Oncorhynchus kisutch) to 0 or 12 ng EE2/L for up to 6 weeks and effects on the pituitary transcriptome of females were assessed using high-throughput Illumina{sup ®} sequencing, RNA-Seq and pathway analysis. After 1 or 6 weeks, 218 and 670 contiguous sequences (contigs) respectively, were differentially expressed in pituitaries of EE2-exposed fish relative to control. Two of the most highly up- and down-regulated contigs were luteinizing hormone β subunit (241-fold and 395-fold at 1 and 6 weeks, respectively) and follicle-stimulating hormone β subunit (−3.4-fold at 6 weeks). Additional contigs related to gonadotropin synthesis and release were differentially expressed in EE2-exposed fish relative to controls. These included contigs involved in gonadotropin releasing hormone (GNRH) and transforming growth factor-β signaling. There was an over-representation of significantly affected contigs in 33 and 18 canonical pathways at 1 and 6 weeks

  14. High-throughput sequencing and pathway analysis reveal alteration of the pituitary transcriptome by 17α-ethynylestradiol (EE2) in female coho salmon, Oncorhynchus kisutch

    International Nuclear Information System (INIS)

    Harding, Louisa B.; Schultz, Irvin R.; Goetz, Giles W.; Luckenbach, J. Adam; Young, Graham; Goetz, Frederick W.; Swanson, Penny

    2013-01-01

    Highlights: •Studied impacts of ethynylestradiol (EE2) exposure on salmon pituitary transcriptome. •High-throughput sequencing, RNAseq, and pathway analysis were performed. •EE2 altered mRNAs for genes in circadian rhythm, GnRH, and TGFβ signaling pathways. •LH and FSH beta subunit mRNAs were most highly up- and down-regulated by EE2, respectively. •Estrogens may alter processes associated with reproductive timing in salmon. -- Abstract: Considerable research has been done on the effects of endocrine disrupting chemicals (EDCs) on reproduction and gene expression in the brain, liver and gonads of teleost fish, but information on impacts to the pituitary gland are still limited despite its central role in regulating reproduction. The aim of this study was to further our understanding of the potential effects of natural and synthetic estrogens on the brain–pituitary–gonad axis in fish by determining the effects of 17α-ethynylestradiol (EE2) on the pituitary transcriptome. We exposed sub-adult coho salmon (Oncorhynchus kisutch) to 0 or 12 ng EE2/L for up to 6 weeks and effects on the pituitary transcriptome of females were assessed using high-throughput Illumina ® sequencing, RNA-Seq and pathway analysis. After 1 or 6 weeks, 218 and 670 contiguous sequences (contigs) respectively, were differentially expressed in pituitaries of EE2-exposed fish relative to control. Two of the most highly up- and down-regulated contigs were luteinizing hormone β subunit (241-fold and 395-fold at 1 and 6 weeks, respectively) and follicle-stimulating hormone β subunit (−3.4-fold at 6 weeks). Additional contigs related to gonadotropin synthesis and release were differentially expressed in EE2-exposed fish relative to controls. These included contigs involved in gonadotropin releasing hormone (GNRH) and transforming growth factor-β signaling. There was an over-representation of significantly affected contigs in 33 and 18 canonical pathways at 1 and 6 weeks

  15. High Potential Source for Biomass Degradation Enzyme Discovery and Environmental Aspects Revealed through Metagenomics of Indian Buffalo Rumen

    Directory of Open Access Journals (Sweden)

    K. M. Singh

    2014-01-01

    Full Text Available The complex microbiomes of the rumen functions as an effective system for plant cell wall degradation, and biomass utilization provide genetic resource for degrading microbial enzymes that could be used in the production of biofuel. Therefore the buffalo rumen microbiota was surveyed using shot gun sequencing. This metagenomic sequencing generated 3.9 GB of sequences and data were assembled into 137270 contiguous sequences (contigs. We identified potential 2614 contigs encoding biomass degrading enzymes including glycoside hydrolases (GH: 1943 contigs, carbohydrate binding module (CBM: 23 contigs, glycosyl transferase (GT: 373 contigs, carbohydrate esterases (CE: 259 contigs, and polysaccharide lyases (PE: 16 contigs. The hierarchical clustering of buffalo metagenomes demonstrated the similarities and dissimilarity in microbial community structures and functional capacity. This demonstrates that buffalo rumen microbiome was considerably enriched in functional genes involved in polysaccharide degradation with great prospects to obtain new molecules that may be applied in the biofuel industry.

  16. Draft Genome Sequence of a “Candidatus Liberibacter europaeus” Strain Assembled from Broom Psyllids (Arytainilla spartiophila) from New Zealand

    Science.gov (United States)

    Thompson, Sarah M.; Kalamorz, Falk; David, Charles; Addison, Shea M.; Smith, Grant R.

    2018-01-01

    ABSTRACT Here, we report the draft genome sequence of “Candidatus Liberibacter europaeus” ASNZ1, assembled from broom psyllids (Arytainilla spartiophila) from New Zealand. The assembly comprises 15 contigs, with a total length of 1.33 Mb and a G+C content of 33.5%. PMID:29773636

  17. In silico differential display of defense-related expressed sequence tags from sugarcane tissues infected with diazotrophic endophytes

    Directory of Open Access Journals (Sweden)

    Lambais Marcio R.

    2001-01-01

    Full Text Available The expression patterns of 277 sugarcane expressed sequence tags (EST-contigs encoding putative defense-related (DR proteins were evaluated using the Sugarcane EST database. The DR proteins evaluated included chitinases, beta-1,3-glucanases, phenylalanine ammonia-lyases, chalcone synthases, chalcone isomerases, isoflavone reductases, hydroxyproline-rich glycoproteins, proline-rich glycoproteins, peroxidases, catalases, superoxide dismutases, WRKY-like transcription factors and proteins involved in cell death control. Putative sugarcane WRKY proteins were compared and their phylogenetic relationships determined. A hierarchical clustering approach was used to identify DR ESTs with similar expression profiles in representative cDNA libraries. To identify DR ESTs differentially expressed in sugarcane tissues infected with Gluconacetobacter diazotrophicus or Herbaspirillum rubrisubalbicans, 179 putative DR EST-contigs expressed in non-infected tissues (leaves and roots and/or infected tissues were selected and arrayed by similarity of their expression profiles. Changes in the expression levels of 124 putative DR EST-contigs, expressed in non-infected tissues, were evaluated in infected tissues. Approximately 42% of these EST-contigs showed no expression in infected tissues, whereas 15% and 3% showed more than 2-fold suppression in tissues infected with G. diazotrophicus or H. rubrisubalbicans, respectively. Approximately 14 and 8% of the DR EST-contigs evaluated showed more than 2-fold induction in tissues infected with G. diazotrophicus or H. rubrisubalbicans, respectively. The differential expression of clusters of DR genes may be important in the establishment of a compatible interaction between sugarcane and diazotrophic endophytes. It is suggested that the hierarchical clustering approach can be used on a genome-wide scale to identify genes likely involved in controlling plant-microorganism interactions.

  18. The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.

    Science.gov (United States)

    Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K

    2012-11-01

    Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species. © 2012 The Authors. The Plant Journal © 2012 Blackwell Publishing Ltd.

  19. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing.

    Science.gov (United States)

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance; Messina, Thomas; Fan, Hongtao; Jaeger, Edward; Stephens, Susan

    2013-06-27

    Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available

  20. Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures.

    Science.gov (United States)

    Pride, David T; Schoenfeld, Thomas

    2008-09-17

    Metagenomic analysis provides a rich source of biological information for otherwise intractable viral communities. However, study of viral metagenomes has been hampered by its nearly complete reliance on BLAST algorithms for identification of DNA sequences. We sought to develop algorithms for examination of viral metagenomes to identify the origin of sequences independent of BLAST algorithms. We chose viral metagenomes obtained from two hot springs, Bear Paw and Octopus, in Yellowstone National Park, as they represent simple microbial populations where comparatively large contigs were obtained. Thermal spring metagenomes have high proportions of sequences without significant Genbank homology, which has hampered identification of viruses and their linkage with hosts. To analyze each metagenome, we developed a method to classify DNA fragments using genome signature-based phylogenetic classification (GSPC), where metagenomic fragments are compared to a database of oligonucleotide signatures for all previously sequenced Bacteria, Archaea, and viruses. From both Bear Paw and Octopus hot springs, each assembled contig had more similarity to other metagenome contigs than to any sequenced microbial genome based on GSPC analysis, suggesting a genome signature common to each of these extreme environments. While viral metagenomes from Bear Paw and Octopus share some similarity, the genome signatures from each locale are largely unique. GSPC using a microbial database predicts most of the Octopus metagenome has archaeal signatures, while bacterial signatures predominate in Bear Paw; a finding consistent with those of Genbank BLAST. When using a viral database, the majority of the Octopus metagenome is predicted to belong to archaeal virus Families Globuloviridae and Fuselloviridae, while none of the Bear Paw metagenome is predicted to belong to archaeal viruses. As expected, when microbial and viral databases are combined, each of the Octopus and Bear Paw metagenomic contigs

  1. Rapid development of microsatellite markers for Callosobruchus chinensis using Illumina paired-end sequencing.

    Directory of Open Access Journals (Sweden)

    Can-Xing Duan

    Full Text Available BACKGROUND: The adzuki bean weevil, Callosobruchus chinensis L., is one of the most destructive pests of stored legume seeds such as mungbean, cowpea, and adzuki bean, which usually cause considerable loss in the quantity and quality of stored seeds during transportation and storage. However, a lack of genetic information of this pest results in a series of genetic questions remain largely unknown, including population genetic structure, kinship, biotype abundance, and so on. Co-dominant microsatellite markers offer a great resolving power to determine these events. Here, we report rapid microsatellite isolation from C. chinensis via high-throughput sequencing. PRINCIPAL FINDINGS: In this study, 94,560,852 quality-filtered and trimmed reads were obtained for the assembly of genome using Illumina paired-end sequencing technology. In total, the genome with total length of 497,124,785 bp, comprising 403,113 high quality contigs was generated with de novo assembly. More than 6800 SSR loci were detected and a suit of 6303 primer pair sequences were designed and 500 of them were randomly selected for validation. Of these, 196 pair of primers, i.e. 39.2%, produced reproducible amplicons that were polymorphic among 8 C. chinensis genotypes collected from different geographical regions. Twenty out of 196 polymorphic SSR markers were used to analyze the genetic diversity of 18 C. chinensis populations. The results showed the twenty SSR loci were highly polymorphic among these populations. CONCLUSIONS: This study presents a first report of genome sequencing and de novo assembly for C. chinensis and demonstrates the feasibility of generating a large scale of sequence information and SSR loci isolation by Illumina paired-end sequencing. Our results provide a valuable resource for C. chinensis research. These novel markers are valuable for future genetic mapping, trait association, genetic structure and kinship among C. chinensis.

  2. PAVE: Program for assembling and viewing ESTs

    Directory of Open Access Journals (Sweden)

    Bomhoff Matthew

    2009-08-01

    Full Text Available Abstract Background New sequencing technologies are rapidly emerging. Many laboratories are simultaneously working with the traditional Sanger ESTs and experimenting with ESTs generated by the 454 Life Science sequencers. Though Sanger ESTs have been used to generate contigs for many years, no program takes full advantage of the 5' and 3' mate-pair information, hence, many tentative transcripts are assembled into two separate contigs. The new 454 technology has the benefit of high-throughput expression profiling, but introduces time and space problems for assembling large contigs. Results The PAVE (Program for Assembling and Viewing ESTs assembler takes advantage of the 5' and 3' mate-pair information by requiring that the mate-pairs be assembled into the same contig and joined by n's if the two sub-contigs do not overlap. It handles the depth of 454 data sets by "burying" similar ESTs during assembly, which retains the expression level information while circumventing time and space problems. PAVE uses MegaBLAST for the clustering step and CAP3 for assembly, however it assembles incrementally to enforce the mate-pair constraint, bury ESTs, and reduce incorrect joins and splits. The PAVE data management system uses a MySQL database to store multiple libraries of ESTs along with their metadata; the management system allows multiple assemblies with variations on libraries and parameters. Analysis routines provide standard annotation for the contigs including a measure of differentially expressed genes across the libraries. A Java viewer program is provided for display and analysis of the results. Our results clearly show the benefit of using the PAVE assembler to explicitly use mate-pair information and bury ESTs for large contigs. Conclusion The PAVE assembler provides a software package for assembling Sanger and/or 454 ESTs. The assembly software, data management software, Java viewer and user's guide are freely available.

  3. PAVE: program for assembling and viewing ESTs.

    Science.gov (United States)

    Soderlund, Carol; Johnson, Eric; Bomhoff, Matthew; Descour, Anne

    2009-08-26

    New sequencing technologies are rapidly emerging. Many laboratories are simultaneously working with the traditional Sanger ESTs and experimenting with ESTs generated by the 454 Life Science sequencers. Though Sanger ESTs have been used to generate contigs for many years, no program takes full advantage of the 5' and 3' mate-pair information, hence, many tentative transcripts are assembled into two separate contigs. The new 454 technology has the benefit of high-throughput expression profiling, but introduces time and space problems for assembling large contigs. The PAVE (Program for Assembling and Viewing ESTs) assembler takes advantage of the 5' and 3' mate-pair information by requiring that the mate-pairs be assembled into the same contig and joined by n's if the two sub-contigs do not overlap. It handles the depth of 454 data sets by "burying" similar ESTs during assembly, which retains the expression level information while circumventing time and space problems. PAVE uses MegaBLAST for the clustering step and CAP3 for assembly, however it assembles incrementally to enforce the mate-pair constraint, bury ESTs, and reduce incorrect joins and splits. The PAVE data management system uses a MySQL database to store multiple libraries of ESTs along with their metadata; the management system allows multiple assemblies with variations on libraries and parameters. Analysis routines provide standard annotation for the contigs including a measure of differentially expressed genes across the libraries. A Java viewer program is provided for display and analysis of the results. Our results clearly show the benefit of using the PAVE assembler to explicitly use mate-pair information and bury ESTs for large contigs. The PAVE assembler provides a software package for assembling Sanger and/or 454 ESTs. The assembly software, data management software, Java viewer and user's guide are freely available.

  4. Gene discovery and transcript analyses in the corn smut pathogen Ustilago maydis: expressed sequence tag and genome sequence comparison

    Directory of Open Access Journals (Sweden)

    Saville Barry J

    2007-09-01

    Full Text Available Abstract Background Ustilago maydis is the basidiomycete fungus responsible for common smut of corn and is a model organism for the study of fungal phytopathogenesis. To aid in the annotation of the genome sequence of this organism, several expressed sequence tag (EST libraries were generated from a variety of U. maydis cell types. In addition to utility in the context of gene identification and structure annotation, the ESTs were analyzed to identify differentially abundant transcripts and to detect evidence of alternative splicing and anti-sense transcription. Results Four cDNA libraries were constructed using RNA isolated from U. maydis diploid teliospores (U. maydis strains 518 × 521 and haploid cells of strain 521 grown under nutrient rich, carbon starved, and nitrogen starved conditions. Using the genome sequence as a scaffold, the 15,901 ESTs were assembled into 6,101 contiguous expressed sequences (contigs; among these, 5,482 corresponded to predicted genes in the MUMDB (MIPS Ustilago maydis database, while 619 aligned to regions of the genome not yet designated as genes in MUMDB. A comparison of EST abundance identified numerous genes that may be regulated in a cell type or starvation-specific manner. The transcriptional response to nitrogen starvation was assessed using RT-qPCR. The results of this suggest that there may be cross-talk between the nitrogen and carbon signalling pathways in U. maydis. Bioinformatic analysis identified numerous examples of alternative splicing and anti-sense transcription. While intron retention was the predominant form of alternative splicing in U. maydis, other varieties were also evident (e.g. exon skipping. Selected instances of both alternative splicing and anti-sense transcription were independently confirmed using RT-PCR. Conclusion Through this work: 1 substantial sequence information has been provided for U. maydis genome annotation; 2 new genes were identified through the discovery of 619

  5. Pms2 suppresses large expansions of the (GAA·TTC)n sequence in neuronal tissues.

    Science.gov (United States)

    Bourn, Rebecka L; De Biase, Irene; Pinto, Ricardo Mouro; Sandi, Chiranjeevi; Al-Mahdawi, Sahar; Pook, Mark A; Bidichandani, Sanjay I

    2012-01-01

    Expanded trinucleotide repeat sequences are the cause of several inherited neurodegenerative diseases. Disease pathogenesis is correlated with several features of somatic instability of these sequences, including further large expansions in postmitotic tissues. The presence of somatic expansions in postmitotic tissues is consistent with DNA repair being a major determinant of somatic instability. Indeed, proteins in the mismatch repair (MMR) pathway are required for instability of the expanded (CAG·CTG)(n) sequence, likely via recognition of intrastrand hairpins by MutSβ. It is not clear if or how MMR would affect instability of disease-causing expanded trinucleotide repeat sequences that adopt secondary structures other than hairpins, such as the triplex/R-loop forming (GAA·TTC)(n) sequence that causes Friedreich ataxia. We analyzed somatic instability in transgenic mice that carry an expanded (GAA·TTC)(n) sequence in the context of the human FXN locus and lack the individual MMR proteins Msh2, Msh6 or Pms2. The absence of Msh2 or Msh6 resulted in a dramatic reduction in somatic mutations, indicating that mammalian MMR promotes instability of the (GAA·TTC)(n) sequence via MutSα. The absence of Pms2 resulted in increased accumulation of large expansions in the nervous system (cerebellum, cerebrum, and dorsal root ganglia) but not in non-neuronal tissues (heart and kidney), without affecting the prevalence of contractions. Pms2 suppressed large expansions specifically in tissues showing MutSα-dependent somatic instability, suggesting that they may act on the same lesion or structure associated with the expanded (GAA·TTC)(n) sequence. We conclude that Pms2 specifically suppresses large expansions of a pathogenic trinucleotide repeat sequence in neuronal tissues, possibly acting independently of the canonical MMR pathway.

  6. Comparative analysis of catfish BAC end sequences with the zebrafish genome

    Directory of Open Access Journals (Sweden)

    Abernathy Jason

    2009-12-01

    Full Text Available Abstract Background Comparative mapping is a powerful tool to transfer genomic information from sequenced genomes to closely related species for which whole genome sequence data are not yet available. However, such an approach is still very limited in catfish, the most important aquaculture species in the United States. This project was initiated to generate additional BAC end sequences and demonstrate their applications in comparative mapping in catfish. Results We reported the generation of 43,000 BAC end sequences and their applications for comparative genome analysis in catfish. Using these and the additional 20,000 existing BAC end sequences as a resource along with linkage mapping and existing physical map, conserved syntenic regions were identified between the catfish and zebrafish genomes. A total of 10,943 catfish BAC end sequences (17.3% had significant BLAST hits to the zebrafish genome (cutoff value ≤ e-5, of which 3,221 were unique gene hits, providing a platform for comparative mapping based on locations of these genes in catfish and zebrafish. Genetic linkage mapping of microsatellites associated with contigs allowed identification of large conserved genomic segments and construction of super scaffolds. Conclusion BAC end sequences and their associated polymorphic markers are great resources for comparative genome analysis in catfish. Highly conserved chromosomal regions were identified to exist between catfish and zebrafish. However, it appears that the level of conservation at local genomic regions are high while a high level of chromosomal shuffling and rearrangements exist between catfish and zebrafish genomes. Orthologous regions established through comparative analysis should facilitate both structural and functional genome analysis in catfish.

  7. Reliable Detection of Herpes Simplex Virus Sequence Variation by High-Throughput Resequencing.

    Science.gov (United States)

    Morse, Alison M; Calabro, Kaitlyn R; Fear, Justin M; Bloom, David C; McIntyre, Lauren M

    2017-08-16

    High-throughput sequencing (HTS) has resulted in data for a number of herpes simplex virus (HSV) laboratory strains and clinical isolates. The knowledge of these sequences has been critical for investigating viral pathogenicity. However, the assembly of complete herpesviral genomes, including HSV, is complicated due to the existence of large repeat regions and arrays of smaller reiterated sequences that are commonly found in these genomes. In addition, the inherent genetic variation in populations of isolates for viruses and other microorganisms presents an additional challenge to many existing HTS sequence assembly pipelines. Here, we evaluate two approaches for the identification of genetic variants in HSV1 strains using Illumina short read sequencing data. The first, a reference-based approach, identifies variants from reads aligned to a reference sequence and the second, a de novo assembly approach, identifies variants from reads aligned to de novo assembled consensus sequences. Of critical importance for both approaches is the reduction in the number of low complexity regions through the construction of a non-redundant reference genome. We compared variants identified in the two methods. Our results indicate that approximately 85% of variants are identified regardless of the approach. The reference-based approach to variant discovery captures an additional 15% representing variants divergent from the HSV1 reference possibly due to viral passage. Reference-based approaches are significantly less labor-intensive and identify variants across the genome where de novo assembly-based approaches are limited to regions where contigs have been successfully assembled. In addition, regions of poor quality assembly can lead to false variant identification in de novo consensus sequences. For viruses with a well-assembled reference genome, a reference-based approach is recommended.

  8. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

    Science.gov (United States)

    Wan, Shixiang; Zou, Quan

    2017-01-01

    Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

  9. Development and validation of an rDNA operon based primer walking strategy applicable to de novo bacterial genome finishing.

    Directory of Open Access Journals (Sweden)

    Alexander William Eastman

    2015-01-01

    Full Text Available Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing

  10. Large scale identification and categorization of protein sequences using structured logistic regression.

    Directory of Open Access Journals (Sweden)

    Bjørn P Pedersen

    Full Text Available BACKGROUND: Structured Logistic Regression (SLR is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. RESULTS: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. CONCLUSIONS: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

  11. Genotyping-by-sequencing data of 272 crested wheatgrass (Agropyron cristatum genotypes

    Directory of Open Access Journals (Sweden)

    Pingchuan Li

    2017-12-01

    Full Text Available Crested wheatgrass [Agropyron cristatum L. (Gaertn.] is an important cool-season forage grass widely used for early spring grazing. However, the genomic resources for this non-model plant are still lacking. Our goal was to generate the first set of next generation sequencing data using the genotyping-by-sequencing technique. A total of 272 crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions were sequenced with an Illumina MiSeq instrument. These sequence datasets were processed using different bioinformatics tools to generate contigs for diploid and tetraploid plants and SNPs for diploid plants. Together, these genomic resources form a fundamental basis for genomic studies of crested wheatgrass and other wheatgrass species. The raw reads were deposited into Sequence Read Archive (SRA database under NCBI accession SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term=SRP115373 and the supplementary datasets are accessible in Figshare (10.6084/m9.figshare.5345092. Keywords: Crested wheatgrass, Genotyping-by-sequencing, Diploid, Tetraploid, Raw sequence data

  12. Genome Sequence of the Freshwater Yangtze Finless Porpoise.

    Science.gov (United States)

    Yuan, Yuan; Zhang, Peijun; Wang, Kun; Liu, Mingzhong; Li, Jing; Zheng, Jingsong; Wang, Ding; Xu, Wenjie; Lin, Mingli; Dong, Lijun; Zhu, Chenglong; Qiu, Qiang; Li, Songhai

    2018-04-16

    The Yangtze finless porpoise ( Neophocaena asiaeorientalis ssp. asiaeorientalis ) is a subspecies of the narrow-ridged finless porpoise ( N. asiaeorientalis ). In total, 714.28 gigabases (Gb) of raw reads were generated by whole-genome sequencing of the Yangtze finless porpoise, using an Illumina HiSeq 2000 platform. After filtering the low-quality and duplicated reads, we assembled a draft genome of 2.22 Gb, with contig N50 and scaffold N50 values of 46.69 kilobases (kb) and 1.71 megabases (Mb), respectively. We identified 887.63 Mb of repetitive sequences and predicted 18,479 protein-coding genes in the assembled genome. The phylogenetic tree showed a relationship between the Yangtze finless porpoise and the Yangtze River dolphin, which diverged approximately 20.84 million years ago. In comparisons with the genomes of 10 other mammals, we detected 44 species-specific gene families, 164 expanded gene families, and 313 positively selected genes in the Yangtze finless porpoise genome. The assembled genome sequence and underlying sequence data are available at the National Center for Biotechnology Information under BioProject accession number PRJNA433603.

  13. SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.

    Science.gov (United States)

    Meng, Jintao; Wang, Bingqiang; Wei, Yanjie; Feng, Shengzhong; Balaji, Pavan

    2014-01-01

    There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data. This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets. In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler.

  14. A NGS approach to the encrusting Mediterranean sponge Crella elegans (Porifera, Demospongiae, Poecilosclerida): transcriptome sequencing, characterization and overview of the gene expression along three life cycle stages.

    Science.gov (United States)

    Pérez-Porro, A R; Navarro-Gómez, D; Uriz, M J; Giribet, G

    2013-05-01

    Sponges can be dominant organisms in many marine and freshwater habitats where they play essential ecological roles. They also represent a key group to address important questions in early metazoan evolution. Recent approaches for improving knowledge on sponge biological and ecological functions as well as on animal evolution have focused on the genetic toolkits involved in ecological responses to environmental changes (biotic and abiotic), development and reproduction. These approaches are possible thanks to newly available, massive sequencing technologies-such as the Illumina platform, which facilitate genome and transcriptome sequencing in a cost-effective manner. Here we present the first NGS (next-generation sequencing) approach to understanding the life cycle of an encrusting marine sponge. For this we sequenced libraries of three different life cycle stages of the Mediterranean sponge Crella elegans and generated de novo transcriptome assemblies. Three assemblies were based on sponge tissue of a particular life cycle stage, including non-reproductive tissue, tissue with sperm cysts and tissue with larvae. The fourth assembly pooled the data from all three stages. By aggregating data from all the different life cycle stages we obtained a higher total number of contigs, contigs with blast hit and annotated contigs than from one stage-based assemblies. In that multi-stage assembly we obtained a larger number of the developmental regulatory genes known for metazoans than in any other assembly. We also advance the differential expression of selected genes in the three life cycle stages to explore the potential of RNA-seq for improving knowledge on functional processes along the sponge life cycle. © 2013 Blackwell Publishing Ltd.

  15. Draft Genome Sequence of Leptolyngbya sp. KIOST-1, a Filamentous Cyanobacterium with Biotechnological Potential for Alimentary Purposes.

    Science.gov (United States)

    Kim, Ji Hyung; Kang, Do-Hyung

    2016-09-15

    Here, we report the draft genome of cyanobacterium Leptolyngbya sp. KIOST-1 isolated from a microalgal culture pond in South Korea. The genome consists of 13 contigs containing 6,320,172 bp, and a total of 5,327 coding sequences were predicted. This genomic information will allow further exploitation of its biotechnological potential for alimentary purposes. Copyright © 2016 Kim and Kang.

  16. Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis

    Directory of Open Access Journals (Sweden)

    Arias Covadonga

    2007-06-01

    Full Text Available Abstract Background The ciliate protozoan Ichthyophthirius multifiliis (Ich is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate expressed sequence tags (ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Results We sequenced 10,368 EST clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate. Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. These unique sequences represent over two million base pairs (~10% of Plasmodium falciparum genome, a phylogenetically related protozoan. BLASTX searches produced 2,518 significant (E-value -5 hits and further Gene Ontology (GO analysis annotated 1,008 of these genes. The ESTs were analyzed comparatively against the genomes of the related protozoa Tetrahymena thermophila and P. falciparum, allowing putative identification of additional genes. All the EST sequences were deposited by dbEST in GenBank (GenBank: EG957858–EG966289. Gene discovery and annotations are presented and discussed. Conclusion This set of ESTs represents a significant proportion of the Ich transcriptome, and provides a material basis for the development of microarrays useful for gene expression studies concerning Ich development, pathogenesis, and virulence.

  17. Large-Scale Sequencing: The Future of Genomic Sciences Colloquium

    Energy Technology Data Exchange (ETDEWEB)

    Margaret Riley; Merry Buckley

    2009-01-01

    Genetic sequencing and the various molecular techniques it has enabled have revolutionized the field of microbiology. Examining and comparing the genetic sequences borne by microbes - including bacteria, archaea, viruses, and microbial eukaryotes - provides researchers insights into the processes microbes carry out, their pathogenic traits, and new ways to use microorganisms in medicine and manufacturing. Until recently, sequencing entire microbial genomes has been laborious and expensive, and the decision to sequence the genome of an organism was made on a case-by-case basis by individual researchers and funding agencies. Now, thanks to new technologies, the cost and effort of sequencing is within reach for even the smallest facilities, and the ability to sequence the genomes of a significant fraction of microbial life may be possible. The availability of numerous microbial genomes will enable unprecedented insights into microbial evolution, function, and physiology. However, the current ad hoc approach to gathering sequence data has resulted in an unbalanced and highly biased sampling of microbial diversity. A well-coordinated, large-scale effort to target the breadth and depth of microbial diversity would result in the greatest impact. The American Academy of Microbiology convened a colloquium to discuss the scientific benefits of engaging in a large-scale, taxonomically-based sequencing project. A group of individuals with expertise in microbiology, genomics, informatics, ecology, and evolution deliberated on the issues inherent in such an effort and generated a set of specific recommendations for how best to proceed. The vast majority of microbes are presently uncultured and, thus, pose significant challenges to such a taxonomically-based approach to sampling genome diversity. However, we have yet to even scratch the surface of the genomic diversity among cultured microbes. A coordinated sequencing effort of cultured organisms is an appropriate place to begin

  18. Whole Genome Sequence Analysis of an Alachlor and Endosulfan Degrading Micrococcus sp. strain 2385 Isolated from Ochlockonee River, Florida.

    Science.gov (United States)

    Pathak, Ashish; Chauhan, Ashvini; Ewida, Ayman Y I; Stothard, Paul

    2016-01-01

    We recently isolated Micrococcus sp. strain 2385 from Ochlockonee River, Florida and demonstrated potent biodegradative activity against two commonly used pesticides- alachlor [(2-chloro-2`,6`-diethylphenyl-N (methoxymethyl)acetanilide)] and endosulfan [(6,7,8,9,10,10-hexachloro-1,5,5a,6,9,9a-hexahydro-6,9methano-2,3,4-benzo(e)di-oxathiepin-3-oxide], respectively. To further identify the repertoire of metabolic functions possessed by strain 2385, a draft genome sequence was obtained, assembled, annotated and analyzed. The genome sequence of Micrococcus sp. strain 2385 consisted of 1,460,461,440 bases which assembled into 175 contigs with an N50 contig length of 50,109 bases and a coverage of 600x. The genome size of this strain was estimated at 2,431,226 base pairs with a G+C content of 72.8 and a total number of 2,268 putative genes. RAST annotated a total of 340 subsystems in the genome of strain 2385 along with the presence of 2,177 coding sequences. A genome wide survey indicated that that strain 2385 harbors a plethora of genes to degrade other pollutants including caprolactam, PAHs (such as naphthalene), styrene, toluene and several chloroaromatic compounds.

  19. Generation and analysis of a large-scale expressed sequence Tag database from a full-length enriched cDNA library of developing leaves of Gossypium hirsutum L.

    Directory of Open Access Journals (Sweden)

    Min Lin

    Full Text Available BACKGROUND: Cotton (Gossypium hirsutum L. is one of the world's most economically-important crops. However, its entire genome has not been sequenced, and limited resources are available in GenBank for understanding the molecular mechanisms underlying leaf development and senescence. METHODOLOGY/PRINCIPAL FINDINGS: In this study, 9,874 high-quality ESTs were generated from a normalized, full-length cDNA library derived from pooled RNA isolated from throughout leaf development during the plant blooming stage. After clustering and assembly of these ESTs, 5,191 unique sequences, representative 1,652 contigs and 3,539 singletons, were obtained. The average unique sequence length was 682 bp. Annotation of these unique sequences revealed that 84.4% showed significant homology to sequences in the NCBI non-redundant protein database, and 57.3% had significant hits to known proteins in the Swiss-Prot database. Comparative analysis indicated that our library added 2,400 ESTs and 991 unique sequences to those known for cotton. The unigenes were functionally characterized by gene ontology annotation. We identified 1,339 and 200 unigenes as potential leaf senescence-related genes and transcription factors, respectively. Moreover, nine genes related to leaf senescence and eleven MYB transcription factors were randomly selected for quantitative real-time PCR (qRT-PCR, which revealed that these genes were regulated differentially during senescence. The qRT-PCR for three GhYLSs revealed that these genes express express preferentially in senescent leaves. CONCLUSIONS/SIGNIFICANCE: These EST resources will provide valuable sequence information for gene expression profiling analyses and functional genomics studies to elucidate their roles, as well as for studying the mechanisms of leaf development and senescence in cotton and discovering candidate genes related to important agronomic traits of cotton. These data will also facilitate future whole-genome sequence

  20. Composite Binary Sequences with a Large Ensemble and Zero Correlation Zone

    Directory of Open Access Journals (Sweden)

    S. S. Yudachev

    2015-01-01

    Full Text Available The article considers a proposed class of derived signals such as composite binary sequences for application in advanced spread spectrum radio systems of various purposes, using signals based on spectrum spreading by direct sequence method. Considered composite sequences, having a representative set of lengths and unique correlation properties, compares favorably with the widely used at present large ensembles formed on a single algorithmic basis. To evaluate the properties of the composite sequences generated on the basis of two components - the Barker code and Kerdock sequences, expressions of periodic and aperiodic correlation functions are given.An algorithm for generating practical ensembles of composite sequences is presented. On the basis of the algorithm and its software implementation in C #, the samples of the sequence ensembles of various lengths were obtained and their periodic and aperiodic correlation functions and statistical characteristics were studied in detail. As an illustration, some of the most typical correlation functions are presented. The most remarkable characteristics allowing a ssessing the feasibility of using this type of sequences in the design of specific types of radio systems are considered.On the basis of the proposed program and the performed calculations the conclusions can be drawn about the possibility of using the sequences of these classes, with the aim of reducing intra-system disturbance in the projected spread spectrum CDMA.

  1. Draft genome sequence of a Kluyvera intermedia isolate from a patient with a pancreatic abscess.

    Science.gov (United States)

    Thele, Roland; Gumpert, Heidi; Christensen, Louise B; Worning, Peder; Schønning, Kristian; Westh, Henrik; Hansen, Thomas A

    2017-09-01

    The genus Kluyvera comprises potential pathogens that can cause many infections. This study reports a Kluyvera intermedia strain (FOSA7093) from a pancreatic cyst specimen from a long-term hospitalised patient. Whole-genome sequencing (WGS) of the K. intermedia isolate was performed and the strain was reported as sensitive to Danish-registered antibiotics although it had a fosA-like gene in the genome. There were nine contigs that aligned to a plasmid, and these contigs contained several heavy metal resistance gene homologues. Furthermore, a prophage was discovered in the genome. WGS represents an efficient tool for monitoring Kluyvera spp. and its role as a reservoir of multidrug resistance. Therefore, this susceptible K. intermedia genome has many characteristics that allow comparison of resistant K. intermedia that might be discovered in the future. Copyright © 2017 International Society for Chemotherapy of Infection and Cancer. Published by Elsevier Ltd. All rights reserved.

  2. Tablet—next generation sequence assembly visualization

    Science.gov (United States)

    Milne, Iain; Bayer, Micha; Cardle, Linda; Shaw, Paul; Stephen, Gordon; Wright, Frank; Marshall, David

    2010-01-01

    Summary: Tablet is a lightweight, high-performance graphical viewer for next-generation sequence assemblies and alignments. Supporting a range of input assembly formats, Tablet provides high-quality visualizations showing data in packed or stacked views, allowing instant access and navigation to any region of interest, and whole contig overviews and data summaries. Tablet is both multi-core aware and memory efficient, allowing it to handle assemblies containing millions of reads, even on a 32-bit desktop machine. Availability: Tablet is freely available for Microsoft Windows, Apple Mac OS X, Linux and Solaris. Fully bundled installers can be downloaded from http://bioinf.scri.ac.uk/tablet in 32- and 64-bit versions. Contact: tablet@scri.ac.uk PMID:19965881

  3. Genome puzzle master (GPM): an integrated pipeline for building and editing pseudomolecules from fragmented sequences.

    Science.gov (United States)

    Zhang, Jianwei; Kudrna, Dave; Mu, Ting; Li, Weiming; Copetti, Dario; Yu, Yeisoo; Goicoechea, Jose Luis; Lei, Yang; Wing, Rod A

    2016-10-15

    Next generation sequencing technologies have revolutionized our ability to rapidly and affordably generate vast quantities of sequence data. Once generated, raw sequences are assembled into contigs or scaffolds. However, these assemblies are mostly fragmented and inaccurate at the whole genome scale, largely due to the inability to integrate additional informative datasets (e.g. physical, optical and genetic maps). To address this problem, we developed a semi-automated software tool-Genome Puzzle Master (GPM)-that enables the integration of additional genomic signposts to edit and build 'new-gen-assemblies' that result in high-quality 'annotation-ready' pseudomolecules. With GPM, loaded datasets can be connected to each other via their logical relationships which accomplishes tasks to 'group,' 'merge,' 'order and orient' sequences in a draft assembly. Manual editing can also be performed with a user-friendly graphical interface. Final pseudomolecules reflect a user's total data package and are available for long-term project management. GPM is a web-based pipeline and an important part of a Laboratory Information Management System (LIMS) which can be easily deployed on local servers for any genome research laboratory. The GPM (with LIMS) package is available at https://github.com/Jianwei-Zhang/LIMS CONTACTS: jzhang@mail.hzau.edu.cn or rwing@mail.arizona.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  4. Sequencing and De Novo Transcriptome Assembly of Brachypodium sylvaticum (Poaceae

    Directory of Open Access Journals (Sweden)

    Samuel E. Fox

    2013-03-01

    Full Text Available Premise of the study: We report the de novo assembly and characterization of the transcriptomes of Brachypodium sylvaticum (slender false-brome accessions from native populations of Spain and Greece, and an invasive population west of Corvallis, Oregon, USA. Methods and Results: More than 350 million sequence reads from the mRNA libraries prepared from three B. sylvaticum genotypes were assembled into 120,091 (Corvallis, 104,950 (Spain, and 177,682 (Greece transcript contigs. In comparison with the B. distachyon Bd21 reference genome and GenBank protein sequences, we estimate >90% exome coverage for B. sylvaticum. The transcripts were assigned Gene Ontology and InterPro annotations. Brachypodium sylvaticum sequence reads aligned against the Bd21 genome revealed 394,654 single-nucleotide polymorphisms (SNPs and >20,000 simple sequence repeat (SSR DNA sites. Conclusions: To our knowledge, this is the first report of transcriptome sequencing of invasive plant species with a closely related sequenced reference genome. The sequences and identified SNP variant and SSR sites will provide tools for developing novel genetic markers for use in genotyping and characterization of invasive behavior of B. sylvaticum.

  5. Comparative sequence analysis of Solanum and Arabidopsis in a hot spot for pathogen resistance on potato chromosome V reveals a patchwork of conserved and rapidly evolving genome segments

    Directory of Open Access Journals (Sweden)

    Bruggmann Rémy

    2007-05-01

    Full Text Available Abstract Background Quantitative phenotypic variation of agronomic characters in crop plants is controlled by environmental and genetic factors (quantitative trait loci = QTL. To understand the molecular basis of such QTL, the identification of the underlying genes is of primary interest and DNA sequence analysis of the genomic regions harboring QTL is a prerequisite for that. QTL mapping in potato (Solanum tuberosum has identified a region on chromosome V tagged by DNA markers GP21 and GP179, which contains a number of important QTL, among others QTL for resistance to late blight caused by the oomycete Phytophthora infestans and to root cyst nematodes. Results To obtain genomic sequence for the targeted region on chromosome V, two local BAC (bacterial artificial chromosome contigs were constructed and sequenced, which corresponded to parts of the homologous chromosomes of the diploid, heterozygous genotype P6/210. Two contiguous sequences of 417,445 and 202,781 base pairs were assembled and annotated. Gene-by-gene co-linearity was disrupted by non-allelic insertions of retrotransposon elements, stretches of diverged intergenic sequences, differences in gene content and gene order. The latter was caused by inversion of a 70 kbp genomic fragment. These features were also found in comparison to orthologous sequence contigs from three homeologous chromosomes of Solanum demissum, a wild tuber bearing species. Functional annotation of the sequence identified 48 putative open reading frames (ORF in one contig and 22 in the other, with an average of one ORF every 9 kbp. Ten ORFs were classified as resistance-gene-like, 11 as F-box-containing genes, 13 as transposable elements and three as transcription factors. Comparing potato to Arabidopsis thaliana annotated proteins revealed five micro-syntenic blocks of three to seven ORFs with A. thaliana chromosomes 1, 3 and 5. Conclusion Comparative sequence analysis revealed highly conserved collinear regions

  6. Next Generation Sequencing Identifies Five Major Classes of Potentially Therapeutic Enzymes Secreted by Lucilia sericata Medical Maggots.

    Science.gov (United States)

    Franta, Zdeněk; Vogel, Heiko; Lehmann, Rüdiger; Rupp, Oliver; Goesmann, Alexander; Vilcinskas, Andreas

    2016-01-01

    Lucilia sericata larvae are used as an alternative treatment for recalcitrant and chronic wounds. Their excretions/secretions contain molecules that facilitate tissue debridement, disinfect, or accelerate wound healing and have therefore been recognized as a potential source of novel therapeutic compounds. Among the substances present in excretions/secretions various peptidase activities promoting the wound healing processes have been detected but the peptidases responsible for these activities remain mostly unidentified. To explore these enzymes we applied next generation sequencing to analyze the transcriptomes of different maggot tissues (salivary glands, gut, and crop) associated with the production of excretions/secretions and/or with digestion as well as the rest of the larval body. As a result we obtained more than 123.8 million paired-end reads, which were assembled de novo using Trinity and Oases assemblers, yielding 41,421 contigs with an N50 contig length of 2.22 kb and a total length of 67.79 Mb. BLASTp analysis against the MEROPS database identified 1729 contigs in 577 clusters encoding five peptidase classes (serine, cysteine, aspartic, threonine, and metallopeptidases), which were assigned to 26 clans, 48 families, and 185 peptidase species. The individual enzymes were differentially expressed among maggot tissues and included peptidase activities related to the therapeutic effects of maggot excretions/secretions.

  7. Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses.

    Science.gov (United States)

    Liu, Bo; Madduri, Ravi K; Sotomayor, Borja; Chard, Kyle; Lacinski, Lukasz; Dave, Utpal J; Li, Jianqiang; Liu, Chunchen; Foster, Ian T

    2014-06-01

    Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach. Copyright © 2014 Elsevier Inc. All rights reserved.

  8. Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures

    Directory of Open Access Journals (Sweden)

    Pride David T

    2008-09-01

    Full Text Available Abstract Background Metagenomic analysis provides a rich source of biological information for otherwise intractable viral communities. However, study of viral metagenomes has been hampered by its nearly complete reliance on BLAST algorithms for identification of DNA sequences. We sought to develop algorithms for examination of viral metagenomes to identify the origin of sequences independent of BLAST algorithms. We chose viral metagenomes obtained from two hot springs, Bear Paw and Octopus, in Yellowstone National Park, as they represent simple microbial populations where comparatively large contigs were obtained. Thermal spring metagenomes have high proportions of sequences without significant Genbank homology, which has hampered identification of viruses and their linkage with hosts. To analyze each metagenome, we developed a method to classify DNA fragments using genome signature-based phylogenetic classification (GSPC, where metagenomic fragments are compared to a database of oligonucleotide signatures for all previously sequenced Bacteria, Archaea, and viruses. Results From both Bear Paw and Octopus hot springs, each assembled contig had more similarity to other metagenome contigs than to any sequenced microbial genome based on GSPC analysis, suggesting a genome signature common to each of these extreme environments. While viral metagenomes from Bear Paw and Octopus share some similarity, the genome signatures from each locale are largely unique. GSPC using a microbial database predicts most of the Octopus metagenome has archaeal signatures, while bacterial signatures predominate in Bear Paw; a finding consistent with those of Genbank BLAST. When using a viral database, the majority of the Octopus metagenome is predicted to belong to archaeal virus Families Globuloviridae and Fuselloviridae, while none of the Bear Paw metagenome is predicted to belong to archaeal viruses. As expected, when microbial and viral databases are combined, each of

  9. A contig-based strategy for the genome-wide discovery of microRNAs without complete genome resources.

    Directory of Open Access Journals (Sweden)

    Jun-Zhi Wen

    Full Text Available MicroRNAs (miRNAs are important regulators of many cellular processes and exist in a wide range of eukaryotes. High-throughput sequencing is a mainstream method of miRNA identification through which it is possible to obtain the complete small RNA profile of an organism. Currently, most approaches to miRNA identification rely on a reference genome for the prediction of hairpin structures. However, many species of economic and phylogenetic importance are non-model organisms without complete genome sequences, and this limits miRNA discovery. Here, to overcome this limitation, we have developed a contig-based miRNA identification strategy. We applied this method to a triploid species of edible banana (GCTCV-119, Musa spp. AAA group and identified 180 pre-miRNAs and 314 mature miRNAs, which is three times more than those were predicted by the available dataset-based methods (represented by EST+GSS. Based on the recently published miRNA data set of Musa acuminate, the recall rate and precision of our strategy are estimated to be 70.6% and 92.2%, respectively, significantly better than those of EST+GSS-based strategy (10.2% and 50.0%, respectively. Our novel, efficient and cost-effective strategy facilitates the study of the functional and evolutionary role of miRNAs, as well as miRNA-based molecular breeding, in non-model species of economic or evolutionary interest.

  10. De novo transcriptome assembly for a non-model species, the blood-sucking bug Triatoma brasiliensis, a vector of Chagas disease.

    Science.gov (United States)

    Marchant, A; Mougel, F; Almeida, C; Jacquin-Joly, E; Costa, J; Harry, M

    2015-04-01

    High throughput sequencing (HTS) provides new research opportunities for work on non-model organisms, such as differential expression studies between populations exposed to different environmental conditions. However, such transcriptomic studies first require the production of a reference assembly. The choice of sampling procedure, sequencing strategy and assembly workflow is crucial. To develop a reliable reference transcriptome for Triatoma brasiliensis, the major Chagas disease vector in Northeastern Brazil, different de novo assembly protocols were generated using various datasets and software. Both 454 and Illumina sequencing technologies were applied on RNA extracted from antennae and mouthparts from single or pooled individuals. The 454 library yielded 278 Mb. Fifteen Illumina libraries were constructed and yielded nearly 360 million RNA-seq single reads and 46 million RNA-seq paired-end reads for nearly 45 Gb. For the 454 reads, we used three assemblers, Newbler, CAP3 and/or MIRA and for the Illumina reads, the Trinity assembler. Ten assembly workflows were compared using these programs separately or in combination. To compare the assemblies obtained, quantitative and qualitative criteria were used, including contig length, N50, contig number and the percentage of chimeric contigs. Completeness of the assemblies was estimated using the CEGMA pipeline. The best assembly (57,657 contigs, completeness of 80 %, <1 % chimeric contigs) was a hybrid assembly leading to recommend the use of (1) a single individual with large representation of biological tissues, (2) merging both long reads and short paired-end Illumina reads, (3) several assemblers in order to combine the specific advantages of each.

  11. Conservation of gene cassettes among diverse viruses of the human gut.

    Directory of Open Access Journals (Sweden)

    Samuel Minot

    Full Text Available Viruses are a crucial component of the human microbiome, but large population sizes, high sequence diversity, and high frequencies of novel genes have hindered genomic analysis by high-throughput sequencing. Here we investigate approaches to metagenomic assembly to probe genome structure in a sample of 5.6 Gb of gut viral DNA sequence from six individuals. Tests showed that a new pipeline based on DeBruijn graph assembly yielded longer contigs that were able to recruit more reads than the equivalent non-optimized, single-pass approach. To characterize gene content, the database of viral RefSeq proteins was compared to the assembled viral contigs, generating a bipartite graph with functional cassettes linking together viral contigs, which revealed a high degree of connectivity between diverse genomes involving multiple genes of the same functional class. In a second step, open reading frames were grouped by their co-occurrence on contigs in a database-independent manner, revealing conserved cassettes of co-oriented ORFs. These methods reveal that free-living bacteriophages, while usually dissimilar at the nucleotide level, often have significant similarity at the level of encoded amino acid motifs, gene order, and gene orientation. These findings thus connect contemporary metagenomic analysis with classical studies of bacteriophage genomic cassettes. Software is available at https://sourceforge.net/projects/optitdba/.

  12. Whole Genome Sequence Analysis of Pig Respiratory Bacterial Pathogens with Elevated Minimum Inhibitory Concentrations for Macrolides.

    Science.gov (United States)

    Dayao, Denise Ann Estarez; Seddon, Jennifer M; Gibson, Justine S; Blackall, Patrick J; Turni, Conny

    2016-10-01

    Macrolides are often used to treat and control bacterial pathogens causing respiratory disease in pigs. This study analyzed the whole genome sequences of one clinical isolate of Actinobacillus pleuropneumoniae, Haemophilus parasuis, Pasteurella multocida, and Bordetella bronchiseptica, all isolated from Australian pigs to identify the mechanism underlying the elevated minimum inhibitory concentrations (MICs) for erythromycin, tilmicosin, or tulathromycin. The H. parasuis assembled genome had a nucleotide transition at position 2059 (A to G) in the six copies of the 23S rRNA gene. This mutation has previously been associated with macrolide resistance but this is the first reported mechanism associated with elevated macrolide MICs in H. parasuis. There was no known macrolide resistance mechanism identified in the other three bacterial genomes. However, strA and sul2, aminoglycoside and sulfonamide resistance genes, respectively, were detected in one contiguous sequence (contig 1) of A. pleuropneumoniae assembled genome. This contig was identical to plasmids previously identified in Pasteurellaceae. This study has provided one possible explanation of elevated MICs to macrolides in H. parasuis. Further studies are necessary to clarify the mechanism causing the unexplained macrolide resistance in other Australian pig respiratory pathogens including the role of efflux systems, which were detected in all analyzed genomes.

  13. Complete Genome Sequence of Methanohalophilus halophilus DSM 3094 T , Isolated from a Cyanobacterial Mat and Bottom Deposits at Hamelin Pool, Shark Bay, Northwestern Australia

    KAUST Repository

    L'Haridon, Stéphane

    2017-02-17

    The complete genome sequence of Methanohalophilus halophilus DSM 3094, a member of the Methanosarcinaceae family and the Methanosarcianales order, consists of 2,022,959 bp in one contig and contains 2,137 predicted genes. The genome is consistent with a halophilic methylotrophic anaerobic lifestyle, including the methylotrophic and CO-H methanogensis pathways.

  14. Complete Genome Sequence of Methanohalophilus halophilus DSM 3094 T , Isolated from a Cyanobacterial Mat and Bottom Deposits at Hamelin Pool, Shark Bay, Northwestern Australia

    KAUST Repository

    L'Haridon, Sté phane; Corre, Erwan; Guan, Yue; Vinu, Manikandan; La Cono, Violetta; Yakimov, Mickail; Stingl, Ulrich; Toffin, Laurent; Jebbar, Mohamed

    2017-01-01

    The complete genome sequence of Methanohalophilus halophilus DSM 3094, a member of the Methanosarcinaceae family and the Methanosarcianales order, consists of 2,022,959 bp in one contig and contains 2,137 predicted genes. The genome is consistent with a halophilic methylotrophic anaerobic lifestyle, including the methylotrophic and CO-H methanogensis pathways.

  15. The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color.

    Science.gov (United States)

    Motamayor, Juan C; Mockaitis, Keithanne; Schmutz, Jeremy; Haiminen, Niina; Livingstone, Donald; Cornejo, Omar; Findley, Seth D; Zheng, Ping; Utro, Filippo; Royaert, Stefan; Saski, Christopher; Jenkins, Jerry; Podicheti, Ram; Zhao, Meixia; Scheffler, Brian E; Stack, Joseph C; Feltus, Frank A; Mustiga, Guiliana M; Amores, Freddy; Phillips, Wilbert; Marelli, Jean Philippe; May, Gregory D; Shapiro, Howard; Ma, Jianxin; Bustamante, Carlos D; Schnell, Raymond J; Main, Dorrie; Gilbert, Don; Parida, Laxmi; Kuhn, David N

    2013-06-03

    Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits.

  16. Genome and Plasmid Sequences of Escherichia coli KV7, an Extended-Spectrum β-Lactamase Isolate Derived from Feces of a Healthy Pig

    DEFF Research Database (Denmark)

    Bateman, Michael D; de Vries, Stefan P W; Gupta, Srishti

    2017-01-01

    We present single-contig assemblies for Escherichia coli strain KV7 (serotype O27, phylogenetic group D) and its six plasmids, isolated from a healthy pig, as determined by PacBio RS II and Illumina MiSeq sequencing. The chromosome of 4,997,475 bp and G+C content of 50.75% harbored 4,540 protein-...

  17. Dicty_cDB: VHP253 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHP253 (Link to dictyBase) - - - Contig-U16349-1 - (Link to Or...iginal site) - - VHP253Z 355 - - - - Show VHP253 Library VH (Link to library) Clone ID VHP253 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dictycdb.b...CTTGGGTACCAAGAACTGACCGTCAATTTGCTGGTTCATGGTTTGC sequence update 2002. 9.10 Translated Amino Acid sequence ---MQLFAGIKSICT...VPIMRMYFHTGILDYILFKSWVPRTDRQFAGSWF Translated Amino Acid sequence (All Frames) Frame A: ---MQLFAGIKSICTEMAMD

  18. An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

    Directory of Open Access Journals (Sweden)

    Md. Rezaul Karim

    2012-03-01

    Full Text Available Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

  19. Dicty_cDB: SHI251 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SH (Link to library) SHI251 (Link to dictyBase) - - - Contig-U11819-1 - (Link to Or...iginal site) SHI251F 125 - - - - - - Show SHI251 Library SH (Link to library) Clone ID SHI251 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11819-1 Original site URL http://dictycdb.b...XX sequence update 2002.10.25 Translated Amino Acid sequence ilfqilkistnk**IKNYYVNRVYEIIIIINICT...YKKK--- Translated Amino Acid sequence (All Frames) Frame A: ilfqilkistnk**IKNYYVNRVYEIIIIINICT

  20. Plasmid flux in Escherichia coli ST131 sublineages, analyzed by plasmid constellation network (PLACNET), a new method for plasmid reconstruction from whole genome sequences.

    Science.gov (United States)

    Lanza, Val F; de Toro, María; Garcillán-Barcia, M Pilar; Mora, Azucena; Blanco, Jorge; Coque, Teresa M; de la Cruz, Fernando

    2014-12-01

    Bacterial whole genome sequence (WGS) methods are rapidly overtaking classical sequence analysis. Many bacterial sequencing projects focus on mobilome changes, since macroevolutionary events, such as the acquisition or loss of mobile genetic elements, mainly plasmids, play essential roles in adaptive evolution. Existing WGS analysis protocols do not assort contigs between plasmids and the main chromosome, thus hampering full analysis of plasmid sequences. We developed a method (called plasmid constellation networks or PLACNET) that identifies, visualizes and analyzes plasmids in WGS projects by creating a network of contig interactions, thus allowing comprehensive plasmid analysis within WGS datasets. The workflow of the method is based on three types of data: assembly information (including scaffold links and coverage), comparison to reference sequences and plasmid-diagnostic sequence features. The resulting network is pruned by expert analysis, to eliminate confounding data, and implemented in a Cytoscape-based graphic representation. To demonstrate PLACNET sensitivity and efficacy, the plasmidome of the Escherichia coli lineage ST131 was analyzed. ST131 is a globally spread clonal group of extraintestinal pathogenic E. coli (ExPEC), comprising different sublineages with ability to acquire and spread antibiotic resistance and virulence genes via plasmids. Results show that plasmids flux in the evolution of this lineage, which is wide open for plasmid exchange. MOBF12/IncF plasmids were pervasive, adding just by themselves more than 350 protein families to the ST131 pangenome. Nearly 50% of the most frequent γ-proteobacterial plasmid groups were found to be present in our limited sample of ten analyzed ST131 genomes, which represent the main ST131 sublineages.

  1. Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries.

    Science.gov (United States)

    Gillet-Markowska, Alexandre; Richard, Hugues; Fischer, Gilles; Lafontaine, Ingrid

    2015-03-15

    The detection of structural variations (SVs) in short-range Paired-End (PE) libraries remains challenging because SV breakpoints can involve large dispersed repeated sequences, or carry inherent complexity, hardly resolvable with classical PE sequencing data. In contrast, large insert-size sequencing libraries (Mate-Pair libraries) provide higher physical coverage of the genome and give access to repeat-containing regions. They can thus theoretically overcome previous limitations as they are becoming routinely accessible. Nevertheless, broad insert size distributions and high rates of chimerical sequences are usually associated to this type of libraries, which makes the accurate annotation of SV challenging. Here, we present Ulysses, a tool that achieves drastically higher detection accuracy than existing tools, both on simulated and real mate-pair sequencing datasets from the 1000 Human Genome project. Ulysses achieves high specificity over the complete spectrum of variants by assessing, in a principled manner, the statistical significance of each possible variant (duplications, deletions, translocations, insertions and inversions) against an explicit model for the generation of experimental noise. This statistical model proves particularly useful for the detection of low frequency variants. SV detection performed on a large insert Mate-Pair library from a breast cancer sample revealed a high level of somatic duplications in the tumor and, to a lesser extent, in the blood sample as well. Altogether, these results show that Ulysses is a valuable tool for the characterization of somatic mosaicism in human tissues and in cancer genomes. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  2. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    Directory of Open Access Journals (Sweden)

    Marais Gabriel AB

    2011-07-01

    Full Text Available Abstract Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO terms, and thousands of single-nucleotide polymorphisms (SNPs were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49% that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to

  3. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    Science.gov (United States)

    2011-01-01

    Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO) terms, and thousands of single-nucleotide polymorphisms (SNPs) were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49%) that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to further develop Silene as a

  4. Characterization of Aftershock Sequences from Large Strike-Slip Earthquakes Along Geometrically Complex Faults

    Science.gov (United States)

    Sexton, E.; Thomas, A.; Delbridge, B. G.

    2017-12-01

    Large earthquakes often exhibit complex slip distributions and occur along non-planar fault geometries, resulting in variable stress changes throughout the region of the fault hosting aftershocks. To better discern the role of geometric discontinuities on aftershock sequences, we compare areas of enhanced and reduced Coulomb failure stress and mean stress for systematic differences in the time dependence and productivity of these aftershock sequences. In strike-slip faults, releasing structures, including stepovers and bends, experience an increase in both Coulomb failure stress and mean stress during an earthquake, promoting fluid diffusion into the region and further failure. Conversely, Coulomb failure stress and mean stress decrease in restraining bends and stepovers in strike-slip faults, and fluids diffuse away from these areas, discouraging failure. We examine spatial differences in seismicity patterns along structurally complex strike-slip faults which have hosted large earthquakes, such as the 1992 Mw 7.3 Landers, the 2010 Mw 7.2 El-Mayor Cucapah, the 2014 Mw 6.0 South Napa, and the 2016 Mw 7.0 Kumamoto events. We characterize the behavior of these aftershock sequences with the Epidemic Type Aftershock-Sequence Model (ETAS). In this statistical model, the total occurrence rate of aftershocks induced by an earthquake is λ(t) = λ_0 + \\sum_{i:t_i

  5. Dicty_cDB: SHA393 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SH (Link to library) SHA393 (Link to dictyBase) - - - Contig-U11503-1 SHA393E (Link... Clone ID SHA393 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11503-1 Original site URL http://dict...lated Amino Acid sequence ekqfsl*iy*YMIRKSNNFSILFAIFLKIVFVVSAPLCPNSTILLNYNILTVYNSSEGCG FNNXPICTSLKDAVXRAFLLI...yhcysyfg Translated Amino Acid sequence (All Frames) Frame A: ekqfsl*iy*YMIRKSNNFSILFAIFLKIVFVVSAPLCPNSTILLNYNILTVYNSSEGCG FNNXPICT...Homology vs Protein Score E Sequences producing significant alignments: (bits) Value AF020283_1( AF020283 |pid:none) Dict

  6. The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color

    Science.gov (United States)

    2013-01-01

    Background Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. Results We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. Conclusions We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits. PMID:23731509

  7. Hybrid sequencing approach applied to human fecal metagenomic clone libraries revealed clones with potential biotechnological applications.

    Science.gov (United States)

    Džunková, Mária; D'Auria, Giuseppe; Pérez-Villarroya, David; Moya, Andrés

    2012-01-01

    Natural environments represent an incredible source of microbial genetic diversity. Discovery of novel biomolecules involves biotechnological methods that often require the design and implementation of biochemical assays to screen clone libraries. However, when an assay is applied to thousands of clones, one may eventually end up with very few positive clones which, in most of the cases, have to be "domesticated" for downstream characterization and application, and this makes screening both laborious and expensive. The negative clones, which are not considered by the selected assay, may also have biotechnological potential; however, unfortunately they would remain unexplored. Knowledge of the clone sequences provides important clues about potential biotechnological application of the clones in the library; however, the sequencing of clones one-by-one would be very time-consuming and expensive. In this study, we characterized the first metagenomic clone library from the feces of a healthy human volunteer, using a method based on 454 pyrosequencing coupled with a clone-by-clone Sanger end-sequencing. Instead of whole individual clone sequencing, we sequenced 358 clones in a pool. The medium-large insert (7-15 kb) cloning strategy allowed us to assemble these clones correctly, and to assign the clone ends to maintain the link between the position of a living clone in the library and the annotated contig from the 454 assembly. Finally, we found several open reading frames (ORFs) with previously described potential medical application. The proposed approach allows planning ad-hoc biochemical assays for the clones of interest, and the appropriate sub-cloning strategy for gene expression in suitable vectors/hosts.

  8. Analysis and functional annotation of expressed sequence tags from the fall armyworm Spodoptera frugiperda

    Science.gov (United States)

    Deng, Youping; Dong, Yinghua; Thodima, Venkata; Clem, Rollie J; Passarelli, A Lorena

    2006-01-01

    Background Little is known about the genome sequences of lepidopteran insects, although this group of insects has been studied extensively in the fields of endocrinology, development, immunity, and pathogen-host interactions. In addition, cell lines derived from Spodoptera frugiperda and other lepidopteran insects are routinely used for baculovirus foreign gene expression. This study reports the results of an expressed sequence tag (EST) sequencing project in cells from the lepidopteran insect S. frugiperda, the fall armyworm. Results We have constructed an EST database using two cDNA libraries from the S. frugiperda-derived cell line, SF-21. The database consists of 2,367 ESTs which were assembled into 244 contigs and 951 singlets for a total of 1,195 unique sequences. Conclusion S. frugiperda is an agriculturally important pest insect and genomic information will be instrumental for establishing initial transcriptional profiling and gene function studies, and for obtaining information about genes manipulated during infections by insect pathogens such as baculoviruses. PMID:17052344

  9. Genetic architecture of vitamin B12 and folate levels uncovered applying deeply sequenced large datasets

    DEFF Research Database (Denmark)

    Grarup, Niels; Sulem, Patrick; Sandholt, Camilla H

    2013-01-01

    of the underlying biology of human traits and diseases. Here, we used a large Icelandic whole genome sequence dataset combined with Danish exome sequence data to gain insight into the genetic architecture of serum levels of vitamin B12 (B12) and folate. Up to 22.9 million sequence variants were analyzed in combined...... in serum B12 or folate levels do not modify the risk of developing these conditions. Yet, the study demonstrates the value of combining whole genome and exome sequencing approaches to ascertain the genetic and molecular architectures underlying quantitative trait associations....

  10. The First Symbiont-Free Genome Sequence of Marine Red Alga, Susabi-nori (Pyropia yezoensis)

    Science.gov (United States)

    Nakamura, Yoji; Sasaki, Naobumi; Kobayashi, Masahiro; Ojima, Nobuhiko; Yasuike, Motoshige; Shigenobu, Yuya; Satomi, Masataka; Fukuma, Yoshiya; Shiwaku, Koji; Tsujimoto, Atsumi; Kobayashi, Takanori; Nakayama, Ichiro; Ito, Fuminari; Nakajima, Kazuhiro; Sano, Motohiko; Wada, Tokio; Kuhara, Satoru; Inouye, Kiyoshi; Gojobori, Takashi; Ikeo, Kazuho

    2013-01-01

    Nori, a marine red alga, is one of the most profitable mariculture crops in the world. However, the biological properties of this macroalga are poorly understood at the molecular level. In this study, we determined the draft genome sequence of susabi-nori (Pyropia yezoensis) using next-generation sequencing platforms. For sequencing, thalli of P. yezoensis were washed to remove bacteria attached on the cell surface and enzymatically prepared as purified protoplasts. The assembled contig size of the P. yezoensis nuclear genome was approximately 43 megabases (Mb), which is an order of magnitude smaller than the previously estimated genome size. A total of 10,327 gene models were predicted and about 60% of the genes validated lack introns and the other genes have shorter introns compared to large-genome algae, which is consistent with the compact size of the P. yezoensis genome. A sequence homology search showed that 3,611 genes (35%) are functionally unknown and only 2,069 gene groups are in common with those of the unicellular red alga, Cyanidioschyzon merolae. As color trait determinants of red algae, light-harvesting genes involved in the phycobilisome were predicted from the P. yezoensis nuclear genome. In particular, we found a second homolog of phycobilisome-degradation gene, which is usually chloroplast-encoded, possibly providing a novel target for color fading of susabi-nori in aquaculture. These findings shed light on unexplained features of macroalgal genes and genomes, and suggest that the genome of P. yezoensis is a promising model genome of marine red algae. PMID:23536760

  11. The first symbiont-free genome sequence of marine red alga, Susabi-nori (Pyropia yezoensis.

    Directory of Open Access Journals (Sweden)

    Yoji Nakamura

    Full Text Available Nori, a marine red alga, is one of the most profitable mariculture crops in the world. However, the biological properties of this macroalga are poorly understood at the molecular level. In this study, we determined the draft genome sequence of susabi-nori (Pyropia yezoensis using next-generation sequencing platforms. For sequencing, thalli of P. yezoensis were washed to remove bacteria attached on the cell surface and enzymatically prepared as purified protoplasts. The assembled contig size of the P. yezoensis nuclear genome was approximately 43 megabases (Mb, which is an order of magnitude smaller than the previously estimated genome size. A total of 10,327 gene models were predicted and about 60% of the genes validated lack introns and the other genes have shorter introns compared to large-genome algae, which is consistent with the compact size of the P. yezoensis genome. A sequence homology search showed that 3,611 genes (35% are functionally unknown and only 2,069 gene groups are in common with those of the unicellular red alga, Cyanidioschyzon merolae. As color trait determinants of red algae, light-harvesting genes involved in the phycobilisome were predicted from the P. yezoensis nuclear genome. In particular, we found a second homolog of phycobilisome-degradation gene, which is usually chloroplast-encoded, possibly providing a novel target for color fading of susabi-nori in aquaculture. These findings shed light on unexplained features of macroalgal genes and genomes, and suggest that the genome of P. yezoensis is a promising model genome of marine red algae.

  12. Identification of microRNAs from Eugenia uniflora by high-throughput sequencing and bioinformatics analysis.

    Science.gov (United States)

    Guzman, Frank; Almerão, Mauricio P; Körbes, Ana P; Loss-Morais, Guilherme; Margis, Rogerio

    2012-01-01

    microRNAs or miRNAs are small non-coding regulatory RNAs that play important functions in the regulation of gene expression at the post-transcriptional level by targeting mRNAs for degradation or inhibiting protein translation. Eugenia uniflora is a plant native to tropical America with pharmacological and ecological importance, and there have been no previous studies concerning its gene expression and regulation. To date, no miRNAs have been reported in Myrtaceae species. Small RNA and RNA-seq libraries were constructed to identify miRNAs and pre-miRNAs in Eugenia uniflora. Solexa technology was used to perform high throughput sequencing of the library, and the data obtained were analyzed using bioinformatics tools. From 14,489,131 small RNA clean reads, we obtained 1,852,722 mature miRNA sequences representing 45 conserved families that have been identified in other plant species. Further analysis using contigs assembled from RNA-seq allowed the prediction of secondary structures of 25 known and 17 novel pre-miRNAs. The expression of twenty-seven identified miRNAs was also validated using RT-PCR assays. Potential targets were predicted for the most abundant mature miRNAs in the identified pre-miRNAs based on sequence homology. This study is the first large scale identification of miRNAs and their potential targets from a species of the Myrtaceae family without genomic sequence resources. Our study provides more information about the evolutionary conservation of the regulatory network of miRNAs in plants and highlights species-specific miRNAs.

  13. The Transcriptome Analysis and Comparison Explorer--T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms.

    Science.gov (United States)

    Philipp, E E R; Kraemer, L; Mountfort, D; Schilhabel, M; Schreiber, S; Rosenstiel, P

    2012-03-15

    Next generation sequencing (NGS) technologies allow a rapid and cost-effective compilation of large RNA sequence datasets in model and non-model organisms. However, the storage and analysis of transcriptome information from different NGS platforms is still a significant bottleneck, leading to a delay in data dissemination and subsequent biological understanding. Especially database interfaces with transcriptome analysis modules going beyond mere read counts are missing. Here, we present the Transcriptome Analysis and Comparison Explorer (T-ACE), a tool designed for the organization and analysis of large sequence datasets, and especially suited for transcriptome projects of non-model organisms with little or no a priori sequence information. T-ACE offers a TCL-based interface, which accesses a PostgreSQL database via a php-script. Within T-ACE, information belonging to single sequences or contigs, such as annotation or read coverage, is linked to the respective sequence and immediately accessible. Sequences and assigned information can be searched via keyword- or BLAST-search. Additionally, T-ACE provides within and between transcriptome analysis modules on the level of expression, GO terms, KEGG pathways and protein domains. Results are visualized and can be easily exported for external analysis. We developed T-ACE for laboratory environments, which have only a limited amount of bioinformatics support, and for collaborative projects in which different partners work on the same dataset from different locations or platforms (Windows/Linux/MacOS). For laboratories with some experience in bioinformatics and programming, the low complexity of the database structure and open-source code provides a framework that can be customized according to the different needs of the user and transcriptome project.

  14. A physical map of the heterozygous grapevine 'Cabernet Sauvignon' allows mapping candidate genes for disease resistance

    Directory of Open Access Journals (Sweden)

    Scalabrin Simone

    2008-06-01

    Full Text Available Abstract Background Whole-genome physical maps facilitate genome sequencing, sequence assembly, mapping of candidate genes, and the design of targeted genetic markers. An automated protocol was used to construct a Vitis vinifera 'Cabernet Sauvignon' physical map. The quality of the result was addressed with regard to the effect of high heterozygosity on the accuracy of contig assembly. Its usefulness for the genome-wide mapping of genes for disease resistance, which is an important trait for grapevine, was then assessed. Results The physical map included 29,727 BAC clones assembled into 1,770 contigs, spanning 715,684 kbp, and corresponding to 1.5-fold the genome size. Map inflation was due to high heterozygosity, which caused either the separation of allelic BACs in two different contigs, or local mis-assembly in contigs containing BACs from the two haplotypes. Genetic markers anchored 395 contigs or 255,476 kbp to chromosomes. The fully automated assembly and anchorage procedures were validated by BAC-by-BAC blast of the end sequences against the grape genome sequence, unveiling 7.3% of chimerical contigs. The distribution across the physical map of candidate genes for non-host and host resistance, and for defence signalling pathways was then studied. NBS-LRR and RLK genes for host resistance were found in 424 contigs, 133 of them (32% were assigned to chromosomes, on which they are mostly organised in clusters. Non-host and defence signalling genes were found in 99 contigs dispersed without a discernable pattern across the genome. Conclusion Despite some limitations that interfere with the correct assembly of heterozygous clones into contigs, the 'Cabernet Sauvignon' physical map is a useful and reliable intermediary step between a genetic map and the genome sequence. This tool was successfully exploited for a quick mapping of complex families of genes, and it strengthened previous clues of co-localisation of major NBS-LRR clusters and

  15. An Ambystoma mexicanum EST sequencing project: analysis of 17,352 expressed sequence tags from embryonic and regenerating blastema cDNA libraries

    Science.gov (United States)

    Habermann, Bianca; Bebin, Anne-Gaelle; Herklotz, Stephan; Volkmer, Michael; Eckelt, Kay; Pehlke, Kerstin; Epperlein, Hans Henning; Schackert, Hans Konrad; Wiebe, Glenis; Tanaka, Elly M

    2004-01-01

    Background The ambystomatid salamander, Ambystoma mexicanum (axolotl), is an important model organism in evolutionary and regeneration research but relatively little sequence information has so far been available. This is a major limitation for molecular studies on caudate development, regeneration and evolution. To address this lack of sequence information we have generated an expressed sequence tag (EST) database for A. mexicanum. Results Two cDNA libraries, one made from stage 18-22 embryos and the other from day-6 regenerating tail blastemas, generated 17,352 sequences. From the sequenced ESTs, 6,377 contigs were assembled that probably represent 25% of the expressed genes in this organism. Sequence comparison revealed significant homology to entries in the NCBI non-redundant database. Further examination of this gene set revealed the presence of genes involved in important cell and developmental processes, including cell proliferation, cell differentiation and cell-cell communication. On the basis of these data, we have performed phylogenetic analysis of key cell-cycle regulators. Interestingly, while cell-cycle proteins such as the cyclin B family display expected evolutionary relationships, the cyclin-dependent kinase inhibitor 1 gene family shows an unusual evolutionary behavior among the amphibians. Conclusions Our analysis reveals the importance of a comprehensive sequence set from a representative of the Caudata and illustrates that the EST sequence database is a rich source of molecular, developmental and regeneration studies. To aid in data mining, the ESTs have been organized into an easily searchable database that is freely available online. PMID:15345051

  16. Transcriptomic analysis of cadmium stress response in the heavy metal hyperaccumulator Sedum alfredii Hance.

    Directory of Open Access Journals (Sweden)

    Jun Gao

    Full Text Available The Sedum alfredii Hance hyperaccumulating ecotype (HE has the ability to hyperaccumulate cadmium (Cd, as well as zinc (Zn and lead (Pb in above-ground tissues. Although many physiological studies have been conducted with these plants, the molecular mechanisms underlying their hyper-tolerance to heavy metals are largely unknown. Here we report on the generation of 9.4 gigabases of adaptor-trimmed raw sequences and the assembly of 57,162 transcript contigs in S. alfredii Hance (HE shoots by the combination of Roche 454 and Illumina/Solexa deep sequencing technologies. We also have functionally annotated the transcriptome and analyzed the transcriptome changes upon Cd hyperaccumulation in S. alfredii Hance (HE shoots. There are 110 contigs and 123 contigs that were up-regulated (Fold Change ≥ 2.0 and down-regulated (Fold Change large-scale expressed sequence information and genome-wide transcriptome profiling of Cd responses in S. alfredii Hance (HE shoots.

  17. Using nanopore sequencing to get complete genomes from complex samples

    DEFF Research Database (Denmark)

    Kirkegaard, Rasmus Hansen; Karst, Søren Michael; Nielsen, Per Halkjær

    The advantages of “next generation sequencing” has come at the cost of genome finishing. The dominant sequencing technology provides short reads of 150-300 bp, which has made genome assembly very difficult as the reads do not span important repeat regions. Genomes have thus been added...... to the databases as fragmented assemblies and not as finished contigs that resemble the chromosomes in which the DNA is organised within the cells. This is especially troublesome for genomes derived from complex metagenome sequencing. Databases with incomplete genomes can lead to false conclusions about...... the absence of genes and functional predictions of the organisms. Furthermore, it is common that repetitive elements and marker genes such as the 16S rRNA gene are missing completely from these genome bins. Using nanopore long reads, we demonstrate that it is possible to span these regions and make complete...

  18. Plasmid Flux in Escherichia coli ST131 Sublineages, Analyzed by Plasmid Constellation Network (PLACNET), a New Method for Plasmid Reconstruction from Whole Genome Sequences

    Science.gov (United States)

    Garcillán-Barcia, M. Pilar; Mora, Azucena; Blanco, Jorge; Coque, Teresa M.; de la Cruz, Fernando

    2014-01-01

    Bacterial whole genome sequence (WGS) methods are rapidly overtaking classical sequence analysis. Many bacterial sequencing projects focus on mobilome changes, since macroevolutionary events, such as the acquisition or loss of mobile genetic elements, mainly plasmids, play essential roles in adaptive evolution. Existing WGS analysis protocols do not assort contigs between plasmids and the main chromosome, thus hampering full analysis of plasmid sequences. We developed a method (called plasmid constellation networks or PLACNET) that identifies, visualizes and analyzes plasmids in WGS projects by creating a network of contig interactions, thus allowing comprehensive plasmid analysis within WGS datasets. The workflow of the method is based on three types of data: assembly information (including scaffold links and coverage), comparison to reference sequences and plasmid-diagnostic sequence features. The resulting network is pruned by expert analysis, to eliminate confounding data, and implemented in a Cytoscape-based graphic representation. To demonstrate PLACNET sensitivity and efficacy, the plasmidome of the Escherichia coli lineage ST131 was analyzed. ST131 is a globally spread clonal group of extraintestinal pathogenic E. coli (ExPEC), comprising different sublineages with ability to acquire and spread antibiotic resistance and virulence genes via plasmids. Results show that plasmids flux in the evolution of this lineage, which is wide open for plasmid exchange. MOBF12/IncF plasmids were pervasive, adding just by themselves more than 350 protein families to the ST131 pangenome. Nearly 50% of the most frequent γ–proteobacterial plasmid groups were found to be present in our limited sample of ten analyzed ST131 genomes, which represent the main ST131 sublineages. PMID:25522143

  19. Plasmid flux in Escherichia coli ST131 sublineages, analyzed by plasmid constellation network (PLACNET, a new method for plasmid reconstruction from whole genome sequences.

    Directory of Open Access Journals (Sweden)

    Val F Lanza

    2014-12-01

    Full Text Available Bacterial whole genome sequence (WGS methods are rapidly overtaking classical sequence analysis. Many bacterial sequencing projects focus on mobilome changes, since macroevolutionary events, such as the acquisition or loss of mobile genetic elements, mainly plasmids, play essential roles in adaptive evolution. Existing WGS analysis protocols do not assort contigs between plasmids and the main chromosome, thus hampering full analysis of plasmid sequences. We developed a method (called plasmid constellation networks or PLACNET that identifies, visualizes and analyzes plasmids in WGS projects by creating a network of contig interactions, thus allowing comprehensive plasmid analysis within WGS datasets. The workflow of the method is based on three types of data: assembly information (including scaffold links and coverage, comparison to reference sequences and plasmid-diagnostic sequence features. The resulting network is pruned by expert analysis, to eliminate confounding data, and implemented in a Cytoscape-based graphic representation. To demonstrate PLACNET sensitivity and efficacy, the plasmidome of the Escherichia coli lineage ST131 was analyzed. ST131 is a globally spread clonal group of extraintestinal pathogenic E. coli (ExPEC, comprising different sublineages with ability to acquire and spread antibiotic resistance and virulence genes via plasmids. Results show that plasmids flux in the evolution of this lineage, which is wide open for plasmid exchange. MOBF12/IncF plasmids were pervasive, adding just by themselves more than 350 protein families to the ST131 pangenome. Nearly 50% of the most frequent γ-proteobacterial plasmid groups were found to be present in our limited sample of ten analyzed ST131 genomes, which represent the main ST131 sublineages.

  20. MetaPhinder-Identifying Bacteriophage Sequences in Metagenomic Data Sets

    DEFF Research Database (Denmark)

    Jurtz, Vanessa Isabell; Villarroel, Julia; Lund, Ole

    2016-01-01

    genome structure of many bacteriophages. The method is demonstrated to outperform both BLAST methods based on single hits and methods based on k-mer comparisons. MetaPhinder is available as a web service at the Center for Genomic Epidemiology https://cge.cbs.dtu.dk/services/MetaPhinder/, while the source...... and understand them. Here we present MetaPhinder, a method to identify assembled genomic fragments (i.e. contigs) of phage origin in metage-nomic data sets. The method is based on a comparison to a database of whole genome bacteriophage sequences, integrating hits to multiple genomes to accomodate for the mosaic...... code can be downloaded from https://bitbucket.org/genomicepidemiology/metaphinder or https://github.com/vanessajurtz/MetaPhinder....

  1. RePS: a sequence assembler that masks exact repeats identified from the shotgun data

    DEFF Research Database (Denmark)

    Wang, Jun; Wong, Gane Ka-Shu; Ni, Peixiang

    2002-01-01

    We describe a sequence assembler, RePS (repeat-masked Phrap with scaffolding), that explicitly identifies exact 20mer repeats from the shotgun data and removes them prior to the assembly. The established software is used to compute meaningful error probabilities for each base. Clone......-end-pairing information is used to construct scaffolds that order and orient the contigs. We show with real data for human and rice that reasonable assemblies are possible even at coverages of only 4x to 6x, despite having up to 42.2% in exact repeats. Udgivelsesdato: 2002-May...

  2. Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries from nanogram quantities of DNA.

    Science.gov (United States)

    Marine, Rachel; Polson, Shawn W; Ravel, Jacques; Hatfull, Graham; Russell, Daniel; Sullivan, Matthew; Syed, Fraz; Dumas, Michael; Wommack, K Eric

    2011-11-01

    Construction of DNA fragment libraries for next-generation sequencing can prove challenging, especially for samples with low DNA yield. Protocols devised to circumvent the problems associated with low starting quantities of DNA can result in amplification biases that skew the distribution of genomes in metagenomic data. Moreover, sample throughput can be slow, as current library construction techniques are time-consuming. This study evaluated Nextera, a new transposon-based method that is designed for quick production of DNA fragment libraries from a small quantity of DNA. The sequence read distribution across nine phage genomes in a mock viral assemblage met predictions for six of the least-abundant phages; however, the rank order of the most abundant phages differed slightly from predictions. De novo genome assemblies from Nextera libraries provided long contigs spanning over half of the phage genome; in four cases where full-length genome sequences were available for comparison, consensus sequences were found to match over 99% of the genome with near-perfect identity. Analysis of areas of low and high sequence coverage within phage genomes indicated that GC content may influence coverage of sequences from Nextera libraries. Comparisons of phage genomes prepared using both Nextera and a standard 454 FLX Titanium library preparation protocol suggested that the coverage biases according to GC content observed within the Nextera libraries were largely attributable to bias in the Nextera protocol rather than to the 454 sequencing technology. Nevertheless, given suitable sequence coverage, the Nextera protocol produced high-quality data for genomic studies. For metagenomics analyses, effects of GC amplification bias would need to be considered; however, the library preparation standardization that Nextera provides should benefit comparative metagenomic analyses.

  3. Dicty_cDB: VHA365 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHA365 (Link to dictyBase) - - - Contig-U16349-1 - (Link to Or...iginal site) - - VHA365Z 352 - - - - Show VHA365 Library VH (Link to library) Clone ID VHA365 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dictycdb.b...TTGGGTACCAAGAACTGACCGTCAATTTGCTGGTTCATGGTTT sequence update 2002. 9.10 Translated Amino Acid sequence ---QLFAGIKSICT...wfmv Frame C: ---QLFAGIKSICTEMAMDGCEKCSGNSPTTTCDVLPVYSSLCMAMPDMSQCANWTKMCS SSGQLY

  4. Dicty_cDB: VHC263 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHC263 (Link to dictyBase) - - - Contig-U16349-1 - (Link to Or...iginal site) - - VHC263Z 429 - - - - Show VHC263 Library VH (Link to library) Clone ID VHC263 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dictycdb.b...ATGG ACTCCCAAC sequence update 2002. 9.10 Translated Amino Acid sequence ---QLFAGIKSICT...kyfrkkmdsq Frame B: ---QLFAGIKSICTXMAMDGCEKCSGNSPTTTCDVLPVYSSLCMAMPDMSQCANWTKMCS

  5. SNP markers retrieval for a non-model species: a practical approach

    Directory of Open Access Journals (Sweden)

    Shahin Arwa

    2012-01-01

    Full Text Available Abstract Background SNP (Single Nucleotide Polymorphism markers are rapidly becoming the markers of choice for applications in breeding because of next generation sequencing technology developments. For SNP development by NGS technologies, correct assembly of the huge amounts of sequence data generated is essential. Little is known about assembler's performance, especially when dealing with highly heterogeneous species that show a high genome complexity and what the possible consequences are of differences in assemblies on SNP retrieval. This study tested two assemblers (CAP3 and CLC on 454 data from four lily genotypes and compared results with respect to SNP retrieval. Results CAP3 assembly resulted in higher numbers of contigs, lower numbers of reads per contig, and shorter average read lengths compared to CLC. Blast comparisons showed that CAP3 contigs were highly redundant. Contrastingly, CLC in rare cases combined paralogs in one contig. Redundant and chimeric contigs may lead to erroneous SNPs. Filtering for redundancy can be done by blasting selected SNP markers to the contigs and discarding all the SNP markers that show more than one blast hit. Results on chimeric contigs showed that only four out of 2,421 SNP markers were selected from chimeric contigs. Conclusion In practice, CLC performs better in assembling highly heterogeneous genome sequences compared to CAP3, and consequently SNP retrieval is more efficient. Additionally a simple flow scheme is suggested for SNP marker retrieval that can be valid for all non-model species.

  6. Integration of transcriptomic and proteomic data from a single wheat cultivar provides new tools for understanding the roles of individual alpha gliadin proteins in flour quality and celiac disease

    Science.gov (United States)

    One-hundred-thirty-six expressed sequence tags (ESTs) encoding alpha gliadins from Triticum aestivum cv Butte 86 were identified in public databases and assembled into 19 contigs. Consensus sequences for 12 of the contigs encoded complete alpha gliadin proteins, but only two were identical to protei...

  7. Viral metagenomics: Analysis of begomoviruses by illumina high-throughput sequencing

    KAUST Repository

    Idris, Ali

    2014-03-12

    Traditional DNA sequencing methods are inefficient, lack the ability to discern the least abundant viral sequences, and ineffective for determining the extent of variability in viral populations. Here, populations of single-stranded DNA plant begomoviral genomes and their associated beta- and alpha-satellite molecules (virus-satellite complexes) (genus, Begomovirus; family, Geminiviridae) were enriched from total nucleic acids isolated from symptomatic, field-infected plants, using rolling circle amplification (RCA). Enriched virus-satellite complexes were subjected to Illumina-Next Generation Sequencing (NGS). CASAVA and SeqMan NGen programs were implemented, respectively, for quality control and for de novo and reference-guided contig assembly of viral-satellite sequences. The authenticity of the begomoviral sequences, and the reproducibility of the Illumina-NGS approach for begomoviral deep sequencing projects, were validated by comparing NGS results with those obtained using traditional molecular cloning and Sanger sequencing of viral components and satellite DNAs, also enriched by RCA or amplified by polymerase chain reaction. As the use of NGS approaches, together with advances in software development, make possible deep sequence coverage at a lower cost; the approach described herein will streamline the exploration of begomovirus diversity and population structure from naturally infected plants, irrespective of viral abundance. This is the first report of the implementation of Illumina-NGS to explore the diversity and identify begomoviral-satellite SNPs directly from plants naturally-infected with begomoviruses under field conditions. 2014 by the authors; licensee MDPI, Basel, Switzerland.

  8. Viral Metagenomics: Analysis of Begomoviruses by Illumina High-Throughput Sequencing

    Directory of Open Access Journals (Sweden)

    Ali Idris

    2014-03-01

    Full Text Available Traditional DNA sequencing methods are inefficient, lack the ability to discern the least abundant viral sequences, and ineffective for determining the extent of variability in viral populations. Here, populations of single-stranded DNA plant begomoviral genomes and their associated beta- and alpha-satellite molecules (virus-satellite complexes (genus, Begomovirus; family, Geminiviridae were enriched from total nucleic acids isolated from symptomatic, field-infected plants, using rolling circle amplification (RCA. Enriched virus-satellite complexes were subjected to Illumina-Next Generation Sequencing (NGS. CASAVA and SeqMan NGen programs were implemented, respectively, for quality control and for de novo and reference-guided contig assembly of viral-satellite sequences. The authenticity of the begomoviral sequences, and the reproducibility of the Illumina-NGS approach for begomoviral deep sequencing projects, were validated by comparing NGS results with those obtained using traditional molecular cloning and Sanger sequencing of viral components and satellite DNAs, also enriched by RCA or amplified by polymerase chain reaction. As the use of NGS approaches, together with advances in software development, make possible deep sequence coverage at a lower cost; the approach described herein will streamline the exploration of begomovirus diversity and population structure from naturally infected plants, irrespective of viral abundance. This is the first report of the implementation of Illumina-NGS to explore the diversity and identify begomoviral-satellite SNPs directly from plants naturally-infected with begomoviruses under field conditions.

  9. Sequence comparison of prefrontal cortical brain transcriptome from a tame and an aggressive silver fox (Vulpes vulpes

    Directory of Open Access Journals (Sweden)

    Sun Qi

    2011-10-01

    Full Text Available Abstract Background Two strains of the silver fox (Vulpes vulpes, with markedly different behavioral phenotypes, have been developed by long-term selection for behavior. Foxes from the tame strain exhibit friendly behavior towards humans, paralleling the sociability of canine puppies, whereas foxes from the aggressive strain are defensive and exhibit aggression to humans. To understand the genetic differences underlying these behavioral phenotypes fox-specific genomic resources are needed. Results cDNA from mRNA from pre-frontal cortex of a tame and an aggressive fox was sequenced using the Roche 454 FLX Titanium platform (> 2.5 million reads & 0.9 Gbase of tame fox sequence; >3.3 million reads & 1.2 Gbase of aggressive fox sequence. Over 80% of the fox reads were assembled into contigs. Mapping fox reads against the fox transcriptome assembly and the dog genome identified over 30,000 high confidence fox-specific SNPs. Fox transcripts for approximately 14,000 genes were identified using SwissProt and the dog RefSeq databases. An at least 2-fold expression difference between the two samples (p Conclusions Transcriptome sequencing significantly expanded genomic resources available for the fox, a species without a sequenced genome. In a very cost efficient manner this yielded a large number of fox-specific SNP markers for genetic studies and provided significant insights into the gene expression profile of the fox pre-frontal cortex; expression differences between the two fox samples; and a catalogue of potentially important gene-specific sequence variants. This result demonstrates the utility of this approach for developing genomic resources in species with limited genomic information.

  10. Sequence comparison of prefrontal cortical brain transcriptome from a tame and an aggressive silver fox (Vulpes vulpes)

    Science.gov (United States)

    2011-01-01

    Background Two strains of the silver fox (Vulpes vulpes), with markedly different behavioral phenotypes, have been developed by long-term selection for behavior. Foxes from the tame strain exhibit friendly behavior towards humans, paralleling the sociability of canine puppies, whereas foxes from the aggressive strain are defensive and exhibit aggression to humans. To understand the genetic differences underlying these behavioral phenotypes fox-specific genomic resources are needed. Results cDNA from mRNA from pre-frontal cortex of a tame and an aggressive fox was sequenced using the Roche 454 FLX Titanium platform (> 2.5 million reads & 0.9 Gbase of tame fox sequence; >3.3 million reads & 1.2 Gbase of aggressive fox sequence). Over 80% of the fox reads were assembled into contigs. Mapping fox reads against the fox transcriptome assembly and the dog genome identified over 30,000 high confidence fox-specific SNPs. Fox transcripts for approximately 14,000 genes were identified using SwissProt and the dog RefSeq databases. An at least 2-fold expression difference between the two samples (p fox transcriptome. Conclusions Transcriptome sequencing significantly expanded genomic resources available for the fox, a species without a sequenced genome. In a very cost efficient manner this yielded a large number of fox-specific SNP markers for genetic studies and provided significant insights into the gene expression profile of the fox pre-frontal cortex; expression differences between the two fox samples; and a catalogue of potentially important gene-specific sequence variants. This result demonstrates the utility of this approach for developing genomic resources in species with limited genomic information. PMID:21967120

  11. Model SNP development for complex genomes based on hexaploid oat using high-throughput 454 sequencing technology

    Directory of Open Access Journals (Sweden)

    Chao Shiaoman

    2011-01-01

    Full Text Available Abstract Background Genetic markers are pivotal to modern genomics research; however, discovery and genotyping of molecular markers in oat has been hindered by the size and complexity of the genome, and by a scarcity of sequence data. The purpose of this study was to generate oat expressed sequence tag (EST information, develop a bioinformatics pipeline for SNP discovery, and establish a method for rapid, cost-effective, and straightforward genotyping of SNP markers in complex polyploid genomes such as oat. Results Based on cDNA libraries of four cultivated oat genotypes, approximately 127,000 contigs were assembled from approximately one million Roche 454 sequence reads. Contigs were filtered through a novel bioinformatics pipeline to eliminate ambiguous polymorphism caused by subgenome homology, and 96 in silico SNPs were selected from 9,448 candidate loci for validation using high-resolution melting (HRM analysis. Of these, 52 (54% were polymorphic between parents of the Ogle1040 × TAM O-301 (OT mapping population, with 48 segregating as single Mendelian loci, and 44 being placed on the existing OT linkage map. Ogle and TAM amplicons from 12 primers were sequenced for SNP validation, revealing complex polymorphism in seven amplicons but general sequence conservation within SNP loci. Whole-amplicon interrogation with HRM revealed insertions, deletions, and heterozygotes in secondary oat germplasm pools, generating multiple alleles at some primer targets. To validate marker utility, 36 SNP assays were used to evaluate the genetic diversity of 34 diverse oat genotypes. Dendrogram clusters corresponded generally to known genome composition and genetic ancestry. Conclusions The high-throughput SNP discovery pipeline presented here is a rapid and effective method for identification of polymorphic SNP alleles in the oat genome. The current-generation HRM system is a simple and highly-informative platform for SNP genotyping. These techniques provide

  12. Draft Genome Sequences of Sanguibacteroides justesenii, gen. nov., sp. nov., Strains OUH 308042T (= ATCC BAA-2681T) and OUH 334697 (= ATCC BAA-2682), Isolated from Blood Cultures from Two Different Patients

    DEFF Research Database (Denmark)

    Sydenham, Thomas Vognbjerg; Hasman, Henrik; Justesen, Ulrik Stenz

    2015-01-01

    We announce here the draft genome sequences of Sanguibacteroides justesenii, gen. nov., sp. nov., strains OUH 308042T (= DSM 28342T = ATCC BAA-2681T) and OUH 334697 (= DSM 28341 = ATCC BAA-2682), isolated from blood cultures from two different patients and composed of 51 and 39 contigs for totals...

  13. An accurate clone-based haplotyping method by overlapping pool sequencing.

    Science.gov (United States)

    Li, Cheng; Cao, Changchang; Tu, Jing; Sun, Xiao

    2016-07-08

    Chromosome-long haplotyping of human genomes is important to identify genetic variants with differing gene expression, in human evolution studies, clinical diagnosis, and other biological and medical fields. Although several methods have realized haplotyping based on sequencing technologies or population statistics, accuracy and cost are factors that prohibit their wide use. Borrowing ideas from group testing theories, we proposed a clone-based haplotyping method by overlapping pool sequencing. The clones from a single individual were pooled combinatorially and then sequenced. According to the distinct pooling pattern for each clone in the overlapping pool sequencing, alleles for the recovered variants could be assigned to their original clones precisely. Subsequently, the clone sequences could be reconstructed by linking these alleles accordingly and assembling them into haplotypes with high accuracy. To verify the utility of our method, we constructed 130 110 clones in silico for the individual NA12878 and simulated the pooling and sequencing process. Ultimately, 99.9% of variants on chromosome 1 that were covered by clones from both parental chromosomes were recovered correctly, and 112 haplotype contigs were assembled with an N50 length of 3.4 Mb and no switch errors. A comparison with current clone-based haplotyping methods indicated our method was more accurate. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  14. Deep sequencing as a method of typing bluetongue virus isolates.

    Science.gov (United States)

    Rao, Pavuluri Panduranga; Reddy, Yella Narasimha; Ganesh, Kapila; Nair, Shreeja G; Niranjan, Vidya; Hegde, Nagendra R

    2013-11-01

    Bluetongue (BT) is an economically important endemic disease of livestock in tropics and subtropics. In addition, its recent spread to temperate regions like North America and Northern Europe is of serious concern. Rapid serotyping and characterization of BT virus (BTV) is an essential step in the identification of origin of the virus and for controlling the disease. Serotyping of BTV is typically performed by serum neutralization, and of late by nucleotide sequencing. This report describes the near complete genome sequencing and typing of two isolates of BTV using Illumina next generation sequencing platform. Two of the BTV RNAs were multiplexed with ten other unknown samples. Viral RNA was isolated and fragmented, reverse transcribed, the cDNA ends were repaired and ligated with a multiplex oligo. The genome library was amplified using primers complementary to the ligated oligo and subjected to single and paired end sequencing. The raw reads were assembled using a de novo method and reference-based assembly was performed based on the contig data. Near complete sequences of all segments of BTV were obtained with more than 20× coverage, and single read sequencing method was sufficient to identify the genotype and serotype of the virus. The two viruses used in this study were typed as BTV-1 and BTV-9E. Copyright © 2013 Elsevier B.V. All rights reserved.

  15. A BAC-based physical map of the Drosophila buzzatii genome

    Energy Technology Data Exchange (ETDEWEB)

    Gonzalez, Josefa; Nefedov, Michael; Bosdet, Ian; Casals, Ferran; Calvete, Oriol; Delprat, Alejandra; Shin, Heesun; Chiu, Readman; Mathewson, Carrie; Wye, Natasja; Hoskins, Roger A.; Schein, JacquelineE.; de Jong, Pieter; Ruiz, Alfredo

    2005-03-18

    Large-insert genomic libraries facilitate cloning of large genomic regions, allow the construction of clone-based physical maps and provide useful resources for sequencing entire genomes. Drosophilabuzzatii is a representative species of the repleta group in the Drosophila subgenus, which is being widely used as a model in studies of genome evolution, ecological adaptation and speciation. We constructed a Bacterial Artificial Chromosome (BAC) genomic library of D. buzzatii using the shuttle vector pTARBAC2.1. The library comprises 18,353 clones with an average insert size of 152 kb and a {approx}18X expected representation of the D. buzzatii euchromatic genome. We screened the entire library with six euchromatic gene probes and estimated the actual genome representation to be {approx}23X. In addition, we fingerprinted by restriction digestion and agarose gel electrophoresis a sample of 9,555 clones, and assembled them using Finger Printed Contigs (FPC) software and manual editing into 345 contigs (mean of 26 clones per contig) and 670singletons. Finally, we anchored 181 large contigs (containing 7,788clones) to the D. buzzatii salivary gland polytene chromosomes by in situ hybridization of 427 representative clones. The BAC library and a database with all the information regarding the high coverage BAC-based physical map described in this paper are available to the research community.

  16. Dicty_cDB: CHA851 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CH (Link to library) CHA851 (Link to dictyBase) - - - Contig-U16368-1 - (Link to Or...iginal site) CHA851F 614 - - - - - - Show CHA851 Library CH (Link to library) Clone ID CHA851 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16368-1 Original site URL http://dictycdb.b...TCCXXXXXXXXXX sequence update 2002.10.25 Translated Amino Acid sequence VRDARPPHNLCRGFGCPEGSHCEVLEKHPVCVRNHVPPHPPPPPQICGSVNCGPGYICT...nly*skttgttttllnlcraiism*srwn dlysstkqlyqy*ipmlpis--- Frame C: VRDARPPHNLCRGFGCPEGSHCEVLEKHPVCVRNHVPPHPPPPPQICGSVNCGPGYICT

  17. Dicty_cDB: VHP243 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHP243 (Link to dictyBase) - - - Contig-U16236-1 - (Link to Or...iginal site) VHP243F 134 - - - - - - Show VHP243 Library VH (Link to library) Clone ID VHP243 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16236-1 Original site URL http://dictycdb.b...AXXXXXXXXXX sequence update 2002.10.25 Translated Amino Acid sequence CWPTGIXKTTICT...kilsif*ynfkyyqqpkkk--- Frame B: llaywyxqnnnlyqyyyyfyl*kyflsfniilniinnpkk--- Frame C: CWPTGIXKTTICTNTTIISICKN

  18. Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing

    Science.gov (United States)

    Chan, Chon-Kit Kenneth; Hsu, Arthur L.; Tang, Sen-Lin; Halgamuge, Saman K.

    2008-01-01

    Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%–15% speed improvement. PMID:18288261

  19. Hybrid sequencing approach applied to human fecal metagenomic clone libraries revealed clones with potential biotechnological applications.

    Directory of Open Access Journals (Sweden)

    Mária Džunková

    Full Text Available Natural environments represent an incredible source of microbial genetic diversity. Discovery of novel biomolecules involves biotechnological methods that often require the design and implementation of biochemical assays to screen clone libraries. However, when an assay is applied to thousands of clones, one may eventually end up with very few positive clones which, in most of the cases, have to be "domesticated" for downstream characterization and application, and this makes screening both laborious and expensive. The negative clones, which are not considered by the selected assay, may also have biotechnological potential; however, unfortunately they would remain unexplored. Knowledge of the clone sequences provides important clues about potential biotechnological application of the clones in the library; however, the sequencing of clones one-by-one would be very time-consuming and expensive. In this study, we characterized the first metagenomic clone library from the feces of a healthy human volunteer, using a method based on 454 pyrosequencing coupled with a clone-by-clone Sanger end-sequencing. Instead of whole individual clone sequencing, we sequenced 358 clones in a pool. The medium-large insert (7-15 kb cloning strategy allowed us to assemble these clones correctly, and to assign the clone ends to maintain the link between the position of a living clone in the library and the annotated contig from the 454 assembly. Finally, we found several open reading frames (ORFs with previously described potential medical application. The proposed approach allows planning ad-hoc biochemical assays for the clones of interest, and the appropriate sub-cloning strategy for gene expression in suitable vectors/hosts.

  20. The Douglas-Fir Genome Sequence Reveals Specialization of the Photosynthetic Apparatus in Pinaceae

    Directory of Open Access Journals (Sweden)

    David B. Neale

    2017-09-01

    Full Text Available A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb. Franco (Coastal Douglas-fir is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50 = 340,704 bp. Incremental improvements in sequencing and assembly technologies are in part responsible for the higher quality reference genome, but it may also be due to a slightly lower exact repeat content in Douglas-fir vs. pine and spruce. Comparative genome annotation with angiosperm species reveals gene-family expansion and contraction in Douglas-fir and other conifers which may account for some of the major morphological and physiological differences between the two major plant groups. Notable differences in the size of the NDH-complex gene family and genes underlying the functional basis of shade tolerance/intolerance were observed. This reference genome sequence not only provides an important resource for Douglas-fir breeders and geneticists but also sheds additional light on the evolutionary processes that have led to the divergence of modern angiosperms from the more ancient gymnosperms.

  1. ADN-Viewer: a 3D approach for bioinformatic analyses of large DNA sequences.

    Science.gov (United States)

    Hérisson, Joan; Ferey, Nicolas; Gros, Pierre-Emmanuel; Gherbi, Rachid

    2007-01-20

    Most of biologists work on textual DNA sequences that are limited to the linear representation of DNA. In this paper, we address the potential offered by Virtual Reality for 3D modeling and immersive visualization of large genomic sequences. The representation of the 3D structure of naked DNA allows biologists to observe and analyze genomes in an interactive way at different levels. We developed a powerful software platform that provides a new point of view for sequences analysis: ADNViewer. Nevertheless, a classical eukaryotic chromosome of 40 million base pairs requires about 6 Gbytes of 3D data. In order to manage these huge amounts of data in real-time, we designed various scene management algorithms and immersive human-computer interaction for user-friendly data exploration. In addition, one bioinformatics study scenario is proposed.

  2. Dicty_cDB: Contig-U01541-1 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available ula chromosome 7 BAC clone mth2-7... 36 3.7 5 ( FG283242 ) 1108457714276 New World Screwworm Egg 9261 ESTs C...1-1... 40 3.8 2 ( AM448784 ) Vitis vinifera contig VV78X077229.13, whole genom... 40 3.8 5 ( FG299281 ) 1108793334783 New World...malized cDNA li... 34 3.8 3 ( FG298782 ) 1108793320683 New World Screwworm Larvae 9387 EST... 40 3.8 2 ( AC2...) Populus trichocarpa clone POP011-A24, complete se... 38 3.9 5 ( FG298363 ) 1108793311332 New World Screwwo...rm Larvae 9387 EST... 40 3.9 2 ( AE017263 ) Mesoplasma florum L1 complete genome. 34 3.9 11 ( FG290177 ) 1108793315292 New World

  3. Transcriptomic analysis of grain amaranth (Amaranthus hypochondriacus using 454 pyrosequencing: comparison with A. tuberculatus, expression profiling in stems and in response to biotic and abiotic stress

    Directory of Open Access Journals (Sweden)

    Vargas-Ortiz Erandi

    2011-07-01

    Full Text Available Abstract Background Amaranthus hypochondriacus, a grain amaranth, is a C4 plant noted by its ability to tolerate stressful conditions and produce highly nutritious seeds. These possess an optimal amino acid balance and constitute a rich source of health-promoting peptides. Although several recent studies, mostly involving subtractive hybridization strategies, have contributed to increase the relatively low number of grain amaranth expressed sequence tags (ESTs, transcriptomic information of this species remains limited, particularly regarding tissue-specific and biotic stress-related genes. Thus, a large scale transcriptome analysis was performed to generate stem- and (abiotic stress-responsive gene expression profiles in grain amaranth. Results A total of 2,700,168 raw reads were obtained from six 454 pyrosequencing runs, which were assembled into 21,207 high quality sequences (20,408 isotigs + 799 contigs. The average sequence length was 1,064 bp and 930 bp for isotigs and contigs, respectively. Only 5,113 singletons were recovered after quality control. Contigs/isotigs were further incorporated into 15,667 isogroups. All unique sequences were queried against the nr, TAIR, UniRef100, UniRef50 and Amaranthaceae EST databases for annotation. Functional GO annotation was performed with all contigs/isotigs that produced significant hits with the TAIR database. Only 8,260 sequences were found to be homologous when the transcriptomes of A. tuberculatus and A. hypochondriacus were compared, most of which were associated with basic house-keeping processes. Digital expression analysis identified 1,971 differentially expressed genes in response to at least one of four stress treatments tested. These included several multiple-stress-inducible genes that could represent potential candidates for use in the engineering of stress-resistant plants. The transcriptomic data generated from pigmented stems shared similarity with findings reported in developing

  4. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB760 (Link to dictyBase) - - - Contig-U12286-1 VFB760P (Link... to Original site) VFB760F 474 VFB760Z 691 VFB760P 1165 - - Show VFB760 Library VF (Link to library) Clone ID VFB760 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U12286-1 Original site URL http://dict...CCACTTAATTCCAGAAGG sequence update 2001. 6. 1 Translated Amino Acid sequence LLAYWNKCQVNSCDKTTGNCKPENLKCPDRSNECLKNTGCDDLTGCKYVSICT...cmfllqsfxnplnsrr Frame C: LLAYWNKCQVNSCDKTTGNCKPENLKCPDRSNECLKNTGCDDLTGCKYVSICTDS

  5. Dicty_cDB: VSK196 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VS (Link to library) VSK196 (Link to dictyBase) - - - Contig-U10274-1 VSK196P (Link... to Original site) VSK196F 423 VSK196Z 453 VSK196P 876 - - Show VSK196 Library VS (Link to library) Clone ID VSK196 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U10274-1 Original site URL http://dict...TTATNAACAATTAAAAAAAA sequence update 2001. 3.22 Translated Amino Acid sequence kdgklvslkdfikdqkpivlyfypkdetsict...*NKIX--- ---KLKVGDQAPDFTCPDKDGKLVSLKDFIKDQKPIVLYFYPKDETSICTKEACEFRDKY QKFIEAGADVI

  6. Comprehensive transcriptome assembly of Chickpea (Cicer arietinum L. using sanger and next generation sequencing platforms: development and applications.

    Directory of Open Access Journals (Sweden)

    Himabindu Kudapa

    Full Text Available A comprehensive transcriptome assembly of chickpea has been developed using 134.95 million Illumina single-end reads, 7.12 million single-end FLX/454 reads and 139,214 Sanger expressed sequence tags (ESTs from >17 genotypes. This hybrid transcriptome assembly, referred to as Cicer arietinumTranscriptome Assembly version 2 (CaTA v2, available at http://data.comparative-legumes.org/transcriptomes/cicar/lista_cicar-201201, comprising 46,369 transcript assembly contigs (TACs has an N50 length of 1,726 bp and a maximum contig size of 15,644 bp. Putative functions were determined for 32,869 (70.8% of the TACs and gene ontology assignments were determined for 21,471 (46.3%. The new transcriptome assembly was compared with the previously available chickpea transcriptome assemblies as well as to the chickpea genome. Comparative analysis of CaTA v2 against transcriptomes of three legumes - Medicago, soybean and common bean, resulted in 27,771 TACs common to all three legumes indicating strong conservation of genes across legumes. CaTA v2 was also used for identification of simple sequence repeats (SSRs and intron spanning regions (ISRs for developing molecular markers. ISRs were identified by aligning TACs to the Medicago genome, and their putative mapping positions at chromosomal level were identified using transcript map of chickpea. Primer pairs were designed for 4,990 ISRs, each representing a single contig for which predicted positions are inferred and distributed across eight linkage groups. A subset of randomly selected ISRs representing all eight chickpea linkage groups were validated on five chickpea genotypes and showed 20% polymorphism with average polymorphic information content (PIC of 0.27. In summary, the hybrid transcriptome assembly developed and novel markers identified can be used for a variety of applications such as gene discovery, marker-trait association, diversity analysis etc., to advance genetics research and breeding

  7. A novel approach to sequence validating protein expression clones with automated decision making

    Directory of Open Access Journals (Sweden)

    Mohr Stephanie E

    2007-06-01

    Full Text Available Abstract Background Whereas the molecular assembly of protein expression clones is readily automated and routinely accomplished in high throughput, sequence verification of these clones is still largely performed manually, an arduous and time consuming process. The ultimate goal of validation is to determine if a given plasmid clone matches its reference sequence sufficiently to be "acceptable" for use in protein expression experiments. Given the accelerating increase in availability of tens of thousands of unverified clones, there is a strong demand for rapid, efficient and accurate software that automates clone validation. Results We have developed an Automated Clone Evaluation (ACE system – the first comprehensive, multi-platform, web-based plasmid sequence verification software package. ACE automates the clone verification process by defining each clone sequence as a list of multidimensional discrepancy objects, each describing a difference between the clone and its expected sequence including the resulting polypeptide consequences. To evaluate clones automatically, this list can be compared against user acceptance criteria that specify the allowable number of discrepancies of each type. This strategy allows users to re-evaluate the same set of clones against different acceptance criteria as needed for use in other experiments. ACE manages the entire sequence validation process including contig management, identifying and annotating discrepancies, determining if discrepancies correspond to polymorphisms and clone finishing. Designed to manage thousands of clones simultaneously, ACE maintains a relational database to store information about clones at various completion stages, project processing parameters and acceptance criteria. In a direct comparison, the automated analysis by ACE took less time and was more accurate than a manual analysis of a 93 gene clone set. Conclusion ACE was designed to facilitate high throughput clone sequence

  8. Analysis of quality raw data of second generation sequencers with Quality Assessment Software.

    Science.gov (United States)

    Ramos, Rommel Tj; Carneiro, Adriana R; Baumbach, Jan; Azevedo, Vasco; Schneider, Maria Pc; Silva, Artur

    2011-04-18

    Second generation technologies have advantages over Sanger; however, they have resulted in new challenges for the genome construction process, especially because of the small size of the reads, despite the high degree of coverage. Independent of the program chosen for the construction process, DNA sequences are superimposed, based on identity, to extend the reads, generating contigs; mismatches indicate a lack of homology and are not included. This process improves our confidence in the sequences that are generated. We developed Quality Assessment Software, with which one can review graphs showing the distribution of quality values from the sequencing reads. This software allow us to adopt more stringent quality standards for sequence data, based on quality-graph analysis and estimated coverage after applying the quality filter, providing acceptable sequence coverage for genome construction from short reads. Quality filtering is a fundamental step in the process of constructing genomes, as it reduces the frequency of incorrect alignments that are caused by measuring errors, which can occur during the construction process due to the size of the reads, provoking misassemblies. Application of quality filters to sequence data, using the software Quality Assessment, along with graphing analyses, provided greater precision in the definition of cutoff parameters, which increased the accuracy of genome construction.

  9. Dicty_cDB: VHA386 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHA386 (Link to dictyBase) - - - Contig-U11201-1 - (Link to Or...iginal site) - - VHA386Z 730 - - - - Show VHA386 Library VH (Link to library) Clone ID VHA386 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11201-1 Original site URL http://dictycdb.b...Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N L08646 |L08646.1 Dict...19 1 CX835130 |AF255664_1 major vault protein [Ictalurus punctatus], mRNA sequenc

  10. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB593 (Link to dictyBase) - - - Contig-U02438-1 VFB593E (Link...) Clone ID VFB593 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U02438-1 Ori...0.009 6 AC116986 |AC116986.2 Dictyostelium discoideum chromosome 2 map 2234041-25...sequence. 46 0.031 2 AC115577 |AC115577.2 Dictyostelium discoideum chromosome 2 m...ap 4657875-4914984 strain AX4, complete sequence. 34 0.051 14 AC116960 |AC116960.2 Dictyostelium discoideum

  11. Dicty_cDB: VFB330 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB330 (Link to dictyBase) - G22107 DDB0216429 Contig-U02054-1...ary VF (Link to library) Clone ID VFB330 (Link to dictyBase) Atlas ID - NBRP ID G22107 dictyBase ID DDB02164...29 Link to Contig Contig-U02054-1 | Contig-U16357-1 Original site URL http://dict...d Amino Acid sequence *k*k*NKMKLDPKALRYLSKDDFRTLVAVEMGMKNHELVPVSLICTIANLKYGGTKKSIQ TLHKFKLLFHDGRNYDGYKLTYLGY...LRYLSKDDFRTLVAVEMGMKNHELVPVSLICTIANLKYGGTKKSIQ TLHKFKLLFHDGRNYDGYKLTYLGYDFLALKTLVSRGVCSYVGNQIGVGKESDIYIVAND

  12. Insight into the transcriptome of Arthrobotrys conoides using high throughput sequencing.

    Science.gov (United States)

    Ramesh, Pandit; Reena, Patel; Amitbikram, Mohapatra; Chaitanya, Joshi; Anju, Kunjadia

    2015-12-01

    Arthrobotrys conoides is a nematode-trapping fungus belonging to Orbiliales, Ascomycota group, and traps prey nematodes by means of adhesive network. Fungus has a potential to be used as a biocontrol agent against plant parasitic nematodes. In the present study, we characterized the transcriptome of A. conoides using high-throughput sequencing technology and characterized its virulence unigenes. Total 7,255 cDNA contigs with an average length of 425 bp were generated and 6184 (61.81%) transcripts were functionally annotated and characterized. Majority of unigenes were found analogous to the genes of plant pathogenic fungi. A total of 1749 transcripts were found to be orthologous with eukaryotic proteins of KOG database. Several carbohydrate active enzymes and peptidases were identified. We also analyzed classically and nonclassically secreted proteins and confirmed by BLASTP against fungal secretome database. A total of 916 contigs were analogous to 556 unique proteins of Pathogen Host Interaction (PHI) database. Further, we identified 91 unigenes homologous to the database of fungal virulence factor (DFVF). A total of 104 putative protein kinases coding transcripts were identified by BLASTP against KinBase database, which are major players in signaling pathways. This study provides a comprehensive look at the transcriptome of A. conoides and the identified unigenes might have a role in catching and killing prey nematodes by A. conoides. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. Transcriptome resources for the perennial sunflower Helianthus maximiliani obtained from ecologically divergent populations.

    Science.gov (United States)

    Kawakami, Takeshi; Darby, Brian J; Ungerer, Mark C

    2014-07-01

    Next-generation sequencing (NGS) technologies provide a rapid means to generate genomic resources for species exhibiting interesting ecological and evolutionary variation but for which such resources are scant or nonexistent. In the current report, we utilize 454 pyrosequencing to obtain transcriptome information for multiple individuals and tissue types from geographically disparate and ecologically differentiated populations of the perennial sunflower species Helianthus maximiliani. A total of 850 275 raw reads were obtained averaging 355 bp in length. Reads were assembled, postprocessing, into 16 681 unique contigs with an N50 of 898 bp and a total length of 13.6 Mb. A majority (67%) of these contigs were annotated based on comparison with the Arabidopsis thaliana genome (TAIR10). Contigs were identified that exhibit high similarity to genes associated with natural variation in flowering time and freezing tolerance in other plant species and will facilitate future studies aimed at elucidating the molecular basis of clinal life history variation and adaptive differentiation in H. maximiliani. Large numbers of gene-associated simple sequence repeats (SSRs) and single-nucleotide polymorphisms (SNPs) also were identified that can be deployed in mapping and population genomic analyses. © 2014 John Wiley & Sons Ltd.

  14. Application of genotyping-by-sequencing on semiconductor sequencing platforms: a comparison of genetic and reference-based marker ordering in barley.

    Directory of Open Access Journals (Sweden)

    Martin Mascher

    Full Text Available The rapid development of next-generation sequencing platforms has enabled the use of sequencing for routine genotyping across a range of genetics studies and breeding applications. Genotyping-by-sequencing (GBS, a low-cost, reduced representation sequencing method, is becoming a common approach for whole-genome marker profiling in many species. With quickly developing sequencing technologies, adapting current GBS methodologies to new platforms will leverage these advancements for future studies. To test new semiconductor sequencing platforms for GBS, we genotyped a barley recombinant inbred line (RIL population. Based on a previous GBS approach, we designed bar code and adapter sets for the Ion Torrent platforms. Four sets of 24-plex libraries were constructed consisting of 94 RILs and the two parents and sequenced on two Ion platforms. In parallel, a 96-plex library of the same RILs was sequenced on the Illumina HiSeq 2000. We applied two different computational pipelines to analyze sequencing data; the reference-independent TASSEL pipeline and a reference-based pipeline using SAMtools. Sequence contigs positioned on the integrated physical and genetic map were used for read mapping and variant calling. We found high agreement in genotype calls between the different platforms and high concordance between genetic and reference-based marker order. There was, however, paucity in the number of SNP that were jointly discovered by the different pipelines indicating a strong effect of alignment and filtering parameters on SNP discovery. We show the utility of the current barley genome assembly as a framework for developing very low-cost genetic maps, facilitating high resolution genetic mapping and negating the need for developing de novo genetic maps for future studies in barley. Through demonstration of GBS on semiconductor sequencing platforms, we conclude that the GBS approach is amenable to a range of platforms and can easily be modified as new

  15. Genome sequence of the dark pink pigmented Listia bainesii microsymbiont Methylobacterium sp. WSM2598.

    Science.gov (United States)

    Ardley, Julie; Tian, Rui; Howieson, John; Yates, Ron; Bräu, Lambert; Han, James; Lobos, Elizabeth; Huntemann, Marcel; Chen, Amy; Mavromatis, Konstantinos; Markowitz, Victor; Ivanova, Natalia; Pati, Amrita; Goodwin, Lynne; Woyke, Tanja; Kyrpides, Nikos; Reeve, Wayne

    2014-01-01

    Strains of a pink-pigmented Methylobacterium sp. are effective nitrogen- (N2) fixing microsymbionts of species of the African crotalarioid genus Listia. Strain WSM2598 is an aerobic, motile, Gram-negative, non-spore-forming rod isolated in 2002 from a Listia bainesii root nodule collected at Estcourt Research Station in South Africa. Here we describe the features of Methylobacterium sp. WSM2598, together with information and annotation of a high-quality draft genome sequence. The 7,669,765 bp draft genome is arranged in 5 scaffolds of 83 contigs, contains 7,236 protein-coding genes and 18 RNA-only encoding genes. This rhizobial genome is one of 100 sequenced as part of the DOE Joint Genome Institute 2010 G enomic E ncyclopedia for B acteria and A rchaea- R oot N odule B acteria (GEBA-RNB) project.

  16. High-density rhesus macaque oligonucleotide microarray design using early-stage rhesus genome sequence information and human genome annotations

    Directory of Open Access Journals (Sweden)

    Magness Charles L

    2007-01-01

    Full Text Available Abstract Background Until recently, few genomic reagents specific for non-human primate research have been available. To address this need, we have constructed a macaque-specific high-density oligonucleotide microarray by using highly fragmented low-pass sequence contigs from the rhesus genome project together with the detailed sequence and exon structure of the human genome. Using this method, we designed oligonucleotide probes to over 17,000 distinct rhesus/human gene orthologs and increased by four-fold the number of available genes relative to our first-generation expressed sequence tag (EST-derived array. Results We constructed a database containing 248,000 exon sequences from 23,000 human RefSeq genes and compared each human exon with its best matching sequence in the January 2005 version of the rhesus genome project list of 486,000 DNA contigs. Best matching rhesus exon sequences for each of the 23,000 human genes were then concatenated in the proper order and orientation to produce a rhesus "virtual transcriptome." Microarray probes were designed, one per gene, to the region closest to the 3' untranslated region (UTR of each rhesus virtual transcript. Each probe was compared to a composite rhesus/human transcript database to test for cross-hybridization potential yielding a final probe set representing 18,296 rhesus/human gene orthologs, including transcript variants, and over 17,000 distinct genes. We hybridized mRNA from rhesus brain and spleen to both the EST- and genome-derived microarrays. Besides four-fold greater gene coverage, the genome-derived array also showed greater mean signal intensities for genes present on both arrays. Genome-derived probes showed 99.4% identity when compared to 4,767 rhesus GenBank sequence tag site (STS sequences indicating that early stage low-pass versions of complex genomes are of sufficient quality to yield valuable functional genomic information when combined with finished genome information from

  17. Molecular adaptation in the world's deepest-living animal: Insights from transcriptome sequencing of the hadal amphipod Hirondellea gigas.

    Science.gov (United States)

    Lan, Yi; Sun, Jin; Tian, Renmao; Bartlett, Douglas H; Li, Runsheng; Wong, Yue Him; Zhang, Weipeng; Qiu, Jian-Wen; Xu, Ting; He, Li-Sheng; Tabata, Harry G; Qian, Pei-Yuan

    2017-07-01

    The Challenger Deep in the Mariana Trench is the deepest point in the oceans of our planet. Understanding how animals adapt to this harsh environment characterized by high hydrostatic pressure, food limitation, dark and cold is of great scientific interest. Of the animals dwelling in the Challenger Deep, amphipods have been captured using baited traps. In this study, we sequenced the transcriptome of the amphipod Hirondellea gigas collected at a depth of 10,929 m from the East Pond of the Challenger Deep. Assembly of these sequences resulted in 133,041 contigs and 22,046 translated proteins. Functional annotation of these contigs was made using the go and kegg databases. Comparison of these translated proteins with those of four shallow-water amphipods revealed 10,731 gene families, of which 5659 were single-copy orthologs. Base substitution analysis on these single-copy orthologs showed that 62 genes are positively selected in H. gigas, including genes related to β-alanine biosynthesis, energy metabolism and genetic information processing. For multiple-copy orthologous genes, gene family expansion analysis revealed that cold-inducible proteins (i.e., transcription factors II A and transcription elongation factor 1) as well as zinc finger domains are expanded in H. gigas. Overall, our results indicate that genetic adaptation to the hadal environment by H. gigas may be mediated by both gene family expansion and amino acid substitutions of specific proteins. © 2017 John Wiley & Sons Ltd.

  18. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    Energy Technology Data Exchange (ETDEWEB)

    Shi, CY; Yang, H; Wei, CL; Yu, O; Zhang, ZZ; Sun, J; Wan, XC

    2011-01-01

    Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real

  19. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    Directory of Open Access Journals (Sweden)

    Chen Qi

    2011-02-01

    Full Text Available Abstract Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs. Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010. Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were

  20. Transcriptome analysis reveals the time of the fourth round of genome duplication in common carp (Cyprinus carpio)

    Science.gov (United States)

    2012-01-01

    Background Common carp (Cyprinus carpio) is thought to have undergone one extra round of genome duplication compared to zebrafish. Transcriptome analysis has been used to study the existence and timing of genome duplication in species for which genome sequences are incomplete. Large-scale transcriptome data for the common carp genome should help reveal the timing of the additional duplication event. Results We have sequenced the transcriptome of common carp using 454 pyrosequencing. After assembling the 454 contigs and the published common carp sequences together, we obtained 49,669 contigs and identified genes using homology searches and an ab initio method. We identified 4,651 orthologous pairs between common carp and zebrafish and found 129,984 paralogous pairs within the common carp. An estimation of the synonymous substitution rate in the orthologous pairs indicated that common carp and zebrafish diverged 120 million years ago (MYA). We identified one round of genome duplication in common carp and estimated that it had occurred 5.6 to 11.3 MYA. In zebrafish, no genome duplication event after speciation was observed, suggesting that, compared to zebrafish, common carp had undergone an additional genome duplication event. We annotated the common carp contigs with Gene Ontology terms and KEGG pathways. Compared with zebrafish gene annotations, we found that a set of biological processes and pathways were enriched in common carp. Conclusions The assembled contigs helped us to estimate the time of the fourth-round of genome duplication in common carp. The resource that we have built as part of this study will help advance functional genomics and genome annotation studies in the future. PMID:22424280

  1. DeepSimulator: a deep simulator for Nanopore sequencing

    KAUST Repository

    Li, Yu

    2017-12-23

    Motivation: Oxford Nanopore sequencing is a rapidly developed sequencing technology in recent years. To keep pace with the explosion of the downstream data analytical tools, a versatile Nanopore sequencing simulator is needed to complement the experimental data as well as to benchmark those newly developed tools. However, all the currently available simulators are based on simple statistics of the produced reads, which have difficulty in capturing the complex nature of the Nanopore sequencing procedure, the main task of which is the generation of raw electrical current signals. Results: Here we propose a deep learning based simulator, DeepSimulator, to mimic the entire pipeline of Nanopore sequencing. Starting from a given reference genome or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments performed across four species show that the signals generated by our context-dependent model are more similar to the experimentally obtained signals than the ones generated by the official context-independent pore model. In terms of the simulated reads, we provide a parameter interface to users so that they can obtain the reads with different accuracies ranging from 83% to 97%. The reads generated by the default parameter have almost the same properties as the real data. Two case studies demonstrate the application of DeepSimulator to benefit the development of tools in de novo assembly and in low coverage SNP detection. Availability: The software can be accessed freely at: https://github.com/lykaust15/DeepSimulator.

  2. Transcriptome sequence analysis of an ornamental plant, Ananas comosus var. bracteatus, revealed the potential unigenes involved in terpenoid and phenylpropanoid biosynthesis.

    Science.gov (United States)

    Ma, Jun; Kanakala, S; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

  3. Transcriptome sequence analysis of an ornamental plant, Ananas comosus var. bracteatus, revealed the potential unigenes involved in terpenoid and phenylpropanoid biosynthesis.

    Directory of Open Access Journals (Sweden)

    Jun Ma

    Full Text Available Ananas comosus var. bracteatus (Red Pineapple is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies.The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis.The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

  4. Construction of an integrated genetic linkage map for the A genome of Brassica napus using SSR markers derived from sequenced BACs in B. rapa

    Directory of Open Access Journals (Sweden)

    King Graham J

    2010-10-01

    Full Text Available Abstract Background The Multinational Brassica rapa Genome Sequencing Project (BrGSP has developed valuable genomic resources, including BAC libraries, BAC-end sequences, genetic and physical maps, and seed BAC sequences for Brassica rapa. An integrated linkage map between the amphidiploid B. napus and diploid B. rapa will facilitate the rapid transfer of these valuable resources from B. rapa to B. napus (Oilseed rape, Canola. Results In this study, we identified over 23,000 simple sequence repeats (SSRs from 536 sequenced BACs. 890 SSR markers (designated as BrGMS were developed and used for the construction of an integrated linkage map for the A genome in B. rapa and B. napus. Two hundred and nineteen BrGMS markers were integrated to an existing B. napus linkage map (BnaNZDH. Among these mapped BrGMS markers, 168 were only distributed on the A genome linkage groups (LGs, 18 distrubuted both on the A and C genome LGs, and 33 only distributed on the C genome LGs. Most of the A genome LGs in B. napus were collinear with the homoeologous LGs in B. rapa, although minor inversions or rearrangements occurred on A2 and A9. The mapping of these BAC-specific SSR markers enabled assignment of 161 sequenced B. rapa BACs, as well as the associated BAC contigs to the A genome LGs of B. napus. Conclusion The genetic mapping of SSR markers derived from sequenced BACs in B. rapa enabled direct links to be established between the B. napus linkage map and a B. rapa physical map, and thus the assignment of B. rapa BACs and the associated BAC contigs to the B. napus linkage map. This integrated genetic linkage map will facilitate exploitation of the B. rapa annotated genomic resources for gene tagging and map-based cloning in B. napus, and for comparative analysis of the A genome within Brassica species.

  5. INE: a rice genome database with an integrated map view.

    Science.gov (United States)

    Sakata, K; Antonio, B A; Mukai, Y; Nagasaki, H; Sakai, Y; Makino, K; Sasaki, T

    2000-01-01

    The Rice Genome Research Program (RGP) launched a large-scale rice genome sequencing in 1998 aimed at decoding all genetic information in rice. A new genome database called INE (INtegrated rice genome Explorer) has been developed in order to integrate all the genomic information that has been accumulated so far and to correlate these data with the genome sequence. A web interface based on Java applet provides a rapid viewing capability in the database. The first operational version of the database has been completed which includes a genetic map, a physical map using YAC (Yeast Artificial Chromosome) clones and PAC (P1-derived Artificial Chromosome) contigs. These maps are displayed graphically so that the positional relationships among the mapped markers on each chromosome can be easily resolved. INE incorporates the sequences and annotations of the PAC contig. A site on low quality information ensures that all submitted sequence data comply with the standard for accuracy. As a repository of rice genome sequence, INE will also serve as a common database of all sequence data obtained by collaborating members of the International Rice Genome Sequencing Project (IRGSP). The database can be accessed at http://www. dna.affrc.go.jp:82/giot/INE. html or its mirror site at http://www.staff.or.jp/giot/INE.html

  6. Dicty_cDB: CFE213 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CF (Link to library) CFE213 (Link to dictyBase) - - - Contig-U16381-1 CFE213F (Link... to Original site) CFE213F 111 - - - - - - Show CFE213 Library CF (Link to library) Clone ID CFE213 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16381-1 Original site URL http://dict... E Sequences producing significant alignments: (bits) Value N AC115685 |AC115685.1 Dict...yostelium discoideum chromosome 2 map 4718821-4752388 strain AX4, complete sequence. 80 9e-24 3 X51892 |X51892.1 Dict

  7. Dicty_cDB: SSD329 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSD329 (Link to dictyBase) - - - Contig-U16581-1 SSD329F (Link... to Original site) SSD329F 444 - - - - - - Show SSD329 Library SS (Link to library) Clone ID SSD329 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16581-1 Original site URL http://dict...A Score E Sequences producing significant alignments: (bits) Value N D16417 |D16417.1 Dictyostelium discoide... DNA sequence. 50 0.027 1 BM028890 |BM028890.1 IpSkn01670 Skin cDNA library Ictal

  8. Dicty_cDB: SSJ546 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSJ546 (Link to dictyBase) - - - Contig-U16581-1 SSJ546F (Link... to Original site) SSJ546F 445 - - - - - - Show SSJ546 Library SS (Link to library) Clone ID SSJ546 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16581-1 Original site URL http://dict...DNA Score E Sequences producing significant alignments: (bits) Value N D16417 |D16417.1 Dictyostelium discoi...2, DNA sequence. 50 0.027 1 BM028890 |BM028890.1 IpSkn01670 Skin cDNA library Ict

  9. Dicty_cDB: SSF689 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSF689 (Link to dictyBase) - - - Contig-U16581-1 SSF689F (Link... to Original site) SSF689F 443 - - - - - - Show SSF689 Library SS (Link to library) Clone ID SSF689 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16581-1 Original site URL http://dict...core E Sequences producing significant alignments: (bits) Value N D16417 |D16417.1 Dictyostelium discoideum ...A sequence. 50 0.027 1 BM028890 |BM028890.1 IpSkn01670 Skin cDNA library Ictaluru

  10. Dicty_cDB: CHD534 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CH (Link to library) CHD534 (Link to dictyBase) - - - Contig-U15540-1 CHD534E (Link...) Clone ID CHD534 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15540-1 Ori...nts: (bits) Value N AC114263 |AC114263.2 Dictyostelium discoideum chromosome 2 ma...p 215673-367476 strain AX4, complete sequence. 40 1e-05 6 AC117081 |AC117081.2 Dictyostelium discoideum chro...mosome 2 map 5862124-6045772 strain AX4, complete sequence. 40 2e-05 5 AJ277590 |AJ277590.1 Dictyostelium di

  11. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB752 (Link to dictyBase) - - - Contig-U14717-1 VFB752E (Link...) Clone ID VFB752 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U14717-1 Ori...s: (bits) Value N AC116984 |AC116984.2 Dictyostelium discoideum chromosome 2 map 2567470-3108875 strain AX4,... complete sequence. 1215 0.0 11 AC115594 |AC115594.2 Dictyostelium discoideum chr...omosome 2 map 4071862-4101005 strain AX4, complete sequence. 113 3e-47 8 AC116920 |AC116920.2 Dictyostelium

  12. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB714 (Link to dictyBase) - - - Contig-U12859-1 VFB714F (Link... to Original site) VFB714F 545 - - - - - - Show VFB714 Library VF (Link to library) Clone ID VFB714 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U12859-1 Original site URL http://dict... Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N U36936 |U36936.1 Dict...Score E Sequences producing significant alignments: (bits) Value U36936_1( U36936 |pid:none) Dictyostelium d

  13. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFC124 (Link to dictyBase) - - - Contig-U12017-1 VFC124Z (Link... to Original site) - - VFC124Z 496 - - - - Show VFC124 Library VF (Link to library) Clone ID VFC124 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U12017-1 Original site URL http://dict...e E Sequences producing significant alignments: (bits) Value N AC116957 |AC116957.2 Dictyostelium discoideum... chromosome 2 map 1685067-2090751 strain AX4, complete sequence. 835 0.0 2 U67089 |U67089.1 Dict

  14. Dicty_cDB: VHF145 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHF145 (Link to dictyBase) - - - Contig-U15430-1 VHF145E (Link...) Clone ID VHF145 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15430-1 Ori...ology vs DNA Score E Sequences producing significant alignments: (bits) Value N AC116984 |AC116984.2 Dictyos... theta DNA for complete sequence of nucleomorph chromosome 2. 48 2e-07 2 ES451909 | PREDICTED: similar to PI...al 16.0 %: nuclear 8.0 %: vacuolar 8.0 %: endoplasmic reticulum 4.0 %: cytoskeletal >> prediction for VHF145

  15. Deep sequencing-based transcriptome analysis of chicken spleen in response to avian pathogenic Escherichia coli (APEC infection.

    Directory of Open Access Journals (Sweden)

    Qinghua Nie

    Full Text Available Avian pathogenic Escherichia coli (APEC leads to economic losses in poultry production and is also a threat to human health. The goal of this study was to characterize the chicken spleen transcriptome and to identify candidate genes for response and resistance to APEC infection using Solexa sequencing. We obtained 14422935, 14104324, and 14954692 Solexa read pairs for non-challenged (NC, challenged-mild pathology (MD, and challenged-severe pathology (SV, respectively. A total of 148197 contigs and 98461 unigenes were assembled, of which 134949 contigs and 91890 unigenes match the chicken genome. In total, 12272 annotated unigenes take part in biological processes (11664, cellular components (11927, and molecular functions (11963. Summing three specific contrasts, 13650 significantly differentially expressed unigenes were found in NC Vs. MD (6844, NC Vs. SV (7764, and MD Vs. SV (2320. Some unigenes (e.g. CD148, CD45 and LCK were involved in crucial pathways, such as the T cell receptor (TCR signaling pathway and microbial metabolism in diverse environments. This study facilitates understanding of the genetic architecture of the chicken spleen transcriptome, and has identified candidate genes for host response to APEC infection.

  16. Whole genome sequence of the emerging oomycete pathogen Pythium insidiosum strain CDC-B5653 isolated from an infected human in the USA

    Directory of Open Access Journals (Sweden)

    Marina S. Ascunce

    2016-03-01

    Full Text Available Pythium insidiosum ATCC 200269 strain CDC-B5653, an isolate from necrotizing lesions on the mouth and eye of a 2-year-old boy in Memphis, Tennessee, USA, was sequenced using a combination of Illumina MiSeq (300 bp paired-end, 14 millions reads and PacBio (10  Kb fragment library, 356,001 reads. The sequencing data were assembled using SPAdes version 3.1.0, yielding a total genome size of 45.6 Mb contained in 8992 contigs, N50 of 13 Kb, 57% G + C content, and 17,867 putative protein-coding genes. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession JRHR00000000. Keywords: Oomycete, Pythium insidiosum, Pythiosis, Human emerging pathogen, Genome sequencing

  17. Metagenomic insights into the rumen microbial fibrolytic enzymes in Indian crossbred cattle fed finger millet straw.

    Science.gov (United States)

    Jose, V Lyju; Appoothy, Thulasi; More, Ravi P; Arun, A Sha

    2017-12-01

    The rumen is a unique natural habitat, exhibiting an unparalleled genetic resource of fibrolytic enzymes of microbial origin that degrade plant polysaccharides. The objectives of this study were to identify the principal plant cell wall-degrading enzymes and the taxonomic profile of rumen microbial communities that are associated with it. The cattle rumen microflora and the carbohydrate-active enzymes were functionally classified through a whole metagenomic sequencing approach. Analysis of the assembled sequences by the Carbohydrate-active enzyme analysis Toolkit identified the candidate genes encoding fibrolytic enzymes belonging to different classes of glycoside hydrolases(11,010 contigs), glycosyltransferases (6366 contigs), carbohydrate esterases (4945 contigs), carbohydrate-binding modules (1975 contigs), polysaccharide lyases (480 contigs), and auxiliary activities (115 contigs). Phylogenetic analysis of CAZyme encoding contigs revealed that a significant proportion of CAZymes were contributed by bacteria belonging to genera Prevotella, Bacteroides, Fibrobacter, Clostridium, and Ruminococcus. The results indicated that the cattle rumen microbiome and the CAZymes are highly complex, structurally similar but compositionally distinct from other ruminants. The unique characteristics of rumen microbiota and the enzymes produced by resident microbes provide opportunities to improve the feed conversion efficiency in ruminants and serve as a reservoir of industrially important enzymes for cellulosic biofuel production.

  18. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects.

    Science.gov (United States)

    Meglécz, Emese; Costedoat, Caroline; Dubut, Vincent; Gilles, André; Malausa, Thibaut; Pech, Nicolas; Martin, Jean-François

    2010-02-01

    QDD is an open access program providing a user-friendly tool for microsatellite detection and primer design from large sets of DNA sequences. The program is designed to deal with all steps of treatment of raw sequences obtained from pyrosequencing of enriched DNA libraries, but it is also applicable to data obtained through other sequencing methods, using FASTA files as input. The following tasks are completed by QDD: tag sorting, adapter/vector removal, elimination of redundant sequences, detection of possible genomic multicopies (duplicated loci or transposable elements), stringent selection of target microsatellites and customizable primer design. It can treat up to one million sequences of a few hundred base pairs in the tag-sorting step, and up to 50,000 sequences in a single input file for the steps involving estimation of sequence similarity. QDD is freely available under the GPL licence for Windows and Linux from the following web site: http://www.univ-provence.fr/gsite/Local/egee/dir/meglecz/QDD.html. Supplementary data are available at Bioinformatics online.

  19. Generation and analysis of expressed sequence tags from Botrytis cinerea

    Directory of Open Access Journals (Sweden)

    EVELYN SILVA

    2006-01-01

    Full Text Available Botrytis cinerea is a filamentous plant pathogen of a wide range of plant species, and its infection may cause enormous damage both during plant growth and in the post-harvest phase. We have constructed a cDNA library from an isolate of B. cinerea and have sequenced 11,482 expressed sequence tags that were assembled into 1,003 contigs sequences and 3,032 singletons. Approximately 81% of the unigenes showed significant similarity to genes coding for proteins with known functions: more than 50% of the sequences code for genes involved in cellular metabolism, 12% for transport of metabolites, and approximately 10% for cellular organization. Other functional categories include responses to biotic and abiotic stimuli, cell communication, cell homeostasis, and cell development. We carried out pair-wise comparisons with fungal databases to determine the B. cinerea unisequence set with relevant similarity to genes in other fungal pathogenic counterparts. Among the 4,035 non-redundant B. cinerea unigenes, 1,338 (23% have significant homology with Fusarium verticillioides unigenes. Similar values were obtained for Saccharomyces cerevisiae and Aspergillus nidulans (22% and 24%, respectively. The lower percentages of homology were with Magnaporthe grisae and Neurospora crassa (13% and 19%, respectively. Several genes involved in putative and known fungal virulence and general pathogenicity were identified. The results provide important information for future research on this fungal pathogen

  20. Diversity analysis in Cannabis sativa based on large-scale development of expressed sequence tag-derived simple sequence repeat markers.

    Science.gov (United States)

    Gao, Chunsheng; Xin, Pengfei; Cheng, Chaohua; Tang, Qing; Chen, Ping; Wang, Changbiao; Zang, Gonggu; Zhao, Lining

    2014-01-01

    Cannabis sativa L. is an important economic plant for the production of food, fiber, oils, and intoxicants. However, lack of sufficient simple sequence repeat (SSR) markers has limited the development of cannabis genetic research. Here, large-scale development of expressed sequence tag simple sequence repeat (EST-SSR) markers was performed to obtain more informative genetic markers, and to assess genetic diversity in cannabis (Cannabis sativa L.). Based on the cannabis transcriptome, 4,577 SSRs were identified from 3,624 ESTs. From there, a total of 3,442 complementary primer pairs were designed as SSR markers. Among these markers, trinucleotide repeat motifs (50.99%) were the most abundant, followed by hexanucleotide (25.13%), dinucleotide (16.34%), tetranucloetide (3.8%), and pentanucleotide (3.74%) repeat motifs, respectively. The AAG/CTT trinucleotide repeat (17.96%) was the most abundant motif detected in the SSRs. One hundred and seventeen EST-SSR markers were randomly selected to evaluate primer quality in 24 cannabis varieties. Among these 117 markers, 108 (92.31%) were successfully amplified and 87 (74.36%) were polymorphic. Forty-five polymorphic primer pairs were selected to evaluate genetic diversity and relatedness among the 115 cannabis genotypes. The results showed that 115 varieties could be divided into 4 groups primarily based on geography: Northern China, Europe, Central China, and Southern China. Moreover, the coefficient of similarity when comparing cannabis from Northern China with the European group cannabis was higher than that when comparing with cannabis from the other two groups, owing to a similar climate. This study outlines the first large-scale development of SSR markers for cannabis. These data may serve as a foundation for the development of genetic linkage, quantitative trait loci mapping, and marker-assisted breeding of cannabis.

  1. GarlicESTdb: an online database and mining tool for garlic EST sequences

    Directory of Open Access Journals (Sweden)

    Choi Sang-Haeng

    2009-05-01

    Full Text Available Abstract Background Allium sativum., commonly known as garlic, is a species in the onion genus (Allium, which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. Description GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition software technology (JSP/EJB/JavaServlet for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation

  2. GarlicESTdb: an online database and mining tool for garlic EST sequences.

    Science.gov (United States)

    Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog

    2009-05-18

    Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The Garlic

  3. A Note on Sequence Prediction over Large Alphabets

    Directory of Open Access Journals (Sweden)

    Travis Gagie

    2012-02-01

    Full Text Available Building on results from data compression, we prove nearly tight bounds on how well sequences of length n can be predicted in terms of the size σ of the alphabet and the length k of the context considered when making predictions. We compare the performance achievable by an adaptive predictor with no advance knowledge of the sequence, to the performance achievable by the optimal static predictor using a table listing the frequency of each (k + 1-tuple in the sequence. We show that, if the elements of the sequence are chosen uniformly at random, then an adaptive predictor can compete in the expected case if k ≤ logσ n – 3 – ε, for a constant ε > 0, but not if k ≥ logσ n.

  4. Massively parallel sequencing and analysis of the Necator americanus transcriptome.

    Directory of Open Access Journals (Sweden)

    Cinzia Cantacessi

    2010-05-01

    Full Text Available The blood-feeding hookworm Necator americanus infects hundreds of millions of people worldwide. In order to elucidate fundamental molecular biological aspects of this hookworm, the transcriptome of the adult stage of Necator americanus was explored using next-generation sequencing and bioinformatic analyses.A total of 19,997 contigs were assembled from the sequence data; 6,771 of these contigs had known orthologues in the free-living nematode Caenorhabditis elegans, and most of them encoded proteins with WD40 repeats (10.6%, proteinase inhibitors (7.8% or calcium-binding EF-hand proteins (6.7%. Bioinformatic analyses inferred that the C. elegans homologues are involved mainly in biological pathways linked to ribosome biogenesis (70%, oxidative phosphorylation (63% and/or proteases (60%; most of these molecules were predicted to be involved in more than one biological pathway. Comparative analyses of the transcriptomes of N. americanus and the canine hookworm, Ancylostoma caninum, revealed qualitative and quantitative differences. For instance, proteinase inhibitors were inferred to be highly represented in the former species, whereas SCP/Tpx-1/Ag5/PR-1/Sc7 proteins ( = SCP/TAPS or Ancylostoma-secreted proteins were predominant in the latter. In N. americanus, essential molecules were predicted using a combination of orthology mapping and functional data available for C. elegans. Further analyses allowed the prioritization of 18 predicted drug targets which did not have homologues in the human host. These candidate targets were inferred to be linked to mitochondrial (e.g., processing proteins or amino acid metabolism (e.g., asparagine t-RNA synthetase.This study has provided detailed insights into the transcriptome of the adult stage of N. americanus and examines similarities and differences between this species and A. caninum. Future efforts should focus on comparative transcriptomic and proteomic investigations of the other predominant human

  5. De Novo Assembly of the Pea (Pisum sativum L. Nodule Transcriptome

    Directory of Open Access Journals (Sweden)

    Vladimir A. Zhukov

    2015-01-01

    Full Text Available The large size and complexity of the garden pea (Pisum sativum L. genome hamper its sequencing and the discovery of pea gene resources. Although transcriptome sequencing provides extensive information about expressed genes, some tissue-specific transcripts can only be identified from particular organs under appropriate conditions. In this study, we performed RNA sequencing of polyadenylated transcripts from young pea nodules and root tips on an Illumina GAIIx system, followed by de novo transcriptome assembly using the Trinity program. We obtained more than 58,000 and 37,000 contigs from “Nodules” and “Root Tips” assemblies, respectively. The quality of the assemblies was assessed by comparison with pea expressed sequence tags and transcriptome sequencing project data available from NCBI website. The “Nodules” assembly was compared with the “Root Tips” assembly and with pea transcriptome sequencing data from projects indicating tissue specificity. As a result, approximately 13,000 nodule-specific contigs were found and annotated by alignment to known plant protein-coding sequences and by Gene Ontology searching. Of these, 581 sequences were found to possess full CDSs and could thus be considered as novel nodule-specific transcripts of pea. The information about pea nodule-specific gene sequences can be applied for gene-based markers creation, polymorphism studies, and real-time PCR.

  6. Analysis of expressed sequence tags from Prunus mume flower and fruit and development of simple sequence repeat markers

    Directory of Open Access Journals (Sweden)

    Gao Zhihong

    2010-07-01

    Full Text Available Abstract Background Expressed Sequence Tag (EST has been a cost-effective tool in molecular biology and represents an abundant valuable resource for genome annotation, gene expression, and comparative genomics in plants. Results In this study, we constructed a cDNA library of Prunus mume flower and fruit, sequenced 10,123 clones of the library, and obtained 8,656 expressed sequence tag (EST sequences with high quality. The ESTs were assembled into 4,473 unigenes composed of 1,492 contigs and 2,981 singletons and that have been deposited in NCBI (accession IDs: GW868575 - GW873047, among which 1,294 unique ESTs were with known or putative functions. Furthermore, we found 1,233 putative simple sequence repeats (SSRs in the P. mume unigene dataset. We randomly tested 42 pairs of PCR primers flanking potential SSRs, and 14 pairs were identified as true-to-type SSR loci and could amplify polymorphic bands from 20 individual plants of P. mume. We further used the 14 EST-SSR primer pairs to test the transferability on peach and plum. The result showed that nearly 89% of the primer pairs produced target PCR bands in the two species. A high level of marker polymorphism was observed in the plum species (65% and low in the peach (46%, and the clustering analysis of the three species indicated that these SSR markers were useful in the evaluation of genetic relationships and diversity between and within the Prunus species. Conclusions We have constructed the first cDNA library of P. mume flower and fruit, and our data provide sets of molecular biology resources for P. mume and other Prunus species. These resources will be useful for further study such as genome annotation, new gene discovery, gene functional analysis, molecular breeding, evolution and comparative genomics between Prunus species.

  7. Abiotic Stress-Related Expressed Sequence Tags from the Diploid Strawberry Fragaria vesca f. semperflorens

    Directory of Open Access Journals (Sweden)

    Maximo. Rivarola

    2011-03-01

    Full Text Available Strawberry ( spp. is a eudicotyledonous plant that belongs to the Rosaceae family, which includes other agronomically important plants such as raspberry ( L. and several tree-fruit species. Despite the vital role played by cultivated strawberry in agriculture, few stress-related gene expression characterizations of this crop are available. To increase the diversity of available transcriptome sequence, we produced 41,430 L. expressed sequence tags (ESTs from plants growing under water-, temperature-, and osmotic-stress conditions as well as a combination of heat and osmotic stresses that is often found in irrigated fields. Clustering and assembling of the ESTs resulted in a total of 11,836 contigs and singletons that were annotated using Gene Ontology (GO terms. Furthermore, over 1200 sequences with no match to available Rosaceae ESTs were found, including six that were assigned the “response to stress” GO category. Analysis of EST frequency provided an estimate of steady state transcript levels, with 91 sequences exhibiting at least a 20-fold difference between treatments. This EST collection represents a useful resource to advance our understanding of the abiotic stress-response mechanisms in strawberry. The sequence information may be translated to valuable tree crops in the Rosaceae family, where whole-plant treatments are not as simple or practical.

  8. Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms

    Directory of Open Access Journals (Sweden)

    Haznedaroglu Berat Z

    2012-07-01

    Full Text Available Abstract Background The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG ortholog identifiers (KOIs, and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. Results Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63. For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs in the assemblies of individual k-mers (k-19 to k-63 that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. Conclusions This study demonstrated that different k-mer choices result in various quantities

  9. The first generation of a BAC-based physical map of Brassica rapa

    Directory of Open Access Journals (Sweden)

    Lee Soo

    2008-06-01

    Full Text Available Abstract Background The genus Brassica includes the most extensively cultivated vegetable crops worldwide. Investigation of the Brassica genome presents excellent challenges to study plant genome evolution and divergence of gene function associated with polyploidy and genome hybridization. A physical map of the B. rapa genome is a fundamental tool for analysis of Brassica "A" genome structure. Integration of a physical map with an existing genetic map by linking genetic markers and BAC clones in the sequencing pipeline provides a crucial resource for the ongoing genome sequencing effort and assembly of whole genome sequences. Results A genome-wide physical map of the B. rapa genome was constructed by the capillary electrophoresis-based fingerprinting of 67,468 Bacterial Artificial Chromosome (BAC clones using the five restriction enzyme SNaPshot technique. The clones were assembled into contigs by means of FPC v8.5.3. After contig validation and manual editing, the resulting contig assembly consists of 1,428 contigs and is estimated to span 717 Mb in physical length. This map provides 242 anchored contigs on 10 linkage groups to be served as seed points from which to continue bidirectional chromosome extension for genome sequencing. Conclusion The map reported here is the first physical map for Brassica "A" genome based on the High Information Content Fingerprinting (HICF technique. This physical map will serve as a fundamental genomic resource for accelerating genome sequencing, assembly of BAC sequences, and comparative genomics between Brassica genomes. The current build of the B. rapa physical map is available at the B. rapa Genome Project website for the user community.

  10. Large deviation estimates for exceedance times of perpetuity sequences and their dual processes

    DEFF Research Database (Denmark)

    Buraczewski, Dariusz; Collamore, Jeffrey F.; Damek, Ewa

    2016-01-01

    In a variety of problems in pure and applied probability, it is of relevant to study the large exceedance probabilities of the perpetuity sequence $Y_n := B_1 + A_1 B_2 + \\cdots + (A_1 \\cdots A_{n-1}) B_n$, where $(A_i,B_i) \\subset (0,\\infty) \\times \\reals$. Estimates for the stationary tail dist......-time exceedance probabilities of $\\{ M_n^\\ast \\}$, yielding a new result concerning the convergence of $\\{ M_n^\\ast \\}$ to its stationary distribution.......In a variety of problems in pure and applied probability, it is of relevant to study the large exceedance probabilities of the perpetuity sequence $Y_n := B_1 + A_1 B_2 + \\cdots + (A_1 \\cdots A_{n-1}) B_n$, where $(A_i,B_i) \\subset (0,\\infty) \\times \\reals$. Estimates for the stationary tail...... distribution of $\\{ Y_n \\}$ have been developed in the seminal papers of Kesten (1973) and Goldie (1991). Specifically, it is well-known that if $M := \\sup_n Y_n$, then ${\\mathbb P} \\left\\{ M > u \\right\\} \\sim {\\cal C}_M u^{-\\xi}$ as $u \\to \\infty$. While much attention has been focused on extending...

  11. Discovery and functional prioritization of Parkinson's disease candidate genes from large-scale whole exome sequencing

    NARCIS (Netherlands)

    I. Jansen (Iris); Ye, H. (Hui); Heetveld, S. (Sasja); Lechler, M.C. (Marie C.); Michels, H. (Helen); Seinstra, R.I. (Renée I.); Lubbe, S.J. (Steven J.); Drouet, V. (Valérie); S. Lesage (Suzanne); E. Majounie (Elisa); Gibbs, J.R. (J.Raphael); M.A. Nalls (Michael); M. Ryten (Mina); Botia, J.A. (Juan A.); J. Vandrovcova (Jana); J. Simón-Sánchez (Javier); Castillo-Lizardo, M. (Melissa); P. Rizzu (Patrizia); Blauwendraat, C. (Cornelis); Chouhan, A.K. (Amit K.); Li, Y. (Yarong); Yogi, P. (Puja); N. Amin (Najaf); C.M. van Duijn (Cornelia); Morris, H.R. (Huw R.); Brice, A. (Alexis); A. Singleton (Andrew); David, D.C. (Della C.); Nollen, E.A. (Ellen A.); A. Jain (Ashok); J.M. Shulman; P. Heutink (Peter); D.G. Hernandez (Dena); S. Arepalli (Sampath); J. Brooks (Janet); Price, R. (Ryan); Nicolas, A. (Aude); S. Chong (Sean); M.R. Cookson (Mark); A. Dillman (Allissa); M. Moore (Matt); B.J. Traynor (Bryan); A. Singleton (Andrew); V. Plagnol (Vincent); Nicholas W Wood,; U.-M. Sheerin (Una-Marie); Jose M Bras,; K. Charlesworth (Kate); M. Gardner (Mac); R. Guerreiro (Rita); D. Trabzuni (Danyah); Hardy, J. (John); M. Sharma; M. Saad (Mohamad); Javier Simón-Sánchez,; C. Schulte (Claudia); J.C. Corvol (Jean-Christophe); Dürr, A. (Alexandra); M. Vidailhet (M.); S. Sveinbjörnsdóttir (Sigurlaug); R.A. Barker (Roger); Caroline H Williams-Gray,; Y. Ben-Shlomo; H.W. Berendse (Henk W.); K.D. van Dijk (Karin); D. Berg (Daniela); K. Brockmann; K.D. Wurster (Kathrin); Mätzler, W. (Walter); Gasser, T. (Thomas); M. Martinez (Maria); R.M.A. de Bie (Rob); A. Biffi (Alessandro); D. Velseboer (Daan); B.R. Bloem (Bastiaan); B. Post (Bart); M. Wickremaratchi (Mirdhu); B. van de Warrenburg (Bart); Z. Bochdanovits (Zoltan); M. von Bonin (Malte); H. Pétursson (Hjörvar); O. Riess (Olaf); D.J. Burn (David); Lubbe, S. (Steven); Cooper, J.M. (J Mark); N.H. McNeill (Nathan); Schapira, A. (Anthony); Lungu, C. (Codrin); Chen, H. (Honglei); Dong, J. (Jing); Chinnery, P.F. (Patrick F.); G. Hudson (Gavin); Clarke, C.E. (Carl E.); C. Moorby (Catriona); C. Counsell (Carl); P. Damier (Philippe); J.-F. Dartigues; P. Deloukas (Panagiotis); E. Gray (Emma); T. Edkins (Ted); Hunt, S.E. (Sarah E.); S.C. Potter (Simon); A. Tashakkori-Ghanbaria (Avazeh); G. Deuschl (Günther); D. Lorenz (Delia); D.T. Dexter (David); F. Durif (Frank); J. Evans (Jonathan Mark); Langford, C. (Cordelia); T. Foltynie (Thomas); A.M. Goate (Alison); C. Harris (Clare); J.J. van Hilten (Jacobus); A. Hofman (Albert); J.R. Hollenbeck (John R.); J.L. Holton (Janice); Hu, M. (Michele); X. Huang (Xiaohong); Illig, T. (Thomas); P.V. Jónsson (Pálmi); J.-C. Lambert; S.S. O'Sullivan (Sean); T. Revesz (Tamas); K. Shaw (Karen); A.J. Lees (Andrew); P. Lichtner (Peter); P. Limousin (Patricia); G. Lopez; Escott-Price, V. (Valentina); J. Pearson (Justin); N. Williams (Nigel); E. Mudanohwo (Ese); J.S. Perlmutter (Joel); Pollak, P. (Pierre); F. Rivadeneira Ramirez (Fernando); A.G. Uitterlinden (André); S.J. Sawcer (Stephen); H. Scheffer (Hans); I. Shoulson (Ira); L. Shulman (Lee); Smith, C. (Colin); R. Walker (Robert); C.C.A. Spencer (Chris C.); A. Strange (Amy); H. Stefansson (Hreinn); F. Bettella (Francesco); J-A. Zwart (John-Anker); Stockton, J.D. (Joanna D.); D. Talbot; C.M. Tanner (Carlie); F. Tison (François); S. Winder-Rhodes (Sophie); K.P. Bhatia (Kailash)

    2017-01-01

    textabstractBackground: Whole-exome sequencing (WES) has been successful in identifying genes that cause familial Parkinson's disease (PD). However, until now this approach has not been deployed to study large cohorts of unrelated participants. To discover rare PD susceptibility variants, we

  12. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB595 (Link to dictyBase) - - - Contig-U09552-1 VFB595E (Link...) Clone ID VFB595 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U09552-1 Ori...KLLKSDNWISTCQNLIQEYEPQ IIAVVEGFMAPSELCQKIKFCSSSSSTNDFDFIGSSTTDCEICTFISGYAENFLEENKTL EDIIKVVDDFCKILPAAYKTDCVA...A: VEGSGECLVCEFISEKIVTYLEANQTETQILQYLDNDCKLLKSDNWISTCQNLIQEYEPQ IIAVVEGFMAPSELCQKIKFCSSSSSTNDFDFIGSSTTDCEICT... Sequences producing significant alignments: (bits) Value N U66367 |U66367.1 Dictyostelium discoideum SapA (

  13. Dicty_cDB: SSE149 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSE149 (Link to dictyBase) - - - Contig-U01658-1 | Contig-U165... library) Clone ID SSE149 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U016...58-1 | Contig-U16509-1 Original site URL http://dictycdb.biol.tsukuba.ac.jp/CSM/S...g significant alignments: (bits) Value N D16417 |D16417.1 Dictyostelium discoideum mRNA. 64 1e-17 2 AC100496...egans clone T13G4, complete sequence. 36 5.0 2 BM028890 |BM028890.1 IpSkn01670 Skin cDNA library Ictalurus p

  14. Rapid development of microsatellite markers for the endangered fish Schizothorax biddulphi (Günther) using next generation sequencing and cross-species amplification.

    Science.gov (United States)

    Luo, Wei; Nie, Zhulan; Zhan, Fanbin; Wei, Jie; Wang, Weimin; Gao, Zexia

    2012-11-14

    Tarim schizothoracin (Schizothorax biddulphi) is an endemic fish species native to the Tarim River system of Xinjiang and has been classified as an extremely endangered freshwater fish species in China. Here, we used a next generation sequencing platform (ion torrent PGM™) to obtain a large number of microsatellites for S. biddulphi, for the first time. A total of 40577 contigs were assembled, which contained 1379 SSRs. In these SSRs, the number of dinucleotide repeats were the most frequent (77.08%) and AC repeats were the most frequently occurring microsatellite, followed by AG, AAT and AT. Fifty loci were randomly selected for primer development; of these, 38 loci were successfully amplified and 29 loci were polymorphic across panels of 30 individuals. The H(o) ranged from 0.15 to 0.83, and H(e) ranged from 0.15 to 0.85, with 3.5 alleles per locus on average. Cross-species utility indicated that 20 of these markers were successfully amplified in a related, also an endangered fish species, S. irregularis. This study suggests that PGM™ sequencing is a rapid and cost-effective tool for developing microsatellite markers for non-model species and the developed microsatellite markers in this study would be useful in Schizothorax genetic analysis.

  15. A gene-based high-resolution comparative radiation hybrid map as a framework for genome sequence assembly of a bovine chromosome 6 region associated with QTL for growth, body composition, and milk performance traits

    Directory of Open Access Journals (Sweden)

    Laurent Pascal

    2006-03-01

    Full Text Available Abstract Background A number of different quantitative trait loci (QTL for various phenotypic traits, including milk production, functional, and conformation traits in dairy cattle as well as growth and body composition traits in meat cattle, have been mapped consistently in the middle region of bovine chromosome 6 (BTA6. Dense genetic and physical maps and, ultimately, a fully annotated genome sequence as well as their mutual connections are required to efficiently identify genes and gene variants responsible for genetic variation of phenotypic traits. A comprehensive high-resolution gene-rich map linking densely spaced bovine markers and genes to the annotated human genome sequence is required as a framework to facilitate this approach for the region on BTA6 carrying the QTL. Results Therefore, we constructed a high-resolution radiation hybrid (RH map for the QTL containing chromosomal region of BTA6. This new RH map with a total of 234 loci including 115 genes and ESTs displays a substantial increase in loci density compared to existing physical BTA6 maps. Screening the available bovine genome sequence resources, a total of 73 loci could be assigned to sequence contigs, which were already identified as specific for BTA6. For 43 loci, corresponding sequence contigs, which were not yet placed on the bovine genome assembly, were identified. In addition, the improved potential of this high-resolution RH map for BTA6 with respect to comparative mapping was demonstrated. Mapping a large number of genes on BTA6 and cross-referencing them with map locations in corresponding syntenic multi-species chromosome segments (human, mouse, rat, dog, chicken achieved a refined accurate alignment of conserved segments and evolutionary breakpoints across the species included. Conclusion The gene-anchored high-resolution RH map (1 locus/300 kb for the targeted region of BTA6 presented here will provide a valuable platform to guide high-quality assembling and

  16. L’Anaphore associative: contigüité métonymique

    Directory of Open Access Journals (Sweden)

    Gemma Peña Martínez

    2008-04-01

    Full Text Available Cet article porte sur les rapports de contigüité exigés lors de la résolution des anaphores associatives. Il s’agit en général de relations métonymiques, car les différents rapports, à caractère notamment socioculturel, entre référent et marque anaphorique convergent dans un même cadre conceptuel, se faisant écho d’éléments ou caractéristiques du même domaine cognitif. L’anaphore associative reprenant ainsi un attribut concret du référent, nous envisageons donc une classification de ces marques anaphoriques d’après des rapports métonymiques, tels que partie à tout, objet à matière et caractéristique ou propriété à objet.

  17. Dicty_cDB: SSD571 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSD571 (Link to dictyBase) - - - Contig-U16581-1 SSD571Z (Link... to Original site) - - SSD571Z 415 - - - - Show SSD571 Library SS (Link to library) Clone ID SSD571 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16581-1 Original site URL http://dict...omology vs DNA Score E Sequences producing significant alignments: (bits) Value N D16417 |D16417.1 Dict...Brassica oleracea genomic clone BONRK12, DNA sequence. 50 0.025 1 BM028890 |BM028890.1 IpSkn01670 Skin cDNA library Ict

  18. Dicty_cDB: SSK129 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSK129 (Link to dictyBase) - - - Contig-U16021-1 SSK129Z (Link... to Original site) - - SSK129Z 372 - - - - Show SSK129 Library SS (Link to library) Clone ID SSK129 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16021-1 Original site URL http://dict...Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N AC116957 |AC116957.2 Dict...419632 |CK419632.1 AUF_IpOva_21_i24 Ovary cDNA library Ictalurus punctatus cDNA 5', mRNA sequence. 36 0.54 2

  19. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB895 (Link to dictyBase) - - - Contig-U10164-1 VFB895P (Link... to Original site) VFB895F 578 VFB895Z 699 VFB895P 1277 - - Show VFB895 Library VF (Link to library) Clone ID VFB895 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U10164-1 Original site URL http://dict...ore E Sequences producing significant alignments: (bits) Value N AC115604 |AC115604.2 Dictyostelium discoide...um chromosome 2 map 4354771-4414991 strain AX4, complete sequence. 42 5e-06 9 M18106 |M18106.1 Dictyostelium

  20. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFC265 (Link to dictyBase) - - - Contig-U16459-1 VFC265Z (Link... to Original site) - - VFC265Z 278 - - - - Show VFC265 Library VF (Link to library) Clone ID VFC265 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16459-1 Original site URL http://dict...ology vs DNA Score E Sequences producing significant alignments: (bits) Value N AC123513 |AC123513.1 Dictyos...telium discoideum chromosome 2 map 2779865-2840915 strain AX4, *** SEQUENCING IN PROGRESS ***. 159 9e-77 4 AC117070 |AC117070.2 Dict

  1. A first generation BAC-based physical map of the rainbow trout genome

    Directory of Open Access Journals (Sweden)

    Thorgaard Gary H

    2009-10-01

    Full Text Available Abstract Background Rainbow trout (Oncorhynchus mykiss are the most-widely cultivated cold freshwater fish in the world and an important model species for many research areas. Coupling great interest in this species as a research model with the need for genetic improvement of aquaculture production efficiency traits justifies the continued development of genomics research resources. Many quantitative trait loci (QTL have been identified for production and life-history traits in rainbow trout. A bacterial artificial chromosome (BAC physical map is needed to facilitate fine mapping of QTL and the selection of positional candidate genes for incorporation in marker-assisted selection (MAS for improving rainbow trout aquaculture production. This resource will also facilitate efforts to obtain and assemble a whole-genome reference sequence for this species. Results The physical map was constructed from DNA fingerprinting of 192,096 BAC clones using the 4-color high-information content fingerprinting (HICF method. The clones were assembled into physical map contigs using the finger-printing contig (FPC program. The map is composed of 4,173 contigs and 9,379 singletons. The total number of unique fingerprinting fragments (consensus bands in contigs is 1,185,157, which corresponds to an estimated physical length of 2.0 Gb. The map assembly was validated by 1 comparison with probe hybridization results and agarose gel fingerprinting contigs; and 2 anchoring large contigs to the microsatellite-based genetic linkage map. Conclusion The production and validation of the first BAC physical map of the rainbow trout genome is described in this paper. We are currently integrating this map with the NCCCWA genetic map using more than 200 microsatellites isolated from BAC end sequences and by identifying BACs that harbor more than 300 previously mapped markers. The availability of an integrated physical and genetic map will enable detailed comparative genome

  2. Sequence assembly

    DEFF Research Database (Denmark)

    Scheibye-Alsing, Karsten; Hoffmann, S.; Frankel, Annett Maria

    2009-01-01

    Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and...... in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html....

  3. Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin.

    Directory of Open Access Journals (Sweden)

    Melissa K Boles

    2009-12-01

    Full Text Available An accurate and precisely annotated genome assembly is a fundamental requirement for functional genomic analysis. Here, the complete DNA sequence and gene annotation of mouse Chromosome 11 was used to test the efficacy of large-scale sequencing for mutation identification. We re-sequenced the 14,000 annotated exons and boundaries from over 900 genes in 41 recessive mutant mouse lines that were isolated in an N-ethyl-N-nitrosourea (ENU mutation screen targeted to mouse Chromosome 11. Fifty-nine sequence variants were identified in 55 genes from 31 mutant lines. 39% of the lesions lie in coding sequences and create primarily missense mutations. The other 61% lie in noncoding regions, many of them in highly conserved sequences. A lesion in the perinatal lethal line l11Jus13 alters a consensus splice site of nucleoredoxin (Nxn, inserting 10 amino acids into the resulting protein. We conclude that point mutations can be accurately and sensitively recovered by large-scale sequencing, and that conserved noncoding regions should be included for disease mutation identification. Only seven of the candidate genes we report have been previously targeted by mutation in mice or rats, showing that despite ongoing efforts to functionally annotate genes in the mammalian genome, an enormous gap remains between phenotype and function. Our data show that the classical positional mapping approach of disease mutation identification can be extended to large target regions using high-throughput sequencing.

  4. Features of the organization of bread wheat chromosome 5BS based on physical mapping.

    Science.gov (United States)

    Salina, Elena A; Nesterov, Mikhail A; Frenkel, Zeev; Kiseleva, Antonina A; Timonova, Ekaterina M; Magni, Federica; Vrána, Jan; Šafář, Jan; Šimková, Hana; Doležel, Jaroslav; Korol, Abraham; Sergeeva, Ekaterina M

    2018-02-09

    The IWGSC strategy for construction of the reference sequence of the bread wheat genome is based on first obtaining physical maps of the individual chromosomes. Our aim is to develop and use the physical map for analysis of the organization of the short arm of wheat chromosome 5B (5BS) which bears a number of agronomically important genes, including genes conferring resistance to fungal diseases. A physical map of the 5BS arm (290 Mbp) was constructed using restriction fingerprinting and LTC software for contig assembly of 43,776 BAC clones. The resulting physical map covered ~ 99% of the 5BS chromosome arm (111 scaffolds, N50 = 3.078 Mb). SSR, ISBP and zipper markers were employed for anchoring the BAC clones, and from these 722 novel markers were developed based on previously obtained data from partial sequencing of 5BS. The markers were mapped using a set of Chinese Spring (CS) deletion lines, and F2 and RICL populations from a cross of CS and CS-5B dicoccoides. Three approaches have been used for anchoring BAC contigs on the 5BS chromosome, including clone-by-clone screening of BACs, GenomeZipper analysis, and comparison of BAC-fingerprints with in silico fingerprinting of 5B pseudomolecules of T. dicoccoides. These approaches allowed us to reach a high level of BAC contig anchoring: 96% of 5BS BAC contigs were located on 5BS. An interesting pattern was revealed in the distribution of contigs along the chromosome. Short contigs (200-999 kb) containing markers for the regions interrupted by tandem repeats, were mainly localized to the 5BS subtelomeric block; whereas the distribution of larger 1000-3500 kb contigs along the chromosome better correlated with the distribution of the regions syntenic to rice, Brachypodium, and sorghum, as detected by the Zipper approach. The high fingerprinting quality, LTC software and large number of BAC clones selected by the informative markers in screening of the 43,776 clones allowed us to significantly increase the

  5. XLID-causing mutations and associated genes challenged in light of data from large-scale human exome sequencing.

    Science.gov (United States)

    Piton, Amélie; Redin, Claire; Mandel, Jean-Louis

    2013-08-08

    Because of the unbalanced sex ratio (1.3-1.4 to 1) observed in intellectual disability (ID) and the identification of large ID-affected families showing X-linked segregation, much attention has been focused on the genetics of X-linked ID (XLID). Mutations causing monogenic XLID have now been reported in over 100 genes, most of which are commonly included in XLID diagnostic gene panels. Nonetheless, the boundary between true mutations and rare non-disease-causing variants often remains elusive. The sequencing of a large number of control X chromosomes, required for avoiding false-positive results, was not systematically possible in the past. Such information is now available thanks to large-scale sequencing projects such as the National Heart, Lung, and Blood (NHLBI) Exome Sequencing Project, which provides variation information on 10,563 X chromosomes from the general population. We used this NHLBI cohort to systematically reassess the implication of 106 genes proposed to be involved in monogenic forms of XLID. We particularly question the implication in XLID of ten of them (AGTR2, MAGT1, ZNF674, SRPX2, ATP6AP2, ARHGEF6, NXF5, ZCCHC12, ZNF41, and ZNF81), in which truncating variants or previously published mutations are observed at a relatively high frequency within this cohort. We also highlight 15 other genes (CCDC22, CLIC2, CNKSR2, FRMPD4, HCFC1, IGBP1, KIAA2022, KLF8, MAOA, NAA10, NLGN3, RPL10, SHROOM4, ZDHHC15, and ZNF261) for which replication studies are warranted. We propose that similar reassessment of reported mutations (and genes) with the use of data from large-scale human exome sequencing would be relevant for a wide range of other genetic diseases. Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  6. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo: genome assembly and analysis.

    Directory of Open Access Journals (Sweden)

    Rami A Dalloul

    2010-09-01

    Full Text Available A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo. Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.

  7. Discovery of precursor and mature microRNAs and their putative gene targets using high-throughput sequencing in pineapple (Ananas comosus var. comosus).

    Science.gov (United States)

    Yusuf, Noor Hydayaty Md; Ong, Wen Dee; Redwan, Raimi Mohamed; Latip, Mariam Abd; Kumar, S Vijay

    2015-10-15

    MicroRNAs (miRNAs) are a class of small, endogenous non-coding RNAs that negatively regulate gene expression, resulting in the silencing of target mRNA transcripts through mRNA cleavage or translational inhibition. MiRNAs play significant roles in various biological and physiological processes in plants. However, the miRNA-mediated gene regulatory network in pineapple, the model tropical non-climacteric fruit, remains largely unexplored. Here, we report a complete list of pineapple mature miRNAs obtained from high-throughput small RNA sequencing and precursor miRNAs (pre-miRNAs) obtained from ESTs. Two small RNA libraries were constructed from pineapple fruits and leaves, respectively, using Illumina's Solexa technology. Sequence similarity analysis using miRBase revealed 579,179 reads homologous to 153 miRNAs from 41 miRNA families. In addition, a pineapple fruit transcriptome library consisting of approximately 30,000 EST contigs constructed using Solexa sequencing was used for the discovery of pre-miRNAs. In all, four pre-miRNAs were identified (MIR156, MIR399, MIR444 and MIR2673). Furthermore, the same pineapple transcriptome was used to dissect the function of the miRNAs in pineapple by predicting their putative targets in conjunction with their regulatory networks. In total, 23 metabolic pathways were found to be regulated by miRNAs in pineapple. The use of high-throughput sequencing in pineapples to unveil the presence of miRNAs and their regulatory pathways provides insight into the repertoire of miRNA regulation used exclusively in this non-climacteric model plant. Copyright © 2015 Elsevier B.V. All rights reserved.

  8. Genomic divergences among cattle, dog and human estimated from large-scale alignments of genomic sequences

    Directory of Open Access Journals (Sweden)

    Shade Larry L

    2006-06-01

    Full Text Available Abstract Background Approximately 11 Mb of finished high quality genomic sequences were sampled from cattle, dog and human to estimate genomic divergences and their regional variation among these lineages. Results Optimal three-way multi-species global sequence alignments for 84 cattle clones or loci (each >50 kb of genomic sequence were constructed using the human and dog genome assemblies as references. Genomic divergences and substitution rates were examined for each clone and for various sequence classes under different functional constraints. Analysis of these alignments revealed that the overall genomic divergences are relatively constant (0.32–0.37 change/site for pairwise comparisons among cattle, dog and human; however substitution rates vary across genomic regions and among different sequence classes. A neutral mutation rate (2.0–2.2 × 10(-9 change/site/year was derived from ancestral repetitive sequences, whereas the substitution rate in coding sequences (1.1 × 10(-9 change/site/year was approximately half of the overall rate (1.9–2.0 × 10(-9 change/site/year. Relative rate tests also indicated that cattle have a significantly faster rate of substitution as compared to dog and that this difference is about 6%. Conclusion This analysis provides a large-scale and unbiased assessment of genomic divergences and regional variation of substitution rates among cattle, dog and human. It is expected that these data will serve as a baseline for future mammalian molecular evolution studies.

  9. Illumina MiSeq 16S amplicon sequence analysis of bovine respiratory disease associated bacteria in lung and mediastinal lymph node tissue.

    Science.gov (United States)

    Johnston, Dayle; Earley, Bernadette; Cormican, Paul; Murray, Gerard; Kenny, David Anthony; Waters, Sinead Mary; McGee, Mark; Kelly, Alan Kieran; McCabe, Matthew Sean

    2017-05-02

    Bovine respiratory disease (BRD) is caused by growth of single or multiple species of pathogenic bacteria in lung tissue following stress and/or viral infection. Next generation sequencing of 16S ribosomal RNA gene PCR amplicons (NGS 16S amplicon analysis) is a powerful culture-independent open reference method that has recently been used to increase understanding of BRD-associated bacteria in the upper respiratory tract of BRD cattle. However, it has not yet been used to examine the microbiome of the bovine lower respiratory tract. The objective of this study was to use NGS 16S amplicon analysis to identify bacteria in post-mortem lung and lymph node tissue samples harvested from fatal BRD cases and clinically healthy animals. Cranial lobe and corresponding mediastinal lymph node post-mortem tissue samples were collected from calves diagnosed as BRD cases by veterinary laboratory pathologists and from clinically healthy calves. NGS 16S amplicon libraries, targeting the V3-V4 region of the bacterial 16S rRNA gene were prepared and sequenced on an Illumina MiSeq. Quantitative insights into microbial ecology (QIIME) was used to determine operational taxonomic units (OTUs) which corresponded to the 16S rRNA gene sequences. Leptotrichiaceae, Mycoplasma, Pasteurellaceae, and Fusobacterium were the most abundant OTUs identified in the lungs and lymph nodes of the calves which died from BRD. Leptotrichiaceae, Fusobacterium, Mycoplasma, Trueperella and Bacteroides had greater relative abundances in post-mortem lung samples collected from fatal cases of BRD in dairy calves, compared with clinically healthy calves without lung lesions. Leptotrichiaceae, Mycoplasma and Pasteurellaceae showed higher relative abundances in post-mortem lymph node samples collected from fatal cases of BRD in dairy calves, compared with clinically healthy calves without lung lesions. Two Leptotrichiaceae sequence contigs were subsequently assembled from bacterial DNA-enriched shotgun sequences

  10. High diversity of picornaviruses in rats from different continents revealed by deep sequencing.

    Science.gov (United States)

    Hansen, Thomas Arn; Mollerup, Sarah; Nguyen, Nam-Phuong; White, Nicole E; Coghlan, Megan; Alquezar-Planas, David E; Joshi, Tejal; Jensen, Randi Holm; Fridholm, Helena; Kjartansdóttir, Kristín Rós; Mourier, Tobias; Warnow, Tandy; Belsham, Graham J; Bunce, Michael; Willerslev, Eske; Nielsen, Lars Peter; Vinner, Lasse; Hansen, Anders Johannes

    2016-08-17

    Outbreaks of zoonotic diseases in humans and livestock are not uncommon, and an important component in containment of such emerging viral diseases is rapid and reliable diagnostics. Such methods are often PCR-based and hence require the availability of sequence data from the pathogen. Rattus norvegicus (R. norvegicus) is a known reservoir for important zoonotic pathogens. Transmission may be direct via contact with the animal, for example, through exposure to its faecal matter, or indirectly mediated by arthropod vectors. Here we investigated the viral content in rat faecal matter (n=29) collected from two continents by analyzing 2.2 billion next-generation sequencing reads derived from both DNA and RNA. Among other virus families, we found sequences from members of the Picornaviridae to be abundant in the microbiome of all the samples. Here we describe the diversity of the picornavirus-like contigs including near-full-length genomes closely related to the Boone cardiovirus and Theiler's encephalomyelitis virus. From this study, we conclude that picornaviruses within R. norvegicus are more diverse than previously recognized. The virome of R. norvegicus should be investigated further to assess the full potential for zoonotic virus transmission.

  11. Tracembler – software for in-silico chromosome walking in unassembled genomes

    Directory of Open Access Journals (Sweden)

    Wilkerson Matthew D

    2007-05-01

    Full Text Available Abstract Background Whole genome shotgun sequencing produces increasingly higher coverage of a genome with random sequence reads. Progressive whole genome assembly and eventual finishing sequencing is a process that typically takes several years for large eukaryotic genomes. In the interim, all sequence reads of public sequencing projects are made available in repositories such as the NCBI Trace Archive. For a particular locus, sequencing coverage may be high enough early on to produce a reliable local genome assembly. We have developed software, Tracembler, that facilitates in silico chromosome walking by recursively assembling reads of a selected species from the NCBI Trace Archive starting with reads that significantly match sequence seeds supplied by the user. Results Tracembler takes one or multiple DNA or protein sequence(s as input to the NCBI Trace Archive BLAST engine to identify matching sequence reads from a species of interest. The BLAST searches are carried out recursively such that BLAST matching sequences identified in previous rounds of searches are used as new queries in subsequent rounds of BLAST searches. The recursive BLAST search stops when either no more new matching sequences are found, a given maximal number of queries is exhausted, or a specified maximum number of rounds of recursion is reached. All the BLAST matching sequences are then assembled into contigs based on significant sequence overlaps using the CAP3 program. We demonstrate the validity of the concept and software implementation with an example of successfully recovering a full-length Chrm2 gene as well as its upstream and downstream genomic regions from Rattus norvegicus reads. In a second example, a query with two adjacent Medicago truncatula genes as seeds resulted in a contig that likely identifies the microsyntenic homologous soybean locus. Conclusion Tracembler streamlines the process of recursive database searches, sequence assembly, and gene

  12. Draft genome sequence of Streptomyces sp. strain F1, a potential source for glycoside hydrolases isolated from Brazilian soil

    Directory of Open Access Journals (Sweden)

    Ricardo Rodrigues de Melo

    Full Text Available ABSTRACT Here, we show the draft genome sequence of Streptomyces sp. F1, a strain isolated from soil with great potential for secretion of hydrolytic enzymes used to deconstruct cellulosic biomass. The draft genome assembly of Streptomyces sp. strain F1 has 69 contigs with a total genome size of 8,142,296 bp and G + C 72.65%. Preliminary genome analysis identified 175 proteins as Carbohydrate-Active Enzymes, being 85 glycoside hydrolases organized in 33 distinct families. This draft genome information provides new insights on the key genes encoding hydrolytic enzymes involved in biomass deconstruction employed by soil bacteria.

  13. Dicty_cDB: SFF154 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SF (Link to library) SFF154 (Link to dictyBase) - - - Contig-U15074-1 SFF154P (Link... to Original site) SFF154F 132 SFF154Z 514 SFF154P 646 - - Show SFF154 Library SF (Link to library) Clone ID SFF154 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15074-1 Original site URL http://dict...22, genomic survey sequence. 48 3e-09 3 AC116921 |AC116921.2 Dictyostelium discoi...deum chromosome 2 map 4624505-4657775 strain AX4, complete sequence. 44 4e-09 7 BQ096682 |BQ096682.1 IfHdk00151 Ict

  14. Dicty_cDB: SHK620 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SH (Link to library) SHK620 (Link to dictyBase) - - - Contig-U13939-1 - (Link to Or...iginal site) SHK620F 395 - - - - - - Show SHK620 Library SH (Link to library) Clone ID SHK620 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U13939-1 Original site URL http://dictycdb.b...pdate 2002.12. 9 Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N AF337815 |AF337815.1 Dict...omene cDNA clone Hm_pupb_01C06 5', mRNA sequence. 48 0.15 1 CK405838 |CK405838.1 AUF_IfSpn_234_g03 Ict

  15. Dicty_cDB: CHD405 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CH (Link to library) CHD405 (Link to dictyBase) - - - Contig-U15984-1 CHD405E (Link... Clone ID CHD405 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15984-1 Original site URL http://dict...ts) Value N AC115599 |AC115599.2 Dictyostelium discoideum chromosome 2 map 422909...8-4354721 strain AX4, complete sequence. 42 3e-11 9 AC115598 |AC115598.2 Dictyostelium discoideum chromosome... 2 map 581427-735498 strain AX4, complete sequence. 50 4e-11 11 CK417372 |CK417372.1 AUF_IpInt_56_d19 Intestine cDNA library Ict

  16. Dicty_cDB: VSK112 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VS (Link to library) VSK112 (Link to dictyBase) - - - Contig-U16538-1 VSK112P (Link... to Original site) VSK112F 515 VSK112Z 357 VSK112P 872 - - Show VSK112 Library VS (Link to library) Clone ID VSK112 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16538-1 Original site URL http://dict... 2004.12.25 Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N Y17042 |Y17042.1 Dict...095O23 F, DNA sequence. 36 0.071 2 BM439182 |BM439182.1 IpLvr02248 Liver cDNA library Ict

  17. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB877 (Link to dictyBase) - - - Contig-U10748-1 VFB877P (Link... to Original site) VFB877F 220 VFB877Z 725 VFB877P 945 - - Show VFB877 Library VF (Link to library) Clone ID VFB877 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U10748-1 Original site URL http://dict... E Sequences producing significant alignments: (bits) Value N AC116956 |AC116956.2 Dict...63 |AL583863.16 Human DNA sequence from clone RP11-118G19 on chromosome 1. 44 0.16 2 AF193811 |AF193811.1 Dict

  18. Rotation sequence to report humerothoracic kinematics during 3D motion involving large horizontal component: application to the tennis forehand drive.

    Science.gov (United States)

    Creveaux, Thomas; Sevrez, Violaine; Dumas, Raphaël; Chèze, Laurence; Rogowski, Isabelle

    2018-03-01

    The aim of this study was to examine the respective aptitudes of three rotation sequences (Y t X f 'Y h '', Z t X f 'Y h '', and X t Z f 'Y h '') to effectively describe the orientation of the humerus relative to the thorax during a movement involving a large horizontal abduction/adduction component: the tennis forehand drive. An optoelectronic system was used to record the movements of eight elite male players, each performing ten forehand drives. The occurrences of gimbal lock, phase angle discontinuity and incoherency in the time course of the three angles defining humerothoracic rotation were examined for each rotation sequence. Our results demonstrated that no single sequence effectively describes humerothoracic motion without discontinuities throughout the forehand motion. The humerothoracic joint angles can nevertheless be described without singularities when considering the backswing/forward-swing and the follow-through phases separately. Our findings stress that the sequence choice may have implications for the report and interpretation of 3D joint kinematics during large shoulder range of motion. Consequently, the use of Euler/Cardan angles to represent 3D orientation of the humerothoracic joint in sport tasks requires the evaluation of the rotation sequence regarding singularity occurrence before analysing the kinematic data, especially when the task involves a large shoulder range of motion in the horizontal plane.

  19. Dicty_cDB: SSI339 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SS (Link to library) SSI339 (Link to dictyBase) - - - Contig-U04467-1 SSI339Z (Link... to Original site) - - SSI339Z 563 - - - - Show SSI339 Library SS (Link to library) Clone ID SSI339 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U04467-1 Original site URL http://dict...1998. 1.22 Translated Amino Acid sequence ---FTCSNNQVISSSLVSENNCIYTVEMSGNIFCPTPTPTPTPTPTPTPNPTSNVTCKSS NGISITSSDIITCIGYGQSICT...NQVISSSLVSENNCIYTVEMSGNIFCPTPTPTPTPTPTPTPNPTSNVTCKSS NGISITSSDIITCIGYGQSICTTSSGYSCETNQTNGVLKCISPDNSISCIGNQFY

  20. Dicty_cDB: SLB394 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SL (Link to library) SLB394 (Link to dictyBase) - G24120 DDB0184167BP Contig-U13982... (Link to library) Clone ID SLB394 (Link to dictyBase) Atlas ID - NBRP ID G24120 dictyBase ID DDB0184167BP L...ink to Contig Contig-U13982-1 Original site URL http://dictycdb.biol.tsukuba.ac.j...2 Translated Amino Acid sequence TDTNPKEPSNVESIITTSEVTPSPPPSTTTSTTTNATTTITTSQPQANIIGGKRSRKDDE IISIQEALNDQLEEEKNLLEEAKEQEQEDWGDESICT...e B: TDTNPKEPSNVESIITTSEVTPSPPPSTTTSTTTNATTTITTSQPQANIIGGKRSRKDDE IISIQEALNDQLEEEKNLLEEAKEQEQEDWGDESICT

  1. Dicty_cDB: VHB478 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHB478 (Link to dictyBase) - - - Contig-U16349-1 - (Link to Or...iginal site) - - VHB478Z 556 - - - - Show VHB478 Library VH (Link to library) Clone ID VHB478 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dictycdb.b...cid sequence ---ASGSVVXQCSSVDSISNLPTXMQLFAGIKSICTEMAMDGCEKCSGNSPTTTCDVLPV YSSLCMAMPDMSQCANWTKMCSSSGQLYNSQITT...xxcxlf*trknp*kyfrkkmdsqqkrfxr*xfn*vhlsl sgxy Frame C: ---ASGSVVXQCSSVDSISNLPTXMQLFAGIKSICTEMAMDGCEKCSGNSPTTT

  2. Fast selection of miRNA candidates based on large-scale pre-computed MFE sets of randomized sequences.

    Science.gov (United States)

    Warris, Sven; Boymans, Sander; Muiser, Iwe; Noback, Michiel; Krijnen, Wim; Nap, Jan-Peter

    2014-01-13

    Small RNAs are important regulators of genome function, yet their prediction in genomes is still a major computational challenge. Statistical analyses of pre-miRNA sequences indicated that their 2D structure tends to have a minimal free energy (MFE) significantly lower than MFE values of equivalently randomized sequences with the same nucleotide composition, in contrast to other classes of non-coding RNA. The computation of many MFEs is, however, too intensive to allow for genome-wide screenings. Using a local grid infrastructure, MFE distributions of random sequences were pre-calculated on a large scale. These distributions follow a normal distribution and can be used to determine the MFE distribution for any given sequence composition by interpolation. It allows on-the-fly calculation of the normal distribution for any candidate sequence composition. The speedup achieved makes genome-wide screening with this characteristic of a pre-miRNA sequence practical. Although this particular property alone will not be able to distinguish miRNAs from other sequences sufficiently discriminative, the MFE-based P-value should be added to the parameters of choice to be included in the selection of potential miRNA candidates for experimental verification.

  3. Transcriptome profiling of testis during sexual maturation stages in Eriocheir sinensis using Illumina sequencing.

    Directory of Open Access Journals (Sweden)

    Lin He

    Full Text Available The testis is a highly specialized tissue that plays dual roles in ensuring fertility by producing spermatozoa and hormones. Spermatogenesis is a complex process, resulting in the production of mature sperm from primordial germ cells. Significant structural and biochemical changes take place in the seminiferous epithelium of the adult testis during spermatogenesis. The gene expression pattern of testis in Chinese mitten crab (Eriocheir sinensis has not been extensively studied, and limited genetic research has been performed on this species. The advent of high-throughput sequencing technologies enables the generation of genomic resources within a short period of time and at minimal cost. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset for testis of E. sinensis. In two runs, we produced 25,698,778 sequencing reads corresponding with 2.31 Gb total nucleotides. These reads were assembled into 342,753 contigs or 141,861 scaffold sequences, which identified 96,311 unigenes. Based on similarity searches with known proteins, 39,995 unigenes were annotated based on having a Blast hit in the non-redundant database or ESTscan results with a cut-off E-value above 10(-5. This is the first report of a mitten crab transcriptome using high-throughput sequencing technology, and all these testes transcripts can help us understand the molecular mechanisms involved in spermatogenesis and testis maturation.

  4. The physical and genetic framework of the maize B73 genome.

    Directory of Open Access Journals (Sweden)

    Fusheng Wei

    2009-11-01

    Full Text Available Maize is a major cereal crop and an important model system for basic biological research. Knowledge gained from maize research can also be used to genetically improve its grass relatives such as sorghum, wheat, and rice. The primary objective of the Maize Genome Sequencing Consortium (MGSC was to generate a reference genome sequence that was integrated with both the physical and genetic maps. Using a previously published integrated genetic and physical map, combined with in-coming maize genomic sequence, new sequence-based genetic markers, and an optical map, we dynamically picked a minimum tiling path (MTP of 16,910 bacterial artificial chromosome (BAC and fosmid clones that were used by the MGSC to sequence the maize genome. The final MTP resulted in a significantly improved physical map that reduced the number of contigs from 721 to 435, incorporated a total of 8,315 mapped markers, and ordered and oriented the majority of FPC contigs. The new integrated physical and genetic map covered 2,120 Mb (93% of the 2,300-Mb genome, of which 405 contigs were anchored to the genetic map, totaling 2,103.4 Mb (99.2% of the 2,120 Mb physical map. More importantly, 336 contigs, comprising 94.0% of the physical map ( approximately 1,993 Mb, were ordered and oriented. Finally we used all available physical, sequence, genetic, and optical data to generate a golden path (AGP of chromosome-based pseudomolecules, herein referred to as the B73 Reference Genome Sequence version 1 (B73 RefGen_v1.

  5. eRNA: a graphic user interface-based tool optimized for large data analysis from high-throughput RNA sequencing.

    Science.gov (United States)

    Yuan, Tiezheng; Huang, Xiaoyi; Dittmar, Rachel L; Du, Meijun; Kohli, Manish; Boardman, Lisa; Thibodeau, Stephen N; Wang, Liang

    2014-03-05

    RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers. We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification" includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module "mRNA identification" includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module "Target screening" provides expression profiling analyses and graphic visualization. The module "Self-testing" offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program's functionality. eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory.

  6. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    Science.gov (United States)

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  7. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms

    Science.gov (United States)

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  8. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms.

    Directory of Open Access Journals (Sweden)

    Francesca Bertolini

    Full Text Available Few studies investigated the donkey (Equus asinus at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca. The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing and Ion Torrent (RRL runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources.

  9. Methylation-sensitive linking libraries enhance gene-enriched sequencing of complex genomes and map DNA methylation domains

    Directory of Open Access Journals (Sweden)

    Bharti Arvind K

    2008-12-01

    Full Text Available Abstract Background Many plant genomes are resistant to whole-genome assembly due to an abundance of repetitive sequence, leading to the development of gene-rich sequencing techniques. Two such techniques are hypomethylated partial restriction (HMPR and methylation spanning linker libraries (MSLL. These libraries differ from other gene-rich datasets in having larger insert sizes, and the MSLL clones are designed to provide reads localized to "epigenetic boundaries" where methylation begins or ends. Results A large-scale study in maize generated 40,299 HMPR sequences and 80,723 MSLL sequences, including MSLL clones exceeding 100 kb. The paired end reads of MSLL and HMPR clones were shown to be effective in linking existing gene-rich sequences into scaffolds. In addition, it was shown that the MSLL clones can be used for anchoring these scaffolds to a BAC-based physical map. The MSLL end reads effectively identified epigenetic boundaries, as indicated by their preferential alignment to regions upstream and downstream from annotated genes. The ability to precisely map long stretches of fully methylated DNA sequence is a unique outcome of MSLL analysis, and was also shown to provide evidence for errors in gene identification. MSLL clones were observed to be significantly more repeat-rich in their interiors than in their end reads, confirming the correlation between methylation and retroelement content. Both MSLL and HMPR reads were found to be substantially gene-enriched, with the SalI MSLL libraries being the most highly enriched (31% align to an EST contig, while the HMPR clones exhibited exceptional depletion of repetitive DNA (to ~11%. These two techniques were compared with other gene-enrichment methods, and shown to be complementary. Conclusion MSLL technology provides an unparalleled approach for mapping the epigenetic status of repetitive blocks and for identifying sequences mis-identified as genes. Although the types and natures of

  10. A 1.7-Mb YAC contig around the human BDNF gene (11p13): integration of the physical, genetic, and cytogenetic maps in relation to WAGR syndrome

    Energy Technology Data Exchange (ETDEWEB)

    Rosier, M.F.; Martin, A.; Houlgatte, R. [Genetique Moleculaire et Biologie du Development, Villejuif (France)] [and others

    1994-11-01

    WAGR (Wilms tumor, aniridia, genito-urinary abnormalities, mental retardation) syndrome in humans is associated with deletions of the 11p13 region. The brain-derived neurotrophic factor (BDNF) gene maps to this region, and its deletion seems to contribute to the severity of the patient`s mental retardation. Yeast artificial chromosomes (YACs) carrying the BDNF gene have been isolated and characterized. Localization of two known exons of this gene leads to a minimal estimation of its size of about 40 kb. Chimerism of the BDNF YACs has been investigated by fluorescence in situ hybridization and chromosome assignment on somatic cell hybrids. Using the BDNF gene, YAC end sequence tagged sites (STS), and Genethon microsatellite markers, the authors constructed a 1.7-Mb contig and refined the cytogenetic map at 11p13. The resulting integrated physical, genetic, and cytogenetic map constitutes a resource for the characterization of genes that may be involved in the WAGR syndrome. 42 refs., 2 figs., 3 tabs.

  11. Large-scale chromosome folding versus genomic DNA sequences: A discrete double Fourier transform technique.

    Science.gov (United States)

    Chechetkin, V R; Lobzin, V V

    2017-08-07

    Using state-of-the-art techniques combining imaging methods and high-throughput genomic mapping tools leaded to the significant progress in detailing chromosome architecture of various organisms. However, a gap still remains between the rapidly growing structural data on the chromosome folding and the large-scale genome organization. Could a part of information on the chromosome folding be obtained directly from underlying genomic DNA sequences abundantly stored in the databanks? To answer this question, we developed an original discrete double Fourier transform (DDFT). DDFT serves for the detection of large-scale genome regularities associated with domains/units at the different levels of hierarchical chromosome folding. The method is versatile and can be applied to both genomic DNA sequences and corresponding physico-chemical parameters such as base-pairing free energy. The latter characteristic is closely related to the replication and transcription and can also be used for the assessment of temperature or supercoiling effects on the chromosome folding. We tested the method on the genome of E. coli K-12 and found good correspondence with the annotated domains/units established experimentally. As a brief illustration of further abilities of DDFT, the study of large-scale genome organization for bacteriophage PHIX174 and bacterium Caulobacter crescentus was also added. The combined experimental, modeling, and bioinformatic DDFT analysis should yield more complete knowledge on the chromosome architecture and genome organization. Copyright © 2017 Elsevier Ltd. All rights reserved.

  12. Comparison of zero-sequence injection methods in cascaded H-bridge multilevel converters for large-scale photovoltaic integration

    DEFF Research Database (Denmark)

    Yu, Yifan; Konstantinou, Georgios; Townsend, Christopher David

    2017-01-01

    to maintain three-phase balanced grid currents with unbalanced power generation. This study theoretically compares power balance capabilities of various zero-sequence injection methods based on two metrics which can be easily generalised for all CHB applications to PV systems. Experimental results based......Photovoltaic (PV) power generation levels in the three phases of a multilevel cascaded H-bridge (CHB) converter can be significantly unbalanced, owing to different irradiance levels and ambient temperatures over a large-scale solar PV power plant. Injection of a zero-sequence voltage is required...... on a 430 V, 10 kW, three-phase, seven-level cascaded H-bridge converter prototype confirm superior performance of the optimal zero-sequence injection technique....

  13. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB866 (Link to dictyBase) - - - Contig-U16349-1 VFB866Z (Link... to Original site) - - VFB866Z 631 - - - - Show VFB866 Library VF (Link to library) Clone ID VFB866 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dict...ce ---SSVDSISNLPTTMQLFAGIKSICTEMAMDGCEKCSGNSPTTTCDVLPVYSSLCMAMP DMSQCANWTKMCSSSGQLYNSQITSDYCVASVADAVPIMRMYFH...RGCLHAIELTCSYALMLVAMTFNVALFFAV Translated Amino Acid sequence (All Frames) Frame A: ---SSVDSISNLPTTMQLFAGIKSICT

  14. Dicty_cDB: VHL364 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHL364 (Link to dictyBase) - - - Contig-U16205-1 - (Link to Or...iginal site) - - VHL364Z 668 - - - - Show VHL364 Library VH (Link to library) Clone ID VHL364 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16205-1 Original site URL http://dictycdb.b...mino Acid sequence ---KKVTIAEAKEIFEKQYRDLYTVSQDVTKLAIQSAEQNGIVFLDEIDKICTSRESIKN GGDASTDGVQRDLLPIVEGCMVSTKYGQ...itwyh*tyrrgykf*lxys*r*nscfrysrh*ktfk Frame B: ---KKVTIAEAKEIFEKQYRDLYTVSQDVTKLAIQSAEQNGIVFLDEIDKICTSRESIKN G

  15. Dicty_cDB: VHK472 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VH (Link to library) VHK472 (Link to dictyBase) - - - Contig-U16349-1 - (Link to Or...iginal site) - - VHK472Z 392 - - - - Show VHK472 Library VH (Link to library) Clone ID VHK472 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16349-1 Original site URL http://dictycdb.b....10 Translated Amino Acid sequence ---CSSVDSISNLPTXMQLFAGIKSICTEMAMDGCEKCSGNSPTTT...wstlqlsnhirllcrlcc*rrsnhenvlshwylglypl*ilgtk n*psicwfmv Frame B: ---CSSVDSISNLPTXMQLFAGIKSICTEMAMDGCEKCSGNSP

  16. Dicty_cDB: SLE110 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SL (Link to library) SLE110 (Link to dictyBase) - - - Contig-U16520-1 SLE110E (Link... to Original site) - - - - - - SLE110E 237 Show SLE110 Library SL (Link to library) Clone ID SLE110 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16520-1 Original site URL http://dict...NLSEDVNLEEFVMSKDDLSGADIKAICTESGLLALRERRMRVTYXDFKKAKEKVLYR KTAGAPEGLYM*kkknqnq Translated Amino Acid sequence... (All Frames) Frame A: AKMNLSEDVNLEEFVMSKDDLSGADIKAICTESGLLALRERRMRVTYXDFKKAKEKVL

  17. Dicty_cDB: SLA340 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SL (Link to library) SLA340 (Link to dictyBase) - - - Contig-U16510-1 - (Link to Or...iginal site) - - SLA340Z 466 - - - - Show SLA340 Library SL (Link to library) Clone ID SLA340 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16510-1 Original site URL http://dictycdb.b.... 2 Translated Amino Acid sequence ---CGKPSPIPTCTRKLSPISICIIKLCPIFICITKLCSIYICTISICIIFICIICSNYS CNYNCNYSCNYN...qpqlqlqlqlqpqpqlqpqlqpq lplkfnqktffikyhynnffnnnsikk*y*kspffl Frame C: ---CGKPSPIPTCTRKLSPISICIIKLCPIFICITKLCSIYICT

  18. Next-Generation Sequencing of the Chrysanthemum nankingense (Asteraceae) Transcriptome Permits Large-Scale Unigene Assembly and SSR Marker Discovery

    Science.gov (United States)

    Wang, Haibin; Jiang, Jiafu; Chen, Sumei; Qi, Xiangyu; Peng, Hui; Li, Pirui; Song, Aiping; Guan, Zhiyong; Fang, Weimin; Liao, Yuan; Chen, Fadi

    2013-01-01

    Background Simple sequence repeats (SSRs) are ubiquitous in eukaryotic genomes. Chrysanthemum is one of the largest genera in the Asteraceae family. Only few Chrysanthemum expressed sequence tag (EST) sequences have been acquired to date, so the number of available EST-SSR markers is very low. Methodology/Principal Findings Illumina paired-end sequencing technology produced over 53 million sequencing reads from C. nankingense mRNA. The subsequent de novo assembly yielded 70,895 unigenes, of which 45,789 (64.59%) unigenes showed similarity to the sequences in NCBI database. Out of 45,789 sequences, 107 have hits to the Chrysanthemum Nr protein database; 679 and 277 sequences have hits to the database of Helianthus and Lactuca species, respectively. MISA software identified a large number of putative EST-SSRs, allowing 1,788 primer pairs to be designed from the de novo transcriptome sequence and a further 363 from archival EST sequence. Among 100 primer pairs randomly chosen, 81 markers have amplicons and 20 are polymorphic for genotypes analysis in Chrysanthemum. The results showed that most (but not all) of the assays were transferable across species and that they exposed a significant amount of allelic diversity. Conclusions/Significance SSR markers acquired by transcriptome sequencing are potentially useful for marker-assisted breeding and genetic analysis in the genus Chrysanthemum and its related genera. PMID:23626799

  19. Draft genome sequence of Thalassobius mediterraneus CECT 5383T, a poly-beta-hydroxybutyrate producer

    Directory of Open Access Journals (Sweden)

    Lidia Rodrigo-Torres

    2016-03-01

    Full Text Available Thalassobius mediterraneus is the type species of the genus Thalassobius and a member of the Roseobacter clade, an abundant representative of marine bacteria. T. mediterraneus XSM19T (=CECT 5383T was isolated from the Western Mediterranean coast near Valencia (Spain in 1989. We present here the draft genome sequence and annotation of this strain (ENA/DDBJ/NCBI accession number CYSF00000000, which is comprised of 3,431,658 bp distributed in 19 contigs and encodes 10 rRNA genes, 51 tRNA genes and 3276 protein coding genes. Relevant findings are commented, including the complete set of genes required for poly-beta-hydroxybutyrate (PHB synthesis and genes related to degradation of aromatic compounds. Keywords: Rhodobacteraceae, Roseobacter clade, PHB, Aromatic compounds

  20. Sequence recombination and conservation of Varroa destructor virus-1 and deformed wing virus in field collected honey bees (Apis mellifera.

    Directory of Open Access Journals (Sweden)

    Hui Wang

    Full Text Available We sequenced small (s RNAs from field collected honeybees (Apis mellifera and bumblebees (Bombuspascuorum using the Illumina technology. The sRNA reads were assembled and resulting contigs were used to search for virus homologues in GenBank. Matches with Varroadestructor virus-1 (VDV1 and Deformed wing virus (DWV genomic sequences were obtained for A. mellifera but not B. pascuorum. Further analyses suggested that the prevalent virus population was composed of VDV-1 and a chimera of 5'-DWV-VDV1-DWV-3'. The recombination junctions in the chimera genomes were confirmed by using RT-PCR, cDNA cloning and Sanger sequencing. We then focused on conserved short fragments (CSF, size > 25 nt in the virus genomes by using GenBank sequences and the deep sequencing data obtained in this study. The majority of CSF sites confirmed conservation at both between-species (GenBank sequences and within-population (dataset of this study levels. However, conserved nucleotide positions in the GenBank sequences might be variable at the within-population level. High mutation rates (Pi>10% were observed at a number of sites using the deep sequencing data, suggesting that sequence conservation might not always be maintained at the population level. Virus-host interactions and strategies for developing RNAi treatments against VDV1/DWV infections are discussed.

  1. Analyses of the Sequence and Structural Properties Corresponding to Pentapeptide and Large Palindromes in Proteins.

    Directory of Open Access Journals (Sweden)

    Settu Sridhar

    Full Text Available The analyses of 3967 representative proteins selected from the Protein Data Bank revealed the presence of 2803 pentapeptide and large palindrome sequences with known secondary structure conformation. These represent 2014 unique palindrome sequences. 60% palindromes are not associated with any regular secondary structure and 28% are in helix conformation, 11% in strand conformation and 1% in the coil conformation. The average solvent accessibility values are in the range between 0-155.28 Å2 suggesting that the palindromes in proteins can be either buried, exposed to the solvent or share an intermittent property. The number of residue neighborhood contacts defined by interactions ≤ 3.2 Ǻ is in the range between 0-29 residues. Palindromes of the same length in helix, strand and coil conformation are associated with different amino acid residue preferences at the individual positions. Nearly, 20% palindromes interact with catalytic/active site residues, ligand or metal ions in proteins and may therefore be important for function in the corresponding protein. The average hydrophobicity values for the pentapeptide and large palindromes range between -4.3 to +4.32 and the number of palindromes is almost equally distributed between the negative and positive hydrophobicity values. The palindromes represent 107 different protein families and the hydrolases, transferases, oxidoreductases and lyases contain relatively large number of palindromes.

  2. Physical Mapping of Bread Wheat Chromosome 5A: An Integrated Approach

    Directory of Open Access Journals (Sweden)

    Delfina Barabaschi

    2015-11-01

    Full Text Available The huge size, redundancy, and highly repetitive nature of the bread wheat [ (L.] genome, makes it among the most difficult species to be sequenced. To overcome these limitations, a strategy based on the separation of individual chromosomes or chromosome arms and the subsequent production of physical maps was established within the frame of the International Wheat Genome Sequence Consortium (IWGSC. A total of 95,812 bacterial artificial chromosome (BAC clones of short-arm chromosome 5A (5AS and long-arm chromosome 5A (5AL arm-specific BAC libraries were fingerprinted and assembled into contigs by complementary analytical approaches based on the FingerPrinted Contig (FPC and Linear Topological Contig (LTC tools. Combined anchoring approaches based on polymerase chain reaction (PCR marker screening, microarray, and sequence homology searches applied to several genomic tools (i.e., genetic maps, deletion bin map, neighbor maps, BAC end sequences (BESs, genome zipper, and chromosome survey sequences allowed the development of a high-quality physical map with an anchored physical coverage of 75% for 5AS and 53% for 5AL with high portions (64 and 48%, respectively of contigs ordered along the chromosome. In the genome of grasses, [ (L. Beauv.], rice ( L., and sorghum [ (L. Moench] homologs of genes on wheat chromosome 5A were separated into syntenic blocks on different chromosomes as a result of translocations and inversions during evolution. The physical map presented represents an essential resource for fine genetic mapping and map-based cloning of agronomically relevant traits and a reference for the 5A sequencing projects.

  3. Microarray and cDNA sequence analysis of transcription during nerve-dependent limb regeneration

    Directory of Open Access Journals (Sweden)

    Bryant Susan V

    2009-01-01

    Full Text Available Abstract Background Microarray analysis and 454 cDNA sequencing were used to investigate a centuries-old problem in regenerative biology: the basis of nerve-dependent limb regeneration in salamanders. Innervated (NR and denervated (DL forelimbs of Mexican axolotls were amputated and transcripts were sampled after 0, 5, and 14 days of regeneration. Results Considerable similarity was observed between NR and DL transcriptional programs at 5 and 14 days post amputation (dpa. Genes with extracellular functions that are critical to wound healing were upregulated while muscle-specific genes were downregulated. Thus, many processes that are regulated during early limb regeneration do not depend upon nerve-derived factors. The majority of the transcriptional differences between NR and DL limbs were correlated with blastema formation; cell numbers increased in NR limbs after 5 dpa and this yielded distinct transcriptional signatures of cell proliferation in NR limbs at 14 dpa. These transcriptional signatures were not observed in DL limbs. Instead, gene expression changes within DL limbs suggest more diverse and protracted wound-healing responses. 454 cDNA sequencing complemented the microarray analysis by providing deeper sampling of transcriptional programs and associated biological processes. Assembly of new 454 cDNA sequences with existing expressed sequence tag (EST contigs from the Ambystoma EST database more than doubled (3935 to 9411 the number of non-redundant human-A. mexicanum orthologous sequences. Conclusion Many new candidate gene sequences were discovered for the first time and these will greatly enable future studies of wound healing, epigenetics, genome stability, and nerve-dependent blastema formation and outgrowth using the axolotl model.

  4. Transcriptome Sequencing of Diverse Peanut (Arachis Wild Species and the Cultivated Species Reveals a Wealth of Untapped Genetic Variability

    Directory of Open Access Journals (Sweden)

    Ratan Chopra

    2016-12-01

    Full Text Available To test the hypothesis that the cultivated peanut species possesses almost no molecular variability, we sequenced a diverse panel of 22 Arachis accessions representing Arachis hypogaea botanical classes, A-, B-, and K- genome diploids, a synthetic amphidiploid, and a tetraploid wild species. RNASeq was performed on pools of three tissues, and de novo assembly was performed. Realignment of individual accession reads to transcripts of the cultivar OLin identified 306,820 biallelic SNPs. Among 10 naturally occurring tetraploid accessions, 40,382 unique homozygous SNPs were identified in 14,719 contigs. In eight diploid accessions, 291,115 unique SNPs were identified in 26,320 contigs. The average SNP rate among the 10 cultivated tetraploids was 0.5, and among eight diploids was 9.2 per 1000 bp. Diversity analysis indicated grouping of diploids according to genome classification, and cultivated tetraploids by subspecies. Cluster analysis of variants indicated that sequences of B genome species were the most similar to the tetraploids, and the next closest diploid accession belonged to the A genome species. A subset of 66 SNPs selected from the dataset was validated; of 782 SNP calls, 636 (81.32% were confirmed using an allele-specific discrimination assay. We conclude that substantial genetic variability exists among wild species. Additionally, significant but lesser variability at the molecular level occurs among accessions of the cultivated species. This survey is the first to report significant SNP level diversity among transcripts, and may explain some of the phenotypic differences observed in germplasm surveys. Understanding SNP variants in the Arachis accessions will benefit in developing markers for selection.

  5. Role of Modular Polyketide Synthases in the Production of Polyether Ladder Compounds in Ciguatoxin-Producing Gambierdiscus polynesiensis and G. excentricus (Dinophyceae).

    Science.gov (United States)

    Kohli, Gurjeet S; Campbell, Katrina; John, Uwe; Smith, Kirsty F; Fraga, Santiago; Rhodes, Lesley L; Murray, Shauna A

    2017-09-01

    Gambierdiscus, a benthic dinoflagellate, produces ciguatoxins that cause the human illness Ciguatera. Ciguatoxins are polyether ladder compounds that have a polyketide origin, indicating that polyketide synthases (PKS) are involved in their production. We sequenced transcriptomes of Gambierdiscus excentricus and Gambierdiscus polynesiensis and found 264 contigs encoding single domain ketoacyl synthases (KS; G. excentricus: 106, G. polynesiensis: 143) and ketoreductases (KR; G. excentricus: 7, G. polynesiensis: 8) with sequence similarity to type I PKSs, as reported in other dinoflagellates. In addition, 24 contigs (G. excentricus: 3, G. polynesiensis: 21) encoding multiple PKS domains (forming typical type I PKSs modules) were found. The proposed structure produced by one of these megasynthases resembles a partial carbon backbone of a polyether ladder compound. Seventeen contigs encoding single domain KS, KR, s-malonyltransacylase, dehydratase and enoyl reductase with sequence similarity to type II fatty acid synthases (FAS) in plants were found. Type I PKS and type II FAS genes were distinguished based on the arrangement of domains on the contigs and their sequence similarity and phylogenetic clustering with known PKS/FAS genes in other organisms. This differentiation of PKS and FAS pathways in Gambierdiscus is important, as it will facilitate approaches to investigating toxin biosynthesis pathways in dinoflagellates. © 2017 The Author(s) Journal of Eukaryotic Microbiology © 2017 International Society of Protistologists.

  6. XplorSeq: a software environment for integrated management and phylogenetic analysis of metagenomic sequence data.

    Science.gov (United States)

    Frank, Daniel N

    2008-10-07

    Advances in automated DNA sequencing technology have accelerated the generation of metagenomic DNA sequences, especially environmental ribosomal RNA gene (rDNA) sequences. As the scale of rDNA-based studies of microbial ecology has expanded, need has arisen for software that is capable of managing, annotating, and analyzing the plethora of diverse data accumulated in these projects. XplorSeq is a software package that facilitates the compilation, management and phylogenetic analysis of DNA sequences. XplorSeq was developed for, but is not limited to, high-throughput analysis of environmental rRNA gene sequences. XplorSeq integrates and extends several commonly used UNIX-based analysis tools by use of a Macintosh OS-X-based graphical user interface (GUI). Through this GUI, users may perform basic sequence import and assembly steps (base-calling, vector/primer trimming, contig assembly), perform BLAST (Basic Local Alignment and Search Tool; 123) searches of NCBI and local databases, create multiple sequence alignments, build phylogenetic trees, assemble Operational Taxonomic Units, estimate biodiversity indices, and summarize data in a variety of formats. Furthermore, sequences may be annotated with user-specified meta-data, which then can be used to sort data and organize analyses and reports. A document-based architecture permits parallel analysis of sequence data from multiple clones or amplicons, with sequences and other data stored in a single file. XplorSeq should benefit researchers who are engaged in analyses of environmental sequence data, especially those with little experience using bioinformatics software. Although XplorSeq was developed for management of rDNA sequence data, it can be applied to most any sequencing project. The application is available free of charge for non-commercial use at http://vent.colorado.edu/phyloware.

  7. Draft Genome Sequencing and Comparative Analysis of Aspergillus sojae NBRC4239

    Science.gov (United States)

    Sato, Atsushi; Oshima, Kenshiro; Noguchi, Hideki; Ogawa, Masahiro; Takahashi, Tadashi; Oguma, Tetsuya; Koyama, Yasuji; Itoh, Takehiko; Hattori, Masahira; Hanya, Yoshiki

    2011-01-01

    We conducted genome sequencing of the filamentous fungus Aspergillus sojae NBRC4239 isolated from the koji used to prepare Japanese soy sauce. We used the 454 pyrosequencing technology and investigated the genome with respect to enzymes and secondary metabolites in comparison with other Aspergilli sequenced. Assembly of 454 reads generated a non-redundant sequence of 39.5-Mb possessing 13 033 putative genes and 65 scaffolds composed of 557 contigs. Of the 2847 open reading frames with Pfam domain scores of >150 found in A. sojae NBRC4239, 81.7% had a high degree of similarity with the genes of A. oryzae. Comparative analysis identified serine carboxypeptidase and aspartic protease genes unique to A. sojae NBRC4239. While A. oryzae possessed three copies of α-amyalse gene, A. sojae NBRC4239 possessed only a single copy. Comparison of 56 gene clusters for secondary metabolites between A. sojae NBRC4239 and A. oryzae revealed that 24 clusters were conserved, whereas 32 clusters differed between them that included a deletion of 18 508 bp containing mfs1, mao1, dmaT, and pks-nrps for the cyclopiazonic acid (CPA) biosynthesis, explaining the no productivity of CPA in A. sojae. The A. sojae NBRC4239 genome data will be useful to characterize functional features of the koji moulds used in Japanese industries. PMID:21659486

  8. Sequence embedding for fast construction of guide trees for multiple sequence alignment

    LENUS (Irish Health Repository)

    Blackshields, Gordon

    2010-05-14

    Abstract Background The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N 2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http:\\/\\/www.clustal.org\\/mbed.tgz.

  9. Dicty_cDB: SFF103 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SF (Link to library) SFF103 (Link to dictyBase) - - - Contig-U11967-1 SFF103Z (Link... to Original site) - - SFF103Z 655 - - - - Show SFF103 Library SF (Link to library) Clone ID SFF103 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11967-1 Original site URL http://dict...nslated Amino Acid sequence ---rgsf*FLKIIITLLAYKIICTPNHMHLTRGNHETTDMNRFYGFQGEVVAKYSEMVFD LFSELFNWFPLAFVLDESF...*rkrlsnr**wfshhcflc skll*siw*swliykynxdkikittxklxtsexppmhsqk Frame C: ---rgsf*FLKIIITLLAYKIICT

  10. Dicty_cDB: CHF177 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available CH (Link to library) CHF177 (Link to dictyBase) - - - Contig-U11892-1 - (Link to Or...iginal site) - - CHF177Z 395 - - - - Show CHF177 Library CH (Link to library) Clone ID CHF177 (Link to dicty...Base) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11892-1 Original site URL http://dictycdb.b...LLLWDVQGFPCXFAVEG GQCIDPSSLKVGGKYSFIAFSTCRXKFDNQKIHDCDWIIQGPTTPSXCANCGKICTSKCT TNYCDRDXQT Translated Amino A...XKFDNQKIHDCDWIIQGPTTPSXCANCGKICTSKCT TNYCDRDXQT Homology vs CSM-cDNA Score E Sequences producing significant

  11. Dicty_cDB: VFJ256 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFJ256 (Link to dictyBase) - - - Contig-U10140-1 VFJ256E (Link...) Clone ID VFJ256 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U10140-1 Ori...*nq*fylv*l*vx*KMNKLHLPIKENHHQXIKSIELIKNEFPEILICTDLCLC AYTDHGHCGVLTEEGFIENEKSIIRLG...iiikiih iiyivdmpiqlldhgkvimsf*nln*fiqfllqi*liqklklnpyqdnikfqvi**lnf* dhwlrkd*nq*fylv*l*vx*KMNKLHLPIKENHHQXIKSIELIKNEFPEILICT...pdate 2002.12.18 Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N ( BJ432755 ) Dict

  12. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB590 (Link to dictyBase) - - - Contig-U09552-1 VFB590P (Link... to Original site) VFB590F 225 VFB590Z 118 VFB590P 343 - - Show VFB590 Library VF (Link to library) Clone ID VFB590 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U09552-1 Original site URL http://dict...ed Amino Acid sequence IAVVEGFMAPSELCQKIKFCSSSSSTNDFDFIGSSTTDCEICTFISGYAENFLEEXKT...--- ---riyqinkvvmxhxlhn*lxvaxivxlgxvnvkihvexix Frame C: IAVVEGFMAPSELCQKIKFCSSSSSTNDFDFIGSSTTDCEICT

  13. Dicty_cDB: AFA460 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available AF (Link to library) AFA460 (Link to dictyBase) - - - Contig-U15574-1 AFA460Z (Link... to Original site) - - AFA460Z 170 - - - - Show AFA460 Library AF (Link to library) Clone ID AFA460 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15574-1 Original site URL http://dict...date 2001. 6. 2 Translated Amino Acid sequence ---QIHQTIQVVKITLSSASSSSSSSSSSILNKTRICTYINSNSTHSLXXNIYKYKLPK T...it* Frame B: ---QIHQTIQVVKITLSSASSSSSSSSSSILNKTRICTYINSNSTHSLXXNIYKYKLPK Frame C:

  14. Key roles for freshwater Actinobacteria revealed by deep metagenomic sequencing.

    Science.gov (United States)

    Ghai, Rohit; Mizuno, Carolina Megumi; Picazo, Antonio; Camacho, Antonio; Rodriguez-Valera, Francisco

    2014-12-01

    Freshwater ecosystems are critical but fragile environments directly affecting society and its welfare. However, our understanding of genuinely freshwater microbial communities, constrained by our capacity to manipulate its prokaryotic participants in axenic cultures, remains very rudimentary. Even the most abundant components, freshwater Actinobacteria, remain largely unknown. Here, applying deep metagenomic sequencing to the microbial community of a freshwater reservoir, we were able to circumvent this traditional bottleneck and reconstruct de novo seven distinct streamlined actinobacterial genomes. These genomes represent three new groups of photoheterotrophic, planktonic Actinobacteria. We describe for the first time genomes of two novel clades, acMicro (Micrococcineae, related to Luna2,) and acAMD (Actinomycetales, related to acTH1). Besides, an aggregate of contigs belonged to a new branch of the Acidimicrobiales. All are estimated to have small genomes (approximately 1.2 Mb), and their GC content varied from 40 to 61%. One of the Micrococcineae genomes encodes a proteorhodopsin, a rhodopsin type reported for the first time in Actinobacteria. The remarkable potential capacity of some of these genomes to transform recalcitrant plant detrital material, particularly lignin-derived compounds, suggests close linkages between the terrestrial and aquatic realms. Moreover, abundances of Actinobacteria correlate inversely to those of Cyanobacteria that are responsible for prolonged and frequently irretrievable damage to freshwater ecosystems. This suggests that they might serve as sentinels of impending ecological catastrophes. © 2014 John Wiley & Sons Ltd.

  15. Identification and Mapping of Simple Sequence Repeat Markers from Common Bean (Phaseolus vulgaris L. Bacterial Artificial Chromosome End Sequences for Genome Characterization and Genetic–Physical Map Integration

    Directory of Open Access Journals (Sweden)

    Juana M. Córdoba

    2010-11-01

    Full Text Available Microsatellite markers or simple sequence repeat (SSR loci are useful for diversity characterization and genetic–physical mapping. Different in silico microsatellite search methods have been developed for mining bacterial artificial chromosome (BAC end sequences for SSRs. The overall goal of this study was genome characterization based on SSRs in 89,017 BAC end sequences (BESs from the G19833 common bean ( L. library. Another objective was to identify new SSR taking into account three tandem motif identification programs (Automated Microsatellite Marker Development [AMMD], Tandem Repeats Finder [TRF], and SSRLocator [SSRL]. Among the microsatellite search engines, SSRL identified the highest number of SSRs; however, when primer design was attempted, the number dropped due to poor primer design regions. Automated Microsatellite Marker Development software identified many SSRs with valuable AT/TA or AG/TC motifs, while TRF found fewer SSRs and produced no primers. A subgroup of 323 AT-rich, di-, and trinucleotide SSRs were selected from the AMMD results and used in a parental survey with DOR364 and G19833, of which 75 could be mapped in the corresponding population; these represented 4052 BAC clones. Together with 92 previously mapped BES- and 114 non-BES-derived markers, a total of 280 SSRs were included in the polymerase chain reaction (PCR-based map, integrating a total of 8232 BAC clones in 162 contigs from the physical map.

  16. TIMPs of parasitic helminths - a large-scale analysis of high-throughput sequence datasets.

    Science.gov (United States)

    Cantacessi, Cinzia; Hofmann, Andreas; Pickering, Darren; Navarro, Severine; Mitreva, Makedonka; Loukas, Alex

    2013-05-30

    Tissue inhibitors of metalloproteases (TIMPs) are a multifunctional family of proteins that orchestrate extracellular matrix turnover, tissue remodelling and other cellular processes. In parasitic helminths, such as hookworms, TIMPs have been proposed to play key roles in the host-parasite interplay, including invasion of and establishment in the vertebrate animal hosts. Currently, knowledge of helminth TIMPs is limited to a small number of studies on canine hookworms, whereas no information is available on the occurrence of TIMPs in other parasitic helminths causing neglected diseases. In the present study, we conducted a large-scale investigation of TIMP proteins of a range of neglected human parasites including the hookworm Necator americanus, the roundworm Ascaris suum, the liver flukes Clonorchis sinensis and Opisthorchis viverrini, as well as the schistosome blood flukes. This entailed mining available transcriptomic and/or genomic sequence datasets for the presence of homologues of known TIMPs, predicting secondary structures of defined protein sequences, systematic phylogenetic analyses and assessment of differential expression of genes encoding putative TIMPs in the developmental stages of A. suum, N. americanus and Schistosoma haematobium which infect the mammalian hosts. A total of 15 protein sequences with high homology to known eukaryotic TIMPs were predicted from the complement of sequence data available for parasitic helminths and subjected to in-depth bioinformatic analyses. Supported by the availability of gene manipulation technologies such as RNA interference and/or transgenesis, this work provides a basis for future functional explorations of helminth TIMPs and, in particular, of their role/s in fundamental biological pathways linked to long-term establishment in the vertebrate hosts, with a view towards the development of novel approaches for the control of neglected helminthiases.

  17. Draft genome sequence of Halorubrum tropicale strain V5, a novel halophilic archaeon isolated from the solar salterns of Cabo Rojo, Puerto Rico.

    Science.gov (United States)

    Sánchez-Nieves, Rubén; Facciotti, Marc T; Saavedra-Collado, Sofía; Dávila-Santiago, Lizbeth; Rodríguez-Carrero, Roy; Montalvo-Rodríguez, Rafael

    2016-03-01

    The genus Halorubrum is a member of the family Halobacteriaceae which currently has the highest number of described species (31) of all the haloarchaea. Here we report the draft genome sequence of strain V5, a new species within this genus that was isolated from the solar salterns of Cabo Rojo, Puerto Rico. Assembly was performed and rendered the genome into 17 contigs (N50 = 515,834 bp), the largest of which contains 1,031,026 bp. The genome consists of 3.57 MB in length with G + C content of 67.6%. In general, the genome includes 4 rRNAs, 52 tRNAs, and 3246 protein-coding sequences. The NCBI accession number for this genome is LIST00000000 and the strain deposit number is CECT9000.

  18. Draft genome sequence of ramie, Boehmeria nivea (L.) Gaudich.

    Science.gov (United States)

    Luan, Ming-Bao; Jian, Jian-Bo; Chen, Ping; Chen, Jun-Hui; Chen, Jian-Hua; Gao, Qiang; Gao, Gang; Zhou, Ju-Hong; Chen, Kun-Mei; Guang, Xuan-Min; Chen, Ji-Kang; Zhang, Qian-Qian; Wang, Xiao-Fei; Fang, Long; Sun, Zhi-Min; Bai, Ming-Zhou; Fang, Xiao-Dong; Zhao, Shan-Cen; Xiong, He-Ping; Yu, Chun-Ming; Zhu, Ai-Guo

    2018-05-01

    Ramie, Boehmeria nivea (L.) Gaudich, family Urticaceae, is a plant native to eastern Asia, and one of the world's oldest fibre crops. It is also used as animal feed and for the phytoremediation of heavy metal-contaminated farmlands. Thus, the genome sequence of ramie was determined to explore the molecular basis of its fibre quality, protein content and phytoremediation. For further understanding ramie genome, different paired-end and mate-pair libraries were combined to generate 134.31 Gb of raw DNA sequences using the Illumina whole-genome shotgun sequencing approach. The highly heterozygous B. nivea genome was assembled using the Platanus Genome Assembler, which is an effective tool for the assembly of highly heterozygous genome sequences. The final length of the draft genome of this species was approximately 341.9 Mb (contig N50 = 22.62 kb, scaffold N50 = 1,126.36 kb). Based on ramie genome annotations, 30,237 protein-coding genes were predicted, and the repetitive element content was 46.3%. The completeness of the final assembly was evaluated by benchmarking universal single-copy orthologous genes (BUSCO); 90.5% of the 1,440 expected embryophytic genes were identified as complete, and 4.9% were identified as fragmented. Phylogenetic analysis based on single-copy gene families and one-to-one orthologous genes placed ramie with mulberry and cannabis, within the clade of urticalean rosids. Genome information of ramie will be a valuable resource for the conservation of endangered Boehmeria species and for future studies on the biogeography and characteristic evolution of members of Urticaceae. © 2018 John Wiley & Sons Ltd.

  19. Expression sequence tag library derived from peripheral blood mononuclear cells of the chlorocebus sabaeus

    Directory of Open Access Journals (Sweden)

    Tchitchek Nicolas

    2012-06-01

    Full Text Available Abstract Background African Green Monkeys (AGM are amongst the most frequently used nonhuman primate models in clinical and biomedical research, nevertheless only few genomic resources exist for this species. Such information would be essential for the development of dedicated new generation technologies in fundamental and pre-clinical research using this model, and would deliver new insights into primate evolution. Results We have exhaustively sequenced an Expression Sequence Tag (EST library made from a pool of Peripheral Blood Mononuclear Cells from sixteen Chlorocebus sabaeus monkeys. Twelve of them were infected with the Simian Immunodeficiency Virus. The mononuclear cells were or not stimulated in vitro with Concanavalin A, with lipopolysacharrides, or through mixed lymphocyte reaction in order to generate a representative and broad library of expressed sequences in immune cells. We report here 37,787 sequences, which were assembled into 14,410 contigs representing an estimated 12% of the C. sabaeus transcriptome. Using data from primate genome databases, 9,029 assembled sequences from C. sabaeus could be annotated. Sequences have been systematically aligned with ten cDNA references of primate species including Homo sapiens, Pan troglodytes, and Macaca mulatta to identify ortholog transcripts. For 506 transcripts, sequences were quasi-complete. In addition, 6,576 transcript fragments are potentially specific to the C. sabaeus or corresponding to not yet described primate genes. Conclusions The EST library we provide here will prove useful in gene annotation efforts for future sequencing of the African Green Monkey genomes. Furthermore, this library, which particularly well represents immunological and hematological gene expression, will be an important resource for the comparative analysis of gene expression in clinically relevant nonhuman primate and human research.

  20. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio.

    Science.gov (United States)

    Leung, Henry C M; Yiu, S M; Yang, Bin; Peng, Yu; Wang, Yi; Liu, Zhihua; Chen, Jingchi; Qin, Junjie; Li, Ruiqiang; Chin, Francis Y L

    2011-06-01

    With the rapid development of next-generation sequencing techniques, metagenomics, also known as environmental genomics, has emerged as an exciting research area that enables us to analyze the microbial environment in which we live. An important step for metagenomic data analysis is the identification and taxonomic characterization of DNA fragments (reads or contigs) resulting from sequencing a sample of mixed species. This step is referred to as 'binning'. Binning algorithms that are based on sequence similarity and sequence composition markers rely heavily on the reference genomes of known microorganisms or phylogenetic markers. Due to the limited availability of reference genomes and the bias and low availability of markers, these algorithms may not be applicable in all cases. Unsupervised binning algorithms which can handle fragments from unknown species provide an alternative approach. However, existing unsupervised binning algorithms only work on datasets either with balanced species abundance ratios or rather different abundance ratios, but not both. In this article, we present MetaCluster 3.0, an integrated binning method based on the unsupervised top--down separation and bottom--up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios (say 1:1) to very different abundance ratios (e.g. 1:24) with consistently higher accuracy than existing methods. MetaCluster 3.0 can be downloaded at http://i.cs.hku.hk/~alse/MetaCluster/.

  1. Design of Long Period Pseudo-Random Sequences from the Addition of -Sequences over

    Directory of Open Access Journals (Sweden)

    Ren Jian

    2004-01-01

    Full Text Available Pseudo-random sequence with good correlation property and large linear span is widely used in code division multiple access (CDMA communication systems and cryptology for reliable and secure information transmission. In this paper, sequences with long period, large complexity, balance statistics, and low cross-correlation property are constructed from the addition of -sequences with pairwise-prime linear spans (AMPLS. Using -sequences as building blocks, the proposed method proved to be an efficient and flexible approach to construct long period pseudo-random sequences with desirable properties from short period sequences. Applying the proposed method to , a signal set is constructed.

  2. Salmonella enterica Prophage Sequence Profiles Reflect Genome Diversity and Can Be Used for High Discrimination Subtyping

    Directory of Open Access Journals (Sweden)

    Walid Mottawea

    2018-05-01

    Full Text Available Non-typhoidal Salmonella is a leading cause of foodborne illness worldwide. Prompt and accurate identification of the sources of Salmonella responsible for disease outbreaks is crucial to minimize infections and eliminate ongoing sources of contamination. Current subtyping tools including single nucleotide polymorphism (SNP typing may be inadequate, in some instances, to provide the required discrimination among epidemiologically unrelated Salmonella strains. Prophage genes represent the majority of the accessory genes in bacteria genomes and have potential to be used as high discrimination markers in Salmonella. In this study, the prophage sequence diversity in different Salmonella serovars and genetically related strains was investigated. Using whole genome sequences of 1,760 isolates of S. enterica representing 151 Salmonella serovars and 66 closely related bacteria, prophage sequences were identified from assembled contigs using PHASTER. We detected 154 different prophages in S. enterica genomes. Prophage sequences were highly variable among S. enterica serovars with a median ± interquartile range (IQR of 5 ± 3 prophage regions per genome. While some prophage sequences were highly conserved among the strains of specific serovars, few regions were lineage specific. Therefore, strains belonging to each serovar could be clustered separately based on their prophage content. Analysis of S. Enteritidis isolates from seven outbreaks generated distinct prophage profiles for each outbreak. Taken altogether, the diversity of the prophage sequences correlates with genome diversity. Prophage repertoires provide an additional marker for differentiating S. enterica subtypes during foodborne outbreaks.

  3. Dicty_cDB: SLJ344 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SL (Link to library) SLJ344 (Link to dictyBase) - - - Contig-U16255-1 SLJ344P (Link... to Original site) SLJ344F 253 SLJ344Z 273 SLJ344P 526 - - Show SLJ344 Library SL (Link to library) Clone ID SLJ344 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16255-1 Original site URL http://dict...Amino Acid sequence GTSGGTPGSCDKVNCPNGYICTIVNQLAVCVSPSSSSSSSSSTTGSHTTTGGSTTGSHTT TGGSTTGSHTTTGGSTTGSHTTTG---...li tilffniqrlykkkkkkkkkknkp*tklkin*kk Frame B: GTSGGTPGSCDKVNCPNGYICTIVNQLAVCVSPSSSSSSSSSTTGSHTTTGGSTTGSHTT

  4. Dicty_cDB: SHD573 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SH (Link to library) SHD573 (Link to dictyBase) - - - Contig-U11503-1 SHD573E (Link...Clone ID SHD573 (Link to dictyBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U11503-1 Original site URL http://dict...LFAIFLKIVFVVSAPLCPNSTILLNYNILTVYNSSEGCGFNN EPICTSLKDAVSRAFLLISNNSRVCIGIIGNINVTSEQITLGNYCGALWITSENINNENN NYTI...ststtts ax***d*eyyhcysyfgldl Frame C: fsl*iy*YMIRKSNNFSILFAIFLKIVFVVSAPLCPNSTILLNYNILTVYNSSEGCGFNN EPICTSLKD...vegicus clone CH230-428C17, WORKING DRAFT SEQUENCE, 3 unordered pieces. 48 3e-12 3 AC116984 |AC116984.2 Dict

  5. The testis and ovary transcriptomes of the rock bream (Oplegnathus fasciatus: A bony fish with a unique neo Y chromosome

    Directory of Open Access Journals (Sweden)

    Dongdong Xu

    2016-03-01

    Full Text Available The rock bream (Oplegnathus fasciatus is considerably one of the most economically important marine fish in East Asia and has a unique neo-Y chromosome system that is a good model to study the sex determination and differentiation in fish. In the present study, we used Illumina sequencing technology (HiSeq2000 to sequence, assemble and annotate the transcriptome of the testis and ovary tissues of rock bream. A total of 40,004,378 (NCBI SRA database SRX1406649 and 53,108,992 (NCBI SRA database SRX1406648 high quality reads were obtained from testis and ovary RNA sequencing, respectively, and 60,421 contigs (with average length of 1301 bp were obtained after de novo assembling with Trinity software. Digital gene expression analysis reveals 14,036 contigs that show gender-enriched expressional profile with either testis-enriched (237 contigs or ovary-enriched (581 contigs with RPKM >100. There are 237 male- and 582 female-abundant expressed genes that show sex dimorphic expression. We hope that the gonad transcriptome and those gender-enriched transcripts of rock bream can provide some insight into the understanding of genome-wide transcriptome profile of teleost gonad tissue and give useful information in fish gonad development. Keywords: Gonad transcriptome, Testis, Ovary, Rock bream

  6. Comparative Transcriptome Analysis Identifies Candidate Genes Related to Skin Color Differentiation in Red Tilapia.

    Science.gov (United States)

    Zhu, Wenbin; Wang, Lanmei; Dong, Zaijie; Chen, Xingting; Song, Feibiao; Liu, Nian; Yang, Hui; Fu, Jianjun

    2016-08-11

    Red tilapia is becoming more popular for aquaculture production in China in recent years. However, the pigmentation differentiation in genetic breeding is the main problem limiting its development of commercial red tilapia culture and the genetic basis of skin color variation is still unknown. In this study, we conducted Illumina sequencing of transcriptome on three color variety red tilapia. A total of 224,895,758 reads were generated, resulting in 160,762 assembled contigs that were used as reference contigs. The contigs of red tilapia transcriptome had hits in the range of 53.4% to 86.7% of the unique proteins of zebrafish, fugu, medaka, three-spined stickleback and tilapia. And 44,723 contigs containing 77,423 simple sequence repeats (SSRs) were identified, with 16,646 contigs containing more than one SSR. Three skin transcriptomes were compared pairwise and the results revealed that there were 148 common significantly differentially expressed unigenes and several key genes related to pigment synthesis, i.e. tyr, tyrp1, silv, sox10, slc24a5, cbs and slc7a11, were included. The results will facilitate understanding the molecular mechanisms of skin pigmentation differentiation in red tilapia and accelerate the molecular selection of the specific strain with consistent skin colors.

  7. Genome sequence of Aspergillus luchuensis NBRC 4314

    Science.gov (United States)

    Yamada, Osamu; Machida, Masayuki; Hosoyama, Akira; Goto, Masatoshi; Takahashi, Toru; Futagami, Taiki; Yamagata, Youhei; Takeuchi, Michio; Kobayashi, Tetsuo; Koike, Hideaki; Abe, Keietsu; Asai, Kiyoshi; Arita, Masanori; Fujita, Nobuyuki; Fukuda, Kazuro; Higa, Ken-ichi; Horikawa, Hiroshi; Ishikawa, Takeaki; Jinno, Koji; Kato, Yumiko; Kirimura, Kohtaro; Mizutani, Osamu; Nakasone, Kaoru; Sano, Motoaki; Shiraishi, Yohei; Tsukahara, Masatoshi; Gomi, Katsuya

    2016-01-01

    Awamori is a traditional distilled beverage made from steamed Thai-Indica rice in Okinawa, Japan. For brewing the liquor, two microbes, local kuro (black) koji mold Aspergillus luchuensis and awamori yeast Saccharomyces cerevisiae are involved. In contrast, that yeasts are used for ethanol fermentation throughout the world, a characteristic of Japanese fermentation industries is the use of Aspergillus molds as a source of enzymes for the maceration and saccharification of raw materials. Here we report the draft genome of a kuro (black) koji mold, A. luchuensis NBRC 4314 (RIB 2604). The total length of nonredundant sequences was nearly 34.7 Mb, comprising approximately 2,300 contigs with 16 telomere-like sequences. In total, 11,691 genes were predicted to encode proteins. Most of the housekeeping genes, such as transcription factors and N-and O-glycosylation system, were conserved with respect to Aspergillus niger and Aspergillus oryzae. An alternative oxidase and acid-stable α-amylase regarding citric acid production and fermentation at a low pH as well as a unique glutamic peptidase were also found in the genome. Furthermore, key biosynthetic gene clusters of ochratoxin A and fumonisin B were absent when compared with A. niger genome, showing the safety of A. luchuensis for food and beverage production. This genome information will facilitate not only comparative genomics with industrial kuro-koji molds, but also molecular breeding of the molds in improvements of awamori fermentation. PMID:27651094

  8. Sequencing and analysis of the Mediterranean amphioxus (Branchiostoma lanceolatum transcriptome.

    Directory of Open Access Journals (Sweden)

    Silvan Oulion

    Full Text Available BACKGROUND: The basally divergent phylogenetic position of amphioxus (Cephalochordata, as well as its conserved morphology, development and genetics, make it the best proxy for the chordate ancestor. Particularly, studies using the amphioxus model help our understanding of vertebrate evolution and development. Thus, interest for the amphioxus model led to the characterization of both the transcriptome and complete genome sequence of the American species, Branchiostoma floridae. However, recent technical improvements allowing induction of spawning in the laboratory during the breeding season on a daily basis with the Mediterranean species Branchiostoma lanceolatum have encouraged European Evo-Devo researchers to adopt this species as a model even though no genomic or transcriptomic data have been available. To fill this need we used the pyrosequencing method to characterize the B. lanceolatum transcriptome and then compared our results with the published transcriptome of B. floridae. RESULTS: Starting with total RNA from nine different developmental stages of B. lanceolatum, a normalized cDNA library was constructed and sequenced on Roche GS FLX (Titanium mode. Around 1.4 million of reads were produced and assembled into 70,530 contigs (average length of 490 bp. Overall 37% of the assembled sequences were annotated by BlastX and their Gene Ontology terms were determined. These results were then compared to genomic and transcriptomic data of B. floridae to assess similarities and specificities of each species. CONCLUSION: We obtained a high-quality amphioxus (B. lanceolatum reference transcriptome using a high throughput sequencing approach. We found that 83% of the predicted genes in the B. floridae complete genome sequence are also found in the B. lanceolatum transcriptome, while only 41% were found in the B. floridae transcriptome obtained with traditional Sanger based sequencing. Therefore, given the high degree of sequence conservation

  9. Genome Sequence, Assembly and Characterization of Two Metschnikowia fructicola Strains Used as Biocontrol Agents of Postharvest Diseases

    Directory of Open Access Journals (Sweden)

    Edoardo Piombo

    2018-04-01

    Full Text Available The yeast Metschnikowia fructicola was reported as an efficient biological control agent of postharvest diseases of fruits and vegetables, and it is the bases of the commercial formulated product “Shemer.” Several mechanisms of action by which M. fructicola inhibits postharvest pathogens were suggested including iron-binding compounds, induction of defense signaling genes, production of fungal cell wall degrading enzymes and relatively high amounts of superoxide anions. We assembled the whole genome sequence of two strains of M. fructicola using PacBio and Illumina shotgun sequencing technologies. Using the PacBio, a high-quality draft genome consisting of 93 contigs, with an estimated genome size of approximately 26 Mb, was obtained. Comparative analysis of M. fructicola proteins with the other three available closely related genomes revealed a shared core of homologous proteins coded by 5,776 genes. Comparing the genomes of the two M. fructicola strains using a SNP calling approach resulted in the identification of 564,302 homologous SNPs with 2,004 predicted high impact mutations. The size of the genome is exceptionally high when compared with those of available closely related organisms, and the high rate of homology among M. fructicola genes points toward a recent whole-genome duplication event as the cause of this large genome. Based on the assembled genome, sequences were annotated with a gene description and gene ontology (GO term and clustered in functional groups. Analysis of CAZymes family genes revealed 1,145 putative genes, and transcriptomic analysis of CAZyme expression levels in M. fructicola during its interaction with either grapefruit peel tissue or Penicillium digitatum revealed a high level of CAZyme gene expression when the yeast was placed in wounded fruit tissue.

  10. Non-contiguous finished genome sequence and contextual data of the filamentous soil bacterium Ktedonobacter racemifer type strain (SOSP1-21).

    Science.gov (United States)

    Chang, Yun-Juan; Land, Miriam; Hauser, Loren; Chertkov, Olga; Del Rio, Tijana Glavina; Nolan, Matt; Copeland, Alex; Tice, Hope; Cheng, Jan-Fang; Lucas, Susan; Han, Cliff; Goodwin, Lynne; Pitluck, Sam; Ivanova, Natalia; Ovchinikova, Galina; Pati, Amrita; Chen, Amy; Palaniappan, Krishna; Mavromatis, Konstantinos; Liolios, Konstantinos; Brettin, Thomas; Fiebig, Anne; Rohde, Manfred; Abt, Birte; Göker, Markus; Detter, John C; Woyke, Tanja; Bristow, James; Eisen, Jonathan A; Markowitz, Victor; Hugenholtz, Philip; Kyrpides, Nikos C; Klenk, Hans-Peter; Lapidus, Alla

    2011-10-15

    Ktedonobacter racemifer corrig. Cavaletti et al. 2007 is the type species of the genus Ktedonobacter, which in turn is the type genus of the family Ktedonobacteraceae, the type family of the order Ktedonobacterales within the class Ktedonobacteria in the phylum 'Chloroflexi'. Although K. racemifer shares some morphological features with the actinobacteria, it is of special interest because it was the first cultivated representative of a deep branching unclassified lineage of otherwise uncultivated environmental phylotypes tentatively located within the phylum 'Chloroflexi'. The aerobic, filamentous, non-motile, spore-forming Gram-positive heterotroph was isolated from soil in Italy. The 13,661,586 bp long non-contiguous finished genome consists of ten contigs and is the first reported genome sequence from a member of the class Ktedonobacteria. With its 11,453 protein-coding and 87 RNA genes, it is the largest prokaryotic genome reported so far. It comprises a large number of over-represented COGs, particularly genes associated with transposons, causing the genetic redundancy within the genome being considerably larger than expected by chance. This work is a part of the Genomic Encyclopedia of Bacteria and Archaea project.

  11. Evaluation of next generation sequencing for the analysis of Eimeria communities in wildlife.

    Science.gov (United States)

    Vermeulen, Elke T; Lott, Matthew J; Eldridge, Mark D B; Power, Michelle L

    2016-05-01

    Next-generation sequencing (NGS) techniques are well-established for studying bacterial communities but not yet for microbial eukaryotes. Parasite communities remain poorly studied, due in part to the lack of reliable and accessible molecular methods to analyse eukaryotic communities. We aimed to develop and evaluate a methodology to analyse communities of the protozoan parasite Eimeria from populations of the Australian marsupial Petrogale penicillata (brush-tailed rock-wallaby) using NGS. An oocyst purification method for small sample sizes and polymerase chain reaction (PCR) protocol for the 18S rRNA locus targeting Eimeria was developed and optimised prior to sequencing on the Illumina MiSeq platform. A data analysis approach was developed by modifying methods from bacterial metagenomics and utilising existing Eimeria sequences in GenBank. Operational taxonomic unit (OTU) assignment at a high similarity threshold (97%) was more accurate at assigning Eimeria contigs into Eimeria OTUs but at a lower threshold (95%) there was greater resolution between OTU consensus sequences. The assessment of two amplification PCR methods prior to Illumina MiSeq, single and nested PCR, determined that single PCR was more sensitive to Eimeria as more Eimeria OTUs were detected in single amplicons. We have developed a simple and cost-effective approach to a data analysis pipeline for community analysis of eukaryotic organisms using Eimeria communities as a model. The pipeline provides a basis for evaluation using other eukaryotic organisms and potential for diverse community analysis studies. Copyright © 2016 Elsevier B.V. All rights reserved.

  12. Transcriptome sequencing and De Novo analysis of Youngia japonica using the illumina platform.

    Directory of Open Access Journals (Sweden)

    Yulan Peng

    Full Text Available Youngia japonica, a weed species distributed worldwide, has been widely used in traditional Chinese medicine. It is an ideal plant for studying the evolution of Asteraceae plants because of its short life history and abundant source. However, little is known about its evolution and genetic diversity. In this study, de novo transcriptome sequencing was conducted for the first time for the comprehensive analysis of the genetic diversity of Y. japonica. The Y. japonica transcriptome was sequenced using Illumina paired-end sequencing technology. We produced 21,847,909 high-quality reads for Y. japonica and assembled them into contigs. A total of 51,850 unigenes were identified, among which 46,087 were annotated in the NCBI non-redundant protein database and 41,752 were annotated in the Swiss-Prot database. We mapped 9,125 unigenes onto 163 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database. In addition, 3,648 simple sequence repeats (SSRs were detected. Our data provide the most comprehensive transcriptome resource currently available for Y. japonica. C4 photosynthesis unigenes were found in the biological process of Y. japonica. There were 5596 unigenes related to defense response and 1344 ungienes related to signal transduction mechanisms (10.95%. These data provide insights into the genetic diversity of Y. japonica. Numerous SSRs contributed to the development of novel markers. These data may serve as a new valuable resource for genomic studies on Youngia and, more generally, Cichoraceae.

  13. Transcriptome sequencing of the Antarctic vascular plant Deschampsia antarctica Desv. under abiotic stress.

    Science.gov (United States)

    Lee, Jungeun; Noh, Eun Kyeung; Choi, Hyung-Seok; Shin, Seung Chul; Park, Hyun; Lee, Hyoungseok

    2013-03-01

    Antarctic hairgrass (Deschampsia antarctica Desv.) is the only natural grass species in the maritime Antarctic. It has been studied as an extremophile that has successfully adapted to marginal land with the harshest environment for terrestrial plants. However, limited genetic research has focused on this species due to the lack of genomic resources. Here, we present the first de novo assembly of its transcriptome by massive parallel sequencing and its expression profile using D. antarctica grown under various stress conditions. Total sequence reads generated by pyrosequencing were assembled into 60,765 unigenes (28,177 contigs and 32,588 singletons). A total of 29,173 unique protein-coding genes were identified based on sequence similarities to known proteins. The combined results from all three stress conditions indicated differential expression of 3,110 genes. Quantitative reverse transcription polymerase chain reaction showed that several well-known stress-responsive genes encoding late embryogenesis abundant protein, dehydrin 1, and ice recrystallization inhibition protein were induced dramatically and that genes encoding U-box-domain-containing protein, electron transfer flavoprotein-ubiquinone, and F-box-containing protein were induced by abiotic stressors in a manner conserved with other plant species. We identified more than 2,000 simple sequence repeats that can be developed as functional molecular markers. This dataset is the most comprehensive transcriptome resource currently available for D. antarctica and is therefore expected to be an important foundation for future genetic studies of grasses and extremophiles.

  14. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library

    Directory of Open Access Journals (Sweden)

    Salem Mohamed

    2009-11-01

    Full Text Available Abstract Background To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs have been used for single nucleotide polymorphism (SNP discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA broodstock population. Results The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends. Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183 of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In

  15. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library.

    Science.gov (United States)

    Sánchez, Cecilia Castaño; Smith, Timothy P L; Wiedmann, Ralph T; Vallejo, Roger L; Salem, Mohamed; Yao, Jianbo; Rexroad, Caird E

    2009-11-25

    To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the

  16. Next generation DNA sequencing technology delivers valuable genetic markers for the genomic orphan legume species, Bituminaria bituminosa

    Directory of Open Access Journals (Sweden)

    Pazos-Navarro María

    2011-12-01

    Full Text Available Abstract Background Bituminaria bituminosa is a perennial legume species from the Canary Islands and Mediterranean region that has potential as a drought-tolerant pasture species and as a source of pharmaceutical compounds. Three botanical varieties have previously been identified in this species: albomarginata, bituminosa and crassiuscula. B. bituminosa can be considered a genomic 'orphan' species with very few genomic resources available. New DNA sequencing technologies provide an opportunity to develop high quality molecular markers for such orphan species. Results 432,306 mRNA molecules were sampled from a leaf transcriptome of a single B. bituminosa plant using Roche 454 pyrosequencing, resulting in an average read length of 345 bp (149.1 Mbp in total. Sequences were assembled into 3,838 isotigs/contigs representing putatively unique gene transcripts. Gene ontology descriptors were identified for 3,419 sequences. Raw sequence reads containing simple sequence repeat (SSR motifs were identified, and 240 primer pairs flanking these motifs were designed. Of 87 primer pairs developed this way, 75 (86.2% successfully amplified primarily single fragments by PCR. Fragment analysis using 20 primer pairs in 79 accessions of B. bituminosa detected 130 alleles at 21 SSR loci. Genetic diversity analyses confirmed that variation at these SSR loci accurately reflected known taxonomic relationships in original collections of B. bituminosa and provided additional evidence that a division of the botanical variety bituminosa into two according to geographical origin (Mediterranean region and Canary Islands may be appropriate. Evidence of cross-pollination was also found between botanical varieties within a B. bituminosa breeding programme. Conclusions B. bituminosa can no longer be considered a genomic orphan species, having now a large (albeit incomplete repertoire of expressed gene sequences that can serve as a resource for future genetic studies. This

  17. Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx

    Directory of Open Access Journals (Sweden)

    Colbourne John K

    2009-05-01

    Full Text Available Abstract Background New methods are needed for genomic-scale analysis of emerging model organisms that exemplify important biological questions but lack fully sequenced genomes. For example, there is an urgent need to understand the potential for corals to adapt to climate change, but few molecular resources are available for studying these processes in reef-building corals. To facilitate genomics studies in corals and other non-model systems, we describe methods for transcriptome sequencing using 454, as well as strategies for assembling a useful catalog of genes from the output. We have applied these methods to sequence the transcriptome of planulae larvae from the coral Acropora millepora. Results More than 600,000 reads produced in a single 454 sequencing run were assembled into ~40,000 contigs with five-fold average sequencing coverage. Based on sequence similarity with known proteins, these analyses identified ~11,000 different genes expressed in a range of conditions including thermal stress and settlement induction. Assembled sequences were annotated with gene names, conserved domains, and Gene Ontology terms. Targeted searches using these annotations identified the majority of genes associated with essential metabolic pathways and conserved signaling pathways, as well as novel candidate genes for stress-related processes. Comparisons with the genome of the anemone Nematostella vectensis revealed ~8,500 pairs of orthologs and ~100 candidate coral-specific genes. More than 30,000 SNPs were detected in the coral sequences, and a subset of these validated by re-sequencing. Conclusion The methods described here for deep sequencing of the transcriptome should be widely applicable to generate catalogs of genes and genetic markers in emerging model organisms. Our data provide the most comprehensive sequence resource currently available for reef-building corals, and include an extensive collection of potential genetic markers for association and

  18. How conserved are the conserved 16S-rRNA regions?

    Directory of Open Access Journals (Sweden)

    Marcel Martinez-Porchas

    2017-02-01

    Full Text Available The 16S rRNA gene has been used as master key for studying prokaryotic diversity in almost every environment. Despite the claim of several researchers to have the best universal primers, the reality is that no primer has been demonstrated to be truly universal. This suggests that conserved regions of the gene may not be as conserved as expected. The aim of this study was to evaluate the conservation degree of the so-called conserved regions flanking the hypervariable regions of the 16S rRNA gene. Data contained in SILVA database (release 123 were used for the study. Primers reported as matches of each conserved region were assembled to form contigs; sequences sizing 12 nucleotides (12-mers were extracted from these contigs and searched into the entire set of SILVA sequences. Frequency analysis shown that extreme regions, 1 and 10, registered the lowest frequencies. 12-mer frequencies revealed segments of contigs that were not as conserved as expected (≤90%. Fragments corresponding to the primer contigs 3, 4, 5b and 6a were recovered from all sequences in SILVA database. Nucleotide frequency analysis in each consensus demonstrated that only a small fraction of these so-called conserved regions is truly conserved in non-redundant sequences. It could be concluded that conserved regions of the 16S rRNA gene exhibit considerable variation that has to be considered when using this gene as biomarker.

  19. Dicty_cDB: SFE109 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SF (Link to library) SFE109 (Link to dictyBase) - - - Contig-U10771-1 SFE109P (Link... to Original site) SFE109F 197 SFE109Z 630 SFE109P 827 - - Show SFE109 Library SF (Link to library) Clone ID SFE109 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U10771-1 Original site URL http://dict...anslated Amino Acid sequence elnykfhftlnttqiei*inknfllflf*fffyfqi*iqlcfilkllllptiink*INKI FLKKK--- ---VTTSQCESLIQAGVDGLRVGMGVGSICT...fkfnsvsy*nyyyyqpslink*ik ff*kk--- ---VTTSQCESLIQAGVDGLRVGMGVGSICTTQEVMACGRPQATAVFKCALYSSQYNVPI IADGGIRTIGHII

  20. Dicty_cDB: SFL482 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SF (Link to library) SFL482 (Link to dictyBase) - - - Contig-U15494-1 SFL482P (Link... to Original site) SFL482F 434 SFL482Z 394 SFL482P 828 - - Show SFL482 Library SF (Link to library) Clone ID SFL482 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U15494-1 Original site URL http://dict...cftclww*qmcqmcp*qnrscfln*rtknr*nclqeatkrfkt ker*EIIQINVVININHFLLIKKK--- ---ICTHIEKMVQRLTYRRRLSYRTTSNATKIVKTP...KQQK DLKQKKDKKSSK*m Translated Amino Acid sequence (All Frames) Frame A: icthiekmvqrltyrrrlsyrttsnatkivktpgg

  1. Dicty_cDB: SLD420 [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available SL (Link to library) SLD420 (Link to dictyBase) - - - Contig-U16325-1 SLD420E (Link... to Original site) - - - - - - SLD420E 434 Show SLD420 Library SL (Link to library) Clone ID SLD420 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U16325-1 Original site URL http://dict... 7 Homology vs DNA Score E Sequences producing significant alignments: (bits) Value N ( AF066071 ) Dict...yostelium discoideum SP85 (pspB) gene, comple... 860 0.0 1 ( AC117075 ) Dictyostelium

  2. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB585 (Link to dictyBase) - - - Contig-U09875-1 VFB585Z (Link... to Original site) - - VFB585Z 664 - - - - Show VFB585 Library VF (Link to library) Clone ID VFB585 (Link to dict...yBase) Atlas ID - NBRP ID - dictyBase ID - Link to Contig Contig-U09875-1 Original site URL http://dict...Score E Sequences producing significant alignments: (bits) Value N AC116551 |AC116551.2 Dictyostelium discoi...ces producing significant alignments: (bits) Value AC116551_43( AC116551 |pid:none) Dictyostelium discoideum

  3. Dicty_cDB: [Dicty_cDB

    Lifescience Database Archive (English)

    Full Text Available VF (Link to library) VFB668 (Link to dictyBase) - G00394 DDB0168247 Contig-U09555-1...brary) Clone ID VFB668 (Link to dictyBase) Atlas ID - NBRP ID G00394 dictyBase ID DDB0168247 Link to Contig ...Contig-U09555-1 Original site URL http://dictycdb.biol.tsukuba.ac.jp/CSM/VF/VFB6-...uences producing significant alignments: (bits) Value N AC117072 |AC117072.2 Dictyostelium discoideum chromo...tein Score E Sequences producing significant alignments: (bits) Value AC117076_25

  4. The use of Open Reading frame ESTs (ORESTES for analysis of the honey bee transcriptome

    Directory of Open Access Journals (Sweden)

    Soares Ademilson EE

    2004-11-01

    Full Text Available Abstract Background The ongoing efforts to sequence the honey bee genome require additional initiatives to define its transcriptome. Towards this end, we employed the Open Reading frame ESTs (ORESTES strategy to generate profiles for the life cycle of Apis mellifera workers. Results Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41% represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria, and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52% did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22% should represent Apis-specific genes. Conclusions The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.

  5. Transcriptome analysis in cotton boll weevil (Anthonomus grandis and RNA interference in insect pests.

    Directory of Open Access Journals (Sweden)

    Alexandre Augusto Pereira Firmino

    Full Text Available Cotton plants are subjected to the attack of several insect pests. In Brazil, the cotton boll weevil, Anthonomus grandis, is the most important cotton pest. The use of insecticidal proteins and gene silencing by interference RNA (RNAi as techniques for insect control are promising strategies, which has been applied in the last few years. For this insect, there are not much available molecular information on databases. Using 454-pyrosequencing methodology, the transcriptome of all developmental stages of the insect pest, A. grandis, was analyzed. The A. grandis transcriptome analysis resulted in more than 500.000 reads and a data set of high quality 20,841 contigs. After sequence assembly and annotation, around 10,600 contigs had at least one BLAST hit against NCBI non-redundant protein database and 65.7% was similar to Tribolium castaneum sequences. A comparison of A. grandis, Drosophila melanogaster and Bombyx mori protein families' data showed higher similarity to dipteran than to lepidopteran sequences. Several contigs of genes encoding proteins involved in RNAi mechanism were found. PAZ Domains sequences extracted from the transcriptome showed high similarity and conservation for the most important functional and structural motifs when compared to PAZ Domains from 5 species. Two SID-like contigs were phylogenetically analyzed and grouped with T. castaneum SID-like proteins. No RdRP gene was found. A contig matching chitin synthase 1 was mined from the transcriptome. dsRNA microinjection of a chitin synthase gene to A. grandis female adults resulted in normal oviposition of unviable eggs and malformed alive larvae that were unable to develop in artificial diet. This is the first study that characterizes the transcriptome of the coleopteran, A. grandis. A new and representative transcriptome database for this insect pest is now available. All data support the state of the art of RNAi mechanism in insects.

  6. Transcriptome analysis in cotton boll weevil (Anthonomus grandis) and RNA interference in insect pests.

    Science.gov (United States)

    Firmino, Alexandre Augusto Pereira; Fonseca, Fernando Campos de Assis; de Macedo, Leonardo Lima Pepino; Coelho, Roberta Ramos; Antonino de Souza, José Dijair; Togawa, Roberto Coiti; Silva-Junior, Orzenil Bonfim; Pappas, Georgios Joannis; da Silva, Maria Cristina Mattar; Engler, Gilbert; Grossi-de-Sa, Maria Fatima

    2013-01-01

    Cotton plants are subjected to the attack of several insect pests. In Brazil, the cotton boll weevil, Anthonomus grandis, is the most important cotton pest. The use of insecticidal proteins and gene silencing by interference RNA (RNAi) as techniques for insect control are promising strategies, which has been applied in the last few years. For this insect, there are not much available molecular information on databases. Using 454-pyrosequencing methodology, the transcriptome of all developmental stages of the insect pest, A. grandis, was analyzed. The A. grandis transcriptome analysis resulted in more than 500.000 reads and a data set of high quality 20,841 contigs. After sequence assembly and annotation, around 10,600 contigs had at least one BLAST hit against NCBI non-redundant protein database and 65.7% was similar to Tribolium castaneum sequences. A comparison of A. grandis, Drosophila melanogaster and Bombyx mori protein families' data showed higher similarity to dipteran than to lepidopteran sequences. Several contigs of genes encoding proteins involved in RNAi mechanism were found. PAZ Domains sequences extracted from the transcriptome showed high similarity and conservation for the most important functional and structural motifs when compared to PAZ Domains from 5 species. Two SID-like contigs were phylogenetically analyzed and grouped with T. castaneum SID-like proteins. No RdRP gene was found. A contig matching chitin synthase 1 was mined from the transcriptome. dsRNA microinjection of a chitin synthase gene to A. grandis female adults resulted in normal oviposition of unviable eggs and malformed alive larvae that were unable to develop in artificial diet. This is the first study that characterizes the transcriptome of the coleopteran, A. grandis. A new and representative transcriptome database for this insect pest is now available. All data support the state of the art of RNAi mechanism in insects.

  7. Comparing de novo assemblers for 454 transcriptome data.

    Science.gov (United States)

    Kumar, Sujai; Blaxter, Mark L

    2010-10-16

    Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis. Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs. Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible

  8. A de novo expression profiling of Anopheles funestus, malaria vector in Africa, using 454 pyrosequencing.

    Directory of Open Access Journals (Sweden)

    Richard Gregory

    2011-02-01

    Full Text Available Anopheles funestus is one of the major malaria vectors in Africa and yet there are few genomic tools available for this species compared to An. gambiae. To start to close this knowledge gap, we sequenced the An. funestus transcriptome using cDNA libraries developed from a pyrethroid resistant laboratory strain and a pyrethroid susceptible field strain from Mali.Using a pool of life stages (pupae, larvae, adults: females and males for each strain, 454 sequencing generated 375,619 reads (average length of 182 bp. De novo assembly generated 18,103 contigs with average length of 253 bp. The average depth of coverage of these contigs was 8.3. In total 20.8% of all reads were novel when compared to reference databases. The sequencing of the field strain generated 204,758 reads compared to 170,861 from the insecticide resistant laboratory strain. The contigs most differentially represented in the resistant strain belong to the P450 gene family and cuticular genes which correlates with previous studies implicating both of these gene families in pyrethroid resistance. qPCR carried out on six contigs indicates that these ESTs could be suitable for gene expression studies such as microarray. 31,000 sites were estimated to contain Single Nucleotide Polymorphisms (SNPs and analysis of SNPs from 20 contigs suggested that most of these SNPs are likely to be true SNPs. Gene conservation analysis confirmed the close phylogenetic relationship between An. funestus and An. gambiae.This study represents a significant advance for the genetics and genomics of An. funestus since it provides an extensive set of both Expressed Sequence Tags (ESTs and SNPs which can be readily adopted for the design of new genomic tools such as microarray or SNP platforms.

  9. Complete chloroplast genome sequence of MD-2 pineapple and its comparative analysis among nine other plants from the subclass Commelinidae.

    Science.gov (United States)

    Redwan, R M; Saidin, A; Kumar, S V

    2015-08-12

    Pineapple (Ananas comosus var. comosus) is known as the king of fruits for its crown and is the third most important tropical fruit after banana and citrus. The plant, which is indigenous to South America, is the most important species in the Bromeliaceae family and is largely traded for fresh fruit consumption. Here, we report the complete chloroplast sequence of the MD-2 pineapple that was sequenced using the PacBio sequencing technology. In this study, the high error rate of PacBio long sequence reads of A. comosus's total genomic DNA were improved by leveraging on the high accuracy but short Illumina reads for error-correction via the latest error correction module from Novocraft. Error corrected long PacBio reads were assembled by using a single tool to produce a contig representing the pineapple chloroplast genome. The genome of 159,636 bp in length is featured with the conserved quadripartite structure of chloroplast containing a large single copy region (LSC) with a size of 87,482 bp, a small single copy region (SSC) with a size of 18,622 bp and two inverted repeat regions (IRA and IRB) each with the size of 26,766 bp. Overall, the genome contained 117 unique coding regions and 30 were repeated in the IR region with its genes contents, structure and arrangement similar to its sister taxon, Typha latifolia. A total of 35 repeats structure were detected in both the coding and non-coding regions with a majority being tandem repeats. In addition, 205 SSRs were detected in the genome with six protein-coding genes contained more than two SSRs. Comparative chloroplast genomes from the subclass Commelinidae revealed a conservative protein coding gene albeit located in a highly divergence region. Analysis of selection pressure on protein-coding genes using Ka/Ks ratio showed significant positive selection exerted on the rps7 gene of the pineapple chloroplast with P less than 0.05. Phylogenetic analysis confirmed the recent taxonomical relation among the member of

  10. Salmon louse (Lepeophtheirus salmonis transcriptomes during post molting maturation and egg production, revealed using EST-sequencing and microarray analysis

    Directory of Open Access Journals (Sweden)

    Jonassen Inge

    2008-03-01

    Full Text Available Abstract Background Lepeophtheirus salmonis is an ectoparasitic copepod feeding on skin, mucus and blood from salmonid hosts. Initial analysis of EST sequences from pre adult and adult stages of L. salmonis revealed a large proportion of novel transcripts. In order to link unknown transcripts to biological functions we have combined EST sequencing and microarray analysis to characterize female salmon louse transcriptomes during post molting maturation and egg production. Results EST sequence analysis shows that 43% of the ESTs have no significant hits in GenBank. Sequenced ESTs assembled into 556 contigs and 1614 singletons and whenever homologous genes were identified no clear correlation with homologous genes from any specific animal group was evident. Sequence comparison of 27 L. salmonis proteins with homologous proteins in humans, zebrafish, insects and crustaceans revealed an almost identical sequence identity with all species. Microarray analysis of maturing female adult salmon lice revealed two major transcription patterns; up-regulation during the final molting followed by down regulation and female specific up regulation during post molting growth and egg production. For a third minor group of ESTs transcription decreased during molting from pre-adult II to immature adults. Genes regulated during molting typically gave hits with cuticula proteins whilst transcripts up regulated during post molting growth were female specific, including two vitellogenins. Conclusion The copepod L.salmonis contains high a level of novel genes. Among analyzed L.salmonis proteins, sequence identities with homologous proteins in crustaceans are no higher than to homologous proteins in humans. Three distinct processes, molting, post molting growth and egg production correlate with transcriptional regulation of three groups of transcripts; two including genes related to growth, one including genes related to egg production. The function of the regulated

  11. Large-scale analysis of peptide sequence variants: the case for high-field asymmetric waveform ion mobility spectrometry.

    Science.gov (United States)

    Creese, Andrew J; Smart, Jade; Cooper, Helen J

    2013-05-21

    Large scale analysis of proteins by mass spectrometry is becoming increasingly routine; however, the presence of peptide isomers remains a significant challenge for both identification and quantitation in proteomics. Classes of isomers include sequence inversions, structural isomers, and localization variants. In many cases, liquid chromatography is inadequate for separation of peptide isomers. The resulting tandem mass spectra are composite, containing fragments from multiple precursor ions. The benefits of high-field asymmetric waveform ion mobility spectrometry (FAIMS) for proteomics have been demonstrated by a number of groups, but previously work has focused on extending proteome coverage generally. Here, we present a systematic study of the benefits of FAIMS for a key challenge in proteomics, that of peptide isomers. We have applied FAIMS to the analysis of a phosphopeptide library comprising the sequences GPSGXVpSXAQLX(K/R) and SXPFKXpSPLXFG(K/R), where X = ADEFGLSTVY. The library has defined limits enabling us to make valid conclusions regarding FAIMS performance. The library contains numerous sequence inversions and structural isomers. In addition, there are large numbers of theoretical localization variants, allowing false localization rates to be determined. The FAIMS approach is compared with reversed-phase liquid chromatography and strong cation exchange chromatography. The FAIMS approach identified 35% of the peptide library, whereas LC-MS/MS alone identified 8% and LC-MS/MS with strong cation exchange chromatography prefractionation identified 17.3% of the library.

  12. Genome analysis and identification of gelatinase encoded gene in Enterobacter aerogenes

    Science.gov (United States)

    Shahimi, Safiyyah; Mutalib, Sahilah Abdul; Khalid, Rozida Abdul; Repin, Rul Aisyah Mat; Lamri, Mohd Fadly; Bakar, Mohd Faizal Abu; Isa, Mohd Noor Mat

    2016-11-01

    In this study, bioinformatic analysis towards genome sequence of E. aerogenes was done to determine gene encoded for gelatinase. Enterobacter aerogenes was isolated from hot spring water and gelatinase species-specific bacterium to porcine and fish gelatin. This bacterium offers the possibility of enzymes production which is specific to both species gelatine, respectively. Enterobacter aerogenes was partially genome sequenced resulting in 5.0 mega basepair (Mbp) total size of sequence. From pre-process pipeline, 87.6 Mbp of total reads, 68.8 Mbp of total high quality reads and 78.58 percent of high quality percentage was determined. Genome assembly produced 120 contigs with 67.5% of contigs over 1 kilo base pair (kbp), 124856 bp of N50 contig length and 55.17 % of GC base content percentage. About 4705 protein gene was identified from protein prediction analysis. Two candidate genes selected have highest similarity identity percentage against gelatinase enzyme available in Swiss-Prot and NCBI online database. They were NODE_9_length_26866_cov_148.013245_12 containing 1029 base pair (bp) sequence with 342 amino acid sequence and NODE_24_length_155103_cov_177.082458_62 which containing 717 bp sequence with 238 amino acid sequence, respectively. Thus, two paired of primers (forward and reverse) were designed, based on the open reading frame (ORF) of selected genes. Genome analysis of E. aerogenes resulting genes encoded gelatinase were identified.

  13. Conditional Probabilities of Large Earthquake Sequences in California from the Physics-based Rupture Simulator RSQSim

    Science.gov (United States)

    Gilchrist, J. J.; Jordan, T. H.; Shaw, B. E.; Milner, K. R.; Richards-Dinger, K. B.; Dieterich, J. H.

    2017-12-01

    Within the SCEC Collaboratory for Interseismic Simulation and Modeling (CISM), we are developing physics-based forecasting models for earthquake ruptures in California. We employ the 3D boundary element code RSQSim (Rate-State Earthquake Simulator of Dieterich & Richards-Dinger, 2010) to generate synthetic catalogs with tens of millions of events that span up to a million years each. This code models rupture nucleation by rate- and state-dependent friction and Coulomb stress transfer in complex, fully interacting fault systems. The Uniform California Earthquake Rupture Forecast Version 3 (UCERF3) fault and deformation models are used to specify the fault geometry and long-term slip rates. We have employed the Blue Waters supercomputer to generate long catalogs of simulated California seismicity from which we calculate the forecasting statistics for large events. We have performed probabilistic seismic hazard analysis with RSQSim catalogs that were calibrated with system-wide parameters and found a remarkably good agreement with UCERF3 (Milner et al., this meeting). We build on this analysis, comparing the conditional probabilities of sequences of large events from RSQSim and UCERF3. In making these comparisons, we consider the epistemic uncertainties associated with the RSQSim parameters (e.g., rate- and state-frictional parameters), as well as the effects of model-tuning (e.g., adjusting the RSQSim parameters to match UCERF3 recurrence rates). The comparisons illustrate how physics-based rupture simulators might assist forecasters in understanding the short-term hazards of large aftershocks and multi-event sequences associated with complex, multi-fault ruptures.

  14. The Dunaliella salina organelle genomes: large sequences, inflated with intronic and intergenic DNA

    Energy Technology Data Exchange (ETDEWEB)

    Smith, David R.; Lee, Robert W.; Cushman, John C.; Magnuson, Jon K.; Tran, Duc; Polle, Juergen E.

    2010-05-07

    Abstract Background: Dunaliella salina Teodoresco, a unicellular, halophilic green alga belonging to the Chlorophyceae, is among the most industrially important microalgae. This is because D. salina can produce massive amounts of β-carotene, which can be collected for commercial purposes, and because of its potential as a feedstock for biofuels production. Although the biochemistry and physiology of D. salina have been studied in great detail, virtually nothing is known about the genomes it carries, especially those within its mitochondrion and plastid. This study presents the complete mitochondrial and plastid genome sequences of D. salina and compares them with those of the model green algae Chlamydomonas reinhardtii and Volvox carteri. Results: The D. salina organelle genomes are large, circular-mapping molecules with ~60% noncoding DNA, placing them among the most inflated organelle DNAs sampled from the Chlorophyta. In fact, the D. salina plastid genome, at 269 kb, is the largest complete plastid DNA (ptDNA) sequence currently deposited in GenBank, and both the mitochondrial and plastid genomes have unprecedentedly high intron densities for organelle DNA: ~1.5 and ~0.4 introns per gene, respectively. Moreover, what appear to be the relics of genes, introns, and intronic open reading frames are found scattered throughout the intergenic ptDNA regions -- a trait without parallel in other characterized organelle genomes and one that gives insight into the mechanisms and modes of expansion of the D. salina ptDNA. Conclusions: These findings confirm the notion that chlamydomonadalean algae have some of the most extreme organelle genomes of all eukaryotes. They also suggest that the events giving rise to the expanded ptDNA architecture of D. salina and other Chlamydomonadales may have occurred early in the evolution of this lineage. Although interesting from a genome evolution standpoint, the D. salina organelle DNA sequences will aid in the development of a viable

  15. The Dunaliella salina organelle genomes: large sequences, inflated with intronic and intergenic DNA

    Directory of Open Access Journals (Sweden)

    Tran Duc

    2010-05-01

    Full Text Available Abstract Background Dunaliella salina Teodoresco, a unicellular, halophilic green alga belonging to the Chlorophyceae, is among the most industrially important microalgae. This is because D. salina can produce massive amounts of β-carotene, which can be collected for commercial purposes, and because of its potential as a feedstock for biofuels production. Although the biochemistry and physiology of D. salina have been studied in great detail, virtually nothing is known about the genomes it carries, especially those within its mitochondrion and plastid. This study presents the complete mitochondrial and plastid genome sequences of D. salina and compares them with those of the model green algae Chlamydomonas reinhardtii and Volvox carteri. Results The D. salina organelle genomes are large, circular-mapping molecules with ~60% noncoding DNA, placing them among the most inflated organelle DNAs sampled from the Chlorophyta. In fact, the D. salina plastid genome, at 269 kb, is the largest complete plastid DNA (ptDNA sequence currently deposited in GenBank, and both the mitochondrial and plastid genomes have unprecedentedly high intron densities for organelle DNA: ~1.5 and ~0.4 introns per gene, respectively. Moreover, what appear to be the relics of genes, introns, and intronic open reading frames are found scattered throughout the intergenic ptDNA regions -- a trait without parallel in other characterized organelle genomes and one that gives insight into the mechanisms and modes of expansion of the D. salina ptDNA. Conclusions These findings confirm the notion that chlamydomonadalean algae have some of the most extreme organelle genomes of all eukaryotes. They also suggest that the events giving rise to the expanded ptDNA architecture of D. salina and other Chlamydomonadales may have occurred early in the evolution of this lineage. Although interesting from a genome evolution standpoint, the D. salina organelle DNA sequences will aid in the

  16. De novo Transcriptome Sequencing Reveals a Considerable Bias in the Incidence of Simple Sequence Repeats towards the Downstream of ‘Pre-miRNAs’ of Black Pepper

    Science.gov (United States)

    Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan

    2013-01-01

    Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of ‘43 pre-miRNA candidates bearing different types of SSR motifs’. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted ‘pre-miRNA candidates bearing SSRs’. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted ‘pre-miRNA candidates’. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of ‘tandem repeats’ in miRNAs. PMID:23469176

  17. Draft genome sequence of Streptomyces sp. strain F1, a potential source for glycoside hydrolases isolated from Brazilian soil.

    Science.gov (United States)

    Melo, Ricardo Rodrigues de; Persinoti, Gabriela Felix; Paixão, Douglas Antonio Alvaredo; Squina, Fábio Márcio; Ruller, Roberto; Sato, Helia Harumi

    Here, we show the draft genome sequence of Streptomyces sp. F1, a strain isolated from soil with great potential for secretion of hydrolytic enzymes used to deconstruct cellulosic biomass. The draft genome assembly of Streptomyces sp. strain F1 has 69 contigs with a total genome size of 8,142,296bp and G+C 72.65%. Preliminary genome analysis identified 175 proteins as Carbohydrate-Active Enzymes, being 85 glycoside hydrolases organized in 33 distinct families. This draft genome information provides new insights on the key genes encoding hydrolytic enzymes involved in biomass deconstruction employed by soil bacteria. Copyright © 2017 Sociedade Brasileira de Microbiologia. Published by Elsevier Editora Ltda. All rights reserved.

  18. Expressed sequences tags of the anther smut fungus, Microbotryum violaceum, identify mating and pathogenicity genes

    Directory of Open Access Journals (Sweden)

    Devier Benjamin

    2007-08-01

    Full Text Available Abstract Background The basidiomycete fungus Microbotryum violaceum is responsible for the anther-smut disease in many plants of the Caryophyllaceae family and is a model in genetics and evolutionary biology. Infection is initiated by dikaryotic hyphae produced after the conjugation of two haploid sporidia of opposite mating type. This study describes M. violaceum ESTs corresponding to nuclear genes expressed during conjugation and early hyphal production. Results A normalized cDNA library generated 24,128 sequences, which were assembled into 7,765 unique genes; 25.2% of them displayed significant similarity to annotated proteins from other organisms, 74.3% a weak similarity to the same set of known proteins, and 0.5% were orphans. We identified putative pheromone receptors and genes that in other fungi are involved in the mating process. We also identified many sequences similar to genes known to be involved in pathogenicity in other fungi. The M. violaceum EST database, MICROBASE, is available on the Web and provides access to the sequences, assembled contigs, annotations and programs to compare similarities against MICROBASE. Conclusion This study provides a basis for cloning the mating type locus, for further investigation of pathogenicity genes in the anther smut fungi, and for comparative genomics.

  19. Non-contiguous finished genome sequence and contextual data of the filamentous soil bacterium Ktedonobacter racemifer type strain (SOSP1-21T)

    Energy Technology Data Exchange (ETDEWEB)

    Chang, Yun-Juan [ORNL; Land, Miriam L [ORNL; Hauser, Loren John [ORNL; Chertkov, Olga [Los Alamos National Laboratory (LANL); Glavina Del Rio, Tijana [U.S. Department of Energy, Joint Genome Institute; Nolan, Matt [U.S. Department of Energy, Joint Genome Institute; Copeland, A [U.S. Department of Energy, Joint Genome Institute; Tice, Hope [U.S. Department of Energy, Joint Genome Institute; Cheng, Jan-Fang [U.S. Department of Energy, Joint Genome Institute; Lucas, Susan [U.S. Department of Energy, Joint Genome Institute; Han, Cliff [Los Alamos National Laboratory (LANL); Goodwin, Lynne A. [Los Alamos National Laboratory (LANL); Pitluck, Sam [U.S. Department of Energy, Joint Genome Institute; Ivanova, N [U.S. Department of Energy, Joint Genome Institute; Ovchinnikova, Galina [U.S. Department of Energy, Joint Genome Institute; Pati, Amrita [U.S. Department of Energy, Joint Genome Institute; Chen, Amy [U.S. Department of Energy, Joint Genome Institute; Palaniappan, Krishna [U.S. Department of Energy, Joint Genome Institute; Mavromatis, K [U.S. Department of Energy, Joint Genome Institute; Liolios, Konstantinos [U.S. Department of Energy, Joint Genome Institute; Brettin, Thomas S [ORNL; Fiebig, Anne [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Rohde, Manfred [HZI - Helmholtz Centre for Infection Research, Braunschweig, Germany; Abt, Birte [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Goker, Markus [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Detter, J. Chris [U.S. Department of Energy, Joint Genome Institute; Woyke, Tanja [U.S. Department of Energy, Joint Genome Institute; Bristow, James [U.S. Department of Energy, Joint Genome Institute; Eisen, Jonathan [U.S. Department of Energy, Joint Genome Institute; Markowitz, Victor [U.S. Department of Energy, Joint Genome Institute; Hugenholtz, Philip [U.S. Department of Energy, Joint Genome Institute; Kyrpides, Nikos C [U.S. Department of Energy, Joint Genome Institute; Klenk, Hans-Peter [DSMZ - German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany; Lapidus, Alla L. [U.S. Department of Energy, Joint Genome Institute

    2011-01-01

    Ktedonobacter racemifer corrig. Cavaletti et al. 2007 is the type species of the genus Ktedo- nobacter, which in turn is the type genus of the family Ktedonobacteraceae, the type family of the order Ktedonobacterales within the class Ktedonobacteria in the phylum Chloroflexi . Although K. racemifer shares some morphological features with the actinobacteria, it is of special interest because it was the first cultivated representative of a deep branching unclassi- fied lineage of otherwise uncultivated environmental phylotypes tentatively located within the phylum Chloroflexi . The aerobic, filamentous, non-motile, spore-forming Gram-positive heterotroph was isolated from soil in Italy. The 13,661,586 bp long non-contiguous finished genome consists of ten contigs and is the first reported genome sequence from a member of the class Ktedonobacteria. With its 11,453 protein-coding and 87 RNA genes, it is the largest prokaryotic genome reported so far. It comprises a large number of over-represented COGs, particularly genes associated with transposons, causing the genetic redundancy within the genome being considerably larger than expected by chance. This work is a part of the Ge- nomic Encyclopedia of Bacteria and Archaea project.

  20. Profile hidden Markov models for the detection of viruses within metagenomic sequence data.

    Directory of Open Access Journals (Sweden)

    Peter Skewes-Cox

    Full Text Available Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs ("vFams" to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3