Enabling scientists to simultaneously study multiple genomes recovered directly from an environmental sample

Traditional and classical methods of genomics and microbiology allow researchers to study an individual microbial species obtained from the environment by isolating the organism into pure colonies using microbial culture techniques. However, this approach cannot capture the structure of the broader microbial community within the environmental sample, the relative representation of multiple genomes, and their interaction with each other and with the environment. Additionally, a large number of microbial species are very difficult, or impossible, to culture in vitro in the laboratory setting. The development of nextgeneration sequencing has advanced the field of metagenomics, thereby bypassing the need for microbial isolation through culturing. In a metagenomic experiment, a Niltubacin HDAC inhibitor sample is usually taken from a natural or a host-associated environment containing micro-organisms organized into communities or microbiomes. DNA is extracted from the environmental sample containing a mixture of multiple genomes and then sequenced without prior separation. The resulting dataset comprises millions of mixed sequence reads from the multiple genomes contained in the sample. Traditionally, DNA has been sequenced using Sanger sequencing technology and the reads generated are routinely 800–1000 base pairs long. However this technology is extremely cumbersome and costly. Recently next-generation sequencers, e.g., Illumina/Solexa, Applied Biosystems’ SOLiD, and Roche’s 454 Life Sciences sequencing systems, have emerged as the future of genomics with incredible ability to rapidly generate large amounts of sequence data. These new technologies greatly facilitate highthroughput while lowering the cost of metagenomic studies. However, the reads generated are of much shorter length making reads assembly and alignment more challenging. For example, Illumina/Solexa and SOLiD generate reads ranging between 35–100 base pairs while Roche 454 reads are approximately 100–400 base pairs in length. One goal of metagenomic studies is to identify what genomes are contained in the environmental sample and to estimate their relative abundance. Identification of genomes is complicated by the mixed nature of multiple genomes in the sample. A widely used approach is assigning the sequence reads to NCBI’s taxonomy tree based on sequence read homology alignment with known sequences catalogued in reference databases. The sequence reads are first aligned to the reference sequence databases using a sequence comparison program such as BLAST. Reads which have hits in the database are then assigned to the taxonomy tree based on the best match or multiple high-scoring hits. The challenge of this approach is that hits may be found in multiple genomes for a single read at a given threshold of bit-score or Expect value, due to sequence homology and overlaps associated with similarity among species. Strategy of weighting similarities for multiple BLAST hits has been used to estimate the relative genomic abundance and average size. Another representative and stand-alone analysis tool, MEGAN, assigns a read with hits in multiple genomes to their lowest common ancestor in the NCBI taxonomy tree. Thus assignments of reads to different ranks of taxonomy tree depend on what threshold for bit-score or Expect value is used. Furthermore, MEGAN assigns reads one at a time. As a consequence, the results have less false positives but lack specificity.

Leave a Reply