134x Filetype PDF File size 0.43 MB Source: www.nature.com
FOCUS ON METAGENOMICS COMMUNITY GENOMICS IN MICROBIAL ECOLOGY AND EVOLUTION ‡ Eric E. Allen* and Jillian F. Banfield* Abstract | It is possible to reconstruct near-complete, and possibly complete, genomes of the dominant members of microbial communities from DNA that is extracted directly from the environment. Genome sequences from environmental samples capture the aggregate characteristics of the strain population from which they were derived. Comparison of the sequence data within and among natural populations can reveal the evolutionary processes that lead to genome diversification and speciation. Community genomic datasets can also enable subsequent gene expression and proteomic studies to determine how resources are invested and functions are distributed among community members. Ultimately, genomics can reveal how individual species and strains contribute to the net activity of the community. CLONE LIBRARY Microbial genomics has, until recently, been confined members. This cannot be adequately addressed by A collection of targeted DNA to individual, isolated microbial strains. Genome focused isolation and individual genome sequencing sequences, such as the sequence information for isolates from phylogeneti- efforts, as isolates might not be representative of the 16S rRNA gene, most often cally diverse lineages has had a marked impact on our full genetic and metabolic potential of their associ- derived from PCR amplification understanding of microbial physiology, biochemis- ated natural populations. Moreover, artificial cultiva- and subsequent cloning into a try, genetics, ecology and evolution. However, this tion conditions often do not replicate those found in vector. Specifically, 16S rRNA gene clone libraries are often approach is limited because we do not know how to nature. Therefore, there is a compelling impetus to used in surveys of microbial 1 diversity from environmental cultivate most microorganisms . Consequently, many move beyond the culture-centric realm of microbial samples. questions about the roles of uncharacterized organisms sequencing and to begin focusing sequencing efforts in natural ecosystems remain. on microbial communities en masse. Our ability to survey the resident microbiota in The analysis of genome sequence data that has a given community has been greatly expanded by been recovered directly from the environment is various cultivation-independent methodologies, motivated by many objectives, which include the which include 16S rRNA gene CLONE LIBRARY collections establishment of gene inventories and natural prod- *Department of and group-specific fluorescence in situ hybridization uct discovery3,4. This approach is often referred to Environmental Science, 2 Policy, and Management, (FISH) . Although the description and quantitation of as metagenomics, which is defined as the functional ‡ the phylogenetic diversity of microbial communities is and sequence-based analysis of the collective micro- Department of Earth an important first step, linking these organisms to their bial genomes that are contained in an environmental and Planetary Sciences, 3. Recent reviews have covered environmental University of California, ecological roles remains a significant challenge. sample Berkeley, Berkeley, 3,5–8 In the natural environment, individual organisms and functional metagenomics . California 94720, USA. do not exist in isolation. Rather, microbial communi- Here we centre our discussion on the opportu- Correspondence to J.F.B. ties are dynamic CONSORTIA of microbial species popu- nities for analysis of ecological and evolutionary e-mail: lations. The understanding of consortia function will processes in natural microbial consortia using envi- jill@eps.berkeley.edu doi:10.1038/nrmicro1157 benefit from genomic information from all coexisting ronmentally-derived genome sequence data. We NATURE REVIEWS | MICROBIOLOGY © 2005 Nature Publishing Group VOLUME 3 | JUNE 2005 | 489 REVIEWS Box 1 | Acid mine drainage community genomics 78 79 A decade of research on the biogeochemistry and microbiology of the Richmond Mine at Iron Mountain, California, provided the important scientific foundation for the acid mine drainage (AMD) community genome sequencing project. Initial work determined the types of organism that were present and correlated community 80 membership with geochemical conditions . In 2002–2003, 76 Mb of environmental sequence data were obtained 9 from a small-insert library from a single biofilm sample . Using this data, it was possible to reconstruct the genomes of the dominant bacterium, Leptospirillum group II (10X coverage) and the dominant archaeon, Ferroplasma type II (10X coverage). Partial reconstruction was also possible for the bacterial genomes of Leptospirillum group III (3X coverage) and a Sulfobacillus species (0.5X coverage) that is closely related to Sulfobacillus thermosulfidooxidans. Archaeal genomes that were partially reconstructed include Ferroplasma acidarmanus Type I (very closely related to F. acidarmanus fer 1; 4X coverage) and G-plasma (3X coverage) — a novel group within the Thermoplasmatales. The sequencing allowed metabolic reconstructions of these organisms based on genome annotations and an 9 analysis of functional partitioning among community members . Importantly, it was revealed that a relatively minor community component, Leptospirillum group III, possessed the sole complement of nitrogen fixation (nif) genes. This subsequently led to the design of a selective isolation strategy to successfully cultivate this organism using 43 genome sequence data . Furthermore, carbon fixation pathways and gene products that are possibly involved in iron oxidation were revealed, which provided important insights into the intricate metabolism of these chemolithoautotrophic communities. Genomic analyses also provided insights into population structure. These included evidence for genetic recombination among archaeal populations, which revealed a high degree of genome mosaicism in these species. Furthermore, comprehensive population genomic information has allowed analysis of factors that contribute to genomic heterogeneity within species populations and the ability to assess evidence of selection based on the analysis of nucleotide substitutions (E.E.A. et al., unpublished observations). Finally, the community genomic dataset has 10 provided the foundation for performing environmental proteomic surveys from a natural biofilm sample . These studies have revealed the complement of genes that are expressed in situ, and therefore go beyond inferences based on genome-annotation gazing to provide insights into how functions are distributed and which functions are important in natural microbial consortia. focus on community genomics, which emphasizes Insights into the metabolic functions of uncultivated the analysis of species populations and their interac- microorganisms have been facilitated by exploiting tions, recognizing that both species composition and phylogenetic anchors that are contained in environ- interactions change over time, and in response to envi- mental libraries BOX 2. For example, in large-insert ronmental stimuli. This requires that the system under environmental libraries, contiguous DNA that flanks investigation can be sampled repeatedly, and defined taxonomic-specific markers such as 16S rRNA genes well enough to enable in situ ecological studies and the can provide a glimpse into the genetic potential of sam- 11–15 analysis of adaptive processes. Genomics can resolve pled organisms . Alternatively, random clones from the genetic and metabolic potential of communities shotgun libraries can be sequenced. In this review, we and establish how functions are partitioned in and focus primarily on the shotgun sequencing method, among populations, reveal how genetic diversity is cre- which represents a relatively unbiased, non-directed ated and maintained, and identify the primary drivers approach to survey the structure and metabolic capacity of genome evolution and speciation. of a community. We draw upon experiences from our ongoing As a first step, consideration should be given to analyses of an extreme acid mine drainage (AMD) the community chosen for investigation. On the ecosystem9,10 BOX 1. We discuss the challenges that one hand, simple communities with low species are associated with the assembly of near-complete, diversity can be characterized thoroughly with mod- and potentially complete, genomes of uncultivated est sequencing effort. On the other hand, complex organisms, the documentation of genomic heterogene- communities are more representative of most natu- CONSORTIUM ity in populations and the use of these data to enable ral microbial assemblages, but their characteriza- Physical association between comprehensive functional studies. tion presents myriad challenges that require special cells of two or more types consideration. For example, it is necessary to address of microorganism. Such Approaches to community genomics gaps in knowledge owing to incomplete sequence an association might be Community genomics provides a platform to assess COVERAGE, and limitations that might arise owing to a advantageous to at least one natural microbial phenomena that include biogeo- lack of reproducibility that results from community of the microorganisms. chemical activities, population ecology, evolutionary heterogeneity. COVERAGE processes such as lateral gene transfer (LGT) events, Currently, both the cost of sequencing and the The average number of times and microbial interactions. Only by placing these challenges that are associated with the management of a nucleotide is represented processes in their environmental context can we vast datasets precludes comprehensive genomic stud- by a high-quality base in the begin to understand complex community structure ies of highly complex communities. Consequently, we sequence data; full genome and functions, and the evolutionary constraints that favour an initial approach that is based on the analysis coverage is usually attained at 8–10X coverage. define and sustain them. of simpler model communities. The technical and 490 | JUNE 2005 | VOLUME 3 © 2005 Nature Publishing Group www.nature.com/reviews/micro FOCUS ON METAGENOMICS Box 2 | Environmental libraries Sampling and defining the biogeochemical framework. To understand the ecology of a community, it is neces- The extraction of high quality DNA is central to the success of any sequencing sary to describe the associations of organisms with each project. In the case of environmental samples that contain a mixed consortia of other and with their environment. Characterization of organism types, the objective is to obtain a quantitatively accurate representation of ABIOTIC system attributes is important to understand all community members during extraction and subsequent construction of shotgun the sequencing libraries. Realizing this importance, microbial ecologists have invested the factors that control community membership. substantial effort in optimizing DNA extraction procedures for various Spatial and temporal environmental heterogeneity is 15,81 inextricably linked to successional changes in com- environmental samples . munity composition and diversity16–18. Therefore, it is The advantage of large-insert libraries (for example, ~40 kb for fosmids) is that important to define physicochemical gradients such they provide substantial contiguous genomic information that is representative of as pH, temperature, osmotic strength, mineralogy 15,82,83 individual community members . For community genome sequencing and and nutrient levels, and to identify sources of energy, assembly, paired-end sequences from large-insert libraries are particularly useful as nutrient fluxes and feedbacks owing to microorgan- they provide valuable linking information for orientation and scaffolding of ism–mineral interactions. For instance, geochemical assembled genome fragments. Furthermore, the complete sequences of large-insert 19 clones can be used as reference sequences for the assembly and statistical analysis of patterns might indicate important metabolic func- 22 tions in the system. In combination with genomic infor- environmental shotgun sequence data . Despite the obvious utility of large-insert libraries, certain environments present a mation, the assessment of environmental conditions considerable challenge to obtaining the high-molecular-weight DNA that is suitable that contribute to spatial or temporal heterogeneity in for large-insert library construction. For example, small-insert shotgun libraries species composition might enable the identification of (3- to 4-kb insert size) might be the only viable option for the acid mine drainage traits that are important to microbial adaptation in the (AMD) biofilm community from Iron Mountain, as the many steps that are required community. for DNA purification result in excessive DNA shearing. Nevertheless, small-insert The biological attributes of the system are also an DNA libraries alone have been used in environmental genomic studies with important consideration. For example, the presence of considerable success9,36. a microbial species might depend upon the presence (or absence) of another species. This might be due to a metabolic dependence and is often suggested to be scientific lessons that have been gleaned from these a phenomenon that limits the success of cultivation studies can then be extended to more complex systems endeavours20. The interdependence of community and their generality evaluated. members might also take the form of thermodynamic control, such as that observed in microbial consortia System tractability. Extreme geological environ- that can couple methane oxidation to sulphate reduc- 21,22 ments, such as acidic geothermal hotsprings, highly tion . Biotic features, such as grazing and phage acidic, or hypersaline habitats, provide important predation, also impact community structure. Grazing geochemical and selective constraints on species pressure that is imposed by eukaryotic protozoa, such diversity, which makes them ideal for high-resolution as flagellates and ciliates, is one example of a top–down studies of microbial ecology and evolution. There 23–25. Perhaps more important, however, is the control are other system attributes that can enhance our well-documented contribution of phage to microbial ability to learn about the structure of communities mortality. The efficacy of phage predation can have and the degree to which they function as integrated profound effects on the composition of microbial synergistic assemblages. These include: first, self- 26,27 assemblages by controlling dominant groups . sustaining systems, in which all essential metabolic Phage-induced cell lysis can also release cellular functions are carried out in situ and which therefore contents into the environment, thereby influencing represent a complete ecosystem microcosm; second, microbial food-web dynamics and biogeochemical systems that are characterized by strong and clearly processes28. Furthermore, the capacity for phage-medi- defined geochemical–microbiological feedbacks, ated DNA transfer (transduction), or the direct release 29 which enables analysis of the interplay between of free DNA during virus-induced host-cell lysis , can organism function and environmental conditions; contribute to the overall mobile gene pool in natural third, systems that are characterized by systematic communities. Laterally transferred genes and genome fluctuations in environmental conditions, and that fragments can alter the metabolic properties of the can be sampled over space and time, to understand host30 and represent a primary driving factor that how the community-level metabolic networks change contributes to genomic heterogeneity, and therefore during colonization and as a function of community evolution, in natural species populations (REF. 31 and membership and geochemical conditions; fourth, E.E.A. et al., unpublished observations). systems that are defined by well-established species interactions, as expected in extreme environments Estimating the community sequencing endeavour. It is and other specialized habitats, such as host–pathogen possible to predict the amount of sequencing that will ABIOTIC and host–symbiont relationships, in which organ- be needed to analyse a given community based on the The non-living physical and isms have co-evolved over extended evolutionary desired degree of coverage of genomes and the avail- chemical attributes of a system, time periods; and finally, systems that have sufficient able information about species number, relative species which include pH, temperature, biomass for post-genomic functional assays (such as abundance and genome sizes. An approximation of pressure, osmotic strength, and chemical composition. proteomic surveys). community diversity can be made through the analysis NATURE REVIEWS | MICROBIOLOGY © 2005 Nature Publishing Group VOLUME 3 | JUNE 2005 | 491 REVIEWS of 16S rRNA gene libraries, together with a quantita- overrepresentation of the Archaea, which prompted a 9 tive assessment of relative species richness (number reappraisal of actual genome coverages . of species) and evenness (relative abundance of each species) using FISH. However, diversity estimates are Community genomics. Perhaps the primary challenge complicated by PCR bias, rrn (ribosomal RNA gene) of any community genomic study that aims to extract copy number per genome, and the fact that libraries are ecological insights is to correctly assign genome frag- rarely sequenced to completion. Genome sizes can be ments to organism types. In our experience, the weight estimated from known sizes of related species, if avail- of this requirement falls most heavily on genome able, or approximated using the average prokaryotic assembly. Various genome assembly programmes genome size (~3.16 Mb ± 1.79 Mb; calculated from are currently available (ARACHNE, CAP, CELERA, 215 prokaryotic genomes published in the Genomes EULER, JAZZ, PHRAP and TIGR assemblers, to name Online Database at the time of writing; see the Online but a few). However, the relative efficacy with which links box). Such estimates can prove imperfect, how- most of these programs handle mixed community ever, owing to marked variation in genome size in a DNAs has yet to be determined (JAZZ, PHRAP and 32 9,36 and the fact that current genome have been used previously ). microbial species CELERA databases are biased towards pathogens and symbionts, Conventional shotgun sequencing of microbial iso- which often have reduced genomes. Correlations that lates is simplified by the fact that the sequenced clones have been drawn between the ecology of an organism are derived from organisms with a single genome and its genome size might provide a more refined type. In environmental samples, however, each clone estimate of genomic complexity for community represents a unique sequence that is probably derived 33 members . from an individual in the community, and the genomes To predict the amount of sequencing that will be that are sampled come from a pool of both distinct and required for community coverage, estimates of spe- related genome types. This might pose challenges that cies richness and the abundance of the dominant currently available genome assembly programs are not organism(s) can be used with statistical methods to designed to deal with. So far, studies have revealed that 34 describe the species abundance distribution . If the the genomes of different species have sufficient nucle- abundance of a given organism is 1%, with a genome otide-level sequence divergence (as well as changes size of 3 Mb, then 2.4 Gb of sequence would be required 9,36. Hurdles do in gene order) to prevent co-assembly to obtain 8X (near complete) genome coverage of that arise, however, owing to genomic variation within organism. A sequencing effort of this magnitude would species populations. vastly over-sample the genomes of more dominant The resolution of strain-level differences is a fun- community members. Therefore, directed strategies to damental goal of community genomic analysis (FIG. 1). target low abundance organisms may be advantageous Although many comparative genomic studies of strain 37–39 (see below). variants indicate a highly conserved gene order , It is likely that sequencing projections will be impre- extensive genome rearrangements in members of the 40,41 cise. For example, although species abundance impacts same species will confuse genome assembly and on the relative proportion of DNAs that are present can preclude the assembly of environmental shotgun 22 in sequencing libraries, cloning bias might skew spe- sequence data . In the AMD community, genome cies representation. Furthermore, there might be rearrangements that involve more than two genes were multiple genome types per species. Therefore, predic- extremely rare in archaeal populations, and breakdown tions should be refined after the assembly of an initial of conserved SYNTENY occurs primarily after species sequencing increment. One simple approach is to use divergence (E.E.A. et al., unpublished observations). the coverage statistics of the assembly based on a ver- In regions where single-nucleotide polymorphisms are SYNTENY 35 sion of the Lander–Waterman equation that is modi- the predominant form of genomic heterogeneity, it is Refers to the presence of two fied to take into consideration the relative abundance possible to define composite species genome sequences or more genes on the same of species in the community. If the equation predicts (that is, an aggregate sequence comprised of multiple chromosome. However, the fewer contigs than are observed, the representation of strain sequence types). However, assembly is problem- term is often used to refer to the shared colinearity in organisms in the library or effective genome sizes can atic in regions where members of the strain population orthologous gene content and (FIG. 2). gene order between genomes. be refined. The prediction should be performed itera- have different gene contents tively as more sequence data is analysed. This approach It is important to develop mixed genome assembly SCAFFOLD was used to successfully predict the outcome of the methods to deal with differences in gene content and A genome fragment community sequencing project undertaken by Tyson gene sequence, because these phenomena can artifi- constructed by the ordering and 9 et al. Specifically, estimates based solely on community cially terminate SCAFFOLDS or separate sequencing reads orienting of sets of unlinked characterization by 16S rRNA gene library and quan- into multiple scaffolds at regions of strain genome con- contigs generated from raw titative FISH analyses, with an average genome size of fusion. This results in separate, but homologous, DNA shotgun sequence data by ~2 Mb, estimated that ~80 Mb of sequence would be fragments that can be mapped onto the composite spe- using additional information (such as paired-end sequence (FIG. 1). Other complications owing sufficient to cover the five dominant genome types, cies genome dataset information or homology data) with individual genome coverages ranging from 0.4 to strain assembly include inaccurate (over)estimation to determine proper contig to 30X. Analyses post-assembly of sequencing incre- of genome sizes and artificial duplication of open read- linkage and placement along the ments (2, 10, 15 and 25 Mb) revealed that cloning ing frames (ORFs) in community genome datasets. If chromosome. Scaffolds can be comprised of multiple contigs. bias in sequencing libraries resulted in the significant assembly heuristics can overcome complications owing 492 | JUNE 2005 | VOLUME 3 © 2005 Nature Publishing Group www.nature.com/reviews/micro
no reviews yet
Please Login to review.