High throughput analysis of gene sequences from microbial communities 1 2 2 1 Celeste J. Brown , Audra K. Johnson , James A. Foster , Larry J. Forney 1 Departments of Biological Sciences and Computer Science University of Idaho, Moscow, ID 83844 Corresponding Author: Celeste J. Brown, Ph.D. Life Sciences South, Rm 252 Department of Biological Sciences University of Idaho Moscow, ID 83844-3051 Phone: 1-(208)-885-4012 Fax: 1-(208)-885-7905 Email: [email protected] Running Title: High throughput analysis of gene sequences Keywords: sequence analysis, ribosomal RNA, bioinformatics 2 Summary Microbial ecologists often deduce the structure and function of microbial communities from microbial genetic material extracted from environmental samples. Their studies use several bioinformatics tools and databases that typically require significant manual intervention, making them time consuming and 5 infeasible for the analysis of complex communities. Two computer programs were developed to facilitate large-scale studies that require the analysis of small subunit ribosomal RNA (SSU rRNA) gene sequences. The high throughput sequence analysis program (HITSA) identifies and groups closely related sequences by automatically removing low quality and inappropriate sequences; searching an appropriate SSU rRNA database for similar sequences using BLAST; aligning sequences and their best matches using 10 ClustalW; and finally clustering the sequences using the neighbor-joining algorithm. HITSA produces summary tables for BLAST results, a matrix of pair-wise genetic distances, and a phylogenetic tree which the user can explore with our second program, STATGEN. STATGEN calculates and reports the mean, standard deviation, minimum, maximum and quartile values for the pair-wise genetic distances among the sequences selected by the user. HITSA and STATGEN streamline sequence data analysis and presentation 15 for large-scale studies on the composition and dynamics of microbial communities. 1 Introduction Culture-independent techniques have become essential tools for microbial ecologists interested in determining the structure and function of microbial communities and the environmental factors that 20 influence them. The move toward these techniques began with the pioneering work of Woese (Woese, 1987; Woese et al., 1990; Wheelis et al., 1992), Pace (Olsen et al., 1986; Hugenholtz et al., 1998; DeLong and Pace, 2001) and their students and colleagues. These researchers suggested that phenetic methods were problematic for categorizing microbial species, and that molecular sequences provide an evolutionary context for classifying both cultured and uncultured microbes. Numerous novel prokaryotic 25 phylotypes have been discovered using culture-independent methods, and about half of the approximately 50 identifiable major phyla within the domain Bacteria are known only from analyses of uncultured organisms (Rappe and Giovannoni, 2003). The range of environments surveyed and the variety of bacteria discovered has greatly expanded our appreciation of the diversity of microorganisms in the world. 30 Culture-independent methods typically rely on the analysis of small subunit (SSU) rRNA gene sequences that have been amplified by PCR from genomic DNA isolated from a microbial community (Weisburg et al., 1991). The gene sequences roughly reflect the relative frequencies of the microbial populations from which they were derived. Importantly, the sequences obtained can be compared to previously described sequences in databases, such as the Ribosomal Database Project II (Cole et al., 35 2003) or GenBank (Benson et al., 2005), to estimate the amount of divergence among sets of sequences. To quantify the microbial diversity within or between environmental samples in a statistically meaningful way requires many sequences from many samples. Consequently, the researcher must carefully select which cloned PCR products to sequence or risk being overwhelmed by the task. Furthermore, there may be many species in a sample that occur at very low frequency, and a few species 40 that are in very high numbers (Curtis et al., 2002), so increasing the number of sequenced clones to get larger sample sizes increases the probability of detecting rare sequences (Fig. 1) but at the cost of 2 increasing the number of sequences from the most common species. Completely sequencing many, nearly-identical sequences in both directions is uninformative and expensive. On the other hand, short, single-strand sequences are insufficient for phylogenetic analyses. We have developed a strategy by 45 which a large number of clones are screened by sequencing a single strand of an informative region of a SSU rRNA; then sequences are clustered to distinguish closely related, common sequences from rare ones. Sequences that are representative of a particular phylotype are chosen for sequencing the entire amplified region in both directions. Such a procedure provides an estimate of the frequency of different phylotypes within the population, and concentrates most of the sequencing effort on identifying as much 50 diversity as possible. Analyzing large sets of sequences requires efficient bioinformatics tools. In this paper, we describe a high throughput sequence analysis program (HI TSA) that significantly streamlines this effort by automatically collating and passing the results from each individual program to the next. STATGEN was developed so the results of this analysis can be summarized in a meaningful way. The program generates 55 summary statistics for the pair-wise sequence differences among user-selected groups or sequences so that the user can make informed comparisons among sequences. 3 Results One approach to the analysis of microbial community composition and structure relies on PCR amplification of SSU rRNA gene sequences from genomic DNA that has been isolated from an 60 environmental sample. The amplification products are typically cloned into one of several suitable vectors, and a library is constructed. The DNA sequences of cloned genes are determined using standard methods, and the data are analyzed using phylogenetic methods. We have developed a high throughput sequence analysis (HITSA) program that uses the protocol that is schematically diagrammed in Fig. 2. A discussion of each step in the procedure is given below. 65 Data quality assurance. Each raw sequence is checked for quality and orientation (Fig. 2; Data Quality). By default, HITSA retains sequences that (a) have fewer than 3% of the bases in the sequence called as Ns by the sequencing software, and (b) are at least 500 nucleotides long after removing the low quality sequence from the 3-prime end. This step also removes the amplification primer and vector sequence 70 when present. Finally, only sequences that are in the correct orientation are used from this edited set of sequences. This step is important when the sequences could be in either orientation relative to the coding strand due to non-directional cloning. The sequences to be used in the subsequent analyses (Fig. 2; Edited Sequences) are written for the coding strand so that they will align properly with sequences from the databases. 75 Searching for similar sequences. For each new sequence, the most similar sequences from a userspecified database are found using BLAST (Fig. 2; BLAST). We typically use one of three databases of ribosomal RNA sequences in our analysis. The largest database includes all SSU rRNA gene sequences from the Ribosomal Database Project (Cole et al., 2003) that are greater than or equal to 1200 nucleotides 80 in length. The second largest database includes only those sequences from the first database that are from cultured organisms. Because the sequences came from cultured organisms, the organisms were not 4 taxonomically classified solely on the basis of rRNA gene sequencing. The smallest database is composed solely of sequences greater than 1200 nt in length that come from type strains of bacterial species. Our default setting for the BLAST search returns the 25 most similar sequences. 85 Parsing BLAST output and fetching the best match. Tab-delimited, summary tables are produced by parsing the BLAST output (Fig. 2; Parser). The tables list the query name, query length, hit id, hit description, significance, percent identity, and the start, end and total length of the query sequence that aligns with the hit sequence. One table, output5.xls, provides the results for the 25 matching sequences in 90 the BLAST output. Another table, output1.xls, provides the results for only the best match for each good sequence. The sequence from the database that is the most similar to the good sequence is then included in the subsequent analyses (Fig. 2; Fetch). Clustering based on sequence similarity. HiTSA clusters similar sequences based on their genetic 95 distances. Each good sequence, the best match from the database for each good sequence, thirty nine sequences representing a broad range of Eubacterial sequences and a single Archaea sequence (Fig. 2; Reference Sequences) are aligned using ClustalW or ClustalW-mpi (Fig. 2; ClustalW). The region of the alignment corresponding to the last unique start site and the first unique end site (Fig. 2; Define Region) are extracted using Seqret (Rice et al., 2000) and used for the phylogenetic analysis. Genetic distances are 100 calculated for the aligned region by the Jukes and Cantor method (Jukes and Cantor, 1969) and the sequences are clustered based upon these distances using the neighbor joining method (Saitou and Nei, 1987) as implemented in the PHYLIP programs DNAdist and Neighbor, respectively. The cluster analysis produces a tree in the Newick format (http://evolution.genetics.washington.edu/phylip/newicktree.html). 105 Generating statistics for related sequences. The second program, STATGEN, can then be used to generate summary statistics based upon the pair-wise genetic distances (corrected percent sequence differences) among sequences that cluster together in the Newick tree. The tree and the genetic distances 5 are opened and displayed in STATGEN. At this point the user can chose the sequences whose distances will be compared by either dragging individual sequences or clusters into the comparison window (Fig. 110 3). Comparisons can be made at various levels: all of the sequences chosen for a particular analysis can be compared amongst themselves, two groups of sequences can be compared to each other, or the sequences can be compared against one other sequence. For comparing a group within itself, STATGEN uses every unique pair-wise distance among the sequences in the group. For comparing one group (or one sequence) with another group, STATGEN uses every unique pair-wise distance between the two groups of 115 sequences. The statistics generated are the minimum, first quartile, median, third quartile, maximum, mean, and standard deviation of the genetic distances in the comparison. The output is displayed both as a table and as a box plot. Generally the clusters are matched to a sequence from the BLAST results, and the comparison is between the cluster and the database sequence to determine how divergent the members of the cluster are from the closest sequence in the database. These data provide a useful way of 120 summarizing large amounts of sequence information. Random sampling by STATGEN. STATGEN can be used to select representative clones within a group for further analysis (sequencing). The number of samples chosen depends on the degree of variability in the group to be sampled. If there is very little variability among the sequences, then one or two sequences 125 will be sufficient to capture most of the sequence variability in the group, while more sequences will be required for more variable groups. STATGEN uses sampling without replacement to generate random sets from the group, and the researcher decides how many replicate datasets will be generated. To start, datasets of size two are produced from the sequences. The coefficients of variation of all pair-wise genetic distances are 130 calculated. If the coefficient is less than or equal to a predetermined value set by the researcher, two clones are chosen at random from the group. If the predetermined value is not reached, the size of the datasets increase by one, and the coefficient of variation is calculated for the new datasets. This process is 6 continued until a sample size with the desired level of coverage is attained, and a random sample of this size is taken from the group. A list of the randomly sampled sequences is the output from this procedure. 7 Discussion 135 We have developed an efficient, high-throughput sequence analysis program, HITSA, that allows us to rapidly discover the evolutionary relationships among a large set of single-read sequences and those in a database of previously described sequences. HITSA was developed for evaluating SSU ribosomal RNA sequences from microbial communities, however, the program can analyze any group of sequences 140 that has a database of homologous sequences for comparison. To complement HITSA, we have also developed an interactive tool, STATGEN, that allows users to generate summary statistics for the pair-wise genetic distances generated by HITSA, and to chose a random sample of sequences from a group based upon the sequence variability within the group. STATGEN displays the clustering relationships of the sequences to easily visualize possible comparisons. This tool is also generic, in that, it could be used for 145 the analysis of any square, symmetric matrix of numbers and/or cluster tree with branch lengths in Newick format. Similar tools for classifying SSU rRNA gene sequences exist. Bio Informatic Bacterial Identification (BIBI; http://pbil.univ-lyon1.fr/bibi/) is a web-based tool for identifying the most similar sequence from a SSU rRNA database, and uses a similar strategy to the one used here (Devulder et al., 150 2003). Unfortunately, BIBI only processes a single sequence at a time, and is therefore not as useful as HITSA for large sets of sequences. The RDP has a web-based tool for classifying bacterial sequences based upon Bergey’s Manual of Systematic Bacteriology (http://rdp.cme.msu.edu/classifier/classifier.jsp). This tool uses a naïve Bayesian classifier to assign sequences to the lowest taxonomic level possible. HITSA provides more information, including a summary of results from the BLAST searches, the average 155 genetic distance among sequences and a tree that clusters sequences together based upon similarity. These results integrate seamlessly with STATGEN for summarizing results. There are aspects of this analysis that should be taken into account when interpreting HITSA results. First, ClustalW is used to align sequences, and this algorithm does a progressive alignment, in which the most similar sequences are aligned together first followed by progressively more distantly 8 160 related groups of sequences. This often results in the placement of potentially homologous gaps at different sequence positions for each group. Since these misalignments are not repaired “by hand” before clustering the sequences based upon the multiple sequence alignment, HITSA overemphasizes the similarity of sequences within a cluster and the differences between clusters. Second, we are calculating pair-wise genetic distances using the Jukes and Cantor correction. This method only corrects for the 165 possibility of back mutations and does not consider the well-known difference in the rates of transitions and transversions or other more complicated evolutionary models (Minin et al., 2003). However, this correction is sufficient for the distances in which we are most interested, that is, between the most similar sequences. Third, we are using the neighbor joining method to cluster our sequences based upon their genetic distances. These are unproofed, single-strand sequences, and it is probably not productive to 170 consider which of many alternative phylogenetic algorithms should be used at this point. We deem this to be acceptable, since our intention is to identify those clones that should be sequenced completely on both strands, assembled, then proofed. These complete sequences can then be used in a more extensive phylogenetic analysis that includes correcting the sequence alignment, identifying a suitable evolutionary model, and estimating statistical support for the inferred tree topology. 175 Of course, the classification of sequences based upon their similarity to known sequences depends upon the quality and completeness of the database being searched. For a handful of well-studied environments, this is not much of a problem. However, for novel environments with novel microbes, this can be quite problematic. For example, the closest sequence in the database may only have 80% similarity to a sample sequence, and it is very important to distinguish between this case and very high levels of 180 similarity. The STATGEN analysis tool makes it easy to summarize the amount of genetic variability between database and experimental sequences. The choice of which summaries are needed is left up to the researcher who can choose based upon expert knowledge. Visualizing the clusters generated by the neighbor-joining algorithm lets the researcher compare sequences based upon inferred evolutionary relationships. STATGEN can also be used to study and summarize the results of subsequent, more refined, 185 phylogenetic analyses. 9 The use of HITSA and STATGEN will benefit researchers who analyze large numbers of samples to determine patterns of microbial diversity or the temporal and spatial dynamics of microbial communities. By reducing cost and increasing efficiency, researchers can conduct more expansive studies that involve larger numbers of samples, which will increase statistical power. This is important so that 190 such studies can move beyond mere description of communities to testing hypotheses that are important to understanding the ecology of prokaryotes. 10 Experimental Procedures Implementation The high throughput sequence analysis (HITSA) has been implemented on a Sun Fire V880 195 running Solaris 5.8 and on a cluster of computers running Redhat Linux with Sun Grid Engine. HITSA is initiated by typing a command on the user’s terminal. The command line parameters are the complete path to the program followed by the complete path to a directory containing all of the sequence files in FASTA format, followed by the complete path to the parameter file. The only adaptation of the program that is necessary for installation on any Unix-type computer is changing the “scripts” variable in the 200 HITSA script to name the directory containing all of the HITSA program files. All other variables are set in the parameter file. STATGEN is a graphical Java 1.4 Swing application programmed using the Java API. It has been implemented on computers running under Windows XP, Mac OS X, Solaris and Linux. The user provides STATGEN with the files containing the genetic distance matrix and/or a Newick tree file. STATGEN uses 205 the distances from the matrix if it is provided, otherwise it will use the branch lengths from a Newick tree. HITSA and StatGen are available at www.ibest.uidaho.edu/tools/. Software Several programs must be installed for HITSA and STATGEN to function. HITSA uses a bash 210 script to invoke other programs and scripts to transform data from one program for use by another. HITSA must have access to Perl (v. 5.8), Bio-perl (v. 1.3), BLAST (Altschul et al., 1997), Seqret from the EMBOSS (Rice et al., 2000) package, and DNAdist and Neighbor from the PHYLIP package (Felsenstein, 2004), and ClustalW (Thompson et al., 1994), or ClustalW-MPI for parallel runs (Li, 2003). Parallel jobs are submitted using the Sun Grid Engine. STATGEN requires Java Virtual Machine (JVM) 215 and a JAVA 1.4 compiler to compile from source code. 11 Databases HITSA requires a database of sequences that has been formatted for BLAST searches (Table 1; Database; Fig. 2; Sequence db). 220 Parameter file A parameter file is used to set the various options for HITSA. Table 1 lists the options, current defaults, and a brief description of the purpose for each option. The Suffix parameter must be unique to all of the sequence files to be analyzed. The Npercent parameter sets the number of ambiguous bases (N) 225 that HITSA will allow in a good quality sequence. Primer5 and Primer3 are the nucleotide sequences that define the PCR primers used to amplify the sequences. The default setting for these parameters is currently set to the 3’ ends of 8F and 926R (Weisburg et al., 1991). The Direction parameter indicates the direction of the raw sequences relative to the database sequences. Those sequences whose orientation is the opposite of those in the database are transcribed into their reverse complements. In order to determine 230 the direction of the raw sequences and to confirm that the sequences are of SSU rRNA genes (or whichever gene has been sequenced), HITSA requires an example sequence in the forward direction. The BlastSeq parameter provides the complete path to this FASTA formatted sequence. These parameters are used in the Data Quality step (Fig. 2). The parameters Nhits, LengthCutoff, and Database are required for the BLAST search (Fig. 2). 235 Nhits sets the number of matching sequences that BLAST reports from its search. These are the Nhits best matches from the database as determined by their score (see BLAST documentation, www.ncbi.nlm.nih.gov/blast). The LengthCutoff is the minimum total length of the matching sequence aligned to the query sequence that is required to count as one of the Nhits matches. This parameter is necessary because some SSU rRNA gene sequences share significant similarity with the template strand 240 of distantly related SSU rRNA genes. Some experimentation may be necessary to find the optimal setting for this parameter for other genes. Database is the complete path to the BLAST-formatted database (see Databases above). 12 The RefStrains and Root parameters are used to organize sequences from other sources that will be used in the alignment and clustering steps of HITSA (Fig. 2; ClustalW, Neighbor). For each raw 245 sequence, the FASTA sequence for the match from the database that has the highest score (or the first in the list of matches returned by BLAST) is included in these subsequent steps. This sequence is copied from the BLAST-formatted database (Fig. 2; Sequence db). Another file of FASTA sequences, whose complete path is given by RefStrains, can also be included in the subsequent steps. Inclusion of these sequences for analyses from different samples will allow a basis for comparing the clusters among 250 environmental samples. The Root parameter names which of the RefStrains should be used to root the tree created by Neighbor. This name should match the description line of one of the RefStrain sequences. The cluster tree and distance matrix will be labeled by the description line of each FASTA file. Output files 255 When HiTSA begins, it creates two directories and numerous files within the top level directory that contains the experimental sequences. The output files are in a directory named 8Fphylo. These files are output1.xls, output5.xls, finaldistances and finaltree.txt. If HiTSA is rerun using the same top level directory, all of the files and directories are overwritten by the new analysis. The results from STATGEN can be saved as an HTML file, with boxplot graphics saved separately 260 in jpeg format. The list of randomly sampled sequences can also be saved as a text file. 13 Acknowledgements The research was supported by NIH grant P20 RR16448 from the COBRE Program of the National Center for Research Resources and NIH grant P20 RR016454 from the INBRE Program of the 265 National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIH. We would also like to thank Mayee Wong and Xia Zhou for help with designing the pipeline, Raymond N. Brown for helpful comments on the manuscript and Christopher J. Williams for statistical assistance. 14 Table 1. Options that can be customized in the parameter file of HITSA Option Default Purpose Suffix .seq Defining extension for sequence files Npercent .03 Proportion of uncalled bases tolerated Primer5 GACTCGGTMC 5’ PCR primer Primer3 TTGARTTTCC 3’ PCR primer Direction FORWARD Desired sequence direction relative to coding strand BlastSeq $scriptsa/af243169.for Sequence used for determining sequence orientation Nhits 25 Number of matches to return from BLAST LengthCutoff 175 Total length for a match to be included Database /PATHb/databases/rdp_species/rdp_ Database formatted for BLAST species RefStrains $scripts/RefStrains Reference sequences for alignment and clustering Root Methanococcus jannaschii Sequence from RefStrains that is used to root the tree produced by Neighbor a $scripts is the path to the location of HITSA program files, a variable set when HITSA is installed b PATH is the location of the database files and is set only in the parameter file 15 Figure Legends Figure 1. Probability of detecting microbes that occur at low frequency. Population frequency is along the x-axis and probability of detection is on the y-axis. Each curve represents different sample sizes. Figure 2. Flow chart for the HIgh Throughput Sequence Analysis. The central column shows the programs used at each step. Flanking columns show inputs into and outputs from each program with arrows indicating direction of data flow. Dark grey boxes indicate input provided by the user, black boxes indicate files that can be used in subsequent analyses Figure 3. A screenshot of STATGEN in action. A) Newick tree structure of clustered sequences. B) List of sequences. “Drag and Drop” is used to move sequence names from A or B to C. C) Groups of sequences and the sequences to which they will be compared. D) Summary statistics for Group Brevibacillus levickii. Genetic distances between each of the “Sequences” listed in C and highlighted in E (grey) compared to the sequence listed in C under "Compare" are summarized. The "Range Indicator" shows the range of genetic distances for the boxplot. The boxplot for the Brevibacillus levickii comparison is below, along with a table of the statistics. E) Tree view of clustered sequences. 16 Probability of detecting microbes that occur at low frequency 1 0.9 0.8 Probability 0.7 0.6 0.5 0.4 0.3 0.2 25 100 400 0.1 0 0 0.005 0.01 0.015 0.02 0.025 0.03 Population Frequency Figure 1 17 Parameter File Bad Sequences Data Quality Raw Sequences Sequence DB BLAST Output Good Sequences BLAST Summary Table Parser Fetch Reference Sequences ClustalW Best match list Best match sequences Aligned sequences Define Region Alignment Points Seqret DNAdist Neighbor Aligned region Distance Matrix Newick Tree Figure 2 18 A) E) C) D) B) Figure 3 19 References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. (2005) GenBank. Nucleic Acids Res 33: D34-38. Cole, J.R., Chai, B., Marsh, T.L., Farris, R.J., Wang, Q., Kulam, S.A. et al. (2003) The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 31: 442-443. Curtis, T.P., Sloan, W.T., and Scannell, J.W. (2002) Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci U S A 99: 10494-10499. DeLong, E.F., and Pace, N.R. (2001) Environmental diversity of bacteria and archaea. Syst Biol 50: 470-478. Devulder, G., Perriere, G., Baty, F., and Flandrois, J.P. (2003) BIBI, a bioinformatics bacterial identification tool. J Clin Microbiol 41: 1785-1787. Felsenstein, J. (2004) PHYLIP (Phylogenetic Inference Package). In. Department of Genome Sciences, University of Washington, Seattle: Distributed by the author. Hugenholtz, P., Goebel, B.M., and Pace, N.R. (1998) Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J Bacteriol 180: 4765-4774. Jukes, T.H., and Cantor, C.R. (1969) Evolution of protein molecules. In Mammalian Protein Metabolism. Munro, N.H. (ed). New York: Academic Press, pp. 21-132. Li, K.B. (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 1585-1586. Minin, V., Abdo, Z., Joyce, P., and Sullivan, J. (2003) Performance-based selection of likelihood models for phylogeny estimation. Syst Biol 52: 674-683. Olsen, G.J., Lane, D.J., Giovannoni, S.J., Pace, N.R., and Stahl, D.A. (1986) Microbial ecology and evolution: a ribosomal RNA approach. Annu Rev Microbiol 40: 337-365. Rappe, M.S., and Giovannoni, S.J. (2003) The uncultured microbial majority. Annu Rev Microbiol 57: 369-394. Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16: 276-277. Saitou, N., and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4: 406-425. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. Weisburg, W.G., Barns, S.M., Pelletier, D.A., and Lane, D.J. (1991) 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol 173: 697-703. Wheelis, M.L., Kandler, O., and Woese, C.R. (1992) On the nature of global classification. Proc Natl Acad Sci U S A 89: 2930-2934. Woese, C.R. (1987) Bacterial evolution. Microbiol Rev 51: 221-271. Woese, C.R., Kandler, O., and Wheelis, M.L. (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87: 4576-4579. 20
© Copyright 2026 Paperzz