High throughput analysis of gene sequences from

High throughput analysis of gene sequences
from microbial communities
1
2
2
1
Celeste J. Brown , Audra K. Johnson , James A. Foster , Larry J. Forney
1
Departments of Biological Sciences and Computer Science
University of Idaho, Moscow, ID 83844
Corresponding Author:
Celeste J. Brown, Ph.D.
Life Sciences South, Rm 252
Department of Biological Sciences
University of Idaho
Moscow, ID 83844-3051
Phone: 1-(208)-885-4012
Fax: 1-(208)-885-7905
Email: [email protected]
Running Title: High throughput analysis of gene sequences
Keywords: sequence analysis, ribosomal RNA, bioinformatics
2
Summary
Microbial ecologists often deduce the structure and function of microbial communities from microbial
genetic material extracted from environmental samples. Their studies use several bioinformatics tools and
databases that typically require significant manual intervention, making them time consuming and
5
infeasible for the analysis of complex communities. Two computer programs were developed to facilitate
large-scale studies that require the analysis of small subunit ribosomal RNA (SSU rRNA) gene
sequences. The high throughput sequence analysis program (HITSA) identifies and groups closely related
sequences by automatically removing low quality and inappropriate sequences; searching an appropriate
SSU rRNA database for similar sequences using BLAST; aligning sequences and their best matches using
10
ClustalW; and finally clustering the sequences using the neighbor-joining algorithm. HITSA produces
summary tables for BLAST results, a matrix of pair-wise genetic distances, and a phylogenetic tree which
the user can explore with our second program, STATGEN. STATGEN calculates and reports the mean,
standard deviation, minimum, maximum and quartile values for the pair-wise genetic distances among the
sequences selected by the user. HITSA and STATGEN streamline sequence data analysis and presentation
15
for large-scale studies on the composition and dynamics of microbial communities.
1
Introduction
Culture-independent techniques have become essential tools for microbial ecologists interested in
determining the structure and function of microbial communities and the environmental factors that
20
influence them. The move toward these techniques began with the pioneering work of Woese (Woese,
1987; Woese et al., 1990; Wheelis et al., 1992), Pace (Olsen et al., 1986; Hugenholtz et al., 1998; DeLong
and Pace, 2001) and their students and colleagues. These researchers suggested that phenetic methods
were problematic for categorizing microbial species, and that molecular sequences provide an
evolutionary context for classifying both cultured and uncultured microbes. Numerous novel prokaryotic
25
phylotypes have been discovered using culture-independent methods, and about half of the approximately
50 identifiable major phyla within the domain Bacteria are known only from analyses of uncultured
organisms (Rappe and Giovannoni, 2003). The range of environments surveyed and the variety of
bacteria discovered has greatly expanded our appreciation of the diversity of microorganisms in the
world.
30
Culture-independent methods typically rely on the analysis of small subunit (SSU) rRNA gene
sequences that have been amplified by PCR from genomic DNA isolated from a microbial community
(Weisburg et al., 1991). The gene sequences roughly reflect the relative frequencies of the microbial
populations from which they were derived. Importantly, the sequences obtained can be compared to
previously described sequences in databases, such as the Ribosomal Database Project II (Cole et al.,
35
2003) or GenBank (Benson et al., 2005), to estimate the amount of divergence among sets of sequences.
To quantify the microbial diversity within or between environmental samples in a statistically
meaningful way requires many sequences from many samples. Consequently, the researcher must
carefully select which cloned PCR products to sequence or risk being overwhelmed by the task.
Furthermore, there may be many species in a sample that occur at very low frequency, and a few species
40
that are in very high numbers (Curtis et al., 2002), so increasing the number of sequenced clones to get
larger sample sizes increases the probability of detecting rare sequences (Fig. 1) but at the cost of
2
increasing the number of sequences from the most common species. Completely sequencing many,
nearly-identical sequences in both directions is uninformative and expensive. On the other hand, short,
single-strand sequences are insufficient for phylogenetic analyses. We have developed a strategy by
45
which a large number of clones are screened by sequencing a single strand of an informative region of a
SSU rRNA; then sequences are clustered to distinguish closely related, common sequences from rare
ones. Sequences that are representative of a particular phylotype are chosen for sequencing the entire
amplified region in both directions. Such a procedure provides an estimate of the frequency of different
phylotypes within the population, and concentrates most of the sequencing effort on identifying as much
50
diversity as possible.
Analyzing large sets of sequences requires efficient bioinformatics tools. In this paper, we describe a
high throughput sequence analysis program (HI TSA) that significantly streamlines this effort by
automatically collating and passing the results from each individual program to the next. STATGEN was
developed so the results of this analysis can be summarized in a meaningful way. The program generates
55
summary statistics for the pair-wise sequence differences among user-selected groups or sequences so
that the user can make informed comparisons among sequences.
3
Results
One approach to the analysis of microbial community composition and structure relies on PCR
amplification of SSU rRNA gene sequences from genomic DNA that has been isolated from an
60
environmental sample. The amplification products are typically cloned into one of several suitable
vectors, and a library is constructed. The DNA sequences of cloned genes are determined using standard
methods, and the data are analyzed using phylogenetic methods. We have developed a high throughput
sequence analysis (HITSA) program that uses the protocol that is schematically diagrammed in Fig. 2. A
discussion of each step in the procedure is given below.
65
Data quality assurance. Each raw sequence is checked for quality and orientation (Fig. 2; Data Quality).
By default, HITSA retains sequences that (a) have fewer than 3% of the bases in the sequence called as
Ns by the sequencing software, and (b) are at least 500 nucleotides long after removing the low quality
sequence from the 3-prime end. This step also removes the amplification primer and vector sequence
70
when present. Finally, only sequences that are in the correct orientation are used from this edited set of
sequences. This step is important when the sequences could be in either orientation relative to the coding
strand due to non-directional cloning. The sequences to be used in the subsequent analyses (Fig. 2; Edited
Sequences) are written for the coding strand so that they will align properly with sequences from the
databases.
75
Searching for similar sequences. For each new sequence, the most similar sequences from a userspecified database are found using BLAST (Fig. 2; BLAST). We typically use one of three databases of
ribosomal RNA sequences in our analysis. The largest database includes all SSU rRNA gene sequences
from the Ribosomal Database Project (Cole et al., 2003) that are greater than or equal to 1200 nucleotides
80
in length. The second largest database includes only those sequences from the first database that are from
cultured organisms. Because the sequences came from cultured organisms, the organisms were not
4
taxonomically classified solely on the basis of rRNA gene sequencing. The smallest database is composed
solely of sequences greater than 1200 nt in length that come from type strains of bacterial species. Our
default setting for the BLAST search returns the 25 most similar sequences.
85
Parsing BLAST output and fetching the best match. Tab-delimited, summary tables are produced by
parsing the BLAST output (Fig. 2; Parser). The tables list the query name, query length, hit id, hit
description, significance, percent identity, and the start, end and total length of the query sequence that
aligns with the hit sequence. One table, output5.xls, provides the results for the 25 matching sequences in
90
the BLAST output. Another table, output1.xls, provides the results for only the best match for each good
sequence. The sequence from the database that is the most similar to the good sequence is then included
in the subsequent analyses (Fig. 2; Fetch).
Clustering based on sequence similarity. HiTSA clusters similar sequences based on their genetic
95
distances. Each good sequence, the best match from the database for each good sequence, thirty nine
sequences representing a broad range of Eubacterial sequences and a single Archaea sequence (Fig. 2;
Reference Sequences) are aligned using ClustalW or ClustalW-mpi (Fig. 2; ClustalW). The region of the
alignment corresponding to the last unique start site and the first unique end site (Fig. 2; Define Region)
are extracted using Seqret (Rice et al., 2000) and used for the phylogenetic analysis. Genetic distances are
100
calculated for the aligned region by the Jukes and Cantor method (Jukes and Cantor, 1969) and the
sequences are clustered based upon these distances using the neighbor joining method (Saitou and Nei,
1987) as implemented in the PHYLIP programs DNAdist and Neighbor, respectively. The cluster analysis
produces a tree in the Newick format (http://evolution.genetics.washington.edu/phylip/newicktree.html).
105
Generating statistics for related sequences. The second program, STATGEN, can then be used to
generate summary statistics based upon the pair-wise genetic distances (corrected percent sequence
differences) among sequences that cluster together in the Newick tree. The tree and the genetic distances
5
are opened and displayed in STATGEN. At this point the user can chose the sequences whose distances
will be compared by either dragging individual sequences or clusters into the comparison window (Fig.
110
3). Comparisons can be made at various levels: all of the sequences chosen for a particular analysis can
be compared amongst themselves, two groups of sequences can be compared to each other, or the
sequences can be compared against one other sequence. For comparing a group within itself, STATGEN
uses every unique pair-wise distance among the sequences in the group. For comparing one group (or one
sequence) with another group, STATGEN uses every unique pair-wise distance between the two groups of
115
sequences. The statistics generated are the minimum, first quartile, median, third quartile, maximum,
mean, and standard deviation of the genetic distances in the comparison. The output is displayed both as
a table and as a box plot. Generally the clusters are matched to a sequence from the BLAST results, and
the comparison is between the cluster and the database sequence to determine how divergent the members
of the cluster are from the closest sequence in the database. These data provide a useful way of
120
summarizing large amounts of sequence information.
Random sampling by STATGEN. STATGEN can be used to select representative clones within a group
for further analysis (sequencing). The number of samples chosen depends on the degree of variability in
the group to be sampled. If there is very little variability among the sequences, then one or two sequences
125
will be sufficient to capture most of the sequence variability in the group, while more sequences will be
required for more variable groups.
STATGEN uses sampling without replacement to generate random sets from the group, and the
researcher decides how many replicate datasets will be generated. To start, datasets of size two are
produced from the sequences. The coefficients of variation of all pair-wise genetic distances are
130
calculated. If the coefficient is less than or equal to a predetermined value set by the researcher, two
clones are chosen at random from the group. If the predetermined value is not reached, the size of the
datasets increase by one, and the coefficient of variation is calculated for the new datasets. This process is
6
continued until a sample size with the desired level of coverage is attained, and a random sample of this
size is taken from the group. A list of the randomly sampled sequences is the output from this procedure.
7
Discussion
135
We have developed an efficient, high-throughput sequence analysis program, HITSA, that allows
us to rapidly discover the evolutionary relationships among a large set of single-read sequences and those
in a database of previously described sequences. HITSA was developed for evaluating SSU ribosomal
RNA sequences from microbial communities, however, the program can analyze any group of sequences
140
that has a database of homologous sequences for comparison. To complement HITSA, we have also
developed an interactive tool, STATGEN, that allows users to generate summary statistics for the pair-wise
genetic distances generated by HITSA, and to chose a random sample of sequences from a group based
upon the sequence variability within the group. STATGEN displays the clustering relationships of the
sequences to easily visualize possible comparisons. This tool is also generic, in that, it could be used for
145
the analysis of any square, symmetric matrix of numbers and/or cluster tree with branch lengths in
Newick format.
Similar tools for classifying SSU rRNA gene sequences exist. Bio Informatic Bacterial
Identification (BIBI; http://pbil.univ-lyon1.fr/bibi/) is a web-based tool for identifying the most similar
sequence from a SSU rRNA database, and uses a similar strategy to the one used here (Devulder et al.,
150
2003). Unfortunately, BIBI only processes a single sequence at a time, and is therefore not as useful as
HITSA for large sets of sequences. The RDP has a web-based tool for classifying bacterial sequences
based upon Bergey’s Manual of Systematic Bacteriology (http://rdp.cme.msu.edu/classifier/classifier.jsp).
This tool uses a naïve Bayesian classifier to assign sequences to the lowest taxonomic level possible.
HITSA provides more information, including a summary of results from the BLAST searches, the average
155
genetic distance among sequences and a tree that clusters sequences together based upon similarity.
These results integrate seamlessly with STATGEN for summarizing results.
There are aspects of this analysis that should be taken into account when interpreting HITSA
results. First, ClustalW is used to align sequences, and this algorithm does a progressive alignment, in
which the most similar sequences are aligned together first followed by progressively more distantly
8
160
related groups of sequences. This often results in the placement of potentially homologous gaps at
different sequence positions for each group. Since these misalignments are not repaired “by hand” before
clustering the sequences based upon the multiple sequence alignment, HITSA overemphasizes the
similarity of sequences within a cluster and the differences between clusters. Second, we are calculating
pair-wise genetic distances using the Jukes and Cantor correction. This method only corrects for the
165
possibility of back mutations and does not consider the well-known difference in the rates of transitions
and transversions or other more complicated evolutionary models (Minin et al., 2003). However, this
correction is sufficient for the distances in which we are most interested, that is, between the most similar
sequences. Third, we are using the neighbor joining method to cluster our sequences based upon their
genetic distances. These are unproofed, single-strand sequences, and it is probably not productive to
170
consider which of many alternative phylogenetic algorithms should be used at this point. We deem this to
be acceptable, since our intention is to identify those clones that should be sequenced completely on both
strands, assembled, then proofed. These complete sequences can then be used in a more extensive
phylogenetic analysis that includes correcting the sequence alignment, identifying a suitable evolutionary
model, and estimating statistical support for the inferred tree topology.
175
Of course, the classification of sequences based upon their similarity to known sequences
depends upon the quality and completeness of the database being searched. For a handful of well-studied
environments, this is not much of a problem. However, for novel environments with novel microbes, this
can be quite problematic. For example, the closest sequence in the database may only have 80% similarity
to a sample sequence, and it is very important to distinguish between this case and very high levels of
180
similarity. The STATGEN analysis tool makes it easy to summarize the amount of genetic variability
between database and experimental sequences. The choice of which summaries are needed is left up to
the researcher who can choose based upon expert knowledge. Visualizing the clusters generated by the
neighbor-joining algorithm lets the researcher compare sequences based upon inferred evolutionary
relationships. STATGEN can also be used to study and summarize the results of subsequent, more refined,
185
phylogenetic analyses.
9
The use of HITSA and STATGEN will benefit researchers who analyze large numbers of samples
to determine patterns of microbial diversity or the temporal and spatial dynamics of microbial
communities. By reducing cost and increasing efficiency, researchers can conduct more expansive studies
that involve larger numbers of samples, which will increase statistical power. This is important so that
190
such studies can move beyond mere description of communities to testing hypotheses that are important
to understanding the ecology of prokaryotes.
10
Experimental Procedures
Implementation
The high throughput sequence analysis (HITSA) has been implemented on a Sun Fire V880
195
running Solaris 5.8 and on a cluster of computers running Redhat Linux with Sun Grid Engine. HITSA is
initiated by typing a command on the user’s terminal. The command line parameters are the complete
path to the program followed by the complete path to a directory containing all of the sequence files in
FASTA format, followed by the complete path to the parameter file. The only adaptation of the program
that is necessary for installation on any Unix-type computer is changing the “scripts” variable in the
200
HITSA script to name the directory containing all of the HITSA program files. All other variables are set
in the parameter file.
STATGEN is a graphical Java 1.4 Swing application programmed using the Java API. It has been
implemented on computers running under Windows XP, Mac OS X, Solaris and Linux. The user provides
STATGEN with the files containing the genetic distance matrix and/or a Newick tree file. STATGEN uses
205
the distances from the matrix if it is provided, otherwise it will use the branch lengths from a Newick tree.
HITSA and StatGen are available at www.ibest.uidaho.edu/tools/.
Software
Several programs must be installed for HITSA and STATGEN to function. HITSA uses a bash
210
script to invoke other programs and scripts to transform data from one program for use by another.
HITSA must have access to Perl (v. 5.8), Bio-perl (v. 1.3), BLAST (Altschul et al., 1997), Seqret from the
EMBOSS (Rice et al., 2000) package, and DNAdist and Neighbor from the PHYLIP package
(Felsenstein, 2004), and ClustalW (Thompson et al., 1994), or ClustalW-MPI for parallel runs (Li, 2003).
Parallel jobs are submitted using the Sun Grid Engine. STATGEN requires Java Virtual Machine (JVM)
215
and a JAVA 1.4 compiler to compile from source code.
11
Databases
HITSA requires a database of sequences that has been formatted for BLAST searches (Table 1;
Database; Fig. 2; Sequence db).
220
Parameter file
A parameter file is used to set the various options for HITSA. Table 1 lists the options, current
defaults, and a brief description of the purpose for each option. The Suffix parameter must be unique to
all of the sequence files to be analyzed. The Npercent parameter sets the number of ambiguous bases (N)
225
that HITSA will allow in a good quality sequence. Primer5 and Primer3 are the nucleotide sequences that
define the PCR primers used to amplify the sequences. The default setting for these parameters is
currently set to the 3’ ends of 8F and 926R (Weisburg et al., 1991). The Direction parameter indicates the
direction of the raw sequences relative to the database sequences. Those sequences whose orientation is
the opposite of those in the database are transcribed into their reverse complements. In order to determine
230
the direction of the raw sequences and to confirm that the sequences are of SSU rRNA genes (or
whichever gene has been sequenced), HITSA requires an example sequence in the forward direction. The
BlastSeq parameter provides the complete path to this FASTA formatted sequence. These parameters are
used in the Data Quality step (Fig. 2).
The parameters Nhits, LengthCutoff, and Database are required for the BLAST search (Fig. 2).
235
Nhits sets the number of matching sequences that BLAST reports from its search. These are the Nhits best
matches from the database as determined by their score (see BLAST documentation,
www.ncbi.nlm.nih.gov/blast). The LengthCutoff is the minimum total length of the matching sequence
aligned to the query sequence that is required to count as one of the Nhits matches. This parameter is
necessary because some SSU rRNA gene sequences share significant similarity with the template strand
240
of distantly related SSU rRNA genes. Some experimentation may be necessary to find the optimal setting
for this parameter for other genes. Database is the complete path to the BLAST-formatted database (see
Databases above).
12
The RefStrains and Root parameters are used to organize sequences from other sources that will
be used in the alignment and clustering steps of HITSA (Fig. 2; ClustalW, Neighbor). For each raw
245
sequence, the FASTA sequence for the match from the database that has the highest score (or the first in
the list of matches returned by BLAST) is included in these subsequent steps. This sequence is copied
from the BLAST-formatted database (Fig. 2; Sequence db). Another file of FASTA sequences, whose
complete path is given by RefStrains, can also be included in the subsequent steps. Inclusion of these
sequences for analyses from different samples will allow a basis for comparing the clusters among
250
environmental samples. The Root parameter names which of the RefStrains should be used to root the
tree created by Neighbor. This name should match the description line of one of the RefStrain sequences.
The cluster tree and distance matrix will be labeled by the description line of each FASTA file.
Output files
255
When HiTSA begins, it creates two directories and numerous files within the top level directory
that contains the experimental sequences. The output files are in a directory named 8Fphylo. These files
are output1.xls, output5.xls, finaldistances and finaltree.txt. If HiTSA is rerun using the same top level
directory, all of the files and directories are overwritten by the new analysis.
The results from STATGEN can be saved as an HTML file, with boxplot graphics saved separately
260
in jpeg format. The list of randomly sampled sequences can also be saved as a text file.
13
Acknowledgements
The research was supported by NIH grant P20 RR16448 from the COBRE Program of the
National Center for Research Resources and NIH grant P20 RR016454 from the INBRE Program of the
265
National Center for Research Resources. Its contents are solely the responsibility of the authors and do
not necessarily represent the official views of NIH. We would also like to thank Mayee Wong and Xia
Zhou for help with designing the pipeline, Raymond N. Brown for helpful comments on the manuscript
and Christopher J. Williams for statistical assistance.
14
Table 1. Options that can be customized in the parameter file of HITSA
Option
Default
Purpose
Suffix
.seq
Defining extension for sequence files
Npercent
.03
Proportion of uncalled bases tolerated
Primer5
GACTCGGTMC
5’ PCR primer
Primer3
TTGARTTTCC
3’ PCR primer
Direction
FORWARD
Desired sequence direction relative to coding
strand
BlastSeq
$scriptsa/af243169.for
Sequence used for determining sequence
orientation
Nhits
25
Number of matches to return from BLAST
LengthCutoff
175
Total length for a match to be included
Database
/PATHb/databases/rdp_species/rdp_
Database formatted for BLAST
species
RefStrains
$scripts/RefStrains
Reference sequences for alignment and clustering
Root
Methanococcus jannaschii
Sequence from RefStrains that is used to root the
tree produced by Neighbor
a
$scripts is the path to the location of HITSA program files, a variable set when HITSA is installed
b
PATH is the location of the database files and is set only in the parameter file
15
Figure Legends
Figure 1. Probability of detecting microbes that occur at low frequency. Population frequency is along the
x-axis and probability of detection is on the y-axis. Each curve represents different sample sizes.
Figure 2. Flow chart for the HIgh Throughput Sequence Analysis. The central column shows the
programs used at each step. Flanking columns show inputs into and outputs from each program with
arrows indicating direction of data flow. Dark grey boxes indicate input provided by the user, black boxes
indicate files that can be used in subsequent analyses
Figure 3. A screenshot of STATGEN in action. A) Newick tree structure of clustered sequences. B) List of
sequences. “Drag and Drop” is used to move sequence names from A or B to C. C) Groups of sequences
and the sequences to which they will be compared. D) Summary statistics for Group Brevibacillus
levickii. Genetic distances between each of the “Sequences” listed in C and highlighted in E (grey)
compared to the sequence listed in C under "Compare" are summarized. The "Range Indicator" shows the
range of genetic distances for the boxplot. The boxplot for the Brevibacillus levickii comparison is below,
along with a table of the statistics. E) Tree view of clustered sequences.
16
Probability of detecting microbes that occur at low frequency
1
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
25
100
400
0.1
0
0
0.005
0.01
0.015
0.02
0.025
0.03
Population Frequency
Figure 1
17
Parameter
File
Bad
Sequences
Data Quality
Raw
Sequences
Sequence
DB
BLAST
Output
Good
Sequences
BLAST
Summary
Table
Parser
Fetch
Reference
Sequences
ClustalW
Best match
list
Best match
sequences
Aligned
sequences
Define Region
Alignment
Points
Seqret
DNAdist
Neighbor
Aligned
region
Distance
Matrix
Newick
Tree
Figure 2
18
A)
E)
C)
D)
B)
Figure 3
19
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25: 3389-3402.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. (2005) GenBank.
Nucleic Acids Res 33: D34-38.
Cole, J.R., Chai, B., Marsh, T.L., Farris, R.J., Wang, Q., Kulam, S.A. et al. (2003) The
Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular
updates and the new prokaryotic taxonomy. Nucleic Acids Res 31: 442-443.
Curtis, T.P., Sloan, W.T., and Scannell, J.W. (2002) Estimating prokaryotic diversity and its
limits. Proc Natl Acad Sci U S A 99: 10494-10499.
DeLong, E.F., and Pace, N.R. (2001) Environmental diversity of bacteria and archaea. Syst Biol
50: 470-478.
Devulder, G., Perriere, G., Baty, F., and Flandrois, J.P. (2003) BIBI, a bioinformatics bacterial
identification tool. J Clin Microbiol 41: 1785-1787.
Felsenstein, J. (2004) PHYLIP (Phylogenetic Inference Package). In. Department of Genome
Sciences, University of Washington, Seattle: Distributed by the author.
Hugenholtz, P., Goebel, B.M., and Pace, N.R. (1998) Impact of culture-independent studies on
the emerging phylogenetic view of bacterial diversity. J Bacteriol 180: 4765-4774.
Jukes, T.H., and Cantor, C.R. (1969) Evolution of protein molecules. In Mammalian Protein
Metabolism. Munro, N.H. (ed). New York: Academic Press, pp. 21-132.
Li, K.B. (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing.
Bioinformatics 19: 1585-1586.
Minin, V., Abdo, Z., Joyce, P., and Sullivan, J. (2003) Performance-based selection of likelihood
models for phylogeny estimation. Syst Biol 52: 674-683.
Olsen, G.J., Lane, D.J., Giovannoni, S.J., Pace, N.R., and Stahl, D.A. (1986) Microbial ecology
and evolution: a ribosomal RNA approach. Annu Rev Microbiol 40: 337-365.
Rappe, M.S., and Giovannoni, S.J. (2003) The uncultured microbial majority. Annu Rev
Microbiol 57: 369-394.
Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open
Software Suite. Trends Genet 16: 276-277.
Saitou, N., and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol Evol 4: 406-425.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680.
Weisburg, W.G., Barns, S.M., Pelletier, D.A., and Lane, D.J. (1991) 16S ribosomal DNA
amplification for phylogenetic study. J Bacteriol 173: 697-703.
Wheelis, M.L., Kandler, O., and Woese, C.R. (1992) On the nature of global classification. Proc
Natl Acad Sci U S A 89: 2930-2934.
Woese, C.R. (1987) Bacterial evolution. Microbiol Rev 51: 221-271.
Woese, C.R., Kandler, O., and Wheelis, M.L. (1990) Towards a natural system of organisms:
proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87:
4576-4579.
20