Whole-genome prokaryotic phylogeny

BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 10 2005, pages 2329–2335
doi:10.1093/bioinformatics/bth324
Structural bioinformatics
Whole-genome prokaryotic phylogeny
Stefan R. Henz1 , Daniel H. Huson1,∗ , Alexander F. Auch1 , Kay Nieselt-Struwe1 and
Stephan C. Schuster2
1 Center
for Bioinformatics Tübingen (ZBIT), Sand 14, Tübingen 72076, Germany and 2 Max-Planck-Institute for
Development Biology, Spemannstrasse 35, Tübingen, 72076, Germany
Received on September 26, 2003; revised on March 5, 2004; accepted on April 8, 2004
Advance Access publication May 27, 2004
ABSTRACT
Current understanding of the phylogeny of prokaryotes is based on the
comparison of the highly conserved small ssu-rRNA subunit and similar regions. Although such molecules have proved to be very useful
phylogenetic markers, mutational saturation is a problem, due to their
restricted lengths. Now, a growing number of complete prokaryotic
genomes are available. This paper addresses the problem of determining a prokaryotic phylogeny utilizing the comparison of complete
genomes. We introduce a new strategy, GBDP, ‘genome blast distance
phylogeny’, and show that different variants of this approach robustly
produce phylogenies that are biologically sound, when applied to 91
prokaryotic genomes. In this approach, first Blast is used to compare
genomes, then a distance matrix is computed, and finally a tree- or
network-reconstruction method such as UPGMA, Neighbor-Joining,
BioNJ or Neighbor-Net is applied.
Contact: [email protected]
INTRODUCTION
Our current understanding of the phylogeny of prokaryotes is based
on the comparison of the highly conserved small subunit rRNA (ssurRNA) as originally proposed by Carl Woese and colleagues (Woese,
1987). This rRNA subunit has all the characteristics of a good phylogenetic marker, namely: (1) it is universally present in all organisms
under consideration, (2) it was derived from a common ancestor and
thus the sequences are homologous, (3) it is an essential and complex
gene and, (4) because of its low rate of substitution, it is genetically
very stable.
Attempts to elucidate the phylogeny of prokaryotes based on the
ssu-rRNA have been quite successful (Huynen and Bork, 1998;
Olsen et al., 1994). However, saturation is a problem due to the
restricted length of the molecule and functional restrictions limiting the number of mutable sites. This issue can be addressed to
some degree by using additional phylogenetic markers, such as
23S rRNA, the β-subunit of F1 F0 ATPase, and also the elongation factor Tu (Ludwig and Schleifer, 1999, http://www.vermicon.de/
english/news/science/KHS99111.htm).
Another well-known problem associated with this type of approach
is that the evolutionary history of any single gene may differ from
the phylogenetic history of the whole organism from which the
corresponding molecule was isolated.
∗ To
whom correspondence should be addressed.
Generally, the resolution of deep divergences such as the phylogeny of the archae and (eu)bacteria is very difficult since information
fades as one goes deeper into the past. Indeed, it has been shown
that the sequence requirement grows exponentially with time as one
attempts to resolve deeper and deeper divergences (Sober and Steel,
2002). Therefore, it seems important to make use of as much genetic
information as available for the reconstruction of the prokaryotic
phylogeny.
Now, a growing number of complete prokaryotic genomes is
available and the question arises how to derive phylogenies based
on the whole genomic information of organisms rather than based
on a small number of genes (Wolf et al., 2001; Mirkin et al., 2003).
One such approach is to determine a genome phylogeny based on
gene content (Snel et al., 1999, Huson and Steel, submitted for
publication).
In this paper, we introduce a new strategy, called GBDP, ‘genome blast distance phylogeny’, that combines different pieces
of standard methodology to produce a phylogenetic tree or network from a given set of whole-genome data. We have applied
this approach to 91 prokaryotic genomes, of which 90 were
obtained from NCBI (NCBI, 2003, http://www.ncbi.nlm.nih.gov/
PMGifs/Genomes/new_micr.html) and EBI (EBI, 2003, http://www.
ebi.ac.uk/cgibin/genomes/genomes.cgi?genomes=Bacteria) and one
additional genome (Wolinella succinogenes) that was recently
sequenced in the laboratory of the last-named author (Baar et al.,
2003).
More precisely, GBDP starts with an all-against-all pairwise comparison of genomes using Blastn (Altschul et al., 1990). In a second
step, a distance matrix is calculated from the resulting HSPs (High
Speed Products). Here, we studied a number of variants which are
described in detail below. This distance matrix is then processed
by a distance-based phylogenetic method to produce a phylogenetic tree or network. We considered UPGMA (Sokal and Michener,
1958), Neighbor-Joining (NJ) (Saitou and Nei, 1987; Studier and
Keppler, 1988), BioNJ (Gascuel, 1997), and NeighborNet (Bryant
and Moulton, 2002).
Comparison of our results with the NCBI taxonomy (NCBI, 2003,
http:// www.ncbi.nlm.nih.gov / PMGifs / Genomes / new_micr.html)
(Fig. 4) and the phylogeny reported in (Woese, 2000) shows that
GBDP is a robust method for determining the phylogeny of a
collection of whole genomes. Of the four reconstruction methods
considered, BioNJ and NeighborNet produce the most biologically
sound phylogenies. Additionally, the latter method can lead to useful
insights as the reticulations in the NeighborNet graph explains, in
© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2329
S.R.Henz et al.
γ Proteobacteria
β Proteobacteria
Yersinia_pestis_a,b
Green Bacteria
Spirochaete
Escherichia_coli_a,b,c
Pasteurella_multocida
Buchnera_aps
Buchnera_aphidicolas_graminum Haemophilus_influenzae
Salmonella_typhimurium
Salmonella_enterica
Vibrio_choleraeTreponema_pallidum
Chlorobium_tepidum
Bacillus_halodurans
Xanthomonas_campestris
Firmicutes
Bacillus_subtilis
Xanthomonas_axonopodis
Xylella_fastidiosa
Listeria_innocua
Ralstonia_solanacearum
Listeria_monocytogenes
Pseudomonas_aeruginosa
Oceanobacillus_iheyensis
Neisseria_meningitidis_a,b
Staphylococcus_aureus_a,b,c
α Proteobacteria
Caulobacter_crescentus
Brucella_melitensis
Mesorhizobium_loti
Sinorhizobium_meliloti
Lactococcus_lactis
Streptococcus_agalactiae_a,b
Streptococcus_pyogenes_a,b,c
Agrobacterium_tumefaciens_a,b
Spirochaete
Streptococcus_pneumoniae_a,b
Borrelia_burgdorferi
Clostridium_acetobutylicum
Clostridium_perfringens
Fusobacterium_nucleatum
Mycoplasma_pulmonis
Ureaplasma_urealyticum
Aeropyrum_pernix
Pyrobaculum_aerophilum
Firmicutes
Thermoanaerobacter_tengcongensis
Mycoplasma_genitalium
Mycoplasma_pneumoniae
Sulfolobus_tokodaii
Sulfolobus_solfataricus
Thermoplasma_volcanium
Rickettsia_prowazekii
Rickettsia_conorii
Thermoplasma_acidophilum
Methanopyrus_kandleri
α Proteobacteria
Campylobacter_jejuni
Helicobacter_pylori_a,b
Pyrococcus_horikoshii
Pyrococcus_furiosus
Pyrococcus_abyssi
Archae
Fusobacteria
Methanococcus_jannaschii
Methanosarcina_mazei
Methanosarcina_acetivorans
Methanobacterium_thermoautotrophicum
Archaeoglobus_fulgidus
Wolinella_succinogenes
Chlamydia_trachomatis
Chlamydia_muridarum
Chlamydia_pneumoniae_a,b,c
Halobacterium_sp
Deinococcus_radiodurans
Streptomyces_coelicolor
Mycobacterium_tuberculosis_a,b
Mycobacterium_leprae
ε Proteobacteria
Chlamydiae
Nostoc_pcc_7120
Thermosynechococcus_elongatus
Thermotoga_maritima Synechocystis
Aquifex_aeolicus
Corynebacterium_glutamicum_a,b
Cyanobacteria
Aquificales
Archae
Deinococcales
Actinobacteria
Thermotogales
Fig. 1. Phylogenetic tree for 91 prokaryotic genomes produced by applying UPGMA to the ‘matched distances’ matrix. Major groups are indicated.
isolated cases, why some individual taxa are placed ambiguously in
a number of the phylogenies reported.
METHODS
Phylogenetic networks
The deep divergence of the prokaryotes makes the resolution of the different
bacterial groups a difficult task and several of the inter-species and intraspecies branching orders remain unclear.
In theory, the representation of the ambiguous branching orders by a phylogenetic network rather than a tree should be helpful here. One of the most
popular methods for computing phylogenetic networks, in this case so-called
splits graphs, is split decomposition (Bandelt and Dress, 1992; Huson, 1998).
In such a splits graph, bipartitions or splits of the taxa are represented by
bands of parallel edges and the graph can be interpreted as representing multiple phylogenetic branching patterns simultaneously. Unfortunately, the split
decomposition method is very conservative and so, given the level of diversity
studied here, the resulting graph is almost completely unresolved. NeighborNet (Bryant and Moulton, 2002) is a new agglomerative algorithm for
computing (planar) splits graphs that produces a much better resolution.
In our study, the phylogenetic network produced by NeighborNet (Fig. 3)
indicates, in some cases, why some individual taxa are ambiguously placed
in the trees reported. For example, the Neisseria meningitidis sequences are
placed differently in all three phylogenetic trees. However, two significant splits separating the group in the network produced by NeighborNet
2330
clearly indicate the ambiguous evolutionary signal. Another example is
the grouping of Borrelia burgdorferi and Fusobacterium nucleatum within
the Firmicutes in all three trees. NeighborNet, too, clusters them within the
Firmicutes group and, again, conflicting splits indicate the uncertainty of this
position.
It is also interesting to note that much fewer conflicting splits appear in the
archaeal phylum than in the bacterial phylum.
The GBDP strategy
The first step in the GBDP strategy is an all-against-all pairwise comparison of
all genomes using a sequence comparison method such as Blast (Altschul
et al., 1990). For any two genomes X and Y , this produces a list of highscoring segment pairs (HSPs), each consisting of a pair of sequence segments
in X and Y (or in either of the opposite strands) of very similar length and
whose significance is indicated by an E-value and/or score.
In our study, we used standard nucleotide–nucleotide Blastn. This is the
most time consuming part of the process and given 91 prokaryotic genomes,
this step took about 170 CPU hours. We are currently evaluating the use of
tBlastx for this step.
The second step of the GBDP strategy is the application of a distance
transformation that computes a distance between any two genomes X and
Y , based on the set of HSPs obtained in the first step. The first distance
transformation is the coverage distance defined as follows:
dcov (X, Y ) := − log
|Xcov | + |Ycov |
,
|X| + |Y |
(1)
Whole-genome prokaryotic phylogeny
Spirochaete
Chlamydiae
Actinobacteria
Mycobacterium_leprae
β Proteobacteria
Green Bacteria
Chlamydia_pneumoniae_a
Mycobacterium_tuberculosis_a
Mycobacterium_tuberculosis_b
Chlamydia_pneumoniae_b
Chlamydia_pneumoniae_c
Cyanobacteria
Chlamydia_muridarum
Streptomyces_coelicolor
Corynebacterium_glutamicum_a
Treponema_pallidum
Corynebacterium_glutamicum_b
γ Proteobacteria
Deinococcales
Deinococcus_radiodurans
Chlorobium_tepidum
Xylella_fastidiosa
Neisseria_meningitidis_a
Xanthomonas_campestris
Neisseria_meningitidis_b
Xanthomonas_axonopodis
Synechocystis
Nostoc_pcc_7120
Thermosynechococcus_elongatus
Chlamydia_trachomatis
Firmicutes
Ralstonia_solanacearum
α Proteobacteria
Caulobacter_crescentus
Brucella_melitensis
Mesorhizobium_loti
Sinorhizobium_meliloti
Agrobacterium_tumefaciens_a
Agrobacterium_tumefaciens_a
Bacillus_halodurans
Pseudomonas_aeruginosa
Bacillus_subtilis
Listeria_monocytogenes
Listeria_innocua
Oceanobacillus_iheyensis
Staphylococcus_aureus_a
Staphylococcus_aureus_c
Staphylococcus_aureus_b
Lactococcus_lactis
Streptococcus_agalactiae_a
Streptococcus_agalactiae_b
Streptococcus_pyogenes_a
Streptococcus_pyogenes_b
Streptococcus_pyogenes_c
Streptococcus_pneumoniae_a
Streptococcus_pneumoniae_b
Spirochaete
Aeropyrum_pernix
Borrelia_burgdorferi
Clostridium_acetobutylicum
Clostridium_perfringens
Fusobacterium_nucleatum
Pyrobaculum_aerophilum
Sulfolobus_tokodaii
Sulfolobus_solfataricus
Ureaplasma_urealyticum
Pyrococcus_furiosus
Pyrococcus_horikoshii
Pyrococcus_abyssi
Fusobacteria
Mycoplasma_genitalium
Buchnera_aphidicolas_graminum
Mycoplasma_pneumoniae
Buchnera_aps
Mycoplasma_pulmonis
Firmicutes
Thermoplasma_volcanium
Thermoplasma_acidophilum
Escherichia_coli_a
Thermoanaerobacter_tengcongensisCampylobacter_jejuni
Escherichia_coli_b
Methanococcus_jannaschii
Haemophilus_influenzae
Escherichia_coli_c
Halobacterium_sp
Aquifex_aeolicus
Pasteurella_multocida
Salmonella_enterica
Methanosarcina_mazei
Rickettsia_prowazekii
Salmonella_typhimurium
Wolinella_succinogenes
Methanosarcina_acetivorans
Yersinia_pestis
Thermotoga_maritima
Helicobacter_pylori_b Rickettsia_conorii
Methanobacterium_thermoautotrophicum
Yersinia_pestis_kim
Helicobacter_pylori_a
Archaeoglobus_fulgidus
γ Proteobacteria
Vibrio_cholerae
Methanopyrus_kandleri
Archae
Aquificales
Thermotogales
ε Proteobacteria
α Proteobacteria
Firmicutes
Fig. 2. Phylogenetic tree for 91 prokaryotic genomes produced by BioNJ on the ‘matched distances’ matrix.
where |X| and |Y | denote the length of genomes X and Y , respectively, and
|Xcov | and |Ycov | denote the number of base pairs in X and Y , respectively,
that are covered by some HSP between X and Y . The simple rationale behind
Equation (1) is that more closely related genomes will be more covered by a
larger number of HSPs.
However, we observed that application of this transformation can lead to a
problematic relative placement of two genomes when one genome is essentially a subset of a much larger second one. For example, in our data set this
is true of a Buchnera genome, which is essentially a subset of the Escherichia coli genome, approximately 14% the size of it (Moran and Mira, 2001).
Although a substantial amount of Buchnera is covered by HSPs between
Buchnera and E.coli, E.coli has a large amount of additional DNA that
remains uncovered, thus obscuring the subset relationship. To address this
problem, we introduce a modification of the first distance transformation:
dcov
(X, Y ) := − log
|Xcov | + |Ycov |
.
2 min(|X|, |Y |)
(2)
In practice, this transformation performs slightly better than the first one and
throughout this paper, the distance matrix obtained using (2) is referred to as
the coverage distance.
Repeats of significant size or number can result in a coverage distance
that makes two different genomes seem to be more closely related than
they actually are. To address this problem, we consider a second distance
transformation, called the matched distance. This is obtained by selecting a
maximum subset of HSPs that are non-overlapping both in the X and in the
Y sequence. In other words, the set of selected HSPs has the property that
any nucleotide position in either genome is contained in at most one selected
HSP. The distance is given by:
dmatch (X, Y ) := − log
|Xmatch | + |Ymatch |
,
2 min(|X|, |Y |)
(3)
where |Xmatch | and |Ymatch | denote the number of base pairs in X and Y that
are covered by selected HSPs, respectively.
We implemented two variants of the matched distance. In the first variant, greedy selection, HSPs are selected by first sorting them by decreasing
‘significance’ (either length, E-value or score) and then greedily choosing a
subset of HSPs with non-overlapping intervals. The second variant, greedywith-trimming, operates very similarly. However, if the currently considered
HSP overlaps with other HSPs that have already been selected, then we do not
discard it as in the previous method but rather simply trim back the offending
part of the HSP and insert the rest back into the sorted list of HSPs for later
reconsideration, see (Halpern et al., 2002) for details.
In practice, it doesn’t seem to make much of a difference whether one uses
length, E-value or score as the value to be maximized in the greedy approach.
Moreover, we detected little difference between the greedy and greedy-withtrimming approaches. Throughout this paper, the term matched distances
will denote the distances obtained using the greedy selection by match
length.
2331
S.R.Henz et al.
Firmicutes
Streptococcus_agalactiae_a,b
Streptococcus_pyogenes_a,b,c
Streptococcus_pneumoniae_a,b
Lactococcus_lactis
Oceanobacillus_iheyensis
Mycoplasma_pneumoniae
Bacillus_halodurans
Bacillus_subtilis
Listeria_innocua
Listeria_monocytogenes
Staphylococcus_aureus_a,b,c
Spirochaete
Ureaplasma_urealyticum
Campylobacter_jejuni
Helicobacter_pylori_a,b
Borrelia_burgdorferi
Clostridium_perfringens
Clostridium_acetobutylicum
Fusobacterium_nucleatum
Fusobacteria
γ Proteobacteria
Wolinella_succinogenes
α Proteobacteria
Haemophilus_influenzae
Pasteurella_multocida
Buchnera_aphidicolas_graminum
Buchnera_aps
Rickettsia_prowazekii
Rickettsia_conorii
Firmicutes
ε Proteobacteria
Mycoplasma_genitalium
Mycoplasma_pulmonis
Escherichia_coli_a,b,c
Thermoanaerobacter_tengcongensis
Aquificales
Salmonella_typhimurium
Salmonella_enterica
Aquifex_aeolicus
Thermotogales
γ Proteobacteria
Yersinia_pestis
Yersinia_pestis_kim
Vibrio_cholerae
Thermotoga_maritima
β Proteobacteria
Neisseria_meningitidis_a,b
Halobacterium_sp
Methanopyrus_kandleri
Xylella_fastidiosa
Archaeoglobus_fulgidus
Xanthomonas_axonopodis
Xanthomonas_campestris
Ralstonia_solanacearum
Pseudomonas_aeruginosa
Methanobacterium_thermoautotrophicum
Methanosarcina_acetivorans
γ Proteobacteria
β Proteobacteria
Methanosarcina_mazei
Thermoplasma_acidophilum
Deinococcus_radiodurans
Agrobacterium_tumefaciens_a,b
Sinorhizobium_meliloti
Mesorhizobium_loti
Brucella_melitensis
Thermoplasma_volcanium
Chlamydia_trachomatis
Pyrococcus_furiosus
Pyrococcus_horikoshii
Pyrococcus_abyssi
Archae
Chlamydia_muridarum
Chlamydia_pneumoniae_a,b,c
Pyrobaculum_aerophilum
Aeropyrum_pernix
Treponema_pallidum
Nostoc_pcc_7120
Sulfolobus_tokodaii
Green Bacteria
Mycobacterium_tuberculosis_a,b
Synechocystis
Thermosynechococcus_elongatus
Chlamydiae
α Proteobacteria
Chlorobium_tepidum
Streptomyces_coelicolor
Methanococcus_jannaschii
Sulfolobus_solfataricus
Deinococcales
Caulobacter_crescentus
Mycobacterium_leprae
Corynebacterium_glutamicum_a,b
Actinobacteria
Cyanobacteria
Spirochaete
Fig. 3. Phylogenetic network for 91 prokaryotic genomes produced by NeighborNet on the ‘matched distances’ matrix.
The choice whether to use the coverage distance or matched distance can
make quite a difference. For example, a Blastn comparison of N.meningitidis
serogroup A and N.meningitidis serogroup B gives rise to 350 000 HSPs that
are clearly repeat induced (Parkhill et al., 2000). In this case, the covered
distance is 0.0538, whereas the matched distance is 0.182.
A different distance transformation is based on the concept of breakpoints
(Sankhoff and Blanchette, 1997; Sankhoff et al., 2000; Wang et al., 2003,
http://www.smi.stanford.edu/projects/helix/psb02/wang.pdf). A breakpoint
can be detected when two HSPs (from a set of non-overlapping HSPs
obtained, e.g. by one of greedy heuristics described above) map consecutive
intervals in one genome onto non-consecutive intervals in the other. More
precisely, we define a breakpoint as follows: Consider a HSP h1 covering the
interval [Xi , Xj ] in genome X and the interval [Yk , Ym ] in genome Y . Now,
consider the next interval in X that is covered by an HSP h2 , with coordinates
[Xi , Xj ]. (For this discussion we assume that all matches are in the forward
strands of both genomes. We omit the discussion of the other cases, which is
straightforward but slightly tedious.) Let [Yk , Ym ] denote the corresponding
interval in Y . We count a breakpoint, if there exists a third, intervening HSP
h3 that covers a position between Ym and Yk and thus destroys the co-linearity
of h1 and h2 . We define the breakpoint transformation as
BX + BY
dbrkp (X, Y ) = − log 1 −
,
(4)
M X + MY
where BX , or BY , is the number of breakpoints in X, or Y , respectively, and
MX , or MY , is the number of matched intervals in X, or Y , respectively.
In our study, the breakpoint transformation performed very poorly within
the framework of GBDP (not shown here). This is perhaps not surprising as
2332
the order of the matching segments of two prokaryontic genomes is usually
extremely different, whereas it is commonly assumed that breakpoint methods
lead to good results only if the genomes have preserved a substantial amount
of co-linearity.
By definition, Blast is not symmetric (Altschul et al., 1990) and this
potentially leads to a number of different variants of GBDP. Let d(X, Y )
denote a distance based on blasting genome X against Y . We can define
a distance matrix in three different ways, namely, we can use the average
D(X, Y ) := [d(X, Y ) + d(Y , X)]/2 or use one of the two directions, either
D(X, Y ) := D(Y , X) := d(X, Y ) or D(Y , X) := D(X, Y ) := d(Y , X). Our
studies seem to indicate that this directionality does have some impact on the
performance of GBDP methods and averaging over both directions produces
better results. Thus, all results shown here are based on the latter method.
The third and final step of the GBDP strategy consists of applying a standard
distance-based phylogenetic tree- or network reconstruction method. In our
study, we used UPGMA (Sokal and Michener, 1958), NJ (Saitou and Nei,
1987; Studier and Keppler, 1988), as implemented in the release 3.57c of
the Phylip package (Felsenstein, 1989), and BioNJ (Gascuel, 1997), in order
to reconstruct phylogenetic trees. We used two different network methods,
split decomposition (Bandelt and Dress, 1992) as implemented in Huson and
Bryant, (manuscript in preparation) and NeighborNet (Bryant and Moulton,
2002) as implemented in Huson and Bryant, (manuscript in preparation) to
compute phylogenetic networks.
As discussed above, BioNJ and NeighborNet produced the best results,
followed by UPGMA and then NJ. As to be expected for such a large number
of highly divergent taxa, the split decomposition method produced a very
unresolved tree (not shown here).
Whole-genome prokaryotic phylogeny
α Proteobacteria
Archae
Pyrococcus_horikoshii
Pyrococcus_furiosus Thermoplasma_volcanium
Pyrococcus_abyssi Thermoplasma_acidophilum
Agrobacterium_tumefaciens_a
Methanobacterium_thermoautotrophicum
Agrobacterium_tumefaciens_a
Sulfolobus_tokodaii
Methanococcus_jannaschii
Sulfolobus_solfataricus
Methanopyrus_kandleri
Methanosarcina_mazei
Brucella_melitensis
Methanosarcina_acetivorans
Mesorhizobium_loti
Sinorhizobium_meliloti
Thermotoga
Aquificales
Halobacterium_sp
Archaeoglobus_fulgidus
Pyrobaculum_aerophilum
Aeropyrum_pernix
Rickettsia_conorii
Rickettsia_prowazekii
Cyanobacteria
Caulobacter_crescentus
Thermosynechococcus_elongatus
Deinococcales
Synechocystis
Thermotoga_maritima
Nostoc_pcc_7120
Fusobacteria
Aquifex_aeolicus
Fusobacterium_nucleatum
Green Bacteria
Chlorobium_tepidum
Deinococcus_radiodurans
Chlamydia_trachomatis
Chlamydia_pneumoniae_a
Chlamydiae
Chlamydia_pneumoniae_b
Chlamydia_pneumoniae_c
Chlamydia_muridarum
Streptomyces_coelicolor
Actinobacteria
Mycobacterium_leprae
Mycobacterium_tuberculosis_a
Mycobacterium_tuberculosis_b
Corynebacterium_glutamicum_b
Corynebacterium_glutamicum_a
Buchnera_aphidicolas_graminum
Buchnera_aps
Escherichia_coli_a
Escherichia_coli_b
Escherichia_coli_c
Salmonella_typhimurium
Salmonella_enterica
Haemophilus_influenzae
Pasteurella_multocida
Yersinia_pestis
Yersinia_pestis_kim
Pseudomonas_aeruginosa
Vibrio_cholerae
γ Proteobacteria
Xanthomonas_axonopodis
Xanthomonas_campestris
Xylella_fastidiosa
Ralstonia_solanacearum
Borrelia_burgdorferi
Treponema_pallidum
Thermoanaerobacter
tengcongensis
Campylobacter_jejuni
Helicobacter_pylori_a
Helicobacter_pylori_b
Wolinella_succinogenes
ε Proteobacteria
Neisseria_meningitidis_a
Neisseria_meningitidis_b
Spirochaete
β Proteobacteria
Bacillus_halodurans
Bacillus_subtilis
Staphylococcus_aureus_a
Staphylococcus_aureus_b
Clostridium_acetobutylicum
Staphylococcus_aureus_c
Clostridium_perfringens
Oceanobacillus_iheyensis
Listeria_innocua
Listeria_monocytogenes
Ureaplasma_urealyticum
Lactococcus_lactis
Mycoplasma_pulmonis Streptococcus_pyogenes_a
Mycoplasma_pneumoniae Streptococcus_pyogenes_b
Mycoplasma_genitalium Streptococcus_pyogenes_c
Streptococcus_agalactiae_a
Streptococcus_agalactiae_b
Streptococcus_pneumoniae_a
Streptococcus_pneumoniae_b
Firmicutes
Fig. 4. The prokaryotic phylogenetic tree reflecting the current NCBI taxonomy (NCBI, 2003).
Comparison of variants of GBDP
In our studies we compared the topology of the trees computed with the
topology of the NCBI taxonomy (NCBI, 2003, http://www.ncbi.nlm.nih.gov/
PMGifs/Genomes/newmicr.html), which we regard as the ‘correct’ one,
although it is not totally resolved. In the following, we will refer to this
taxonomy as the NCBI tree, (Fig. 4).
We could compare any computed phylogenetic tree or network T with the
NCBI tree T0 simply by counting the number of non-trivial splits in T that
are not contained in T0 , and vice versa, thus obtaining a number of false
positives and false negatives. However, because the NCBI tree T0 is not fully
resolved, proceeding in this way would over-count false positives. To avoid
this problem, we only count a non-trivial split from T as a false positive, if it
is missing from T0 and it is incompatible with T0 (Bandelt and Dress, 1992).
For purposes of a first evaluation, we define the compatibility-score (c-score)
of the reconstruction as:
|(Tcomp )|
,
(5)
c-score :=
|(T )|
where (T ) and (Tcomp ) denote the set of all non-trivial splits in T , and
the set of all non-trivial splits in T that are compatible with T0 , respectively.
We used this c-score to evaluate how well a reconstructed tree or network
might reflect the ‘correct’ biological evolution. The matched distances variants of GBDP produced the highest c-scores and the resulting phylogenies
are displayed in Figures 1–3. The c-scores for a number of different variants of GBDP are summarized in Table 1. Note that the c-scores reported
for NeighborNet are very low. This reflects the fact that the data is not very
tree-like and the computed phylogenetic networks display range of conflicting
phylogenies, thus necessarily producing false positives.
DISCUSSION
In recent years the determination-driven classification of microbial
organisms has been given up in order to make room for a phylogenetic
system (Woese, 1987; NCBI, 2003, http://www.ncbi.nlm.nih.gov/
PMGifs/Genomes/new_micr.html). The most valuable phylogenetic
marker for the establishment of a general phylogenetic system turned
out to be the ssu-rRNA. Using this marker, the three domains of life
are recognizable, namely archae, eukaryotes and bacteria. The aim
of this study is to extend the set of tools available for the phylogenetic
comparison of prokaryotes beyond the already published ssu-rRNA
(Woese, 1987) and gene content approaches (Snel et al., 1999, Hason
and Steel, 2004).
Using the GBDP strategy, i.e. a genome wide homology search
using Blastn followed by a distance matrix computation and tree
construction (see below for algorithmic details), we were able to construct phylogenetic trees and networks that resemble the previously
published phylogenies in their most important aspects, with some
noteworthy deviations.
2333
S.R.Henz et al.
Table 1. The c-score of different GDBP phylogenies, based on a comparison
with the current NCBI taxonomy
Distance transformation
Phylogenetic method
c-score
covered
greedy length
greedy score
greedy E-value
covered
greedy length
greedy score
greedy E-value
covered
greedy length
greedy score
greedy E-value
covered
greedy length
greedy score
greedy E-value
UPGMA
UPGMA
UPGMA
UPGMA
NJ
NJ
NJ
NJ
BioNJ
BioNJ
BioNJ
BioNJ
NeighborNet
NeighborNet
NeighborNet
NeighborNet
0.716
0.716
0.716
0.716
0.705
0.716
0.716
0.716
0.727
0.727
0.727
0.727
0.4
0.433
0.433
0.433
In the first column, the words covered and greedy indicate which distance transformation was used, i.e. the covered distances, Equation (2), or matched distances,
Equation (3), respectively. In the latter case, length, score and E-value indicates which
variable the greedy selection was based on. The second column indicates the phylogenetic reconstruction method used. The third column reports the c-score as defined in
Equation (5).
Most of the variants of GBDP that we studied identify the archaeal
domain as a monophyletic unit. (Figs 1–4). Furthermore, the topology within this domain described by the ssu-rRNA approach is also
observable. Such a clustering of extreme thermophile, methanogen
or extreme halophile archaeal species can be demonstrated in the
presented phylogenetic models. A major difference between both
phylogenies can be observed for organisms that branch off early from
the bacterial phylum, such as the Thermotogales and the Aquificales.
Both are found in the GBDP approach as branching off from the
archaeal phylum, despite being considered (for good reasons) as
bacterial organisms.
In contrast, other deep branching bacterial phyla such as the
Deinococci, the Green bacteria, Actinobacteria, Spirochaetes and
Cyanobacteria are grouping within the bacterial monophyletic
branch, albeit from different branch points than those described in
the ssu-rRNA phylogenies. Particular robustness with respect to the
different variants of GBDP that we considered is displayed by the
phyla of the Actinobacteria, Cyanobacteria and Chlamydiae, all considered to be deep branching species. Interestingly, the same pattern
can be observed for the group of the -Proteobacteria, which also do
not branch off the general proteobacterial branch, but rather remain
distinct.
The vast majority of the remaining species are categorized within
the Firmicutes (gram-positive) and the Proteobacteria. For the phylogenetic network constructed by the NeighborNet and matched
distance variant of GBDP (Fig. 3, see below for algorithmic
details), the entire group Firmicutes is found to cluster nicely
together, closely resembling published single-gene phylogenies
based on 16S rRNA, 23S rRNA, the β-subunit of F1 F0 ATPase,
and also the elongation factor (Ludwig and Schleifer, 1999, http://
www.vermicon.de/english/news/science/KHS99111.htm). However,
2334
the order within the Firmicutes is disturbed by the presence of the two
species B.burgdorferi, a Spirochaete, and F.nucleatum, a Fusobacterium. This fact is true for all variants of GBDP that we studied,
and could therefore be indicative of the need to further substantiate their phylogenetic classification. This notion is supported by the
fact that the second Spirochaete, Treponema palidum, is rather deep
branching and never clusters with B.burgdorferi.
A second group of organisms that does not follow the ssu-rRNA
phylogeny are those of the heavily degraded genomes. On one side
are the genomes of two Rickettsia species, which originally have
been classified within the α-Proteobacteria. Both genomes are known
to have recently undergone massive mutation-driven, genome-wide
deletions (Andersson et al., 1998). It might therefore be possible that
the same mutational saturation that can be observed for ssu-rRNA
markers is also affecting the comparison to the other members of
the α-Proteobacteria. This might also affect the classification of the
two Buchnera species, which in several of the presented phylogenies
do not group together with their next closest relative Escherichia
coli, a γ -Proteobacterium. In addition to ssu-rRNA based studies,
the phylogenetic relationship between Buchnera and E.coli has also
been the subject of detailed syntenic studies, which establish the
time frame of divergence between the two Buchnera strains to be 50
million years (Tamas et al., 2000).
The group of organisms comprising of Haemophilus, Pasteurella,
Escherichia, Salmonella and Yersinia form a robust phylum within
the γ -Proteobacteria that holds up in all phylogenies presented. The
larger group of γ -Proteobacteria, which also includes Xanthomonas, Xylella and Pseudomonas, constitutes a more diffuse phylogeny, which is interspersed by the β-Proteobacteria N.meningitidis
and Ralstonia solanacearum. However, this positioning of the
β-Proteobacteria in the cluster of γ -Proteobacteria also appears to
be robust among the various results.
An ambiguous grouping is also found for the firmicute Thermoanerobacter tengcongenis, which clusters in three of four methods
(NJ, BIONJ and NeighborNet) with other thermophile species, albeit
archaeal or bacterial. This might reflect a genomic convergence that
is driven by environmental factors (temperature) rather than a shared
ancestral history.
An interesting observation of the study is that the phylogenies
computed for the archaeal phylum are much better in agreement
with the ones from the ssu-rRNA analysis than those from the bacterial phylum. Biological reasons for this may be found in the much
slower growth rate of archaeal organisms that leads to a slower rate of
mutation, as well as a lower degree of lateral gene transfer compared
to bacterial species. In how far lateral gene transfer has an influence
on the performance of GBDP methods will have to be examined
by comparisons in which the flexible moieties of the genomes are
omitted.
It is also important to note that the applied methods are based on
DNA similarities. Nevertheless, it can be demonstrated that even
in cases where the average GC-content of the genome differs by
as much as 20% (as between Camphylobacter and Wolinella) a
phylogenetically correct placement can be obtained.
In summary, the presented variants of the GBDP strategy are,
within limits, robust and recover the general features of the ssu-rRNA
phylogeny, which is the molecular basis of the current microbial classification. The observed differences should be used as a starting point
both for a reinvestigation of the phylogenetic position of organisms
and also for an analysis of the weaknesses of the methods presented.
Whole-genome prokaryotic phylogeny
We found that the accuracy of the trees computed depends on having a sufficiently large sampling of taxa and is therefore expected to
increase further as more whole genome sequences become available.
We are currently comparing the proteomes of these species by
employing tBLASTx in the comparison stage of the GBDP strategy.
We suspect that this will yield a small gain in the biological plausibility of the phylogeny obtained, while substantially increasing the
computational cost involved.
REFERENCES
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local
alignment search tool. J. Mol. Biol., 215, 403–410.
Andersson,S.G.E., Zomorodipour,A., Andersson,J.O., Sicheritz-Ponten, T.,
Alsmark,C.M.U., Podwoski,R.M., Näslund,A.K., Eriksson,A.-C., Winkler,H.H. and
Kurland,C.G. (1998) The genome sequence of rickettsia prowazekii and the origin
of mitochondria. Nature, 396, 133–140.
Baar,C., Eppinger,M., Raddatz,G., Simon,J., Lanz,C., Klimmek,O., Nadakumar,R.,
Gross,R., Rosinus,A., Keller,H. et al. (2003) Complete genome sequence and analysis
of Wolinella succinogenes. Proc. Natl Acad. Sci. USA, 100, 11690–11695.
Bandelt,H.-J. and Dress,A.W.M. (1992) A canonical decomposition theory for metrics
on a finite set. Adv. Math., 92, 47–105.
Bryant,D. and Moulton,V. (2002) NeighborNet: an agglomerative method for the
construction of planar phylogenetic networks. In Guigó,R. and Gusfield,D. (eds),
Algorithms in Bioinformatics, WABI 2002, Volume LNCS 2452, pp. 375–391.
EBI (2003) Completed genomes bacteria.
Felsenstein,J. (1989) PHYLIP – phylogeny inference package (version 3.2). Cladistics,
5, 164–166.
Gascuel,O. (1997) BIONJ: an improved version of the NJ algorithm based on a simple
model of sequence data. Mol. Biol. Evol., 14, 685–695.
Halpern,A.L., Huson,D.H. and Reinert,K. (2002) Segment match refinement and applications. Proceedings of the Workshop in Algorithms in Bioinformatics, Springer
Verlag, New York, pp. 12–139.
Huson,D.H. (1998) SplitsTree: a program for analyzing and visualizing evolutionary
data. Bioinformatics, 14, 68–73.
Huson,D.H. and Steel,M. (2004) Phylogenetic trees based on gene content. Bioinformatics, 20, 2044–2049.
Huynen,M.A. and Bork,P. (1998) Measuring genome evolution. Proc. Natl Acad. Sci.
USA, 95, 5849–5856.
Ludwig,W. and Schleifer,K.-H. (1999) Phylogeny of bacteria beyond the 16S rRNA
standard. ASM News, 752–757.
Mirkin,B.G., Fenner,T.I., Galperin,M.Y. and Koonin,E.V. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal
common ancestor and domiance of horizontal gene transfer in the evolution of
prokaryotes. BMC Evol. Biol., 3, 2.
Moran,N.A. and Mira,A. (2001) The process of genome shrinkage in the obligate
symbiont Buchnera aphidicola. Genome Biol., 2, 1–12.
NCBI (2003) Microbial complete genomes taxonomy.
Olsen,G.J., Woese,C.R. and Overbeek,R (1994) The winds of (evolutionary) change:
breathing new life into microbiology. J Bact., 176, 1–6.
Parkhill,J., Achtman,M., James,K.D., Bentley,S.D., Churcher,C., Klee,S.R., Morelli,G.,
Basham,D., Brown,D., Chillingworth,T. et al. (2000) Complete DNA sequence
of a serogroup A strain of Neisseria meningitidis Z2491 Nature, 404,
502–506.
Saitou,N. and Nei,M. (1987) The Neighbor-Joining method: a new method for
reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425.
Sankhoff,D. and Blanchette,M. (1997) The median problem for breakpoints in comparative genomics. In Jiang,T. and Lee,D.T., (eds), Computing and Combinatorics, Proc.
COCOON’97. Lecture Notes in Computer Science, Volume 1276. Springer Verlag,
New York.
Sankhoff,D., Bryant,D., Denault,M., Lang,B.F. and Burger,G. (2000) Early eukaryote evolution based on mitochondrial gene order breakpoints. J. Comp. Biol. 7,
521–535.
Snel,B., Bork,P. and Huynen,M.A. (1999) Genome phylogeny based on gene content.
Nature, 21, 108–110.
Sober,E. and Steel,M. (2002) Testing the hypothesis of common ancestry. J. theor. Biol.,
218, 395–408.
Sokal,R.R. and Michener,C.D. (1958) A statistical method for evaluating systematic
relationships. University of Kansas Scientific Bulletin, 28, 1409–1438.
Studier,J.A. and Keppler, K.J. (1988) A note on the neighbour-joining algorithm of
Saitou and Nei. Mol. Biol. Evol., 5, 729–731.
Tamas,I., Klasson,L., Canbäck,B., Näslund,A.K., Eriksson,A.-E., Wernegren,J.J.,
Sandström, J.P., Moran,N.A. and Andersson,S.G.E.(2000) 50 million years of
genomic stasis in endosymbiotic bacteria. Science, 296, 2376–2379.
Wang,L.S., Jansen,R.K., Moret,B.M.E., Raubeson,L.A. and Warnow,T. (2003) Fast
phylogenetic methods for the analysis of genome rearrangement data: an empirical
study.
Woese,C.R. (1987) Bacterial evolution. Microbiol. Rev., 51, 221–272.
Woese,C.R. (2000) Prokaryote systematics: the evolution of a science. In Balows,A.,
Trüper,H.G., Dworkin,M., Harder,W. and Schleifer,K.H. (eds), The Prokaryotes,
Volume 3. Springer Verlag, New York.
Wolf,Y.I., Rogozin,I.B., Grishin,N.V., Tatusov,R.L. and Koonin,E.V. (2001) Genome
trees constructed using five different approaches suggest new major bacterial clades.
BMC Evolut. Biol., 1, 8.
2335