Phylogenetic Profiles: A Graphical Method for Detecting Genetic

Phylogenetic Profiles: A Graphical Method for Detecting Genetic
Recombinations in Homologous Sequences
Georg F. Weiller
Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University, Canberra
Phylogenetic profiles constitute a novel way of graphically displaying the coherence of the sequence relationships
over the entire length of a set of aligned homologous sequences. Using a sliding-window technique, this method
determines the pairwise distances of all sequences in the windows and evaluates, for each sequence, the degree to
which the patterns of distances in these regions agree. This method is suited for exploring data consistency as well
as detecting recombinant sequences. A computer program implementing the algorithm has been developed, and
examples with simulated and natural sequences are given to demonstrate the sensitivity and accuracy of the method
for identifying recombinant sequences and their recombination junctions as well as detecting hot spots of recombinational activity.
Introduction
Gene conversion and other recombinational events
like transposition, transduction, and intron-homing are
important processes that influence biological evolution.
They also complicate the work of the molecular phylogeneticist, as genomes rearranged in such ways become a mosaic of regions with different phylogenetic
histories. In viruses, horizontal gene transfer over a wide
range of phylogenetic distances has been a major evolutionary force (Gibbs and Keese 1995; Gorbalenya
1995; Nuttall et al. 1995), and similar processes have
been suggested to occur in prokaryotes (Whatmore and
Kehoe 1994; Bik et al. 1995), in eukaryotes (Assali,
Mache, and de Goer 1990), and even between kingdoms
(Doolitttle et al. 1990); hence, phylogenetic relationships deduced from gene sequences only represent the
evolutionary history of those genes, unless recombination can be excluded. However, genetic recombination
is not limited to exchanges involving whole genes, and
there is strong evidence for intragenic recombination in
viruses (Hahn et al. 1988; Sandmeier 1994), bacteria
(DuBose, Dykhuizen, and Hartl 1988; Reeves 1993; Li
et al. 1994; Thampapillai, Lan, and Reeves 1994), and
eukaryotes (Stephens 1985; Weiller, Schueller, and
Schweyen 1989; Paquin, Laforest, and Lang 1994). The
recognition of extra- and intragenic recombination is not
only important for unraveling the phylogenetic history
of genes, but it is also crucial for molecular phylogenetic
inference, as trees derived from different genes or gene
regions may differ in topology, and taxa with mosaic
sequences will be placed incorrectly. The analytical
challenges posed by horizontal gene transfer are reviewed by Syvanen (1994).
Various methods of detecting gene conversion and
recombination in homologous DNA sequences have
been described. In Stephens’ (1985) method, a set of
aligned sequences is split into two subsets at every variKey words: genetic recombination, computer algorithm, Salmonella enterica, molecular phylogeny, evolution.
Address for correspondence and reprints: Georg F. Weiller, Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University, Canberra, ACT 0200, Australia. E-mail:
[email protected].
Mol. Biol. Evol. 15(3):326–335. 1998
q 1998 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
326
able position, and the distribution of all variable sites
that support particular splits is examined. Significant deviations from a uniform distribution are used as an indication that some of the sequences are recombinants.
However, in samples of more than a few sequences, the
appropriate splits are hard to find, and sites with more
than two alternative nucleotides present a problem for
Stephens’ method, although the statistical difficulties
created by regions with variable mutation rates are lessened by simply excluding invariant sites.
Sawyer’s (1989) method reduces the problem of
variable mutation rates by focusing on silent polymorphic sites. His method also overcomes the partitioning
problem of Stephens’ method by analyzing the distribution of maximal-length segments common to some
pairs of sequences. A Monte Carlo test involving permutation of sites is used to estimate the significance of
the distribution. Some of the imperfections of this method include the drastically reduced amount of usable data
when only silent polymorphic sites are used and the validity of the significance test.
In Fitch and Goodman’s (1991) ‘‘PhylogeneticScanning’’ method, sets of phylogenetic trees are constructed at different intervals in the sequence alignment.
The support for some of these trees is then evaluated at
all intervals using the parsimony principle and presented
graphically. The main computational complication of
this method arises with the very large number of possible trees when more than a few sequences are analyzed. Only a tiny subset of all possible trees can be
analyzed, and it is not always clear which trees to
choose. While the graphs can be very informative, the
requirement that each tree be represented as a column
in the graph further reduces the number of alternatives
that can be tested.
Hein (1993) has developed a method that employs
a heuristic extension of the parsimony principle to infer
phylogenies from recombinant sequences. The method
assumes that a correct tree can be found for some sequence regions and tries to reconstruct the recombinational steps required to explain the tree topology found
in other regions. Similar to Fitch and Goodman’s method, this method is only applicable for a comparatively
Phylogenetic Profiles
small number of sequences and cannot detect recombinations that do not change the topology of a tree.
Recently, methods have been developed specifically for the analysis of HIV sequences (Robertson et al.
1995; Salimen et al. 1995) whereby sliding windows are
used to compare the relationships of aligned sequences
with previously determined HIV prototype sequences.
The success of these methods depends largely on the
availability of suitable prototype sequences, as only recombinations that switch the prototype can be detected.
Two newly developed and related computer methods make use of compatibility matrices (Jakobsen and
Easteal 1996) and partition matrices (Jakobsen, Wilson,
and Easteal 1997) to graphically display the consistency
of the phylogenetic signal in all columns of a multiplesequence alignment. These methods make fewer assumptions and do not require the prior knowledge (or
even existence) of a single phylogeny. They are therefore especially helpful for exploratory analysis of a limited number of sequences.
The ‘‘phylogenetic-profile’’ method, described below, is a new computer graphic method that overcomes
some of the limitations of the currently available methods. Similar to other methods mentioned above, it is
based on the principle that phylogenetic relationships
derived from different regions of a multiple-sequence
alignment will be similar when no recombination has
occurred. Thus, the method attempts to establish consistency in sequence relationships between different
parts of the alignment. Rather than tree topologies or
compatibility matrices, the method uses distance data to
describe the relationships and thus avoids many of the
difficulties posed by constructing and comparing tree
topologies. The distance approach makes it possible to
detect recombinations that do not change the tree topology and is very fast to compute, thus allowing analysis of a large data set (.1,000 sequences). The estimate
that a recombinational event has occurred is then plotted
for every position of every sequence, and all of the information is displayed in a single diagram.
The Phylogenetic-Profile Algorithm
The phylogenetic-profile method introduces the
‘‘phylogenetic correlation’’ measure, which quantifies
the coherence of the sequence interrelationships in two
different regions of a multiple alignment. The phylogenetic correlation of any given position is determined
by evaluating regions immediately upstream and downstream of this position. Positions in which sequence relationships in the upstream region clearly differ from
their downstream counterparts exhibit low phylogenetic
correlations and are likely recombination sites.
To determine the phylogenetic correlation of a given test sequence at a given test location, the method
defines two sequence windows, located immediately before (upstream) and after (downstream) the test location,
and determines the differences between the test sequence and all other sequences in the windows, resulting
in two vectors of distance data. If the test sequence relates to the other sequences similarly in both windows,
327
then the two distance vectors will exhibit the same trend
and correlate well. Conversely, if the test sequence has
recombined so that the sequence fragments in both widows have different phylogenetic histories, then the two
sets of sequence relationships would correlate poorly.
Accordingly, the phylogenetic correlation has been defined as the correlation coefficient of the two distance
vectors.
Table 1 demonstrates the computation of the phylogenetic correlation for nine of the sequences described
in the legend to figure 1. As the recombinant sequence
R1 matches sequences S1 upstream and S6 downstream
of the recombination site at position 500, the R1 distance vectors are identical to the S1 and S6 distance
vectors in the respective regions. The phylogenetic correlations of all nine sequences at the R1 recombination
site are given in the bottom row of table 1. Note the
poor phylogenetic correlation of sequence R1. The phylogenetic correlations for the sequences S1 and S6 are
slightly lower than the values for the other sequences
but are clearly higher than the R1 values. These values
reflect the degree to which the relationships of the test
sequence differ in the two windows. While the upstream
and downstream distance vectors of the recombinant R1
vary greatly, the vector pairs of the other sequences (S1–
S8) are closely related, varying mainly in their R1 component, and this variation is especially pronounced for
S1 and S6, which were used to construct R1.
For each individual sequence in the alignment, the
phylogenetic correlations are computed at every position
using sliding-window techniques. If a recombination site
is located not exactly between the two windows but inside one of them, then a part of the test sequence in the
two windows will still have similar relationships, resulting in an intermediate phylogentic correlation. Consequently, when the window moves over a recombination junction, the phylogenetic correlation decreases; it
is smallest when the recombination junction is exactly
at the junction of the two windows. The plot of all phylogenetic correlations of a sequence against the sequence
positions is termed a ‘‘phylogenetic profile,’’ and the
profiles of all individual sequences are typically superimposed in a single diagram. By examining and comparing the phylogenetic profiles for all sequences, the
recombinant sequences and the locations of recombination junctions are easily detected.
Phylogenetic profiles can exploit a variety of different measures for estimating the sequence distances as
well as for determining the phylogenetic correlation of
distance vectors. In addition, two different sliding-window techniques can be used. These parameters are briefly discussed below.
Distance Estimates
Although a simple count of different nucleotides or
amino acids (Hamming distance) gives an adequate
measure of distance, the fraction of differences ( p distance) is preferable if alignment gaps impede a significant number of pairwise comparisons. The phylogenetic-profile principle is, however, amenable to any distance metric that provides sensible distance values, in-
328
Weiller
Table 1
Computation of the Phylogenetic Correlations at Position 500 of the Demonstration Data Set
R1a
S1
S2
S3
S4
S5
S6
S7
S8
1–500)b
R1 . . . . .
S1 . . . . .
S2 . . . . .
S3 . . . . .
S4 . . . . .
S5 . . . . .
S6 . . . . .
S7 . . . . .
S8 . . . . .
0
0
78
148
147
170
178
182
163
0
0
78
148
147
170
178
182
163
R1 . . . . .
S1 . . . . .
S2 . . . . .
S3 . . . . .
S4 . . . . .
S5 . . . . .
S6 . . . . .
S7 . . . . .
S8 . . . . .
0
193
200
199
198
84
0
145
139
193
0
85
143
153
192
193
203
187
0.01
0.63
Distances in Upstream Window (positions
78
148
147
170
78
148
147
170
0
125
130
161
125
0
79
179
130
79
0
178
161
179
178
0
164
177
177
83
183
187
185
148
160
173
168
129
Distances in Downstream Window (positions 501–1000)
200
199
198
84
85
143
153
192
0
144
149
199
144
0
91
195
149
91
0
201
199
195
201
0
200
199
198
84
199
186
185
148
190
181
190
148
Phylogenetic Correlations at Position 500c
0.85
0.97
0.98
0.86
178
178
164
177
177
83
0
141
125
182
182
183
187
185
148
141
0
91
163
163
160
173
168
129
125
91
0
0
193
200
199
198
84
0
145
139
145
203
199
186
185
148
145
0
82
139
187
190
181
190
148
139
82
0
0.62
0.97
0.96
a
The sequences S1–S8 and the recombinant sequence R1 are described in figure 1.
Distances were computed as the number of differences (Hamming distance) within the specified region of the multiple-sequence alignment.
c The phylogenetic correlation for each sequence is calculated as the linear correlation coefficient of the upstream and downstream distance vectors (columns in
the matrices above).
b
cluding multiple-hit corrections and various nucleotide
or amino acid scoring matrices (PAMs etc.). For a more
detailed treatment of distance values, see Weiller, McClure, and Gibbs 1995).
Intercorrelation Measures
A number of standard coefficients can be used to
determine the phylogenetic correlation of distance vectors. The Bray-Curtis distance, the Canberra metric, the
chi-square distance, the average Manhattan distance, and
the linear correlation coefficient, as well as nonparametric correlations like the Spearman rank-order correlation, were explored in a variety of simulated and real
sequence sets. As all these measures gave similar results, the linear correlation coefficient (Pearson coefficient) was used for all data presented here. Note that
multiplication of a distance vector with a constant will
FIG. 1.—Neighbor-joining trees of the demonstration data set.
Eight artificial 1,000-nt-long sequences (S1–S8) were generated using
EVOL-TREE (Schoeniger and Haeseler 1995) with a 10% nucleotide
substitution for each generation and 25% of each type of nucleotide.
The recombinant sequence (R1) was constructed by combining the
500-bp 59 fragment of sequence S1 with a 500-bp 39 fragment of sequence S6, and the recombinant sequence (R2) was constructed by
replacing the region 333–666 of S1 with the corresponding fragment
of S6. The dendrograms were produced using the neighbor-joining program of the PHYLIP package (Felsenstein 1991). a shows the relationships of simulated sequences S1–S8. b also includes the recombinant sequences R1 and R2.
not change the correlation value. It is therefore not necessary to normalize the sequence differences by the window width even when the widths of the upstream and
downstream windows differ. For a more detailed treatment of correlation measures, see Rohlf (1993) and William et al. (1992).
Sequence Windows
In general, the recombination signal will be strongest,
i.e., the phylogenetic correlation will be minimal, when
the window used for determining one distance vector
contains only sequence from one ancestor, while the other window contains only sequence from a different and
phylogenetically discordant ancestor. Multiple recombination sites within one window will probably decrease
the resolution of the method. Hence, sequences with
many recombination sites are best analyzed using appropriately narrow windows. Wider windows, on the
contrary, will be less discriminatory but will provide
more sites for estimating the interrelationships of the
sequences, and therefore enhance the signal-to-noise ratio in the resulting plot.
Two different techniques are used to control
‘‘movement’’ of the sequence windows, and these are
optimal for different types of sequence data. The first
method uses the entire sequence in two variable-sized
windows with the left edge of the upstream window
fixed at the beginning of the aligned sequences, the right
edge of the downstream window fixed at the sequence
end, and a sliding split between them. Thus, for the first
comparison, the upstream window covers only the first
site of the alignment, while the downstream window
covers all remaining sites. In successive steps, the upstream window grows by a single site, while the down-
Phylogenetic Profiles
329
FIG. 2.—Phylogenetic profiles of sequences S1–S8 and R1–R2. Series 1 (left column) contains only one recombinant sequence, R1 (bold
line), with a single crossover at site 349. Series 2 (right column) additionally contains R2 (bold line) with a double crossover at sites 237 and
465. The parental sequences S1 and S6 of both recombinants are omitted from rows b–d. Rows a and b make use of variable window widths,
while rows c and d use fixed-size windows with widths of 70 and 35 positions, respectively (see text).
stream window decreases by one site, until the downstream window covers the last site only and the upstream window covers the remaining sites.
The second method uses two windows with identical and fixed widths and consequently cannot analyze
sites that lie less than one window width from either
end of the sequence; the width of the two windows is
specified at the beginning of the scanning process. The
appropriate minimal width depends on the variability of
the sequences analyzed, as the method requires sufficient variable sites inside each window to reliably determine the sequence relationships. To allow for data
sets that have an uneven distribution of variable positions, invariant sites are removed from the sequences
before analysis, as these differentially dilute the recombination signal.
Explorations with Simulated Data
To demonstrate the properties of phylogenetic profiles and to explore their dependency on parameters and
sequence inclusion sets, several phylogenetic profiles
have been produced using simulated sequences, as well
as recombinations of these constructed in silico.
Single Recombinant Sequences
Eight related sequences (S1–S8) and two recombinant sequences, one with a single crossover (R1) and
one with a central insertion (R2), were constructed. Dendrograms showing the relationships of these data are
given in figure 1. The 10 aligned sequences were then
condensed to the 733 variable positions only. In the condensed sequences, the recombination junctions corresponded to the positions 349 (R1) and 237/465 (R2),
respectively. Several phylogenetic profiles were derived
from this data set (fig. 2). A simple count of different
nucleotides was used to estimate sequence distances,
and the linear correlation coefficient of the distance vectors was calculated to determine their phylogenetic correlation.
The profiles in the top row (a) of figure 2 include
the recombinant sequences R1 (left column) and R1/R2
(right column), together with sequences S1–S8. The progenitor sequences S1 and S6 of the recombinant sequences R1 and R2 were omitted from plots b–d. Nevertheless, the recombinant sequences can still be identified in these plots, even in series 2, where the phylogenetic background signal is fairly weak because two of
the eight sequences are recombinant.
Series a and b use the variable-width window technique, which uses the entire length of the alignment for
analysis, but it can be seen that the estimate of phylogenetic correlations becomes ‘‘noisy’’ at either end as
one or other of the windows becomes too narrow. Note
that the phylogenetic correlation for sequence R1 (bold
in a1 and b1) is small over the entire sequence, as the
R1 recombination junction is always included in one of
the sequence windows. The value is smallest at the R1
junction, as at this point, each window contains sequences exclusively from different parents. Note that data
shown in table 1 are taken from the R1 junction in a1.
The profile of recombinant R2 (bold in a2 and b2) is
decreased less, as the R2 sequence contains some sites
donated by S1 in both windows, irrespective of the position of the window split. The recombination junctions
are nevertheless clearly visible, as the profile has its
minima at the two junction sites (237 and 465), where
330
Weiller
one window contains sequence exclusively derived from
S1, while the other window contains sequence from both
parents (S1 and S6). Note also the strong phylogenetic
correlation of R2 around site 350, where the R2 sequences in both windows contain a similar mixture of
sites of both parental sequences. This large value indicates that R2 has an insertion and would not have been
observed if the 59 and 39 sequences of R2 had come
from different parents.
The recombination junction of R1 cannot be determined precisely from b2 alone, as the second recombinant, R2, distorts the phylogenetic correlation of R1.
This distortion is exceptionally pronounced because R2
was constructed from the same donor sequences as R1,
and these were excluded from the analysis represented
by b2. Series c and d use fixed-size windows of sizes
70 and 35, respectively. Note that the smaller fixed-size
windows in c2, because they contain fewer contradictory sites, are better suited to pinpoint the three crossover sites than are the maximal-sized windows in b2.
The series d graphs demonstrate that when the window
width is too small, the noise generated by sampling errors obscures the recombination signals.
Recombination Hot Spots and Multiple Recombinants
The phylogenetic-profile method determines the
phylogenetic correlation of a particular sequence in different regions by comparing it with other reference sequences. However, some of the reference sequences may
themselves be recombinants; indeed, when the analysis
includes recombination hot spots, all or most sequences
might be recombinant.
To demonstrate the properties of phylogenetic profiles in these situations, the artificial data set was modified, generating two series (A and B) of reciprocal recombinant sequences by combining every sequence si
with the sequence si12. This resulted in recombinant sequences of types S1:3, S2:4, S3:5, S4:6, S5:7, S6:8, S7:
1, and S8:2. Series A contained these 16 reciprocally
recombined sequences, which resembled R1 in the example above, with a single site of recombination again
at site 500 (349 in the condensed variable-sites sequences). Series B was constructed in a manner analogous to
that used to construct R2, by exchanging the center and
flanking regions of si with si12. This also yielded 16
reciprocally recombinant sequences with recombination
junctions occurring at sites 333 and 666 (237 and 465
in the condensed variable-sites sequences). Note that the
si:si12 scheme for producing recombinants resulted in the
sequences combining regions of different similarities.
While sequences S1:3, S2:4, S5:7, and S6:8 are relatively closely related, sequences S3:5, S4:6, S7:1, and
S8:2 are more distantly related, so stronger recombination signals can be expected from combinations of the
latter. Figure 3 gives the phylogenetic profiles of the two
series of recombinants, as well as a combination of both.
Note that the sites of recombination can be clearly seen
in all three graphs. Strong and weak signals cannot be
distinguished in graphs a and b of figure 3, because here
all sequences have their recombination junction at the
same site, and algorithm has no means to determine
FIG. 3.—Phylogenetic profiles of chimeric sequences created from
the demonstration data set. All profiles use Hamming distances and
fixed windows of 50 bp width. All sequences in the profiles are recombinant. Plot a contains 16 sequences, each with a crossover at
position 349 (series A). Plot b contains 16 sequences, each with a
double crossover at positions 237 and 465 (series B). Plot c combines
all 32 sequences used in a and b (see text).
which sequence is closest to the unrecombined wildtype sequence. However, when the sequences of both
series A and B are included, the strength of the recombination signal is revealed (graph c in fig. 3). This is
because the recombination sites of the two series differ;
therefore, there are always some partial sequences in the
parental (nonrecombined) configuration at every site in
the data set, allowing the algorithm to distinguish between strong and weak recombination signals. Consequently, it can be seen that the phylogenetic correlation
of some sequences is particularly small (,0 in fig. 3c)
and these come from the recombinants formed from the
most distantly related sequences, S3:5, S4:6, S7:1, and
S8:2. However, the data set still does not contain sufficient information to clearly distinguish the recombinants
of closely related sequences from those of parental sequences. In general, if a large proportion of the sequences are recombinants, the phylogenetic-profile method
may not be able to determine which of the sequences
are the parents; the identification of recombinational hot
spots, however, is not impaired.
Complex Phylogenies
Multiple recombinations during the phylogenetic
development of sequences can lead to very complex
phylogenetic relationships, resulting in the translocation
of entire subtrees. In addition, continuing evolutionary
changes overwrite the initial signal in the sequences.
Hudson and Kaplan (1985) have demonstrated that recombination events may leave no evidence in the extant
sequences. The simulation given in figure 4 was chosen
to demonstrate the behavior of phylogenetic profiles in
this more realistic situation. A randomly generated sequence containing 25% of each nucleotide was evolved
over four generations with 2% nucleotide changes per
generation. At each but the last generation, one recombinant sequence was created (x, y, and z in fig. 4) and
added to the population. A phylogenetic profile of the
resulting 30 sequences is given in figure 5. Only the
Phylogenetic Profiles
331
derived from sequences y and z could be clearly identified (data not shown). This was to be expected, as the
recombination x combines two sequences (sB and sA)
with only 4% sequence differences. Evolution of sequence x for three more generations added an additional
12% changes to the sequences which obscure the recombination signal. Note that although 12 of the 30 sequences are recombinant, the data set still provides sufficient
background signal to distinguish all recombinant sequences. When a large proportion of the data set is represented by recombinant sequences, the exclusion of sequences with regions of low phylogenetic correlation
can help to improve the detection of sequences with
modest recombination signals. This is shown in the following example using a real data set.
Example with Real Data
FIG. 4.—Simulation of a complex phylogeny. A 1,000-bp-long
random progenitor sequence (s) containing 25% A, C, T, and G was
assembled. Descendant sequences were created by introducing 20 random nucleotide changes to their respective progenitors. The recombinant sequence x was built by combining the first 500 and the last 500
nucleotides of the sequences sB and sA respectively. The remaining
recombinants y and z were created similarly, by crossing sequences
sAB with sBA on position 300 and sABB with sBAA on position 700.
The recombinants were evolved further, yielding a total of 30 sequences.
parsimoniously informative sites are used for distance
calculation. Note that the four descendants of the recombinant sequence y and the two descendants of the recombinant sequence z are clearly identified. The identification of the eight descendants of the recombinant x
is more difficult, albeit still possible, and could be considered close to the limit of the resolution of this particular profile. The sequence window was deliberately
chosen to be wide (60 parsimonious sites) in order to
collect sufficient signal. When smaller windows (10–40
parsimonious sites) were used, only the recombinants
In order to test the phylogenetic-profile method
with real gene sequences, a set of 34 gnd gene sequences
of Salmonella enterica was scanned for recombinations.
These sequences and the sites where recombinations
have probably occurred were previously reported by
Thampapillai, Lan, and Reeves (1994). The sequences
are 1,329 bp long and correspond to positions 16–1344
of the gnd coding region. Only the variable sites were
used for the phylogenetic profiles, and these are given
in figure 6.
Individual Sequences
Because many of these sequences are recombinants, it could be expected that strong recombination
signals would obscure the weaker ones. To avoid this,
the analysis was repeated many times, and after each
analysis, the sequence that gave the smallest phylogenetic correlation (i.e., the strongest recombination signal) was removed. Once the 14 sequences with the
smallest phylogenetic correlations were removed, they
were reintroduced into the data set individually. Some
of the resulting phylogenetic profiles are given in figure
7. The top graph in figure 7 shows a phylogenetic profile
FIG. 5.—Phylogenetic profile of a simulated complex phylogeny. The source of the 30 sequences is described in figure 4. The profile uses
only the 304 parsimonious sites. For convenience, the graph is mapped to display all positions whereby the x axis gives the range from position
100 to position 900. The y axis gives the phylogenetic correlation from 11 to 20.5. The boxes labeled x, y, and z mark the sequences derived
from the recombinant x, y, and z, respectively (see text and fig. 4).
FIG. 6.—gnd gene sequence of 34 Salmonella enterica strains. Only the variable sites and their locations within the gene are given. The black bar indicates a chi-like region.
332
Weiller
Phylogenetic Profiles
333
FIG. 7.—Phylogenetic profiles of the gnd locus in Salmonella enterica strains. All graphs used a fixed window size of 60 bp and only the
variable sites, as given in figure 6. The proportion of different nucleotides was used to determine the relationships of the sequences, and the
linear correlation coefficient was calculated to determine phylogenetic correlations. The top graph shows the profiles of all 34 sequences as
given in figure 6. All but one of the sequences of the strains that gave strong recombination signals (m318, m298, m130, m38, m322, m321,
w2DI, west, m317, w3038, lind, m316, m325, m287) were removed from the remaining plots. The profile of the sequence with the strongest
recombination signal is plotted in bold, and the name of the corresponding strain is given (see text).
of all 34 sequences, whereas the remaining profiles are
each of 21 sequences, with the profile of the reintroduced sequence highlighted. It can easily be seen that
the reintroduced sequences have one or more distinct
minima in their phylogenetic profiles, suggesting that
recombinations might have occurred at or close to these
minima. The possible recombination junctions, as deduced from the phylogenetic profiles in figure 7, are
summarized in table 2. Changing the various scanning
parameters resulted in very similar plots (data not
shown), varying the relative strength of the individual
recombination signals only slightly, but not changing the
conclusions that can be drawn from the analysis.
As the purpose of these tests was to examine the
capability of the phylogenetic-profile method rather than
to give a comprehensive analysis of the evolution of the
gnd locus, further interpretation of these profiles is left
to a later publication. A comprehensive analysis of the
evolution of the gnd locus of S. enterica has been made
by Thampapillai, Lan, and Reeves (1994), who searched
for evidence of recombination in the sequences of the
strains m318, m298, m130, m38, m322, and m321. All
334
Weiller
Table 2
Local Minima in the Phylogenetic Profiles of Figure 7
(possible recombination sites)
Strain
Gene Position
Variable Position
m318 . . . . . . . .
m298 . . . . . . . .
735–774*1
687–762*1
1,017–1,066
705–738*
414–477(*)1
705–738*
414–477(*)1
889–8941
930–9781
494–546
735-774*
705–744*
930–1,018
990–1,067
648–738*
507–741*
989–1,024
555–741*
930–989
1,018–1,074
152–165
138–152
224–238
141–144
73–83
141–144
73–83
172–175
187–205
87–103
153–165
141–145
185–225
210–230
131–144
90–145
210–228
105–145
185–210
225–243
m130 . . . . . . . .
m38 . . . . . . . . .
m322 . . . . . . . .
m321 . . . . . . . .
w2Di . . . . . . . .
west . . . . . . . . .
m317 . . . . . . . .
w3038 . . . . . . .
lind . . . . . . . . . .
m316 . . . . . . . .
NOTE.—* 5 overlapping with or close to the chi-like sequence at positions
744–751 (145–150); (*) 5 overlapping with or close to a chi-like sequence at
positions 424–431 (74–76); 1 5 independently identified by Thampapillai, Lan,
and Reeves (1994).
of the recombinations reported by the authors are detected by the phylogenetic-profile method, and the positions of the recombination junctions agree in the two
studies. These sequences are shown in the graphs in the
left column of figure 7, whereas the right column features sequences with similarly strong recombination signals that have not previously been recognized. The additional sites detected illustrate the strength and sensitivity of the phylogenetic-profile method.
Recombination Hot Spots
Many of the gnd sequences appear to exhibit particularly small phylogenetic correlations in their central
parts. Nine of the 22 local minima (marked with an asterisk in table 2) overlap the variable sites 143–153, indicating a particularly large number of recombinations
in this region. As previously reported by Thampapillai,
Lan, and Reeves (1994), a variant of the general recombination stimulating sequence, chi, is located in this region at positions 744–751 (variable sites 145–150) of
many strains in this analysis. All of the nine recombinant sequences mentioned above have the canonical sequence (59-CCTGGTGG-39) which is a single-base-pair
variation of the E. coli chi motif (59-GCTGGTCC-39).
This sequence could possibly be regarded as the S. enterica equivalent to E. coli chi.
In order to examine the influence of this sequence
motif on the phylogenetic profile of the sequences, a
profile was created of exclusively sequences which had
other variants of chi in this location. As can be seen in
figure 8a, none of them has a local minimum at this site
in its phylogenetic profile. In contrast, all sequences
with local phylogenetic correlation minima near the chi
site also have the canonical (CCTGGTGG) chi motif
(fig. 8b). This chi motif is also found in sequences
FIG. 8.—Phylogenetic profiles of sequences with and without chilike motif. The canonical chi-like motif (59 744-CCTGGTGG-751 39)
is marked (variable sites 145–150) in both graphs. The graph in a
includes all sequences (s71, m229, m311, sofia, m261, m287, m319,
m35, m320, m313, m314, m321, m322, m326) that have different variations of this chi-like sequence, while all sequences in b (lt2, s41, m46,
m36, m73, m13, m55, m298, m295, west, m317, w3038, m318, m324,
m325) have the canonical chi-like sequence. The thick line gives the
averages of the phylogenetic correlations of all sequences that are included in each graph.
w2D1, m38, m130, lind, and 316, which were not included in figure 8b, as their phylogenetic profiles are
more difficult to interpret when analyzed in the context
of the sequences in figure 8b. However, as shown above
(fig. 7 and table 2), recombinations at or close to the
chi-like motif are also likely for these five sequences.
Implementation
All of the phylogenetic profiles were generated using the PhylPro program. PhylPro is a Microsoft Windows application developed in C11 by the author. A
prototype of the program is available from the author
free of charge.
Acknowledgments
I thank Prof. Peter Reeves for communicating the
S. enterica sequences prior to publication, Prof. Adrian
Gibbs for helpful discussions during the development of
the phylogenetic-profile method, and Holger Averdunk
for his help in exploring the parameter space of the
method.
LITERATURE CITED
ASSALI, N. E., R. MACHE, and S. L. DE GOER. 1990. Evidence
for a composite phylogenetic origin of the plastid genome
of the brown alga Pylaiella littoralis (L.) Kjellm. Plant Mol.
Biol. 15:307–315.
BIK, E. M., A. E. BUNSCHOTEN, R. D. GOUW, and F. R. MOOI.
1995. Genesis of the novel epidemic Vibrio cholerae O139
strain: evidence for horizontal transfer of genes involved in
polysaccharide synthesis. EMBO J. 14:209–216.
DOOLITTLE, R. F., D. F. FENG, K. L. ANDERSON, and M. R.
ALBERRO. 1990. A naturally occurring horizontal gene
transfer from eukaryote to prokaryote. J. Mol. Evol. 31:
383–388.
DUBOSE, R., D. DYKHUIZEN, and D. HARTL. 1988. Genetic
exchange among natural isolates of bacteria: recombination
within the phoA locus of Escherichia coli. Proc. Natl. Acad.
Sci. USA 85:7036–7040.
FELSENSTEIN, J. 1991. PHYLIP: phylogeny inference package.
Version 3.4. Distributed by the author, Department of Genetics, University of Washington, Seattle.
Phylogenetic Profiles
FITCH, D. H. A., and M. GOODMAN. 1991. Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversions and other recombinational events. CABIOS 7:207–
215.
GIBBS, A. J., and P. K. KEESE. 1995. In search of origins of
viral genes. Pp. 76–91 in A. J. GIBBS, C. H. CALLISHER,
and F. GARCIA-ARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England.
GORBALENYA, A. E. 1995. Origin of RNA viral genomes: approaching the problem by comparative sequence analysis.
Pp. 49–67 in A. J. GIBBS, C. H. CALISHER, and F. GARCIAARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England.
HAHN, C. S., S. LUSTIG, S. STRAUSS, and E. G. STRAUSS. 1988.
Western equine encephalitis virus is a recombinant virus.
Proc. Natl. Acad. Sci. USA 85:5997–6001.
HEIN, J. 1993. A heuristic method to reconstruct the history of
sequences subject to recombination. J. Mol. Evol. 36:396–
405.
HUDSON, R. R., and N. L. KAPLAN. 1985. Statistical properties
of the number of recombination events in the history of a
sample of DNA sequences. Genetics 111:147–164.
JAKOBSEN, I. B., and S. EASTEAL. 1996. A program for calculating and displaying compatibility matrices as an aid in
determining reticulate evolution in molecular sequences.
CABIOS 12:291–295.
JAKOBSEN, I. B., S. R. WILSON, and S. EASTEAL. 1997. The
partition matrix: exploring variable phylogenetic signals
along nucleotide sequence alignments. Mol. Biol. Evol. 14:
474–484.
LI, J., K. NELSON, A. C. MCWHORTER, T. S. WHITTAM, and R.
K. SELANDER. 1994. Recombinational basis of serovar diversity in Salmonella enterica. Proc. Natl. Acad. Sci. USA
91:2252–2256.
NUTTALL, P. A., M. A. MORSE, L. D. JONES, and A. PORTELA.
1995. Adaptation of members of the Orthomyxoviridae
family to transmission by ticks. Pp. 416–426 in A. J. GIBBS,
C. H. CALISHER, and F. GARCIA-ARENAL, eds. Molecular
basis of virus evolution. Cambridge University Press, Cambridge, England.
PAQUIN, B., M. J. LAFOREST, and B. F. LANG. 1994. Interspecific transfer of mitochondrial genes in fungi and creation
of a homologous hybrid gene. Proc. Natl. Acad. Sci. USA
91:11807–11810.
335
REEVES, P. R. 1993. Evolution of Salmonella O antigen variation by interspecific gene transfer on a large scale. Trends
Genet. 9:17–22.
ROBERTSON, D. L., P. M. SHARP, F. E. MCCUTCHAN, and B. H.
HAHN. 1995. Recombination in HIV1. Nature 374:124–126.
ROHLF, F. J. 1993. NTSYS-pc: numerical taxonomy and multivariate analysis system. Applied Biostatistics, New York.
SALIMEN, M. O., J. K. CARR, D. S. BURKE, and F. E. MCCUTCHAN. 1995. Identification of breakpoints in intergenotypic recombinants of HIV Type 1 by bootscanning. AIDS
Res. Hum. Retroviruses 11:1423–1425.
SANDMEIER, H. 1994. Acquisition and rearrangement of sequence motifs in the evolution of bacteriophage tail fibres.
Mol. Microbiol. 12:343–350.
SAWYER, S. A. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526–538.
SCHOENIGER, A., and A. HAESELER. 1995. Simulating efficiently the evolution of DNA sequences. CABIOS 11:111–115.
STEPHENS, J. C. 1985. Statistical method of DNA sequence
analysis: detection of intragenic recombination or gene conversion. Mol. Biol. Evol. 2:539–556.
SYVANEN, M. 1994. Horizontal gene transfer: evidence and
possible consequences. Annu. Rev. Genet. 28:237–261.
THAMPAPILLAI, G., R. LAN, and P. R. REEVES. 1994. Molecular
evolution in the gnd locus of Salmonella enterica. Mol.
Biol. Evol. 11:813–828.
WEILLER, G. F., M. A. MCCLURE, and A. J. GIBBS. 1995. Molecular phylogenetic analysis. Pp. 553–585 in A. J. GIBBS,
C. H. CALISHER, and F. GARCIA-ARENAL, eds. Molecular
basis of virus evolution. Cambridge University Press, Cambridge, England.
WEILLER, G. F., C. M. E. SCHUELLER, and R. J. SCHWEYEN.
1989. Putative target sites for mobile G1C rich clusters in
yeast mitochondrial DNA: single elements and tandem arrays. Mol. Gen. Genet. 218:272–283.
WHATMORE, A. M., and M. A. KEHOE. 1994. Horizontal gene
transfer in the evolution of group A streptococcal emm-like
genes: gene mosaics and variation in Vir regulons. Mol.
Microbiol. 11:363–374.
WILLIAM, H. P., S. A. TEUKOLSKY, W. T. VETTERLING, and B.
P. FLANNERY. 1992. Numerical recipes in C. Cambridge
University Press, Cambridge, England.
SIMON EASTEAL, reviewing editor
Accepted November 20, 1997