Phylogenetic Profiles: A Graphical Method for Detecting Genetic Recombinations in Homologous Sequences Georg F. Weiller Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University, Canberra Phylogenetic profiles constitute a novel way of graphically displaying the coherence of the sequence relationships over the entire length of a set of aligned homologous sequences. Using a sliding-window technique, this method determines the pairwise distances of all sequences in the windows and evaluates, for each sequence, the degree to which the patterns of distances in these regions agree. This method is suited for exploring data consistency as well as detecting recombinant sequences. A computer program implementing the algorithm has been developed, and examples with simulated and natural sequences are given to demonstrate the sensitivity and accuracy of the method for identifying recombinant sequences and their recombination junctions as well as detecting hot spots of recombinational activity. Introduction Gene conversion and other recombinational events like transposition, transduction, and intron-homing are important processes that influence biological evolution. They also complicate the work of the molecular phylogeneticist, as genomes rearranged in such ways become a mosaic of regions with different phylogenetic histories. In viruses, horizontal gene transfer over a wide range of phylogenetic distances has been a major evolutionary force (Gibbs and Keese 1995; Gorbalenya 1995; Nuttall et al. 1995), and similar processes have been suggested to occur in prokaryotes (Whatmore and Kehoe 1994; Bik et al. 1995), in eukaryotes (Assali, Mache, and de Goer 1990), and even between kingdoms (Doolitttle et al. 1990); hence, phylogenetic relationships deduced from gene sequences only represent the evolutionary history of those genes, unless recombination can be excluded. However, genetic recombination is not limited to exchanges involving whole genes, and there is strong evidence for intragenic recombination in viruses (Hahn et al. 1988; Sandmeier 1994), bacteria (DuBose, Dykhuizen, and Hartl 1988; Reeves 1993; Li et al. 1994; Thampapillai, Lan, and Reeves 1994), and eukaryotes (Stephens 1985; Weiller, Schueller, and Schweyen 1989; Paquin, Laforest, and Lang 1994). The recognition of extra- and intragenic recombination is not only important for unraveling the phylogenetic history of genes, but it is also crucial for molecular phylogenetic inference, as trees derived from different genes or gene regions may differ in topology, and taxa with mosaic sequences will be placed incorrectly. The analytical challenges posed by horizontal gene transfer are reviewed by Syvanen (1994). Various methods of detecting gene conversion and recombination in homologous DNA sequences have been described. In Stephens’ (1985) method, a set of aligned sequences is split into two subsets at every variKey words: genetic recombination, computer algorithm, Salmonella enterica, molecular phylogeny, evolution. Address for correspondence and reprints: Georg F. Weiller, Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University, Canberra, ACT 0200, Australia. E-mail: [email protected]. Mol. Biol. Evol. 15(3):326–335. 1998 q 1998 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 326 able position, and the distribution of all variable sites that support particular splits is examined. Significant deviations from a uniform distribution are used as an indication that some of the sequences are recombinants. However, in samples of more than a few sequences, the appropriate splits are hard to find, and sites with more than two alternative nucleotides present a problem for Stephens’ method, although the statistical difficulties created by regions with variable mutation rates are lessened by simply excluding invariant sites. Sawyer’s (1989) method reduces the problem of variable mutation rates by focusing on silent polymorphic sites. His method also overcomes the partitioning problem of Stephens’ method by analyzing the distribution of maximal-length segments common to some pairs of sequences. A Monte Carlo test involving permutation of sites is used to estimate the significance of the distribution. Some of the imperfections of this method include the drastically reduced amount of usable data when only silent polymorphic sites are used and the validity of the significance test. In Fitch and Goodman’s (1991) ‘‘PhylogeneticScanning’’ method, sets of phylogenetic trees are constructed at different intervals in the sequence alignment. The support for some of these trees is then evaluated at all intervals using the parsimony principle and presented graphically. The main computational complication of this method arises with the very large number of possible trees when more than a few sequences are analyzed. Only a tiny subset of all possible trees can be analyzed, and it is not always clear which trees to choose. While the graphs can be very informative, the requirement that each tree be represented as a column in the graph further reduces the number of alternatives that can be tested. Hein (1993) has developed a method that employs a heuristic extension of the parsimony principle to infer phylogenies from recombinant sequences. The method assumes that a correct tree can be found for some sequence regions and tries to reconstruct the recombinational steps required to explain the tree topology found in other regions. Similar to Fitch and Goodman’s method, this method is only applicable for a comparatively Phylogenetic Profiles small number of sequences and cannot detect recombinations that do not change the topology of a tree. Recently, methods have been developed specifically for the analysis of HIV sequences (Robertson et al. 1995; Salimen et al. 1995) whereby sliding windows are used to compare the relationships of aligned sequences with previously determined HIV prototype sequences. The success of these methods depends largely on the availability of suitable prototype sequences, as only recombinations that switch the prototype can be detected. Two newly developed and related computer methods make use of compatibility matrices (Jakobsen and Easteal 1996) and partition matrices (Jakobsen, Wilson, and Easteal 1997) to graphically display the consistency of the phylogenetic signal in all columns of a multiplesequence alignment. These methods make fewer assumptions and do not require the prior knowledge (or even existence) of a single phylogeny. They are therefore especially helpful for exploratory analysis of a limited number of sequences. The ‘‘phylogenetic-profile’’ method, described below, is a new computer graphic method that overcomes some of the limitations of the currently available methods. Similar to other methods mentioned above, it is based on the principle that phylogenetic relationships derived from different regions of a multiple-sequence alignment will be similar when no recombination has occurred. Thus, the method attempts to establish consistency in sequence relationships between different parts of the alignment. Rather than tree topologies or compatibility matrices, the method uses distance data to describe the relationships and thus avoids many of the difficulties posed by constructing and comparing tree topologies. The distance approach makes it possible to detect recombinations that do not change the tree topology and is very fast to compute, thus allowing analysis of a large data set (.1,000 sequences). The estimate that a recombinational event has occurred is then plotted for every position of every sequence, and all of the information is displayed in a single diagram. The Phylogenetic-Profile Algorithm The phylogenetic-profile method introduces the ‘‘phylogenetic correlation’’ measure, which quantifies the coherence of the sequence interrelationships in two different regions of a multiple alignment. The phylogenetic correlation of any given position is determined by evaluating regions immediately upstream and downstream of this position. Positions in which sequence relationships in the upstream region clearly differ from their downstream counterparts exhibit low phylogenetic correlations and are likely recombination sites. To determine the phylogenetic correlation of a given test sequence at a given test location, the method defines two sequence windows, located immediately before (upstream) and after (downstream) the test location, and determines the differences between the test sequence and all other sequences in the windows, resulting in two vectors of distance data. If the test sequence relates to the other sequences similarly in both windows, 327 then the two distance vectors will exhibit the same trend and correlate well. Conversely, if the test sequence has recombined so that the sequence fragments in both widows have different phylogenetic histories, then the two sets of sequence relationships would correlate poorly. Accordingly, the phylogenetic correlation has been defined as the correlation coefficient of the two distance vectors. Table 1 demonstrates the computation of the phylogenetic correlation for nine of the sequences described in the legend to figure 1. As the recombinant sequence R1 matches sequences S1 upstream and S6 downstream of the recombination site at position 500, the R1 distance vectors are identical to the S1 and S6 distance vectors in the respective regions. The phylogenetic correlations of all nine sequences at the R1 recombination site are given in the bottom row of table 1. Note the poor phylogenetic correlation of sequence R1. The phylogenetic correlations for the sequences S1 and S6 are slightly lower than the values for the other sequences but are clearly higher than the R1 values. These values reflect the degree to which the relationships of the test sequence differ in the two windows. While the upstream and downstream distance vectors of the recombinant R1 vary greatly, the vector pairs of the other sequences (S1– S8) are closely related, varying mainly in their R1 component, and this variation is especially pronounced for S1 and S6, which were used to construct R1. For each individual sequence in the alignment, the phylogenetic correlations are computed at every position using sliding-window techniques. If a recombination site is located not exactly between the two windows but inside one of them, then a part of the test sequence in the two windows will still have similar relationships, resulting in an intermediate phylogentic correlation. Consequently, when the window moves over a recombination junction, the phylogenetic correlation decreases; it is smallest when the recombination junction is exactly at the junction of the two windows. The plot of all phylogenetic correlations of a sequence against the sequence positions is termed a ‘‘phylogenetic profile,’’ and the profiles of all individual sequences are typically superimposed in a single diagram. By examining and comparing the phylogenetic profiles for all sequences, the recombinant sequences and the locations of recombination junctions are easily detected. Phylogenetic profiles can exploit a variety of different measures for estimating the sequence distances as well as for determining the phylogenetic correlation of distance vectors. In addition, two different sliding-window techniques can be used. These parameters are briefly discussed below. Distance Estimates Although a simple count of different nucleotides or amino acids (Hamming distance) gives an adequate measure of distance, the fraction of differences ( p distance) is preferable if alignment gaps impede a significant number of pairwise comparisons. The phylogenetic-profile principle is, however, amenable to any distance metric that provides sensible distance values, in- 328 Weiller Table 1 Computation of the Phylogenetic Correlations at Position 500 of the Demonstration Data Set R1a S1 S2 S3 S4 S5 S6 S7 S8 1–500)b R1 . . . . . S1 . . . . . S2 . . . . . S3 . . . . . S4 . . . . . S5 . . . . . S6 . . . . . S7 . . . . . S8 . . . . . 0 0 78 148 147 170 178 182 163 0 0 78 148 147 170 178 182 163 R1 . . . . . S1 . . . . . S2 . . . . . S3 . . . . . S4 . . . . . S5 . . . . . S6 . . . . . S7 . . . . . S8 . . . . . 0 193 200 199 198 84 0 145 139 193 0 85 143 153 192 193 203 187 0.01 0.63 Distances in Upstream Window (positions 78 148 147 170 78 148 147 170 0 125 130 161 125 0 79 179 130 79 0 178 161 179 178 0 164 177 177 83 183 187 185 148 160 173 168 129 Distances in Downstream Window (positions 501–1000) 200 199 198 84 85 143 153 192 0 144 149 199 144 0 91 195 149 91 0 201 199 195 201 0 200 199 198 84 199 186 185 148 190 181 190 148 Phylogenetic Correlations at Position 500c 0.85 0.97 0.98 0.86 178 178 164 177 177 83 0 141 125 182 182 183 187 185 148 141 0 91 163 163 160 173 168 129 125 91 0 0 193 200 199 198 84 0 145 139 145 203 199 186 185 148 145 0 82 139 187 190 181 190 148 139 82 0 0.62 0.97 0.96 a The sequences S1–S8 and the recombinant sequence R1 are described in figure 1. Distances were computed as the number of differences (Hamming distance) within the specified region of the multiple-sequence alignment. c The phylogenetic correlation for each sequence is calculated as the linear correlation coefficient of the upstream and downstream distance vectors (columns in the matrices above). b cluding multiple-hit corrections and various nucleotide or amino acid scoring matrices (PAMs etc.). For a more detailed treatment of distance values, see Weiller, McClure, and Gibbs 1995). Intercorrelation Measures A number of standard coefficients can be used to determine the phylogenetic correlation of distance vectors. The Bray-Curtis distance, the Canberra metric, the chi-square distance, the average Manhattan distance, and the linear correlation coefficient, as well as nonparametric correlations like the Spearman rank-order correlation, were explored in a variety of simulated and real sequence sets. As all these measures gave similar results, the linear correlation coefficient (Pearson coefficient) was used for all data presented here. Note that multiplication of a distance vector with a constant will FIG. 1.—Neighbor-joining trees of the demonstration data set. Eight artificial 1,000-nt-long sequences (S1–S8) were generated using EVOL-TREE (Schoeniger and Haeseler 1995) with a 10% nucleotide substitution for each generation and 25% of each type of nucleotide. The recombinant sequence (R1) was constructed by combining the 500-bp 59 fragment of sequence S1 with a 500-bp 39 fragment of sequence S6, and the recombinant sequence (R2) was constructed by replacing the region 333–666 of S1 with the corresponding fragment of S6. The dendrograms were produced using the neighbor-joining program of the PHYLIP package (Felsenstein 1991). a shows the relationships of simulated sequences S1–S8. b also includes the recombinant sequences R1 and R2. not change the correlation value. It is therefore not necessary to normalize the sequence differences by the window width even when the widths of the upstream and downstream windows differ. For a more detailed treatment of correlation measures, see Rohlf (1993) and William et al. (1992). Sequence Windows In general, the recombination signal will be strongest, i.e., the phylogenetic correlation will be minimal, when the window used for determining one distance vector contains only sequence from one ancestor, while the other window contains only sequence from a different and phylogenetically discordant ancestor. Multiple recombination sites within one window will probably decrease the resolution of the method. Hence, sequences with many recombination sites are best analyzed using appropriately narrow windows. Wider windows, on the contrary, will be less discriminatory but will provide more sites for estimating the interrelationships of the sequences, and therefore enhance the signal-to-noise ratio in the resulting plot. Two different techniques are used to control ‘‘movement’’ of the sequence windows, and these are optimal for different types of sequence data. The first method uses the entire sequence in two variable-sized windows with the left edge of the upstream window fixed at the beginning of the aligned sequences, the right edge of the downstream window fixed at the sequence end, and a sliding split between them. Thus, for the first comparison, the upstream window covers only the first site of the alignment, while the downstream window covers all remaining sites. In successive steps, the upstream window grows by a single site, while the down- Phylogenetic Profiles 329 FIG. 2.—Phylogenetic profiles of sequences S1–S8 and R1–R2. Series 1 (left column) contains only one recombinant sequence, R1 (bold line), with a single crossover at site 349. Series 2 (right column) additionally contains R2 (bold line) with a double crossover at sites 237 and 465. The parental sequences S1 and S6 of both recombinants are omitted from rows b–d. Rows a and b make use of variable window widths, while rows c and d use fixed-size windows with widths of 70 and 35 positions, respectively (see text). stream window decreases by one site, until the downstream window covers the last site only and the upstream window covers the remaining sites. The second method uses two windows with identical and fixed widths and consequently cannot analyze sites that lie less than one window width from either end of the sequence; the width of the two windows is specified at the beginning of the scanning process. The appropriate minimal width depends on the variability of the sequences analyzed, as the method requires sufficient variable sites inside each window to reliably determine the sequence relationships. To allow for data sets that have an uneven distribution of variable positions, invariant sites are removed from the sequences before analysis, as these differentially dilute the recombination signal. Explorations with Simulated Data To demonstrate the properties of phylogenetic profiles and to explore their dependency on parameters and sequence inclusion sets, several phylogenetic profiles have been produced using simulated sequences, as well as recombinations of these constructed in silico. Single Recombinant Sequences Eight related sequences (S1–S8) and two recombinant sequences, one with a single crossover (R1) and one with a central insertion (R2), were constructed. Dendrograms showing the relationships of these data are given in figure 1. The 10 aligned sequences were then condensed to the 733 variable positions only. In the condensed sequences, the recombination junctions corresponded to the positions 349 (R1) and 237/465 (R2), respectively. Several phylogenetic profiles were derived from this data set (fig. 2). A simple count of different nucleotides was used to estimate sequence distances, and the linear correlation coefficient of the distance vectors was calculated to determine their phylogenetic correlation. The profiles in the top row (a) of figure 2 include the recombinant sequences R1 (left column) and R1/R2 (right column), together with sequences S1–S8. The progenitor sequences S1 and S6 of the recombinant sequences R1 and R2 were omitted from plots b–d. Nevertheless, the recombinant sequences can still be identified in these plots, even in series 2, where the phylogenetic background signal is fairly weak because two of the eight sequences are recombinant. Series a and b use the variable-width window technique, which uses the entire length of the alignment for analysis, but it can be seen that the estimate of phylogenetic correlations becomes ‘‘noisy’’ at either end as one or other of the windows becomes too narrow. Note that the phylogenetic correlation for sequence R1 (bold in a1 and b1) is small over the entire sequence, as the R1 recombination junction is always included in one of the sequence windows. The value is smallest at the R1 junction, as at this point, each window contains sequences exclusively from different parents. Note that data shown in table 1 are taken from the R1 junction in a1. The profile of recombinant R2 (bold in a2 and b2) is decreased less, as the R2 sequence contains some sites donated by S1 in both windows, irrespective of the position of the window split. The recombination junctions are nevertheless clearly visible, as the profile has its minima at the two junction sites (237 and 465), where 330 Weiller one window contains sequence exclusively derived from S1, while the other window contains sequence from both parents (S1 and S6). Note also the strong phylogenetic correlation of R2 around site 350, where the R2 sequences in both windows contain a similar mixture of sites of both parental sequences. This large value indicates that R2 has an insertion and would not have been observed if the 59 and 39 sequences of R2 had come from different parents. The recombination junction of R1 cannot be determined precisely from b2 alone, as the second recombinant, R2, distorts the phylogenetic correlation of R1. This distortion is exceptionally pronounced because R2 was constructed from the same donor sequences as R1, and these were excluded from the analysis represented by b2. Series c and d use fixed-size windows of sizes 70 and 35, respectively. Note that the smaller fixed-size windows in c2, because they contain fewer contradictory sites, are better suited to pinpoint the three crossover sites than are the maximal-sized windows in b2. The series d graphs demonstrate that when the window width is too small, the noise generated by sampling errors obscures the recombination signals. Recombination Hot Spots and Multiple Recombinants The phylogenetic-profile method determines the phylogenetic correlation of a particular sequence in different regions by comparing it with other reference sequences. However, some of the reference sequences may themselves be recombinants; indeed, when the analysis includes recombination hot spots, all or most sequences might be recombinant. To demonstrate the properties of phylogenetic profiles in these situations, the artificial data set was modified, generating two series (A and B) of reciprocal recombinant sequences by combining every sequence si with the sequence si12. This resulted in recombinant sequences of types S1:3, S2:4, S3:5, S4:6, S5:7, S6:8, S7: 1, and S8:2. Series A contained these 16 reciprocally recombined sequences, which resembled R1 in the example above, with a single site of recombination again at site 500 (349 in the condensed variable-sites sequences). Series B was constructed in a manner analogous to that used to construct R2, by exchanging the center and flanking regions of si with si12. This also yielded 16 reciprocally recombinant sequences with recombination junctions occurring at sites 333 and 666 (237 and 465 in the condensed variable-sites sequences). Note that the si:si12 scheme for producing recombinants resulted in the sequences combining regions of different similarities. While sequences S1:3, S2:4, S5:7, and S6:8 are relatively closely related, sequences S3:5, S4:6, S7:1, and S8:2 are more distantly related, so stronger recombination signals can be expected from combinations of the latter. Figure 3 gives the phylogenetic profiles of the two series of recombinants, as well as a combination of both. Note that the sites of recombination can be clearly seen in all three graphs. Strong and weak signals cannot be distinguished in graphs a and b of figure 3, because here all sequences have their recombination junction at the same site, and algorithm has no means to determine FIG. 3.—Phylogenetic profiles of chimeric sequences created from the demonstration data set. All profiles use Hamming distances and fixed windows of 50 bp width. All sequences in the profiles are recombinant. Plot a contains 16 sequences, each with a crossover at position 349 (series A). Plot b contains 16 sequences, each with a double crossover at positions 237 and 465 (series B). Plot c combines all 32 sequences used in a and b (see text). which sequence is closest to the unrecombined wildtype sequence. However, when the sequences of both series A and B are included, the strength of the recombination signal is revealed (graph c in fig. 3). This is because the recombination sites of the two series differ; therefore, there are always some partial sequences in the parental (nonrecombined) configuration at every site in the data set, allowing the algorithm to distinguish between strong and weak recombination signals. Consequently, it can be seen that the phylogenetic correlation of some sequences is particularly small (,0 in fig. 3c) and these come from the recombinants formed from the most distantly related sequences, S3:5, S4:6, S7:1, and S8:2. However, the data set still does not contain sufficient information to clearly distinguish the recombinants of closely related sequences from those of parental sequences. In general, if a large proportion of the sequences are recombinants, the phylogenetic-profile method may not be able to determine which of the sequences are the parents; the identification of recombinational hot spots, however, is not impaired. Complex Phylogenies Multiple recombinations during the phylogenetic development of sequences can lead to very complex phylogenetic relationships, resulting in the translocation of entire subtrees. In addition, continuing evolutionary changes overwrite the initial signal in the sequences. Hudson and Kaplan (1985) have demonstrated that recombination events may leave no evidence in the extant sequences. The simulation given in figure 4 was chosen to demonstrate the behavior of phylogenetic profiles in this more realistic situation. A randomly generated sequence containing 25% of each nucleotide was evolved over four generations with 2% nucleotide changes per generation. At each but the last generation, one recombinant sequence was created (x, y, and z in fig. 4) and added to the population. A phylogenetic profile of the resulting 30 sequences is given in figure 5. Only the Phylogenetic Profiles 331 derived from sequences y and z could be clearly identified (data not shown). This was to be expected, as the recombination x combines two sequences (sB and sA) with only 4% sequence differences. Evolution of sequence x for three more generations added an additional 12% changes to the sequences which obscure the recombination signal. Note that although 12 of the 30 sequences are recombinant, the data set still provides sufficient background signal to distinguish all recombinant sequences. When a large proportion of the data set is represented by recombinant sequences, the exclusion of sequences with regions of low phylogenetic correlation can help to improve the detection of sequences with modest recombination signals. This is shown in the following example using a real data set. Example with Real Data FIG. 4.—Simulation of a complex phylogeny. A 1,000-bp-long random progenitor sequence (s) containing 25% A, C, T, and G was assembled. Descendant sequences were created by introducing 20 random nucleotide changes to their respective progenitors. The recombinant sequence x was built by combining the first 500 and the last 500 nucleotides of the sequences sB and sA respectively. The remaining recombinants y and z were created similarly, by crossing sequences sAB with sBA on position 300 and sABB with sBAA on position 700. The recombinants were evolved further, yielding a total of 30 sequences. parsimoniously informative sites are used for distance calculation. Note that the four descendants of the recombinant sequence y and the two descendants of the recombinant sequence z are clearly identified. The identification of the eight descendants of the recombinant x is more difficult, albeit still possible, and could be considered close to the limit of the resolution of this particular profile. The sequence window was deliberately chosen to be wide (60 parsimonious sites) in order to collect sufficient signal. When smaller windows (10–40 parsimonious sites) were used, only the recombinants In order to test the phylogenetic-profile method with real gene sequences, a set of 34 gnd gene sequences of Salmonella enterica was scanned for recombinations. These sequences and the sites where recombinations have probably occurred were previously reported by Thampapillai, Lan, and Reeves (1994). The sequences are 1,329 bp long and correspond to positions 16–1344 of the gnd coding region. Only the variable sites were used for the phylogenetic profiles, and these are given in figure 6. Individual Sequences Because many of these sequences are recombinants, it could be expected that strong recombination signals would obscure the weaker ones. To avoid this, the analysis was repeated many times, and after each analysis, the sequence that gave the smallest phylogenetic correlation (i.e., the strongest recombination signal) was removed. Once the 14 sequences with the smallest phylogenetic correlations were removed, they were reintroduced into the data set individually. Some of the resulting phylogenetic profiles are given in figure 7. The top graph in figure 7 shows a phylogenetic profile FIG. 5.—Phylogenetic profile of a simulated complex phylogeny. The source of the 30 sequences is described in figure 4. The profile uses only the 304 parsimonious sites. For convenience, the graph is mapped to display all positions whereby the x axis gives the range from position 100 to position 900. The y axis gives the phylogenetic correlation from 11 to 20.5. The boxes labeled x, y, and z mark the sequences derived from the recombinant x, y, and z, respectively (see text and fig. 4). FIG. 6.—gnd gene sequence of 34 Salmonella enterica strains. Only the variable sites and their locations within the gene are given. The black bar indicates a chi-like region. 332 Weiller Phylogenetic Profiles 333 FIG. 7.—Phylogenetic profiles of the gnd locus in Salmonella enterica strains. All graphs used a fixed window size of 60 bp and only the variable sites, as given in figure 6. The proportion of different nucleotides was used to determine the relationships of the sequences, and the linear correlation coefficient was calculated to determine phylogenetic correlations. The top graph shows the profiles of all 34 sequences as given in figure 6. All but one of the sequences of the strains that gave strong recombination signals (m318, m298, m130, m38, m322, m321, w2DI, west, m317, w3038, lind, m316, m325, m287) were removed from the remaining plots. The profile of the sequence with the strongest recombination signal is plotted in bold, and the name of the corresponding strain is given (see text). of all 34 sequences, whereas the remaining profiles are each of 21 sequences, with the profile of the reintroduced sequence highlighted. It can easily be seen that the reintroduced sequences have one or more distinct minima in their phylogenetic profiles, suggesting that recombinations might have occurred at or close to these minima. The possible recombination junctions, as deduced from the phylogenetic profiles in figure 7, are summarized in table 2. Changing the various scanning parameters resulted in very similar plots (data not shown), varying the relative strength of the individual recombination signals only slightly, but not changing the conclusions that can be drawn from the analysis. As the purpose of these tests was to examine the capability of the phylogenetic-profile method rather than to give a comprehensive analysis of the evolution of the gnd locus, further interpretation of these profiles is left to a later publication. A comprehensive analysis of the evolution of the gnd locus of S. enterica has been made by Thampapillai, Lan, and Reeves (1994), who searched for evidence of recombination in the sequences of the strains m318, m298, m130, m38, m322, and m321. All 334 Weiller Table 2 Local Minima in the Phylogenetic Profiles of Figure 7 (possible recombination sites) Strain Gene Position Variable Position m318 . . . . . . . . m298 . . . . . . . . 735–774*1 687–762*1 1,017–1,066 705–738* 414–477(*)1 705–738* 414–477(*)1 889–8941 930–9781 494–546 735-774* 705–744* 930–1,018 990–1,067 648–738* 507–741* 989–1,024 555–741* 930–989 1,018–1,074 152–165 138–152 224–238 141–144 73–83 141–144 73–83 172–175 187–205 87–103 153–165 141–145 185–225 210–230 131–144 90–145 210–228 105–145 185–210 225–243 m130 . . . . . . . . m38 . . . . . . . . . m322 . . . . . . . . m321 . . . . . . . . w2Di . . . . . . . . west . . . . . . . . . m317 . . . . . . . . w3038 . . . . . . . lind . . . . . . . . . . m316 . . . . . . . . NOTE.—* 5 overlapping with or close to the chi-like sequence at positions 744–751 (145–150); (*) 5 overlapping with or close to a chi-like sequence at positions 424–431 (74–76); 1 5 independently identified by Thampapillai, Lan, and Reeves (1994). of the recombinations reported by the authors are detected by the phylogenetic-profile method, and the positions of the recombination junctions agree in the two studies. These sequences are shown in the graphs in the left column of figure 7, whereas the right column features sequences with similarly strong recombination signals that have not previously been recognized. The additional sites detected illustrate the strength and sensitivity of the phylogenetic-profile method. Recombination Hot Spots Many of the gnd sequences appear to exhibit particularly small phylogenetic correlations in their central parts. Nine of the 22 local minima (marked with an asterisk in table 2) overlap the variable sites 143–153, indicating a particularly large number of recombinations in this region. As previously reported by Thampapillai, Lan, and Reeves (1994), a variant of the general recombination stimulating sequence, chi, is located in this region at positions 744–751 (variable sites 145–150) of many strains in this analysis. All of the nine recombinant sequences mentioned above have the canonical sequence (59-CCTGGTGG-39) which is a single-base-pair variation of the E. coli chi motif (59-GCTGGTCC-39). This sequence could possibly be regarded as the S. enterica equivalent to E. coli chi. In order to examine the influence of this sequence motif on the phylogenetic profile of the sequences, a profile was created of exclusively sequences which had other variants of chi in this location. As can be seen in figure 8a, none of them has a local minimum at this site in its phylogenetic profile. In contrast, all sequences with local phylogenetic correlation minima near the chi site also have the canonical (CCTGGTGG) chi motif (fig. 8b). This chi motif is also found in sequences FIG. 8.—Phylogenetic profiles of sequences with and without chilike motif. The canonical chi-like motif (59 744-CCTGGTGG-751 39) is marked (variable sites 145–150) in both graphs. The graph in a includes all sequences (s71, m229, m311, sofia, m261, m287, m319, m35, m320, m313, m314, m321, m322, m326) that have different variations of this chi-like sequence, while all sequences in b (lt2, s41, m46, m36, m73, m13, m55, m298, m295, west, m317, w3038, m318, m324, m325) have the canonical chi-like sequence. The thick line gives the averages of the phylogenetic correlations of all sequences that are included in each graph. w2D1, m38, m130, lind, and 316, which were not included in figure 8b, as their phylogenetic profiles are more difficult to interpret when analyzed in the context of the sequences in figure 8b. However, as shown above (fig. 7 and table 2), recombinations at or close to the chi-like motif are also likely for these five sequences. Implementation All of the phylogenetic profiles were generated using the PhylPro program. PhylPro is a Microsoft Windows application developed in C11 by the author. A prototype of the program is available from the author free of charge. Acknowledgments I thank Prof. Peter Reeves for communicating the S. enterica sequences prior to publication, Prof. Adrian Gibbs for helpful discussions during the development of the phylogenetic-profile method, and Holger Averdunk for his help in exploring the parameter space of the method. LITERATURE CITED ASSALI, N. E., R. MACHE, and S. L. DE GOER. 1990. Evidence for a composite phylogenetic origin of the plastid genome of the brown alga Pylaiella littoralis (L.) Kjellm. Plant Mol. Biol. 15:307–315. BIK, E. M., A. E. BUNSCHOTEN, R. D. GOUW, and F. R. MOOI. 1995. Genesis of the novel epidemic Vibrio cholerae O139 strain: evidence for horizontal transfer of genes involved in polysaccharide synthesis. EMBO J. 14:209–216. DOOLITTLE, R. F., D. F. FENG, K. L. ANDERSON, and M. R. ALBERRO. 1990. A naturally occurring horizontal gene transfer from eukaryote to prokaryote. J. Mol. Evol. 31: 383–388. DUBOSE, R., D. DYKHUIZEN, and D. HARTL. 1988. Genetic exchange among natural isolates of bacteria: recombination within the phoA locus of Escherichia coli. Proc. Natl. Acad. Sci. USA 85:7036–7040. FELSENSTEIN, J. 1991. PHYLIP: phylogeny inference package. Version 3.4. Distributed by the author, Department of Genetics, University of Washington, Seattle. Phylogenetic Profiles FITCH, D. H. A., and M. GOODMAN. 1991. Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversions and other recombinational events. CABIOS 7:207– 215. GIBBS, A. J., and P. K. KEESE. 1995. In search of origins of viral genes. Pp. 76–91 in A. J. GIBBS, C. H. CALLISHER, and F. GARCIA-ARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England. GORBALENYA, A. E. 1995. Origin of RNA viral genomes: approaching the problem by comparative sequence analysis. Pp. 49–67 in A. J. GIBBS, C. H. CALISHER, and F. GARCIAARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England. HAHN, C. S., S. LUSTIG, S. STRAUSS, and E. G. STRAUSS. 1988. Western equine encephalitis virus is a recombinant virus. Proc. Natl. Acad. Sci. USA 85:5997–6001. HEIN, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396– 405. HUDSON, R. R., and N. L. KAPLAN. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147–164. JAKOBSEN, I. B., and S. EASTEAL. 1996. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences. CABIOS 12:291–295. JAKOBSEN, I. B., S. R. WILSON, and S. EASTEAL. 1997. The partition matrix: exploring variable phylogenetic signals along nucleotide sequence alignments. Mol. Biol. Evol. 14: 474–484. LI, J., K. NELSON, A. C. MCWHORTER, T. S. WHITTAM, and R. K. SELANDER. 1994. Recombinational basis of serovar diversity in Salmonella enterica. Proc. Natl. Acad. Sci. USA 91:2252–2256. NUTTALL, P. A., M. A. MORSE, L. D. JONES, and A. PORTELA. 1995. Adaptation of members of the Orthomyxoviridae family to transmission by ticks. Pp. 416–426 in A. J. GIBBS, C. H. CALISHER, and F. GARCIA-ARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England. PAQUIN, B., M. J. LAFOREST, and B. F. LANG. 1994. Interspecific transfer of mitochondrial genes in fungi and creation of a homologous hybrid gene. Proc. Natl. Acad. Sci. USA 91:11807–11810. 335 REEVES, P. R. 1993. Evolution of Salmonella O antigen variation by interspecific gene transfer on a large scale. Trends Genet. 9:17–22. ROBERTSON, D. L., P. M. SHARP, F. E. MCCUTCHAN, and B. H. HAHN. 1995. Recombination in HIV1. Nature 374:124–126. ROHLF, F. J. 1993. NTSYS-pc: numerical taxonomy and multivariate analysis system. Applied Biostatistics, New York. SALIMEN, M. O., J. K. CARR, D. S. BURKE, and F. E. MCCUTCHAN. 1995. Identification of breakpoints in intergenotypic recombinants of HIV Type 1 by bootscanning. AIDS Res. Hum. Retroviruses 11:1423–1425. SANDMEIER, H. 1994. Acquisition and rearrangement of sequence motifs in the evolution of bacteriophage tail fibres. Mol. Microbiol. 12:343–350. SAWYER, S. A. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526–538. SCHOENIGER, A., and A. HAESELER. 1995. Simulating efficiently the evolution of DNA sequences. CABIOS 11:111–115. STEPHENS, J. C. 1985. Statistical method of DNA sequence analysis: detection of intragenic recombination or gene conversion. Mol. Biol. Evol. 2:539–556. SYVANEN, M. 1994. Horizontal gene transfer: evidence and possible consequences. Annu. Rev. Genet. 28:237–261. THAMPAPILLAI, G., R. LAN, and P. R. REEVES. 1994. Molecular evolution in the gnd locus of Salmonella enterica. Mol. Biol. Evol. 11:813–828. WEILLER, G. F., M. A. MCCLURE, and A. J. GIBBS. 1995. Molecular phylogenetic analysis. Pp. 553–585 in A. J. GIBBS, C. H. CALISHER, and F. GARCIA-ARENAL, eds. Molecular basis of virus evolution. Cambridge University Press, Cambridge, England. WEILLER, G. F., C. M. E. SCHUELLER, and R. J. SCHWEYEN. 1989. Putative target sites for mobile G1C rich clusters in yeast mitochondrial DNA: single elements and tandem arrays. Mol. Gen. Genet. 218:272–283. WHATMORE, A. M., and M. A. KEHOE. 1994. Horizontal gene transfer in the evolution of group A streptococcal emm-like genes: gene mosaics and variation in Vir regulons. Mol. Microbiol. 11:363–374. WILLIAM, H. P., S. A. TEUKOLSKY, W. T. VETTERLING, and B. P. FLANNERY. 1992. Numerical recipes in C. Cambridge University Press, Cambridge, England. SIMON EASTEAL, reviewing editor Accepted November 20, 1997
© Copyright 2026 Paperzz