B RIEFINGS IN BIOINF ORMATICS . VOL 15. NO 3. 407^ 418 Advance Access published on 29 November 2013 doi:10.1093/bib/bbt083 Alignment-free phylogenetics and population genetics Bernhard Haubold Submitted: 5th July 2013; Received (in revised form) : 17th October 2013 Abstract Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on comparative data, today usually DNA sequences. These have become so plentiful that alignment-free sequence comparison is of growing importance in the race between scientists and sequencing machines. In phylogenetics, efficient distance computation is the major contribution of alignment-free methods. A distance measure should reflect the number of substitutions per site, which underlies classical alignment-based phylogeny reconstruction. Alignment-free distance measures are either based on word counts or on match lengths, and I apply examples of both approaches to simulated and real data to assess their accuracy and efficiency. While phylogeny reconstruction is based on the number of substitutions, in population genetics, the distribution of mutations along a sequence is also considered. This distribution can be explored by match lengths, thus opening the prospect of alignment-free population genomics. Keywords: phylogenetics; population genetics; mutation distance; match length; suffix tree INTRODUCTION Evolutionary biologists think about descent on two levels: phylogeny is concerned with the evolution of species and higher taxonomic orders, and population genetics with groups below the species level. Darwin’s The Origin of Species contains a single figure: a phylogeny. This reminds us that phylogenies are the central metaphor of evolutionary biology and attempts to construct them are as old as the field itself. Phylogenies are constructed by comparing homologous characters that vary. The most useful of these are selectively neutral, as their variation closely reflects the passage of time, rather than the action of evolutionary forces. In 1965, Zuckerkandl and Pauling were among the first to note that the residues of nucleotide and protein sequences record an organism’s evolutionary history [1]. To decipher this history is the aim of modern phylogeny reconstruction, a large field with wide application throughout biology [2]. Population genetics has an equally distinguished history, but instead of reconstructing descent, its central metaphor is a random tree known as the coalescent, which is used to simulate evolution [3]. More work has been done on alignment-free phylogeny than on alignment-free population genetics. Hence this review is centered on phylogeny followed by a coda on population genetics. Alignment-free sequence comparison was last reviewed comprehensively 10 years ago [4]. In addition, 10 methods were extensively compared 6 years ago yielding a set of reference implementations in Python at http://acb.qfab.org/acb/decafþpy/ [5]. A review of four methods was published 4 years ago [6]. Recently, 12 methods for alignment-free phylogenetic reconstruction were posted under a unifying interface at http://www.herbbol.org:8000/agp [7]. I concentrate on DNA-based methods because their often superior efficiency makes them most relevant to the current interest in analyzing next-generation sequencing data. These methods are traditionally model-free in the sense that the distances they yield cannot be interpreted in terms of mutations, the most widely used quantity for reconstructing phylogenies. However, I also survey two exceptions to this rule, one of which allows not only alignmentfree but also assembly-free phylogeny reconstruction Corresponding author. Bernhard Haubold. E-mail: [email protected] Bernhard Haubold is a tenured research scientist at the Max-Planck-Institute for Evolutionary Biology. Before that he worked in the biotech industry and was Professor for Bioinformatics at the Fachhochschule Weihenstephan. ß The Author 2013. Published by Oxford University Press. For Permissions, please email: [email protected] 408 Haubold [8]. In contrast to phylogeny, population genetics considers not only the number of mutations but also the distribution of the distances between them. This is equivalent to the distribution of match lengths. This review is organized as follows. I first remind the reader of classical phylogeny reconstruction (Section 2: Phylogeny). Then I describe seven alignment-free methods of distance computation that have been used in phylogeny reconstruction (Section 3: Distance Computation). The methods are classified in Figure 2, and their names and references are listed in Table 1. In addition, the Web site http://guanine.evolbio.mpg.de/aliFreeReview/ lists links to implementations of all software used in this review. In Section 4 (Applications), I apply the methods to simulated and empirical sequence data. I use three empirical data sets: seven primate mitochondrial genomes, 30 Escherichia coli/Shigella genomes and 12 Drosophila genomes. The accompanying Web site also contains links to these. In Section 5 (Population Genetics), finally, I discuss the prospect of applying alignment-free methods in population genetics. PHYLOGENY Figure 1A shows an alignment of four 100 bp sequences. The underlying phylogeny can be discovered in two different ways: (i) search for the tree that best explains the data, or (ii) compute the pairwise distances between the sequences (Figure 1B) and construct a tree from them (Figure 1C). The tree search methods are further divided according to their search criterion: maximum parsimony or maximum likelihood. The maximum parsimony criterion aims for the tree that implies the smallest number of mutations (Figure 1D). Similarly, the maximum likelihood method aims to discover the tree that best explains the data, given some explicit mutation model (Figure 1E). For a comprehensive description of alignment-based phylogeny reconstruction, see [2]. Without an alignment, it is difficult to score trees. Apart from one exception, alignment-free phylogeny reconstruction is therefore based on distances. For the exception, the sequence data are recursively partitioned according to the distribution of k-mers without first computing a distance matrix. However, the partition method does not outperform lesscomplicated distance-based methods [5] and hence I concentrate on those. Distances are clustered using an algorithm such as neighbor joining [17]. Alternatively, distances can be clustered using an algorithm that minimizes the difference between tree and data [18]. Interestingly, pairwise distances are often used to construct multiple sequence alignments. For example, the multiple sequence aligner clustalw first computes all pairwise distances between the input sequences. This is done either based on alignments or on a much faster alignment-free method. Apart from clustalw, MUSCLE [19] and MAFFT [20] can use k-mer distances for guide tree computation. The tree constructed from these pairwise distances then guides the order in which the sequences are fitted into the growing multiple sequence alignment [21]. DISTANCE COMPUTATION There are two classes of alignment-free methods for distance computation: gene-based and residue-based. The gene-based metrics include the number of shared genes in prokaryote genomes [22] and the number of topological changes necessary to convert one sequence of genes into another [23]. However, these methods include a step for identifying homologous genes, usually by alignment. It is therefore a Table 1: Methods for phylogenetic distance computation Distance Name Implementation Reference dkmer dffp dcv dco dgram dacs dkr k-mer Feature frequency profile Composition vector Co-phylog Grammar-based Average common substring Substitutions from repeats, Kr kt FFP CVtree Co-phylog gramDist kr kr [9] [10] [11] [8] [12] [13] [14] Alignment-free phylogenetics and population genetics A 409 S1 S2 S3 S4 CGCAATGTGTCACTCGGCACTGGGTGGGATTTGGGGCAAGCTTGGAGACT .....g..t...........a..aa.c....a...c.........c.... .....gt.............a..a..cc.......c.........c...g .....g..............a..a..cc.......c.........c...g 50 50 50 50 S1 S2 S3 S4 GGCCGCAACGTGCTTCCTTTGAAAAGATAGCTCCAGCCCTAGCACAGTAT ....c.t...........a....cc..........t......g....... ....c.t...........a....cc......a.................. ....c.t...........a....cc......................... 100 100 100 100 B C S1 S2 S3 S4 S1 0.00 0.18 0.17 0.14 S2 0.18 0.00 0.10 0.07 S3 0.17 0.10 0.00 0.02 S4 0.14 0.07 0.02 0.00 0.01 S4 D S3 0.01 S4 E S3 0.01 S4 S3 S2 S2 S2 S1 S1 S1 Figure 1: Phylogeny reconstruction. Sequence alignment (A), distance matrix (B), distance tree (C), maximum parsimony tree (D) and maximum likelihood tree (E). The trees all have the same topology and very similar branch lengths. All trees were calculated and midpoint-rooted using PHYLIP [15] and drawn using njplot [16]. slight overstatement to call them alignment-free. Hence I concentrate on methods that take unannotated DNA sequences as input. These are either based on the counts of words of some fixed length, or on the lengths of exact matches between pairs of sequences (Figure 2). When counting words, we can specify them exactly or inexactly. An inexact word contains a ‘do not care’ character denoted by 0, with 1 denoting match. If a word is defined as, say, 101, then ATA matches AAA and ATA, but not TAA [8]. Methods based on match lengths are subdivided according to the definition of the matching substring. The two most popular are common substrings and Lempel–Ziv factors (Figure 2). Word counts Perhaps the oldest word-based method was published in 1986 [24]. It quantifies the difference of overlapping dimer or trimer frequencies between sets of sequences. Counting words of a particular length—also known as k-mers—takes time OðknÞ OðnÞ, where n is the number of nucleotides analyzed. In practice, word counting is easy and fast. Hence a large number of k-mer distances have been defined [5,7,25]. A simple example is dkmer ðQ, SÞ ¼ 4k X ðqi si Þ2 , i¼1 where qi is the frequency of the i-th of 4k possible substrings of length k in Q and si is the frequency of the i-th substring of length k in S. A typical value for k is five [9]. Unfortunately, k-mer distances often lack power when applied to closely related sequences [5]. For example, Figure 3A shows the phylogeny of seven primates computed from their full mitochondrial genome sequences. The alignment was computed using clustalw in 248 s on an i5 CPU running at 2.50 GHz. The dkmer tree in Figure 3B was 104 times faster to compute (0.02 s), but only resolves the split between chimps and all other primates. Notice also that the branch lengths of the dkmer tree differ by more than three orders of magnitude from the clustalw tree. A more refined version of word counting is the feature frequency profile [10]. A feature is a word of some length, say 24. The collection of all features is the feature profile of a sequence. Such profiles are similar for similar sequences. This is quantified using the Jensen–Shannon divergence [26]. The resulting distance, dffp , has been used to construct the phylogeny of mammals from their mtDNA sequences [26], the phylogeny of 884 prokaryotes from their proteomes [27] and the phylogeny of 38 E. coli/ Shigella strains from their complete genomes [10]. For our seven example primates, dffp recovers the expected topology in 0.86 s (Figure 3C). A similar philosophy underlies the composition vector method [11]. Words of some length, say five, are counted and their frequencies divided by the frequencies expected by chance alone. Given two such normalized word frequency vectors from two genomes, their distance is defined as dcv ðS, QÞ ¼ 1 CðS, QÞ , 2 where P qi si CðS, QÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 P 2: qi si 410 Haubold Phylogeny Alignment No Alignment MP Distance ML Partition Distance Genes Residues Match Lengths Word Counts Exact dkmer dffp Inexact LZ-factors dco dgram dcv Common Substrings dacs dkr Figure 2: Classification of phylogeny reconstruction methods, with particular emphasis on alignment-free methods. MP: maximum parsimony; ML: maximum likelihood; LZ: Lempel ^Ziv. The distances d? are further explained in Table 1. A 100 100 100 100 B 10−5 C 0.05 D 0.01 P. Chimp P. Chimp P. Chimp 0.02 P. Chimp C. Chimp C. Chimp C. Chimp C. Chimp Human Human Human Gorilla Gorilla Gorilla Orangutan Orangutan Gibbon Gibbon Baboon E 0.01 Human Gorilla Orangutan Orangutan Gibbon Gibbon Baboon Baboon P. Chimp 0.01 P. Chimp C. Chimp C. Chimp C. Chimp Human Human Human Gorilla Gorilla Gorilla Gorilla Gibbon Orangutan P. Chimp C. Chimp Human Orangutan Baboon F 0.02 P. Chimp G 0.1 Baboon Gibbon Gibbon Orangutan Baboon Baboon H Gibbon Orangutan Baboon Figure 3: Primate phylogenies based on seven fully sequenced mitochondrial genomes. Distances were either computed from sequences aligned with (A) clustalw [19], or (B) using the k-mer distance dkmer [9], (C) feature frequency profile, dffp [10], (D) composition vector, dcv [11], (E) co-phylog, dco [8], (F) grammar-based distance, dgram [12], (G) average common substring, dacs [13] and (H) substitutions from repeats, dkr [14]. The numbers in (A) are bootstrap values. C. Chimp: common chimp; P. Chimp: pigmy chimp. This has been applied, for example, to the computation of the tree of life from small subunit rRNA sequences [11] and the phylogeny of 82 fungi from their complete genomes [28]. The computation of dcv is implemented in the Web-based CVtree software [29]. Its application to our primate data yields the correct tree (Figure 3D). The word count methods detailed so far are designed to recover the topology of a phylogeny rather than its branch lengths. Branch lengths are traditionally expressed as substitutions per site, which is difficult to estimate without alignment. However, the recently published method Co-phylog achieves just that [8]. It starts by counting words of a certain length, say seven, that do not have a match in the genome that differs at the middle position. This is repeated for a second genome. Among the intersection between the two sets of words, the proportion Alignment-free phylogenetics and population genetics of words that differ at the middle position is the distance measure, dco . Consider for example the word 101 and say that you have recovered fATA, ATG, CTC, CCCg from some sequence. Then CTC and CCC cancel each other, leaving fATA, ATGg. If fATA, AGGg were obtained from a different genome, the intersection of the matching parts is fA A, A Gg, of which one out of two differ at the middle position, i.e. dco ¼ 1=2. In real data, this approach gives a good approximation of the number of mismatches per site. As a result, dco yields in 0.27 s the tree with the most realistic branch lengths so far (Figure 3E). On the down side, it groups orangutan with baboon. Co-phylog is the only method in this review that is also designed to compute distances between unassembled reads. Match lengths Closely related sequences contain longer exact matches than divergent sequences. This insight is used in a class of related distance measures. Like k-mer distances, they can be computed in linear time. However, linear-time algorithms for finding matches of arbitrary length are based on an index of the input sequences known as a suffix tree, which is more demanding to compute than word counts. Figure 4 shows the suffix tree of a short example string. Its defining feature is that each suffix S½i::jSj is represented as the concatenated path label from the root to leaf i. Notice the $ at the end of the example sequence S ¼ TATACCAATCCT$. This ‘sentinel character’ differs from all other characters in the string. Its addition ensures that even a suffix that is the prefix of another suffix is represented by a leaf in 3 4 T A 5 6 C C 13 $ T $ T CC A $ $ 5 10 11 12 $ ATCCT TCACCACAAT CCT$ Figure 4: Suffix tree. T$ 8 6 T$ AATCC 2 11 12 C T T$ C TC CT 4 T$ ACCACAC TCCT$ 7 10 C AA T CCAATCCT$ ATC CT$ 13 7 8 9 A A T C 2 A $ 1 T A T 3 1 9 411 the suffix tree. For example, the last character in S apart from the sentinel, S½12 ¼ T, is a prefix of the suffix starting at position 9, S½9::12 ¼ TCCT. All repeated prefixes of individual suffixes are collapsed into single paths, which makes a suffix tree ideal for searching. To look up CC, for example, walk from the root until CC has been traversed. Then look up the leaves below the last match to find that CC starts at positions 5 and 10. This search took time proportional to the length of CC, independent of the length of S. Readers unfamiliar with suffix trees might suspect that searching independent of input size is too good to be true. But think of a book index, where searching takes time proportional to the length of the word being looked up, regardless of the length of the book. A suffix tree is a generalized index, in which any substring in the text can be looked up. Its construction in linear time was first proposed in 1973 [30] and is one of the seminal achievements of computer science [31], even though it does not feature in many of the standard texts on algorithms [32]. There are numerous applications of suffix trees in bioinformatics, most notably perhaps in the genome aligner MUMmer [33]. Among the distances surveyed in this review (Figure 2), dgram is based on suffix trees. However, the memory requirement of suffix trees exceeds that of the underlying string by at least one order of magnitude. Fortunately, suffix trees can be approximated as an array of alphabetically ordered suffixes. Table 2 shows the suffix array of our example string; it consists of the column ‘SA’, which contains the starting positions of the sorted suffixes. For example, the suffix starting at position 10 in the input string, CCT$, comes eighth after alphabetical sorting. To compute this suffix array, one could traverse the suffix tree in Figure 4 and note the leaf labels. This works because in our example suffix tree, the outgoing edges are alphabetically ordered. The implication is that suffix arrays can be computed in O(n). However, an intermediate suffix tree would cancel the space saving that makes suffix arrays attractive in the first place. When first proposed in 1993, direct construction of suffix arrays was expected to run in O(n) time [34]. This expectation was met for random sequences, with the worst-case run time being Oðn logðnÞÞ. In practice, suffix array construction took 3–10 times longer than suffix tree construction [34]. Since then there has been great interest in devising algorithms for suffix array construction that run as fast as 412 Haubold Table 2: Suffix array (SA) enhanced by a ‘longest common prefix’ array (LCP) Rank SA Suffix LCP LengthðLCPÞ 1 2 3 4 5 6 7 8 9 10 11 12 13 13 7 4 2 8 6 5 10 11 12 3 1 9 $ AATCCT$ ACCAATCCT$ ATACCAATCCT$ ATCCT$ CAATCCT$ CCAATCCT$ CCT$ CT$ T$ TACCAATCCT$ TATACCAATCCT$ TCCT$ A A AT C CC C T TA T 0 0 1 1 2 0 1 2 1 0 1 2 1 The longest common prefix of suffixi is computed by comparing it with the prefix of suffix i ^ 1; if there is no common prefix, LCP is empty. See text for further details. suffix tree construction [35]. This recently culminated in the publication of a linear-time suffix array construction algorithm together with its highly efficient implementation [36]. To simulate suffix tree operations with a suffix array, the suffix array is usually enhanced by an array of longest common prefix (LCP) lengths. Table 2 shows the LCP array corresponding to our example sequence. The suffix with Rank½8, CCT$, shares the prefix CC with the suffix at the preceding Rank½7, CCAATCCT$. Hence the LCP of this pair of suffices is CC and its length is 2. LCP arrays can be constructed from the corresponding suffix array in O(n) time using a fast algorithm [37]. I already noted that traversal of our example suffix tree yields the suffix array in Table 2. However, the reason for constructing this Table in the first place is that its traversal simulates the traversal of the corresponding suffix tree [37]. To get an intuition for how the suffix tree can be recovered from its enhanced suffix array, draw the root and a leaf labeled with the first entry in the suffix array, SA½1 ¼ 13 (Figure 5A). Label the edge connecting the root and the leaf with the first suffix, $. After this initialization step, repeat the following for each remaining Rank½i: draw an edge off the edge just drawn such that LengthðLCP½iÞ is the length of the edge label from the root to the new branch point. Attach a leaf to the new edge and label it SA½i. Label the path from the root to the new leaf with Suffix½i (Figure 5B and C). After 12 repetitions, the desired suffix tree is recovered (Figure 5D). A formalized version of this can be found, for example, in [38]. The programs used to compute dacs and dkr are based on enhanced suffix arrays (Figure 2). Computations based on suffix trees have come a long way in recent years. This means that all distances based on substring lengths can in theory be computed in O(n) time. The enhanced suffix array is only one of several space-efficient data structures for carrying out suffix tree operations. The FM index is currently the most space-efficient among these [39,40]. It underlies the programs bowtie [41] and bwa [42], which can efficiently map short sequencing reads to mammalian genomes on hardware with only 4 GB of RAM. Grammar-based distance, dgram A grammar is a set of rules for decomposing a string into its constituents. For example, the rule might be to look for repeated substrings to compress the input string, as in this naı̈ve example: AAACCCC ! A3 C4 There is a close relationship between compression and distance estimation: similar sequences are more compressible than divergent sequences. The compression scheme most often applied in distance estimation is based on the Lempel–Ziv factorization, which also underlies such programs as gzip for compressing files: look for the longest match S½i::j that starts somewhere in S½1::i 1. If such a match exists, the desired factor is S½i::j þ 1 and the search continues at S½j þ 2. If there is no such match, the factor is S½i and the search continues at S½i þ 1. The concatenation of all factors regenerates the input string. For example, TATACCAATCCT ! T, A, TAC, CA, ATC, CT In this case, S is decomposed into six substrings, which we express as cðSÞ ¼ 6. Consider now a second sequence, Q, which is identical to S. Concatenate S and Q and compute their decomposition. It contains only one extra element, all of Q; that is, cðSQÞ ¼ 7. However, if Q differs from S, cðSQÞ grows. This leads to a number of related decomposition- or grammar-based distance measures [43], the most recent of which is [12] ( cðSQÞcðSÞ dgram ðS, QÞ ¼ jSj jQj cðSÞ cðQSÞcðQÞ jQj cðQÞ jSj if jSj > jQj otherwise: Alignment-free phylogenetics and population genetics B −→ −→ AA TC A 413 $ CT $ $ 13 7 13 5 11 12 10 $ ATCCT TCACCACAAT CCT$ 8 6 T$ AATCC T$ A $ 2 T$ CC A $ T$ C CC T$ 4 AA T 7 T C 4 13 T$ ACCACAC TCCT$ 7 T CC TC AAA TCC CT$ T$ CC AA ATCTCCT$ CT$ 13 D $ −→ −→ −→ A C 3 9 1 Figure 5: Na|º ve suffix tree construction based on the enhanced suffix array in Table 2. The dgram tree in Figure 3F has the correct topology. Its underlying distances were computed in 0.15 s. Average common substring, dacs Instead of decomposing SQ, we can look up the longest match in S starting at every position in Q. In contrast to the Lempel–Ziv factors, longest matches are overlapping. Call LðQ, SÞ the average of these match lengths and define [13] 0 dacs ðQ, SÞ ¼ logðjSjÞ=LðQ, SÞ 2 logðjQjÞ=jQj: Because the labeling of sequences matters, that is 0 0 ðQ, SÞ 6¼ dacs ðS, QÞ, the final distance measure is dacs defined as the arithmetic average dacs ðQ, SÞ ¼ 0 0 dacs ðQ, SÞ þ dacs ðS, QÞ : 2 Figure 3G shows the dacs tree computed in 0.36 s. The positions of gibbon and orangutan are switched compared with the reference tree Figure 3A. From match lengths to mutation distances, dkr A run of matching nucleotides between a query and a subject sequence is interrupted by a mutation. This could be a point mutation, an indel or a rearrangement. In other words, we can interpret the common substring length that is used to compute dacs as one shorter than the distances to the next mutation. As the longest common substrings extended by one turn into the SHortest Unique subSTRING, they have been called shustrings [44]. Their lengths are converted to the number of mutations, m, by realizing that the probability that a shustring of length X is longer than some threshold t is PrfX > tg ¼ ð1 m=lÞt etm=l , ð1Þ where l is the length of the subject sequence. m can be estimated from equation (1) if query and subject are so similar that all matches are due to homology. However, the probability of random matches grows with increasing divergence. Assuming that all nucleotides have equal frequencies, random matches are corrected for using [14] l PrfX tg ¼ ð1 etm=l Þð1 4t Þ : ð2Þ Finally, m is corrected for multiple substitutions to yield an alignment-free version of the classical Jukes– Cantor distance [14,45]: 3 4 dkr ¼ ln 1 m : 4 3l ð3Þ Figure 3H shows the primate tree based on dkr values computed in 0.1 s. It has the same topology as the dacs tree. This is not surprising, as both measures are based on the same type of match lengths (Figure 2). However, dkr estimates the number of Haubold substitutions that also underlie the clustalw reference tree. Accordingly, the branch lengths of the reference tree are more similar to the branch lengths of the dkr tree than to those of the dacs tree. The seven distances we have surveyed give a paradoxical result on our primate test data: the two distances that best approximate mutation distances, dco and dkr , give false topologies. On the other hand, some of the distances that do not approximate substitution rates, including dffp , dcv and dgram , give the correct topology. This probably means that dco and dkr are more sensitive to imperfections in the data, such as differences in the sequence lengths, than other measures. I therefore investigate the relationship between substitutions and the various divergence measures in the next section. APPLICATIONS Except for dcv , for which I could not find a standalone implementation, I now apply the distance measures classified in Figure 2. I begin with simulated data and then discuss applications to large empirical data sets. Simulated data The differences in absolute and relative branch lengths in the primate trees (Figure 3) suggest two things: (i) our distance measures differ in their relationship to the number of mutations, and (ii) this relationship changes as a function of divergence. To clarify these two points I carried out simulations of evolution by mutation alone: pairs of equally long sequences are subjected to between 105 and 0.8 substitutions per site and the divergence is estimated (Figure 6). The worst measure appears to be dkmer : it is < 106 if there are less than 0.01 substitutions per site and saturates for > 0:2 substitutions per site. dacs , dffp and dgram are all off by approximately an order of magnitude and saturate for high substitution rates. In contrast, dco and dkr give the best fit between simulated and estimated distances. As noted previously [45], dkr is limited to substitution rates < 0:5. When using the substitution rate of 0:8 dco gave no distances for the 100-kb sequences simulated, but it did work with longer sequences, e.g. 1 MB. These simulations demonstrate that dco and dkr recover the correct substitution rate from data simulated under the model that also underlies the derivation of these distances. 10 1 0.1 Divergence 414 d d d d d d ideal 0.01 0.001 0.0001 1e − 05 1e − 06 0.0001 0.001 0.01 0.1 Substitutions/Site 1 Figure 6: Log/log plot of divergence measures as a function of the number of substitutions per site, K. Plotted are means from 1000 iterations. Pairs of 100 -kb sequences were simulated using the program simK, which is available from the accompanying Web site guanine.evolbio.mpg.de/aliFreeReview. Before applying distance measures to large data sets, it is important to look at run times. The implementations I use may or may not be the best possible. The run times summarized in Figure 7 should therefore be read as a rough guide to current implementations rather than a statement on how quickly the various distances could in principle be computed. Apart from dcv , I also left out dacs , because its algorithm is essentially the same as that used to compute dkr . dkmer can be computed fastest by a large margin. However, this is followed by the two alignment-free estimates of substitution rates, dco and dkr . dffp and dgram are both slower. In the following I therefore concentrate on dkr and dco as the fastest and most accurate distance measures. Empirical data The distances underlying the primate trees in Figure 3 were computed from 115 kb of mitochondrial sequence. All alignment-free computations took less than 1 s. I also applied dco and dkr to the set of 30 E. coli investigated by Yi and Jin [8]. These comprise 147 MB, that is, a thousand times more than the primate data set. The dco computation took 278 s and the dkr computation 403 s on a Xeon CPU running at 2.40 GHz. Not only is the Alignment-free phylogenetics and population genetics implementation of dco faster, it also gives the more accurate—although still not perfect—tree [8]. dkr has also originally been applied to the sequences of 12 complete Drosophila genomes comprising 2.03 GB. Its implementation took 3.25 h using 72 GB RAM to compute a tree that agreed with the reference tree [45]. However, the dco computation used only 47 min and 2.4 GB RAM to yield the tree shown in Figure 8B. It has the same topology as the reference tree (Figure 8A) and its relative branch lengths 100000 10000 d d d d d ideal Time (s) 1000 100 10 415 are also similar. The most notable exception is the invisibly short branch that separates the melanogaster clade including Drosophila willistoni from the virilis clade in Figure 8B. Moreover, the absolute branches are approximately five times longer on the reference tree than on the dco tree. We conclude that when it comes to choosing an alignment-free distance measure, dco is a strong candidate, especially when analyzing large genomes where the time and/or memory consumption of other methods is often prohibitive. The fact that dco can also be used to compute distances between unassembled reads is an additional advantage. However, recall that in Figure 3E, dco did not return the correct primate phylogeny, so the jury is still out on which method is best. I now turn to population genetics, where not only the number of mutations but also the distances between them is an important diagnostic of the evolutionary process. 1 POPULATION GENETICS 0.1 0.01 1000 10000 100000 1e + 06 1e + 07 1e + 08 Length (kb) Figure 7: Log/log plot of run times for various implementations of divergence measures as a function of sequence length. Pairs of sequences were simulated using the program simK with a divergence of 0.01. A 0.1 D. grimshawi D. virilis Population genetics is concerned with discovering the forces of evolution acting at the level of interbreeding groups of organisms. These forces are foremost mutation, recombination and selection. Their action is influenced by a wide variety of factors, including population structure and size, both of which may vary over time. For example, a sudden change in environmental conditions might precipitate a rapid reduction in population size. The B 0.02 D. grimshawi D. virilis D. mojavensis D. willistoni D. persimilis D. pseudoobscura D. ananassae D. mojavensis D. willistoni D. persimilis D. pseudoobscura D. ananassae D. erecta D. erecta D. yakuba D. yakuba D. sechellia D. sechellia D. simulans D. simulans D. melanogaster D. melanogaster Figure 8: Phylogeny of 12 Drosophila species computed from their complete genomes. (A) Reference tree distributed with the genome project [46]. (B) Tree computed from dco and midpoint-rooted using PHYLIP [15] and drawn using njplot [16]. 416 Haubold opposite of such a population bottle neck occurs when a population expands steadily into a new habitat. The current 1000 genome projects lead to a huge amount of population genetic data. Most of these data are mapped to reference sequences and hence aligned. However, once de novo assembly of these data becomes feasible, alignment-free investigations of the resulting large assemblies might be a viable filtering step, before using alignment to investigate interesting portions of the data. In the following I outline potential applications of alignment-free methods in population genetics. My aim is to pique the readers’ interest, as, to date, little alignment-free population genetics has been done. I discuss the three fundamental population genetic quantities: mutation, recombination and selection. Mutation Our aim is to compute the number of mismatches per site between a pair of homologous sequences, p, without alignment. dco could be a good tool for this even before assembly. Alternatively, we note that p is the probability of mutation per nucleotide, hence 1=p is the expected distance to the next mutation. To a first approximation, the expected distance to the next mutation is the expected shustring length used to calculate dkr : E½p ¼ E½Xi , where Xi is the length of the shustring starting at Q½i. The approximation holds if we assume that shustring lengths always correspond to homologous matches, which is a good heuristic for closely related sequences such as those sampled from within populations. Hence we can estimate p by replacing the expected shustring length by its average: pm ¼ 1 : avgðXi Þ In contrast to dkr , pm does not correct for nonhomologous matches and can thus be calculated very quickly. This makes it suitable for rapid sliding window analyses. On the other hand, it can only be applied to closely related sequences sampled from within populations rather than to the more divergent sequences sampled between species [47,48]. In contrast to phylogeny reconstruction, which usually deals with individuals sampled from separate species, population genetics is concerned with individuals sampled from interbreeding groups. This begs the question how recombination affects mutations. Recombination Recombination is the exchange of genetic material between chromosomes. In diploid sexual organisms, it can be observed under the light microscope as crossing over during gamete formation by meiosis. This process is modeled as the exchange of maternal and paternal genetic material at a random point along a pair of homologous chromosomes (Figure 9). In a population with recombination, two things can happen as we trace lines of descent backwards in time: two lines may coalesce, thereby reducing the number of lineages by one, or there is a recombination event, which may increase the number of lineages by one. The lines of descent for a sample of homologous sequences are called ‘coalescent’ and for more details on coalescent theory, the reader may consult [3]. Figure 10 shows the topology of the coalescent for four sequences with recombination [49]. Each pairwise comparison involving sequence #3 diagnoses two segments with different mutation rates due to the two coalescence times along sequence #3. Figure 11 shows a simulation of this varying time to the most recent common ancestor and the resulting clustering of mutations. Such clustering corresponds to a change in the variance of the match length. This could be used to test for recombination [50]. Given that pm is sensitive to recombination, it might actually be used to estimate the rate of recombination. However, it turns out that two values of pm correspond to a given rate of recombination if we allow recombination to exceed mutation [47], as is often observed in nature [51, p. 89]. More promising is to test the null hypothesis of no recombination by comparing the observed variance of shustring lengths with that expected in the absence of recombination [50]. Selection A favorable variant of a gene can rise in frequency in the population, and such a beneficial allele may ultimately get fixed in the population. In the process, X Figure 9: Reciprocal recombination between two homologous chromosomes at the crossing over point X. Alignment-free phylogenetics and population genetics C alignment-free work is just beginning. However, as recombination and selection affect the distribution of mutations along sequences, match-length methods should be suited for work in this field. This will be facilitated by the fact that suffix trees and their abstraction suffix arrays are rapidly becoming as ubiquitous as dynamical programming matrices were in classical bioinformatics. Most recent common ancestor C time C C R Homologous sequences 1 2 3 4 Figure 10: Ancestral recombination graph for four homologous sequences. R: recombination; C: coalescence. 1.8 1.6 1.4 1.2 TMRCA 417 1 Key Points Phylogenetics is sped up by alignment-free distance computation. Alignment-free distances are computed from word counts or match lengths. Phylogenies are based on the number of mutations; in population genetics, the spatial distribution of mutations also matters. Word counts are best for computing the number of mutations. Match lengths are best for computing the distribution of mutations. 0.8 0.6 0.4 Acknowledgements I thank Heng Li, Peter Pfaffelhuber, Angelika Börsch-Haubold and Linda Krause for helpful comments. 0.2 0 0 2 4 6 8 10 Position Figure 11: Time to the most recent common ancestor (TMRCA) as a function of genome position. TMRCA is measured in 2N generations; the vertical lines on the x-axis mark mutations. neutral variants linked to the favorable allele are eliminated from the population. In other words, the neutral variants go to fixation by hitching a ride with the favorable allele. This hitchhiking model of selection was first proposed by Maynard Smith 40 years ago [52]. Today it is a central tool in population genetics [53]. The rise of a favorable allele is also called a selective sweep, as linked neutral variants are swept out of the population in the process. A selective sweep leads to an increase in the length of the homozygous tract surrounding the favorable allele. This would be observable as an increase in the local match length. However, we have already seen that recombination can also lead to an increase in match length. To distinguish these two potential causes of match length variation, should be a rewarding topic for future research in alignment-free population genetics. In phylogeny reconstruction, the dco distance is currently the state of the art. In population genetics, References 1. Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. J Theor Biol 1965;8:357–66. 2. Felsenstein J. Inferring Phylogenies. Sunderland: Sinauer, 2004. 3. Wakeley J. Coalescent Theory: An Introduction. Colorado: Roberts & Company, 2009. 4. Vinga S, Almeida J. Alignment-free sequence comparison— a review. Bioinformatics 2003;19:513–23. 5. Höhl M, Ragan M. Is multiple-sequence alignment required for accurate inference of phylogeny? Syst Biol 2007;56(2):206–21. 6. Guyon F, Brochier-armanet C, Guénoche A. Comparison of alignment free string distances for complete genome phylogeny. Adv Data Anal Classif 2009;3:95–108. 7. Cheng J, Cao F, Liu Z. AGP: a multimethods web server for alignment-free genome phylogeny. Mol Biol Evol 2013; 30:1032–37. 8. Yi H, Jin L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 2013;41:e75. 9. Yang K, Zhang L. Performance comparison between ktuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res 2008;36(5):e33. 10. Sims GE, Kim SH. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci USA 2011;108:8329–34. 11. Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J Mol Evol 2004;58:1–11. 418 Haubold 12. Russell DJ, Way SF, Benson AK, Sayood K. A grammarbased distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 2010;11: 601. 13. Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 2006;13:336–50. 14. Haubold B, Pfaffelhuber P, Domazet-Lošo M, Wiehe T. Estimating mutation distances from unaligned genomes. J Comput Biol 2009;16:1487–500. 15. Felsenstein J. PHYLIP (phylogeny interference package) version 3.6. 2005. 16. Perrière G, Gouy M. WWW-Query: an on-line retrieval system for biological sequence banks. Biochimie 1996;78: 364–9. 17. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1986;4:406–25. 18. Fitch WM, Margoliash E. Construction of phylogenetic trees. Science 1967;155:279–84. 19. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32:1792–7. 20. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinformatics 2008;9:286–98. 21. Larkin M, Blackshields G, Brown N, et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007;23:2947–8. 22. Tettelin H, Masignani V, Cieslewicz M, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘‘pan-genome’’. Proc Natl Acad Sci USA 2005;102(39):13950–5. 23. Nadeau JH, Sankoff D. Counting on comparative maps. Trends Genet 1998;14:495–501. 24. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 1986;83:5155–9. 25. Reinert G, Chew D, Sun F, Waterman M. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 2009;16:1615–34. 26. Sims EG, Jun SR, Wu AG, Kim SH. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolution. Proc Natl Acad Sci USA 2009;106: 2677–82. 27. Jun SR, Sims GE, Wu GA, Kim SH. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc Natl Acad Sci USA 2010;107:133–8. 28. Wang H, Xu Z, Gao L, Hao B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 2009;9:195. 29. Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res 2004;32:W45–7. 30. Weiner P. Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and AutomataTheory (swat 1973). Washington, DC, USA: IEEE Computer Society SWAT’73, 1973;1–11. 31. Gusfield D. Algorithms on Strings,Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press, 1973. 32. Cormen TH, Leiserson CE, Rivest RL, Stein CS. Introduction toAlgorithms. Cambridge, MA: The MIT Press, 2009. 33. Kurtz S, Phillippy A, Delcher A, et al. Versatile and open software for comparing large genomes. Genome Biol 2004; 5(2):R12. 34. Manber U, Myers EW. Suffix arrays: a new method for online string searches. SIAM J Comput 1993;22:935–48. 35. Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix array construction algorithms. ACMComputSurv, 2007;39:4. 36. Nong G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACMTrans Inf Syst 2013;31:1–15. 37. Kasai T, Lee G, Arimura H, et al. Linear-time longestcommon-prefix computation in suffix arrays and its applications. LNCS 2001;2089:181–92. 38. Abouelhoda M, Kurtz S, Ohlebusch E. The enhanced suffix array and its applications to genome analysis. In: Proceedings of the Second Workshop on Algorithms in Bioinformatics. Lecture Notes in Computer Science 2452, 2002, pp. 449–63. Springer-Verlag. 39. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings of the 41st Symposium on Foundation of Computer Science (FOCS 2000). IEEE Computer Society, 2000;390–98. 40. Ferragina P, González R, Navarro G. Compressed text indexes: from theory to practice. ACM J Exp Algorithmics 2008. 13, 1.12:1–31. 41. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. 42. Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754–60. 43. Otu HH, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003;19:2122–30. 44. Haubold B, Pierstorff N, Möller F, Wiehe T. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 2005;6:123. 45. Domazet-Lošo M, Haubold B. Efficient estimation of pairwise distances between genomes. Bioinformatics 2009;25: 3221–7. 46. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007;450: 203–18. 47. Haubold B, Reed FA, Pfaffelhuber P. Alignment-free estimation of nucleotide diversity. Bioinformatics 2011;27:449–55. 48. Haubold B, Pfaffelhuber P. Alignment-free population genomics: an efficient estimator of sequence diversity. Genes Genomes Genetics 2012;2:883–9. 49. Hudson RR. Gene genealogies and the coalescent process. Oxford Surveys Evol Biol 1990;7:1–44. 50. Haubold B, Krause L, Horn T, Pfaffelhuber P. An alignment-free test for recombination. Bioinformatics. doi:10. 1093/bioinformatics/btt550 (Advance Access publication 12 October 2013). 51. Lynch M. The Origins of Genome Architecture. Sunderland: Sinauer, 2007. 52. Maynard Smith J, Haigh J. The hitch-hiking effect of a favourable gene. Genet Res Cambridge 1974;23:23–35. 53. Stephan W. Genetic hitchhiking versus background selection: the controversy and its implications. PhilosTrans R Soc Lond B 2010;365:1245–53.
© Copyright 2026 Paperzz