Alignment-free phylogenetics and population genetics

B RIEFINGS IN BIOINF ORMATICS . VOL 15. NO 3. 407^ 418
Advance Access published on 29 November 2013
doi:10.1093/bib/bbt083
Alignment-free phylogenetics and
population genetics
Bernhard Haubold
Submitted: 5th July 2013; Received (in revised form) : 17th October 2013
Abstract
Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on comparative
data, today usually DNA sequences. These have become so plentiful that alignment-free sequence comparison is of
growing importance in the race between scientists and sequencing machines. In phylogenetics, efficient distance
computation is the major contribution of alignment-free methods. A distance measure should reflect the number
of substitutions per site, which underlies classical alignment-based phylogeny reconstruction. Alignment-free distance measures are either based on word counts or on match lengths, and I apply examples of both approaches to
simulated and real data to assess their accuracy and efficiency. While phylogeny reconstruction is based on the
number of substitutions, in population genetics, the distribution of mutations along a sequence is also considered.
This distribution can be explored by match lengths, thus opening the prospect of alignment-free population
genomics.
Keywords: phylogenetics; population genetics; mutation distance; match length; suffix tree
INTRODUCTION
Evolutionary biologists think about descent on two
levels: phylogeny is concerned with the evolution of
species and higher taxonomic orders, and population
genetics with groups below the species level. Darwin’s
The Origin of Species contains a single figure: a phylogeny. This reminds us that phylogenies are the central metaphor of evolutionary biology and attempts
to construct them are as old as the field itself. Phylogenies are constructed by comparing homologous
characters that vary. The most useful of these are
selectively neutral, as their variation closely reflects
the passage of time, rather than the action of evolutionary forces. In 1965, Zuckerkandl and Pauling
were among the first to note that the residues of
nucleotide and protein sequences record an organism’s evolutionary history [1]. To decipher this
history is the aim of modern phylogeny reconstruction, a large field with wide application throughout
biology [2]. Population genetics has an equally distinguished history, but instead of reconstructing descent, its central metaphor is a random tree known as
the coalescent, which is used to simulate evolution
[3]. More work has been done on alignment-free
phylogeny than on alignment-free population genetics. Hence this review is centered on phylogeny
followed by a coda on population genetics.
Alignment-free sequence comparison was last
reviewed comprehensively 10 years ago [4]. In addition, 10 methods were extensively compared 6 years
ago yielding a set of reference implementations in
Python at http://acb.qfab.org/acb/decafþpy/ [5]. A
review of four methods was published 4 years ago
[6]. Recently, 12 methods for alignment-free phylogenetic reconstruction were posted under a unifying
interface at http://www.herbbol.org:8000/agp [7]. I
concentrate on DNA-based methods because their
often superior efficiency makes them most relevant
to the current interest in analyzing next-generation
sequencing data. These methods are traditionally
model-free in the sense that the distances they
yield cannot be interpreted in terms of mutations,
the most widely used quantity for reconstructing
phylogenies. However, I also survey two exceptions
to this rule, one of which allows not only alignmentfree but also assembly-free phylogeny reconstruction
Corresponding author. Bernhard Haubold. E-mail: [email protected]
Bernhard Haubold is a tenured research scientist at the Max-Planck-Institute for Evolutionary Biology. Before that he worked in the
biotech industry and was Professor for Bioinformatics at the Fachhochschule Weihenstephan.
ß The Author 2013. Published by Oxford University Press. For Permissions, please email: [email protected]
408
Haubold
[8]. In contrast to phylogeny, population genetics
considers not only the number of mutations but
also the distribution of the distances between them.
This is equivalent to the distribution of match
lengths.
This review is organized as follows. I first remind
the reader of classical phylogeny reconstruction
(Section 2: Phylogeny). Then I describe seven
alignment-free methods of distance computation
that have been used in phylogeny reconstruction
(Section 3: Distance Computation). The methods
are classified in Figure 2, and their names and references are listed in Table 1. In addition, the Web site
http://guanine.evolbio.mpg.de/aliFreeReview/ lists
links to implementations of all software used in this
review. In Section 4 (Applications), I apply the
methods to simulated and empirical sequence data.
I use three empirical data sets: seven primate mitochondrial genomes, 30 Escherichia coli/Shigella
genomes and 12 Drosophila genomes. The accompanying Web site also contains links to these. In
Section 5 (Population Genetics), finally, I discuss
the prospect of applying alignment-free methods in
population genetics.
PHYLOGENY
Figure 1A shows an alignment of four 100 bp
sequences. The underlying phylogeny can be discovered in two different ways: (i) search for the
tree that best explains the data, or (ii) compute the
pairwise distances between the sequences (Figure 1B)
and construct a tree from them (Figure 1C). The tree
search methods are further divided according to their
search criterion: maximum parsimony or maximum
likelihood. The maximum parsimony criterion aims
for the tree that implies the smallest number of mutations (Figure 1D). Similarly, the maximum likelihood method aims to discover the tree that best
explains the data, given some explicit mutation
model (Figure 1E). For a comprehensive description
of alignment-based phylogeny reconstruction, see
[2].
Without an alignment, it is difficult to score trees.
Apart from one exception, alignment-free phylogeny reconstruction is therefore based on distances.
For the exception, the sequence data are recursively
partitioned according to the distribution of k-mers
without first computing a distance matrix. However,
the partition method does not outperform lesscomplicated distance-based methods [5] and hence
I concentrate on those.
Distances are clustered using an algorithm such as
neighbor joining [17]. Alternatively, distances can be
clustered using an algorithm that minimizes the difference between tree and data [18]. Interestingly,
pairwise distances are often used to construct multiple sequence alignments. For example, the multiple
sequence aligner clustalw first computes all pairwise distances between the input sequences. This is
done either based on alignments or on a much faster
alignment-free method. Apart from clustalw,
MUSCLE [19] and MAFFT [20] can use k-mer distances for guide tree computation. The tree constructed from these pairwise distances then guides
the order in which the sequences are fitted into the
growing multiple sequence alignment [21].
DISTANCE COMPUTATION
There are two classes of alignment-free methods for
distance computation: gene-based and residue-based.
The gene-based metrics include the number of
shared genes in prokaryote genomes [22] and the
number of topological changes necessary to convert
one sequence of genes into another [23]. However,
these methods include a step for identifying homologous genes, usually by alignment. It is therefore a
Table 1: Methods for phylogenetic distance computation
Distance
Name
Implementation
Reference
dkmer
dffp
dcv
dco
dgram
dacs
dkr
k-mer
Feature frequency profile
Composition vector
Co-phylog
Grammar-based
Average common substring
Substitutions from repeats, Kr
kt
FFP
CVtree
Co-phylog
gramDist
kr
kr
[9]
[10]
[11]
[8]
[12]
[13]
[14]
Alignment-free phylogenetics and population genetics
A
409
S1
S2
S3
S4
CGCAATGTGTCACTCGGCACTGGGTGGGATTTGGGGCAAGCTTGGAGACT
.....g..t...........a..aa.c....a...c.........c....
.....gt.............a..a..cc.......c.........c...g
.....g..............a..a..cc.......c.........c...g
50
50
50
50
S1
S2
S3
S4
GGCCGCAACGTGCTTCCTTTGAAAAGATAGCTCCAGCCCTAGCACAGTAT
....c.t...........a....cc..........t......g.......
....c.t...........a....cc......a..................
....c.t...........a....cc.........................
100
100
100
100
B
C
S1
S2
S3
S4
S1
0.00
0.18
0.17
0.14
S2
0.18
0.00
0.10
0.07
S3
0.17
0.10
0.00
0.02
S4
0.14
0.07
0.02
0.00
0.01
S4
D
S3
0.01
S4
E
S3
0.01
S4
S3
S2
S2
S2
S1
S1
S1
Figure 1: Phylogeny reconstruction. Sequence alignment (A), distance matrix (B), distance tree (C), maximum parsimony tree (D) and maximum likelihood tree (E). The trees all have the same topology and very similar branch
lengths. All trees were calculated and midpoint-rooted using PHYLIP [15] and drawn using njplot [16].
slight overstatement to call them alignment-free.
Hence I concentrate on methods that take unannotated DNA sequences as input. These are either
based on the counts of words of some fixed length,
or on the lengths of exact matches between pairs of
sequences (Figure 2). When counting words, we can
specify them exactly or inexactly. An inexact word
contains a ‘do not care’ character denoted by 0, with
1 denoting match. If a word is defined as, say, 101,
then ATA matches AAA and ATA, but not TAA [8].
Methods based on match lengths are subdivided according to the definition of the matching substring.
The two most popular are common substrings and
Lempel–Ziv factors (Figure 2).
Word counts
Perhaps the oldest word-based method was published in 1986 [24]. It quantifies the difference of
overlapping dimer or trimer frequencies between
sets of sequences. Counting words of a particular
length—also known as k-mers—takes time
OðknÞ OðnÞ, where n is the number of nucleotides
analyzed. In practice, word counting is easy and fast.
Hence a large number of k-mer distances have been
defined [5,7,25]. A simple example is
dkmer ðQ, SÞ ¼
4k
X
ðqi si Þ2 ,
i¼1
where qi is the frequency of the i-th of 4k possible
substrings of length k in Q and si is the frequency of
the i-th substring of length k in S. A typical value for
k is five [9]. Unfortunately, k-mer distances often
lack power when applied to closely related sequences
[5]. For example, Figure 3A shows the phylogeny of
seven primates computed from their full mitochondrial genome sequences. The alignment was computed using clustalw in 248 s on an i5 CPU
running at 2.50 GHz. The dkmer tree in Figure 3B
was 104 times faster to compute (0.02 s), but only
resolves the split between chimps and all other primates. Notice also that the branch lengths of the
dkmer tree differ by more than three orders of magnitude from the clustalw tree.
A more refined version of word counting is the
feature frequency profile [10]. A feature is a word of
some length, say 24. The collection of all features is
the feature profile of a sequence. Such profiles are
similar for similar sequences. This is quantified using
the Jensen–Shannon divergence [26]. The resulting
distance, dffp , has been used to construct the phylogeny of mammals from their mtDNA sequences
[26], the phylogeny of 884 prokaryotes from their
proteomes [27] and the phylogeny of 38 E. coli/
Shigella strains from their complete genomes [10].
For our seven example primates, dffp recovers the
expected topology in 0.86 s (Figure 3C).
A similar philosophy underlies the composition
vector method [11]. Words of some length, say
five, are counted and their frequencies divided by
the frequencies expected by chance alone. Given
two such normalized word frequency vectors from
two genomes, their distance is defined as
dcv ðS, QÞ ¼
1 CðS, QÞ
,
2
where
P
qi si
CðS, QÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 P 2:
qi si
410
Haubold
Phylogeny
Alignment
No Alignment
MP
Distance
ML
Partition
Distance
Genes
Residues
Match Lengths
Word Counts
Exact
dkmer
dffp
Inexact
LZ-factors
dco
dgram
dcv
Common Substrings
dacs
dkr
Figure 2: Classification of phylogeny reconstruction methods, with particular emphasis on alignment-free methods. MP: maximum parsimony; ML: maximum likelihood; LZ: Lempel ^Ziv. The distances d? are further explained in
Table 1.
A
100
100
100
100
B 10−5
C 0.05
D
0.01
P. Chimp
P. Chimp
P. Chimp
0.02
P. Chimp
C. Chimp
C. Chimp
C. Chimp
C. Chimp
Human
Human
Human
Gorilla
Gorilla
Gorilla
Orangutan
Orangutan
Gibbon
Gibbon
Baboon
E 0.01
Human
Gorilla
Orangutan
Orangutan
Gibbon
Gibbon
Baboon
Baboon
P. Chimp
0.01
P. Chimp
C. Chimp
C. Chimp
C. Chimp
Human
Human
Human
Gorilla
Gorilla
Gorilla
Gorilla
Gibbon
Orangutan
P. Chimp
C. Chimp
Human
Orangutan
Baboon
F
0.02
P. Chimp
G 0.1
Baboon
Gibbon
Gibbon
Orangutan
Baboon
Baboon
H
Gibbon
Orangutan
Baboon
Figure 3: Primate phylogenies based on seven fully sequenced mitochondrial genomes. Distances were either computed from sequences aligned with (A) clustalw [19], or (B) using the k-mer distance dkmer [9], (C) feature frequency
profile, dffp [10], (D) composition vector, dcv [11], (E) co-phylog, dco [8], (F) grammar-based distance, dgram [12], (G)
average common substring, dacs [13] and (H) substitutions from repeats, dkr [14]. The numbers in (A) are bootstrap
values. C. Chimp: common chimp; P. Chimp: pigmy chimp.
This has been applied, for example, to the computation of the tree of life from small subunit rRNA
sequences [11] and the phylogeny of 82 fungi from
their complete genomes [28]. The computation of
dcv is implemented in the Web-based CVtree software [29]. Its application to our primate data yields
the correct tree (Figure 3D).
The word count methods detailed so far are designed to recover the topology of a phylogeny rather
than its branch lengths. Branch lengths are traditionally expressed as substitutions per site, which is difficult to estimate without alignment. However, the
recently published method Co-phylog achieves just
that [8]. It starts by counting words of a certain
length, say seven, that do not have a match in the
genome that differs at the middle position. This is
repeated for a second genome. Among the intersection between the two sets of words, the proportion
Alignment-free phylogenetics and population genetics
of words that differ at the middle position is the
distance measure, dco . Consider for example the
word 101 and say that you have recovered
fATA, ATG, CTC, CCCg from some sequence. Then
CTC and CCC cancel each other, leaving
fATA, ATGg. If fATA, AGGg were obtained from a
different genome, the intersection of the matching
parts is fA A, A Gg, of which one out of two differ
at the middle position, i.e. dco ¼ 1=2. In real data,
this approach gives a good approximation of the
number of mismatches per site. As a result, dco
yields in 0.27 s the tree with the most realistic
branch lengths so far (Figure 3E). On the down
side, it groups orangutan with baboon. Co-phylog
is the only method in this review that is also designed
to compute distances between unassembled reads.
Match lengths
Closely related sequences contain longer exact
matches than divergent sequences. This insight is
used in a class of related distance measures. Like
k-mer distances, they can be computed in linear
time. However, linear-time algorithms for finding
matches of arbitrary length are based on an index of
the input sequences known as a suffix tree, which is
more demanding to compute than word counts.
Figure 4 shows the suffix tree of a short example
string. Its defining feature is that each suffix S½i::jSj
is represented as the concatenated path label from the
root to leaf i. Notice the $ at the end of the example
sequence S ¼ TATACCAATCCT$. This ‘sentinel
character’ differs from all other characters in the
string. Its addition ensures that even a suffix that is
the prefix of another suffix is represented by a leaf in
3 4
T A
5 6
C C
13
$
T
$
T
CC
A
$
$
5
10
11
12
$
ATCCT
TCACCACAAT
CCT$
Figure 4: Suffix tree.
T$
8
6
T$
AATCC
2
11 12
C T
T$
C
TC
CT
4
T$
ACCACAC
TCCT$
7
10
C
AA
T
CCAATCCT$
ATC
CT$
13
7 8 9
A A T
C
2
A
$
1
T
A
T
3
1
9
411
the suffix tree. For example, the last character in S
apart from the sentinel, S½12 ¼ T, is a prefix of the
suffix starting at position 9, S½9::12 ¼ TCCT. All repeated prefixes of individual suffixes are collapsed
into single paths, which makes a suffix tree ideal for
searching. To look up CC, for example, walk from
the root until CC has been traversed. Then look up
the leaves below the last match to find that CC starts
at positions 5 and 10. This search took time proportional to the length of CC, independent of the length
of S. Readers unfamiliar with suffix trees might suspect that searching independent of input size is too
good to be true. But think of a book index, where
searching takes time proportional to the length of the
word being looked up, regardless of the length of the
book. A suffix tree is a generalized index, in which
any substring in the text can be looked up. Its construction in linear time was first proposed in 1973
[30] and is one of the seminal achievements of computer science [31], even though it does not feature in
many of the standard texts on algorithms [32].
There are numerous applications of suffix trees in
bioinformatics, most notably perhaps in the genome
aligner MUMmer [33]. Among the distances surveyed in this review (Figure 2), dgram is based on
suffix trees. However, the memory requirement of
suffix trees exceeds that of the underlying string by at
least one order of magnitude. Fortunately, suffix
trees can be approximated as an array of alphabetically ordered suffixes. Table 2 shows the suffix array of
our example string; it consists of the column ‘SA’,
which contains the starting positions of the sorted
suffixes. For example, the suffix starting at position
10 in the input string, CCT$, comes eighth after
alphabetical sorting.
To compute this suffix array, one could traverse
the suffix tree in Figure 4 and note the leaf labels.
This works because in our example suffix tree, the
outgoing edges are alphabetically ordered. The
implication is that suffix arrays can be computed in
O(n). However, an intermediate suffix tree would
cancel the space saving that makes suffix arrays
attractive in the first place.
When first proposed in 1993, direct construction
of suffix arrays was expected to run in O(n) time [34].
This expectation was met for random sequences,
with the worst-case run time being Oðn logðnÞÞ. In
practice, suffix array construction took 3–10 times
longer than suffix tree construction [34]. Since
then there has been great interest in devising algorithms for suffix array construction that run as fast as
412
Haubold
Table 2: Suffix array (SA) enhanced by a ‘longest
common prefix’ array (LCP)
Rank
SA
Suffix
LCP
LengthðLCPÞ
1
2
3
4
5
6
7
8
9
10
11
12
13
13
7
4
2
8
6
5
10
11
12
3
1
9
$
AATCCT$
ACCAATCCT$
ATACCAATCCT$
ATCCT$
CAATCCT$
CCAATCCT$
CCT$
CT$
T$
TACCAATCCT$
TATACCAATCCT$
TCCT$
A
A
AT
C
CC
C
T
TA
T
0
0
1
1
2
0
1
2
1
0
1
2
1
The longest common prefix of suffixi is computed by comparing it with
the prefix of suffix i ^ 1; if there is no common prefix, LCP is empty.
See text for further details.
suffix tree construction [35]. This recently culminated in the publication of a linear-time suffix
array construction algorithm together with its
highly efficient implementation [36].
To simulate suffix tree operations with a suffix
array, the suffix array is usually enhanced by an
array of longest common prefix (LCP) lengths.
Table 2 shows the LCP array corresponding to our
example sequence. The suffix with Rank½8, CCT$,
shares the prefix CC with the suffix at the preceding
Rank½7, CCAATCCT$. Hence the LCP of this pair
of suffices is CC and its length is 2. LCP arrays can be
constructed from the corresponding suffix array in
O(n) time using a fast algorithm [37]. I already
noted that traversal of our example suffix tree
yields the suffix array in Table 2. However, the
reason for constructing this Table in the first place
is that its traversal simulates the traversal of the corresponding suffix tree [37].
To get an intuition for how the suffix tree can be
recovered from its enhanced suffix array, draw
the root and a leaf labeled with the first entry in
the suffix array, SA½1 ¼ 13 (Figure 5A). Label the
edge connecting the root and the leaf with the first
suffix, $. After this initialization step, repeat the following for each remaining Rank½i: draw an edge off
the edge just drawn such that LengthðLCP½iÞ is the
length of the edge label from the root to the new
branch point. Attach a leaf to the new edge and label
it SA½i. Label the path from the root to the new leaf
with Suffix½i (Figure 5B and C). After 12 repetitions,
the desired suffix tree is recovered (Figure 5D). A
formalized version of this can be found, for example,
in [38]. The programs used to compute dacs and dkr
are based on enhanced suffix arrays (Figure 2).
Computations based on suffix trees have come a
long way in recent years. This means that all distances based on substring lengths can in theory be
computed in O(n) time. The enhanced suffix array is
only one of several space-efficient data structures for
carrying out suffix tree operations. The FM index is
currently the most space-efficient among these
[39,40]. It underlies the programs bowtie [41]
and bwa [42], which can efficiently map short
sequencing reads to mammalian genomes on hardware with only 4 GB of RAM.
Grammar-based distance, dgram
A grammar is a set of rules for decomposing a string
into its constituents. For example, the rule might be
to look for repeated substrings to compress the input
string, as in this naı̈ve example:
AAACCCC ! A3 C4
There is a close relationship between compression
and distance estimation: similar sequences are more
compressible than divergent sequences. The compression scheme most often applied in distance estimation is based on the Lempel–Ziv factorization,
which also underlies such programs as gzip for
compressing files: look for the longest match S½i::j
that starts somewhere in S½1::i 1. If such a match
exists, the desired factor is S½i::j þ 1 and the search
continues at S½j þ 2. If there is no such match, the
factor is S½i and the search continues at S½i þ 1. The
concatenation of all factors regenerates the input
string. For example,
TATACCAATCCT ! T, A, TAC, CA, ATC, CT
In this case, S is decomposed into six substrings,
which we express as cðSÞ ¼ 6. Consider now a
second sequence, Q, which is identical to S.
Concatenate S and Q and compute their decomposition. It contains only one extra element, all of Q;
that is, cðSQÞ ¼ 7. However, if Q differs from
S, cðSQÞ grows. This leads to a number of related
decomposition- or grammar-based distance measures
[43], the most recent of which is [12]
( cðSQÞcðSÞ
dgram ðS, QÞ ¼
jSj
jQj
cðSÞ
cðQSÞcðQÞ
jQj
cðQÞ
jSj
if jSj > jQj
otherwise:
Alignment-free phylogenetics and population genetics
B
−→
−→
AA
TC
A
413
$
CT
$
$
13
7
13
5
11
12
10
$
ATCCT
TCACCACAAT
CCT$
8
6
T$
AATCC
T$
A
$
2
T$
CC
A
$
T$
C
CC
T$
4
AA
T
7
T
C
4
13
T$
ACCACAC
TCCT$
7
T
CC
TC
AAA
TCC CT$
T$
CC
AA
ATCTCCT$
CT$
13
D
$
−→ −→ −→
A
C
3
9
1
Figure 5: Na|º ve suffix tree construction based on the enhanced suffix array in Table 2.
The dgram tree in Figure 3F has the correct topology. Its underlying distances were computed in
0.15 s.
Average common substring, dacs
Instead of decomposing SQ, we can look up the
longest match in S starting at every position in Q.
In contrast to the Lempel–Ziv factors, longest
matches are overlapping. Call LðQ, SÞ the average
of these match lengths and define [13]
0
dacs
ðQ, SÞ ¼ logðjSjÞ=LðQ, SÞ 2 logðjQjÞ=jQj:
Because the labeling of sequences matters, that is
0
0
ðQ, SÞ 6¼ dacs
ðS, QÞ, the final distance measure is
dacs
defined as the arithmetic average
dacs ðQ, SÞ ¼
0
0
dacs
ðQ, SÞ þ dacs
ðS, QÞ
:
2
Figure 3G shows the dacs tree computed in 0.36 s.
The positions of gibbon and orangutan are switched
compared with the reference tree Figure 3A.
From match lengths to mutation distances, dkr
A run of matching nucleotides between a query and
a subject sequence is interrupted by a mutation. This
could be a point mutation, an indel or a rearrangement. In other words, we can interpret the common
substring length that is used to compute dacs as one
shorter than the distances to the next mutation.
As the longest common substrings extended by one
turn into the SHortest Unique subSTRING, they
have been called shustrings [44]. Their lengths are
converted to the number of mutations, m, by realizing that the probability that a shustring of length X is
longer than some threshold t is
PrfX > tg ¼ ð1 m=lÞt etm=l ,
ð1Þ
where l is the length of the subject sequence. m can
be estimated from equation (1) if query and subject
are so similar that all matches are due to homology.
However, the probability of random matches grows
with increasing divergence. Assuming that all nucleotides have equal frequencies, random matches
are corrected for using [14]
l
PrfX tg ¼ ð1 etm=l Þð1 4t Þ :
ð2Þ
Finally, m is corrected for multiple substitutions to
yield an alignment-free version of the classical Jukes–
Cantor distance [14,45]:
3
4
dkr ¼ ln 1 m :
4
3l
ð3Þ
Figure 3H shows the primate tree based on dkr
values computed in 0.1 s. It has the same topology
as the dacs tree. This is not surprising, as both measures are based on the same type of match lengths
(Figure 2). However, dkr estimates the number of
Haubold
substitutions that also underlie the clustalw reference tree. Accordingly, the branch lengths of the
reference tree are more similar to the branch lengths
of the dkr tree than to those of the dacs tree.
The seven distances we have surveyed give a paradoxical result on our primate test data: the two distances that best approximate mutation distances, dco
and dkr , give false topologies. On the other hand,
some of the distances that do not approximate substitution rates, including dffp , dcv and dgram , give the
correct topology. This probably means that dco and
dkr are more sensitive to imperfections in the data,
such as differences in the sequence lengths, than
other measures. I therefore investigate the relationship between substitutions and the various divergence measures in the next section.
APPLICATIONS
Except for dcv , for which I could not find a standalone implementation, I now apply the distance
measures classified in Figure 2. I begin with simulated data and then discuss applications to large
empirical data sets.
Simulated data
The differences in absolute and relative branch
lengths in the primate trees (Figure 3) suggest two
things: (i) our distance measures differ in their relationship to the number of mutations, and (ii) this
relationship changes as a function of divergence.
To clarify these two points I carried out simulations
of evolution by mutation alone: pairs of equally long
sequences are subjected to between 105 and 0.8
substitutions per site and the divergence is estimated
(Figure 6).
The worst measure appears to be dkmer : it is
< 106 if there are less than 0.01 substitutions per
site and saturates for > 0:2 substitutions per site.
dacs , dffp and dgram are all off by approximately an
order of magnitude and saturate for high substitution
rates. In contrast, dco and dkr give the best fit between
simulated and estimated distances. As noted previously [45], dkr is limited to substitution rates < 0:5.
When using the substitution rate of 0:8 dco gave no
distances for the 100-kb sequences simulated, but it
did work with longer sequences, e.g. 1 MB. These
simulations demonstrate that dco and dkr recover the
correct substitution rate from data simulated under
the model that also underlies the derivation of these
distances.
10
1
0.1
Divergence
414
d
d
d
d
d
d
ideal
0.01
0.001
0.0001
1e − 05
1e − 06
0.0001
0.001
0.01
0.1
Substitutions/Site
1
Figure 6: Log/log plot of divergence measures as a
function of the number of substitutions per site, K.
Plotted are means from 1000 iterations. Pairs of 100 -kb
sequences were simulated using the program simK,
which is available from the accompanying Web site
guanine.evolbio.mpg.de/aliFreeReview.
Before applying distance measures to large data
sets, it is important to look at run times. The implementations I use may or may not be the best possible.
The run times summarized in Figure 7 should therefore be read as a rough guide to current implementations rather than a statement on how quickly the
various distances could in principle be computed.
Apart from dcv , I also left out dacs , because its algorithm is essentially the same as that used to compute
dkr . dkmer can be computed fastest by a large margin.
However, this is followed by the two alignment-free
estimates of substitution rates, dco and dkr . dffp and
dgram are both slower. In the following I therefore
concentrate on dkr and dco as the fastest and most
accurate distance measures.
Empirical data
The distances underlying the primate trees in
Figure 3 were computed from 115 kb of mitochondrial sequence. All alignment-free computations
took less than 1 s. I also applied dco and dkr to the
set of 30 E. coli investigated by Yi and Jin [8]. These
comprise 147 MB, that is, a thousand times more
than the primate data set. The dco computation
took 278 s and the dkr computation 403 s on a
Xeon CPU running at 2.40 GHz. Not only is the
Alignment-free phylogenetics and population genetics
implementation of dco faster, it also gives the more
accurate—although still not perfect—tree [8]. dkr has
also originally been applied to the sequences of 12
complete Drosophila genomes comprising 2.03 GB.
Its implementation took 3.25 h using 72 GB RAM
to compute a tree that agreed with the reference tree
[45]. However, the dco computation used only
47 min and 2.4 GB RAM to yield the tree shown
in Figure 8B. It has the same topology as the reference tree (Figure 8A) and its relative branch lengths
100000
10000
d
d
d
d
d
ideal
Time (s)
1000
100
10
415
are also similar. The most notable exception is the
invisibly short branch that separates the melanogaster
clade including Drosophila willistoni from the virilis
clade in Figure 8B. Moreover, the absolute branches
are approximately five times longer on the reference
tree than on the dco tree.
We conclude that when it comes to choosing an
alignment-free distance measure, dco is a strong candidate, especially when analyzing large genomes
where the time and/or memory consumption of
other methods is often prohibitive. The fact that
dco can also be used to compute distances between
unassembled reads is an additional advantage.
However, recall that in Figure 3E, dco did not
return the correct primate phylogeny, so the jury is
still out on which method is best.
I now turn to population genetics, where not only
the number of mutations but also the distances
between them is an important diagnostic of the
evolutionary process.
1
POPULATION GENETICS
0.1
0.01
1000
10000 100000 1e + 06 1e + 07 1e + 08
Length (kb)
Figure 7: Log/log plot of run times for various implementations of divergence measures as a function of sequence length. Pairs of sequences were simulated using
the program simK with a divergence of 0.01.
A
0.1
D. grimshawi
D. virilis
Population genetics is concerned with discovering
the forces of evolution acting at the level of interbreeding groups of organisms. These forces are foremost mutation, recombination and selection. Their
action is influenced by a wide variety of factors,
including population structure and size, both of
which may vary over time. For example, a sudden
change in environmental conditions might precipitate a rapid reduction in population size. The
B
0.02
D. grimshawi
D. virilis
D. mojavensis
D. willistoni
D. persimilis
D. pseudoobscura
D. ananassae
D. mojavensis
D. willistoni
D. persimilis
D. pseudoobscura
D. ananassae
D. erecta
D. erecta
D. yakuba
D. yakuba
D. sechellia
D. sechellia
D. simulans
D. simulans
D. melanogaster
D. melanogaster
Figure 8: Phylogeny of 12 Drosophila species computed from their complete genomes. (A) Reference tree distributed with the genome project [46]. (B) Tree computed from dco and midpoint-rooted using PHYLIP [15] and drawn
using njplot [16].
416
Haubold
opposite of such a population bottle neck occurs
when a population expands steadily into a new
habitat.
The current 1000 genome projects lead to a huge
amount of population genetic data. Most of these
data are mapped to reference sequences and hence
aligned. However, once de novo assembly of these
data becomes feasible, alignment-free investigations
of the resulting large assemblies might be a viable
filtering step, before using alignment to investigate
interesting portions of the data.
In the following I outline potential applications of
alignment-free methods in population genetics. My
aim is to pique the readers’ interest, as, to date, little
alignment-free population genetics has been done.
I discuss the three fundamental population genetic
quantities: mutation, recombination and selection.
Mutation
Our aim is to compute the number of mismatches
per site between a pair of homologous sequences, p,
without alignment. dco could be a good tool for this
even before assembly. Alternatively, we note that p is
the probability of mutation per nucleotide, hence
1=p is the expected distance to the next mutation.
To a first approximation, the expected distance to
the next mutation is the expected shustring length
used to calculate dkr :
E½p ¼ E½Xi ,
where Xi is the length of the shustring starting at
Q½i. The approximation holds if we assume that
shustring lengths always correspond to homologous
matches, which is a good heuristic for closely related
sequences such as those sampled from within populations. Hence we can estimate p by replacing the
expected shustring length by its average:
pm ¼
1
:
avgðXi Þ
In contrast to dkr , pm does not correct for nonhomologous matches and can thus be calculated very
quickly. This makes it suitable for rapid sliding
window analyses. On the other hand, it can only
be applied to closely related sequences sampled
from within populations rather than to the more
divergent sequences sampled between species
[47,48]. In contrast to phylogeny reconstruction,
which usually deals with individuals sampled from
separate species, population genetics is concerned
with individuals sampled from interbreeding
groups. This begs the question how recombination
affects mutations.
Recombination
Recombination is the exchange of genetic material
between chromosomes. In diploid sexual organisms,
it can be observed under the light microscope as
crossing over during gamete formation by meiosis.
This process is modeled as the exchange of maternal
and paternal genetic material at a random point along
a pair of homologous chromosomes (Figure 9).
In a population with recombination, two things
can happen as we trace lines of descent backwards
in time: two lines may coalesce, thereby reducing
the number of lineages by one, or there is a recombination event, which may increase the number of
lineages by one. The lines of descent for a sample of
homologous sequences are called ‘coalescent’ and for
more details on coalescent theory, the reader may
consult [3]. Figure 10 shows the topology of the coalescent for four sequences with recombination [49].
Each pairwise comparison involving sequence #3
diagnoses two segments with different mutation
rates due to the two coalescence times along sequence
#3. Figure 11 shows a simulation of this varying time
to the most recent common ancestor and the resulting
clustering of mutations. Such clustering corresponds
to a change in the variance of the match length. This
could be used to test for recombination [50].
Given that pm is sensitive to recombination, it might
actually be used to estimate the rate of recombination.
However, it turns out that two values of pm correspond to a given rate of recombination if we allow
recombination to exceed mutation [47], as is often
observed in nature [51, p. 89]. More promising is to
test the null hypothesis of no recombination by comparing the observed variance of shustring lengths with
that expected in the absence of recombination [50].
Selection
A favorable variant of a gene can rise in frequency in
the population, and such a beneficial allele may ultimately get fixed in the population. In the process,
X
Figure 9: Reciprocal recombination between two
homologous chromosomes at the crossing over point X.
Alignment-free phylogenetics and population genetics
C
alignment-free work is just beginning. However, as
recombination and selection affect the distribution of
mutations along sequences, match-length methods
should be suited for work in this field. This will be
facilitated by the fact that suffix trees and their abstraction suffix arrays are rapidly becoming as ubiquitous as dynamical programming matrices were in
classical bioinformatics.
Most recent common ancestor
C
time
C
C
R
Homologous sequences
1
2
3
4
Figure 10: Ancestral recombination graph for four
homologous sequences. R: recombination; C:
coalescence.
1.8
1.6
1.4
1.2
TMRCA
417
1
Key Points
Phylogenetics is sped up by alignment-free distance computation.
Alignment-free distances are computed from word counts or
match lengths.
Phylogenies are based on the number of mutations; in population
genetics, the spatial distribution of mutations also matters.
Word counts are best for computing the number of mutations.
Match lengths are best for computing the distribution of
mutations.
0.8
0.6
0.4
Acknowledgements
I thank Heng Li, Peter Pfaffelhuber, Angelika Börsch-Haubold
and Linda Krause for helpful comments.
0.2
0
0
2
4
6
8
10
Position
Figure 11: Time to the most recent common ancestor
(TMRCA) as a function of genome position. TMRCA is
measured in 2N generations; the vertical lines on the
x-axis mark mutations.
neutral variants linked to the favorable allele are
eliminated from the population. In other words,
the neutral variants go to fixation by hitching a
ride with the favorable allele. This hitchhiking
model of selection was first proposed by Maynard
Smith 40 years ago [52]. Today it is a central tool
in population genetics [53]. The rise of a favorable
allele is also called a selective sweep, as linked neutral
variants are swept out of the population in the
process.
A selective sweep leads to an increase in the length
of the homozygous tract surrounding the favorable
allele. This would be observable as an increase in the
local match length. However, we have already seen
that recombination can also lead to an increase in
match length. To distinguish these two potential
causes of match length variation, should be a rewarding topic for future research in alignment-free population genetics.
In phylogeny reconstruction, the dco distance is
currently the state of the art. In population genetics,
References
1.
Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. J Theor Biol 1965;8:357–66.
2. Felsenstein J. Inferring Phylogenies. Sunderland: Sinauer,
2004.
3. Wakeley J. Coalescent Theory: An Introduction. Colorado:
Roberts & Company, 2009.
4. Vinga S, Almeida J. Alignment-free sequence comparison—
a review. Bioinformatics 2003;19:513–23.
5. Höhl M, Ragan M. Is multiple-sequence alignment
required for accurate inference of phylogeny? Syst Biol
2007;56(2):206–21.
6. Guyon F, Brochier-armanet C, Guénoche A. Comparison
of alignment free string distances for complete genome
phylogeny. Adv Data Anal Classif 2009;3:95–108.
7. Cheng J, Cao F, Liu Z. AGP: a multimethods web server
for alignment-free genome phylogeny. Mol Biol Evol 2013;
30:1032–37.
8. Yi H, Jin L. Co-phylog: an assembly-free phylogenomic
approach for closely related organisms. Nucleic Acids Res
2013;41:e75.
9. Yang K, Zhang L. Performance comparison between ktuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res 2008;36(5):e33.
10. Sims GE, Kim SH. Whole-genome phylogeny of Escherichia
coli/Shigella group by feature frequency profiles (FFPs).
Proc Natl Acad Sci USA 2011;108:8329–34.
11. Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition
approach. J Mol Evol 2004;58:1–11.
418
Haubold
12. Russell DJ, Way SF, Benson AK, Sayood K. A grammarbased distance metric enables fast and accurate clustering
of large sets of 16S sequences. BMC Bioinformatics 2010;11:
601.
13. Ulitsky I, Burstein D, Tuller T, Chor B. The average
common substring approach to phylogenomic reconstruction. J Comput Biol 2006;13:336–50.
14. Haubold B, Pfaffelhuber P, Domazet-Lošo M, Wiehe T.
Estimating mutation distances from unaligned genomes.
J Comput Biol 2009;16:1487–500.
15. Felsenstein J. PHYLIP (phylogeny interference package)
version 3.6. 2005.
16. Perrière G, Gouy M. WWW-Query: an on-line retrieval
system for biological sequence banks. Biochimie 1996;78:
364–9.
17. Saitou N, Nei M. The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Mol Biol Evol
1986;4:406–25.
18. Fitch WM, Margoliash E. Construction of phylogenetic
trees. Science 1967;155:279–84.
19. Edgar RC. MUSCLE: multiple sequence alignment with
high accuracy and high throughput. Nucleic Acids Res 2004;
32:1792–7.
20. Katoh K, Toh H. Recent developments in the MAFFT
multiple sequence alignment program. Brief Bioinformatics
2008;9:286–98.
21. Larkin M, Blackshields G, Brown N, et al. Clustal W and
Clustal X version 2.0. Bioinformatics 2007;23:2947–8.
22. Tettelin H, Masignani V, Cieslewicz M, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae:
implications for the microbial ‘‘pan-genome’’. Proc Natl Acad
Sci USA 2005;102(39):13950–5.
23. Nadeau JH, Sankoff D. Counting on comparative maps.
Trends Genet 1998;14:495–501.
24. Blaisdell BE. A measure of the similarity of sets of sequences
not requiring sequence alignment. Proc Natl Acad Sci USA
1986;83:5155–9.
25. Reinert G, Chew D, Sun F, Waterman M. Alignment-free
sequence comparison (I): statistics and power. J Comput Biol
2009;16:1615–34.
26. Sims EG, Jun SR, Wu AG, Kim SH. Alignment-free
genome comparison with feature frequency profiles (FFP)
and optimal resolution. Proc Natl Acad Sci USA 2009;106:
2677–82.
27. Jun SR, Sims GE, Wu GA, Kim SH. Whole-proteome
phylogeny of prokaryotes by feature frequency profiles: an
alignment-free method with optimal feature resolution. Proc
Natl Acad Sci USA 2010;107:133–8.
28. Wang H, Xu Z, Gao L, Hao B. A fungal phylogeny based
on 82 complete genomes using the composition vector
method. BMC Evol Biol 2009;9:195.
29. Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res
2004;32:W45–7.
30. Weiner P.
Linear pattern matching algorithms. In:
Proceedings of the 14th Annual Symposium on Switching and
AutomataTheory (swat 1973). Washington, DC, USA: IEEE
Computer Society SWAT’73, 1973;1–11.
31. Gusfield D. Algorithms on Strings,Trees, and Sequences: Computer
Science and Computational Biology. Cambridge: Cambridge
University Press, 1973.
32. Cormen TH, Leiserson CE, Rivest RL, Stein CS. Introduction toAlgorithms. Cambridge, MA: The MIT Press, 2009.
33. Kurtz S, Phillippy A, Delcher A, et al. Versatile and open
software for comparing large genomes. Genome Biol 2004;
5(2):R12.
34. Manber U, Myers EW. Suffix arrays: a new method for online string searches. SIAM J Comput 1993;22:935–48.
35. Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix
array construction algorithms. ACMComputSurv, 2007;39:4.
36. Nong G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACMTrans Inf Syst 2013;31:1–15.
37. Kasai T, Lee G, Arimura H, et al. Linear-time longestcommon-prefix computation in suffix arrays and its applications. LNCS 2001;2089:181–92.
38. Abouelhoda M, Kurtz S, Ohlebusch E. The enhanced
suffix array and its applications to genome analysis. In:
Proceedings of the Second Workshop on Algorithms in
Bioinformatics. Lecture Notes in Computer Science 2452,
2002, pp. 449–63. Springer-Verlag.
39. Ferragina P, Manzini G. Opportunistic data structures with
applications. In: Proceedings of the 41st Symposium on Foundation
of Computer Science (FOCS 2000). IEEE Computer Society,
2000;390–98.
40. Ferragina P, González R, Navarro G. Compressed text
indexes: from theory to practice. ACM J Exp Algorithmics
2008. 13, 1.12:1–31.
41. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and
memory-efficient alignment of short DNA sequences to the
human genome. Genome Biol 2009;10:R25.
42. Li H., Durbin R. Fast and accurate short read alignment with
Burrows-Wheeler transform. Bioinformatics 2009;25:1754–60.
43. Otu HH, Sayood K. A new sequence distance measure for
phylogenetic tree construction. Bioinformatics 2003;19:2122–30.
44. Haubold B, Pierstorff N, Möller F, Wiehe T. Genome
comparison without alignment using shortest unique substrings. BMC Bioinformatics 2005;6:123.
45. Domazet-Lošo M, Haubold B. Efficient estimation of pairwise distances between genomes. Bioinformatics 2009;25:
3221–7.
46. Drosophila 12 Genomes Consortium. Evolution of genes
and genomes on the Drosophila phylogeny. Nature 2007;450:
203–18.
47. Haubold B, Reed FA, Pfaffelhuber P. Alignment-free estimation of nucleotide diversity. Bioinformatics 2011;27:449–55.
48. Haubold B, Pfaffelhuber P. Alignment-free population genomics: an efficient estimator of sequence diversity. Genes
Genomes Genetics 2012;2:883–9.
49. Hudson RR. Gene genealogies and the coalescent process.
Oxford Surveys Evol Biol 1990;7:1–44.
50. Haubold B, Krause L, Horn T, Pfaffelhuber P. An alignment-free test for recombination. Bioinformatics. doi:10.
1093/bioinformatics/btt550 (Advance Access publication
12 October 2013).
51. Lynch M. The Origins of Genome Architecture. Sunderland:
Sinauer, 2007.
52. Maynard Smith J, Haigh J. The hitch-hiking effect of a
favourable gene. Genet Res Cambridge 1974;23:23–35.
53. Stephan W. Genetic hitchhiking versus background selection: the controversy and its implications. PhilosTrans R Soc
Lond B 2010;365:1245–53.