Supplementary Online Material PASTA: ultra

Supplementary Online Material
PASTA: ultra-large multiple sequence
alignment
Siavash Mirarab, Nam Nguyen, and Tandy Warnow
University of Texas at Austin - Department of Computer
Science
{smirarab,bayzid,tandy}@cs.utexas.edu
Abstract
We introduce PASTA, a new method for multiple sequence alignment of datasets with up to 200,000 sequences in [3]. Here we provide
supplementary information not provided in the main paper. We give
exact commands used for running the experiments, we provide extra results that did not fit in the main paper, and we provide some
supplementary discussion of the results.
1
Contents
A Appendix - Method version numbers and commands
3
B Appendix - Proofs
6
C Appendix - Further Results
C.1 Performance on 1000-taxon datasets . . . . . . . .
C.2 Performance on biological datasets . . . . . . . .
C.3 Impact of Starting Tree . . . . . . . . . . . . . . .
C.4 Choice of Opal vs. Muscle for merging alignments
C.5 Impact of alignment subset size . . . . . . . . . .
C.6 Muscle running time . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
11
12
14
15
16
17
D Appendix - Additional Discussion
18
D.1 Alignment Accuracy Measures . . . . . . . . . . . . . . . . . . 18
D.2 Comparison between SATé and PASTA . . . . . . . . . . . . . 18
2
A
Appendix - Method version numbers and
commands
• ClustalW version 2.1:
clustalw2 -quicktree -align -infile=[input sequences] -outfile=[output alignment]
-output=fasta
• Muscle version 3.8.31:
muscle -in [input sequences] -out [output alignment]
• HMMBUILD version 3.0:
hmmbuild –symfrac 0.0 –dna [output profile] [backbone alignment]
• HMMALIGN version 3.0:
hmmalign –dna [output profile] [query file] > [output alignment]
• Mafft-profile version 6.956b:
mafft –add [query file] [backbone alignment] > [output alignment]
• FastTree version 2.1.5 SSE3:
fasttree -nt -gtr [input fasta] > [output tree]
• SATé version 2.2.7:
python run sate.py config.sate2.txt
(The config.sate2.txt file is defined as follows:)
[commandline]
two_phase = False
datatype = <dna or rna>
untrusted = False
multilocus = False
input = <input_fasta>
treefile = <starting_tree>
aligned = False
raxml_search_after = False
auto = False
[fasttree]
model = -gtr
args =
options = -nosupport -fastest
3
[sate]
time_limit = -1.0
iter_without_imp_limit = -1
time_without_imp_limit = -1.0
break_strategy = centroid
start_tree_search_from_current = True
blind_after_iter_without_imp = -1
max_mem_mb = 4024
blind_mode_is_final = True
blind_after_time_without_imp = -1.0
max_subproblem_size = 200
merger = muscle
num_cpus = 12
after_blind_time_without_imp_limit = -1.0
max_subproblem_frac = 0.0
blind_after_total_time = -1.0
after_blind_time_term_limit = -1.0
aligner = mafft
iter_limit = 3
blind_after_total_iter = -1
tree_estimator = fasttree
after_blind_iter_term_limit = -1
return_final_tree_and_alignment = False
move_to_blind_on_worse_score = True
after_blind_iter_without_imp_limit = -1
• PASTA version 1.1.0:
python run pasta.py config.txt
(The config.txt file is defined as follows:)
[commandline]
two_phase = False
datatype = <dna or rna>
untrusted = False
multilocus = False
input = <input_fasta>
treefile = <starting_tre>
aligned = False
raxml_search_after = False
4
[sate]
time_limit = -1.0
iter_without_imp_limit = -1
time_without_imp_limit = -1.0
break_strategy = centroid
start_tree_search_from_current = True
blind_after_iter_without_imp = -1
max_mem_mb = 1024
blind_mode_is_final = True
blind_after_time_without_imp = -1.0
max_subproblem_size = 200
merger = opal
num_cpus = 12
after_blind_time_without_imp_limit = -1.0
max_subproblem_frac = 0.0
blind_after_total_time = -1.0
after_blind_time_term_limit = -1.0
aligner = mafft
iter_limit = 3
blind_after_total_iter = -1
tree_estimator = fasttree
after_blind_iter_term_limit = -1
return_final_tree_and_alignment = True
move_to_blind_on_worse_score = True
after_blind_iter_without_imp_limit = -1
mask_gappy_sites = <0.1% of dataset size]
5
B
Appendix - Proofs
Lemma 1: Let X, Y, and Z be disjoint sequence datasets, and alignments A
and A0 be alignments on X ∪ Z and Y ∪ Z, respectively, that induce identical
alignments on Z. Let K be the length of the longest sequence in X, Y, and
Z, and L be the total number of sites in A and A0 . Then we can merge
alignments A and A0 using transitivity in O(L + (|X| + |Y | + |Z|)K).
Proof: To represent an alignment, we use a data-structure with two elements: 1) the unaligned sequence and 2) a list of integers giving the position
of each letter in the aligned sequences.
Assume A has k columns, and A0 has k 0 columns. We start by finding
the sequences that belong to Z. For each shared sequence in Z, we find
the columns that are non-gap in at least one shared sequence in A, and
do the same thing for A0 (we call these shared columns). Calculating shared
columns can be done in O(K|Z|), because our data-structure for representing
alignments has the list of column positions for each character of each sequence
of Z. Let ks denote the number of shared columns. After computing shared
columns we know that the final alignment will have k + k 0 − ks columns. We
simultaneously iterate through the k columns in A and k 0 columns in A0 , and
map these numbers to position numbers in the output alignment. We start at
the leftmost position of both alignments, and keep a position in A (denoted
by p), a position in A0 (dented by p0 ), and a position in the output alignment
(denoted by q). If both p and p0 correspond to a shared column, we map both
to q and increment all three. Otherwise, w.l.o.g. assume p is not a shared
column; we map p to q and increment only p and q. At the end of this process,
we have a mapping from columns of both input alignments to the columns
of the output alignment, and this procedure takes O(k + k 0 − ks ) = O(L)
time. Finally, we build the output alignment by adding sequences from the
original alignments, and by mapping their column indices using the mapping
computed above. This step takes O(k(|X| + |Y | + |Z|)). Thus, the final
running time is O(L + k(|X| + |Y | + |Z|)). Note that for any single gene
alignment, L << k(|X| + |Y | + |Z|), and therefore can be omitted from the
analysis.
Theorem 1: Let {A1 , A2 , . . . , Ak } be a set of Type 1 sub-alignments, and let
{X1 , X2 , . . . , Xl } be a set of Type 2 sub-alignments defined by any spanning
tree on this set. Then, (a) any two orderings of pairwise transitivity mergers, using the procedure described above, produce the same final multiple
sequence alignment, and (b) if no Ai has any false positives homologies (with
respect to the true alignment) then the final multiple sequence alignment will
also not have any false positive homologies.
Proof: An alignment is entirely defined by the set of homologies it contains.
6
Each set of homologies in each column of an alignment creates an equivalence
class, and thus the alignment constitutes an equivalence relation. Thus, the
transitive closure of a set A of alignments, denoted tc(A), is defined to be
the equivalence relation defined by the transitive closure of the equivalence
relations defining each alignment in A. It is easy to see that this is equivalent
to saying that letters x and y are in the same column within tc(A) if and
only if there is a sequence of alignments A1 , A2 , . . . , Ak with each Ai ∈ A,
and letters x1 , x2 , . . . , xk , such that x and x1 are in the same column in A1 ,
xi and xi+1 are in the same column within Ai+1 (i = 1, 2, . . . , k − 1), and
xk and y are in the same column in Ak . However, by the constraints on the
Type 2 sub-alignments (they cannot modify the Type 1 sub-alignments), the
sequence of alignments that implies that x and y are in the same column
cannot repeat any sub-alignment. Hence, when we apply the transitivity
merge of two sub-alignments and infer a new homology between letters x
and y, it is because there is some z such that x and z are in the same column
in one sub-alignment, and z and y are in the same column in the other subalignment. This will be helpful to us in proving that the simple algorithm
we have suffices to define the transitivity merge.
We prove that the transitivity merge algorithm computes the transitive
closure tc(A) and so is independent of the particular ordering on the edge
contractions. Note that at any point in the transitive merge algorithm, some
number of Type 3 sub-alignments may have been computed. We can therefore
trace the creation of any Type 3 sub-alignment from the beginning set of
Type 1 and Type 2 sub-alignments, and note the sequence and number of
the pairwise transitivity mergers involved. We can also note the initial set of
Type 2 sub-alignments involved in creating the Type 3 sub-alignment. Note
that the number n of such additional operations is at least 1 for every Type
3 sub-alignment, but never more than k − 1, where k is the number of Type
1 sub-alignments.
We will prove by induction on n that every Type 3 sub-alignment that is
created by n pairwise mergers applied to a set A of Type 2 sub-alignments
is identical to tc(A).
The base case is n = 1. Let A and A0 be Type 2 sub-alignments on X ∪ Z
and Y ∪Z, respectively, with X ∩Y = ∅. The definition of the pairwise merge
of these two Type 2 sub-alignments matches the definition of tc({A, A0 }), and
so the base case is proven.
Now assume by induction that there is some n ≥ 1 such that any Type 3
sub-alignment formed by at most n pairwise transitivity mergers applied to
a set A of Type 2 sub-alignments is identical to tc(A). Let A1 be some Type
3 sub-alignment created by n + 1 pairwise mergers, and let A2 and A3 be
the pair of sub-alignments (each is either Type 2 or Type 3) that are merged
7
together through the transitivity merger to form A1 . Hence, A2 is on X ∪ Y ,
A3 is on Y ∪ Z, and X ∩ Z = ∅. Note that A2 and A3 were each created
using fewer than n pairwise transitivity mergers, and so are identical to the
transitive closure applied to a set of Type 2 sub-alignments. The result of
merging these two sub-alignments therefore adds in remaining homologies
that can only be inferred through transitivity. Hence, the result of applying
the transitive closure algorithm is an alignment that has only those pairwise
homologies that can be inferred through transitivity. As a result, if each
Type 2 sub-alignment has no false positives, then neither can the result of
the transitive closure algorithm. The only thing we now need to show is that
all pairwise homologies that can be inferred through transitivity are in the
final alignment. So suppose letter x in X and letter z in Z should be in the
same column in the true transitive closure of all Type 2 sub-alignments in
A. Then the path of alignments linking x to z must go directly from x to
some y ∈ Y via some Type 2 sub-alignment and then from y directly to z via
some Type 2 sub-alignment. However, the Type 2 sub-alignments in A are
obtained using the edges of the spanning tree, and hence the path linking x
to z must be obtained using sub-alignments A and A0 . Hence the pairwise
transitivity merge of A and A0 would correctly detect that x and z are in the
same equivalence class, and the result is proved.
Theorem 2: Given m Type 1 alignments and m − 1 Type 2 alignments,
the algorithm to compute the transitivity merge of these alignments uses
O(Km log m + mL) time, where K is the maximum length of any sequence
(not counting gaps) in any Type 1 alignment and L is length of output
alignment.
Proof: Let our dataset consist of N sequences, with each sequence of length
at most K, and for the sake of simplicity, assume that our decomposition
produces m subsets, all with equal sizes (note that centroid decomposition
produces balanced subsets, so this assumption is justified). As described
before, in Step 5, we chose an edge e = (v, w) from the spanning tree, contract that edge, and perform two transitivity merges: one between S(v) and
Label(e), and another between the result of the first merger and S(w).
Based on Lemma 1, the first transitivity merge will have a running time of
O(K(|S(v)| + 2) + L), and the second merge will have a cost of O(K(|S(v)| +
|S(w)|) + L), and thus the cost of each edge contraction is O(K(2 ∗ |S(v)| +
|S(w)|) + L). Now, imagine the case where the spanning tree is a path. If
we start merging from one end to the other end, we get the total running
time of O(K(3 + 4 + . . . + m)) = O(Km2 ); however, we can improve on that.
The important observation is that the spanning tree should be traversed such
that transitivity mergers are between alignments with balanced number of
sequences on each side.
8
The order in which edges are processed in PASTA is obtained by a recursive approach. Given the spanning tree, we divide it into two halves on the
centroid edge, and thus obtain two roughly equal size subtrees. We process
each half recursively using the same strategy, and thus get two single leaves
at the endpoints of the centroid edge. Each leaf would represent the merger
of all alignments in each half, and by construction they would have roughly
equal size. We then contract the centroid edge, merge the two sides, and
obtain the full alignment. If each half has roughly x sequences, the cost of
the final edge contraction is O(K(2x + x) + L) = O(3Kx + L) (as shown
before). If f (x) denotes the cost of applying our transitivity merger on a
spanning tree with x nodes, we have
f (2x) = 2f (x) + 3kx + L
which has a O(x log(x) + xL) solution. Therefore, our particular order of
traversing the spanning tree results in a total cost of O(Km log(m) + mL).
9
C
Appendix - Further Results
This section includes:
• Results on 1000-taxon simulated datasets
• Tree error using reference trees based on different bootstrap support
thresholds for the Gutell datasets
• Comparison of PASTA results based on different starting trees
• Comparison of PASTA results based on using Opal or Muscle to produce Type 2 sub-alignments
• Comparison of PASTA results for different alignment subset sizes
• Running time of Muscle as a function of dataset size
10
C.1
Performance on 1000-taxon datasets
Missing Branch Rate (FN)
40%
30%
20%
10%
1000M3 1000S3 1000M2 1000L2 1000S2 1000L3 1000S1 1000L1 1000M1
True Alignment
Mafft−linsi
OPAL
MUSCLE
SATé−II
PASTA
Figure 1: Tree errors on 1000-taxon datasets. We report the missing branch
rate of trees estimated by FastTree-2 on the true alignment, and on alignments estimated using PASTA, SATé-II, and other alignment methods, on
challenging 1000-taxon datasets from [1].
In Figure 1, we show FN error rates of maximum likelihood trees estimated using FastTree-2 on different alignments, using datasets studied in
[1, 2]. The model conditions are labelled by the gap length distribution (M
for medium length, S for short, and L for long), and increase in difficulty
(higher rates of indels and substitutions) from left to right. Note that the
most accurate results are obtained by ML on the true alignment, but that
PASTA and SATé-II have almost indistinguishable accuracy on these data.
11
C.2
Performance on biological datasets
Tree Error (FN rate)
0.3
16S.3
16S.B.ALL
16S.T
33 50 75 80 85 90 95 99
33 50 75 80 85 90 95 99
0.2
0.1
0.0
33 50 75 80 85 90 95 99
Bootstrap Support Threshold
Muscle
SATe2
PASTA
Figure 2: Tree error (FN) rates on biological datasets as a function of bootstrap threshold chosen to define the reference tree.
Biological datasets present a challenge for benchmarking because the true
phylogeny cannot be known with certainty. We used the reference alignment
(estimated based on secondary structures) produced by the Gutell laboratory,
and then estimated a maximum likelihood tree on each dataset using RAxML
with bootstrapping. By contracting all branches with “low support” we
obtain a reference tree. In the text we reported results based on a bootstrap
support threshold of 75%; here we present results based on other thresholds.
Note that for high thresholds, more branches will be collapsed, whereas for
low thresholds, fewer branches will be collapsed.
Figure 2 reports results only for Muscle, SATé-II, and PASTA, the three
methods with the best performance on these biological datasets, but exploring the impact of changing the threshold for contracting branches. Note that
the difference in performance is small in most cases, and that relative performance generally does not change much as a function of threshold. Typically
PASTA has slightly better tree accuracy than both SATé and Muscle, but
there are a few cases where the relative performance changes.
However, there are a few thresholds and datasets where the relative ordering between SATé, PASTA, and Muscle changes, so that Muscle is tied
for best, or PASTA is less accurate than SATé. With respect to the remaining methods, the PASTA starting tree had the least accurate results, with
especially poor accuracy on the 16S.T dataset, where the starting alignment
did not include one of the taxa, and so that taxon was added randomly
12
into the tree computed without its sequence. MAFFT-profile and Clustalquicktree had generally good performance, but clearly less accurate than
Muscle, PASTA, and SATé.
13
C.3
Impact of Starting Tree
We also looked at the impact of using different starting trees. We tested
starting trees obtained by using FastTree-2 on three different alignments: the
Mafft-Profile alignment, the Mafft-PartTree alignment, and our new method
for computing the starting alignment (HMMER-based). Table 1 shows the
PASTA alignment and tree accuracy starting from these starting trees, after
running PASTA for three iterations. There does not seem to be any impact
of the starting tree on the TC score or the final tree accuracy. However,
alignment SP-score and Modeler score decrease with the increase in error for
the starting tree, but the differences are very small. Thus, PASTA is highly
robust to starting tree, when run for three iterations.
Starting Tree (FN = 12.4% )
Mafft-Profile (FN = 19.1%)
Mafft-PartTree (FN = 28.7%)
Alignment Accuracy
SP-score Modeler score
88.5%
89.1%
87.2%
87.8%
83.2%
84.6%
TC
145
148
144
Tree Accuracy
1-FN
89.3%
89.5%
89.3%
Table 1: Effects of the starting tree in PASTA algorithm, based on three
iterations of PASTA on one replicate of the 10k RNASim dataset. We show
PASTA using three different starting trees: our default HMMER-based technique, the tree computed using FastTree-2 on the Mafft-Profile alignment,
and the tree based on FastTree-2 on Mafft-PartTree alignment. Boldface
indicates the best performance on this data.
14
C.4
Choice of Opal vs. Muscle for merging alignments
We also explored the difference between PASTA using Opal (the default
merger) or Muscle for merging Type 1 alignments into Type 2 alignments;
see Table 2. This comparison showed that OPAL can result in better final alignments and trees compared to Muscle. For example, on the 10,000
RNASim dataset, PASTA with OPAL and with MUSCLE had tree errors of
10.7% and 11.2%, a slight improvement, but that alignment accuracy changed
substantially, especially when considering the average of the SP and modeler
scores.
Parameters
Opal Type 2 merger
Muscle Type 2 merger
Tree Error
10.7%
11.2%
SP-Score
88.5%
67.2%
Modeler Score
89.1%
80.6%
TC score
145
136
Running Time (sec)
13,478
14,884
Table 2: Impact of using Muscle versus Opal as the alignment merger technique. We report results on one replicate of the 10K RNASim dataset, using
three iteration of PASTA using all default settings for other algorithmic parameters. We report the missing branch rate for the tree error, and two
accuracy measures for alignments: the Total Column (TC) score, and the
average of the SP-score and modeler score. Boldface indicates the best performance on this data.
15
C.5
Impact of alignment subset size
We also compared results for PASTA where we changed the alignment subset
size from 200 (default) to smaller values; see Table 3. We show results for
one replicate of the 10K RNASim dataset, and also for the 16S.T dataset.
Note that the tree error changes only slightly, with slight differences between
the two datasets. As the alignment subset size is reduced from 200 to 100,
there is a decrease in error for the 16S.T dataset, but the tree error for the
10K dataset first decreases and then returns to the original value. The sumof-pairs alignment accuracy measures do not seem to be consistently affected
by the change in alignment subset size between the two datasets either, (initial decrease in accuracy then increase for 10K, and consistent decrease in
accuracy for 16S.T), but here also the changes are small. On the other hand,
there is a consistent change in the TC score, where decreasing the alignment
subset size improves the TC score for both datasets, and substantially for
the 10K dataset. Finally, the running time went down for both datasets as
a result of reducing the subset size from 200 to 100.
Thus, although changes in alignment subset size do seem to impact the
alignment and tree accuracy, the differences are generally small and may
not be statistically significant. Furthermore, the impact may depend on the
dataset. However, the impact on the running time is substantial, so that
reducing from datasets of size 200 to datasets of size 50 reduced the running
time by 55% for the 10K dataset and by 37% for the 16S.T dataset.
Dataset
Parameters
10K
10K
10K
16S.T
16S.T
16S.T
Subset
Subset
Subset
Subset
Subset
Subset
size
size
size
size
size
size
200
100
50
200
100
50
Tree
Error
10.7%
10.4%
10.7%
8.2%
8.1%
7.9%
Alignment
Accuracy
88.8
87.4
88.6
82.7
82.0
79.0
TC
score
145
185
210
121
125
129
Running Time
(secs)
13,478
8,235
6,015
9,120
7,086
5,780
Table 3: Impact of alignment subset size. We report results on one replicate
of the 10K RNASim dataset and also on the 16S.T dataset, using three
iteration of PASTA in which we explore the impact of changing the subset
size from 200 (the default) to 100 and 50; all other algorithmic parameters
use default values. We report the missing branch rate for the tree error, two
accuracy measures for alignments (the Total Column (TC) score, and the
average of the SP-score and modeler score), and the running time in seconds.
Boldface indicates the best performance on this data.
16
C.6
Muscle running time
Figure 3 shows that the running time of MUSCLE mergers scale linearly with
the alignment length, but grows more rapidly with the number of sequences.
2000
●
●
2500
1000
Running Time (minutes)
Running Time (seconds)
2000
1500
●
500
●
1500
1000
●
●
500
●
●
●
●
Alignment Length
50000
40000
30000
20000
100
2000
6400
3200
1600
800
●● ●
10000
0
●
●
100
400
0
Number of Sequences
(a) (a)
(b) (b)
Figure 3: Running time for Muscle mergers as a function of (a) the alignment length with fixed number of sequences (10,000) and (b) the number of
sequences sequences with fixed sequence length (2000bp).
17
D
D.1
Appendix - Additional Discussion
Alignment Accuracy Measures
One of the interesting observations in this study is that the average of SPscore and modeler score is not always predictive of tree accuracy. For example, on the Gutell datasets, MAFFT-profile has better sum-of-pairs alignment
accuracy than Muscle but produces less accurate trees. Similarly, the starting alignments used by PASTA (computed using the HMM-based technique)
had close to the best sum-of-pairs alignment accuracy but produced less accurate trees than PASTA, and SATé had much lower alignment accuracy
scores than MAFFT-profile on the Gutell and the 10K RNASim datasets,
but produced more accurate trees.
This is generally not what has been expected, and perhaps runs counter
to earlier studies that have explored the impact of alignment estimation on
tree estimation. However, this inconsistency between relative performance
in terms of tree accuracy and standard alignment accuracy criteria has been
observed before on large phylogenetic datasets [2], where Opal was observed
to produce alignments with very good sum-of-pairs scores but where trees
on Opal alignments were not particularly accurate. Thus, the disconnect
between standard alignment criteria and phylogenetic accuracy may be a
general issue for large datasets.
The explanation may be that not all homologies have the same evolutionary signal, and that failing to recover some homologies may not have much
impact on tree estimation, while other homologies may be essential for phylogenetic accuracy. Furthermore, alignment methods that aim to recover the
conserved regions may be able to have high alignment accuracy scores but
fail to produce good trees - because conserved regions may not be as useful
for phylogeny estimation as regions that change. Thus, the sites and even
specific homologies that are most informative of the phylogenetic branching
process may not be the homologies that many alignment methods are trained
to recover.
More generally, then, this disconnect suggests a real challenge in using
alignment metrics to predict the utility of an alignment, especially if the
purpose of the alignment is phylogeny estimation.
D.2
Comparison between SATé and PASTA
The technique to re-align sequences on a guide tree used within PASTA was
designed to address the running time challenges in using Opal or Muscle
to repeatedly merge alignments until the entire alignment is created. As we
18
saw, the last pairwise merge is the most expensive, and becomes prohibitively
expensive on large datasets. The new design, which computes overlapping
compatible Type 2 sub-alignments and then merges these using transitivity,
addresses that computational challenge. However, as we saw, it also provides
accuracy improvements on large datasets. Here we make a direct comparison
between the two methods, to clarify where PASTA has an advantage.
• 1000-taxon datasets from [1]: PASTA and SATé have very similar tree
accuracy, and improve relative to the other methods tested. The running time differences are small on these datasets, because they are not
that big.
• Gutell datasets, 16S.3, 16S.T, and 16S.B.ALL. These datasets have between 6K and 28K sequences. PASTA is slightly more accurate than
SATé with respect to tree accuracy, but has much higher alignment
accuracy. In addition, PASTA uses less time. The running time advantage depends on the number of sequences, however; for example, on
16S.3, the smallest of these three datasets, PASTA uses about 80% of
the time used by SATé, but on 16S.B.ALL, the largest of these datasets,
PASTA finishes in 6 hours while SATé finishes in 24 hours.
• RNASim datasts, ranging from 10K to 200K sequences. When restricted to the 24 hour limit, SATé can only complete analyses for
the smallest of these datasets. The comparison between SATé and
PASTA on this model condition (10 replicates) show that PASTA is
much faster, finishing in less than 4 hours while SATé uses more than 6
hours. PASTA and SATé are close in terms of tree error, with a small
advantage to PASTA, but PASTA has much better alignment accuracy.
• Running SATé on a machine where runs are not limited to 24 hours
produces a tree after roughly 70 hours of running time per iteration for
the RNASim dataset with 50K sequences. Thus SATé is much slower
than PASTA, which finishes each iteration in about 5 hours on this
dataset. Moreover, the final tree and alignment of SATé are not nearly
as accurate as PASTA; the tree error is 50% higher (12.6% for SATé
versus 8.2% for PASTA), and the alignment is much less accurate (e.g.
SATé recovers only 30 columns entirely correctly, but PASTA recovers
311 columns).
In general, therefore, PASTA and SATé have very similar accuracy on
small datasets, but PASTA has a large advantage over SATé on large datasets
with respect to alignment accuracy measures, and a smaller advantage with
19
respect to tree accuracy. However, PASTA is much faster than SATé, and the
running time advantage increases with the size of the dataset. Thus, PASTA
dominates SATé for large-scale phylogeny and alignment estimation.
Acknowledgments
This research was supported in part by NSF grant DBI 0733029 to TW,
by an International Predoctoral Fellowship to SM from the HHMI, and by
a subgrant from the University of Alberta to TW, made possible through
a donation from Musea Ventures, which is held by Professor Gane Ka-Shu
Wong. The authors wish to thank the anonymous referees for their helpful
comments.
References
[1] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and
accurate large-scale coestimation of sequence alignments and phylogenetic
trees. Science, 324(5934):1561–1564, 2009.
[2] K. Liu, T.J. Warnow, M.T. Holder, S. Nelesen, J. Yu, A. Stamatakis, and
C.R. Linder. SATé-II: Very fast and accurate simultaneous estimation of
multiple sequence alignments and phylogenetic trees. Syst Biol, 61(1):90–
106, 2011.
[3] S. Mirarab, N. Nguyen, and T. Warnow. PASTA: ultra-large multiple
sequence alignment. In Proc. ISMB 2014, 2014.
20

Download Report

Supplementary Online Material PASTA: ultra

Paperzz.com

Your Paperzz