Statistically motivated multiple sequence alignment

Statistically motivated
multiple sequence
alignment
Bioinformatics Journal Club
16.02.10
Aleksander Sudakov
MSA – multiple sequence alignment
Fast Statistical Alignment.
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, et al. 2009.
PLoS Comput Biol 5(5): e1000392.
doi:10.1371/journal.pcbi.1000392
Mind the gaps: evidence of bias in estimates of multiple sequence
alignments.
Golubchik T, Wise MJ, Easteal S, Jermiin LS. Mol Biol Evol. 2007
Nov;24(11):2433-42
Multiple sequence alignment.
Edgar RC, Batzoglou S. Curr Opin Struct Biol. 2006 Jun;16(3):36873.
2
Problems with MSA
Why do alignment programs continue to be
developed, and why are new tools not more
widely adopted by biologists?
●
ClustalW + manual editing of alignments
●
Visual inspection, experts eye
●
„No software will find correct alignment“
3
Many MSA programs
●
●
Numerous new alignment programs, algorithms
and publications; increasing demand from
sequencing
Even within same program different settings
create many distinctive results
●
Accuracy vs. speed? Not so simple.
●
Which approach gives the best results?
4
●
●
●
●
●
●
CLUSTALW
Uses less memory than other programs
Less accurate or scalable than modern programs
DIALIGN
Attempts to distinguish between alignable and non-alignable regions
Less accurate than CLUSTALW on global benchmarks
MAFFT, MUSCLE
Faster and more accurate than CLUSTALW; good trade-off of accuracy and
computational cost. Options to run even faster, with lower average accuracy, for highthroughput applications.
For very large data sets (say, more than 1000 sequences) select time- and memorysaving options
PROBCONS
Highest accuracy score on several benchmarks
Computation time and memory usage is a limiting factor for large alignment problems
(>100 sequences)
ProDA
Does not assume global alignability; allows repeated, shuffled and absent domains
High computational cost and less accurate than CLUSTALW on global benchmarks
T-COFFEE
High accuracy and the ability to incorporate heterogeneous types of information
Computation time and memory usage is a limiting factor for large alignment problems
(>100 sequences)
5
CLUSTALW
●
Maximum-likelihood
●
Introduced in 1994
●
●
●
Method of choice for biologists, represented
dramatic progress in alignment sensitivity
combined with speed
CLUSTALW is still the most widely used MSA
program.
No significant improvements have been made
to the algorithm since 1994, several modern
methods achieve better performance in
accuracy, speed or both
6
Benchmarks
●
●
●
●
Validation of an MSA program typically uses a
benchmark dataset of reference alignments
BAliBASE
Several new benchmarks: OXBENCH,
PREFAB, SABmark, IRMBASE
(largely constructed by automated means)
BRAliBase – ncRNA
7
Benchmark measurement
Reference alignments contain globally alignable
sequences
Mostly measured sensitivity
●
Sensitivity - correctly aligned positions
●
Specificity - correctly non-aligned positions
●
●
Accuracy - correctness of all sequence
positions
PPV - Positive predictive value:
(true positive/(true positive + false positive)
8
Fast Statistical Alignment
●
Based on statistical model
●
Seeks to find alignment closest to truth
●
Estimates of the alignment accuracy and
uncertainty for every column and character of
the alignment - previously available only with
alignment programs which use computationallyexpensive Markov Chain Monte Carlo
approaches
9
FSA goals
●
●
●
Maximize the expected alignment accuracy
(„truth“)
Robust to variation in evolutionary parameters
High accuracies on protein, RNA and DNA
sequences without additional input
●
Fast enough for large-scale problems
●
Modular code base
10
Algorithm
●
Pairwise comparison:
Pair HMM to infer posterior probability
distribution that characters in two sequences
are aligned
Merging probabilities for sequence
●
Sequence annealing
●
Greedy algorithm + iterative refinement
11
12
Learning
●
●
●
Query-specific learning procedure for
parameter estimation
Unsupervised Expectation Maximization (EM)
algorithm to estimate transition (gap) and
emission (substitution) probabilities
Needs only 2 unaligned DNA sequences or 4
protein sequences to estimate
13
Anchoring
Long matches of local homology anchor
regions of global alignment
●
Megabase-long alignments
●
Fast exact matches using MUMmer (default)
●
Inexact matches using exonerate
●
Modularity permits FSA to incorporate almost
any sources of potential homology information
14
Parallelization
15
Visualization module
16
FSA benchmarks: protein
17
FSA benchmarks: RNA
18
FSA benchmarks: DNA
genomic alignment
19
FSA benchmarks: speed
20
FSA benchmarks: non-related
21
FSA benchmarks: similarity between
protein and DNA alignment
22
FSA summary
●
●
●
●
Maximize accuracy (other programs maximize
sensitivity or likelihood)
Automated query-based learning
Robust with non-homologous sequences,
evolutionary biases
Agnostic to phylogeny
23
24
25
Existing MSA deficiencies
●
●
Speed and memory limiting factor
Progressive alignment with guide tree vs
pairwise alignment
●
Assumes alignability/homology
●
Discordances in protein, DNA, RNA alignment
26
●
Nevertheless, the ClustalW program, published
in 1994, remains the most widely-used MSA
program.
27