Statistically motivated multiple sequence alignment Bioinformatics Journal Club 16.02.10 Aleksander Sudakov MSA – multiple sequence alignment Fast Statistical Alignment. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, et al. 2009. PLoS Comput Biol 5(5): e1000392. doi:10.1371/journal.pcbi.1000392 Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Golubchik T, Wise MJ, Easteal S, Jermiin LS. Mol Biol Evol. 2007 Nov;24(11):2433-42 Multiple sequence alignment. Edgar RC, Batzoglou S. Curr Opin Struct Biol. 2006 Jun;16(3):36873. 2 Problems with MSA Why do alignment programs continue to be developed, and why are new tools not more widely adopted by biologists? ● ClustalW + manual editing of alignments ● Visual inspection, experts eye ● „No software will find correct alignment“ 3 Many MSA programs ● ● Numerous new alignment programs, algorithms and publications; increasing demand from sequencing Even within same program different settings create many distinctive results ● Accuracy vs. speed? Not so simple. ● Which approach gives the best results? 4 ● ● ● ● ● ● CLUSTALW Uses less memory than other programs Less accurate or scalable than modern programs DIALIGN Attempts to distinguish between alignable and non-alignable regions Less accurate than CLUSTALW on global benchmarks MAFFT, MUSCLE Faster and more accurate than CLUSTALW; good trade-off of accuracy and computational cost. Options to run even faster, with lower average accuracy, for highthroughput applications. For very large data sets (say, more than 1000 sequences) select time- and memorysaving options PROBCONS Highest accuracy score on several benchmarks Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences) ProDA Does not assume global alignability; allows repeated, shuffled and absent domains High computational cost and less accurate than CLUSTALW on global benchmarks T-COFFEE High accuracy and the ability to incorporate heterogeneous types of information Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences) 5 CLUSTALW ● Maximum-likelihood ● Introduced in 1994 ● ● ● Method of choice for biologists, represented dramatic progress in alignment sensitivity combined with speed CLUSTALW is still the most widely used MSA program. No significant improvements have been made to the algorithm since 1994, several modern methods achieve better performance in accuracy, speed or both 6 Benchmarks ● ● ● ● Validation of an MSA program typically uses a benchmark dataset of reference alignments BAliBASE Several new benchmarks: OXBENCH, PREFAB, SABmark, IRMBASE (largely constructed by automated means) BRAliBase – ncRNA 7 Benchmark measurement Reference alignments contain globally alignable sequences Mostly measured sensitivity ● Sensitivity - correctly aligned positions ● Specificity - correctly non-aligned positions ● ● Accuracy - correctness of all sequence positions PPV - Positive predictive value: (true positive/(true positive + false positive) 8 Fast Statistical Alignment ● Based on statistical model ● Seeks to find alignment closest to truth ● Estimates of the alignment accuracy and uncertainty for every column and character of the alignment - previously available only with alignment programs which use computationallyexpensive Markov Chain Monte Carlo approaches 9 FSA goals ● ● ● Maximize the expected alignment accuracy („truth“) Robust to variation in evolutionary parameters High accuracies on protein, RNA and DNA sequences without additional input ● Fast enough for large-scale problems ● Modular code base 10 Algorithm ● Pairwise comparison: Pair HMM to infer posterior probability distribution that characters in two sequences are aligned Merging probabilities for sequence ● Sequence annealing ● Greedy algorithm + iterative refinement 11 12 Learning ● ● ● Query-specific learning procedure for parameter estimation Unsupervised Expectation Maximization (EM) algorithm to estimate transition (gap) and emission (substitution) probabilities Needs only 2 unaligned DNA sequences or 4 protein sequences to estimate 13 Anchoring Long matches of local homology anchor regions of global alignment ● Megabase-long alignments ● Fast exact matches using MUMmer (default) ● Inexact matches using exonerate ● Modularity permits FSA to incorporate almost any sources of potential homology information 14 Parallelization 15 Visualization module 16 FSA benchmarks: protein 17 FSA benchmarks: RNA 18 FSA benchmarks: DNA genomic alignment 19 FSA benchmarks: speed 20 FSA benchmarks: non-related 21 FSA benchmarks: similarity between protein and DNA alignment 22 FSA summary ● ● ● ● Maximize accuracy (other programs maximize sensitivity or likelihood) Automated query-based learning Robust with non-homologous sequences, evolutionary biases Agnostic to phylogeny 23 24 25 Existing MSA deficiencies ● ● Speed and memory limiting factor Progressive alignment with guide tree vs pairwise alignment ● Assumes alignability/homology ● Discordances in protein, DNA, RNA alignment 26 ● Nevertheless, the ClustalW program, published in 1994, remains the most widely-used MSA program. 27
© Copyright 2025 Paperzz