Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong Topics • Background • Algorithm Design • Test Results Background Definitions What is a Sequence Alignment? Given • 2 or more sequences • a scoring scheme • match score • mismatch score • gap penalty Insert gaps in each sequence, so that • all sequences have the same length • maximum pairing score Scoring Matrix Simplified Scoring • match = 2 • mismatch = -1 • gap penalty = -2 In Practice Scoring matrix Global vs. Local Alignments Global: entire lengths of sequences F G K – G K G F G K F G K G Local: regions of sequences - - - F G K G K G F G K F G K G - - Pairwise Alignment vs. Multiple Sequence Alignment (MSA) Pairwise: 2 sequences F G K G K G F G K F G K G MSA: more than 2 sequences F F - G G G - K K K K F Q F G G G G K K K K G G G G Background Basic Dynamic Programming Dynamic Programming Algorithm for Pairwise Alignments Two sequences 1. Initialization • GAATTC • GGATC G A A T T C 0 Scoring scheme • match = 2 • mismatch = -1 • gap penalty = -2 G G A T C 0 0 0 0 0 0 0 0 0 0 0 cj ci Mi-1,j-1 Mi-1,j Mi,j-1 Mij 2. Table fill Mi-1,j-1 + S(ci, cj) Mi,j-1 + g Mi-1,j + g Mij = max G A A T T C Scoring scheme • match = 2 • mismatch = -1 • gap g = -2 G G A T C 0 0 0 0 0 0 0 0 2 0 -1 -1 -1 -1 0 2 1 -1 -2 -2 -2 0 0 4 3 1 -1 -3 0 -1 2 3 5 3 1 0 -1 0 1 3 4 5 3. Trace back G A A T T C G G A T C 0 0 0 0 0 0 0 2 0 -1 -1 -1 -1 0 2 1 -1 -2 -2 -2 0 0 4 3 1 -1 -3 0 -1 2 3 5 3 1 0 -1 0 1 3 4 5 G A A T T C | | | | G G A – T C 0 Multidimensional Dynamic Programming for MSA • n strings of length L each, running time is O(Ln). • Impractical: 5-7 proteins of 200-300 residues each. Topics • Background • Algorithm Design • Test Results Algorithm Design An MSA Heuristic Feng-Doolittle Progressive Alignment 1. Align 2 of the sequences Si, Sj 2. Align a 3rd sequence Sk to the alignment Si, Sj c j T A ci S * S(ci, cj) = (S(T, S) + S(A, S)) / 2 3. Repeat 2 until all sequences are aligned Running Time O( n L2 ) Features of Feng-Doolittle Algorithm • Once a gap, always a gap Alignment order is important • Early mistakes cannot be corrected x: y: G A A G T T G A C – T T z: G A A C T G x: G A A G T T y: z: G A – C T T G A A C T G Algorithm Design TspMsa: First Version Traveling Salesman Problem (TSP) Given • n nodes • distances for each pair of nodes Find a roundtrip, so that • visit each node exactly once • minimal total length NP-complete Well studied TspMsa: Algorithm Design calculate pairwise distances 0 1 2 3 4 determine a TSP tour 0 1 2 3 4 0 1 2 3 4 0 1 15 51 61 1 0 14 24 58 15 14 0 46 67 51 24 46 0 38 61 58 67 38 0 0 1 2 3 Feng-Doolittle alignment Alignment order 4 0 2 4 3 1 Starting Point and Direction of TSP Tour data set kinase_ref3 0.747 0.703 0.770 0.737 508 0.749 375 10 337 9 498 0 0.703 0.67 429 1 814 4 8 2 7 0.736 0.702 624 542 8 6 3 632 0.743 932 5 0.689 0.668 18 970 19 17 251 1 20 0.772 0.677 84 0.719 0.733 0.681 378 21 914 22 0.692 1049 13 15 12 14 14 11 79 0.7 284 9 0.698 0.765 0.746 0.685 0.711 15 16 0.688 0.686 Algorithm Design TspMsa: Modified Design TspMsa: Modified Algorithm Design 1 calculate pairwise distances 0 1 67 2 2 24 15 3 determine a TSP tour 1, 0 67 4 24 15 3 4 38 38 1, 0 67 align closest nodes 3, 1, 0 24 no one node left 67 2, 4 3 2, 4 38 38 3 ? yes end 1 3, 1, 0, 2, 4 0 2 4 Modified Algorithm is Better Alignment order for Kinase_ref3 5 6 7 8 10 9 0 1 4 2 3 18 17 14 15 16 11 12 13 22 21 20 19 Original TspMsa : 0.603 (worst) - 0.772 (best) Modified TspMsa : 0.836 Topics • Background • Algorithm Design • Test Results Test Results What to Compare With? Existing MSA Programs best quality Iterative Progressive multal clustalw saga prrp multalign pileup poa less computation time hmmt better quality Fast CLUSTALW 1. Calculate pairwise distances 2. Derive a guide tree by the Neighbor Joining method 1 2 3 1 choose 2 closest nodes, derive an internal node 4 9 5 5 7 i j 3 4 9 8 2 8 7 6 i repeat until one node left at the center x j ri=(Σdik)/(n-2) dix=(dij + ri - rj) /2 djx=dij – dix dxm=(dim + djm - dij)/2 6 9 4 3 2 1 8 7 6 5 CLUSTALW 3. Progressively align all sequences following the guide tree • Weighted sequences 1 p e e k s a v t a l 2 g e e k a a v l a l 3 e g e w q l v l h v Without weights Score = [S(t,v) + S(l,v)] / 2 With weights Score = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2 • 2 gap penalty values: opening, extension • Dynamically changes the gap penalty and the scoring matrix POA 1. Convert sequences to partial order graphs E T N K E T N K E T - - P K M I V R E T T H – K M L V R P E I T K T H M V L R POA 2. Align 2 sequences 3. Align one sequence to the current group P E T T H K E T N K 4. Repeat 3 until all sequences are aligned Test Results Quality Evaluation BAliBASE Benchmark • Reference 1: equidistance sequences with various levels of similarity. • < 25% sequence identity • 20-40% sequence identity • > 35% sequence identity • Reference 2: closely related sequences with a highly divergent “orphan” sequence. • Reference 3: subgroups with <25% identity between groups. • Reference 4: sequences with N/C-terminal extensions. • Reference 5: sequences with internal insertions. Reference 1 Sequences with < 25% Identity TspMsa Alignment scores 0.8 short CLUSTALW medium CLUSTALW TspMsa POA POA long 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Test cases ref.1 (<25%) All Test Scores Average Score Reference 1 Sequences with 20-40% Identity Alignment scores TspMsa CLUSTALW CLUSTALW TspMsa POA POA 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 short medium long 0 0 1 3 5 7 9 11 13 15 17 19 Test cases All Test Scores 21 23 25 27 29 ref.1 (2040%) Average Score Reference 1 Sequences with >35% Identity CLUSTALW TspMsa Alignment scores TspMsa 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 CLUSTALW POA POA 1 0.9 0.8 0.7 0.6 short 1 3 5 medium 7 long 9 11 13 15 17 19 21 23 25 27 Test cases All Test Scores 0.5 0.4 0.3 0.2 0.1 0 ref.1 (>35%) Average Score Reference 2 CLUSTALW TspMsa Alignments scores TspMsa CULSTALW POA POA 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 short medium 0.1 long 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Test cases All Test Scores 18 19 20 21 22 23 0 ref.2 Average Score Reference 3 CLUSTALW TspMsa Alignment scores TspMsa CLUSTALW POA POA 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 short medium 0.1 long 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 ref.3 Test Cases All Test Scores Average Score Reference 4 and Reference 5 CLUSTALW TspMsa Alignment scores TspMsa CLUSTALW POA POA 1 1 0.9 0.8 0.9 0.7 0.7 0.6 0.5 0.6 0.4 0.4 0.3 0.3 0.2 0.1 0.2 0.8 0.5 Reference 4 0 0.1 Reference 5 0 1 3 5 7 9 11 13 15 17 19 21 ref.4 ref.5 Test cases All Test Scores Average Score Alignment Quality Comparison TspMsa and POA: TspMsa better TspMsa and CLUSTALW: comparable Reference 1: <25% identity: Similar * 20-40% identity: Similar * > 35% identity: Similar Reference 2: Similar * Reference 3: TspMsa better Reference 4: CLUSTALW better Reference 5: Similar * CLUSTALW slightly better for short sequences. Test Results Execution Time Evaluation Fast Mode TspMsa Most time consuming step: Pairwise distance calculations • Slow mode: full dynamic programming (accurate) • Fast mode: a fast approximate method (heuristic) Quality Impact of the Fast Mode CLUSTALW CLUSTALW-fast TspMsa TspMsa-fast POA 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ref.1 (<25%) ref.1 (2040%) ref.1 (>35%) ref.2 ref.3 ref.4 ref.5 Execution Time Evaluation CLUSTALW and TspMsa in fast mode CLUSTALW TspMsa POA 200 Execution time (min) 180 160 140 120 100 80 60 40 20 0 100 200 500 Number of sequences 1000 Conclusions QUALITY Slow mode • close to CLUSTALW (slow mode) • better than POA Fast mode (not as good as slow mode) • comparable to CLUSTALW (fast mode) • better than POA SPEED Fast mode • faster than CLUSTALW (fast mode) • comparable to POA Acknowledgement Dr. Robert Robinson Dr. Russell Malmberg Dr. Eileen Kraemer Computer Science Department
© Copyright 2026 Paperzz