Traveling Salesman Problem and Multiple

Using Traveling Salesman
Problem Algorithms to Determine
Multiple Sequence Alignment
Orders
Weiwei Zhong
Topics
• Background
• Algorithm Design
• Test Results
Background
Definitions
What is a Sequence Alignment?
Given
• 2 or more sequences
• a scoring scheme
• match score
• mismatch score
• gap penalty
Insert gaps in each sequence, so that
• all sequences have the same length
• maximum pairing score
Scoring Matrix
Simplified Scoring
• match = 2
• mismatch = -1
• gap penalty = -2
In Practice
Scoring matrix
Global vs. Local Alignments
Global: entire lengths of sequences
F G K – G K G
F G K F G K G
Local: regions of sequences
- - - F G K G K G
F G K F G K G - -
Pairwise Alignment vs.
Multiple Sequence Alignment (MSA)
Pairwise: 2 sequences
F G K  G K G
F G K F G K G
MSA: more than 2 sequences
F
F
-
G
G
G
-
K
K
K
K

F
Q
F
G
G
G
G
K
K
K
K
G
G
G
G
Background
Basic Dynamic Programming
Dynamic Programming Algorithm for
Pairwise Alignments
Two sequences
1. Initialization
• GAATTC
• GGATC
G A A T T C
0
Scoring scheme
• match = 2
• mismatch = -1
• gap penalty = -2
G
G
A
T
C
0
0
0
0
0
0
0
0
0
0
0
cj
ci
Mi-1,j-1
Mi-1,j
Mi,j-1
Mij
2. Table fill
Mi-1,j-1 + S(ci, cj)
Mi,j-1 + g
Mi-1,j + g
Mij = max
G A A T T C
Scoring scheme
• match = 2
• mismatch = -1
• gap g = -2
G
G
A
T
C
0
0
0
0
0
0
0
0
2
0
-1 -1 -1 -1
0
2
1
-1 -2 -2 -2
0
0
4
3
1
-1 -3
0
-1
2
3
5
3
1
0
-1
0
1
3
4
5
3. Trace back
G A A T T C
G
G
A
T
C
0
0
0
0
0
0
0
2
0
-1 -1 -1 -1
0
2
1
-1 -2 -2 -2
0
0
4
3
1
-1 -3
0
-1
2
3
5
3
1
0
-1
0
1
3
4
5
G A A T T C
|
|
| |
G G A – T C
0
Multidimensional Dynamic Programming for
MSA
• n strings of length L each, running time is O(Ln).
• Impractical: 5-7 proteins of 200-300 residues each.
Topics
• Background
• Algorithm Design
• Test Results
Algorithm Design
An MSA Heuristic
Feng-Doolittle Progressive Alignment
1. Align 2 of the sequences Si, Sj
2. Align a 3rd sequence Sk to the
alignment Si, Sj
c
j
T
A
ci S
*
S(ci, cj) = (S(T, S) + S(A, S)) / 2
3. Repeat 2 until all sequences are
aligned
Running Time
O( n L2 )
Features of Feng-Doolittle Algorithm
• Once a gap, always
a gap
Alignment order
is important
• Early mistakes
cannot be corrected
x:
y:
G A A G T T
G A C – T T
z:
G A A C T G
x:
G A A G T T
y:
z:
G A – C T T
G A A C T G
Algorithm Design
TspMsa: First Version
Traveling Salesman Problem (TSP)
Given
• n nodes
• distances for each pair of nodes
Find a roundtrip, so that
• visit each node exactly once
• minimal total length
NP-complete
Well studied
TspMsa: Algorithm Design
calculate pairwise
distances
0
1
2
3
4
determine a TSP
tour
0
1
2
3
4
0
1
2
3
4
0
1
15
51
61
1
0
14
24
58
15
14
0
46
67
51
24
46
0
38
61
58
67
38
0
0
1
2
3
Feng-Doolittle
alignment
Alignment
order
4
0
2
4
3
1
Starting Point and Direction of TSP Tour
data set
kinase_ref3
0.747 0.703
0.770
0.737
508
0.749
375
10
337 9
498
0
0.703
0.67
429
1
814
4
8
2
7
0.736
0.702
624
542
8
6
3
632
0.743
932
5
0.689
0.668
18
970
19
17
251
1
20
0.772
0.677
84
0.719
0.733
0.681
378
21
914
22
0.692
1049
13
15
12
14
14
11
79
0.7
284
9
0.698
0.765
0.746 0.685
0.711
15
16
0.688
0.686
Algorithm Design
TspMsa: Modified Design
TspMsa: Modified Algorithm Design
1
calculate pairwise distances
0
1
67
2
2
24
15
3
determine a TSP tour
1, 0
67
4
24
15
3
4
38
38
1, 0
67
align closest nodes
3, 1, 0
24
no
one node left
67
2, 4
3
2, 4
38
38
3
?
yes
end
1
3, 1, 0, 2, 4
0
2
4
Modified Algorithm is Better
Alignment order for Kinase_ref3
5
6
7
8 10
9
0
1
4
2
3
18 17 14 15 16 11 12 13 22 21 20 19
Original TspMsa : 0.603 (worst) - 0.772 (best)
Modified TspMsa : 0.836
Topics
• Background
• Algorithm Design
• Test Results
Test Results
What to Compare With?
Existing MSA Programs
best quality
Iterative
Progressive
multal
clustalw
saga
prrp
multalign
pileup
poa
less computation time
hmmt
better quality
Fast
CLUSTALW
1. Calculate pairwise distances
2. Derive a guide tree by the Neighbor Joining method
1
2
3
1
choose 2 closest nodes,
derive an internal node
4
9
5
5
7
i
j
3
4
9
8
2
8
7
6
i
repeat until
one node
left at the
center
x
j
ri=(Σdik)/(n-2)
dix=(dij + ri - rj) /2
djx=dij – dix
dxm=(dim + djm - dij)/2
6
9
4
3
2
1
8
7
6
5
CLUSTALW
3. Progressively align all sequences following the
guide tree
• Weighted sequences
1 p e e k s a v t a l
2 g e e k a a v l a l
3 e g e w q l v l h v
Without weights
Score = [S(t,v) + S(l,v)] / 2
With weights
Score = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2
• 2 gap penalty values: opening, extension
• Dynamically changes the gap penalty and
the scoring matrix
POA
1. Convert sequences to partial order graphs
E T N K
E
T
N
K
E T - - P K M I V R
E T T H – K M L V R
P
E
I
T
K
T
H
M
V
L
R
POA
2. Align 2 sequences
3. Align one sequence to the current group
P
E
T
T
H
K
E
T
N
K
4. Repeat 3 until all sequences are aligned
Test Results
Quality Evaluation
BAliBASE Benchmark
• Reference 1: equidistance sequences with various
levels of similarity.
• < 25% sequence identity
• 20-40% sequence identity
• > 35% sequence identity
• Reference 2: closely related sequences with a highly
divergent “orphan” sequence.
• Reference 3: subgroups with <25% identity between
groups.
• Reference 4: sequences with N/C-terminal extensions.
• Reference 5: sequences with internal insertions.
Reference 1 Sequences with < 25% Identity
TspMsa
Alignment scores
0.8
short
CLUSTALW
medium
CLUSTALW
TspMsa
POA
POA
long
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21
Test cases
ref.1
(<25%)
All Test Scores
Average Score
Reference 1 Sequences with 20-40% Identity
Alignment scores
TspMsa
CLUSTALW
CLUSTALW
TspMsa
POA
POA
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
short
medium
long
0
0
1
3
5
7
9
11
13
15
17
19
Test cases
All Test Scores
21
23 25
27
29
ref.1 (2040%)
Average Score
Reference 1 Sequences with >35% Identity
CLUSTALW
TspMsa
Alignment scores
TspMsa
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CLUSTALW
POA
POA
1
0.9
0.8
0.7
0.6
short
1
3
5
medium
7
long
9 11 13 15 17 19 21 23 25 27
Test cases
All Test Scores
0.5
0.4
0.3
0.2
0.1
0
ref.1
(>35%)
Average Score
Reference 2
CLUSTALW
TspMsa
Alignments scores
TspMsa
CULSTALW
POA
POA
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
short
medium
0.1
long
0
1
2
3
4
5
6
7
8
9
10 11
12 13 14
15 16 17
Test cases
All Test Scores
18 19
20 21 22
23
0
ref.2
Average Score
Reference 3
CLUSTALW
TspMsa
Alignment scores
TspMsa
CLUSTALW
POA
POA
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
short
medium
0.1
long
0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
ref.3
Test Cases
All Test Scores
Average Score
Reference 4 and Reference 5
CLUSTALW
TspMsa
Alignment scores
TspMsa
CLUSTALW
POA
POA
1
1
0.9
0.8
0.9
0.7
0.7
0.6
0.5
0.6
0.4
0.4
0.3
0.3
0.2
0.1
0.2
0.8
0.5
Reference 4
0
0.1
Reference 5
0
1
3
5
7
9
11
13
15
17
19
21
ref.4
ref.5
Test cases
All Test Scores
Average Score
Alignment Quality Comparison
TspMsa and POA: TspMsa better
TspMsa and CLUSTALW: comparable
Reference 1:
<25% identity:
Similar *
20-40% identity:
Similar *
> 35% identity:
Similar
Reference 2: Similar *
Reference 3: TspMsa better
Reference 4: CLUSTALW better
Reference 5: Similar
* CLUSTALW slightly better for short sequences.
Test Results
Execution Time Evaluation
Fast Mode TspMsa
Most time consuming step:
Pairwise distance calculations
• Slow mode:
full dynamic programming (accurate)
• Fast mode:
a fast approximate method (heuristic)
Quality Impact of the Fast Mode
CLUSTALW
CLUSTALW-fast
TspMsa
TspMsa-fast
POA
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ref.1
(<25%)
ref.1
(2040%)
ref.1
(>35%)
ref.2
ref.3
ref.4
ref.5
Execution Time Evaluation
CLUSTALW
and TspMsa
in fast mode
CLUSTALW
TspMsa
POA
200
Execution time (min)
180
160
140
120
100
80
60
40
20
0
100
200
500
Number of sequences
1000
Conclusions
QUALITY
Slow mode
•
close to CLUSTALW (slow mode)
•
better than POA
Fast mode (not as good as slow mode)
•
comparable to CLUSTALW (fast mode)
•
better than POA
SPEED
Fast mode
•
faster than CLUSTALW (fast mode)
•
comparable to POA
Acknowledgement
Dr. Robert Robinson
Dr. Russell Malmberg
Dr. Eileen Kraemer
Computer Science Department