Optimality, Gene Sequence Alignment

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 312: Algorithm Design &
Analysis
Lecture #24: Optimality,
Gene Sequence Alignment
Slides by: Eric Ringger, with contributions from Mike Jones, Eric Mercer, Sean Warnick
Announcements
 Homework #15 due now
 Project #5: Gene Sequence Alignment





Kick-off: today
Read directions now
Whiteboard experience: due Monday
Early: Monday after mid-term exam
Due: Wednesday after mid-term exam
 Mid-term Exam
 Start preparing your one page of notes
 Must be prepared by you. No cutting and pasting.
Objectives
 Revisit the main ideas behind Dynamic
Programming
 Define the optimality property for DP
 Develop the algorithm for gene sequence
alignment (or at least begin)
 Prepare for Project #5
Dynamic Programming
The six steps:
1. Ask: am I solving an optimization problem?
2. Devise a minimal description (address) for any problem
instance and sub-problem
3. Divide problems into sub-problems: define the recurrence to
specify the relationship of problems to sub-problems
4. Check that the optimality property holds: An optimal
solution to a problem is built from optimal solutions to subproblems.
5. Store results – typically in a table – and re-use the solutions
to sub-problems in the table as you build up to the overall
solution.
6. Back-trace / analyze the table to extract the composition of
the final solution.
Optimality Property
An optimal solution to a problem is built from
optimal solutions to sub-problems.
 The optimality property is a necessary
condition for solving an optimization problem
by DP!
 It allows us to store and re-use optimal results to
sub-problems.
Optimality
 f1 (optimalsolution(child1 ))
 f (optimalsolution(child ))

2
optimalsolution( parent )  min (or max)  2
...
 f n (optimalsolution(child n ))
A
B
E
F
C
G
H
D
I
J
K
Shortest Path
American Fork
Sundance
20
10
12
15
18
Orem
3
10
Geneva
12
Provo
Goal: the shortest path from AF to Provo.
Does this problem exhibit the optimality property? Pair up. Discuss
American Fork
Questions
Sundance
20
10
15
12
18
Orem
3
Geneva
10
Provo
 Q. In general, do you know which
12
sub-problem solutions to use in advance?
 A. No. So a very greedy algorithm is not an option. (But
Dijkstra’s is.)
 Q: How does having a table of intermediate shortest path
results help find the shortest path from AF to Provo?
 A: Reuse those results for intermediate destinations as you try
different routes.
 Q. Do you have to reconsider alternative sub-optimal
solutions for the intermediate destinations?
 A. No
 Thus,, the Optimality Property holds
 Therefore, the shortest path problem can be solved by DP.
Optimality in Driving
The shortest route from American
Fork to Provo passes through
Orem.
Assume we have found this route.
Then what can we say about
the shortest route from AF
to Orem?
It follows that optimal route
from AF to Provo.
Could it be otherwise?
A related problem
Now suppose you drive from AF to Orem as fast as you can
on your way to Provo,
But you are limited by the gas in your tank.
Does the Optimality Property Hold?
Start with 10 gallons
AF
5/9
Orem
5/9
10/5
10/5
20/1
20/1
Provo
“takes 20 minutes using
1 gallon of gas”
Goal: get to Provo in as little time as possible. No refueling.
Does this problem (formulation) satisfy the optimality property or not? Why?
Problem Solving Advice
 Start by asking: which sub-problems should be
solved?
 If you know how to choose in advance using local
information only, then
 greedy might work.
 Else if sub-problems don’t overlap, then
 divide and conquer would be a good choice.
 Else if the optimality property holds, then
 DP is a good choice.
 Else the optimality property does NOT
hold, so
 apply another strategy.
(Stay tuned for more guidance)
Important!
Gene Sequence Alignment
x=ACGCTGA
y=ACTGT
Virtually Identical Problems
 Edit Distance
 aka Levenshtein Distance
 Sequence Alignment
 E.g., Gene Sequence Alignment
 Fundamentally the same thing!
 We’re focusing on gene sequence
alignment.
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
Contrast the 2
perspectives.
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
The ‘-’ is a “gap”
Alignment
Example:
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
Divide into
Pairs
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:
 Type: Match; Cost = cmatch
Each Pair has
a type and a
cost
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:
 Match: cmatch
 Type: Insertion into x (= deletion from y) aka “indel”; Cost = cindel
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:
 Match: cmatch
 Insertion into x (= deletion from y): cindel
 Insertion into y (= deletion from x): cindel
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:




Match: cmatch
Insertion into x (= deletion from y): cindel
Insertion into y (= deletion from x): cindel
Type: Substitution of x into y (or from y into x); Cost = csub
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:




Match: cmatch
Insertion into x (= deletion from y): cindel
Insertion into y (= deletion from x): cindel
Substitution of x into y (or from y into x); Cost = csub
x: ACGCT-C
y: A--CTGT
Edit Distance / Sequence
Alignment Problem
 Given:
 2 strings: 𝑥 and 𝑦; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑥) = 𝑚; 𝑙𝑒𝑛𝑔𝑡ℎ(𝑦) = 𝑛
 Return:
 Smallest cost to transform string 𝒙 into string 𝒚 (or vice versa)
 Another perspective: smallest cost of aligning 𝒙 to 𝒚 (or vice versa)
 Cost:




Match: cmatch
Insertion into x (= deletion from y): cindel
Insertion into y (= deletion from x): cindel
Substitution of x into y (or from y into x); Cost = csub
x: ACGCT-C
y: A--CTGT
How would you
solve this problem?
Solution Ideas
 Enumerate all and score
 Pro: Easy to code
 Pro: Optimal
 Con: exponential
 Greedy: work from left to right, gobbling up matches and inserting
gaps or allowing substitutions as necessary
 Pro: Easy
 Pro: Linear = fast / efficient
 Con: not optimal
 DP





Pre-req: optimality property
Pre-req: define addressable sub-problems
Pre-req: determine relationship between problem and sub-problems
Pro: Optimal
Con: ?
 Divide and Conquer?
Designing the DP Algorithm for
Gene Sequence Alignment
DP?
 Define each sub-problem 𝑆(𝑖, 𝑗) to be the best score for
aligning
 the first 𝑖 bases of sequence 𝑥 with
 the first 𝑗 bases of sequence 𝑦
 Does that suffice as a minimal description?
 In those terms, what is our objective function?
 minimize 𝑆 𝑚, 𝑛 , where 𝑚 = 𝑥 , 𝑛 = |𝑦|
 Can we divide this problem into sub-problems?
 How many?
 Hint: how many sub-problems are one step away from 𝑆(𝑖, 𝑗)?
Example: Sub-problems
x=ACGCTGA
y=ACTGT
Example: Sub-problems
x=ACGCTGA
y=ACTGT
 To be continued in Lecture #25
Assignment
 HW #16
 Read Section 6.3, if you haven’t done so
already.
 Thursday: Screencast & Quiz