h , g

The Population
Haplotyping problem
NOTATION: each SNP only two values in a population (bio).
Call them 0 and 1. Also, call * the fact that a site is heterozygous
HAPLOTYPE: string over 0, 1
GENOTYPE: string over 0, 1, *
where 0={0}, 1={1}, *={0,1}
01
00
10
11
1*
01
10
**
10
10 10
0*
11 11
11
10
00
*0
10
00 *0
ALGEBRA OF HAPLOTYPES:
0+
0=
--0
1+
1=
--1
Homozygous sites
10
11
1*
01
10
**
10
10
0+
1=
--*
1 +
0=
--*
Heterozygous (ambiguous) sites
01
00
0*
11
00
10
**
00
10
*0
Phasing the alleles
11101
10000
11100
10001
1**0*
11001
10100
11000
10101
For k heterozygous (ambiguous) sites, there are 2 k-1 possible phasings
THE PHASING (or HAPLOTYPING) PROBLEM
It is too expensive to determine haplotypes directly
Much cheaper to determine genotypes, and then infer haplotypes in silico:
Given genotypes of k individuals, determine the phasings
of all heterozygous sites.
This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.
The input is GENOTYPE data
*1**1
11**1
11011
****1
00011
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
The input is GENOTYPE data
11011
11101
00011
11101
11011
01101
*1**1
11**1
****1
11011
11011
00011
00011
11011
00011
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
OUTPUT: H = { 11011, 11101, 00011, 01101}
Each genotype is resolved by two haplotypes
We will define some objectives for H
OBJECTIVES
-without objectives/constraints, the
haplotyping problem would be
(mathematically)trivial
E.g., always put 0 above and 1 below
**0*1  00001
11011
1*0**  10000
11011
-the objectives/constraints must be
“driven by biology”
OBJECTIVES
1°) Clark’s inference rule
2°) Perfect Phylogeny
3°) Disease Association
4°) (parsimony): minimize |H|
1st
Obj: Clark’s rule
Inference Rule
for a
compatible pair h , g
known haplotype h
1011001011 +
?????????? =
1**1001*1*
known (ambiguos) genotype g
Inference Rule
for a
compatible pair h , g
known haplotype h
new (derived) haplotype h’
1011001011 +
1101001110 =
1**1001*1*
known (ambiguos) genotype g
We write
h + h’ = g
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
1100
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
1100
1111
SUCCESS
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
0100
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0000
1000
**00
11**
0100
FAILURE (can’t resolve 1122 )
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G
3.
apply the rule to any such (h, g) obtaining h’
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic: the algorithm could end without explaining
all genotypes even if an explanation was possible.
The number of genotypes solved depends on order of application.
OBJ: find order of application rule that leaves the fewest elements in G
The problem was studied by Gusfield
(ISMB 2000, and Journal of Comp. Biol., 2001)
- problem is APX-hard
- it corresponds to finding largest forest in a graph with
haplotypes as nodes and arcs for possible derivations
-solved via ILP of exponential-size
(practical for small real instances)
2nd
Obj: Perfect Phylogeny
- Parsimony does not take into account
mutations/evolution of haplotypes
- parsimony is very relialable on “small”
haplotype blocks
- when haplotypes are large (span several
SNPs, we should consider evolutionionary
events and recombination)
- the cleanest model for evolution is the
perfect phylogeny
3rd objective is based on perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…)
with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
3rd objective is based on perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…)
with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
has tail
flies
But…a new species may come along so that no
-Perfect
A phylogeny
expalainsisset
of binary features (e.g. flies, has fur…)
phylogeny
possible…
with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
has tail
flies
flies
two legs
tail
Human
1
0
0
Mouse
0
1
0
Spider
0
0
0
Eagle
1
0
1
Theorem: such matrix has p.p. iff there is not a 00
10
01
11
4x2 minor
flies
two legs
tail
Human
1
0
0
Mouse
0
1
0
Spider
0
0
0
Eagle
1
0
1
Mickey mouse
1
1
0
Theorem: such matrix has p.p. iff there is not a 00
10
01
11
4x2 minor
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
01*0
*10*
*0*0
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
01*0
*10*
*0*0
0100
0110
1101
0100
1000
0010
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
01*0
*10*
*0*0
0100
0110
1101
0100
1000
0010
NOT a perfect phylogeny solution !
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
01*0
010*
000*
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently
along a tree)
01*0
010*
000*
0100
0110
0100
1101
0000
0001
A perfect phylogeny
Theorem: The Perfect Phylogeny
Haplotyping problem is polynomial
Theorem: The Perfect Phylogeny
Haplotyping problem is polynomial
Algorithms are of combinatorial nature
- There is a graph for which SNPs are columns and edges are of two types
(forced and free)
- forced edges connect pairs of SNPs that must be phased in the same way
**  00 + 11
or
**  01 + 10
- a complex visit of the graph decides how to phase free SNPs
3rd
Obj: Disease Association
Some diseases may be due to a gene which has “faulty” configurations
RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased
one must have both copies faulty. With one copy one is a carrier of the disease
DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be
diseased it is enough to have one faulty copy
Two individuals of which one is healthy and the other diseased
may have the same genotype.
The explanation of the disease lies in a difference in their haplotypes
11**1
*1**1
0*011
0**01
11**1
00011
INPUT: GD = {11**1,*1**1,0*011},
GH = {11**1,0**01,00011}
11011
01101
11011
11101
11**1
01101
00001
0**01
01011
00011
*1**1
11001
11111
11**1
0*011
00011
00011
00011
INPUT: GD = {11**1,*1**1,0*011},
GH = {11**1,0**01,00011}
OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}
H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none
Theorem 1 is proved via a reduction from 3 SAT
Theorem 2 has a mathematical proof (coloring argument) with
little relation to biology:
There is R (depending on input) s.t. a haplotype is healthy
if the sum of its bits is congruent to R modulo 3
This means the model must be refined!
4th
Obj: Max Parsimony
 See separate slides…
Summary:
- haplotyping in-silico needed for economical reasons
- several objectives, all biologically driven
- nice combinatorial problems
(mostly due to binary nature of SNPs)
- these problems are technology-dependant and may
become obsolete (hopefully after we have retired)
111
111
011
101
010
001
111
111
010
011
221
022
012
011
000
022
211
222
011
012
222
111
022
012
2nd Objective (parsimony) :
minimize |H|
1. The problem is APX-Hard
Reduction from VERTEX-COVER
B
A
C
D
E
A
B
A
C
D
E
B
C
D
E
*
A
B
A
C
D
E
AB
BC
AE
DE
AD
B
C
D
E
*
A
B
A
C
D
E
AB
BC
AE
DE
AD
A
B
C
D
E
B
C
D
E
*
B
A
C
D
E
AB
BC
AE
DE
AD
A
B
C
D
E
A
B
C
2
2
2
2
D
2
2
2
2
E
2
2
*
B
A
C
D
E
A
B
C
AB
BC
AE
DE
AD
2
2
2
2
A
B
C
D
E
0
D
2
2
2
2
E
2
2
0
0
0
0
*
B
A
C
D
E
A
B
AB
BC
AE
DE
AD
2
2
2
A
B
C
D
E
0
C
D
E
2
2
2
2
2
2
2
0
0
0
0
*
2
2
2
2
2
0
0
0
0
0
B
A
C
D
E
A
B
C
D
E
*
AB
BC
AE
DE
AD
2
1
2
1
2
2
2
1
1
1
1
2
1
1
1
1
1
1
2
2
1
1
2
2
1
2
2
2
2
2
A
B
C
D
E
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
0
0
0
0
0
B
A
C
D
E
A
B
C
D
E
*
AB
BC
AE
DE
AD
2
1
2
1
2
2
2
1
1
1
1
2
1
1
1
1
1
1
2
2
1
1
2
2
1
2
2
2
2
2
A
B
C
D
E
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
0
0
0
0
0
G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that
explain all genotypes
B
A
C
D
E
A
B
C
D
E
*
AB
BC
AE
DE
AD
2
1
2
1
2
2
2
1
1
1
1
2
1
1
1
1
1
1
2
2
1
1
2
2
1
2
2
2
2
2
A
B
C
D
E
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
0
0
0
0
0
G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that
explain all genotypes
B
A
C
D
E
A
B
C
D
E
*
AB
BC
AE
DE
AD
2
1
2
1
2
2
2
1
1
1
1
2
1
1
1
1
1
1
2
2
1
1
2
2
1
2
2
2
2
2
A
B
C
D
E
A’
B’
E’
0
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
0
0
0
0
0
0
1
1
1
G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that
explain all genotypes
A basic ILP formulation
A basic ILP formulation
Expand your input G in all possible ways
220
120
022
A basic ILP formulation
Expand your input G in all possible ways
220
010 + 100, 000 + 110
120
100 + 110
022
000 + 011, 001 + 010
A basic ILP formulation
Expand your input G in all possible ways
220
120
010 + 100, 000 + 110
100 + 110
xh
xh
y
h1 , h2 


h1 , h2 
022
000 + 011, 001 + 010
The resulting Integer Program (IP1):
Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations
The input is GENOTYPE data
11011
11101
00011
11101
11011
01101
*1**1
11**1
****1
11011
11011
00011
00011
11O11
OOO11
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
OUTPUT: H = { 11011, 11101, 00011, 01101}
Each genotype is explained by two haplotypes
We will define some objectives for H
1st Objective (open research problem):
minimize |H|
2nd Objective based on inference rule:
1st Objective (parsimony) :
minimize |H|
An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2
genotypes, hence, we need at least LB = SQRT(n) haplotypes.
BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a
solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB
It’s difficult, but not impossible, to come up with better approximations, like constants
(Lancia, Pinotti, Rizzi ’02)
2nd Objective based on inference rule:
Inference Rule
known haplotype h
xoxxooxoxx +
********** =
x??xoox?x?
known (ambiguos) genotype g
Inference Rule
known haplotype h
new (derived) haplotype h’
known (ambiguos) genotype g
xoxxooxoxx +
xxoxooxxxo =
x??xoox?x?
Inference Rule
known haplotype h
new (derived) haplotype h’
xoxxooxoxx +
xxoxooxxxo =
x??xoox?x?
known (ambiguos) genotype g
We write
h + h’ = g
g and h must be compatible to derive h’
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
xxoo
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
xxoo
xxxx
SUCCESS
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
oxoo
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes
2. while exists ambiguos genotype g in G
3.
take h in H compatible with g and let h + h’ = g
4.
set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
oooo
xooo
??oo
xx??
oxoo
FAILURE (can’t resolve xx?? )
OBJ: find order of application rule that leaves the fewest elements in G
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
1. expand
genotypes
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
xoooo
xooox
x??o?
xoxoo
xoxox
xxooo
1. expand
genotypes
xxoox
xxxoo
xxxox
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
xoooo
xooox
x??o?
xoxoo
xoxox
2. create (h, h’) if exists g s.t. h’ can be
derived from g and h
xxooo
1. expand
genotypes
xxoox
xxxoo
xxxox
3. Largest number of nodes in forest
rooted at unambiguos genotpes =
= largest number of ambiguous genotypes
resolved
Hence, find largest number of nodes in forest rooted at unambiguos
genotpes. Use I.P. model with vars x(ij).
This reduction is exponential. Is there a better practical approach?
3rd Objective (open research problem)
Disease Detection:
?x??x
xx??x
??oxx
????x
oooxx
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
3rd Objective (open research problem)
Disease Detection:
xxoxx
xxxox
oooxx
xxxox
xxoxx
oxxox
xx??x
?x??x
xxoxx
oooxx
????x
oooxx
oooxx
??oxx
oooxx
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
H contains H’, s.t. each diseased has one haplotype in H’ and each healty none
minimize | H|
Genome Rearrangements and
Evolutionary Distances
Each species has a genome (organized in pairs of chromosomes)
tcgtgatggat………………ttgatggattga
tcgattatggat………………ttttgatatcca
Genomes evolve by means of
•Insertions
•Deletions
•Inversions
•Transpositions
•Translocations
of DNA regions
deletion
deletion
insertion
deletion
translocation
insertion
insertion
deletion
translocation
inversion
insertion
deletion
transposition
translocation
inversion
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a
shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)
(…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,
transpositions…)
We focus on inversions, that are the most important in Nature
Example:
5
1
6
2
4
3
8
8
3
4
2
6
1
5
9
9
7
7
1
1
1
2
2
2
3
3
3
8
6
6
4
5
5
5
4
4
6
8
8
9
9
7
7
7
9
1
1
2
2
3
3
4
4
5
5
6
6
8
7
7
8
9
9
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a
shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)
(…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,
transposition…)
We focus on inversions, that are the most important in Nature
Example:
+5 +6 -4 -8 -3 -2 -1 -9 +7
+1 +2 +3 +8 +4 -6 -5 -9 +7
+1 +2 +3 +8 +4 +5 +6 -9 +7
+1 +2 +3 -6 -5 -4 -8 -9 +7
+1 +2 +3 -6 -5 -4 -8 -7 +9
+1 +2 +3 +4 +5 +6 -8 -7 +9
+1 +2 +3 +4 +5 +6 +7 +8 +9
There is also a SIGNED VERSION of the problem !
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and
Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and
Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0
5
7
8
2
1
4
3
6
9
10
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and
Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0
5
7
8
2
1
4
3
6
9
10
d(p) = inversion distance
b(p) = # breakpoints
TRIVIAL BOUND: d(p) >= b(p) / 2
Example: d(p) >= 6 / 2 = 3
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
Each node has degree...
10
0
4
6
2
or 4
…
hence the graph can be decomposed in cycles!
3
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
Alternating cycle decomposition
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
Alternating cycle decomposition
6
9
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
Alternating cycle decomposition
c(p) = max # cycles in alternating decomposition
VERY STRONG BOUND : d (p) >= b(p) - c(p)
Example: c(p)= 2 and d (p) >= 6 - 2 = 4
10
The Breakpoint Graph
0
5
7
8
2
1
4
3
6
9
10
The best algorithm for this problem is based on an Integer Programming
formulation of the max cycle decomposition
A variable xC for each cycle (exponential # of vars…)
A constraint S C containing e xC = 1 for each edge e
Objective: maximize
SC xC
PRIMAL
max S xC
S
C
xC
=1
for all edges e
C\ni e
xC \in {0,1}
for all alt. cycles C
DUAL
min S ye
S
e
ye
<= 1
for all alt. Cycles C
e\in C
ye \in R
for all edges e
PRIMAL
max S xC
S
C
xC
=1
for all edges e
C\ni e
xC \in {0,1}
for all alt. cycles C
DUAL
min S ye
S
e
ye
<= 1
for all alt. Cycles C
e\in C
ye \in R
for all edges e
0
5
7
8
2
1
4
3
6
Pricing out the cycles for which y*(C) < 1
9
10
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
Split the graph in two copies
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
Connect twins
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
A perfect matching corresponds to (a set of) alternating cycles
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
A perfect matching corresponds to (a set of) alternating cycles
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
A perfect matching corresponds to (a set of) alternating cycles
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
A perfect matching corresponds to (a set of) alternating cycles
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
A perfect matching corresponds to (a set of) alternating cycles
.6
1
.4
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
0
.2
.5
The weight of the matching is the y*-weight of the cycles
.6
1
.4
0
5
7
8
2
1
4
3
6
9
10
0
5
7
8
2
1
4
3
6
9
10
100000
.2
Forcing a cycle to use a certain node
.5
- These cycles would not use the same node twice, but with simple trick
is possible to model (OMISSIS)
BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)
BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)
KS can solve at most n=40. Take days for n=50
CLN can solve for n=200. Takes few seconds (say 5) for n=100
NP-hard problem practically solved to optimality!
Statistical view of evolution
• Genome evolve by random inversions
• It’s like a random walk on a huge graph with an edge for each
permutation an edge for each inversion
• It is not clear why the shortest solution should be the one
followed by Nature (in fact, often it isn’t)
• We want to find the most likely number of inversions that
lead from (1 2 … n ) to p
• We use the expected number of breakpoints after k
inversions as a way to guess the # of inversions
Given a p obtained by h random inversions from (1 … n ) we want to estimate h
The inversion distance is only a lower bound: h >= d(p) but the gap could be big
Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n)
We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as
close as possible to b(p) (maximum likelihood). CL ,2000, have shown:
E[B(k)] = ( n - 1 )
(1-(
n-3
n-1
k
)
)
Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions
Example: n = 200, k (u.a.r. in 1…n) inversions
k
8
19
68
69
73
85
86
87
118
184
k’
d(p)
b
8
19
67
73
79
91
85
90
117
184
8
19
67
68
73
83
83
84
109
135
16
34
98
104
109
120
115
119
138
168
Protein Structure Alignments: the
Maximum Contact Map Overlap
Problem
A Protein is a complex molecule with a primary,
linear structure (a sequence of aminoacids) and a
3-Dimensional structure (the protein fold).
Protein STRUCTURE determines its FUNCTION
For instance, the Drug Design problem
calls for constructing peptides with a 3D
shape complementary to a protein, so as
to dock onto it.
Problem:
Align two 3D protein structures
Motivation:
Structure Alignment is Important for:
- Discovery of Protein Function (shape determines function)
- Search in 3D data bases
- Protein Classification and Evolutionary Studies
- ...
Contact Maps
CONTACT MAPS
Unfolded protein
CONTACT MAPS
Unfolded protein
Folded protein = contacts
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
OBJECTIVE: align 3d folds of proteins
= align contact maps
Contact Map Alignments
Non-crossing Alignments
Protein 1
Protein 2
non-crossing map of residues in protein 1 and protein 2
The value of an alignment
The value of an alignment
The value of an alignment
The value of an alignment
Value = 3
The value of an alignment
Value = 3
We want to maximize the value
The value of an alignment
NP-Hard
Integer Programming
Formulation
Integer Programming
Formulation
e
0-1 VARIABLES
yef
yef for e and f contacts
f
Integer Programming
Formulation
e
0-1 VARIABLES
yef
yef for e and f contacts
e
e’
f
CONSTRAINTS
yef + ye’f’ <= 1
f
f’
Integer Programming
Formulation
e
0-1 VARIABLES
yef
yef for e and f contacts
e
e’
f
CONSTRAINTS
yef + ye’f’ <= 1
f
f’
OBJECTIVE
max SeSf yef
Independent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e’’
e’
f’
e
e’
e
f
e’’
f’’
f’’
f
f’
Independent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e’’
e’
f’
e
e’
e
f
e’’
f’’
f’’
f
f’
|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)
The best exact algorithm for independent set can solve for at most a few hundred
nodes
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
i
xij
j
yef
f
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
NEW CONSTRAINTS
i’
i
j’
j
xij + xi’j’ <= 1
i
xij
j
yef
f
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
i
yef
xij
j
f
NEW CONSTRAINTS
i’
i
j’
p
i
j
q
j
xij + xi’j’ <= 1
y(ip)(jq) <= xij and y(ip)(jq) <= xpq
Clique Constraints
Variables x define a graph Gx:
• A node for each line
• An edge between each pair of crossing lines
i’
i
i
j
i’
j’
j’
j
Clique Constraints
Variables x define a graph Gx:
• A node for each line
• An edge between each pair of crossing lines
i’
i
i
j
i’
j’
j’
j
• Gx is much smaller than Gy
• Gx has nice proprieties (it’s a perfect graph)
• It’s easier to find large independent sets in Gx
Clique Constraints
Non-crossing constraints can be extended to
CLIQUE CONSTRAINTS
S
xij <= 1
[i,j] in M
For all sets M of mutually incompatible (i.e. crossing) lines
All clique constraints satisfied (and Gx perfect) imply a strong bound!
Structure of Maximal cliques in
Gx
1. Pick two subsets of same size
Structure of Maximal cliques in
Gx
2. Connect them in a zig-zag fashion
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
Structure of Maximal cliques in
Gx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in
Gx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in
Gx
The result is a maximal clique in Gx
Separation of Clique Inequalities
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
We need to generate in polynomial time a clique inequality when needed,
i.e., when violated by the current LP solution x*
S
x*ij > 1
[i,j] in M
THEOREM
We can find the most violated clique inequality in time O(n2)
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
1
2
3
4 5 6
7
8
Separation of Clique Inequalities
PROOF (sketch)
2) Flip one graph: zigzag
1) Clique = zigzag path
1
2
3
4 5 6
7
8
8
7
6
5 4
leftright
3
2
1
Separation of Clique Inequalities
PROOF (sketch)
2) Flip one graph: zigzag
1) Clique = zigzag path
1
2
3
4 5 6
7
8
8
7
6
5 4
leftright
3
2
1
3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.
to find longest path in grid, time O(n^2)
Separation of cliques
1
2
i
1
2
u
n2
Create n1 x n2 grid
Orient all edges and give weights
n1
Separation of cliques
1
2
n1
i
1
2
u
x*iu
x*iu
n2
Create n1 x n2 grid
Orient all edges and give weights
Separation of cliques
B=(n1,1)
A=(1,n2)
Create n1 x n2 grid
Orient all edges and give weights
There is violated clique iff longest A,B path has length > 1
Gx is a Perfect Graph
We show why polynomial separation is possible:
Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)
=> Gx is perfect (Hayward, 1985)
Gx is a Perfect Graph
PROOF (Sketch, for Gx)
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
L1
L7
L2
L6
L3
L5
L4
Gx is a Perfect Graph
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
L1
L7
L2
L6
L3
L5
L1
L4
L3
Gx is a Perfect Graph
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L7
L2
L6
L3
L5
L1
L4
L3
Gx is a Perfect Graph
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L7
L2
L6
L3
L4
L5
L1
L4
L3
Gx is a Perfect Graph
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L7
L2
L6
L3
L4
L5
L1
L4
L5
Gx is a Perfect Graph
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L7
L2
L6
L3
L5
L1
L4
L6 L5
Gx is a Perfect Graph
We get LEFT(L1, {L3, L4, L5, L6})
L1
L7
L2
L6
L3
L5
L1
L6
L4
L3, L4, L5 L6
Gx is a Perfect Graph
A symmetric argument started at
L6, with LEFT(L1, L6) implies
LEFT(Li, L6) for i=2,3,4,5
L1
L7
L2
L6
L3
L5
L1
L6
L4
L3, L4, L5 L6
Gx is a Perfect Graph
A symmetric argument started at
L6, with LEFT(L1, L6) implies
LEFT(Li, L6) for i=2,3,4,5
L1
L7
L2
L6
L3
L5
L1
L6
L4
L2, L3, L4 L5
L3, L4, L5 L6
Gx is a Perfect Graph
Then {L3, L4, L5} are between L1 and L6
L1
L7
L2
L6
L3
L5
L1
L6
L4
L2, L3, L4 L5
L3, L4, L5 L6
Gx is a Perfect Graph
Then {L3, L4, L5} are between L1 and L6
But L7 crosses L1 and L6, and
so should cross them all !
L1
L7
L2
L6
L3
L7
L5
L1
L6
L4
L2, L3, L4 L5
L3, L4, L5 L6
The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)
It can be applied to small or moderate proteins (up to 80 residues/150 contacts)
In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN
RELAXATION. Approach borrowed from Quadratic Assignment. With new
approach we can solve important proteins (up to 150 residues/300 contacts)
What about Heuristics?
E.g., genetic algorithms…
Genetic Algorithm Overview
• A Population of candidate solutions that
evolve (improve) over time
Population
at time t
Recombination
operators
Evaluation
function
Population
at time t+1
• Recombination creates new candidate solutions via
crossover and mutation
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
Offspring
Blue Parent
Red Parent
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
These edges conflict with existing
edges and are not copied
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
– Add random edges to fill any remaining space
Crossover
• Crossover selects pieces from both parents and creates two
offspring solutions
– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
– Add random edges to fill any remaining space
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
– Select a set of endpoints to shift
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
– Select a set of endpoints to shift
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
– Select a set of endpoints to shift
This edge “fell off” the
end of the contact map
and is removed
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
– Select a set of endpoints to shift
– Randomly add new edges
Mutation
• Mutation introduces small changes to existing solutions by
shifting edge endpoints
– Select a set of endpoints to shift
– Randomly add new edges
Computational Results
Computational Results
• 269 proteins
– 70 -100 residues
– 80 to 140 contacts
•
•
•
•
Picked 10,000 pairs of proteins out of 36046 possible
Took a weekend on PC
500 were solved to optimality
2500 had a gap <= 10 contacts
Skolnick Clustering Test
Skolnick Results
• Four Families
1
2
3
4
•
•
•
•
•
Flavodoxin-like fold Che-Y related
Plastocyanin
TIM Barrel
Ferratin
alpha-beta
8 structures
up to 124 residues
15-30% sequence similarity
< 3Å RMSD
Skolnick Results
• Four Families
1
2
3
4
•
•
•
•
•
Flavodoxin-like fold Che-Y related
Plastocyanin
TIM Barrel
Ferratin
beta
8 structures
up to 99 residues
35-90% sequence similarity
< 2Å RMSD
Skolnick Results
• Four Families
1
2
3
4
•
•
•
•
•
Flavodoxin-like fold Che-Y related
Plastocyanin
TIM Barrel
Ferratin
alpha-beta
11 structures
up to 250 residues
30-90% sequence similarity
< 2Å RMSD
Skolnick Results
• Four Families
1
2
3
4
•
•
•
•
•
Flavodoxin-like fold Che-Y related
Plastocyanin
TIM Barrel
Ferratin
alpha
6 structures
up to 170 residues
7-70% sequence similarity
< 4Å RMSD
Skolnick Results
• Four Families
1
2
3
4
Flavodoxin-like fold Che-Y related
Plastocyanin
TIM Barrel
Ferratin
Family
1
Style
alpha-beta
Residues
124
Seq. Sim.
15-30%
RMSD
< 3A
2
beta
99
35-90%
< 2A
3
alpha-beta
250
30-90%
< 2A
170
7-70%
< 4A
4
Proteins
1b00, 1dbw, 1nat, 1ntr,
1qmp, 1rnl, 3cah, 4tmy
1baw, 1byo, 1kdi, 1nin,
1pla, 3b3i, 2pcy, 2plt
1amk, 1aw2, 1b9b, 1btm,
1hti, 1tmh, 1tre, 1tri,
1ydv, 3ypi, 8tim
1b71, 1bcf, 1dps, 1fha,
1ier, 1rcd
Clustering
Define score(P1, P2) as
# shared contacts
0 <=
<= 1
Min # of contacts of P1,P2
Put P1, P2 in same family if score(P1, P2) >= threshold
Clustering
Define score(P1, P2) as
# shared contacts
0 <=
<= 1
Min # of contacts of P1,P2
Put P1, P2 in same family if score(P1, P2) >= threshold
If P1, P2 too big, use G.A. and local search to compute score
L.P. gives then bounds:
HEUR score <= OPT score <= LP bound
and we know how far off OPT we are
Clustering validation
We got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each
other and against the remaining.
Clustering validation
We got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each
other and against the remaining.
TYPICAL BEHAVIOUR
score
proteins were…
0.05
0.1
0.15
0.2
0.25
0.3
0.35
……
1.0
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MATCH
……
MATCH
Skolnick Results
• Performance
– 528 alignments
– 1.3% false negative
– 0.0% false positive
Clustering
Computed, for 1st time, provably optimal alignments for 150 pairs
(inter-family)
Used the CMO value to cluster: retrieves the clusters.
Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise
Use TSP to find a block diagonal structure for S
Clustering
Last Open Problem
?
?