Computational Molecular Biology - Lecture Nine: Average length of

Computational Molecular Biology
Lecture Nine: Average length of a longest common subsequence
Semester I, 2009-10
Graham Ellis
NUI Galway, Ireland
General Computational Problem
Let
X = {V1 , V2 , . . . Vm }
and
W = {W1 , W2 , . . . , Wn }
be two sets of words. We are interested in calculating the average
length of a longest common subsequence between a word in X and
a word in Y .
General Computational Problem
Let
X = {V1 , V2 , . . . Vm }
and
W = {W1 , W2 , . . . , Wn }
be two sets of words. We are interested in calculating the average
length of a longest common subsequence between a word in X and
a word in Y .
For example, X might be the set of genes in a mouse’s genome, Y
might be the set of genes in a rat’s genome, and we might be
interested in quantifying the similarity between mice and rats.
Illustration: toy example
Let
X = {CATG , CTGA, AGTC }
and
Y = {CTAG , GTAC }.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
Toy example (cont.)
v
CATG
CATG
CTGA
CTGA
AGTC
AGTC
w
CTAG
GTAC
CTAG
GTAC
CTAG
GTAC
Length of LCS
3
1
3
2
2
2
Toy example (cont.)
v
CATG
CATG
CTGA
CTGA
AGTC
AGTC
w
CTAG
GTAC
CTAG
GTAC
CTAG
GTAC
Length of LCS
3
1
3
2
2
2
Average length of a longest common subsequence is
13
3+1+3+2+2+2
= .
6
6
Toy example (cont.)
In bigger examples it is useful to compute the average length by
first computing
Prob(Length LCS = k)
for each possible length k.
Toy example (cont.)
In bigger examples it is useful to compute the average length by
first computing
Prob(Length LCS = k)
for each possible length k.
In the example:
Prob(Length LCS = 4)=0
Prob(Length LCS = 3)= 13
Prob(Length LCS = 2)= 12
Prob(Length LCS = 1)= 16
Prob(Length LCS = 0)=0
Toy example (cont.)
In bigger examples it is useful to compute the average length by
first computing
Prob(Length LCS = k)
for each possible length k.
In the example:
Prob(Length LCS = 4)=0
Prob(Length LCS = 3)= 13
Prob(Length LCS = 2)= 12
Prob(Length LCS = 1)= 16
Prob(Length LCS = 0)=0
So the average length of a longest common subsequence is
4×0+3×
1
1
13
1
+2× +1× +0= .
3
2
6
6
A harder toy example
Let
X =set of all 24 permutations of ACGT,
Y =set of all 24 permutations of ACGT.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
A harder toy example
Let
X =set of all 24 permutations of ACGT,
Y =set of all 24 permutations of ACGT.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
We’ll first compute
Prob(LCS(v,w) = k)
for each possible k.
A harder toy example
Let
X =set of all 24 permutations of ACGT,
Y =set of all 24 permutations of ACGT.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
We’ll first compute
Prob(LCS(v,w) = k)
for each possible k.
We’ll handle permutations of 1234 instead of permutations of
ACGT
A harder toy example
Let
X =set of all 24 permutations of ACGT,
Y =set of all 24 permutations of ACGT.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
We’ll first compute
Prob(LCS(v,w) = k)
for each possible k.
We’ll handle permutations of 1234 instead of permutations of
ACGT
and for simplicity assume v=1234.
Harder example (cont.)
w
LCS
w
LCS
w
LCS
1234
4
2134
3
3124
3
1324
3
2314
3
3214
2
1342
3
2341
3
3241
2
2
2431
2
3421
2
1432
1423
3
2413
2
3412
2
3
2143
2
3142
2
1243
w
LCS
4123
3
4213
2
4231
2
4321
1
4312
2
4132
2
Harder example (cont.)
w
LCS
w
LCS
w
LCS
1234
4
2134
3
3124
3
1324
3
2314
3
3214
2
1342
3
2341
3
3241
2
2
2431
2
3421
2
1432
1423
3
2413
2
3412
2
3
2143
2
3142
2
1243
Prob(LCS(v,w) = 1) = 1/24
w
LCS
4123
3
4213
2
4231
2
4321
1
4312
2
4132
2
Harder example (cont.)
w
LCS
w
LCS
w
LCS
1234
4
2134
3
3124
3
1324
3
2314
3
3214
2
1342
3
2341
3
3241
2
2
2431
2
3421
2
1432
1423
3
2413
2
3412
2
3
2143
2
3142
2
1243
Prob(LCS(v,w) = 1) = 1/24
Prob(LCS(v,w) = 2) = 13/24
w
LCS
4123
3
4213
2
4231
2
4321
1
4312
2
4132
2
Harder example (cont.)
w
LCS
w
LCS
w
LCS
1234
4
2134
3
3124
3
1324
3
2314
3
3214
2
1342
3
2341
3
3241
2
2
2431
2
3421
2
1432
1423
3
2413
2
3412
2
3
2143
2
3142
2
1243
Prob(LCS(v,w) = 1) = 1/24
Prob(LCS(v,w) = 2) = 13/24
Prob(LCS(v,w) = 3) = 9/24
w
LCS
4123
3
4213
2
4231
2
4321
1
4312
2
4132
2
Harder example (cont.)
w
LCS
w
LCS
w
LCS
1234
4
2134
3
3124
3
1324
3
2314
3
3214
2
1342
3
2341
3
3241
2
2
2431
2
3421
2
1432
1423
3
2413
2
3412
2
3
2143
2
3142
2
1243
w
LCS
4123
3
4213
2
4231
2
4321
1
4312
2
4132
2
Prob(LCS(v,w) = 1) = 1/24
Prob(LCS(v,w) = 2) = 13/24
Prob(LCS(v,w) = 3) = 9/24
Prob(LCS(v,w) = 4) = 1/24
Average Length =
29
1 × 1 + 2 × 13 + 3 × 9 + 4 × 1
= .
24
12
Even harder toy example
Let
X =set of all 5040 permutations of 1234567,
Y =set of all 5040 permutations of 1234567.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
Even harder toy example
Let
X =set of all 5040 permutations of 1234567,
Y =set of all 5040 permutations of 1234567.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?
For this kind of example we need more mathematics!
A Theorem
Let Ln be the length of a longest increasing subsequence of a
random permutation of the numbers 123 · · · n.
Prob(Ln = k) =
1
n!
X
(dλ )2
λ⊢n, |λ|=k
where
◮
dλ = number of standard Yound tableau of shape λ
◮
λ ranges over all partitions of n into k parts.
Illustration of the theorem
Let’s calculate Prob(L4 = 2).
Illustration of the theorem
Let’s calculate Prob(L4 = 2).
Need to consider the partitions (2, 2) ⊢ 4 and (3, 1) ⊢ 4.
Illustration of the theorem
Let’s calculate Prob(L4 = 2).
Need to consider the partitions (2, 2) ⊢ 4 and (3, 1) ⊢ 4.
For λ = (2, 2) there are dλ = 2 standard Young tableau.
1
2
1
3
3
4
2
4
For λ = (3, 1) there are dλ = 3 standard Young tableau.
1
4
2
3
1
3
2
4
1
2
3
4
Illustration of the theorem (cont.)
Prob(L4 = 3) =
13
1 2
{2 + 32 } =
4!
24
This agrees with our earlier calculation of this probability.
A Proposition
To apply the theorem for, say, n = 7 we need a method for
calculating dλ .
A Proposition
To apply the theorem for, say, n = 7 we need a method for
calculating dλ .
Proposition:
dλ =
n!
Πhc
where
hc = 1 + (number of cells on right of c)
+ (number of cells below c)
A Proposition
To apply the theorem for, say, n = 7 we need a method for
calculating dλ .
Proposition:
dλ =
n!
Πhc
where
hc = 1 + (number of cells on right of c)
+ (number of cells below c)
For example, for λ = (3, 1) we have
dλ =
4!
= 3.
4×2×1×1
Lengthy exercise
Let
X =set of all 5040 permutations of 1234567,
Y =set of all 5040 permutations of 1234567.
What is the average length of a longest common subsequence
between a word v in X and a word w in Y ?