Language Approach

Word and clump statistics
and an application to DNA
evolution
P. Nicodème
CNRS - Team Calin - LIPN - University Paris13
00100
00100
111100
00100
111100
10111111100
00100
111100
10111111100
01100
00100
111100
10111111100
01100
0100
00100
111100
10111111100
01100
0100
11100
00100
111100
10111111100
01100
0100
11100
11100
00100
111100
10111111100
01100
0100
11100
11100
010100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
0100010010010100000100111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
0100010010010100000100111
00111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
0100010010010100000100111
00111
000101100101010000111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
0100010010010100000100111
00111
000101100101010000111
101010101000101011001010100000010000110111
00100
111100
10111111100
01100
0100
11100
11100
010100
0110100
00100
10111111010111111100
00010100
0101100
100010100111
1100010100111
1000010111
0011000011000000111
010101010101000111
110110110011011010100101100110000011010111
0111
111
001100010111
0100010010010100000100111
00111
000101100101010000111
101010101000101011001010100000010000110111
1
1
1000011100
1
2
1
2
1000011100
0001101000010110
1
2
3
1
2
3
1000011100
0001101000010110
1001110101
1
2
3
4
1
2
3
1000011100
0001101000010110
1001110101
111011110
1
2
3
4
5
1
2
3
1000011100
0001101000010110
1001110101
111011110
111011111
1
2
3
4
5
6
1
2
3
4
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
1
2
3
4
5
6
7
1
2
3
4
5
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1
2
3
4
5
6
7
8
1
2
3
4
5
6
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
2
3
4
5
6
7
8
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
10
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
2
3
4
5
6
7
8
9
10
11
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
2
3
4
5
6
7
8
9
10
11
12
13
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1000000000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1000000000
00111111011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1000000000
00111111011
11000011001
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1000000000
00111111011
11000011001
0111001000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1000011100
0001101000010110
1001110101
111011110
111011111
001000101010
11001011011
1001100010
10111010011
0011000001010
0111110010
11010111100010
111010011
101001011000
001000011011
001000011110
000011000111111
1001010011
00010101001110001
111101011
10101001001000
101001111100
0111010111
0001001010110
111110111
0110111000011
1000000000
00111111011
11000011001
0111001000
Waiting time (1000 random texts)
100
100
80
80
µ̂ = 7.98
σ̂ = 4.74
60
40
40
20
20
0
µ̂ = 13.67
σ̂ = 11.00
60
20
40
100
60
80
0
20
40
111
60
80
Waiting time (1000 random texts)
µ=8
σ = 4.90
100
100
80
80
µ̂ = 7.98
σ̂ = 4.74
60
40
µ̂ = 13.67
σ̂ = 11.00
60
40
20
20
0
µ = 14
σ = 11.92
20
40
100
60
80
0
20
40
111
60
80
What is going on?
I Probability of appearance at a given position
1
8
1
P(111) =
8
P(100) =
What is going on?
I Probability of appearance at a given position
1
8
1
P(111) =
8
P(100) =
I BUT the
111
occur often by CLUMPS
· · · 0111110
111
111 · · ·
I while the
100
NEVER OVERLAP
· · · 011110
111 · · ·
What is going on?
I Probability of appearance at a given position
1
8
1
P(111) =
8
P(100) =
I BUT the
111
occur often by CLUMPS
· · · 0111110
· · · 011110
111 · · ·
111
111 · · ·
I while the
100
NEVER OVERLAP
Expected waiting time:
111 − 14
100 − 7
Number of occurrences 1000 texts of size 500
80
80
60
60
µ̂ = 62.38
σ̂ = 4.90
40
20
20
0
µ̂ = 62.07
σ̂ = 10.79
40
30
40
50
60
100
70
80
90
100
0
30
40
50
60
111
70
80
90
100
Part I
Statistics
of reduced patterns
1 word - Bernoulli model
A
alphabet,
w
considered word
Polynomial of autocorrelation
Cw = { h,
w.h = u.w
and
|u| < |w| }
w = ababa
ababa
ababa|
ababa
ababa
Cababa = {, ba, baba}
X
Cababa (z) =
P(v)z |v| = 1 + ωa ωb z 2 + ωa2 ωb2 z 4
v ∈ Cababa
Guibas-Odlyzko decomposition for a word u
aaaa
aaaa
aaaa
aaaa
aaaa
Guibas-Odlyzko decomposition for a word u
∈R
aaaa
aaaa
aaaa
aaaa
aaaa
Guibas-Odlyzko decomposition for a word u
∈R
aaaa
∈ uM
aaaa
aaaa
aaaa
∈ uM
aaaa
Guibas-Odlyzko decomposition for a word u
∈R
aaaa
∈ uU
∈ uM
aaaa
aaaa
aaaa
∈ uM
aaaa
Guibas-Odlyzko decomposition for a word u
∈R
aaaa
∈ uU
∈ uM
aaaa
aaaa
aaaa
aaaa
∈ uM
A? = N ∪ R.M?.U
L• = N ∪ R•.(M•)?.U
need to compute
N , R, M, U
The languages R, M and U
R = { t = u.w and 6 ∃r, s, t = r.w.s }
aaaaaababa ∈ R, bbbbbabababa 6∈ R
First
The languages R, M and U
R = { t = u.w and 6 ∃r, s, t = r.w.s }
aaaaaababa ∈ R, bbbbbabababa 6∈ R
First
Minimal
M = { t,
w.t = u.w
and
6 ∃r, s,
w.t = r.w.s }
ababa
ababa
ababa
aaaaababa ∈ M
babbbbbbbbababa 6∈ M
ba ∈ M
The languages R, M and U
R = { t = u.w and 6 ∃r, s, t = r.w.s }
aaaaaababa ∈ R, bbbbbabababa 6∈ R
First
Minimal
M = { t,
w.t = u.w
and
6 ∃r, s,
w.t = r.w.s }
ababa
ababa
ababa
aaaaababa ∈ M
babbbbbbbbababa 6∈ M
ba ∈ M
Ultimate
U = {t,
6 ∃r, s,
w.t = r.w.s}
ababa
aabbbabbbbbbb ∈ U
ababa
babbbbbbbbbb 6∈ U
The languages R, M and U
R = { t = u.w and 6 ∃r, s, t = r.w.s }
aaaaaababa ∈ R, bbbbbabababa 6∈ R
First
Minimal
M = { t,
w.t = u.w
and
6 ∃r, s,
w.t = r.w.s }
ababa
ababa
ababa
aaaaababa ∈ M
babbbbbbbbababa 6∈ M
ba ∈ M
Ultimate
U = {t,
6 ∃r, s,
w.t = r.w.s}
ababa
aabbbabbbbbbb ∈ U
Not
N = A? .w.A? = {t, 6 ∃r, s,
ababa
babbbbbbbbbb 6∈ U
t = r.w.s}
Régnier-Szpankowski - Equations over the langages
I
(I) A? = U + MA? ⇔ w.A? = w.U + w.MA?
− for any word beginning with w
1. 1 single occurrence of
2. several occurrences of
w
w
⇒
⇒
w.U
w.M.A?
Régnier-Szpankowski - Equations over the langages
I
(I) A? = U + MA? ⇔ w.A? = w.U + w.MA?
− for any word beginning with w
1. 1 single occurrence of
2. several occurrences of
I
w
w
⇒
⇒
(II) A? w = R.C + R.A? .w
− for any word nishing by w
1. rst occurrence
w
(remark ⊂ C )
overlaps last one
or single occurrence of
2. rst occurrence of
w.U
w.M.A?
w
w
does not overlap last
⇒ R.C
⇒ R.A? .w
Equations over the languages (continued)
I
(III)
1.
2.
3.
⇔
M+ = A? .w +
C−
w.M+ = w.A? .w + w.(C − )
w.M+ , ⇒ words beginning and nishing with w
w.A? .w, ⇒ rst and last occurrences do not overlap
w.(C − ) ⇒ rst and last occurrences overlap
Equations over the languages (continued)
I
(III)
1.
2.
3.
I
⇔
M+ = A? .w +
C−
w.M+ = w.A? .w + w.(C − )
w.M+ , ⇒ words beginning and nishing with w
w.A? .w, ⇒ rst and last occurrences do not overlap
w.(C − ) ⇒ rst and last occurrences overlap
(IV) N .A = R + N − − concatenate a letter to a word of N ⇒
1. creates a match
⇒R
2. does not create a match
3. (empty word
forbidden)
⇒N
From Languages to Generating Functions
A? = U + MA?
1
M (z)
= U (z) +
1−z
1−z
(II)
A? w = R.A + R.A? .w
ωw z |w|
ωw z |w|
= F (z) C(z) +
1−z
1−z
(III)
M+ = A? .w + A − M (z)
ωw z |w|
=
+ C(z) − 1
1 − M (z)
1−z
(IV)
N .A = R + N − zN (z) = F (z) + N (z) − 1
(I)
!
Solving the system
R(z) =
N (z) =
ωw z |w|
ωw z |w| + (1 − z)C(z)
ωw
z |w|
C(z)
+ (1 − z)C(z)
U (z) =
ωw
z |w|
M (z) = 1+
ωw
1
+ (1 − z)C(z)
z |w|
z−1
+ (1 − z)C(z)
L• = N ∪ R•.(M•)?.U
L(z, u) = N (z)+
R(z)uU (z)
=
1 − uM (z)
1
(u − 1)ωw z |w|
1−z−
1 − (u − 1)(C(z) − 1)
Moments
Xn random variable counting the number of occurrences of
the word w in a random text of length n
I
I
E(Xn ) =
[z n ]
Var(Xn ) =
∂L(z, u) = (n − |w| + 1)ωw
∂u u=1
[z n ]
∂ ∂L(z, u) u
∂u
∂u u=1
− E2 (Xn )
= n × ωw 2C(1) − 1 − (2|w| − 1)ωw + O(1)
Number of occurrences
100
L(z, u)
m(2) (z)
µn
σn
µ500
σ500
1−u
(4z + 2z 2 )
8
1−u
1−z−
(4z−2z 2 −z 3 )
8
1+
1
1 − z + (1 − u)
µ(z)
111
z3
8
z3
8(1 − z)2
z3
8(1 − z)2
z3
z3
−
2
8(1−z)
32(1−z)3
8z 3 − 4z 5 − 2z 6
32(1 − z)3
(n − 2)/8
√
3n/8
(n − 2)/8
√
15n − 40/8
249/4 = 62.25
√
1500/8 ≈ 4.84
249/4 = 62.25
√
7460/8 ≈ 10.80
Reduced compound patterns
W = {w1 , w2 }
w1 )
and
w1
(resp.
w2 )
is
not factor of w2
Correlation of words
C(z) = (Ci,j (z))
Example
Languages
Ci,j = {h, wi .h = u.wj }
W = {aab, abaa}
1
ωa2 z 2
C(z) =
ωb z 1 + ωa2 ωb z 3
Right, Minimal, Ultimate
Ri , Mi,j , Ui
(resp.
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
aaaa
u1 = aaaa
u2 = aaab
aaaa
aaab
aaaa
aaaa
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
u1 = aaaa
u2 = aaab
∈ R1
aaaa
aaaa
aaab
aaaa
aaaa
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
∈ R1
aaaa
u1 = aaaa
u2 = aaab
∈ u1 M12
aaaa
aaaa
aaab
∈ u2 M21
aaaa
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
∈ R1
aaaa
u1 = aaaa
u2 = aaab
∈ u1 U1
∈ u1 M12
aaaa
aaaa
aaab
∈ u2 M21
aaaa
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
∈ R1
aaaa
u1 = aaaa
u2 = aaab
∈ u1 U1
∈ u1 M12
aaaa
aaaa
aaab
∈ u2 M21
aaaa
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
∈ R1
u1 = aaaa
u2 = aaab
∈ u1 U1
∈ u1 M12
aaaa
aaaa
aaaa
aaaa
aaab
∈ u2 M21
I
The Right language
Ri = {r | r = e · ui
I
I
associated to the word
υ∈U
ui
such that
is the set of words
r = xυy
Mij leading from a word ui to
words Mij = {m | ui · m = e · uj and there is no υ ∈ U
xυy with |x| > 0, |y| > 0}.
The Minimal language
The Ultimate language
ui
Ui
with
a word
uj
such that
|y| > 0}.
is the set of
ui · m =
of words following the last occurrence of the word
U in the text) is the set
ui · u = xυy with |x| > 0}.
(such that this occurrence is the last occurrence of
of words
I
Ri
and there is no
Ui = {u |
there is no
The Not language
N = {n |
υ∈U
N is the set
υ ∈ U such
there is no
such that
of words with no occurrences of
that
n = xυy}.
U,
Guibas-Odlyzko decomposition for a pattern (u1 , u2 )
U = (aaaa, aaab)
∈ R1
aaaa
u1 = aaaa
u2 = aaab
∈ u1 U1
∈ u1 M12
aaaa
aaaa
aaaa
aaab
∈ u2 M21
F (z, x1 , x2 )
= N (z) + (R1 (z)x1 , R2 (z)x2 )
M11 (z)x1
M21 (z)x1
M12 (z)x2
M22 (z)x2
U1 (z)
U2 (z)
Computing the languages
I
Régnier-Szpankowski Equations
[
Mk
i,j
= A? ·uj + Cij − δij , Ui ·A =
[
Mij + Ui − ,
j
k≥1
A·Rj − (Rj − uj ) =
[
ui Mij ,
N ·uj = Rj +
i
[
Ri (Cij − δij ) ,
i
L(z, x1 , x2 )
= N (z) + (R1 (z)x1 , R2 (z)x2 )
I
M11 (z)x1
M21 (z)x1
M12 (z)x2
M22 (z)x2
The generating functions are also computable by automata
U1 (z)
U2 (z)
Markov case - One word w
I
Conditional probability matrix:
P = (pij ),
i, j ∈ {1, . . . , |A|}
I
w = aabaa
I
C = {baa, abaa}
I
Cw (z) = pab pba p2aa z 3 + paa pab p2aa z 4
I
Cw (z)
is the
conditionned generating function of the
autocorrelation set
Markov case - One word w = w1 .w2 . . . .wm
Conditional probability matrix: P = (pij ), i, j ∈ {1, . . . , |A|}
Stationnary vector: µ = (µ
,...,µ )
1  |A|
µ
Stationnary matrix: Π = 

.
.
.

 = limn→∞ Pn
µ
P (w)z m
D
1
U (z) =
D
R(z) =
z−1
D
D − P (w)z m
N (z) =
(1 − z)D
M (z) = 1 +
D = (1 − z)Cw (z) + P (w)z m 1 − (1 − z)F (z)
F (z) =
−1 i
1 h
(P − Π) I − (P − Π)z
µw1
wm ,w1
Limit Laws
I
I
Also
Poisson when number of occurrences is O(1)
Gaussian when number of occurrences is Θ(n)
large deviation result
Bibliography
I
I
Lothaire book (2005), Chapter 7
Régnier (2000) A unied approach to word occurrences
probabilities
Part II
Clump Analysis
Counting clumps
w = aa
T = bbbbaaaabbbbaaa
I word occurrences counting with overlaps:
5
(aa|a|a|, aa|a|)
I clumps counting:
2
matches (aaaa|, aaa|)
matches
Counting clumps
w = aa
T = bbbbaaaabbbbaaa
I word occurrences counting with overlaps:
5
matches
(aa|a|a|, aa|a|)
I clumps counting:
2
matches (aaaa|, aaa|)
A clump of occurrences of a word w is
I either composed of a single occurrence of w
I or of a maximal set of occurrences of w such that each
occurrence of w overlaps at least another occurrence of w
Generalization to clumps of occurrences of a set of words
{u1 , . . . , ur }
Aim of our work
I a
combinatorial analysis for counting clumps of reduced sets
of words
I an
algorithmic contruction by automata that solves the
counting problem in the general case and implies a normal
limit law
Probability of start at a position
(A) word
w = 1k
p = P(1) = 1 − P(0) = 1 − q
k
...
P(start) = pk
i
z
1
}|
1 ...
{
1
...
Probability of start at a position
(A) word
w = 1k
i
p = P(1) = 1 − P(0) = 1 − q
k
...
}|
1 ...
z
1
{
1
...
P(start) = pk
k
(B) clump
P(start) =
Γ = 1k .1?
pk
...
...
...
...
z
1
1
1
}|
1 ...
1 ...
1 ...
{
1
0
...
1
1
0
...
1
1
1
0
...
...
× q × (1 + p + p2 + . . . ) = pk ×
q
= pk
1−p
Probability of start at a position
(A) word
w = 1k
i
p = P(1) = 1 − P(0) = 1 − q
k
...
}|
1 ...
z
1
{
1
...
P(start) = pk
k
(B) clump
P(start) =
Γ = 1k .1?
pk
...
...
...
...
z
1
1
1
}|
1 ...
1 ...
1 ...
{
1
0
...
1
1
0
...
1
1
1
0
...
...
× q × (1 + p + p2 + . . . ) = pk ×
FALSE
q
= pk
1−p
Probability of start at a position
(A) word
i
w = 1k
k
...
z
1
}|
1 ...
{
1
...
P(start) = pk
(B) clump
Γ = 1k .1?
k
z }| {
. . . 1| 1 . . . 1 . . .
no clump
beginning at position i
k
z }| {
. . . 0| 1 . . . 1 . . .
a clump
P(start) = pk × q = pk × (1 − p)
begins at position i
Probabilistic approach
Prum, Reinert, Schbath, Pape, ...
w = aaa
P(a)
small
P(start-of-a-clump-at-a-position)
small
Poisson law for the number of clumps
Geometric law for the number of words in a clump
This extends to the general case.
Chen-Stein Poisson approximation provides bounds for the total
variation distance to the composed law.
A - Combinatorial approach
Language equations for the clumps
∈ R− w
∈ wU
∈ wC ?
aaaa
aaaa
aaaa
aaaa
aaaa
∈ (M−K)−
Notation:
if
L=W ·w
we write
L− = W
Combinatorial decomposition
?
A? = N + R− Γ (M − K)− Γ U
?
= N + R− w C ? (M − K)− w C ? U
?
= N + R− w K? (M − K) − w K? U
Clump:
Γ = w C ? = w K?
Some combinatorial properties
w = aaaaa
C − {} = {a, aa, aaa, aaaa}
K = {a}
M = {a, b(b + ab + aab + aaab + aaaab)? aaaaa}
Properties
I
K⊂M
I
M − K = Lw
I
K? = C ?
and
K?
is
unambiguous
A Prex Code generating the clumps
Lemma.
Let
C◦ = C − {}
be the strict autocorrelation set of a word
I the Prex code K = C◦ − C◦ A+ generates
C + −{}, which implies that K? = C◦ ?
I K? is
unambiguous
w
unambiguously
Generating functions
F (z, •, •, •) = N (z)+
I number of
w
R(z)
Γ(•z,•,•)
ωw z |w|
1
U(z)
M(z)−K(z)
1−
×Γ(•z,•,•)
ωw z |w|
and number of clumps:
Γ(z, u, x) = uxωw z |w|
1
1 − xK(z)
I number of clumps and total number of positions inside clumps:
Γ(tz, u) = uωw (tz)|w|
I number of
w
1
1 − K(tz)
and total number of positions inside clumps:
1
1 − xK(tz)
stuttering w and total
Γ(tz, x) = xωw (tz)|w|
I number of
number of positions inside
clumps:
Γ(tz, x) =
x
ωw (tz)|w|
1−x
1
x
1−
K(tz)
1−x
One word - Expectation - Variance
clumps
E(OnK ) = (n − |w| + 1)ωw (1 − K(1)) − ωw K0 (1)
Var(OnK ) = n × (1 − K(1))2 Vw − n × ωw (1 − K(1))(K(1) − 2ωw K0 (1))
one word
E(Onw ) = (n − |w| + 1)ωw ,
Var(Onw ) = n × Vw + O(1).
Vw = ωw 2C(1) − 1 − (2|w| − 1)ωw
Putting up equations for clumps of two words
Minimal Correlation Language
Lemma
: Mij − Kij = Lwj
: Kij = Cij − Cij A+
K11 K12
K=
K21 K22
w1 Ξ11 w1 Ξ12
?
Ξ=K
G=
w2 Ξ21 w2 Ξ22
?
A =N +
−
(R−
1 , R2 ) G
U1
?
(M − K) G
U2
−
B - Automaton approach
Clumps of the set of words
Ew1 ,w2
extension set from
w1
to
U = {aabaa, baab}
w2
Eaabaa,aabaa = {baa, abaa}
Eaabaa,baab = {b}
Ebaab,baab = {aab}
Ebaab,aabaa = {aa}
Clumps of the set of words
Ew1 ,w2
extension set from
w1
to
w2
Eaabaa,aabaa = {baa, abaa}
Eaabaa,baab = {b}
Algorithm
1. Build the set of strings
U = {aabaa, baab}
Ebaab,baab = {aab}
Ebaab,aabaa = {aa}
Clumps of the set of words
Ew1 ,w2
extension set from
w1
to
U = {aabaa, baab}
w2
Eaabaa,aabaa = {baa, abaa}
Eaabaa,baab = {b}
Ebaab,baab = {aab}
Ebaab,aabaa = {aa}
Algorithm
1. Build the set of strings
X=
∪
{aabaa.( + Eaabaa,aabaa )}
{baab.( + Ebaab,baab )}
∪ {aabaa.Eaabaa,baab }
∪ {baab.Ebaab,aabaa }
Clumps of the set of words
Ew1 ,w2
extension set from
w1
to
U = {aabaa, baab}
w2
Eaabaa,aabaa = {baa, abaa}
Eaabaa,baab = {b}
Ebaab,baab = {aab}
Ebaab,aabaa = {aa}
Algorithm
1. Build the set of strings
X=
∪
2. Build a trie
{aabaa.( + Eaabaa,aabaa )}
{baab.( + Ebaab,baab )}
T
on
X
∪ {aabaa.Eaabaa,baab }
∪ {baab.Ebaab,aabaa }
Clumps of the set of words
Ew1 ,w2
extension set from
w1
to
U = {aabaa, baab}
w2
Eaabaa,aabaa = {baa, abaa}
Eaabaa,baab = {b}
Ebaab,baab = {aab}
Ebaab,aabaa = {aa}
Algorithm
1. Build the set of strings
X=
∪
2. Build a trie
{aabaa.( + Eaabaa,aabaa )}
{baab.( + Ebaab,baab )}
T
on
∪ {aabaa.Eaabaa,baab }
∪ {baab.Ebaab,aabaa }
X
3. Build a Aho-Corasick like automaton upon
with access word
δ(ν, `) =
of v.`
v,
T.
For each node
use the transition function
node accessed by the
ν
of
T
δ
longuest prex in X
that is
sux
X = {aabaa, aabaabaa, aabaaabaa, aabaab
}
b
a
a
a
b
a
a
a
a
b
a
a
X = {aabaa, aabaabaa, aabaaabaa, aabaab, baab, baabaab }
b
a
a
a
b
a
a
a
a
b
a
a
b
a
a
b
a
a
b
X = {aabaa, aabaabaa, aabaaabaa, aabaab, baab, baabaab }
δ(aabaaabaa, a) = aabaaa
a
b
a
a
a
b
a
a
a
a
b
a
a
b
a
a
b
a
a
b
a
a
b
a
a
a
A
b
a
a
a
b
b
a
b
a
a
a
b
B
a
An automaton for
a
b
a
a
V = {v1 = aabaa, v2 = baab}. All
A and B are omitted.
ending respectively on state
b
a
transitions labeled by
a
and
b
a
a
b
a
a
a
A
b
a
a
a
•
•
b
b
a
•
b
•
a
a
a
b
B
a
An automaton for
a
b
•
a
b
a
•
•
V = {v1 = aabaa, v2 = baab}. All
A and B are omitted.
transitions labeled by
a
a
and
ending respectively on state
I
•, • → the corresponding
aabaa, baab.
I red states →
prex (or state) ends with some occurrence of
states where we have entered a
new clump
b
a
a
b
a
a
A
b
a
a(τ 4 x1 )
a
a(γτ 5 x1 )
•
•
b(τ x2 )
a
•
b(τ x2 )
•
b
B
b(γτ 4 x2 )
a
An automaton for
a
•
a
b(τ x2 )
a
a
b(τ x2 )
a(τ 2 x1 )
a(τ 2 x1 )
•
•
V = {v1 = aabaa, v2 = baab}. All
A and B are omitted.
transitions labeled by
a
a
and
ending respectively on state
I
•, • → the corresponding
aabaa, baab.
I red states →
prex (or state) ends with some occurrence of
states where we have entered a
Formal weights on transitions
I γ → the number of clumps
I τ → total length of clumps
I x1 , x2 → occurrences of aabaa, baab
new clump
b
F (a, b, γ, x1 , x2 , τ )


= (1, 0, . . . ) I − T(a, b, γ, x1 , x2 , τ ) −1 

1
1




b
a
A
b
a
a(τ 4 x1 )
a
a(γτ 5 x1 )
a
a
a
•
•
b(τ x2 )
a
•
b
B
b(γτ 4 x2 )
a
An automaton for
a
•
a
b(τ x2 )
•
b(τ x2 )
a
a
b(τ x2 )
a(τ 2 x1 )
a(τ 2 x1 )
•
•
V = {v1 = aabaa, v2 = baab}. All
A and B are omitted.
transitions labeled by
a
a
and
ending respectively on state
I
•, • → the corresponding
aabaa, baab.
I red states →
prex (or state) ends with some occurrence of
states where we have entered a
Formal weights on transitions
I γ → the number of clumps
I τ → total length of clumps
I x1 , x2 → occurrences of aabaa, baab
new clump
b
F (a, b, γ, x1 , x2 , τ )


= (1, 0, . . . ) I − T(a, b, γ, x1 , x2 , τ ) −1 

a
πa z, b
1
1
πb z
[z n ]F (πa z, πb z, . . . ) → ()Tn (πa , πb , . . . )()




b
a
A
b
a
a(τ 4 x1 )
a
a(γτ 5 x1 )
a
a
a
•
•
b(τ x2 )
a
•
b(τ x2 )
•
b
B
b(γτ 4 x2 )
a
An automaton for
a
•
a
b(τ x2 )
a
a
b(τ x2 )
a(τ 2 x1 )
a(τ 2 x1 )
•
•
V = {v1 = aabaa, v2 = baab}. All
A and B are omitted.
transitions labeled by
a
a
and
ending respectively on state
I
•, • → the corresponding
aabaa, baab.
I red states →
prex (or state) ends with some occurrence of
states where we have entered a
Formal weights on transitions
I γ → the number of clumps
I τ → total length of clumps
I x1 , x2 → occurrences of aabaa, baab
new clump
b
Asymptotic Limit Laws
I
One word
-
I
OnK = O(1)
Poisson law for the number of clumps
General non-reduced sets
-
OnK = Θ(n)
Normal limit law (number of clumps, size covered)
Proofs
I Poisson law: Rouché theorem, singularity analysis
I Normal law: automaton, Perron-Frobenius for
singularity analysis, large powers theorem
T(. . . ),
Complexity of computing the prex code(s)
I
one word
|K| log(|K|)
I
several words (reduced case)




X
X

|Ki,j | log 
|Ki,j |
i,j
Complexity of
i,j
insertion of random keys in tries
Throwing at dimers on an integral segment
1 2
0
i
i+2
n
Throwing at dimers on an integral segment
1 2
0
1
2
3
4
5
6
7
8
Size covered by the dimers?
n
Throwing at dimers on an integral segment
a a
stuttering
t
0
P(a) = P(b) =
stuttering
by
1
2
t
t
occurrences of aa on (a + b)n
t
P(aa) =
substituting
t
1
4
x
t
t
t
x
1
1− x
4
n
in
Γ(zt, x)
Throwing at dimers on an integral segment
Size covered
20
15
10
5
Number of dimers
0
10
20
30
40
n = |segment| = 20
50
Part III
Application of Clump Analysis
to Genomics
Revisiting waiting times in DNA evolution
Biological Motivation
I
Promoters are DNA sequences located upstream of the gene
they regulate; regulation can be positive for enhancers or
negative for repressors.
I The promoters contain binding sites for regulatory proteins
such as Transcription Factors (TFs) that are short
stretches of DNA.
Biological Motivation
I
Promoters are DNA sequences located upstream of the gene
they regulate; regulation can be positive for enhancers or
negative for repressors.
I The promoters contain binding sites for regulatory proteins
such as Transcription Factors (TFs) that are short
stretches of DNA.
I
Waiting time: how long it takes for a Transcription Factor
to appear in a promoter under a probabilistic model of
evolution helps understanding the overall evolution of
promoters within species and between species?
From innitesimal to discrete evolution model
I
I
innitesimal time
P(t) evolution matrix from time x and time x + t
Q(t)dt
evolution matrix for
P(t) = eQ(t)
I
(Karlin-Taylor 1975)
P(1) = (πα→β ) evolution matrix
years), α, β ∈ {A, C, G, T}
for
one generation (20
Probability of occurrence of a
at time 1
I
length n at time 0
Sn (1) sequence obtained from Sn (0) by evolution at time 1
b a k -mer (word of length k over A = {A,C,G,T})
I
Pn (b) probability that b
I
I
Sn (0) random DNA sequence
k -mer
of
I
occurs at time 1
I
while not occurring at time 0
Pn (b) = P(b ∈ Sn (1) | b 6∈ Sn (0))
Probability of occurrence of a
at time 1
I
length n at time 0
Sn (1) sequence obtained from Sn (0) by evolution at time 1
b a k -mer (word of length k over A = {A,C,G,T})
I
Pn (b) probability that b
I
I
Sn (0) random DNA sequence
k -mer
of
I
occurs at time 1
I
while not occurring at time 0
Pn (b) = P(b ∈ Sn (1) | b 6∈ Sn (0))
Expectation of the Waiting time
I
En (b) ≈
1
Pn (b)
(geometric
En (b)
distribution
− BehVin2010)
Dierent computations
1.
of
Pn
Behrens-Vingron (2010)
neglecting words correlation.
Ecient computation of Pn with respect to this assumption.
I Approach
I
2.
Behrens-Nicaud-P.N. (2012)
I
Rigorous and ecient approach by automata.
hiding the quasi-linear behaviour of Pn
I Approach
3.
P.N. (NCMA2012)
I
I
Non-ecient approach by clump analysis, either by
combinatorics of words or by automata.
Proof by singularity analysis of the quasi-linear behaviour
of
Pn
Initial
α
A
C
G
T
ν(α)
and Substitution Probabilities
A
A
ν(α)
A
C
0.23889
A
G
0.26242
A
T
0.25865
C
A
0.24004
C
C
C
G
C
T
G
A
G
C
substitution
probability
πα→β
G
G
G
T
T
A
for one generation
T
C
(20 years)
T
G
T
T
πα→β
0.9999999763
4.54999994943 × 10−9
1.57499995613 × 10−8
3.40000001733 × 10−9
6.14999993408 × 10−9
0.99999996495
7.14999984731 × 10−9
2.17499993935 × 10−8
2.17499993935 × 10−8
7.14999984731 × 10−9
0.99999996495
6.14999993408 × 10−9
3.40000001733 × 10−9
1.57499995613 × 10−8
4.54999994943 × 10−9
0.9999999763
Numerical remarks
I
I
I
length of promoters n ∈ [500 − 2000]
k ∈ {5, 6, 7, 8, 9, 10} for the k -mers
Mutation probability πα→β ≈ 10−9
Numerical remarks
I
I
I
length of promoters n ∈ [500 − 2000]
k ∈ {5, 6, 7, 8, 9, 10} for the k -mers
Mutation probability πα→β ≈ 10−9
We have
I
I
pr : probability of Mutation to b from a r-neighbour of
b with r ≥ 2
r
pr ≤ n × πα→β
≤ 2000 × 10−18 < 2.10−6 × πα→β
qs : probability that s 1-neighbours simultaneously
mutate to b with s ≥ 2
s
qs ≤ n × πα→β
≤ 2000 × 10−18 < 2.10−6 × πα→β
Numerical remarks
I
I
I
length of promoters n ∈ [500 − 2000]
k ∈ {5, 6, 7, 8, 9, 10} for the k -mers
Mutation probability πα→β ≈ 10−9
We have
I
I
pr : probability of Mutation to b from a r-neighbour of
b with r ≥ 2
r
pr ≤ n × πα→β
≤ 2000 × 10−18 < 2.10−6 × πα→β
qs : probability that s 1-neighbours simultaneously
mutate to b with s ≥ 2
s
qs ≤ n × πα→β
≤ 2000 × 10−18 < 2.10−6 × πα→β
Therefore assuming a single mutation in the promoter
is numerically sound
Behrens-Vingron 2010
I
d+ (b)
neighbors of
b
by substitution
S(0)
ν(α)
d+ (b)
S(1)
d+ (b)
πα→β
b
b
Behrens-Vingron 2010
I
d+ (b)
neighbors of
b
by substitution
S(0)
ν(α)
d+ (b)
d+ (b)
πα→β
S(1)
b
b

bn/kc

X

n − i(k − 1) i

i+1

(−1)
Φ

 Pn ≈
i
i=1



Φ=



X
(a1 ,...,ak )∈Ak \{b1 ,...,bk }
ν(a1 ) × · · · × ν(ak ) ·
k
Y
j=1
πai →bi (1)
Behrens-Nicaud-P.N. 2012
Construct an
I
on the
automaton
I recognizing
I
Σ = A × A with A = {A, C, G, T}
sequences S(b) = S(0) ⊗ S(1)
alphabet
such that
1.
b 6∈ S(0)
2.
b ∈ S(1)
Using the Knuth-Morris-Pratt automaton
C
0
A, C
A
A
1
C
2
C
A
C
0
1
C
MACC = {Q, δ, s = 0, F }
A, C
A
A
3
2
C
3
MACC = {Q, δ, s = 0, Q \ F }
A

Mb = (Q = {0, . . . , k}, δb , 0, {k})


Mb = (Q = {0, . . . , k}, δb , 0, {0, . . . , k − 1})


Nb = Mb ⊗ Mb = (Q × Q, ∆, q00 = (0, 0), F 0 = {0,. . ., k−1} × {k})
∆((r, s), (α, β)) = (δb (r, α), δb (s, β))
NACC = MACC ⊗ MACC
The automaton
1,0
A
0,0
C
with matrix
Notations for the transitions:
2,0
C
(
C, C
0,1
C
2,1
C
1,1
A
A=
A=
A
,
A
A
,
C
C=
C=
C
C
C
A
C, C
sink
A, A, C, C
a missing label of a transition is
C, C
set to the letter at the bottom
1,2
A
0,2
C
2,2
C
C, C
of its ending state
1,0
C
2,0
C
C, A
0,3
C
A, A
C, C
1,3
A
2,3
C
A, A
A, A
P
is labelled by
C
NACC = MACC ⊗ MACC
The automaton
1,0
A
0,0
C
with matrix
Notations for the transitions:
2,0
C
(
C, C
0,1
C
2,1
C
1,1
A
A=
A=
A
,
A
A
,
C
C=
C=
C
C
C
A
C, C
sink
A, A, C, C
a missing label of a transition is
C, C
set to the letter at the bottom
1,2
A
0,2
C
2,2
C
of its ending state
C, C
1,0
C
2,0
C
C, A
0,3
C
A, A
C, C
1,3
A
is labelled by
2,3
C
C
A, A
A, A
?
P
?
?
?
Pn = P(Sn (1) ∈ A bA |Sn (0) 6∈ A bA ) =
Vq00 Pn VF 0 t
t
1 − Vq00 Pn Vsink
Results for
BNN
of DNA
BV
EBNN (T1000 )
EBV (T1000 )
EBNN (T1000 )/106
Rank
EBV (T1000 )/106
9,105
1021
6,304
1
1.44
9,570
1022
6,666
142
1.44
10,401
1023
7,457
993
1.39
10,656
1024
7,654
1024
1.39
7,047
699
6,446
11
1.09
7,076
737
6,477
17
1.09
7,076
738
6,477
21
1.09
7,127
787
6,518
31
1.09
7,263
883
6,679
148
1.09
...
...
...
...
...
CCCCC
GGGGG
TTTTT
AAAAA
CGCGC
TCCCC
CCCCT
GCGCG
CTCTC
...


5-mers
5-mers
0.2% of the 7-mers

0.002% of the 10-mers
4% of the
Rank
verify EBNN (T1000 ) > 1.05%
EBV (T1000 )
Putative-hit positions.
sequence S(0) not containing a k-mer b,
a putative-hit position is any position of S(0) that can lead
by a mutation to an occurrence of b in S(1),
where we assume that a single mutation has occurred.
I Given a
I
I
S(0) = CCCAACAC,
putative-hit positions
b = ACC
underlined in S(0).
S(0) = CCCAACAC,
Putative-hit positions.
sequence S(0) not containing a k-mer b,
a putative-hit position is any position of S(0) that can lead
by a mutation to an occurrence of b in S(1),
where we assume that a single mutation has occurred.
I Given a
I
I
S(0) = CCCAACAC,
putative-hit positions
b = ACC
S(0) = CCCAACAC,
underlined in S(0).
In a random sequence of length n, let
(n)
number of putative-hit-positions
A → C,
(n)
I H
C→A number of putative-hit-positions
C → A,
I
HA→C
Then
(n)
(n)
Pn ≈ E(HA→C ) × πA→C + E(HC→A ) × πC→A
Computing via generating functions
Aim:
Compute
Fb (z, tA→C , tC→A ) =
X
X
fn,i,j tiA→C , tjC→A z n
X
n≥0 0≤i≤n−|b| 0≤j≤n−|b|
where
length
I
i
fn,i,j is the probability
n, contains
putative-hit positions
I and
j
that a sequence
Sn (0) with no b,
of
A→C
putative-hit positions
C→A
We have
n
Pn = [z ] πA→C
!
∂F (z, 1, tC→A ) ∂F (z, tA→C , 1) + πC→A
∂tA→C
∂tC→A
tA→C =1
tC→A =1
Putative-Hit-Positions and clump analysis
A = {A, C} b = ACC −→ d(ACC, 1) = {CCC, AAC, ACA}
CCCCAAACAAACAAACAAAACACAAC
CCC AAC AAC AAC AAC
CCC
ACA
ACA
ACA
AAC
I
II
I (left)
III
IV
b = ACC
V
- in clump I, when the right extension of a clump
adds a new putative-hit position, this position is not necessarily in
the extension, but possibly backwards left
Putative-Hit-Positions and clump analysis
A = {A, C} b0 = AAA −→ d(AAA, 1) = {CAA, ACA, AAC}
CCCAACAACAACCCCCCCCAACACCACA
CAA
CAA
ACA
AAC
AAC
ACA
ACA
CAA
AAC
CAA
AAC
I
0
I (right)b
4
= AAA
7 occurrences of d(AAA),
b0 = AAA. The number of word
- clump I contains
putative-hit positions for
II
III
but only
occurrences is not the relevant statistics for counting putative-hit
positions
A - Language approach
Clump Analysis
∈R
aaaa
(Bassino-Clément-Fayolle-P.N. 2008)
∈ uU
∈ u C?
aaaa
aaaa
aaaa
∈ uM
aaaa
Clump Analysis
∈R
aaaa
(Bassino-Clément-Fayolle-P.N. 2008)
∈ uU
∈ u C?
aaaa
aaaa
aaaa
aaaa
∈ uM
I
residual
I
L2 − L1 = L2 \ L1 = {h; h ∈ L2 , h 6∈ L1 }
language
D = L.u− :
D = {h, h · u ∈ L}
Combinatorial decomposition (one word)
A? = N + Ru− u C ? (M − K)u− u C ?
?
U
Clump Analysis
∈R
aaaa
(Bassino-Clément-Fayolle-P.N. 2008)
∈ uU
∈ u C?
aaaa
aaaa
aaaa
aaaa
∈ uM
I
residual
I
L2 − L1 = L2 \ L1 = {h; h ∈ L2 , h 6∈ L1 }
language
D = L.u− :
D = {h, h · u ∈ L}
Combinatorial decomposition (one word)
?
A? = N + Ru− u C ? (M − K)u− u C ? U
?
= N + Ru− u K? (M − K)u− u K? U
?
= N + Ru− S (M − K)u− S U
Clumps: S = u C ? = u K?
Constrained Guibas-Odlyzko languages
Example: b = AA,
d` (b) = (AC, CA)
avoiding AA in S(0) and therefore in the Right,
Minimal and Ultimate languages
Build the Régnier-Szpankowski languages
for the pattern (AC, CA, AA)
I We need
I



M11 M12 M13
U1
L = N + (R1 , R2 , R3 )  M21 M22 M23   U2 
M31 M32 M33
U3
Lb = N + (R1 , R2 )
Notations:
write
b, R
ci , M
dij , U
cj
N
M11 M12
M21 M22
for
U1
U2
constrained languages
Constrained clumps
I
nite code languages Kij easy to compute
we must however avoid the forbidden word b while
extending clumps
I
vi , vj ∈ d` (b)
I
sets
Kij
nite
b ij = {h ∈ Kij ;
K
=⇒
computation of
|vi .h|b = 0}
b ij
K
by string-matching
Constrained clumps
I
nite code languages Kij easy to compute
we must however avoid the forbidden word b while
extending clumps
I
vi , vj ∈ d` (b)
I
sets
I
Kij
nite
b ij = {h ∈ Kij ;
K
=⇒
computation of
|vi .h|b = 0}
b ij
K
by string-matching
Decomposition by constrained clumps
 
? Ub.1
b (M
b − K)
b −G
b  .. 
c? = N
b + (R
b 1 v − , . . ., R
b r vr− ) G
A
1
b
Ubr

b
b

 K = (Kij ),
b
b ?,
S=K
with

 G
b = vi b
Sij
Counting putative hit-positions
b = ACAC
P(A) = P(C) = 1/2
d` (b) = (AAAC, ACAA, ACCC, CCAC) d` (b)(z, t) =



cb (z, t) = 
K



The extension
z4 t z4 t z4 t z4 t
16 , 16 , 16 , 16
z2 t
4
z3 t
8
z3 t
8

z3 t
z2
8 + 4
z2 t
4
z3 t
8
0
0
0
0
0
z2 t
4
z2 t
4
z2
z3 t
8 + 4
z3 t
8






0
b ACAA,AAAC
AC ∈ K
as in
putative-hit position, in contrast to
ACAA|AC leads
ACAA|AAC
3
2
b ACAA,AAAC (z, t) = z t + z
K
8
4
to
no new
Generating function of the number of putative-hit positions
I
I
vi (z, t) = ν(vi )tz |vi | for each vi ∈ d(b).
b ij , we can compute by string matching the number
for each K
b ij .
of putative-hit positions in each word of vi .K
X
b ij (z, t) =
K
ν(w)tput-hit-pos(vi .w)−1 z |w| ,
b ij
w∈K
Generating function of the number of putative-hit positions
I
I
vi (z, t) = ν(vi )tz |vi | for each vi ∈ d(b).
b ij , we can compute by string matching the number
for each K
b ij .
of putative-hit positions in each word of vi .K
X
b ij (z, t) =
K
ν(w)tput-hit-pos(vi .w)−1 z |w| ,
b ij
w∈K
−1
b t) = K
b t)
b ij (z, t) , b
K(z,
S(z, t) = I − K(z,
,
b t) = vi (z, t)b
G(z,
Sij (z, t) .
Generating function of the number of putative-hit positions
I
I
vi (z, t) = ν(vi )tz |vi | for each vi ∈ d(b).
b ij , we can compute by string matching the number
for each K
b ij .
of putative-hit positions in each word of vi .K
X
b ij (z, t) =
K
ν(w)tput-hit-pos(vi .w)−1 z |w| ,
b ij
w∈K
−1
b t) = K
b t)
b ij (z, t) , b
K(z,
S(z, t) = I − K(z,
,
b t) = vi (z, t)b
G(z,
Sij (z, t) .
c? (z, t)
Fb (z, t) = A
b


? Ub1.(z)

b t) (M
b − K)
b − (z)G(z,
b t) 
b (z) + (R
b 1 v − (z), . . ., R
b r vr− (z)) G(z,
=N
 .. 
1
Ubr (z)
B - Automaton approach
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
C•
start
1
C•
0
C•
6
χ
C
(χ = 16)
I
I
I
A
7
8
C•
C•
A
I
A
C•
9
C• A
10
A
12
A
5
C•
C•
2
A
A
3
A
4
11
C•
13
C•
14
Double circles signal an occurrence of a word of d(aaa).
Avoiding AAA leads to missing transitions A
The missing transitions C point to the state χ.
• characters mark putative-hit-positions
A
15
A
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
e
C•
start
1
C•
6
χ
C
(χ = 16)
I
I
I
e
A
7
8
C•
C•
A
I
A
0
C•
C•
2
A
A
3
A
4
e
C•
5
e
C•
9
C• A
10
A
11
e
A
12
e
A
C•
13
e
C•
14
A
15
Double circles signal an occurrence of a word of d(aaa).
Avoiding AAA leads to missing transitions A
The missing transitions C point to the state χ.
• characters mark putative-hit-positions
I Transitions covered by tildes (e
A, e
C) emit a signal counting a putative-hit
position.
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
C•
A
start
1
A
6
χ
(χ = 16)
I O = {q,
7
8
C•
C•
A
C
A
5
C•
C•
9
C• A
10
12
A
A
C•
C•
0
C•
3
2
A
4
A
11
C•
13
C•
δ(0, w) = q, w ∈ X}, (occurrence
14
A
of a word of
15
d(aaa)).
A
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
C•
A
start
1
A
6
χ
7
10
A
prexes
d
δ(0, w) = q, w ∈ Pref(d(b))},
d(b).
C•
C•
9
A
11
A
C•
13
C•
of words of
5
C• A
12
(χ = 16)
I E = {q,
8
C•
C•
A
C
A
A
C•
C•
0
C•
3
2
A
4
with
14
A
d
Pref(d(b))
15
set of
strict
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
C•
A
start
1
A
6
χ
7
10
A
I
C•
C•
with
prexes of words of d(b).
Clump-Core of the automaton E = Q \ E
C•
9
A
11
A
C•
13
d
δ(0, w) = q, w ∈ Pref(d(b))},
5
C• A
12
(χ = 16)
I E = {q,
8
C•
C•
A
C
A
A
C•
C•
0
C•
3
2
A
4
14
A
d
Pref(d(b))
15
set of
strict
Automaton for constrained clumps of
d(AAA) = {AAC, ACA, CAA}
A
C•
A
start
1
A
6
χ
7
10
A
I
I
C•
C•
9
A
11
A
C•
13
C•
d
δ(0, w) = q, w ∈ Pref(d(b))},
5
C• A
12
(χ = 16)
I E = {q,
8
C•
C•
A
C
A
A
C•
C•
0
C•
3
2
A
4
with
14
A
d
Pref(d(b))
15
set of
prexes of words of d(b).
Clump-Core of the automaton E = Q \ E
Markov property: ∀q ∈ E,
|{w ∈ A|b| ; δ(x, w) = q}| = 1
strict
Denition of an auxiliary function θ
d(AAA) = {AAC, ACA, CAA}
C•
A
1
C•
6
χ
(χ = 16)
7
8
C•
C•
A
C
A
4
5
C•
C•
9
C• A
10
12
A
A
C•
A
0
C•
3
2
A
start
A
A
11
C•
13
C•
14
A
15
A
Denition of an auxiliary function θ
d(AAA) = {AAC, ACA, CAA}
C•
start
1
C•
6
χ
(χ = 16)
θ(7) = ACA
7
8
C•
C•
A
C
A
4
5
C•
C•
9
C• A
10
12
A
A
C•
A
0
C•
3
2
A
A
A
A
11
C•
13
C•
14
A
15
A
Denition of an auxiliary function θ
d(AAA) = {AAC, ACA, CAA}
C•
start
1
C•
6
χ
A
7
C
10
A
5
C•
C•
9
A
11
C•
13
C•
θ(5) = AA
A
C• A
12
(χ = 16)
θ(7) = ACA ,
8
C•
C•
A
4
C•
A
0
C•
3
2
A
A
A
14
A
15
A
Denition of an auxiliary function θ
d(AAA) = {AAC, ACA, CAA}
C•
start
1
C•
6
χ
A
7
C
10
A
5
C•
C•
9
A
11
C•
13
C•
θ(5) = AA ,
A
C• A
12
(χ = 16)
θ(7) = ACA ,
8
C•
C•
A
4
C•
A
0
C•
3
2
A
A
A
14
θ(15) = A
A
15
A
Formal denition of θ
For each state
θ(o) =

w
















By the
o∈O
with
(recognizing an occurrence of
|w| ≤ |b|,
of
d(b)),
maximal length,
(a) there exists q such that δ(q, w) = o,
verifying d
(b) there is no u ∈ Pref(w)
such that δ(q, u) ∈ O
Markov property, θ(o) denes a unique word
Adjacency matrix H(t) = (hij (t))
d(AAA) = {AAC, ACA, CAA}
A
C•
start
1
C•
C•
6
χ
C
(χ = 16)
h7,10 = νC z
A
7
8
C•
C•
A
I
A
0
5
C•
C•
9
C• A
10
12
A
A
C•
2
A
A
3
4
A
11
C•
13
C•
14
A
15
A
Adjacency matrix H(t) = (hij (t))
d(AAA) = {AAC, ACA, CAA}
A
C•
start
1
C•
C•
6
χ
C
(χ = 16)
I
h7,10 = νC z
h6,7 (t) = νA tC→A z
A
7
8
C•
C•
A
I
A
0
5
C•
C•
9
C• A
10
12
A
A
C•
2
A
A
3
4
A
11
C•
13
C•
14
A
15
A
Adjacency matrix H(t) = (hij (t))
d(AAA) = {AAC, ACA, CAA}
A
C•
start
1
C•
A
0
C•
6
χ
C
(χ = 16)
I
I
7
h7,10 = νC z
h6,7 (t) = νA tC→A z
h3,4 (t) = h45 (t) = νA z
8
C•
C•
A
I
A
5
C•
C•
9
C• A
10
12
A
A
C•
2
A
A
3
4
A
11
C•
13
C•
14
A
15
A
Adjacency matrix H(t) = (hij (t))
d(AAA) = {AAC, ACA, CAA}
A
C•
start
1
C•
A
0
C•
6
χ
C
(χ = 16)
I
I
I
7
h7,10 = νC z
h6,7 (t) = νA tC→A z
h3,4 (t) = h45 (t) = νA z
h13,14 (t) = νC tC→A z
8
C•
C•
A
I
A
5
C•
C•
9
C• A
10
12
A
A
C•
2
A
A
3
4
A
11
C•
13
C•
14
A
15
A
Formal denition of the adjacency matrix H(t)
(a)
hij (t) = 0
if there is no transition from
δ(i, α) = j ,


 ν(α) if j 6∈ O,

j ∈ O and θ(j)
hi,j (t) =



ν(α) × t elsewhere
i
to
j
(b) With
contains
no putative-hit position
Formal denition of the adjacency matrix H(t)
(a)
hij (t) = 0
if there is no transition from
δ(i, α) = j ,


 ν(α) if j 6∈ O,

j ∈ O and θ(j)
hi,j (t) =



ν(α) × t elsewhere
i
to
j
(b) With
contains
no putative-hit position
From matrix to generating function
Fb (z, t) = (1, 0, . . . , 0) × I + zH(t) + · · · + z n Hn (t) + . . . × 1t
= (1, 0, . . . , 0) × (I − zH(t))−1 × 1t .
Rational functions and gfun
rational function
f (z)
g(z)
recurrence equations
→
gfun[diffeqtorec]
→
recurrence equations
→
gfun[rectoproc]
→
procedure Proc(n)
Rational functions, Taylor coecient of order n and gfun
rational function
f (z)
g(z)
recurrence equations
→
gfun[diffeqtorec]
→
recurrence equations
→ gfun[rectoproc]
→ procedure Proc(n)
X
P (z, 1)
P (z, t)
fbn(b) z n =
Fb (z, t) =
and
Fb (z, 1) =
,
Q(z, t)
Q(z, 1)
n≥0
X
P 0 (z, 1) P (z, 1)Q0t (z, 1)
∂
n
= t
E(z) =
ηn z =
Fb (z, t)
−
∂t
Q(z, 1)
Q2 (z, 1)
t=1
n
Rational functions, Taylor coecient of order n and gfun
rational function
f (z)
g(z)
→
gfun[diffeqtorec]
→
recurrence equations
recurrence equations
→ gfun[rectoproc]
→ procedure Proc(n)
X
P (z, 1)
P (z, t)
fbn(b) z n =
Fb (z, t) =
and
Fb (z, 1) =
,
Q(z, t)
Q(z, 1)
n≥0
X
∂
P 0 (z, 1) P (z, 1)Q0t (z, 1)
n
E(z) =
ηn z =
Fb (z, t)
−
= t
∂t
Q(z, 1)
Q2 (z, 1)
t=1
n
where,
P (z, t)
Sn (0)
of length
and
n
are polynoms, and, in a random sequence
no occurrence of b,
Q(z, t)
with
Rational functions, Taylor coecient of order n and gfun
rational function
f (z)
g(z)
→
gfun[diffeqtorec]
→
recurrence equations
recurrence equations
→ gfun[rectoproc]
→ procedure Proc(n)
X
P (z, 1)
P (z, t)
fbn(b) z n =
Fb (z, t) =
and
Fb (z, 1) =
,
Q(z, t)
Q(z, 1)
n≥0
X
∂
P 0 (z, 1) P (z, 1)Q0t (z, 1)
n
E(z) =
ηn z =
Fb (z, t)
−
= t
∂t
Q(z, 1)
Q2 (z, 1)
t=1
n
where,
P (z, t)
Sn (0)
of length
I
and
n
are polynoms, and, in a random sequence
no occurrence of b,
Q(z, t)
with
(b)
fbn = P(Sn (0))
Rational functions, Taylor coecient of order n and gfun
rational function
f (z)
g(z)
→
gfun[diffeqtorec]
→
recurrence equations
recurrence equations
→ gfun[rectoproc]
→ procedure Proc(n)
X
P (z, 1)
P (z, t)
fbn(b) z n =
Fb (z, t) =
and
Fb (z, 1) =
,
Q(z, t)
Q(z, 1)
n≥0
X
∂
P 0 (z, 1) P (z, 1)Q0t (z, 1)
n
E(z) =
ηn z =
Fb (z, t)
−
= t
∂t
Q(z, 1)
Q2 (z, 1)
t=1
n
where,
P (z, t)
Sn (0)
of length
and
n
I
(b)
fbn = P(Sn (0))
I
ηn
is the
are polynoms, and, in a random sequence
no occurrence of b,
Q(z, t)
with
unconditionned probability of the expectation of the
count of putative-hit positions
Rational functions, Taylor coecient of order n and gfun
rational function
f (z)
g(z)
→
gfun[diffeqtorec]
→
recurrence equations
recurrence equations
→ gfun[rectoproc]
→ procedure Proc(n)
X
P (z, 1)
P (z, t)
fbn(b) z n =
Fb (z, t) =
and
Fb (z, 1) =
,
Q(z, t)
Q(z, 1)
n≥0
X
∂
P 0 (z, 1) P (z, 1)Q0t (z, 1)
n
E(z) =
ηn z =
Fb (z, t)
−
= t
∂t
Q(z, 1)
Q2 (z, 1)
t=1
n
where,
P (z, t)
Sn (0)
of length
and
n
I
(b)
fbn = P(Sn (0))
I
ηn
is the
are polynoms, and, in a random sequence
no occurrence of b,
Q(z, t)
with
unconditionned probability of the expectation of the
count of putative-hit positions
I
Conditionned expectation: ηen = ηn fbn(b)
An unexpected behaviour
(A→C)
ηn = E(Hn
(C→A)
) + E(Hn
(y)
fbn = P(|Sn (0)|y = 0)
)
(y
n
b = ACAC
b0 = AACC
1
ν(A) = ν(C) =
2
πA→C = πC→A
is
b
or
b0 )
n
∂Fb (z, t) E(H) =
∂t t=1
An unexpected behaviour
(A→C)
ηn = E(Hn
(C→A)
) + E(Hn
(y
n
b = ACAC
.
(y)
ηen = ηn fbn
(y)
fbn = P(|Sn (0)|y = 0)
)
b0 = AACC
1
ν(A) = ν(C) =
2
πA→C = πC→A
is
b
or
b0 )
n
∂Fb (z, t) E(H) =
∂t t=1
n
A proof by singularity analysis
P (z, t)
P (z, t) and Q(z, t)
Q(z, t)
X
P (z, 1)
fbn(b) z n =
Fb (z, 1) =
Q(z, 1)
Fb (z, t) =
polynomials
n≥0
(b)
fbn
probability that
E(z) =
X
n≥0
Sn (0)
E(Hn )z n =
has
no occurrence of b.
Px0 (z, 1) P (z, 1)Q0x (z, 1)
−
Q(z, 1)
Q2 (z, 1)
A proof by singularity analysis
P (z, t)
P (z, t) and Q(z, t)
Q(z, t)
X
P (z, 1)
fbn(b) z n =
Fb (z, 1) =
Q(z, 1)
Fb (z, t) =
polynomials
n≥0
(b)
fbn
probability that
E(z) =
X
n≥0
The
Sn (0)
E(Hn )z n =
has
no occurrence of b.
Px0 (z, 1) P (z, 1)Q0x (z, 1)
−
Q(z, 1)
Q2 (z, 1)
dominant singularity τ
is the smallest positive solution of
Q(z, 1) = 0. Use suitable Cauchy integrals

 fbn(b) = ψ × τ −(n−1) (1 + O (B n )) ,

(B < 1)
E(Hn ) = [z n ]E(z) = τ −n (φ1 ×n + φ2 ) × (1 + O (B n ))
e n ) = E(Hn ) = (c1 × n + c2 ) × (1 + O (B n )) , (B < 1).
=⇒ E(H
(b)
fbn
General case
Compute
Fb (z, tA→C , tA→G , tA→T , tC→A , . . . , tT→C , tT→G )
(b)
fbn = [z n ]Fb (z, 1, 1, . . . , 1, 1)
n
Pn ≈ [z ]
X
α6=β∈{A,C,G,T}
dominant singularities of all the terms of the sum are
equal to the dominant singularity of Fb (z, 1, 1, . . . , 1, 1)
Pn behaves quasi-linearly
I The
I
∂Fb (z, 1, . . . , 1, πα→β tα→β , 1, . . . ) fbn(b)
∂tα→β
tα→β =1
(from Behrens-Nicaud-P.N., JCB 19,5, 2012)

Download Report

Language Approach

Paperzz.com

Your Paperzz