Introduction
Using PMC
Applications
Pattern Markov Chains (PMC) to study the
distribution of patterns in Markovian sequences
Grégory Nuel
Laboratoire Statistique et Génome
University of Evry, CNRS (8071), INRA (1152)
France
Algorithmique, combinatoire du texte
et applications en bio-informatique,
Marne-la-Vallée,
September 26-28, 2007
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Examples of functional patterns in DNA
Example (patterns from Bacillus subtilis)
cgcg is a restriction site: restriction enzymes cut DNA at
this site
⇒ negative selection, cgcg should be rare
ccgct is a crossover hotspot instigator (CHI):
endonucleases degrade open DNA but are inactivated
when they encounter the CHI
⇒ positive selection, ccgct should be frequent
Conclusion
functional pattern
=⇒
pattern with exceptional frequency
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Examples of functional patterns in DNA
Example (patterns from Bacillus subtilis)
cgcg is a restriction site: restriction enzymes cut DNA at
this site
⇒ negative selection, cgcg should be rare
ccgct is a crossover hotspot instigator (CHI):
endonucleases degrade open DNA but are inactivated
when they encounter the CHI
⇒ positive selection, ccgct should be frequent
Conclusion
functional pattern
=⇒
?
⇐=
pattern with exceptional frequency
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Finding patterns with exceptional frequencies
pattern
functional
frequency
ctat
cgcg
acac
?
?
?
5484
8475
10353
ctatg
ccgct
gattt
?
?
?
2310
5690
9290
Idea
Take into account composition bias in the sequence and pattern
structure with a background model used to compute p-values
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Finding patterns with exceptional frequencies
pattern
functional
frequency
ctat
cgcg
acac
?
?
?
5484
8475
10353
ctatg
ccgct
gattt
?
?
?
2310
5690
9290
Idea
Take into account composition bias in the sequence and pattern
structure with a background model used to compute p-values
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Finding patterns with exceptional frequencies
pattern
functional
frequency
expected
p-value
ctat
cgcg
acac
?
?
?
5484
8475
10353
5393.20
10598.46
10266.51
0.15
10−93
0.35
ctatg
ccgct
gattt
?
?
?
2310
5690
9290
2392.02
4420.48
9369.56
0.06
10−75
0.34
Idea
Take into account composition bias in the sequence and pattern
structure with a background model used to compute p-values
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Finding patterns with exceptional frequencies
pattern
functional
frequency
expected
p-value
ctat
cgcg
acac
no
yes
no
5484
8475
10353
5393.20
10598.46
10266.51
0.15
10−93
0.35
ctatg
ccgct
gattt
no
yes
no
2310
5690
9290
2392.02
4420.48
9369.56
0.06
10−75
0.34
Idea
Take into account composition bias in the sequence and pattern
structure with a background model used to compute p-values
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Notations
X = X1 . . . Xℓ an order m Markov chain over the size k alphabet
A (ex: A = {a, c, g, t} ). µ0 and π such as ∀a1 , . . . , am , b ∈ A
and ∀m < t 6 ℓ
µ0 (a1 . . . am ) = P(X1 = a1 , . . . Xm = am )
π(a1 . . . am , b) = P(Xt = b | Xt−m = a1 , . . . Xt−1 = am )
w a word (ex: w = ccgct), W a pattern — set of words — (ex:
W = cc[ag]ct = {ccact, ccgct})
ℓ
X
I{W ends in position i}
NX (W) =
i=1
Main objective
to study the distribution of NX (W)
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
State of the art
Exact methods:
Combinatorial (Pevzner et al 1989, Régnier & Szpankowski
1998, Robin & Daudin 1999)
FMCI (Fu & Koutras 1994, Fu 1996)
Exponential families (Stefanov & Pakes 1997, Nicodème et
al 2002, Stefanov & Crochemore 2003)
...
Asymptotic approximations
Gaussian (Cowan 1991, Kleffe & Borodovski 1992, Prum et
al 1995)
Binomial (van Helden et al 1998, Nuel 2005)
Poisson (Chysaphinou & Papastravidis 1988, Arratia et al
1990, Geske et al 1995, Schbath 1995)
Large deviations (Nuel 2001, Nuel 2004, Pudlo 2004)
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Pattern cardinality
Example
DNA: W = act.(0 − 3)tta ⇒ |W| =
P3
i=0 4
i
= 85
protein: W = W.WH.CH.H[YN]HS[MI][DE] ⇒ |W| = 64 000
cardinality can reach |W| = 1030 for some patterns !
Problem
Most pattern methods have linear (or worse) complexity with cardinality |W| ⇒ impossible to treat high cardinality patterns
A solution
Inspiring from pattern matching, Nicodème et al (2002) and Stefanov & Crochemore (2003) propose to overcome cardinality
with automata to compute moment-generating functions
Our goal: extend this approach to most classical methods
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Pattern cardinality
Example
DNA: W = act.(0 − 3)tta ⇒ |W| =
P3
i=0 4
i
= 85
protein: W = W.WH.CH.H[YN]HS[MI][DE] ⇒ |W| = 64 000
cardinality can reach |W| = 1030 for some patterns !
Problem
Most pattern methods have linear (or worse) complexity with cardinality |W| ⇒ impossible to treat high cardinality patterns
A solution
Inspiring from pattern matching, Nicodème et al (2002) and Stefanov & Crochemore (2003) propose to overcome cardinality
with automata to compute moment-generating functions
Our goal: extend this approach to most classical methods
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Pattern cardinality
Example
DNA: W = act.(0 − 3)tta ⇒ |W| =
P3
i=0 4
i
= 85
protein: W = W.WH.CH.H[YN]HS[MI][DE] ⇒ |W| = 64 000
cardinality can reach |W| = 1030 for some patterns !
Problem
Most pattern methods have linear (or worse) complexity with cardinality |W| ⇒ impossible to treat high cardinality patterns
A solution
Inspiring from pattern matching, Nicodème et al (2002) and Stefanov & Crochemore (2003) propose to overcome cardinality
with automata to compute moment-generating functions
Our goal: extend this approach to most classical methods
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Deterministic Finite state Automaton
Definition (DFA)
A a finite alphabet, Q a finite set of states, s ∈ Q a starting
state, F ⊂ Q a subset of final states and δ : Q × A → Q a
transition function then (A, Q, s, F, δ) is a DFA
Example
Q = {0, . . . , 4}, s = 0,
A = {a, b}, F = {4},
δ(0, a) = 1, δ(0, b) = 0,
δ(1, a) = 1, δ(1, b) = 2,
δ(2, a) = 3, δ(2, b) = 0,
δ(3, a) = 1, δ(3, b) = 4,
δ(4, a) = 1, δ(4, b) = 0
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Deterministic Finite state Automaton
Definition (DFA)
A a finite alphabet, Q a finite set of states, s ∈ Q a starting
state, F ⊂ Q a subset of final states and δ : Q × A → Q a
transition function then (A, Q, s, F, δ) is a DFA
Example
Q = {0, . . . , 4}, s = 0,
A = {a, b}, F = {4},
δ(0, a) = 1, δ(0, b) = 0,
δ(1, a) = 1, δ(1, b) = 2,
δ(2, a) = 3, δ(2, b) = 0,
δ(3, a) = 1, δ(3, b) = 4,
δ(4, a) = 1, δ(4, b) = 0
b
b
0
a
a
b
2
a
1
a
3
a
b
b
G. Nuel
PMC for patterns in Markovian sequences
4
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Rational language and DFA
Definition (Regular language)
We call any L ⊂ A∗ (set of all texts on A) a language. A
language is regular if obtained from set of finite part of A∗ and
a finite number of regular operations (×, + and ∗ ).
Theorem
For any regular language L it exists a unique smallest DFA which
accept L (means L is the set of all paths from s to F).
From regular language to smallest DFA
Thompson’s construction: regular language to NFA
determinization: NFA to DFA
minimization: DFA to smallest DFA
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
G. Nuel
b
2
a
b
3
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X Y 0
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b
Y 0 0
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a
Y 0 0 1
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b
Y 0 0 1 2
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b b
Y 0 0 1 2 0
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b b a
Y 0 0 1 2 0 1
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b b a b
Y 0 0 1 2 0 1 2
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b b a b a
Y 0 0 1 2 0 1 2 3
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
a
0
a
1
b
b
2
a
b
3
X - b a b b a b a b
Y 0 0 1 2 0 1 2 3 2
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Connexion between DFA and patterns
Example
A = {a, b}, X = babbababbababaabaaabab,
W = {aba}, L = A∗ W
a
b
0
a
a
1
b
b
2
a
b
3
X - b a b b a b a b b a b a b a a b a a a b a b
Y 0 0 1 2 0 1 2 3 2 0 1 2 3 2 3 1 2 3 1 1 2 3 2
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Independent case
Theorem (PMC in case m = 0)
X = X1 . . . Xℓ and order 0 MC (parameter µ), W pattern,
(A, Q, F, s, δ) a DFA that recognize L = A∗ W then the PMC
Y = Y0 . . . Yℓ defined by Y0 = s and Yi = δ(Yi−1 , Xi ) is an order
1 MC with transition matrix Π given by
µ(a) if δ(p, a) = q
Π(p, q) =
0
if q ∈
/ δ(p, A)
and having the following property:
W ends in position i in X ⇐⇒ F appears in position i in Y
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Properties of PMC
state space is the same than for the DFA (cardinal L)
transition matrix is sparse (only k × L non zero terms)
transition matrix is naturally decomposed in Π = P + Q
(where Q contains the transitions ending in F)
Example
Pattern W = {aba} over the binary alphabet A = {a, b}:
a
b
0
a
a
1
b
b
2
a
b
3
G. Nuel
µb µa 0 0
0 µa µb 0
Π=
µb 0 0 µ∗a
0 µa µb 0
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Order m Markov chain (1)
Definition (m-ambiguous DFA)
A DFA is called m-ambiguous if ∃q such as
δ−m (q) = {a1 . . . am ∈ Am , ∃p, δm (p, a1 . . . am ) = q}
has a cardinal > 2.
Theorem (PMC in general case)
X = X1 . . . Xℓ and order m MC (parameter π), W pattern,
(A, Q, F, s, δ) a non m-ambiguous DFA that recognize L = A∗ W
then the PMC Y = Ym . . . Yℓ defined by Y0 = s and Yi =
δ(Yi−1 , Xi ) is an order 1 MC with transition matrix Π:
π δ−m (p), b if δ(p, b) = q
Π(p, q) =
0
if q ∈
/ δ(p, A)
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Order m Markov chain (2)
Proposition
One can produce a non m-ambiguous DFA from a non (m − 1)ambiguous one by duplicating the ambiguous states (complete
algorithm available but not detailed).
Example (A = {a, b, c}, pattern W = {aaa, aba, aca})
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Order m Markov chain (2)
Proposition
One can produce a non m-ambiguous DFA from a non (m − 1)ambiguous one by duplicating the ambiguous states (complete
algorithm available but not detailed).
Example (A = {a, b, c}, pattern W = {aaa, aba, aca})
a
a
a
a
b,c
0
2
4
b,c
b,c
b,c
a
1
a
3
b,c
δ−1 (0) = {b, c}
δ−1 (3) = {b, c}
5
the DFA is
1-ambiguous
b,c
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Order m Markov chain (2)
Proposition
One can produce a non m-ambiguous DFA from a non (m − 1)ambiguous one by duplicating the ambiguous states (complete
algorithm available but not detailed).
Example (A = {a, b, c}, pattern W = {aaa, aba, aca})
b
b
non 1-ambiguous
b
a
a
2
a
a
c
c
c
a
3
c
a
Π(0, 1) = π(c, a)
b
4
c
5
b
a
b
1
a
b
b
but 2-ambiguous
6
δ−2 (5) = {ba, ca}
c
0
7
c
c
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Order m Markov chain (2)
Proposition
One can produce a non m-ambiguous DFA from a non (m − 1)ambiguous one by duplicating the ambiguous states (complete
algorithm available but not detailed).
Example (A = {a, b, c}, pattern W = {aaa, aba, aca})
a
b
b
a
1
a
a
a
b
4
a
2
b
a
c
a
a
c
7
c
Π(3, 5) = π(ac, a)
but 3-ambiguous
b
9
a
c
b
c
a
c
10
5
3
0
non 2-ambiguous
b
c
c
c
b
a
b
b
11
b
b
6
c
c
8
δ−3 (3) =
{aac, bac, cac}
c
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Renewal occurrences
Proposition (renewal)
If the DFA (A, Q, F, s, δ) count overlap occurrences of W, then
the DFA (A, Q, F, s, δ′ ) count renewal ones with
δ(p, a) if p ∈
/F
′
δ (p, a) =
δ(s, a) if p ∈ F
Example
b
0
a
a
1
b
a
b
b
2
3
a
X - b a b b a b a b b a b a b a a b a a a b a b
Y 0 0 1 2 0 1 2 3 2 0 1 2 3 2 3 1 2 3 1 1 2 3 2
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Renewal occurrences
Proposition (renewal)
If the DFA (A, Q, F, s, δ) count overlap occurrences of W, then
the DFA (A, Q, F, s, δ′ ) count renewal ones with
δ(p, a) if p ∈
/F
′
δ (p, a) =
δ(s, a) if p ∈ F
Example
b
b
0
a
a
1
b
b
a
b
b
2
3
a
0
a
a
1
b
a
3
b
2
a
X - b a b b a b a b b a b a b a a b a a a b a b
Y 0 0 1 2 0 1 2 3 2 0 1 2 3 2 3 1 2 3 1 1 2 3 2
Y′ 0 0 1 2 0 1 2 3 0 0 1 2 3 0 1 1 2 3 1 1 2 3 0
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Motivation and notations
Automata and languages
Pattern Markov Chains
Framework
X = X1 . . . Xℓ order m MC (parameters µ0 and π), W a pattern,
NX (W) (overlap or renewal) number of occurrences
⇓
(A, Q, F, s, δ) non m-ambiguous DFA that recognize (overlap or
renewal) occurrences of W
⇓
the PMC Y = Ym . . . Yℓ is an order 1 MC (parameters µ0 and Π)
with NX (W) = NY (F)
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Waiting time
Proposition
For all i > m and q ∈ Q let define the waiting time
τi (q) = inf{t > 1, Yi+t ∈ F}
then for all t > 1 we have:
T
P (τi (q) = t) = eq P t−1 QeF
where eI is the indicatrix row-vector of I ⊂ Q.
Applications
study the repartition of a pattern by computing a p-value for
each occurrence
draw a pattern repartition without have to draw the
underlying sequence
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Moments
Proposition
∀a ∈ Q we denote by N(a) the number of a in Y and we get:
ℓ
X
µ0 Πi−m eaT
E [N(a)] =
i=m
and ∀b ∈ Q we have:
E [N(a)N(b)] = Ia=b E [N(a)] + C(a, b) + C(b, a)
with
ℓ−1
ℓ
X
X
C(a, b) =
(µ0 Πi−m eaT )
(ea Πj−i ebT )
i=m
j=i+1
Complexities
E [N(W)]: O(L) in space, O(k × L × ℓ) in time
V [N(W)]: O(F × ℓ + L) in space, O(k × L × F × ℓ) in time
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
FMCI (1)
Definition
∀c ∈ N we define the FMCI Z = Zm . . . Zℓ by
(Yj , Nj ) if Nj < c
Zj =
c
if Nj > c
where Nj = NX1 ...Xj (W). Z is an order 1 Markov chain and its
transition matrix T is defined by blocks of size L.
Example
with c = 3 we get
P Q 0 0
0 P Q 0
T =
0 0 P ΣQ
0 0 0 1
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
FMCI (1)
Definition
∀c ∈ N we define the FMCI Z = Zm . . . Zℓ by
(Yj , Nj ) if Nj < c
Zj =
c
if Nj > c
where Nj = NX1 ...Xj (W). Z is an order 1 Markov chain and its
transition matrix T is defined by blocks of size L.
Example
with c = 3 we get
P Q 0 0
0 P Q 0
T =
0 0 P ΣQ
0 0 0 1
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
FMCI (2)
Proposition
∀0 6 i < c we get
P(Nℓ = i) = M0 T ℓ−m EiT and P(Nℓ > c) = M0 T ℓ−m EcT
with M0 = (µ0 0 . . . 0 | 0) and Ei is the block indicatrix of i.
Lemma
If u = (u0 . . . uc−1 | uc ) is a size cL + 1 vector then
uc = uc + uc × ΣQ
for j = (c − 1) . . . 1: uj = uj × P + uj−1 × Q
u0 = u0 × P
update u with u × T in O(c × k × L).
⇒ allows to compute the distribution of Nℓ in O(c × k × L × ℓ)
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Summary for exact methods
Achievements
waiting time ⇒ repartition of occurrences
expectation and variance ⇒ Gaussian approximations
FMCI ⇒ distribution of N(W)
Remarks
central complexity is O(k × L × ℓ)
same complexity for heterogeneous Markov models
take into account non stationary regions
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Reliability of the approximations
Gaussian:
+ work with heterogeneous models
− not reliable for significant patterns
binomial/Poisson:
+ very simple approximations
− do not work for overlapping patterns
geometric/compound Poisson:
+ work with overlapping patterns
− p-value computations are delicate
Large deviations:
+ best approximation for significant patterns
− require sophisticated numerical linear algebra
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Theoretical complexities
method
time
space
exact
O(nkLℓ)
O(kL + nL)
O(kLF ℓ)
O(kL + F ℓ)
Gaussian (exact)
Gaussian (approx.)
O(kLF log ℓ)
O(kL + F log ℓ)
O(kL + log n)
O(kL)
binomial
Poisson
O(kL + log n)
O(kL)
O(kL + n)
O(kL)
geometric Poisson
O(kL + F 3 + n3 ) O(kL + F 2 + n2 )
compound Poisson
O(kL)
O(kL)
large deviations
precise large deviations
O(kL)
O(kL)
with k = |A|, n observed occurrences, ℓ sequence length, L
number of PMC states and F number of final states.
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Running time
method
h=2 h=3 h=4 h=5 h=6 h=7
exact
20.66 15.63 17.07 19.31 25.02 43.68
0.00 0.00 0.00 0.01 0.03
0.11
Gaussian approx
binomial
0.00 0.00 0.00 0.01 0.03
0.09
geometric Poisson
0.00 0.00 0.01 0.01 0.04
0.12
0.02 0.07 0.42 2.22 9.34 27.16
large deviations
precise large deviations 0.06 0.31 1.82 9.33 38.25 100.97
time (in s) to treat all DNA words of length h (L = h + 3, F = 1)
over the HIV type 1 genome (ℓ = 10 kb) with the model M1.
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Summary for asymptotic approximations
Gaussian approximations not reliable in the extreme but
efficient in the center of the distribution
Binomial/Poisson approximations are a simple but efficient
as a rough heuristic
Geometric/Compound Poisson approximations have good
performance but high algorithmic cost for large F or large n
Precise large deviations are best in the extreme but less
reliable in the center of the distribution
A strategy to test a large set of patterns
first stage: filter out unsignificant pattern using Gaussian
approximations
second stage: refine computations on significant patterns
only using either exact method or precise large deviations
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
Exact results
Asymptotic approximations
Summary for asymptotic approximations
Gaussian approximations not reliable in the extreme but
efficient in the center of the distribution
Binomial/Poisson approximations are a simple but efficient
as a rough heuristic
Geometric/Compound Poisson approximations have good
performance but high algorithmic cost for large F or large n
Precise large deviations are best in the extreme but less
reliable in the center of the distribution
A strategy to test a large set of patterns
first stage: filter out unsignificant pattern using Gaussian
approximations
second stage: refine computations on significant patterns
only using either exact method or precise large deviations
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Example
We consider the pattern of Wk = abAk aaAk ab over the binary
alphabet A = {a, b} and build its non 1-ambiguous PMC
k
1
2
3
4
5
6
7
|Wk |
4
16
64 256 1 024 4 096 16 384
L
12
27
57 122 262
562
1 207
1
3
6
13
28
60
129
F
time (s) 0.00 0.00 0.00 0.01 0.01 0.02
0.06
8
9
10
11
k
|Wk |
65 536 262 144 1 048 576 4 194 304
L
2 592
5 567
11 957
25 682
F
277
595
1 278
2 745
0.91
3.85
16.89
time (s) 0.23
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
strict TATA-box
Example
We consider the well known DNA pattern
W = ttgaca.(17)tataat (|W| = 417 ≃ 1.7 × 1010 )
m
0
1
2
3
4
L
764 1002 2068 5738 17344
F
34
34
34
34
34
PMC time (s) 0.04 0.05 0.09 0.23 0.65
overall time (s) 0.18 0.19 0.25 0.40 0.92
Time to build the PMC (first time row), count occurrences on
Escherichia coli K12 (4.6 Mb) and to perform a Gaussian
approximation (total on second time row).
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Relaxed TATA-box
Example
We consider now the DNA pattern
W = ttgaca.(15-19)tataat (|W| ≃ 3.7 × 1011 )
m
0
1
2
3
4
L
1474 1866 3553 9601 28951
F
55
55
55
55
55
PMC time (s) 0.12 0.14 0.21 0.46 1.13
overall time (s) 0.28 0.30 0.37 0.66 1.49
Time to build the PMC (first time row), to count occurrences on
Escherichia coli K12 (4.6 Mb) and to perform a Gaussian
approximation (total on second time row).
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Outline
1
Introduction
Motivation and notations
Automata and languages
Pattern Markov Chains
2
Using PMC
Exact results
Asymptotic approximations
3
Applications
A toy example
TATA-box
PROSITE signatures
G. Nuel
PMC for patterns in Markovian sequences
Introduction
Using PMC
Applications
A toy example
TATA-box
PROSITE signatures
Example
PROSITE1 = W.WH.CH.H[YN]HS[MI][DE]
PROSITE2 = [LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVMA].[STACV]
PROSITE3 = CG(2).(4-7)G.(3)C.(4-5)C.(3-5)[NHGS].[FYWMI].(2)Q
PROSITE1 PROSITE2
PROSITE3
pattern
20
|W |
64000
5.2 × 10
2.0 × 1031
L
21(96)
329(1393) 16064(66620)
F
1(2)
30(78)
163(163)
PMC time (s) 0.00(0.00) 0.07(0.14) 57.79(63.10)
overall time (s) 2.68(2.89) 2.98(3.09) 60.06(67.04)
Time to build the PMC with m = 0 (m = 1), to count
occurrences on Uniprot (ℓ ≃ 86 × 106 ) and to perform a
Gaussian approximation.
G. Nuel
PMC for patterns in Markovian sequences
Summary
Summary
What have we done ?
Thanks to the PMC we have a simple and optimal way to
consider any pattern problem
Most pattern methods easily adapt to this new framework
and take full advantage its optimality
Possibility to treat highly degenerated patterns such as
structured motifs, PROSITE signatures, etc . . .
Outlook
Take into account parameter estimation
Approximate occurrences of patterns
Joint or conditional distributions
G. Nuel
PMC for patterns in Markovian sequences
Summary
Summary
What have we done ?
Thanks to the PMC we have a simple and optimal way to
consider any pattern problem
Most pattern methods easily adapt to this new framework
and take full advantage its optimality
Possibility to treat highly degenerated patterns such as
structured motifs, PROSITE signatures, etc . . .
Outlook
Take into account parameter estimation
Approximate occurrences of patterns
Joint or conditional distributions
G. Nuel
PMC for patterns in Markovian sequences
Summary
References
Outlook
SPatt (Statistics for Patterns) freely available at:
http://stat.genopole.cnrs.fr/spatt
G. Nuel. PMC pour l’étude des occurrences de motifs dans
les séquences markoviennes. HDR university of Evry,
2006.
G. Nuel and B. Prum. Analyse statistique des séquences
biologiques: modélisation markovienne, alignements et
motifs. Hermes, 2007.
G. Nuel. Pattern Markov chains: optimal Markov chain
embedding through deterministic finite automata. Journal
of applied probability, revised.
G. Nuel
PMC for patterns in Markovian sequences
© Copyright 2026 Paperzz