Point processes in biological sequence analysis

Stochastics Meeting Lunteren 2007
Point processes in
biological sequence analysis
Probabilistic analysis of simple models
Niels Richard Hansen
University of Copenhagen
Department of Mathematical Sciences
. – p.1/18
Out-line
Probability theory
- Some biological motivation. Intro to the ACGT/ACGU of life.
- A concrete probabilistic analysis of the occurrence of stem-loop
motifs in iid sequences.
Statistical applications
- The point process limit point-of-view and general biological motifs.
- Three statistical examples.
. – p.2/18
AU
C
AT
C
G
's
G
's
Ribonucleic acid(RNA)
C
Cytosine
C
Cytosine
NH2
H
H
NH2
C
C
H
N
C
C
H
O
H
N
C
C
N
C
C
H
N
H
H
G
Guanine
G
Guanine
O
N
O
O
C
H
C
N
C
C
N
H H
C
C
C
H
N
H H
C
N
H
N
H
N
H
H
A
N
C
H
NH2
C
H
N
N
Sugar
phosphate
backbone
H C
H
C
C
N
H C
N
H
C
C
H
N
H
H
Uracil
U
T
Thymine
O
O
C
H
C
H
N
Adenine
C
N
C
H
N
Base pair
NH2
C
C
H
A
Adenine
N
N
Nitrogenous
Bases
H
H3C
O
H
N
C
C
H
N
National
Institutes
of Health
H
N
C
C
H
N
O
H
H
replaces Thymine in RNA
Nitrogenous
Bases
C
C
RNA
DNA
Ribonucleic acid
Deoxyribonucleic acid
Nitrogenous
Bases
National Human Genome Research Institute
Division of Intramural Research
H
RNA molecular structure
Let-7 (pre-cursor) from C. Elegans.
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA
CGA
UA
GC
GU A
G
G
A
UA
CG
CG
GC
GC
UA
GU
AU
GC
GC
UA
AU
GC
U U
AU
GU
GU
UA
UA
GC
UG
AU
UA
AU
GC
UA
UA
UG
U
G
GC
G
AGC
A
A
C
U
C
A
UUA
U
A
C
A
AGCU
U
C
U
C
A
Member of the family of micro RNAs that terminate or inhibit the
translation of mRNA to protein. The pre-cursor is embedded as a gene in
the DNA – we want to find genes with similar structure.
. – p.4/18
A scoring approach
With X = X1 , . . . , Xn a sequence with letters in E, a stem-loop is a set of
indices
z = {(i1 , j1 ), . . . , (im , jm )}
fulfilling that 1 ≤ im < . . . < i1 < j1 < . . . < jm ≤ n.
Large loops and mis-matching nucleotides are not favorable. We
introduce a scoring function f : E × E → R and the score
S(z, X) =
X
f (Xi , Xj )+αrz + βlz
(i,j)∈z
where rz and lz are the number and total length, respectively, of gaps in z,
and α, β ∈ (−∞, 0). See [4].
. – p.5/18
A simplified scoring approach
X1 . . . Xi−1
5’-stem
hairpin-loop
3’-stem
Xi . . . Xi+δ
|
{z
}
Xi+δ+1 . . . Xj−δ−1
|
{z
}
Xj−δ . . . Xj
|
{z
}
δ+1
j−i−2δ−1
Xj+1 . . . Xn .
δ+1
We introduce here the penalty function g : N → (−∞, 0], the scoring
function f : E × E → R and the triangular matrix
( δ
)
X
Ti,j = max
Zi+k,j−k + g(j − i − 2δ − 1) , i < j,
δ
k=0
Zi,j = f (Xi , Xj ).
. – p.6/18
(Tk,k = g(1))
g(
j
−
Tk,k+1 = g(2)
Initialization:
i+
1)
}
A recursion
Struct.
Index
k−1 k−2 k−3 k−4 k−5
U
A
C
A
G
C
C
ax
{T
i+
2,
g
C
T
i,
j
=
m
4)
−
ax
{g
(
m
4)
2
g(
4)
+
g(
4)
+
1
RNA
g(
k
4)
Index
g(
g(
2)
Score
1,
j−
1
(1
2)
}
+
Z
i,
j,
CAAGUACCCUCA
CAAGUACCCUCA
CAAGUACCCUCA
CAAGUACCCUCA
CAAGUACCCUCA
i
Xi
A
U
Zi,j = f (Xi , Xj )
C
A
k+1 k+2 k+3 k+4 k+5 k+6
Xj
j
. – p.7/18
Arratia, Goldstein and Gordon
n
Theorem: (Thm. 1, [1]) If N =
P
X
a∈I
Va and if
E(Va )E(Vb ) →
0,
E(Va Vb ) →
0,
E|E(Va |Fa ) − E(Va )| →
0,
β1,n =
a∈I,b∈Ba
X
β2,n =
a∈I,b∈Ba ,b6=a
β3,n =
X
a∈I
for n → ∞ (here Fa = σ(Xb , b 6∈ Ba )), then
||D (N n ) − Poi(E(N n ))|| → 0.
and the following bound holds
||D (N n ) − Poi(E(N n ))|| ≤ 2(β1,n + β2,n + β3,n ).
. – p.8/18
The counting construction
With X1 , . . . , Xn i.i.d.,
Ti,j = max{Ti+1,j−1 + f (Xi , Xj ), g(j − i + 1)},
i<j
we let N n (t) the number of diagonals in (Ti,j ) exceeding level t.
With I the set of sets of diagonal indices then
n
N (t) =
X
a∈I
1( max Ti,j > t).
(i,j)∈a
This construction is not directly suitable for the Arratia et al. Theorem.
. – p.9/18
A modification
We need to band limit the matrix and introduce
Va (t) = 1
max
Ti,j > t
(i,j)∈a,|j−i|≤2bn
n
P
and Ñ (t) = a∈I Va (t) with bn → ∞.
Two indicators Va (t) and Vb (t) are independent if the corresponding
diagonals are at distance > 2bn .
Need to control
P (Va (t) = 1)
and
P (Va (t) = 1, Vb (t) = 1)
where a 6= b.
. – p.10/18
The reflected random walk
The process
Tn = max{Tn−1 + Zn , g(n)},
T0 = 0
for i.i.d. (Zn )n≥0 and g : N → (−∞, 0] (g(0) = 0) is a random walk
reflected at the barrier g.
It fulfills that
Tn = Sn + max {g(k) − Sk }
0≤k≤n
{z
}
|
Ln
where Sn =
Pn
k=1
Zk , S0 = 0.
Mn := max Tk % M := sup Tk .
0≤k≤n
k≥0
. – p.11/18
−100
−50
0
50
100
150
Three examples
−150
g(n)=0
g(n) = −15 log(n)
−200
g(n) = −n
0
50
100
150
200
. – p.12/18
Results
Let θ ∗ > 0 solve E exp(θZ1 ) = 1, and let P∗ have local Radon-Nikodym
derivative exp(θ ∗ Sn ) w.r.t. P, then with
L = sup {g(n) − Sn } = lim Ln
n→∞
n≥0
we have
P(M < ∞) = 1
⇔
E∗ exp(θ ∗ L) < ∞.
Moreover (with L−1 = −∞),
E∗ exp(θ ∗ L)
=
≤
∞
X
n=0
∞
X
exp(θ ∗ g(n))E(1 − exp(θ ∗ (Ln−1 − Ln )))
exp(θ ∗ g(n))
n=0
. – p.13/18
Results
Theorem (Thm 2.3 in [2]): If E∗ exp(θ ∗ L) < ∞ and if Z1 is non-arithmetic
P(M > u) ∼ exp(−θ ∗ u) E∗ exp(θ ∗ L)
|
{z
}
Cg
for u → ∞ where
1 − P(τ+ < ∞)
θ ∗ E∗ S τ +
|
{z
}
C
τ+ = inf{n ≥ 0 | Sn > 0}
and
L = sup{g(n) − Sn }.
n≥0
. – p.14/18
Results
Taking gi (n) = g(2n + i), i = 0, 1, increments with distribution f (X1 , X2 )
[+tech assumptions on f ] and using Azuma-Hoeffdings inequality to
control P(Va (t) = 1, Vb (t) = 1) we obtain the following theorem:
Theorem (Thm. 1 in [3]): If Cgi < ∞ then with
log {nC(Cg0 + Cg1 )} + x
tn =
θ∗
it holds that
||D(N n (tn )) − D(Ñ n (tn ))|| → 0
and
||D(Ñ n (tn )) − Poi(exp(−x))|| → 0
for n → ∞.
. – p.15/18
Point process version
Introduce the bivariate point process
µn =
X
δ( i+j ,Qa )
2n
a=[(i,j)]∈I
on [0, 1] × R where
Qna
= max {Tk,l − tn }
(k,l)∈a
log {nC(Cg0 + Cg1 )}
.
and tn =
θ∗
Then the process version of Theorem 1 in [3] reads
D
µn (· ∩ [0, 1] × (0, ∞)) −→ µ
where µ is a Poisson random measure with intensity
λ(t, x) = exp(−x).
. – p.16/18
Some concluding remarks
The Poisson process limit provides us at best with a null model for
the distribution of (stem-loop) motifs in real genomic sequences.
It provides a justification for the use of Poisson process
scan-statistics, say, as used by Leung et al. in [5], to test for
clustering.
The theoretical study of Poisson process limits provides insight on
how the underlying distribution, the the score and penalty functions,
and the algorithms affect the resulting point process of motif
occurrences.
. – p.17/18
References
[1] A RRATIA , R., G OLDSTEIN , L. AND G ORDON , L. (1989). Two moments suffice for Poisson
approximations: the Chen-Stein method. Ann. Probab. 17, 9–25.
[2] H ANSEN , N. R. (2006). The maximum of a random walk reflected at a general barrier. Ann.
Appl. Probab. 16, 15–29.
[3] H ANSEN , N. R. (2007). Asymptotics for Local Maximal Stack Scores with General Loop
Penalty. Advances in Applied Probability 39(3), 776-798.
[4] H ANSEN , N. R. (2007). Statistical models of local RNA stem-loop scores Submitted to
Bioinformatics.
[5] L EUNG , M. Y., C HOI , K. P., X IA , A. AND C HEN , L. H. Y. (2005). Nonrandom Clusters of
Palindromes in Herpesvirus Genomes Journal of Computational Biology 12(3), 331-354
[6] R EINERT, G. AND S CHBATH , S. (1998). Compound poisson and poisson process
approximations for occurrences of multiple words in markov chains. Journal of
Computational Biology 5, 223–253.
. – p.18/18