Introduction to motif representation
and detection
Statistics 246
Week 13
Spring 2006
Lecture 1
1
Motivation
We saw in Lecture 1 last week that transcription of the lac
gene in E. coli involved several proteins binding DNA at
specific sites with consensus sequences as follows:
RNA polymerase binding at TATATT (-10) and TTGACA (-35)
UP binding at AAA(A/T)(A/T)T(A/T)TTTNNAAA
CAP binding at AAATGTGATCTAGATCACATTT, and
Repressor binding at GTGGAATTGTGAGCGGATAACAATTTTC
These proteins bind at many sequences similar but not
identical to the consensus, and in this lecture we explore the
representation and utilization of this diversity.
2
Motifs - Sites - Signals - Domains
For this lecture, I’ll use these terms interchangeably
to describe recurring elements of interest to us.
In PROTEINS we have: transmembrane domains,
coiled-coil domains, EGF-like domains, signal
peptides, phosphorylation sites, antigenic
determinants, …, and protein families.
In DNA / RNA we have: enhancers, promoters,
terminators, splicing signals, translation initiation
sites, centromeres, ...
3
Motifs and models
Motifs typically represent regions of structural
significance with specific biological function.
Are generalisations from known examples.
The models can be highly specific.
Multiple models can be used to give higher
sensitivity & specificity in their detection.
Can sometimes be generated automatically from
examples or multiple alignments.
4
Why (probability) models
for biomolecular motifs?
• to characterize them
• to help identify them
• for incorporation into larger
models, e.g. for an entire gene
5
Transcription initiation in E. coli
In E. coli transcription is initiated at the promotor, and the sequence of
the promotor is recognised by the Sigma factor of RNA polymerase.
6
Determinism 1: consensus sequences
σ Factor
σ70
σ28
Promotor consensus sequence
-35
-10
TTGACA
TATAAT
CTAAA
CCGATAT
Similarly for σ32 , σ38 and σ54.
Consensus sequences have the obvious limitation:
there is usually some deviation from them.
7
The human transcription factor Sp1
has 3 Cys-Cys-His-His zinc finger DNA binding domains
8
Prosite patterns
An early effort at collecting descriptors for
functionally important protein motifs. They
do not attempt to describe a complete
domain or protein, but simply try to identify
the most important residue combinations,
such as the catalytic site of an enzyme.
They use regular expression syntax, and
focus on the most highly conserved residues
in a protein family.
http://au.expasy.org
9
Seeking consensus in sequences
This pattern, which must be in the N-terminal of the sequence (‘<'), means:
<A - x [ST]
(2)
x(0,1)
-V {LI}
Ala-any- [Ser or Thr]-[Ser or Thr]
- (any or none)-Val-(any but Leu, Ile)
10
Example: C2H2 zinc finger DNA binding domain
The characteristic motif of a Cys-Cys-His-His zinc
finger DNA binding domain has regular expression
C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H
Here, as in algebra, X is unknown. The sequence of
our example domain 1ZNF is as follows, clearly fitting
the model.
XYKCGLCERSFVEKSALSRHQRVHKNX
11
Searching with regular expressions
http://www.isrec.isb-sib.ch/software/PATFND_form.html
c.{2,4}c...[livmfywc]........h.{3,5}h
PatternFind output
[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001
...
gp|AF234161|7188808|01AEB01ABAC4F945 nuclear
protein NP94b [Homo sapiens] Occurences: 2
Position : 514 CYICKASCSSQQEFQDHMSEPQH
Position : 606 CTVCNRYFKTPRKFVEHVKSQGH
........
12
Regular expressions can be limiting
The regular expression syntax is too rigid to represent
many highly divergent protein motifs.
Also, short patterns are sometimes insufficient with
today’s large databases. Even requiring perfect
matches you might find many false positives. On
the other site some real sites might not be perfect
matches.
We need to go beyond apparently equally likely
alternatives, and ranges for gaps. We deal with the
former first, having a distribution at each position.
13
Cys-Cys-His-His profile: sequence logo form
1ZNF: ..YKCGLCERSFVEKSALSRHQRVHKN..
A sequence logo is a scaled position-specific a.a.distribution.
Scaling is by a measure of a position’s information content.
(Note that we’ve lost the option of variable spacing.)
14
60 human TATA boxes
15
These two figures courtesy of Anders Krogh.
Weight matrix model (WMM) =
Stochastic consensus sequence
A
9
214
63
142
118
8
A
0.04
0.88
0.26
0.59
0.49
0.03
C
22
7
26
31
52
13
C
0.09
0.03
0.11
0.13
0.21
0.05
G
18
2
29
38
29
5
G
0.07
0.01
0.12
0.16
0.12
0.02
T
193
19
124
31
43
216
T
0.80
0.08
0.51
0.13
0.18
0.89
Counts from 242 known σ70 sites
Relative frequencies:
fbl
2
A -38
19
1
12
10
-48
C -15
-38
-8
-10
-3
-32
G -13
-48
-6
-7
-10
-40
T
-32
8
-9
-6
19
17
Position Specific Scoring Matrix
10×log2fbl/pb
1
0
1
2
3
4
5
6
Informativeness:2+Σbpbllog2pbl
16
Derivation of PSSM entries
Suppose that we have aligned sequence data on a
number of instances of a given type of site.
candidate sequence
CTATAATC....
aligned position
123456
Hypotheses:
S=site (and independence)
R=random (equiprobable, independence)
pr(CTATAA | S)
pr(CTATAA | R)
log2
= log2 .09x.03x.26x.13x.51x.01
.25x.25x.25x.25x.25x.25
= (2+log2.09)+...+(2+log2.01)
=
1
{-15 - 32 + 1 - 9 + 10 - 48}
10
Generally, PSSM score sbl = log fbl/pb
l=position, b=base
17
pb=background frequency
Use of a PSSM to find sites
C
Move the matrix
along the sequence
and score each
window.
Peaks should
occur at the
true sites.
Of course in general
any threshold will
have some false
positive and false
negative rate.
T
A
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
-9
-6
19
A -38
19
1
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
-9
-6
19
A -38
19
1
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
A -38
T
T
T
19
T
A
1
A
T
C
sum
-93
-7 -10 -48
+85
-7 -10 -48
-7 -10 -48
-9
-6
19
-95
18
12 examples of 5’splice (donor) sites
exon
TCGGTGAGT
TGGGTGTGT
CCGGTCCGT
ATG GTAAGA
TCT GTAAGT
CAGGTAGGA
CAGGTAGGG
AAGGTAAGG
AGGGTATGG
TGGGTAAGG
GAGGTTAGT
CATGTGAGT
intron
There are many thousands
of instances of such sites.
19
Sequence logo for human splice donor sites
Base
A
C
G
T
-3
33
37
18
12
-2
61
13
12
14
-1 0 +1 +2 +3 +4 +5
10 0 0 53 71 7 16
3 0 0
3 8 6 16
80 100 0 42 12 81 22
7 0 100 2 9 6 46
20
Representation of motifs: the next steps
Missing from the position-specific distribution
representation of motifs are good ways of
dealing with:
• Length distributions for insertions/deletions
• Non-local association of amino acids
Profiles, and then Hidden Markov models help with
the first. The second remains a hard unsolved problem.
Local (e.g. neighbour) associations can be dealt with
21
straighforwardly, generalizing PSSMs.
Profiles
Are a variation of the position specific scoring matrix
approach just described. Profiles are calculated slightly
differently to reflect amino acid substitutions, and the
possibility of gaps, but are used in the same way.
In general a profile entry Mla for location l and amino acid a is
calculated by
Mla = ΣbwlbSab
where b ranges over amino acids, wlb is a weight (e.g. the
observed frequency of a.a. b in position l) and Sab is the (a,b)entry of a substitution matrix (e.g. PAM or BLOSUM).
Position specific gap penalties can also be included.
22
Derivation of a profile for Ig domains
PileUp of: @/home/ucsb00/George/.WAG/pileup-8391.8424
Symbol comparison table: GenRunData:pileuppep.cmp
1254
CompCheck:
GapWeight: 3.000
GapLengthWeight: 0.100
pileup.msf
Name:
Name:
Name:
Name:
g1
m3
k2
l3
MSF: 49
g1
m3
k2
l3
1
ELVKAGSSVK
GLVEPGGSLR
LPVTPGEPAS
VSVALGQTVR
Type: P
Len:
Len:
Len:
Len:
MSCKATGYTF
LSCSASGFTF
ISCRSSQSLL
ITCQ.GDSLR
May 27, 1999 13:18
49
49
49
49
SSYE....LY
SAND....MN
DSGDGNTYLN
GYDAA.....
Check: 9167 ..
Check: 2179
Check:
0
Check: 6739
Check: 249
Weight:
Weight:
Weight:
Weight:
WVRQAPGQGL
WVRQAPGKGL
WYLQKAGQSP
WYQQKPGQAP
49
EDLGYISSS
EWLSFIGGS
QLLIYTLSY
LLVIYGRNN
1.00
1.00
1.00
1.00
23
(Peptide) PROFILEMAKE v4.40 of:
/home/ucsb00/George/biochem/1999/immunoglob.msf{*} Length: 49
Sequences: 4 MaxScore: 27.72 May 27, 1999 13:24
Gap: 1.00
GapRatio: 0.33
immunoglob.msf{g1}
immunoglob.msf{m3}
immunoglob.msf{k2}
immunoglob.msf{l3}
Symbol comparison table: profilepep.cmp
From:
From:
From:
From:
Len: 1.00
LenRatio: 0.10
1
1
1
1
To:
To:
To:
To:
49
49
49
49
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
FileCheck: 1254
Stringent treatment of non-observed characters
Exponential weighting of characters
Cons A
B
C
D
E
F
G
H
I
K
V
15
8 -14
14
21
1
24
-4
19
-4
L
9 -10 -14 -12
-5
24
-3
-6
21
-5
V
20 -20
20 -20 -20
20
20 -30 110 -20
T
31
21 -10
25
32 -31
21
4
-3
28
P
35
-1
-4
0
3 -13
12
2
5
-1
G
70
60
20
70
50 -60 150 -20 -30 -10
E
22
29
-4
36
40 -33
39
10 -12
11
S
25
14
26
11
11 -23
29
-5
-3
11
V
26 -11
-1
-9
-6
16
9 -14
46 -11
R
-4
13
-8
7
7 -30
-3
14 -14
49
! 11
I
-1 -17 -13 -19 -13
46 -21 -16
67
-8
S
28
20
43
14
14 -21
40 -13
-3
14
C
30 -40 150 -50 -60 -10
20 -10
20 -60
K
4
18 -11
17
17 -32
6
15 -12
40
A
53
11
19
12
12 -20
31
-6
-1
3
S
28
21
28
19
16 -22
45 -11
-5
8
G
29
41
-9
53
39 -44
60
9 -16
7
W
2
-4
35 -14 -10
31
1
-4
8 -12
L
10 -10 -19 -10
-3
29
-3 -10
32
-3
W -21 -28 -18 -39 -26
57 -30
1
29 -15
! 21
S
27
33
18
37
27 -32
50
-4 -10
9
S
29
8
40
4
4
3
19
-4
-2
-2
B
12
35
6
33
21 -10
26
14 -10
0
D
34
47 -20
66
57 -48
39
17
-9
14
A
31
11
7
14
11 -15
31
-4
-4
-1
N
3
15
-4
10
7
-7
6
7
-4
6
T
6
3
3
3
3
-4
6
-1
3
3
Y
-4
-4
14
-7
-7
19 -10
4
1
-8
L
-3 -20 -34 -21 -12
45 -20 -11
34
-7
N
2
31
4
15
9
4
3
20
-8
4
! 31
W -80 -70 -120 -110 -110 130 -100 -10 -50
10
F
-3 -16
38 -22 -22
51 -16
0
38 -25
R
-8
3 -29
3
6 -10 -14
23
-3
27
Q
20
50 -60
70
70 -80
20
70 -30
40
A
48
19 -10
19
19 -38
19
0
-6
48
P
49
8
10
10
10 -47
27
10 -11
6
G
70
60
20
70
50 -60 150 -20 -30 -10
Q
11
34 -42
44
44 -55
10
41 -20
44
G
49
26
20
29
23 -30
66 -11 -11
0
L
13 -13 -22 -13
-6
16
-6
0
19
-6
! 41
E
11
22 -38
35
53 -17
12
20
1
11
L -10 -10 -49 -10 -11
42 -20
-2
16
-4
L
-3 -31 -43 -31 -20
71 -26 -16
61 -20
I
15
6
19
6
3
10
20 -15
42
-5
Y -24 -27
56 -42 -38 101 -48
16
15 -44
I
15
5
12
6
3
10
17 -14
46
-5
S
10
7
-3
6
6
-3
18
-1
1
8
L
21
38
80
-11
10
-50
-18
-18
45
-22
M
19
34
60
0
12
-30
-11
-12
37
5
N
3
-9
-30
18
-3
40
22
12
-12
13
P
3
17
10
14
50
30
15
38
6
16
Q
8
0
-20
17
11
20
32
0
-5
17
R
-14
-7
-30
6
0
-30
3
6
-19
60
S
4
14
-10
15
13
60
31
57
-3
27
T
10
5
20
32
14
40
11
34
11
4
V
W
32 -33
25
9
150 -80
0 -33
17 -30
20 -100
-4 -32
1 -10
61 -30
-14
50
Y
-14
-7
-10
-24
-25
-70
-31
-28
-3
-33
Z
14
-4
-20
25
6
30
35
4
-6
12
Gap
100
100
100
100
100
100
100
100
100
100
Len
100
100
100
100
100
100
100
100
100
100
64
-24
-80
-17
-9
-21
-24
8
44
53
58
-17
-60
1
-4
-14
-15
-4
41
37
-19
20
-30
17
11
18
28
1
-6
-20
-13
27
10
15
21
21
15
-8
0
-22
-11
-7
-60
31
5
-2
37
-23
-6
-21
-12
4
-30
39
-8
-2
-4
-12
-16
-1
-13
90
70
24
33
60
20
38
-3
-14
5
38
20
4
17
36
14
1
44
-12
54 -13
-3
9
20 -120
-11
18
5 -21
2 -13
1 -54
-2
43
32
-3
13
68
6
-27
100
-31
-15
-27
-37
28
0
40
-11
1
-60
24
6
6
37
-18
-3
-22
100
100
100
100
30
100
100
100
100
100
100
100
100
100
30
100
100
100
100
100
-27
-10
-15
-21
-8
-6
-1
4
66
-9
-19
-11
-15
-15
-4
-4
0
-1
62
-11
25
11
35
32
8
21
3
-1
-17
46
18
9
-6
11
11
0
4
-11
-12
-11
9
-9
10
35
6
6
-1
-8
-3
4
-1
-9
-11
-4
-8
1
-1
-8
-10
-5
59
48
10
15
14
4
4
-6
-17
4
18
11
7
15
11
3
21
-4
-3
2
-3
-2
-6
-6
6
-4
3
-1
34
-11
-20
14
-18
-61
-25
-4
-8
15
12
6
-29
4
3
-27
-14
-1
-4
21
8
18
17
-6
14
47
7
6
1
-8
-8
4
100
100
100
100
21
21
21
21
21
21
100
100
100
100
21
21
21
21
21
21
50
35
7
-10
-13
-18
-50
-10
-23
38
-30
16
24
0
6
-11
-30
3
-14
35
-30
-13
3
40
19
3
40
28
20
-13
-80
-22
10
30
19
92
30
18
22
38
-50
-25
32
150
19
20
20
91
8
6
140
-29
48
40
16
13
-30
34
-12
-3
30
-16
-4
-10
19
28
60
-3
45
0
-60
-3
-6
-10
19
23
40
-3
22
6
-80 150
44
10
-1
44
-20 -50
0 -22
8 -57
20 -100
-14 -27
8 -39
29 -10
110
44
-23
-60
-29
-50
-70
-42
-32
-16
-80
-25
19
110
19
14
30
68
12
0
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
10
48
96
13
34
17
3
12
32
82
11
1
15
12
16
-7
-27
0
-13
-1
6
3
-19
-16
3
-55
2
10
42
0
-8
-8
-45
-8
6
0
7
-27
-12
-41
-15
12
-1
-6
-24
26
-27
9
25
4
-9
-3
16
-21
33
7
-20
18
16
-12
105
-11
-19
47
-5
-14
-2
-44
-1
4
100
100
100
100
100
100
100
100
100
100
100
100
100
100
2
12
66
36
-3
40
8
-35
21
17
-26
81
-38
17
..
24
g1
m3
k2
l3
ELVKAGSSVK
GLVEPGGSLR
LPVTPGEPAS
VSVALGQTVR
Cons A
13C
14K
15A
16S
30
4
53
28
B
-40
18
11
21
MSCKATGYTF
LSCSASGFTF
ISCRSSQSLL
ITCQ.GDSLR
C
D
150
-11
19
28
-50
17
12
19
SSYE....LY
SAND....MN
DSGDGNTYLN
GYDAA.....
E
-60
17
12
16
... Gap
WV
WV
WY
WY
Len
100 100
100 100
30 30
100 100
These days profiles of this kind have been largely replaced by
Hidden Markov Models for sequence searching and alignment.
We now turn to HMMs.
25
Hidden Markov models
Processes {(St,Ot), t=1,…}, where St is the hidden
state and Ot the observation at time t, such that
pr(St | St-1,Ot-1,St-2 ,Ot-2 …) = pr(St | St-1)
pr(Ot | St-1,Ot-1,St-2 ,Ot-2 …) = pr(Ot | St, St-1)
The basics of HMMs were laid bare in a series of
beautiful papers by L E Baum and colleagues around
1970, and their formulation has been used almost
unchanged to this day.
26
HMMs: generalities
As the name suggests, the series O = (O1,O2 ,O3, .. OT) is
observed, while the hidden states S = (S1 ,S2 ,S3 ,……., ST)
are not.
S = States {s0,s1,…..,sn}; V = Output alphabet {v0,v1,…..,vm}
A = { aij} = transition probability from si → sj
parameters θ
B = {bi(j)} = probability outputting vj in state si
What is the most likely sequence of states that produced a
given sequence of observations? What is the likelihood of a
sequence? Estimate the parameters.
There are elegant algorithms for calculating pr(O|θ), arg maxθ
pr(O|θ) in certain special cases, and arg maxS pr(S|O,θ).
27
Hidden Markov models:extensions
Many variants are now used. For example, the distribution of O
may not depend on previous S but on previous O values,
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) .
Most importantly for us, the times of S and O may be decoupled,
permitting the Observation corresponding to State time t to be a
string whose length and composition depends on St (and
possibly St-1 and part or all of the previous Observations). This
is called a hidden semi-Markov or generalized hidden Markov
model.
28
Alignment using profile
Hidden Markov models
There are now many HMMs for protein families
such as globins, and these models can be used
to infer alignments of new globin sequences to
other members of the family.
Such models can also be used to determine
whether a given sequence is or is not a member
of a specified family.
29
A very short profile HMM
M = Match state, I = Insert state, D = Delete state.
To operate, go from left to right. I and M states output
30
amino acids; B, D and E states are “silent”.
How profile HMMs work: in brief
Instances of the motif are identified by calculating
log{pr(sequence | M)/pr(sequence | B)},
where M and B are the motif and background HMMs.
Alignments of instances of the motif to the HMM are
found by calculating
arg maxstates pr(states | instance, M).
Estimation of HMM parameters is by calculating
arg max parameters pr(sequences| M, parameters).
In all cases, we use the efficient HMM algorithms.
31
Pfam domain-HMMs
Pfam is a library of models of recurrent protein
domains. They are constructed semi-automatically
using hidden Markov models (HMMs).
Pfam families have permanent accession numbers
and contain functional annotation and crossreferences to other databases, while Pfam-B families
are re-generated at each release and are
unannotated.
See http://pfam.wustl.edu/
32
HMM-type software available
SAM oldest profile HMM software.
HMMER profile HMM to do sensitive database searching.
PFTOOLS “generalized profiles” similar to profile HMMs,
used in Prosite Profiles.
PROBE multiple ungapped HMM motifs including Gibbs
sampling.
META-MEME motif models similar to PROBE.
PSI-BLAST stripped down ultra-fast iterative profile HMM
server from the NCBI.
33
Beyond independence
Weight array matrices, Zhang & Marr (1993) consider
dependencies between adjacent positions, i.e. (nonstationary) first-order Markov models. The number of
parameters increases exponentially if we restrict to full
higher-order Markovian models.
Variable length Markov models, Rissanen (1986),
Buhlmann & Wyner (1999), help us get over this
problem. In the last few years, many variants have
appeared: all make use of trees.
[The interpolated Markov models of Salzberg et al
34
(1998) address the same problem.]
Our aim and some notation
L: length of the the sequence motif.
Xi: discrete random variable at position i,
taking values from a finite set χ.
Given a number of instances of a sequence
motif x = (x1, …, xL) of length L, we want a
model for the probability P(x) of x.
We denote by xij (i < j) the sequence
(xj , xj-1, …., xi ) in reverse time order.
35
Variable Length Markov Models
Factorize P(x) in the usual telescopic way:
P(X1=x1) × Π2L P(Xl = xl | X1l-1 = x1l-1),
then simplify this using context functions cl, l=2,..L, to
P(X1=x1) × Π2L P(Xl = xl | cl(X1l-1)) = cl(x1l-1)),
where cl : x1l-1 → xl-ml-1 is suitably defined on l-1 tuples.
36
VLMM, cont.
Here cl : χl-1 →
∪i=0l-1χ i, and m = ml is given by
ml(x1l-1) = min {r: P(Xl = x | X1l-1 = x1l-1) =
P(Xl = x | Xl-rl-1 = xl-rl-1) for all x ∈ χ }.
The function cl defines the sequence-specific
context, and ml defines the sequencespecific memory or order of the Markov
property for position l.
37
VLMM: an illustrative example
A full set of 16 contexts of order 2.
Pruned set of 12 contexts: P(X3|X2=C,X1=G) = P(X3|X2=C),38etc.
VLMM cont.
A VLMM for a biomolecular motif of length L is specified by
•
•
a distribution for X1, and, for l = 2,…L,
a constrained distribution for Xl given Xl-1,…,X1.
That is, we need L-1 context functions, or trees.
But, there is a difficulty here.
39
Sequence dependencies
(interactions) are not always local
3-dimensional folding; DNA, RNA & protein interactions
The methods outlined so far all fail to incorporate long-range
( ≥4 bp or a.a.) interactions. New model types are needed.
40
Modeling “long-range” dependency
The principal work in this area is Burge & Karlin’s (1997) maximal
dependence decomposition (MDD).
More recently, Cai et al (2000) and Barash et al (2003) used
Bayes networks (BN).
Ellrott et al (2002) optimized the sequence order in which a
stationary Markov chain models the motif.
We have adapted this last idea, to give permuted variable length
Markov models (PVLMM). Potamianos & Jelinek (1998) have
related work on decision trees (PVLMM(D)).
41
Maximal Dependence Decomposition
MDD starts with a mutual independence model as with
WMMs. The data are then iteratively subdivided, at
each stage splitting on the most dependent position,
suitably defined. At the tips of the tree so defined, a
mutual independence model across all remaining
positions is used.
The details can vary according to the splitting criterion
(Burge & Karlin used χ2), the actual splits (binary, etc),
and the stopping rule.
However, the result is always a single tree.
42
A simple illustration: transcription
factor binding sites
These are of great interest, their signals are
very weak, and we typically have only a few
instances.
43
Transcription factor binding
site (TFBS) recognition
We extracted all known TFBS from the TRANSFAC database with a) length ≤9,
and b) ≥ 20 known sites. In all this gave 1,419 sites corresponding to 43 TF.
Next we randomly inserted each site into a background sequence of length
1,000 simulated from a stationary 3rd order MM, with parameters estimated
from a large collection of human sequence upstream of genes.
Finally, we used the PVLMM, MDD and WMM to scan these sequences within
a 10-fold cross-validation framework, to select a number of top-scoring
sequences as putative binding sites. We always made this number equal to
the true number in the sequences, and so sn = sp.
44
20 instances of
P$DOF3_01
GTCTAAAGCGT
aattAAAGTAA
GACGAAAGCAA
aattAAAGTGC
GTCTAAAGCga
GCGAAAAGCGA
GCGTAAAGCAG
TAGAAAAGGCG
aattAAAGTAC
CACAAAAGCCC
GCCCAAAgatc
tGACAAAGCGT
GCGGAAAgatc
aattAAAGCAA
CAAAAAAGGCG
tAAAAAAGGCT
CAGCAAAGACg
GGAAAAAGCAA
AGCAAAAGTGC
GCAGAAAGTCA
45
Modelling P$DOF3_01
Sn
Sp
WMM
.15
MDD
.60
PVLMM(D)
.55
46
Human Transcription Factor Binding Sites
47
TFBS: three results
P$DOF3.01
V$CIZ.01
P$EMBP1.Q2
wmm
pvlmm mdd
wmm
pvlmm mdd
wmm
pvlmm mdd
.15
.55
.46
.64
.77
.91
.50
.62
.77
Entry: sensitivity/specificity
Of the 43 TF, our dependence methods led to no improvement
in 27 cases.
48
TFBS: more results
49
Now we touch on finding
protein-coding genes
50
12 examples of 5’splice (donor) sites
exon TCGGTGAGT
TGGGTGTGT
CCGGTCCGT
ATG GTAAGA
TCT GTAAGT
CAGGTAGGA
CAGGTAGGG
AAGGTAAGG
AGGGTATGG
TGGGTAAGG
GAGGTTAGT
CATGTGAGT
intron
51
Splice site dataset
Human splice donor sequences from SpliceDB,
Burset et al (2001):
• 15,155 canonical donor sites of length 9, with
GT conserved at positions 0 and 1
• 47,495 false donor sites from the set of all
sequences which lie within 40 bp on both sides
of the characteristic donor dinucleotide GT.
52
Weight matrix models, Staden (1984)
Base
A
C
G
T
-3
33
37
18
12
-2
61
13
12
14
-1 0 +1
10 0 0
3 0 0
80 100 0
7 0 100
+2
53
3
42
2
+3
71
8
12
9
+4
7
6
81
6
+5
16
16
22
46
A weight matrix for donor sites.
Essentially a mutual independence model.
An improvement over the consensus CAGGTAAGT. 53
Sequence logo for human splice donor sites
Base
A
C
G
T
-3
33
37
18
12
-2
61
13
12
14
-1 0 +1 +2 +3 +4 +5
10 0 0 53 71 7 16
3 0 0
3 8 6 16
80 100 0 42 12 81 22
7 0 100 2 9 6 46
54
Part of a context (decision) tree for position
-2 of a splice donor PVLMM
Node #s:counts;
Edge #s:split variables.
Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A)
55
Parts of MDD trees for splice donors
56
In each case, splits are into the most frequent nt vs the others.
Optimal permutation: +2 +5 -1 +4 -2 +3 -3
Sp vs Sn
Comparison of PVLMM decision tree (NML, ord = 5),
57
MDD (chi-square), WAM and WMM.
Model assessment:
Integrated recognition
For this assessment, we integrate the splice
donor models into SLAM, Pachter et al
(2002), a eukaryotic cross-species gene
finder.
The training data consists of 3,735 aligned
human and mouse gene sequences. The
resulting SLAM model is then tested on
the Rosetta set of 117 single human gene
sequences.
58
Results at the nucleotide level
VLMM(D)
MDD
PVLMM(D)
Sensitivity 94.5
94.2
94.3
Specificity 99.7
99.7
99.7
Results at the exon level
VLMM(D)
MDD
PVLMM(D)
Correct
362/470
365/465
371/465
Partial
92/470
83/465
79/465
Wrong
18/470
17/465
16/465
Missing
23/465
23/465
23/465
Sensitivity
362/465
365/465
371/465
Specificity
360/470
365/465
371/465
59
Some future work
• More fully elucidate the relationships between
all the different tree-based model classes
• Comprehensively test alternative strategies for
searching and comparing models
• Find the best combination for splice donors,
TFBS and some other motif detection problems
• Joint modelling of human and mouse sites
• Joint modelling of multiple motifs in one species
• Including indels and dependence
60
Summary and conclusions
•Modelling sequence motifs is important
•We have a hierarchy of probabilistic models
for motifs extending the consensus sequence
•All common models are independence models
•Incorporating dependence into the models
helps in many cases
•There remain many unsolved problems
61
Acknowledgements
Xiaoyue Zhao, UCB
Mauro Delorenzi (ISREC)
Sourav Chatterji, UCB
The SLAM team:
Simon Cawley, Affymetrix
Lior Pachter, UCB
Marina Alexandersson, FCC
62
References
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
63
64
Interpretation of PVLMM model selected
We use sequence logos to provide simple
interpretations of our selected PVLMM,
including the optimal permutation.
65
Beginning of the splicing process
splice donor
splice acceptor
66
“Long-range” dependence in the chosen model
↓
U1 sn RNA G
U
C
C
A
U
U
C
A
67
Optimal permutation: +2 +3 -1 +4 -2 +5 -3
The context tree for +5
+5
68
69
Bayes networks (BN): an example
Here each node corresponds to a sequence position, and the tree
70
defines conditional independence constraints on the distribution.
To find genes, we need to model splice sites
71
Issues in modeling short motifs
In any study of this kind, essential items are:
• the model class (e.g. VLMM)
• the way we search through the model class (e.g. by forward
selection)
• the way we compare models when searching (e.g. by χ2),
and finally,
• the way we assess the final model in relation to our aims
(e.g. by cross-validation).
We always need interesting, high-quality datasets.
72
Model classes and model search
For illustrative purposes, we will compare
WMM, WAM, MDD and PVLMM(decision)
for splice donor recognition.
Here we search using a simple procedure:
recursively choosing the best extension of a
current model, or forward selection. Our
slower alternative moves through the models
using reversible jump Markov chain Monte
Carlo (RJMCMC).
73
Model comparison
We fit the models using maximum likelihood,
and compare fitted models using both AIC and
BIC, standard penalties for model complexity.
Better than either of these two is approximate
normalized maximum likelihood (NML),
Barron et al (1998). We use mixture models for
the data with Jeffreys’ (Dirichlet) priors.
74
Model assessment:
Stand-alone splice site recognition
M: motif model
B: background model
Given a sequence x = (x1, …, xL), we predict
x to be a motif (here splice donor) if
log {P(x | M) / P(x | B)} > c,
for a suitably chosen threshold value c.
75
Model assessment: terms
TP: true positives, FP: false positives
TN: true negatives, FN: false negatives
Sensitivity (sn) and specificity (sp) are given
by
sn = TP / [TP + FN]
sp = TP / [TP + FP].
A 5-fold cross-validation is used in assessing
76
performance.
© Copyright 2026 Paperzz