CMB14

(Regulatory-) Motif Finding
Clustering of Genes
Find binding sites
responsible for
common expression
patterns
Chromatin Immunoprecipitation
Microarray Chip
“ChIP-chip”
Finding Regulatory Motifs
.
.
.
Given a collection of genes with common expression,
Find the TF-binding motif in common
Characteristics of Regulatory Motifs
• Tiny
• Highly Variable
• ~Constant Size
 Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity-ish
Essentially a Multiple Local Alignment
.
.
.
• Find “best” multiple local alignment
• Alignment score defined differently in
probabilistic/combinatorial cases
Algorithms
•
Combinatorial
CONSENSUS, TEIRESIAS, SP-STAR, others
•
Probabilistic
1. Expectation Maximization:
MEME
2. Gibbs Sampling:
AlignACE, BioProspector
Combinatorial Approaches to Motif
Finding
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
Approaches
• Exhaustive Searches
• CONSENSUS
• MULTIPROFILER
• TEIRESIAS, SP-STAR, WINNOWER
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T
Find d( W, S )
Report W* = argmin( d(W, S) )
(4K possibilities)
Running time: O( K N 4K )
(where N = i |xi|)
Advantage:
Finds provably “best” motif W
Disadvantage: Time
Exhaustive Searches
2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )
or, Report a local improvement of W*
Running time: O( K N2 )
Advantage:
Time
Disadvantage: If the true motif is weak and does not occur in data
then a random motif may score better than any
instance of true motif
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’)  α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
MULTIPROFILER
Assume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W
 L is smaller than the word length K
Example:
K = 7; L = 3
W
G
=
=
ACGTTGA
--A--CG
MULTIPROFILER
Algorithm:
For each W in S:
For L = 1 to Lmax
1. Find the α-neighbors of W in S
2. Find all “strong” L-long wordlets G in Na(W)
3. For each wordlet G,
1. Modify W by the wordlet G
2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 1 above: Smaller motif-finding problem;
Use exhaustive search
 Nα(W)
 W’
CONSENSUS
Algorithm:
Cycle 1:
For each word W in S
For each word W’ in S
Create alignment (gap free) of W, W’
(of fixed length!)
Keep the C1 best alignments, A1, …, AC1
ACGGTTG
ACGCCTG
,
,
CGAACTT
AGAACTA
,
,
GGGCTCT …
GGGGTGT …
CONSENSUS
Algorithm:
Cycle t:
For each word W in S
For each alignment Aj from cycle t-1
Create alignment (gap free) of W, Aj
Keep the Cl best alignments A1, …, ACt
ACGGTTG
ACGCCTG
…
ACGGCTC
,
,
,
CGAACTT
AGAACTA
…
AGATCTT
,
,
,
GGGCTCT …
GGGGTGT …
…
GGCGTCT …
CONSENSUS
• C1, …, Cn are user-defined heuristic constants
 N is sum of sequence lengths
 n is the number of sequences
Running time:
O(N2) + O(N C1) + O(N C2) + … + O(N Cn)
= O( N2 + NCtotal)
Where Ctotal = i Ci, typically O(nC), where C is a big constant
Expectation Maximization in Motif Finding
Expectation Maximization
motif
All K-long background
words
Algorithm (sketch):
1.
2.
3.
Given genomic sequences find all K-long words
Assume each word is motif or background
Find likeliest
Motif Model
Background Model
classification of words into either Motif or Background
Expectation Maximization
•
Given sequences x1, …, xN,
•
Find all k-long words X1,…, Xn
•
Define motif model:
M = (M1,…, MK)
Mi = (Mi1,…, Mi4)
(assume {A, C, G, T})
motif
M1
where Mij = Prob[ letter j occurs in
motif position i ]
•
Define background model:
B = B1, …, B4
Bi = Prob[ letter j in background
sequence ]
A
C
G
T
background
MK
B
Expectation Maximization
•
•
Define
Zi1 = { 1, if Xi is motif;
0, otherwise }
motif
Zi2 = { 0, if Xi is motif;
1, otherwise }

Given a word Xi = x[1]…x[k],
P[ Xi, Zi1=1 ] =  M1x[1]…Mkx[k]
P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K]
Let 1 = ; 2 = (1- )
M1
A
C
G
T
background
1–
MK
B
Expectation Maximization
Define:
Parameter space  = (M,B)
1: Motif;
2: Background


M1
1–
MK
A
C
G
T
Objective:
Maximize log likelihood of model:
n
2
log P( X 1... X n , Z |  ,  )   Z ij log(  j P( X i |  j ))
i 1 j 1
n

i 1
2
Z
j 1
n
ij
log P( X i |  j )  
i 1
2
Z
j 1
ij
log  j
B
Expectation Maximization
• Maximize expected likelihood, in iteration of two steps:
Expectation:
Find expected value of log likelihood:
E[log P( X 1... X n , Z |  ,  )]
Maximization:
Maximize expected value over , 
Expectation Maximization: E-step
Expectation:
Find expected value of log likelihood:
E[log P( X 1... X n , Z |  ,  )] 
n
2
  E[Z
i 1
j 1
n
ij
] log P( X i |  j )  
i 1
2
 E[Z
j 1
ij
] log  j
where expected values of Z can be computed as follows:
 j P( X i |  j )
E[ Z ij] 
 Z *ij
P( X i | 1 )  (1   ) P( X i |  2 )
Expectation Maximization: M-step
Maximization:
Maximize expected value
over  and 
independently
For , this is easy:
n
n
i 1
i 1
NEW  arg m a x (Z *i1 log  Z *i 2 log( 1   ))  

Z *i1
n
Expectation Maximization: M-step
• For  = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]
• cij values are calculated easily from Z* values
It easily follows:
M
NEW
jk

c jk

4
k 1
NEW
k
B
c jk
to not allow any 0’s, add pseudocounts

c0 k

4
c
k 1 0 k
Initial Parameters Matter!
Consider the following “artificial” example:
x1, …, xN contain:
 212 patterns on {A, T}:
A…A, A…AT,……, T… T
 212 patterns on {C, G}:
C…C , C…CG,…… , G…G
 D << 212 occurrences of 12-mer ACTGACTGACTG
Some local maxima:
  ½; B = ½C, ½G; Mi = ½A, ½T, i = 1,…, 12
  D/2k+1; B = ¼A,¼C,¼G,¼T;
M1 = 100% A, M2= 100% C, M3 = 100% T, etc.
Overview of EM Algorithm
1.
Initialize parameters  = (M, B), :
 Try different values of  from N-1/2 up to 1/(2K)
2.
Repeat:
a. Expectation
b. Maximization
3.
Until change in  = (M, B),  falls below 
4.
Report results for several “good” 
Overview of EM Algorithm
• One iteration running time: O(NK)
 Usually need < N iterations for convergence, and < N starting
points.
 Overall complexity: unclear – typically O(N2K) - O(N3K)
• EM is a local optimization method
M,B
• Initial parameters matter
MEME: Bailey and Elkan, ISMB 1994.
