Motif Finding Algorithm

Motif Discovery:
Algorithm and
Application
Dan Scanfeld
Hong Xue
Sumeet Gupta
Varun Aggarwal
Objective: Motif discovery and use
for deriving biological information
Get bound and unbound
sequences by TF nanog
in human ES cells
Find a motif using a motif
finding algorithm
Genome wide functional
analysis using motif to
find biological pattern
Why nanog: Relevance to ES
Cells
1 Genome
1 Cell
>200 Phenotypes
1013 Cells
•
Activate certain genes
essential for cell growth
•
Repress a key set of
genes needed for an
embryo to develop.
•
This key set of repressed
genes activate entire
networks for generating
many different specialized
cells and tissues.
Objective: Motif discovery and use
for deriving biological information
Get bound and unbound
Sequences by TF nanog
in Human ES cells
Find a motif (nanog) using
a motif finding algorithm
Genome wide Functional
Analysis using motif to
find biological signals
Location Analysis (ChIP-CHIP) in Human ES
Cells (Cell Boyer et al 122: 947-956)
Crosslink
Fragment
Enrich for
Nanog
Differentially
label
44k 10 Set
Agilent
ChIP-CHIP Data Analysis
Set - normalized
Obtain
Intensities
using Genepix
negative
controlsubtracted
Perform
Median
Normalization
Probe-set p-value
p=0.005
May 2004 Genome
Release
P<=0.005
P<=0.01
IP signal
Sequences (500 bp)
Enrichment ratio
P<=0.001
0
Chromosomal position
WCE signal
Objective: Motif discovery and use
for deriving biological information
Get bound and unbound
Sequences by TF nanog
in Human ES cells
Find a motif (nanog) using
a motif finding algorithm
(State-of-the-art)
Genome wide functional
analysis using motif to
find biological pattern
Motif Finding Algorithm
(Mac Isaac, et. al., 2006)
Use Structural Prior
(Database, MacIssac, et. al.)
Refinement:
Expectation-Maximization (ZOOPS)
Score of found motifs:
Classification on unseen data
Significance testing on score:
Use of Empirical p-value
Refinement:
Expectation-Maximization
Differences from EM in Lab 1

Use of structural prior (beta = Strength of prior)

ZOOPS (Zero or One per sequence) model

5th order Markov Model for background trained over
unbound sequences

SVM for hypothesis testing
ZOOPS Model
(Bailey & Elkan 1994)
B Background Model, M: Motif Model
Λ Percentage of Bound Sequences (Mixture Model parameter)
Sequences are drawn from the distribution
P(S) = P(S| M) Λ + P(S|B)(1- Λ)
Hidden Variable for EM: Zij : 1 or 0, position j in sequence i is bound by the TF (1) or not (0)
E-step:
Prob(Zij) =
[Λ *P(Si bound at j |M)]
----------------------------------------[(1- Λ)P(Si |B) + Λ *∑ j P(Si bound at j |M)]
P(M bound at j | Si)
P(Si)
M-step:
(SAME AS BEFORE)
Updating M (Motif Model): For position p on the motif model and each base b (A C T or G)
Baseip : Base at position p of ith sequence
PWM(p,b) = ∑ i (∑ j (prob(Zi(j-p+1))* (Baseij = = b))) + pseudocounts AND NORMALIZE
Updating Background Model
[[WE DON’T UPDATE BACKGROUND)
Updating Λ
Λ = (∑ i ∑ j prob(Zij))/( number of sequences )
Hypothesis testing




Get motifs from EM
Use 2 sets of bound and
unbound seq. ( Train and test)
Train a linear SVM on train
set.
Find classification error on
test set
Error = Misclassifications/Total
Samples

Score = 1 – error
B
+ EM
Motif (M)
Input = P(S|M)/P(S|B)
Output = B OR UB
B
B
UB
UB
Train Set
Train Classifier
Test Set
Test Classifier
Expectation-Maximization
When to stop? Will it overtrain?

Rules of thumb (When likelihood increases very slowly)
 Second derivative is negative for given number of times
 Euclidean distance is less than given value

Over-train to given sequences

Maximizes likelihood of motif in given sequences. Disregards
their likelihood in unbound sequences

Find test classification error at each EM step using SVMs.
Expectation-Maximization
Final Motif
A different Methodology:
 4 sets of data:
Bound (for EM),
B & U.B. (Train SVM),
B. & U.B. (Test SVM),
B. & U.B. (Validation)


At each EM iteration, train SVM and
find test Error.
SVM & Error
Initial Points
Final Motif
SVM & Error
Use two kind of motifs


Best Test Error motif
EM last iteration motif
SVM & Error
SVM & Error
Choose 10 best hypothesis
Use larger validation set
SVM & Error
Initial Points
Expectation-Maximization
Details of RUN
 Transfactor: Nanog
 Beta = [0 0.2 0.35 0.5 0.6 0.7 1]
(Strength of prior)
 5 motifs per beta by masking motifs
 Motif Length : 8
 25 bound seqs for EM
 500 base pairs in each seq.
 150 total train seq (SVM) [Low: Noisy]
 150 total test seq (SVM) [Low: Noisy]
 500 total Validation seq.
 c = [1e-3,0.05,100.0] (SVM: Budget for misclassifications)
 EM for minimum 60 iterations, Second derivative is negative for five
iterations
Expectation-Maximization
Representative Score graphs during EM iterations
X-Axis: EM Iteration Y-Axis: Score of Motif
Beta 0.0
Beta 0.6
Beta 0.35
Beta 0.7
Expectation-Maximization
Test and Validate Error of refined Motifs
X-Axis: beta Value Y-Axis: Score of Motif
Test Classification Score
Validate Classification Score
*: End of iteration EM result
*: End of iteration EM result
o: Best of Iteration
o: Best of Iteration
Expectation-Maximization
iteration
When is it the best-of-iteration?
RUNS
Total iterations
Iterations for Best-Of-Iterations
Expectation Maximization
Results::
 6 out of 7 top ranking motifs were best-ofiteration and 1 was end-of-iteration (6 out of
10 as well)
 Best Motif: Validate Error over set of 500
 Score: 61.2%, Error: 38.8%
A 0.003392 0.764554 0.995187 0.072268 0.063644 0.459349 0.000033 0.088069
C 0.268216 0.050266 0.000149 0.000022 0.303880 0.003363 0.472214 0.201074
G 0.039865 0.000023 0.002015 0.205620 0.105970 0.537248 0.446827 0.228689
T 0.688527 0.185157 0.002648 0.722090 0.526506 0.000040 0.080927 0.482167
T
A
A
T
T
A or G C or G
T
Assumptions and Caveats

Random baseline: End-of-run motif in EM

Low number of sequences for test error

Bound sets may actually not be bound. Better
to use highly probable sequences as bound.

All runs (inc. beta=0) used starting point as
the structural prior.
Objective: Motif discovery and use
for deriving biological information
Get bound and unbound
Sequences by TF nanog
in Human ES cells
Find a motif (nanog) using
a motif finding algorithm
Genome wide functional
analysis using motif to
find biological pattern
GSEA (Subramanian et al 2005)

Gene Set Enrichment Analysis (GSEA) determines
whether an a priori defined set of genes shows
statistically significant differences between two biological
states.
GSEA Output



Enrichment Plot
Gene List
Gene Set Information
GSEA Ranked List





Set of promoter sequences for every human
gene.
2000 bp upstream and 200 bp downstream of
Transcription initiation site.
Score each promoter for likelihood of the motif.
Input this ranked list into GSEA.
Search for gene sets enriched in the ranked list.
Results


Human embryonic stem cell genes OCT4, NANOG,
STELLAR, and GDF3 are expressed in both seminoma
and breast carcinoma. ( Ezeh et al 2006 )
Breast cancer geneset found at p-value: 0.008
Implementation Details

Young Lab Error model for chIP-chip data Analysis

Motif finding Algorithm in MATLAB



Implemented Markov Model
Implemented ZOOPS Model
Integrated SVM Toolbox ( by S. R. Gunn.) with code

Used structural prior from MacIsaac, et.al. 2006

Used software for GSEA for Functional Analysis.
Future Directions

Algorithm





Better use of classification error.
Maximize Likelihood in Bound + Minimizes Likelihood in
Unbound (Multi-objective Optimization using GAs)
Biological Information: Distance from transcription site,
Conservation
Integrating expression data
Cross-species Motif search and functional analysis,
maybe using GO Terms


Scoring
Sequence length
Acknowledgments






Fraenkel Lab
Young Lab
Kenzie D. MacIsaac
Dr. David Gifford (CSAIL)
Dr. Richard Young (WIBR)
Dr. Tommi Jaakkola (CSAIL)