Modeling Dependencies in Protein-DNA Binding Sites - CS

Modeling Dependencies in
Protein-DNA Binding Sites
Yoseph Barash 1
Gal Elidan 1
Nir Friedman 1
Tommy Kaplan 1,2
1 School
of Computer Science & Engineering
2 Hadassah Medical School
The Hebrew University, Jerusalem, Israel
Dependent positions in binding sites
?
A
T
C
binding site
gene
promoter
Most approaches assume position independence
To model or not to model dependencies ?
[Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]
Pros: Biology suggests dependencies
 Single amino-acid interacts with two nucleotides
 Change in conformation of protein or DNA
Cons: Modeling dependencies is harder
 Additional parameters
 Requires more data, not as robust
Data driven approach
 Can we learn dependencies from
available genomic data ?
 Do dependency models perform
better ?
Outline
 Flexible models of dependencies
 Learning from (un)aligned sequences
 Systematic evaluation
 Biological insights
Yes
Yes
How to model binding sites ?
P(X1 X2 X3 X4 X5 )  ? represent a distribution of binding sites
X1
X2
X3
X4
X5
Profile: Independency model
X1
X2
X3
X4
X5
Tree: Direct dependencies
T
X1
X2
X3
X4
X5
T
X1
X2
Mixture of Profiles:
Global dependencies
Mixture of Trees:
X3
X4
X5
Both types of dependencies
P(X
XX
 ) 
P(T)P(X
| T)P(X
T, X
)P(X
| T,
XT)P(X
T)P(X
P(X

P(T)P(X
T)P(X
| T)P(X
P(X

X

P(X
)P(X
)P(X
)P(X
)4 |4)P(X
11
5)
1)P(X
3|
1 )P(X
5|
3
|2 |X
)P(X
X
|T,
X|XT)
))
5) 
1 | T)P(X
23|)P(X
3 |)P(X
5
1
5
T
T
11
2
2
3 3
3 4 1
5 4
5
3
Learning models: Aligned binding sites
Aligned binding sites
GCGGGGCCGGGC
TGGGGGCGGGGT
AGGGGGCGGGGG
TAGGGGCCGGGC
TGGGGGCGGGGT
AAAGGGCCGGGC
GGGAGGCCGGGA
GCGGGGCGGGGC
GAGGGGACGAGT
CCGGGGCGGTCC
ATGGGGCGGGGC
Models
Learning
Machinery
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
X4
X5
X4
X5
T
X1
select maximum
likelihood model
X2
X3
T
X1
X2
X3
Learning based on methods for probabilistic
graphical models (Bayesian networks)
Evaluation using aligned data
95 TFs with ≥ 20 binding sites from TRANSFAC database
[Wingender et al, 2001’]
Estimate generalization of each model:
Test: how probable is the site given the model?
Cross-validation:
Training
Data setset
GCGGGGCCGGGC
TGGGGGCGGGGT
AGGGGGCGGGGG
TAGGGGCCGGGC
TGGGGGCGGGGT
TGGGGGCCGGGC
ATGGGGCGGGGC
GTGGGGCGGGGC
ATGGGGCGGGGC
GTGGGGCGGGGC
GCGGGGCGGGGC
GAGGGGACGAGT
CCGGGGCGGTCC
ATGGGGCGGGGC
Test set
Test Log-Likelihood
-20.34
-23.03
-21.31
-19.10
-18.42
-19.70
-22.39
-23.54
-22.39
-23.54
-18.07
-19.18
-18.31
-21.43
Test
avg. LL = -20.77
Arabidopsis ABA binding factor 1
Mixture of Profiles
Profile
76%
24%
Test LL per instance -19.93
Tree
X4
X5
X6
X7
X8
X9 X10 X11 X12
Test LL per instance -18.47 (+1.46)
(improvement in likelihood > 2.5-fold)
Test LL per instance -18.70 (+1.23)
(improvement in likelihood > 2-fold)
Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
128
Fold-change in likelihood
64
Significant
(paired t-test)
Not significant
32
16
8
4
2
Significant
improvement in generalization
1
0.5
20
30
40
50
60
70
80
90
 Data10 often
exhibits
dependencies
Evaluation for unaligned data
Motif finding problem
Input: A set of potentially co-regulated genes
Output: A common motif in their promoters
Sources of data:
 Gene annotation (e.g. Hughes et al, 2000)
 Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)
 ChIP (e.g. Simon et al, 2001; Lee et al, 2002)
Learning models: unaligned data
Use EM algorithm to simultaneously
 Identify binding site positions
 Learn a dependency model
Models
Unaligned Data
X1
X2
X3
X4
X5
Learn
a model
X1
X2
X3
X4
X5
Identify
binding
sites
X1
X4
X5
X4
X5
EM algorithm
T
X2
X3
T
X1
X2
X3
ChIP location analysis
[Lee et al, 2002]
Yeast genome-wide location experiments
Target genes for 106 TFs in 146 experiments
Gene
YAL001C
YAL002W
YAL003W
YAL005C
.
# genes
.
~ 6000
.
YAL010C
YAL012C
YAL013W
YPR201W
ABF1 Targets
+
–
+
–
.
.
.
+
–
–
–
…....
ZAP1 Targets
–
+
–
–
.
.
.
–
+
+
–
Example: Models learned for ABF1 (YPD)
Autonomously replicating sequence-binding factor 1
Known profile
(from TRANSFAC)
Learned Mixture of Profiles
43
Learned profile
492
Evaluating Performance
Detect target genes on a genomic scale:
ACGTAT…………….………………….AGGGATGC
GAGC
-473
-1000
0
Evaluating Performance
Detect target genes on a genomic scale:
Profile
Mix of Trees
Biologically
verified site
10 -8
10 -7
p-value
10 -6
Bonferroni corrected p-value ≤ 0.01
10 -5
10 -4
10 -3
10 -2
10 -1

-180
-160
-140
-120
-100
Gal4 regulates Gal80
-80
-60 
Evaluation using ChIP location data
[Lee et al, 2002]
Evaluate using a 5-fold cross-validation test:
Data set
+ YAL001C
– YAL002W
+ YAL003W
– YAL005C
+ YAL007C
– YAL008W
– YAL009W
+ YAL010C
– YAL012C
– YAL013W
– YPR201W
Test set
Prediction
+
–
+
Evaluation using ChIP location data
[Lee et al, 2002]
Evaluate using a 5-fold cross-validation test:
Data set
YAL001C
YAL002W
YAL003W
YAL005C
YAL007C
YAL008W
YAL009W
YAL010C
YAL012C
YAL013W
YPR201W
Prediction True
+
–
+
–
–
–
–
+
+
–
–
+
–
+
–
+
–
–
+
–
–
–
√
√
√
√
FN
√
√
√
FP
√
√
Example: ROC curve of HSF1
90%
Mixture of Trees
True Positive Rate (Sensitivity)
80%
70%
Mixture of Profiles
60%
Tree
50%
Profile
40%
30%
20%
10%
0%
0%
~60 FP
1%
2%
3%
False Positive Rate
4%
5%
Improvement in sensitivity & specificity
105 unaligned data sets from Lee et al.
Tree vs. Profile
True
20
3
Δ specificity
15
10
30
TP
5
0
Predicted
-5
Sensitivity
TP / True
-10
-15
-20
15
6
Specificity
TP / Predicted
-25
-20
-10
0
10
20
30
Δ sensitivity
40
50
60
Improvement in sensitivity & specificity
105 unaligned data sets from Lee et al.
Mixture of Profiles vs. Profile
True
20
0
Δ specificity
15
10
52
TP
5
0
Predicted
-5
Sensitivity
TP / True
-10
-15
-20
18
17
Specificity
TP / Predicted
-25
-20
-10
0
10
20
30
Δ sensitivity
40
50
60
Improvement in sensitivity & specificity
105 unaligned data sets from Lee et al.
Mixture of Trees vs. Profile
True
20
1
Δ specificity
15
10
84
TP
5
0
Predicted
-5
Sensitivity
TP / True
-10
2
-15
-20
16
Specificity
TP / Predicted
-25
-20
-10
0
10
20
30
Δ sensitivity
40
50
60
“Is it worthwhile to model dependencies?”
Evaluation clearly supports this
What about the underlying biology ?
(with Prof. Hanah Margalit, Hadassah Medical School)
Distance between dependent positions
Tree models learned from the aligned data sets
Num of dependencies
50
Weak (< 0.3 bits)
Medium (< 0.7 bits)
Strong
< 1/3 of the
dependencies
40
30
20
10
0
1
2
3
4
5
6
7
Distance
8
9
10
11
Structural families
Dependency models vs. Profile on aligned data sets
128
64
32
16
8
4
2
1
Fold-change in likelihood
Fold-change in likelihood
128
64
32
Significant
(paired t-test)
Not Significant
16
8
4
2
1
0.5
0.5
10
20
30
40
50
60
70
80
90
Conclusions
 Flexible framework for learning dependencies
Dependencies are found in many cases
It is worthwhile to model them Better learning and binding site prediction
Future work
 Link to the underlying structural biology
 Incorporate as part of other regulatory
mechanism models
http://compbio.cs.huji.ac.il/TFBN