Modeling Dependencies in Protein-DNA Binding Sites Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem, Israel Dependent positions in binding sites ? A T C binding site gene promoter Most approaches assume position independence To model or not to model dependencies ? [Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002] Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust Data driven approach Can we learn dependencies from available genomic data ? Do dependency models perform better ? Outline Flexible models of dependencies Learning from (un)aligned sequences Systematic evaluation Biological insights Yes Yes How to model binding sites ? P(X1 X2 X3 X4 X5 ) ? represent a distribution of binding sites X1 X2 X3 X4 X5 Profile: Independency model X1 X2 X3 X4 X5 Tree: Direct dependencies T X1 X2 X3 X4 X5 T X1 X2 Mixture of Profiles: Global dependencies Mixture of Trees: X3 X4 X5 Both types of dependencies P(X XX ) P(T)P(X | T)P(X T, X )P(X | T, XT)P(X T)P(X P(X P(T)P(X T)P(X | T)P(X P(X X P(X )P(X )P(X )P(X )4 |4)P(X 11 5) 1)P(X 3| 1 )P(X 5| 3 |2 |X )P(X X |T, X|XT) )) 5) 1 | T)P(X 23|)P(X 3 |)P(X 5 1 5 T T 11 2 2 3 3 3 4 1 5 4 5 3 Learning models: Aligned binding sites Aligned binding sites GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC Models Learning Machinery X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X4 X5 X4 X5 T X1 select maximum likelihood model X2 X3 T X1 X2 X3 Learning based on methods for probabilistic graphical models (Bayesian networks) Evaluation using aligned data 95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’] Estimate generalization of each model: Test: how probable is the site given the model? Cross-validation: Training Data setset GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGGGCGGGGC GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC Test set Test Log-Likelihood -20.34 -23.03 -21.31 -19.10 -18.42 -19.70 -22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -18.31 -21.43 Test avg. LL = -20.77 Arabidopsis ABA binding factor 1 Mixture of Profiles Profile 76% 24% Test LL per instance -19.93 Tree X4 X5 X6 X7 X8 X9 X10 X11 X12 Test LL per instance -18.47 (+1.46) (improvement in likelihood > 2.5-fold) Test LL per instance -18.70 (+1.23) (improvement in likelihood > 2-fold) Likelihood improvement over profiles TRANSFAC 95 aligned data sets 128 Fold-change in likelihood 64 Significant (paired t-test) Not significant 32 16 8 4 2 Significant improvement in generalization 1 0.5 20 30 40 50 60 70 80 90 Data10 often exhibits dependencies Evaluation for unaligned data Motif finding problem Input: A set of potentially co-regulated genes Output: A common motif in their promoters Sources of data: Gene annotation (e.g. Hughes et al, 2000) Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000) ChIP (e.g. Simon et al, 2001; Lee et al, 2002) Learning models: unaligned data Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model Models Unaligned Data X1 X2 X3 X4 X5 Learn a model X1 X2 X3 X4 X5 Identify binding sites X1 X4 X5 X4 X5 EM algorithm T X2 X3 T X1 X2 X3 ChIP location analysis [Lee et al, 2002] Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments Gene YAL001C YAL002W YAL003W YAL005C . # genes . ~ 6000 . YAL010C YAL012C YAL013W YPR201W ABF1 Targets + – + – . . . + – – – ….... ZAP1 Targets – + – – . . . – + + – Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1 Known profile (from TRANSFAC) Learned Mixture of Profiles 43 Learned profile 492 Evaluating Performance Detect target genes on a genomic scale: ACGTAT…………….………………….AGGGATGC GAGC -473 -1000 0 Evaluating Performance Detect target genes on a genomic scale: Profile Mix of Trees Biologically verified site 10 -8 10 -7 p-value 10 -6 Bonferroni corrected p-value ≤ 0.01 10 -5 10 -4 10 -3 10 -2 10 -1 -180 -160 -140 -120 -100 Gal4 regulates Gal80 -80 -60 Evaluation using ChIP location data [Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Data set + YAL001C – YAL002W + YAL003W – YAL005C + YAL007C – YAL008W – YAL009W + YAL010C – YAL012C – YAL013W – YPR201W Test set Prediction + – + Evaluation using ChIP location data [Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Data set YAL001C YAL002W YAL003W YAL005C YAL007C YAL008W YAL009W YAL010C YAL012C YAL013W YPR201W Prediction True + – + – – – – + + – – + – + – + – – + – – – √ √ √ √ FN √ √ √ FP √ √ Example: ROC curve of HSF1 90% Mixture of Trees True Positive Rate (Sensitivity) 80% 70% Mixture of Profiles 60% Tree 50% Profile 40% 30% 20% 10% 0% 0% ~60 FP 1% 2% 3% False Positive Rate 4% 5% Improvement in sensitivity & specificity 105 unaligned data sets from Lee et al. Tree vs. Profile True 20 3 Δ specificity 15 10 30 TP 5 0 Predicted -5 Sensitivity TP / True -10 -15 -20 15 6 Specificity TP / Predicted -25 -20 -10 0 10 20 30 Δ sensitivity 40 50 60 Improvement in sensitivity & specificity 105 unaligned data sets from Lee et al. Mixture of Profiles vs. Profile True 20 0 Δ specificity 15 10 52 TP 5 0 Predicted -5 Sensitivity TP / True -10 -15 -20 18 17 Specificity TP / Predicted -25 -20 -10 0 10 20 30 Δ sensitivity 40 50 60 Improvement in sensitivity & specificity 105 unaligned data sets from Lee et al. Mixture of Trees vs. Profile True 20 1 Δ specificity 15 10 84 TP 5 0 Predicted -5 Sensitivity TP / True -10 2 -15 -20 16 Specificity TP / Predicted -25 -20 -10 0 10 20 30 Δ sensitivity 40 50 60 “Is it worthwhile to model dependencies?” Evaluation clearly supports this What about the underlying biology ? (with Prof. Hanah Margalit, Hadassah Medical School) Distance between dependent positions Tree models learned from the aligned data sets Num of dependencies 50 Weak (< 0.3 bits) Medium (< 0.7 bits) Strong < 1/3 of the dependencies 40 30 20 10 0 1 2 3 4 5 6 7 Distance 8 9 10 11 Structural families Dependency models vs. Profile on aligned data sets 128 64 32 16 8 4 2 1 Fold-change in likelihood Fold-change in likelihood 128 64 32 Significant (paired t-test) Not Significant 16 8 4 2 1 0.5 0.5 10 20 30 40 50 60 70 80 90 Conclusions Flexible framework for learning dependencies Dependencies are found in many cases It is worthwhile to model them Better learning and binding site prediction Future work Link to the underlying structural biology Incorporate as part of other regulatory mechanism models http://compbio.cs.huji.ac.il/TFBN
© Copyright 2026 Paperzz