Motif Finding Problem - Faculty of Engineering, HKU

Repeated (Conserved)
Patterns in Bioinformatics
Francis Y.L. Chin {钱玉麟教授}
Taikoo Chair of Engineering
Chair Professor of Computer Science
Associate Dean of Engineering
University of Hong Kong
March, 2010
Bioinformatics

Use laboratory experiments to understand
biological processes is difficult, laborious,
expensive and time-consuming.
Nowadays, large volumes of biological data
are available.
Bioinformatics aims to exploit this data to
understand biological processes through
computational approach.
What are Repeated Patterns?

Repeated patterns are similar patterns that
appear many times, e.g., common or
conserved patterns.
Repeated patterns can be measured by the
probability of their occurrence in a random
environment (p-value)
Low p-value means “information or signal
bearing” and not “artifact”
High p-value implies “inconclusive”
Why Repeated Patterns?
Finding repeated patterns is important in
bioinformatics research

Sequence analysis
Analysis of mutations
Comparative genomics
Evolutionary biology
Biodiversity
Protein-protein interaction
Repeated (Conserved)
Patterns in DNA Sequences
The Central Dogma
produces
Gene
Protein
DNA
http://research.microsoft.com/uai2004/Slides/FriedmanUAI2004-part-II.ppt
Binding Site
Transcription
factor
Tyr Leu Protein
C G
DNA
C
G
A
A
T
A
Binding Site
T
G
G A
Gene
Identifying the TF binding sites on DNA sequence
is an important problem.
Binding Sites Identification
•
•
Find genes associated with the same TF.
Search for short similar patterns (motif), i.e., binding
sites, in the DNA sequences.
GCN4
DNA
TRP4
AGTTATGACTAATATT …TATCATGTCCGAGGCGACTTTG…
GCN4
DNA
HIS7
AAAATTGAGTCATATC…GAGAATGCCGGTCGTTCACGTG…
GCN4
DNA
Promoter regions | gene
ADE4
CCGAATGACTGCTCAT…AAAAATGTGTGGTATTTTAGGTA…
Example
>SPR3
……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC
GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……
>COX6
……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT
GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……
>QCR8
……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG
AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……
>CYC1
……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT
TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……
Hypothesis: The binding sites are short
similar string pattern in each sequence.
Example
>SPR3
……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC
GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……
>COX6
……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT
GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……
>QCR8
……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG
AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……
>CYC1(reverse)
……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT
TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……
All these strings are binding sites, and similar to a
common pattern, CCAATCA, called motif.
String Motif
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
C. elegans
Binding sites
GTTACCATGGTAAC – Consensus string (motif)
Motif represents the common pattern of binding sites
The binding sites are variants of the motif.
Consensus motif gives the minimum total number of
errors (NP-complete for finding the motif with
minimum maximum error) (Li et al, JCSS 2002)
Planted (l,d)-Motif Problem (PMP)
(Pevzner and Sze, ISMB 2000)
l = length of motif M
d = Hamming Distance
Input:
T = t length-n sequences, each with at
least one binding site.
Problem:
Find M and the binding sites (sub-strings
within Hamming distance d from M)
Can Motif Always be Found?
Many methods, EM, Gibbs Sampling, exhaustive search,
maximal clique,… exist to find the motif.
Motif can never be found when

Too few sequences/binding sites (t and n are too small)

Binding sites “too” short (l is too small)

Binding sites vary too much (d is too large)
Because p-vaule of those similar patterns is too high,
i.e., more than one possible solutions will result.
Successful only when the “similar” patterns are “many”
and “long” (low p-values)
– such existence probability by random is very low.
Problems with String Motif
GTTACCATGGTAAC is a motif
CTTACCATGGTAAC
(not binding site)
Hamming distance cannot
model the real situation
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
GTTACCATGGTAAC
Consensus string
as motif
Hypothesis: The binding sites are short
similar string pattern in each sequence,
while some positions are conserved.
Motif represented by Matrix
Probability Matrix M
M(α,j) = probability that the jth position is α
Binding sites:
TCCATGG
CGCATGG
CCCATGC
GCCATGG
CCCTTGG
Matrix motif (PSSM)
A
C
G
T
0
0
0 .8 0
0
0
.6 .8 1
0
0
0 .2
.2 .2 0
0
0
1 .8
.2 0
0 .2 1
0
0
Problems about Matrix
Representation
Given the matrix, what are binding sites?

Hamming distance for string representation
Which matrix can be the motif?

The pattern which gives most number of
binding sites (string representation)
Motif Representation
Given matrix M, is σ = GGCTTGC a binding
site?
l
∏
Pr(
σ, σ
generated
p( M
)=
Mby
(σ [M)
j ], j )
j =1
A
C
G
T
0
0 0 .8 0
0
0
.6 .8 1
0
0
0 .2
.2 .2 0
0
0
1 .8
.2 0 0 .2 1
0
0
p(M,“GGCTTGC”)
= 0.2 × 0.2 × 1 × 0.2
× 1 × 1 × 0.2
= 0.0016
Motif Representation
Pr(σ generated by Background)
l
p( B, σ ) = ∏ B(σ [ j ])
j =1
B(A) = 0.2
B(C) = 0.3
B(G) = 0.3
B(T) = 0.2
p(B,“GGCTTGC”)
= 0.3 × 0.3 × 0.3 × 0.2
× 0.2 × 0.3 × 0.3
= 0.0000972
What are the Binding Sites?
σ is a binding site iff
⎛ p( M , σ ) ⎞
⎟⎟ ≥ t (threshold)
log⎜⎜
⎝ p ( B, σ ) ⎠
Large ⇒ likely to be a binding sites

Example: “GGCTTGC”
⎛ p ( M , " GGCTTGC" ) ⎞
⎛ 0.0016 ⎞
⎟⎟ = log⎜
log⎜⎜
⎟
⎝ 0.0000972 ⎠
⎝ p ( B, " GGCTTGC" ) ⎠
= log 16.5 = 1.218 (a binding site if t =1)
Which is the Correct Matrix?
Each matrix M is given a score
(Information Content), IC(M)
The score increases with
Number of binding sites
Similarity with M, log(p(M,σ)/p(B,σ)) – t
The correct matrix M* has the maximum
score.
⎛ ⎛ p( M , σ ) ⎞ ⎞
⎜ log⎜⎜
⎟⎟ − t ⎟⎟
IC ( M ) =
∑
⎜
σ : log( p ( M ,σ ) / p ( B ,σ ))≥t ⎝
⎝ p ( B, σ ) ⎠ ⎠
Motif-Finding Problem
Input:

A set of sequences bound by a particular
transcription factor
Output:

A motif (probability matrix)
Positions of binding sites in each sequence
Leung and Chin, "Finding Exact Optimal Motif in Matrix
Representation by Partitioning", Bioinformatics, Vol 21,
Supp 2, ECCB/JBI, ii86-92 (September 2005)
False positives
Pattern “CGCGCG” appears many times in the
sequences, but not binding sites. Why?
Some genes are not regulated (contains no
binding sites), their promoter regions also
contain many “CGCGCG” patterns
These sequences can be used as negative set
(control set)
Hypothesis: Those sequences without
binding sites probably do not contain
any string patterns similar to the motif.
Generalized Motif-Finding Problem
Input:
T = sequences containing binding sites
F = sequences not containing binding sites
Output:
Motif M and the binding sites
Leung and Chin, "Finding Motifs from All
Sequences With and Without Binding Sites,"
Bioinformatics 2006
Color Ratio and Binding Energy
Microarray experiments can be used to indicate
gene expression which is measured by color
intensity (the probability of TF binding)
Hypothesis: Higher color intensity means
stronger binding.
The amount of binding energy (strong or weak)
can be estimated by the color ratio.
Hypothesis: Each binding site, depending
on its pattern, has different binding
energy with the TF.
Energy-based Model
Seq 1
-5.1
-4.8
Seq 2
-4.6
Seq 3
Seq 4
-0.5
Energy-based Model
Seq 1
Seq 2
CCAGATGAGATG -5.1
GACGATGAACGC -4.6
Seq 3
AGTGCTGAGGCT -4.8
CCACCAGCTATT -0.5
Seq 4
Energy Matrix:
A
-0.5
-0.7
0.5
-0.5
-0.6
0.1
C
0.3
-0.4
-0.1
0.3
0.1
0.1
G
-1.1
0.5
0.2
-0.8
-0.2
-0.4
T
0.1
0.8
-1.5
-0.2
0.3
-0.2
Problem with binding energy
Input:
A set of DNA sequences
The binding energy between TF and each
sequence (color intensity in microarray)
Output:
The motif (energy matrix) M which produces
the binding energy of each sequence
The binding sites are those patterns in
the sequence with the lowest energy.
Simulated Data
Results of the algorithms on simulated data for 200
sequences of length 700 where 10 of them contain B
binding sites of length 17 with expected likelihood -10
EBMF
AlignACE
Expected number
of matrices
Find? rank Find? rank
MEME
Find? rank
B=7
149475
yes
1
no
-
no
-
B=8
0.000439
yes
1
no
-
yes
1
B=9
7.7×10-07
yes
1
yes
1
yes
1
Leung, Chin, Yiu, et al, "Finding Motifs with Insufficient Number
of Strong Binding Sites", Journal of Computational Biology,
2005, preliminary version appeared in RECOMB 2004.
Real Data
GAL4 (motif pattern is CGGN11CCG)
EBMF
AlignACE
MEME
Find? rank Find? rank Find? rank
Using the top 100 sequences in the
original data
yes
2
yes
1
yes
1
Using the top 100 sequences except
sequences 2,3,4 and 6
yes
1
no
-
no
-
Using the top 100 sequences except
sequences 1 to 6
yes
10
no
-
no
-
Using the top 100 sequences except
sequences 1 to 8
yes
5
no
-
no
-
Further Information about
Transcription Factors
Proteins (Transcription Factors) are with 3D
structure and can be grouped into classes
Zinc finger
Leucine zipper
Helix-Turn-Helix
Their binding sites have different characteristics
Six Major Classes
Classes
Zinc Finger
Sub-Classes
I. Cys2His2
II. Cys4
Leucine zipper
III. bZip
IV. bHLH
Helix-Turn-Helix
V. Homedomain
VI. Others (e.g. Forkhead)
Characteristics
G..G|G..G..G|
[CG] . . [CG] . . [CG]
AGGTCA | TGACCT
TGA .* TCA
CA . . TG
TAAT | ATTA
unknown
Freq
13%
13%
23%
3%
11%
33%
Guess the motif class based on the characteristics
of binding sites
Search motifs in that class by modifying their
likelihood accordingly
3D Binding Domains
and DNA Binding Sites
TFs - classified by 3D binding domains.
3D binding domains and DNA binding sites are
related
DNA binding sites - classified accordingly
Hypothesis: Most transcription factors can be
classified into a few protein structures and
binding domains, most binding sites should
have a few patterns.
Leung and Chin, "Discovering Motifs with Transcription
Factor Domain Knowledge", PSB2007
Experimental Results
Number of motifs with known sites discovered
MEME / DIMDom

MEME: 38
DIMDom: 47
Average accuracy

MEME: 0.3141
DIMDom: 0.4471
Higher accuracy

MEME: 9 data sets
DIMDom: 26 data sets
same accuracy: 3 data sets
Protein Sequences:
A Motif-Pair for Binding
Leung, Siu, Yiu, Chin, Sung, "Finding Linear
Motif Pair from Protein Interaction Networks:
A Probabilistic Approach", CSB2007
Protein-Protein Interaction
(PPI) Network
Vertex

Protein sequence
Edge (interaction)
between u and v

Protein u and protein v
can bind together
Many interactions are
missing and erroneous
Motif Pairs for Protein Interaction
Problem: When two proteins interact, where
are their binding sites (domains)?
Binding sites and Motif
PTLPPR
PIKPPR
GLFPSNY
PTAPQR
GFIPGNY
PTLPSR
GVFPGNY
PPLPNR
GIFPLNY
PPLPTR
Interaction
1) Different proteins may
have similar binding sites.
2) The proteins that these
proteins bind to also contain
another set of similar
binding sites.
Hypothesis
If M and M’ are two motifs representing
two sets of real binding sites that
interact,
we expect that the sequences containing
instances of the corresponding motifs
should have many interactions.
The Motif Pair Finding Problem
Input: protein-protein interaction network
Problem: To find a pair of motifs (M1, M2) such that
sequences containing M1 and sequences containing M2
have unexpectedly large number of interactions.
e.g. (
(
,
,
) is not a real motif pair
) may be a real motif pair!
Another Problem related to PPI:
Predicting Protein Complexes
Leung, Xiang, Yiu and Chin, "Predicting Protein
Complexes from PPI Data: A Core-Attachment
Approach", Journal of Computational Biology,
2009. Preliminary version presented in
RECOMB Satellite 2008.
Protein Complex
Protein Complex

Group of proteins bind together
Represented as a connected
subgraph in PPI network
Hypothesis: Proteins in the same complex
have more interactions among them.
Problem: To predict protein complexes
from PPI network
Protein complexes (dense subgraphs)
Heuristics to Find Dense Sub-graphs
Markov Cluster (MCL)
(Enright et al., Nucl Acids Res 2002)
Bootstrapping through random walks in graphs,
i.e., trap in clusters and rarely go out to other clusters
Molecular Complex Detection (MCODE)
(Bader and Hogue, BMC Bioinf 2003)
Start with high-degree vertices and recursively merge with
neighbors to ensure its density is above a given threshold
CFinder (Adamcsek et al., Bioinformatics 2006)
Locate overlapping cliques by merging two k-cliques
if they share k-1 nodes
Core and Attachment
(Leung et al., JCB 2009)
Based on biological information that each complex has a core.
Difficulties
Many interactions are missing
The protein complexes may not necessarily
be dense subgraphs, especially when the
complexes are large
Some proteins present in multiple complexes
Structure of Protein Complex
Attachments
Core

Proteins in a complex consist of [Gavin et al. 2006]

Modules
Core proteins
Attachments
Each protein complex has a unique set of core proteins
Core Proteins
Relatively more interactions among themselves

Dense subgraphs
Attachments bind to core protein to form
complexes

Attachments are neighbors of cores
Each protein complex has a unique set of core
proteins

Cores are disjoint
Cores do not present in other complexes
Experimental Results on Cores

Compare with Mediator [Andreopoulos et al.
2007] on Gavin dataset
# of correct cores
acc ≥ 0.4
acc ≥ 0.6
acc ≥ 0.8
Mediator
29
8
0
Ours
267
169
103
Experiments on Complexes

Compare with 3 methods (MCL, MCODE
and CFinder) on 3 datasets
Number of Number of Average
Datasets
proteins interactions degree
DIP
4,928
17,201
6.98
Krogan
2,675
7,080
5.29
Gavin
1,430
6,531
9.13
Comparison with MCL
MCL/Ours
# of correct complexes
acc ≥ 0.6
acc ≥ 0.7
acc ≥ 0.8
DIP
30 / 36
20 / 26
10 / 15
Krogan
28 / 37
16 / 21
7 / 11
Gavin
32 / 35
26 / 29
11 / 17
Comparison with MCODE
MCODE/Ours
# of correct complexes
acc ≥ 0.6
acc ≥ 0.7
acc ≥ 0.8
DIP
17 / 29
13 / 22
7 / 13
Krogan
6 / 24
5 / 16
2/8
Gavin
23 / 32
19 / 27
8 / 17
Comparison with CFinder
CFinder/Ours
# of correct complexes
acc ≥ 0.6
acc ≥ 0.7
acc ≥ 0.8
DIP
19 / 28
14 / 22
10 / 13
Krogan
28 / 28
13 / 19
5 / 11
Gavin
25 / 29
20 / 26
13 / 16
Different Biological Processes
between Two Groups of Species
Given two groups of species, each with a
metabolic network
Find those reactions or metabolic pathways
which belong to most of the networks in one
group but not in the other.
Hypothesis: There must exist a set of
reactions or metabolic pathways in one
group of specifies but not in the other.
Repeated Patterns makes
Genome Assembly Difficult
Peng, Leung, Yiu and Chin, “IDBA - Iterative
de Bruijn Graph de Novo Assembler”,
RECOMB 2010 (to appear).
De novo Assembling
Genome with unknown sequence
Sequencing
Read (45 – 140bp)
Assembling
Genome with known sequence
However, there are many problems
Problems
Error in reads
1-2% error rate per nucleotide
e.g. 1% error rate, 75bp read length
~1 – (1 – 1%)75 = 53% reads
have error
Gap
Positions with no read cover
Repeat
length of repeat ≥ read length
impossible to assemble
Repeats in E.coli.
length
Repeat #
30
3899
40
2784
50
2248
100
1074
200
536
300
345
500
200
1000
101
De novo Assembling
Repeat
Genome
Repeat
Sequencing
error
gap
contig
Assembling
new gap
Assembling Problem

Input:

A set of reads from a genome
Objective:

Construct contigs of the genome

Accuracy (>99.9%)
Coverage N50 (length of shortest contig in
a set cover ≥ 50% genome)
Length
Existing Approaches

Greedy

SSAKE (Warren, 2007), SHARCGS (Dohm,
2007)
Contig:
overlap ≥ k
Stop (favors small k)
overlap < k

Work well in

Stop (favors large k)
error or repeat
high coverage and error-free data
String Graph

Edena (Hernandez, 2008)
Vertex: Read
Edge: overlap ≥ k
Simple path: contig
Contig: GTACTCTAGCTGCTCC…
Ex: k= 5
GTACTCT

TACTCTA
CTAGCTG
Handle some errors (dead-end)
Other errors

G
CTCTAGC
Large graph (requires memory)
Work in high coverage and error-free data
Dead-end
TAGCTGA
AGCTGCT
CTGCTCC
De Bruijn Graph

Ex: k= 5
Velvet (Zerbino, 2008)
Vertex: k-mer in read (length-k substring)
Edge: overlap k – 1 bp
Simple path: contig
TACTCTA
GTACTCT
GTACT
CTCTAGC
G
TACTC ACTCT CTCTA

Advantage:

CTAGCTG
TCTAG
Gain information in error reads
Better filtering
CTAGC
CTAGC
G
TAGCT
AGCTG
Filtering

Sequencing depth: 75
Read length: 75bp
String graph

correct 30-mer appear
~45 times
error 30-mer appear
~1 – 2 times
Can filter error read easily
Correct 7-mers

De Bruijn Graph

correct read appear
~1 times
error read appear
~1 times
ACGCCATCACGTTC
CGCCTTCACGTTCT
CCATCACGTTCTCA Reads
CCATCACGCTCTCA
CATCACGTTCTCAG

CATCACG: 4 times
ATCACGT: 3 times
TCACGTT: 4 times
Error 7-mers

CTTCACG: 1 times
TTCACGT: 1 times
ATCACGC: 1 times
TCACGCT: 1 times
Problem of Existing Algorithms
Find Suitable k
GTACTACTATGC
GTACTAC
TACTACT
ACTACTA
CTACTAT
Branch
ACTATGC
problem
k=5
GTACT TACTA ACTAC CTACT ACTAT CTATG TATGC
GTACTA
TACTAC
CTACTA
TACTAT
ACTACT k = 6
ACTATG CTATGC
Gap problem
No suitable k
Branch problem
due to repeat / error
Gap problem
due to low coverage /
error

Large k
Small k
Target
Less
More
Less
More
Less
Less
Solved by select suitable k
Still produce short contigs
Remark: appear in greedy and string graph
algorithms too
Iterative de Bruijn Graph
Assembler
Construct an algorithm achieve a better
result than any selection of k.
Start with small k,

less gap but more branches
many short contigs
Iterate to higher k
to resolve branches
to get longer contigs by merging.
65
Simulated Data

Genome

Sequencing depth: 30x
Read length: 75bp
Error rate: 1%
Paired-end

Escherichia coli (O157:H7 str. EC4115)
5.6 million bp
Insert distance: 250bp
Requirements for contigs

Accuracy > 99.9%
Length > 100bp
Comparison (Simulated)
N50: The minimum length that all contigs longer
than it will cover more than 50% of the genome
E.Coli, L = 75, depth = 30, error rate = 1%
67
Other Statistics
time
Mem
k
Contigs
#
N50
max len.
avg. len.
cov.
wrong #
wrong len.
velvet
155s 1641M 30
1369 19284 96905
2652
94.6%
19
9813
edena
957s 678M
4672 5104
46908
900
97.2%
650
72019
abyss
1113s 1749M 40
1390 22109 87118
2966
95.1%
66
34998
IDBA
370s 360M 25-50 1550 63218 217365
2210
97.5%
10
3935
2051
99.1%
optimal
40
50
1561 63218 217365
Other Setting
Low
High
Low
High
Coverage
Coverage Coverage Coverage
High
High
Low
Low
Error Rate Error Rate Error Rate Error Rate
(100x,0.5%) (30x, 1%) (100x, 2%) (30x, 2%)
Edena
(string Graph)
Velvet
(de Bruijn Graph)
Abyss
(de Bruijn Graph)
IDBA
(our algorithm)
63256
5104
53491
147
63214
24772
59285
16527
58678
22109
50009
10992
63218
63218
59287
32612
69
Real Biological Data

Genome

Bacillus Subtilis
Sequencing depth: ~45x
Read length: 75bp
Error rate: ~1%
Requirement for contigs

Length > 100bp
Comparison (Real)
Bacillus Subtilis, L = 75, depth = 45, error rate = 1%
71
Other Statistics
Contigs
time
Mem
k
velvet
89s
893M
35
476
35136 164023
8580
4.08M
edena
649s 632M
40
926
19423 66455
4444
4.11M
abyss
729s 923M
45
445
30081 134067
9184
4.09M
IDBA
313s 310M 25-50
283
122574 602412 14489
4.1M
#
N50
max len.
avg. len. total len.
Conclusions

IDBA outperforms existing assembling
algorithms

Main Idea

Contig length
Accuracy
Using multiple values of k
Guarantee better results by increasing values of k
Can be downloaded at

http://www.cs.hku.hk/~alse/idba/
Future Works

Develop more effective algorithm for
paired-end reads
Assembling transcriptome (RNA)

Different expression level
Alternative splicing
Meta genomic sequencing

Mixed genomes with different coverages
Thanks and Questions

Download Report

Motif Finding Problem - Faculty of Engineering, HKU

Paperzz.com

Your Paperzz