Macromolecular Sequence Analysis
Finding motifs in sequence
Pattern matching
Didier Gonze – 13/10/2015
Sequence motifs
What are sequence motifs?
Sequence motifs are short, conserved elements of a sequence alignment. They can be a short sequence of contiguous residues or a more distributed patterns. Functionnally related sequences will share similar distribution patterns of critical functional residues that are not necessarily contiguous.
contiguous motif
"distributed" motif
For example conserved amino acids residues comprising an enzyme's active site may be distant from each other in the protein sequence but will still occur in a recognizable pattern because of the constraints imposed by the requirement for them to come together in a particular spatial configuration to form the active site in the 3D structure.
Zvelebil & Baum, 2008
1
DNA sequence motifs
What are DNA sequence motifs?
DNA sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases, transcription factors, or restriction enzyme. Others are involved in important processes at the RNA level, including ribosome binding, mRNA processing (splicing, polyadenylation) and transcription termination.
Examples:
-The restriction enzyme Eco RI specifically cuts DNA at instance of GAATTC in e.coli. - The Pho4p transcription factor binds specifically to the CACGTG DNA sequence in yeast.
- The E2F transcription factor binds the TTTCGCGC DNA sequence in eukaroytes.
- The TATA box (TATAAA) is recognised by TBP (TATA box binding proteins) in eukaryotes.
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
2
DNA sequence motifs
What are DNA sequence motifs?
Regulatory binding sites may be organized into modules (CRM = ci-regulatory modules) Source: M. Thomas-Chollier
3
RNA sequence motifs
What are RNA sequence motifs?
RNA motifs include:
• structural motifs: they can correspond to the loops‚ hairpins, or junctions. RNA motifs allow the specific interactions that maintain the compact folding of RNAs. • specific protein or ligand binding sites. • splicing sites (donor+acceptor sites, in mRNA). • motifs involved in transcriptional regulation (riboswitches in mRNA). Examples:
- The Shine-Dalgarno sequence AGGAGGU indicates the ribosome binding site on mRNA.
- The motif AGUAAGU ... CAGG is common donor/acceptor sites in eukaryotes. - The polyadenylation sequence AAUAAA is recognized by a protein complex (CPSF) to start polyadenylation.
NB: Note that such sequences can often be detected already at the level of DNA, but not always (e.g. the poly-A chain is added by a specific complex, CPSF, and is not encoded in the genome).
4
Protein sequence motifs
What are protein sequence motifs?
Protein sequence motifs are short, recurring patterns in protein that are presumed to have a biological function or a particular structure. They can be motifs responsible for protein-protein interaction, nuclear localisation signal (NLS) or they can constitute the active enzymatic site.
Examples:
- The sequence PKKKRKV is a NLS signal peptide specifically recognized by importin α.
- MLSLRGSIRFFLPATATLCSSAYLL is a signal peptide for transport to the mitochondria.
- C2H2 Zn-finger transcription factors share the consensus sequence C-N4-C-N12-H-N3-H.
- Many SH3-binding epitopes of proteins (part of the protein recognized by the immune
system, e.g. antibody or T-cell) have a the consensus sequence XPpXP (where X stands for
aliphatic amino acids, P for Proline and small p indicates that it often a Proline).
5
Protein sequence motifs
Motif vs domain vs fingerprint
Structural bioinformaticians often distinguish between:
Motifs: Any conserved sequence.
Example: HLH transcription factor
DNA binding domain
In c olor: HLH motif Domain: Conserved sequence that can be extracted from the whole protein sequence and that can form a correct fold. It is characterized by a particular 3D structure. Contrarily to domains, motifs may not be stable when they are extracted from the protein sequence. For example, a protein can have a DNA binding domain with a helix-turn-helix motif.
Fingerprint (or Print): is a conserved motif (or a group of motifs) that are used to characterize a protein family.
In the following, we will describe tools to detect conserved motifs in general (without distinction between the v arious types of motifs)
6
Finding motifs in sequences
Why finding motifs in DNA and protein sequences?
Clearly, identifying patterns in DNA and protein sequences provides important clues about their possible regulation, function, or structure. Identifying particular DNA motifs will also be crucial in deciphering genomic sequences (cf. gene prediction).
Starting from a completely sequenced genome, different questions can be addressed by bioinformatic approaches, depending on information at hand:
§ You want to find out if your favorite genes possess a particular regulatory element in their promoter.
§ You have identified a set of co-expressed genes and you want to know if they are co-regulated by a common transcription factor.
§ You have a collection of patterns and you want to find out possible genes having this pattern (i.e. possible targets of a given transcription factor).
7
Finding motifs in sequences
Pattern matching vs pattern discovery
Different approaches will be used depending of the question: § You know the genes, you know the pattern ... and you want to know if the genes have the given pattern ⇒ pattern matching
§ You know the genes, you don't know the patterns ... and you want to detect possible patterns common to all genes ⇒ pattern discovery or pattern matching with library
§ You know the patterns, you don't know the genes ... and you want to detect possible genes having a given pattern ⇒ classification
Source: Jacques van Helden
8
Finding motifs in sequences
Pattern matching vs pattern discovery
Source: M. T homas-Chollier
9
Finding motifs in sequences
Computational detection of TFBS: Challenges
Source: M. T homas-Chollier
10
Finding motifs in sequences
Computational detection of TFBS: Characteristics
Source: M. T homas-Chollier
11
Representation of sequence motifs
How to represent a sequence motif?
In some cases, a pattern can be represented by a single short string of nucleotides.
Example: GAATTC
Restriction enzymes need to bind to their DNA targets in a highly sequence-specific manner, because they are part of a primitive bacterial immune system designed to chop up viral DNA from infecting phages. Straying from their consensus binding site specificity would be the equivalent of an autoimmune reaction that could lead to irreversible damage to the bacterial genome. For example, EcoRI binds to the 6-mer GAATTC, and only to that sequence. Note that this motif is a palindrome, reflecting the fact that the EcoRI protein binds to the DNA as a homodimer. Unfortunately, this representation is very restrictive. Indeed, most of the time, variability in the pattern is allowed...
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
12
Representation of sequence motifs
How to account for the variability of the patterns? Example:
HindII binds to the sequences GTYRAC, where Y stands for ‘C or T’ (pYrimidine), and R
stands for ‘A or G’ (puRine). The variability allowed in the patterns makes the representation and the identification of patterns challenging.
We can calculate how often we would expect these consensus sequences to occur,
based on their length and degeneracy. The probability that a random 6-mer matches the EcoRI binding site is (1/4)6, so the site occurs about once every 46 (= 4,096) bp in a random DNA sequence. The HindII binding site, containing two positions where two out of four bases can match, would occur once per 44 × 22 (= 1,024) bp.
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
13
Representation of sequence motifs
Three common motif representations are:
§ Consensus sequence / regular expressions § Profile matrices (PWM, PSSM)
§ Hidden Markov models (HMM)
Depending on the representation chosen, several tools for pattern matching have been designed. They all have advantages, drawbacks, and limitations.
14
Consensus sequence
A consensus sequence is a string that summarizes the best the pattern.
Each letter of the consensus corresponds to the letter which has the
higher probability to be found at this position. A consensus is usually built
from the multiple alignment of the individual pattern sequences.
Example:
DNA motif (alignment)
A
T
A
A
C
C
C
C
A
A
A
C
A
C
G
C
-
T
-
A
A
A
A
T
T
G
T
G
C
C
C
Consensus sequence
A C A C - - A T C
15
Consensus sequence
Algorithm: Principle
Source: M. Thomas-Chollier
16
Consensus sequence
Algorithm: Remarks
Source: M. Thomas-Chollier
17
Consensus sequence
The IUPAC code can be used to refine the consensus sequence, allowing
to assign to a given position 2 or 3 possible nucleotides. For example the
letter Y means either C or T (pYrimidine).
IUPAC code
18
Consensus sequence
Example:
ROX1 is a transcription factor in S. cerevisiae involved in the regulation of heme-repressed and heme-induced genes.
ROX1 binding sites
Consensus sequence
Such consensus sequence can be used to look in any DNA sequence if we find strings
that match the consensus. A certain number of mismatch may also be allowed.
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
19
Regular expression
A regular expression (regexp) is a notational algebra that describes a
string. This is an alternative representation of the consensus
sequence, which is more general and more flexible.
Example
DNA motif (alignment)
Regular expression
A
T
A
A
C
C
C
C
A
A
A
C
A
C
G
C
-
T
-
A
A
A
A
T
T
G
T
G
C
C
C
[AT]C[AC][ACGT]*A[TG][GC]
Either A or T
Either A,C, G or T any number of times
20
Regular expression
Some programming languages like PERL have very
efficient regular expressions functions.
Regular expressions (regexp) match a pattern in text strings if some
substrings of the text strings match a pattern. A regexp evaluates to
either true for strings it matches, or to false for the strings it doesn't.
Evaluation of a regular expression is called pattern matching.
Examples
$pattern=/[AT]C[AC][ACGT]*A[TG][GC]/
if ($sequence =~ m/$pattern/){
print "pattern found!";
}
Find consensus [AT]C[AC][ACGT]*A[TG][GC]
in a given sequence.
$start="ATG";
while ($sequence =~ m/$start/g) {
$i++;
$ici=pos($sequence)-2;
print "start codon found at $ici \n";
}
Find all occurrences of ATG (start codon) in a given sequence and print the positions of these occurrences.
21
Regular expression
Example: Abf1p transcription factor binding site found in YEASTRACT
For each yeast transcription factor, one or
several regular expressions are given
(depending on the source). Some allow more
variability in the sequence than others.
http://www.yeastract.com/tflist.php
http://www.yeastract.com/help/help_searchdnamotifintfbs.php
22
Regular expression
Example: C2H2 Zinc-finger motif
Regular expression (as found in PROSITE) :
C-x(2,4)-C-x(12)-H-x(3)-H
C = cysteine
H = histidine
x(i) = any aa (i times)
http://www.expasy.ch/cgi-bin/prosite-
search-ac?PDOC00028
Transcription factor Sp1 binding to DNA
23
Regular expression
Regular expressions do not capture the statistics of the variation in
sequence patterns - they just tell what letters are permissible at each
position in the pattern.
DNA motif (alignment)
Regular expression
Consensus sequence
Improbable sequence
A
T
A
A
C
C
C
C
A
A
A
C
A
C
G
C
-
T
-
A
A
A
A
T
T
G
T
G
C
C
C
[AT]C[AC][ACGT]*A[TG][GC]
A C A C - - A T C
T C C T - - A G G
They are not distinguished in the regular expression
24
Profile matrices
Position specific scoring matrices (PSSM)
Profiles capture the frequency of each letter at each position in the
pattern so you can tell how well a potential site matches the pattern
(the probability of the site).
Starting from a multiple alignment, one can build a matrix which
reflects the preferred residues at each position:
§ Each column represents a position
§ Each row represents a residue
(20 rows for proteins, 4 rows for DNA)
§ The cells indicate the frequency of each residue at each position
of the multiple alignment.
Hertz & Stormo (1999) Bioinformatics 15: 563-577.
25
Profile matrices (PFM)
Site (1)
Site (2)
Site (3)
Site (4)
Site (5)
Site (6)
Site (7)
consensus:
IUPAC consensus:
A
T
T
A
T
T
A
G
G
C
G
C
G
A
A
A
A
A
A
A
A
T
C
T
T
A
T
T
C
T
C
T
G
C
C
C
G
G
G
G
G
G
A
A
T
A
A
A
A
T
T
T
T
T
C
T
Multiple alignment
T G A T C G A T
W V A H B S W Y
1
2
3
4
5
6
7
8
A
3
1
7
1
0
0
6
0
C
0
2
0
1
4
1
0
1
G
0
4
0
0
1
6
0
0
T
4
0
0
5
2
0
1
6
Position-specific frequency matrix (counts)
26
Profile matrices (PFM)
1
A
1
2
3
4
5
6
7
8
A
3
1
7
1
0
0
6
0
C
0
2
0
1
4
1
0
1
G
0
4
0
0
1
6
0
0
T
4
0
0
5
2
0
1
6
2
0.43 0.14
Position-specific frequency matrix (counts)
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
0
0.14 0.86
0.71 0.29
0
Position-specific frequency matrix (frequencies)
0.14 0.86
i = 1,2,... N (number of sequences)
j = 1,2,... L (number of positions)
A = alphabet size (4 for nucleic acids, 20 for amino acids)
27
Profile matrices (PFM)
1
A
2
0.43 0.14
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
0
0.14 0.86
0.71 0.29
0
0.14 0.86
?
...
A
C
A
C
0.43 0.29
A
1
T
T
C
A
C
T
A
C
...
0.71 0.29 0.14 0.86 0.14
The probability that the sequence ACATTCAC is "generated" by the PWM is:
P( ACATTCAC | PWM ) = 0.43*0.29*1*0.71*0.29*0.14*0.86*0.14 = 4.33 10-4
This probability needs to be compared to the probability that the sequence ACATTCAC is obtained "by chance". 28
Profile matrices (PFM)
1
A
2
0.43 0.14
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
0
0.14 0.86
0.71 0.29
0
0.14 0.86
?
...
A
C
A
C
0.43 0.29
A
1
T
T
C
A
C
T
A
C
...
0.71 0.29 0.14 0.86 0.14
The probability that the sequence ACATTCAC is "generated" by the PWM is:
P( ACATTCAC | PWM ) = 0.43*0.29*1*0.71*0.29*0.14*0.86*0.14 = 4.33 10-4
The probability that the sequence ACATTCAC is obtained "by chance" is (assuming that each residue has a prior probability = 0.25):
P( ACATTCAC | background model) = 0.258 = 1.5 10-5
29
Profile matrices (PFM)
1
A
2
0.43 0.14
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
0
0.14 0.86
0.71 0.29
0
0.14 0.86
?
...
A
C
A
C
0.43 0.29
A
1
T
T
C
A
C
T
A
C
...
0.71 0.29 0.14 0.86 0.14
Thus:
P( ACATTCAC | PWM model) 4.33 10-4
= 28.85
=
-5
P( ACATTCAC | background model) 1.5 10
It is (28.85 times) more likely that sequence ACATTCAC corresponds to a motif than to a background sequence.
Note however the importance of the background model (i.e. the prior probabilities).
30
Profile matrices (PSSM)
1
A
2
0.43 0.14
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
0
0.14 0.86
0.71 0.29
0
0.14 0.86
Position-specific frequency matrix (frequencies)
If known, the prior probabilites (background model) can be used to convert the frequency matrix into a scoring matrix.
Position-weight matrix (PWM)
1
A
2
3
4
0.54 -0.58 1.39 -0.58
C
?
0.15
?
G
?
0.82
?
T
0.82
?
?
5
6
7
8
?
?
1.23
?
?
-0.58
?
?
-0.58 1.04 -0.58
?
-0.58 1.23
1.04 0.15
?
= Position-specific scoring matrix (PSSM) -0.58 1.23
pi = prior probability (here: pA = pC = pG = pT = 0.25)
31
Profile matrices (PSSM)
1
A
2
0.43 0.14
3
4
5
6
7
8
1
0.14
0
0
0.86
0
0.14 0.71 0.14
0
0.14
0
0
C
0
0.29
0
G
0
0.57
0
T
0.57
0
0
1
2
3
0
0.14 0.86
0.71 0.29
4
5
0
6
0.14 0.86
7
8
A
0.54 -0.58 1.39 -0.58 -4.27 -4.27 1.23 -4.27
C
-4.27 0.15 -4.27 -0.58 1.04 -0.58 -4.27 -0.58
G
-4.27 0.82 -4.27 -4.27 -0.58 1.23 -4.27 -4.27
T
0.82 -4.27 -4.27 1.04 0.15 -4.27 -0.58 1.23
The pseudo-weights have been introduced by Hertz & Stormo (1999) to account for the small number of sequences used to build the PWM matrix.
Position-specific frequency matrix (frequencies)
If known, the prior probabilites (background model) can be used to convert the frequency matrix into a scoring matrix.
Position-weight matrix (PWM) with pseudo-weights
with
k = pseudo-count (here: k=0.1)
32
Profile matrices - pseudo-counts
How to choose the appropriate pseudo-count k ?
Most of the binding sites matrices have been constructed on the basis of a small number of sites, often below 10. The PHO4p matrices, as found in TRANSFAC for example has been built from 8 known binding sites. Some residues have thus a frequency of 0. This gives a weight of -∞, which means that we consider as completely impossible for the transcription factor to bind a site with such a residue. However, this might be an artifact due to an incomplete sampling. Hertz & Stormo (1999) proposed to "correct" the frequencies by introducing a pseudo-
weight (k):
The pseudo-weight has to be choosen in order to garanties that the sum of the frequencies is still one.
A typical value for the pseudo-weight is 1/(alphabet size), but more elaborated theories have been developed to better choose the pseudo-weight. 33
Profile matrices - background probabilities
How to choose the appropriate background probabilities pi ?
The four bases in DNA sequences do not occur equally often. The GC content strongly varies from one organism to another (51% in E. coli, 41% in human, 36% in S. cerevisiae, 19% in plasmodium falciparum, 72% in streptomyces coelicolor).
This bias can easily be taken into account in the background probabilities pi. Background probabilities can also take into account local variation in the residues content (moving windows) or dependence between successive residues (Markov chain). Moving windows
PSSM
window used to compute the background probability
Markov chains
p(Rj) = f (Rj-1, Rj-2, ...)
where Rj is residue at position j
34
Profile matrices - background probabilities
How to choose the appropriate background probabilities pi ?
Markov models show strong variations between organisms
Mycoplasma genitalium
(Firmcutes, Bacteria)
Plasmodium falciparum
(Apicomplexa, Eukaryotic parasite)
Source: J . v an Helden
Saccharomyces c erevisiae
(Fungus)
Anopheles gambiae
(Insect)
Escherichia c oli K12
(Proteobacteria)
Homo s apiens
(Mammalian)
CpG islands excluded
35
Profile matrices - background probabilities
How to compute background probabilities using a Markov model?
The example below illustrates the computation of the probability of a seqence segment (CCTACTATATGCCCAGAATT) with a Markov chain of order 2, calibrated from 3nt frequencies on the yeast genome.
Source: Jacques van Helden
36
Profile matrices - example
ROX1 binding sites
Consensus sequence
Position-specific frequency matrix
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
37
Scoring a sequence with a profile matrix
How to score a sequence with a given position-weight matrix ?
1
2
3
4
5
6
7
8
A
0.54 -0.58 1.39 -0.58 -4.27 -4.27 1.23 -4.27
C
-4.27 0.15 -4.27 -0.58 1.04 -0.58 -4.27 -0.58
G
-4.27 0.82 -4.27 -4.27 -0.58 1.23 -4.27 -4.27
T
0.82 -4.27 -4.27 1.04 0.15 -4.27 -0.58 1.23
?
...
A
G
T
C
G
T
A
C
T
C
T
A
C
...
38
Scoring a sequence with a profile matrix
1
2
3
4
5
6
7
8
A
0.54 -0.58 1.39 -0.58 -4.27 -4.27 1.23 -4.27
C
-4.27 0.15 -4.27 -0.58 1.04 -0.58 -4.27 -0.58
G
-4.27 0.82 -4.27 -4.27 -0.58 1.23 -4.27 -4.27
T
0.82 -4.27 -4.27 1.04 0.15 -4.27 -0.58 1.23
...
score/position
score total
A
C
G
A
T
C
G
A
T
C
T
A
C
...
0.54 0.15 -4.27 -0.58 0.15 -0.58 -4.27 -4.27
-13.13
A negative value indicates that the sequence is more likely an instance of the background model than an instance of the PSSM model.
i.e. P(ACGATCGA | background model) > P(ACGATCGA | PSSM model) 39
Scoring a sequence with a profile matrix
1
...
2
3
4
5
6
7
8
A
0.54 -0.58 1.39 -0.58 -4.27 -4.27 1.23 -4.27
C
-4.27 0.15 -4.27 -0.58 1.04 -0.58 -4.27 -0.58
G
-4.27 0.82 -4.27 -4.27 -0.58 1.23 -4.27 -4.27
T
0.82 -4.27 -4.27 1.04 0.15 -4.27 -0.58 1.23
A
C
G
A
T
C
G
A
T
C
T
A
C
...
score/position -4.27 0.82 1.39 1.04 1.04 1.23 1.23 1.23
score total
3.71
A positive value indicates that the sequence is more likely an instance of the PSSM model than an instance of the background model.
i.e. P(CGATCGAT | PSSM model) > P(ACGATCGAT | background model) 40
Scoring a sequence with a profile matrix
1
...
A
2
3
4
5
6
7
8
A
0.54 -0.58 1.39 -0.58 -4.27 -4.27 1.23 -4.27
C
-4.27 0.15 -4.27 -0.58 1.04 -0.58 -4.27 -0.58
G
-4.27 0.82 -4.27 -4.27 -0.58 1.23 -4.27 -4.27
T
0.82 -4.27 -4.27 1.04 0.15 -4.27 -0.58 1.23
C
G
A
T
C
G
A
T
C
T
A
C
...
score/position -4.27 -0.58 -4.27 -0.58 -4.27 -4.27 -0.58 -0.58
score total
-19.4
41
Scoring a sequence with a profile matrix
AGATCCATTGACCGTTAGATTGAAGATTGATAGATTGATTTTGATCGACAAAGTG...
score
threshold
(sth=0)
How to choose the threshold?
position
match
(s > sth)
42
Scoring a sequence with a profile matrix
Example
Source: M. Thomas-Chollier / J . v an Helden
43
Scoring a sequence with a profile matrix
How to choose the threshold?
There is a trade-off:
§ Threshold too high => high selectivity, but low sensitivity
High confidence in the predicted sites, but many real sites are missed
§ Threshold too low => high sensitivity, but low selectivity
The real sites are drowned in a lot of false positives.
The threshold can be choosen on the basis of scores returned when
the matrix is used to scan known binding sites for the factor.
44
Scoring a sequence with a profile matrix
How to choose the threshold?
Source: M. Thomas-Chollier
45
Scoring a sequence with a profile matrix
Example: Here is the matrix for the Pho4p binding site (S. cerevisiae)
Transfac matrix F$PHO4_01
This count matrix has been converted into a PSSM accounting for prior (background) probabilities (pi) based on the GC content in S. cerevisiae:
Source: Jacques van Helden 46
Scoring a sequence with a profile matrix
Example: Here are the matches found when scanning the PHO3, PHO5, PHO11, PHO12, PHO81, PHO84, PHO87, and PHO88 promoters with the Pho4p PSSM (obtained with RSA Tools). Many matches (indicated by blue boxes whose the size has been scaled according their scores) are found. However, many matches are not real Pho4p binding sites...
Source: Jacques van Helden 47
Scoring a sequence with a profile matrix
Example: Here are the matches found when scanning the PHO3, PHO5, PHO11, PHO12, PHO81, PHO84, PHO87, and PHO88 promoters with the Pho4p PSSM for a higher threshold. Note that in this case PHO3 is not displayed because it does not contain a single match.
When the threshold is set at a higher value, the spurious matches are filtered out and only the actual binding sites are retained.
Source: Jacques van Helden 48
PSI BLAST
PSI-BLAST : Position-Specific Iterated BLAST (Altschul et al, 1997)
§ BLAST runs a first time in normal mode.
§ Resulting sequences are aligned together (Multiple sequence alignment) and a PSSM is calculated.
§ This PSSM is used to scan the database for new matches.
§ Steps 2-3 can be iterated several times. This procedure typically converges after a few cycles.
query sequence
PSSM
BLAST
BLAST
aligned output sequences
49
PSI BLAST
PSI-BLAST : Position-Specific Iterated BLAST (Altschul et al, 1997)
§ Known problems:
§ Over-represented subfamilies may bias profile.
§ Inappropriate E-value calculation may lead to the acceptance of false positive matches
§ Domain boundaries may not be properly identified during the first round.
§ Advice:
§ The result of an iterative profile search may be verified by starting the procedure with a different seed.
50
Information content
How to measure the quality of a PSSM?
Which one of the following matrices is "the best"?
1
2
3
4
5
6
7
8
A
2
0
8
0
0
0
0
0
C
3
8
0
8
0
0
0
2
G
3
0
0
0
8
0
5
4
T
0
0
0
0
0
8
3
2
1
2
3
4
5
6
7
8
A
2
1
5
0
1
1
1
1
C
3
4
3
5
1
1
0
2
G
2
3
0
2
4
1
5
3
T
1
1
0
1
2
7
2
2
51
Information content
The information content measures the conservation of the residues in
a given position.
By definition the information content is:
where f'i,j is the corrected frequency of residue i at position j.
Iij > 0 when f’ij > pi (i.e. when residue i is more frequent at position j than expected by chance) Iij < 0 when f’ij < pi
Iij tends towards 0 when f’ij -> 0. Indeed
Information content of a column
where A = alphabet size
Information content of the matrix
where w = length of the matrix
See details in info_content.pdf
Source: Jacques v an Helden
52
Information content
Example: CRP binding site.
§ cAMP Receptor Protein (CRP) is a dimer of two identical subunits.
§ cAMP-CRP activates expression of many genes in E. coli, by binding to specific s ites on the DNA where it directly interacts with RNA Polymerase.
§ Nucleotide sequencing and analysis of CRP binding sites established a consensus binding sequence consisting of an imperfect 5 bp palindrome:
TGTGA---N6---TCACA
CRP binding site logo
53
Information content
Example: CRP binding site.
23 sites identified
as binding sites to
CRP in E. coli
PWM (frequencies)
PSSM (scores)
Information content
per position
Stormo & Hartzell, PNAS (1989)
54
Sequence logos
A logo is a simple piece of type bearing two or more separate elements.
Schneider et al (1990) use logos to display aligned sets of sequences, such as binding site, described by a PSSM.
Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097-100.
55
Sequence logos
The sequence logo shows:
§ the general consensus of a sequence
§ the order of predominance of the residues at every position
§ the amount of information present at every position
§ an initiation point, cut point, or other significant location (if appropriate).
Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18: 6097-100.
56
Sequence logos
ROX1 binding sites
Consensus sequence
Position-specific frequency matrix
D'haeseleer (2006) Nat. Biotech. 24: 423-425.
Sequence logo
In (d), all the position have the same height. In (e) and (f), the size of the letters is scaled according the information content. In (f), the information content is corrected to account for the background frequencies (observed in yeast). Because of the low GC c ontent, the G in the middle is higher, reflecting the fact it carries more information than the flanking A and T. 57
Profile matrices : limitations
Some limitations of the PSSM are: § It is difficult to recognize instances of the pattern that contain insertions or deletions. If a PSSM is designed to detect the motif GGCACGTGTA (and its variants), it will more likely fail to detect the GGCACCTGTGTA
§ It cannot capture positional dependencies. Suppose, in a particular motif, we always see either RD or QH at positions j and j + 1, but never QD or RH. This is an example of a pattern that a PSSM can not represent.
§ PSSM are not well suited to represent variable length patterns.
§ This approach is not well suited for detecting sharp boundaries
between two regions.
Source: http://www.cs.cmu.edu/~durand/
58
Hidden Markov Model
DNA motif (alignment)
Regular expression
Consensus sequence
Improbable sequence
Durbin, R. M., Eddy, S. R., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis Cambridge University Press.
Krogh A (1998) An introduction to hidden Markov models for biological sequences, Chapter 4 in Computational Methods in Molecular Biology.
59
Hidden Markov Model
Hidden markov models are a kind of finite-state automaton
that can be used for example to represent and to find specific motifs in a sequence.
§ An HMM is defined by having a set of states, each of which has a limited number of transitions to other states and a limited number of emissions from a given state. § Each transition between states will have an assigned probability, the value of which is independent of the history of previous states encountered. It is this property that makes these a class of Markov models. § Each of the models considered here has a start state and an end state, and any path through the model from the start to the end state will produce a sequence.
§ In these HMMs the state emission is of a residue of the sequence.
Source: Zvelebil & Baum, 2008
60
Hidden Markov Model
General structure of a "sequence motif" HMM
This is a complete profile HMM model, with a start and end state and four match states. The match, insert and delete states are labeled Mi, Ii, and Di, respectively. Each transition between these states is associated to a certain probability. From a set of aligned sequences associated to a given pattern, we can built a particular HMM that will describe our pattern in term of probability. Source: Zvelebil & Baum, 2008
61
Hidden Markov Model
HMM: Example
Krogh A (1998) An introduction to hidden Markov models for biological sequences, Chapter 4 in Computational Methods in Molecular Biology.
62
Hidden Markov Model
HMM: scoring an alignment (using probabilities)
63
Hidden Markov Model
Log-odd scores
probability according the HMM model
prior (background) probability (assumed here equal for each residue).
64
Hidden Markov Model
Finding motifs in a sequence with a HMM
END
START
background
probabilities
A: 0.25
C: 0.25
G: 0.25
T: 0.25
background
probabilities
A: 0.25
C: 0.25
G: 0.25
T: 0.25
AGATCCATTGACCGTTACACATCAGATTGATAGATTGATTTTGATCGACAAAGTG...
bbbbbbbbbbbbbbbbmmmmmmmbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb...
b=background
m=motif
65
Hidden Markov Model
Limitations and disadvantages of HMMs
§ HMM are less easy to manipulate than PSSM and it can takes a lot of (CPU) time to build HMM models.
§ The number of parameters needed to set a HMM is very large. For example, the "simple" HMM shown as example has 36 parameters but is build on only 5 sequences! Ideally, to avoid over-fitting problems, HMM should be build on a large number of data.
§ In practice, however, the number of sequences is limited. Indeed, most transcription factors regulate a small number of genes, and thus only a small number of binding sites can be used to compute all the probabilities. For the same reason as done for the PSSM, pseudo-counts should often be used to correct for possible missing observations.
66
Motifs databases Collections of DNA and protein patterns have been stored in specialized databases:
§ YEASTRACT and RegulonDB § TRANSFAC and JASPAR
Transcription factor binding sites
§ TRED
§ BLOCKS
§ PROSITE Protein motifs and domain
§ PFAM
Be aware: Motifs found in databases can have been obtained by different means (experimentally on real data, by SELEX experiments, or predicted by bioinformatic means) and have therefore different quality. The number of sites used to build the profile also influence its quality.
67
RegulonDB, YEASTRACT, SCPD
RegulonDB, YEASTRACT and SCPD provides a collection of binding sites for transcription factors in E. coli and in yeast.
§ RegulonDB database
§ provides PSSM matrices for TF in bacteria.
§ Huerta et al (1998) Nucleic Acids Res. 26:55-9
§ web: http://regulondb.ccg.unam.mx/
§ YEASTRACT database
§ provides consensus sequences for TF in yeast.
§ Teixeira et al (2006). Nucleic Acids Res. 34: D446-51.
§ web: http://www.yeastract.com/http
§ SCPD (Promoter Database of S. cerevisiae)
§ provides consensus and PSSM sequences for TF in yeast.
§ Zhu, Zhang (1999) Bioinformatics. 15: 607-11.
§ web: http://rulai.cshl.edu/SCPD/
68
TRANSFAC & JASPAR
Collections of PSSM for transcription factor binding sites can be found in specialized databases: § TRANSFAC database
§ 834 matrices.
§ Matys et al (2003) Nucl Acids Res. 31: 374-278
§ web: http://www.biobase-international.com/pages/index.php?id=transfac
§ JASPAR database
§ 138 matrices.
§ Sandelin et al (2004) Nucl Acids Res. 32: D91-D94
§ web: http://jaspar.genereg.net/
69
TRED
TRED (Transcriptional Regulatory Element Database)
Availability: https://cb.utdallas.edu/cgi-bin/TRED/tred.cgi?process=home
70
TransFind
Roider HG, Kanhere A, Manke T, Vingron M. (2007) Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics. 23:134-41. Kielbasa SM, Klein H, Roider HG, Vingron M, Blüthgen N (2010) TransFind--predicting transcriptional regulators for gene sets. Nucleic Acids Res. 38 Suppl:W275-80. Availability:
http://transfind.sys-bio.net
71
BLOCKS
BLOCKS: multiply aligned ungapped segments of the most highly conserved regions of proteins
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks (current version), retrieve blocks, and create new blocks, respectively.
Web: http://blocks.fhcrc.org/
NB: The Blocks Database is no longer updated. We suggest you use InterPro
(http://www.ebi.ac.uk/interpro/) to annotate your sequences.
Sonnhammer et al (1998) Nucl Acids Res. 26: 302-322.
72
PROSITE
PROSITE: a documented database using patterns and profiles as motifs descriptors PROSITE is an annotated collection of motifs descriptors dedicated to the identification of protein families and domains. The motifs descriptors used in PROSITE are either patterns or profiles which are derived from multiple alignments of homologous sequences.
The core of PROSITE is composed of two text files:
§ PROSITE.DAT (computer-readable): contains all the information necessary to scan sequence(s) for the occurrence of pattern of profiles.
§ PROSITE.DOC: contains textual informations that fully documents each pattern of profiles.
The latest version of PROSITE (Sept 2007) contains 1319 patterns and 745 profiles. Web: http://www.expasy.ch/prosite/
Sigrist et al (2002) Brief. Bioinfo. 3: 265-274
Hulo et al (2007) Nucl Acids Res. 36: D245-249. 73
PFAM
PFAM: multiple sequence alignment and HMM-profiles of protein domains PFAM contains 527 manually verified domain families, which are described by profile HMM models. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the PRODOM database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.
Web: http://pfam.janelia.org/
Sonnhammer et al (1998) Nucl Acids Res. 26: 302-322. 74
Pattern matching: summary Pattern matching is the art of finding known pattern in a sequence. Since variability is often allowed, this is not a trivial task!
Several pattern descriptors have been introduced:
§ Consensus sequence, which can be found in sequence using regular expression matching.
§ Position-specific matrices (PSSM), which can be used to associate a score to each match and to select only the most significant ones.
§ Hidden Markov Models, which relies on a probabilistic description of the pattern.
Patterns collections can be found in specialized databases.
75
Further reading
Reviews:
D'haeseleer (2006) What are DNA sequence motif? Nat. Biotech. 24: 423-425. D'haeseleer (2006) How does DNA sequence motif discovery work? Nat. Biotech. 24: 959-961.
Stormo (2000) DNA binding sites: representation and discovery, Bioinformatics
16:16-23.
Hertz and Stormo (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563-77.
van Helden (2005) The analysis of regulatory sequences. Lecture Notes, Les Houches, 2005. 76
© Copyright 2026 Paperzz