Seq_pattern_I

PLPTH 890 Introduction to Genomic Bioinformatics
Lecture 16
Biological Sequence Pattern Analysis
Liangjiang (LJ) Wang
[email protected]
March 8, 2005
Outline
• Basic concepts and biological problems.
• Regular expression for:
– Pattern matching (sequence motifs),
– Pattern discovery (promoter elements).
• Position Weight Matrix (PWM) for:
– Pattern matching (TransFac, TESS, etc),
– Pattern discovery (MEME, Gibbs sampling).
• Hidden Markov Models (HMMs) for protein
domain analysis (next lecture).
Biological Sequence Patterns
• In nucleotide sequences:
– Transcription start and termination sites,
– Promoter cis regulatory elements,
– Intron/exon splice sites,
– Translation start and stop sites,
– mRNA cis regulatory elements.
• In protein sequences:
– Functional motifs such as signal peptides,
– Conserved protein domains.
Promoter cis Regulatory Elements
• Cells respond to various stimuli by regulating the
expression of particular genes.
• Transcription factors regulate gene expression by
binding to specific
MyoD HLH Dimer
DNA sequence motifs.
• Transcription factor
binding sites are often
short (5 – 25 bases) and
degenerate DNA motifs.
• Co-regulated genes may
have common regulatory
motifs in their promoters.
H2
H2
L
L
DNA
H1
H1
CAACTGAC
How to Represent a Sequence Pattern?
• Regular expressions:
– A pattern is represented by a string of characters
such as TATAAAA (the TATA box).
– Ambiguous characters, wild-cards and gaps are
allowed, but no position-specific information.
• Position Weight Matrices (PWM):
– Also called Position-Specific Score Matrix (PSSM).
– Often an ungapped pattern specified by a table.
• Stochastic models:
– Hidden Markov Models (HMM), neural nets, etc.
– Based on probability / machine learning theory.
Pattern Matching vs. Pattern Discovery
• Pattern matching:
– Scanning a nucleotide or protein sequence for
matches to a known pattern.
– How to get better sensitivity and specificity is
the major consideration.
• Pattern discovery:
– Given a set of sequences, discovering a
pattern that is shared by the sequences. It is
unknown in advance about what is the pattern.
– Using search or learning approaches.
– A much harder problem than pattern matching.
Pattern Matching with RegExp
• Regular Expression (RegExp) can represent:
– Ambiguous character: e.g., [AG] or R.
– Wild-card: e.g., X for any amino acids.
– Gap: e.g., x(i, j) in PROSITE patterns.
• Pattern matching with regular expression is
straightforward, but sometimes very useful.
• For example, find all the Arabidopsis proteins
which contain the following motif:
[RK][LVI]X{5}[QH][LA]
(These proteins may be targeted to peroxisome)
Patmatch at TAIR (http://www.arabidopsis.org/)
Pattern Discovery Using RegExp
Enumerate all the possible
regular expression patterns
with ambiguous characters.
e.g., CWTNC, CRTGTW,
YCGGAYRRAWG, ……
over {A, C, G, T, R, Y, S,
W, M, K, V, H, D, B, N}
Count the occurrences of all
the patterns in the input
sequences (word counting).
Compute statistical
significance based on the
background distribution.
e.g., z-score:
N  E( X )
z
 (X )
(The method works for simple patterns such as short
nucleotide motifs, but not for long and/or complex patterns)
Applications to Promoter Analysis
• The RegExp pattern enumeration method has
been used to find cis regulatory motifs that are
statistically overrepresented in a given
promoter sequence dataset:
– Sinha and Tompa, 2002. Discovery of novel
transcription factor binding sites by statistical
overrepresentation. NAR, 30:5549-5560.
– YMP is available at
http://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl.
• Complete search: all motifs in the search
space are enumerated and tested for statistical
overrepresentation.
Problems with RegExp
• Do not specify the relative
frequencies of nucleotides
at a position.
• Cannot express the relative
importance of a position for
the pattern.
• Cannot capture a possible
relationship between two
positions.
A
A
A
A
A
A
A
A
G
G
G
G
G
G
A
A
T
T
T
T
T
T
C
G
C
C
A
A
G
G
T
T
C
C
C
C
G
G
T
T
A R B N B
PWM Representation of a Motif
• A motif is assumed to have a fixed width, W.
• In the PWM, pnk is the probability (relative
frequency) of nucleotide n in column k.
AGTCC
AGTCC
AGTAC
AGTAC
AGTGG
AGTGG
AACTT
AAGTT
1
2
3
4
5
A
1
0.25
0
0.25
0
C
0
0
0.125
0.25
0.5
G
0
0.75
0.125
0.25
0.25
T
0
0
0.75
0.25
0.25
Have we lost information here?
• Background probability: pn0 is the probability
of n in the background (i.e., outside the motif).
Equal distribution: pA0 = pC0 = pG0 = pT0 = ¼.
Visualization of PWM Patterns
• The pattern captured by an MSA or PWM may
be visualized using a sequence logo.
• Information Content (IC) of the nucleotide
PWM at position k is:
IC (k )  2 
p
nk
n{ A,C ,G ,T }
log 2 pnk  2  Entropy(k )
where pnk is the probability
of n at position k.
Assuming equal background
probability for A, C, G and T (1/4).
IC (k )  [0,2]
Information Content (IC)
• IC is a measure of a site’s tolerance for
substitution: high IC, low tolerance.
• If pA1 = 1, pC1 = 0, pG1 = 0, pT1 = 0,
IC (1)  2 
p
1n
n{ A,C ,G ,T }
log 2 p1n  2  0  2
• If pA4 = ¼, pC4 = ¼, pG4 = ¼, pT4 = ¼,
IC (4)  2  4  14  log 2 ( 14 )  2  2  0
AGTCC
AGTCC
AGTAC
AGTAC
AGTGG
AGTGG
AACTT
AAGTT
1
2
3
4
5
A
1
0.25
0
0.25
0
C
0
0
0.125
0.25
0.5
G
0
0.75
0.125
0.25
0.25
T
0
0
0.75
0.25
0.25
Pattern Matching with PWM
• Given a Position Weight Matrix (PWM) of a
pattern, find all the occurrences of the
pattern on the input sequence.
• Sliding window analysis:
Match with the PWM
Sequence
• How to score a match?
W
Score  
k 1
pck
qc
(Often use log-odd score)
pck is the PWM entry at position k
and corresponding to character c
of the sequence, and qc is the
background probability of c.
Resources for Promoter Analysis
• TransFac (http://www.gene-regulation.com/):
– A database on eukaryotic transcription factors (TF)
and their DNA binding sites (PWMs).
– Provide TF classification and search options.
• TESS (Transcription Element Search System at
http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=WELCOME):
– A web tool for predicting TF binding sites.
– Using PWMs from TransFac and others.
• SCPD (http://cgsigma.cshl.org/jian/):
– The promoter database of Saccharomyces cerevisiae.
– Tools for site prediction and promoter retrieval.
Pattern Discovery Using PWM
• The Problem:
– Given a set of unaligned sequences, discover a
PWM pattern shared by the sequences.
– The pattern locations on the sequences are also
unknown in advance.
• Two sets of parameters to estimate (or learn):
– PWM of a potential pattern.
– Pattern offset matrix.
Sequences
• Algorithmic approaches:
– Expectation Maximization.
– Gibbs sampling.
Motif
Pattern Offset Matrix
• The element Zij of the pattern offset matrix Z is
the probability that the pattern (given in p)
starts at position j of sequence i (Xi):
Z ij  Pr( Z ij  1 | X i , p)
• The probability of a
sequence Xi with the
pattern starting at j is:
1
2
3
4
5
X1
0.1
0.1
0.5
0.2
0.1
X2
0.4
0.2
0.1
0.1
0.2
X3
0.1
0.6
0.1
0.1
0.1
X4
0.2
0.1
0.2
0.3
0.2
j 1
Pr( X i | Z ij  1, p)   pnk , 0
k 1
before motif
j W 1
p
k j
L
nk , k  j 1
motif
p
k  j W
nk , 0
after motif
Expectation Maximization (EM)
Given: length W, sequence dataset
set initial values for p
do {
re-estimate Z from p
re-estimate p from Z
(E-step)
(M-step)
} until (change in p <ε)
return p, Z
E
p
Z
M
p
Z
p
Z
More about the EM Algorithm
• EM is a heuristic algorithm for discovering
PWM motifs shared by a set of sequences.
• EM converges to a local maximum in the
likelihood of the data given the model p:
 Pr( X
i
| p)
i
• EM usually converges in a small number of
iterations.
• EM is sensitive to initial starting point (i.e.,
the initial values in p).
MEME
• MEME (Multiple EM for Motif Elicitation) is
widely used for motif discovery.
• MEME is based on the EM algorithm with
several extensions.
• MEME is available at
http://meme.sdsc.edu/meme/website/meme.html.
• The dataset contains 30 yeast promoters from
a co-regulated gene cluster. These genes are
mostly involved in respiration, and are coregulated in various stress conditions.
• What is the TF binding site in the shared motif?
The MEME Algorithm
MEME (dataset, W, NSITES, PASSES) {
for i = 1 to PASSES {
for each subsequence in dataset {
run EM for 1 iteration with starting point
derived from this subsequence
choose a motif model with the highest likelihood
run EM to convergence from starting point
which generated that model
print converged model of the shared motif
erase appearances of the motif from the dataset
}
}
}
MEME Enhancements to the
Basic EM Approach
• Trying many starting points by using
every distinct subsequences of length w
in the dataset.
• Not assuming that there is exactly one
motif occurrence in every sequence.
• Allowing multiple motifs to be learned.
Gibbs Sampling
• For motif discovery, Gibbs sampling can be
viewed as a stochastic analog of EM:
– In the EM algorithm, we maintained a
distribution Zi over the possible motif starting
positions for each sequence;
– In the Gibbs sampling approach, we maintain
a specific starting position for each sequence,
but keep re-sampling the starting positions.
• Gibbs sampling may be less susceptible to
local minima than EM.
A Gibbs Sampling Algorithm
Given: length W, sequence dataset
choose random motif positions for a
do {
pick a sequence Xi
estimate p using motif positions in a
(all sequences but Xi)
(update step)
sample a new motif position ai for Xi
} until (change in p <ε)
return p, a
(sampling step)
Gibbs Motif Sampler and AlignACE
• Gibbs Motif Sampler:
– Based on the work by Lawrence, et al. 1993.
Science, 262:208-214.
– Available at
http://bayesweb.wadsworth.org/gibbs/gibbs.html.
• AlignACE:
– Based on the Gibbs sampling algorithm with
several extensions.
– Available at http://atlas.med.harvard.edu/.
Summary
• For simple sequence patterns, regular
expression is a useful tool.
• For some complex sequence patterns,
position weight matrix (PWM) is preferred.
• Expectation Maximization (EM) and Gibbs
sampling are two useful approaches for
sequence pattern discovery.
• Next: protein domain analysis using HMM
Reading
• (Optional) Lawrence et al., 1993. Detecting
subtle sequence signals: a Gibbs sampling
strategy for multiple alignment. Science,
262:208-214.
• Eddy, 2004. What is a hidden Markov model?
Nature Biotechnology, 22:1315-1316.
• Eddy, 1998. Multiple alignment and multiple
sequence based searches. Trends Guide to
Bioinformatics, 15-18.
For This Week’s Lab
• Collect a set of promoter sequences (10-500
sequences in FASTA format) from co-regulated
or related genes. The promoter sequences
should be the 500-1500 nucleotides upstream
of the transcription start sites.
• Collect a set of protein sequences (10-50
sequences in FASTA format) from a gene
family or superfamily.