Week 3

Week 3
Topics for this lecture:
•  Bayesian approaches to picking the
right model to give 2 sequences
•  Frequentist approach to assigning
alignment significance.
The big idea for this lecture:
The significance of an
alignment is how
probable it would be in
random junk.
CS 482/682, Spring 2014, Week 3
1
Last week
The major topic of last week:
•  The score of an alignment is the log
of the odds ratio between a simplified
evolutionary history and a background
model.
•  The more positive the score, the more
likely the history.
•  We can optimize affine scores by
using a quadratic-time algorithm.
•  Affine scores correspond to
independent indel events of geometric
lengths.
CS 482/682, Spring 2014, Week 3
2
This is actually profound
There are major consequences of this:
•  If the model of related sequences is
not appropriate for your sequences,
you won’t get decent alignments
–  This is relevant, for example, for
membrane proteins. The alignments that
give rise to the BLOSUM matrix come
from mostly globular proteins. So it’s
not helpful for such proteins
•  Different scoring matrices correspond
to different amounts of mutation
being allowed
•  You might want to use a better model
(columns not independent, nongeometrically distributed gap lengths,
…)
•  Humans don’t make the matrices.
CS 482/682, Spring 2014, Week 3
3
How does this help?
Which model gave rise to a particular
sequence alignment (are they
homologous?)
Recall: alignment score is exactly the log
odds, so a score of 150 gives a log
odds ratio of 2150 ≈1050, strongly
favouring the alignment’s being from
model 2.
Or does it?
What if we know the sequences aren’t
related? Then it doesn’t matter how
good the alignment is, they’re still not
related.
(Same thing if we know they are related)
CS 482/682, Spring 2014, Week 3
4
Conditional probability
This is the first time we have to use
Bayes’ rule in this course, but there
will be many.
You did learn it in STAT 230. I know.
Conditional probability: Probability of
an event, given another event:
Pr [A|B] = Pr [A,B]/Pr [B].
(Easier with a picture…)
Probability of seeing sequences S and T
given model H: Pr [S,T|H] =
Pr [S,T,H]/Pr [H].
Probability of seeing sequences S and T
given model R: Pr [S,T,R] / Pr [R].
That’s not what we want, though.
CS 482/682, Spring 2014, Week 3
5
Bayes’ rule
We want to know:
Pr [Model H | Sequences S,T].
Bayes’ rule:
Pr [A|B] = Pr [B|A] Pr [A] / Pr [B]
(There are lots of ways to try to
remember this, but I always do it with
Venn diagrams…)
So: Pr [Model H | Sequences S, T] =
Pr [S,T|H] Pr [H] / Pr [S, T]
We need to do lots of things to fill in the
details of this equation. But this is the
best way we can explain how we got
these sequences: was it from the
homology model, H, or the random
model, R?
CS 482/682, Spring 2014, Week 3
6
Using Bayes’ rule
Pr [Model H | Sequences S, T] =
Pr [S,T|H] Pr [H] / Pr [S, T]
First off, what’s Pr [S, T|H]?
What’s the probability of seeing S and T
when we pick two related sequences?
The probability of the best explanation of
S and T given H is the probability we
got from the optimal alignment! Let’s
assume that’s much more than any
other. (FALSE IN PRACTICE)
What about Pr [S,T]? There are two
ways to get S and T: either we get it
from H or from R.
Pr [S, T] = Pr [S, T|H] Pr [H] +
Pr [S, T|R] Pr [R].
CS 482/682, Spring 2014, Week 3
7
Finishing off Bayes’ rule
What about Pr [H]?
What’s the probability two sequences are
related? That depends!
To use Bayes’ rule, you need to have a
guess of Pr [H].
This is called the prior. We’ll call it π.
With that, though, we’re set.
Pr (H|Sequences S,T) =
Pr (S,T|H) Pr (H) / Pr (S, T) =
Pr (S,T|H) π /
(Pr [S, T|H] π + Pr [S, T|R] (1- π ))
Now, remember:
log (Pr (S,T|H) / Pr (S,T|R)) is at
least the score of the best alignment
for S and T.
(No better explanation under H exists)
CS 482/682, Spring 2014, Week 3
8
What next?
Pr (H|Sequences S,T) =
Pr (S,T|H) Pr (H) / Pr (S, T) =
Pr (S,T|H) π /
(Pr [S, T|H] π + Pr [S, T|R] (1- π ))
= (Pr [S,T|H] / Pr [S,T|R]) [π/(1- π )]
(Pr [S,T|H]/Pr [S,T|R])[π/(1-π)]+1
(Last change: divide numerator and
denominator by (1-π)(Pr [S,T|R]).)
The alignment odds ratio is 2score.
The other term in the numerator is
π/(1-π), which was the prior odds
ratio; call that r.
So Pr (H|Sequences S,T) ≈
2scorer / (2scorer + 1)
CS 482/682, Spring 2014, Week 3
9
But here’s a taste
We really think there’s a 1% chance
sequences we look at are actually
related.
Alignment scores +30 bits (in log 2).
What’s Pr [H|S,T]?
Pr [H|S,T] ≈ 2scorer / (2scorer + 1)
= (230/100) / (230/100 + 1).
That’s extremely close to 1.
If we’d initially thought that Pr [H] was
around 2-30, then we’d be balancing
out.
CS 482/682, Spring 2014, Week 3
10
A quick bit of philosophy
We’re about to navigate one of the
nastiest traumas in science.
Why would we trust the prior???
That is, where on earth did π, our
original guess of Pr [H] come from?
This is the biggest objection to Bayesian
thinking.
It’s not that Bayes’ Rule is wrong. It’s
that Bayesian thinking, where we start
with a prior, and then estimate the
new value of Pr [H] based on
something else, is all dependent on
our guess of the prior.
CS 482/682, Spring 2014, Week 3
11
Frequentist approach
One approach: Bayesian methods.
•  The basic problem with all Bayesian
approaches: what’s the prior
probability of Model H or Model R?
A different approach: frequentist or
classical methods:
•  Suppose all alignments from Model R.
What is the probability that, out of
many such alignments, one will have a
given score?
Note: This is hard!!
CS 482/682, Spring 2014, Week 3
12
What’s the distribution?
What’s the score of a random alignment,
from R, look like?
Let’s assume that it’s ungapped, for ease.
Easiest case: ungapped global alignments
of length n.
Let’s use scoring scheme
+1 = match, -1 = mismatch.
What’s the expected score?
Well, probability of match is .25,
probability of mismatch is .75.
Expected score of n columns is -n/2.
Std. deviation is sqrt (3n/4).
Distribution is binomial around that.
CS 482/682, Spring 2014, Week 3
13
Central limit theorem, etc.
Recall from 230/231 [I don’t know
which]:
If n is big, the distribution is essentially
normal.
•  Adding n iid variables  normal
distribution
(Really, it’s a bell-shaped curve.)
So an alignment from Model R has
expected score –n/2, with std. dev.
sqrt(3n/4).
OK, so now I see an alignment of length
300 with score +120. Is that weird?
How weird?
CS 482/682, Spring 2014, Week 3
14
Philosophical commentary
Remember what we’re saying here:
•  Suppose I have an alignment of length
300 with score +120. What’s the
chance it would just happen in model
R (the random sequences)?
Mean would be -150.
Standard deviation would be
sqrt (.75*300) = 15.
So this alignment is 18 standard
deviations above the mean, which is
really, really, really uncommon (1 in
1073).
But what if we have a pile of alignments,
and this is the best? Is +120 still
unusual?
CS 482/682, Spring 2014, Week 3
15
Many alignments
Now we have a new problem:
•  Given k alignments of length n from
Model R.
•  What’s the probability that the best of
them has score at least S?
Suppose we know the cumulative
distribution function for alignments.
(Remember:
CDF (x) = Pr [alignment score < x].
Also, we can approximate it, because
we do know the CDF for normal
distributions)
So Pr [All alignments score < S] =
[CDF(S)]k.
CS 482/682, Spring 2014, Week 3
16
I’m going to simplify this
To save time, I’m not going to do this in
fully gory details.
With a lot of pain…
Can show that if there are k random
choices from N(0,1) distribution,
highest value is distributed with a
special distribution called a Gumbel
distribution:
mean ≈ sqrt(2 ln k),
standard deviation ≈1/sqrt(2 ln k).
Then subtract the mean and divide by
the standard deviation, and use that to
find out how surprising a given
alignment is.
CS 482/682, Spring 2014, Week 3
17
So what?
Well, we know that our alignment of
score +120 was 18 standard
deviations away from the mean.
So we can just as easily think of it as
being an N(0,1) variable with score
+18, instead. (That just makes life
easier…)
Suppose we’ve seen 1010 alignments.
Then the expected max is
sqrt (2 ln 1010 – ln ln 1010 – ln 4π),
which is 6.4. (The extra terms are
because the sqrt (2 ln k) is
approximate)
So we’d expect to see something as
much as 6.4 standard deviations away
from the mean, but our alignment was
fully 18 of them. How odd is that?
CS 482/682, Spring 2014, Week 3
18
End of the calculation
We’re at 18 standard deviations.
P [Xk < 18] ~ 1-10-63.
This is extremely large. Basically, this
tells us that the probability of an
alignment as good as ours in 1010
alignments is roughly 10-63; this
compares with the 10-73 probability
from before.
(Note: this is not perfect, but an
approximation of reality…)
For very rare events, you get pretty close
by just multiplying the number of
alignments times the probability for
one alignment.
CS 482/682, Spring 2014, Week 3
19
E- and P- values
After much more grief, this gives rise to
the BLASTP e-value and p-value
scores.
E-value: the expected number of local
alignments that score above a
threshold.
P-value: the probability that an
alignment would have been found at
all that is above a threshold.
The significance of an
alignment is how
probable it would be in
random junk.
And this means that by looking at Pvalues, we’re fetishizing the model.
CS 482/682, Spring 2014, Week 3
20
Assessment…
So we have two ways to assess the
quality of an alignment:
Bayesian: given a prior probability of
having pushed H, what do we think it
is after we see the alignment of S and
T?
Frequentist: How weird is the alignment
of S and T, given that I might have
seen billions of junk alignments?
The second tends to be much more
popular, because we don’t need a
prior.
CS 482/682, Spring 2014, Week 3
21