Week 3 Topics for this lecture: • Bayesian approaches to picking the right model to give 2 sequences • Frequentist approach to assigning alignment significance. The big idea for this lecture: The significance of an alignment is how probable it would be in random junk. CS 482/682, Spring 2014, Week 3 1 Last week The major topic of last week: • The score of an alignment is the log of the odds ratio between a simplified evolutionary history and a background model. • The more positive the score, the more likely the history. • We can optimize affine scores by using a quadratic-time algorithm. • Affine scores correspond to independent indel events of geometric lengths. CS 482/682, Spring 2014, Week 3 2 This is actually profound There are major consequences of this: • If the model of related sequences is not appropriate for your sequences, you won’t get decent alignments – This is relevant, for example, for membrane proteins. The alignments that give rise to the BLOSUM matrix come from mostly globular proteins. So it’s not helpful for such proteins • Different scoring matrices correspond to different amounts of mutation being allowed • You might want to use a better model (columns not independent, nongeometrically distributed gap lengths, …) • Humans don’t make the matrices. CS 482/682, Spring 2014, Week 3 3 How does this help? Which model gave rise to a particular sequence alignment (are they homologous?) Recall: alignment score is exactly the log odds, so a score of 150 gives a log odds ratio of 2150 ≈1050, strongly favouring the alignment’s being from model 2. Or does it? What if we know the sequences aren’t related? Then it doesn’t matter how good the alignment is, they’re still not related. (Same thing if we know they are related) CS 482/682, Spring 2014, Week 3 4 Conditional probability This is the first time we have to use Bayes’ rule in this course, but there will be many. You did learn it in STAT 230. I know. Conditional probability: Probability of an event, given another event: Pr [A|B] = Pr [A,B]/Pr [B]. (Easier with a picture…) Probability of seeing sequences S and T given model H: Pr [S,T|H] = Pr [S,T,H]/Pr [H]. Probability of seeing sequences S and T given model R: Pr [S,T,R] / Pr [R]. That’s not what we want, though. CS 482/682, Spring 2014, Week 3 5 Bayes’ rule We want to know: Pr [Model H | Sequences S,T]. Bayes’ rule: Pr [A|B] = Pr [B|A] Pr [A] / Pr [B] (There are lots of ways to try to remember this, but I always do it with Venn diagrams…) So: Pr [Model H | Sequences S, T] = Pr [S,T|H] Pr [H] / Pr [S, T] We need to do lots of things to fill in the details of this equation. But this is the best way we can explain how we got these sequences: was it from the homology model, H, or the random model, R? CS 482/682, Spring 2014, Week 3 6 Using Bayes’ rule Pr [Model H | Sequences S, T] = Pr [S,T|H] Pr [H] / Pr [S, T] First off, what’s Pr [S, T|H]? What’s the probability of seeing S and T when we pick two related sequences? The probability of the best explanation of S and T given H is the probability we got from the optimal alignment! Let’s assume that’s much more than any other. (FALSE IN PRACTICE) What about Pr [S,T]? There are two ways to get S and T: either we get it from H or from R. Pr [S, T] = Pr [S, T|H] Pr [H] + Pr [S, T|R] Pr [R]. CS 482/682, Spring 2014, Week 3 7 Finishing off Bayes’ rule What about Pr [H]? What’s the probability two sequences are related? That depends! To use Bayes’ rule, you need to have a guess of Pr [H]. This is called the prior. We’ll call it π. With that, though, we’re set. Pr (H|Sequences S,T) = Pr (S,T|H) Pr (H) / Pr (S, T) = Pr (S,T|H) π / (Pr [S, T|H] π + Pr [S, T|R] (1- π )) Now, remember: log (Pr (S,T|H) / Pr (S,T|R)) is at least the score of the best alignment for S and T. (No better explanation under H exists) CS 482/682, Spring 2014, Week 3 8 What next? Pr (H|Sequences S,T) = Pr (S,T|H) Pr (H) / Pr (S, T) = Pr (S,T|H) π / (Pr [S, T|H] π + Pr [S, T|R] (1- π )) = (Pr [S,T|H] / Pr [S,T|R]) [π/(1- π )] (Pr [S,T|H]/Pr [S,T|R])[π/(1-π)]+1 (Last change: divide numerator and denominator by (1-π)(Pr [S,T|R]).) The alignment odds ratio is 2score. The other term in the numerator is π/(1-π), which was the prior odds ratio; call that r. So Pr (H|Sequences S,T) ≈ 2scorer / (2scorer + 1) CS 482/682, Spring 2014, Week 3 9 But here’s a taste We really think there’s a 1% chance sequences we look at are actually related. Alignment scores +30 bits (in log 2). What’s Pr [H|S,T]? Pr [H|S,T] ≈ 2scorer / (2scorer + 1) = (230/100) / (230/100 + 1). That’s extremely close to 1. If we’d initially thought that Pr [H] was around 2-30, then we’d be balancing out. CS 482/682, Spring 2014, Week 3 10 A quick bit of philosophy We’re about to navigate one of the nastiest traumas in science. Why would we trust the prior??? That is, where on earth did π, our original guess of Pr [H] come from? This is the biggest objection to Bayesian thinking. It’s not that Bayes’ Rule is wrong. It’s that Bayesian thinking, where we start with a prior, and then estimate the new value of Pr [H] based on something else, is all dependent on our guess of the prior. CS 482/682, Spring 2014, Week 3 11 Frequentist approach One approach: Bayesian methods. • The basic problem with all Bayesian approaches: what’s the prior probability of Model H or Model R? A different approach: frequentist or classical methods: • Suppose all alignments from Model R. What is the probability that, out of many such alignments, one will have a given score? Note: This is hard!! CS 482/682, Spring 2014, Week 3 12 What’s the distribution? What’s the score of a random alignment, from R, look like? Let’s assume that it’s ungapped, for ease. Easiest case: ungapped global alignments of length n. Let’s use scoring scheme +1 = match, -1 = mismatch. What’s the expected score? Well, probability of match is .25, probability of mismatch is .75. Expected score of n columns is -n/2. Std. deviation is sqrt (3n/4). Distribution is binomial around that. CS 482/682, Spring 2014, Week 3 13 Central limit theorem, etc. Recall from 230/231 [I don’t know which]: If n is big, the distribution is essentially normal. • Adding n iid variables normal distribution (Really, it’s a bell-shaped curve.) So an alignment from Model R has expected score –n/2, with std. dev. sqrt(3n/4). OK, so now I see an alignment of length 300 with score +120. Is that weird? How weird? CS 482/682, Spring 2014, Week 3 14 Philosophical commentary Remember what we’re saying here: • Suppose I have an alignment of length 300 with score +120. What’s the chance it would just happen in model R (the random sequences)? Mean would be -150. Standard deviation would be sqrt (.75*300) = 15. So this alignment is 18 standard deviations above the mean, which is really, really, really uncommon (1 in 1073). But what if we have a pile of alignments, and this is the best? Is +120 still unusual? CS 482/682, Spring 2014, Week 3 15 Many alignments Now we have a new problem: • Given k alignments of length n from Model R. • What’s the probability that the best of them has score at least S? Suppose we know the cumulative distribution function for alignments. (Remember: CDF (x) = Pr [alignment score < x]. Also, we can approximate it, because we do know the CDF for normal distributions) So Pr [All alignments score < S] = [CDF(S)]k. CS 482/682, Spring 2014, Week 3 16 I’m going to simplify this To save time, I’m not going to do this in fully gory details. With a lot of pain… Can show that if there are k random choices from N(0,1) distribution, highest value is distributed with a special distribution called a Gumbel distribution: mean ≈ sqrt(2 ln k), standard deviation ≈1/sqrt(2 ln k). Then subtract the mean and divide by the standard deviation, and use that to find out how surprising a given alignment is. CS 482/682, Spring 2014, Week 3 17 So what? Well, we know that our alignment of score +120 was 18 standard deviations away from the mean. So we can just as easily think of it as being an N(0,1) variable with score +18, instead. (That just makes life easier…) Suppose we’ve seen 1010 alignments. Then the expected max is sqrt (2 ln 1010 – ln ln 1010 – ln 4π), which is 6.4. (The extra terms are because the sqrt (2 ln k) is approximate) So we’d expect to see something as much as 6.4 standard deviations away from the mean, but our alignment was fully 18 of them. How odd is that? CS 482/682, Spring 2014, Week 3 18 End of the calculation We’re at 18 standard deviations. P [Xk < 18] ~ 1-10-63. This is extremely large. Basically, this tells us that the probability of an alignment as good as ours in 1010 alignments is roughly 10-63; this compares with the 10-73 probability from before. (Note: this is not perfect, but an approximation of reality…) For very rare events, you get pretty close by just multiplying the number of alignments times the probability for one alignment. CS 482/682, Spring 2014, Week 3 19 E- and P- values After much more grief, this gives rise to the BLASTP e-value and p-value scores. E-value: the expected number of local alignments that score above a threshold. P-value: the probability that an alignment would have been found at all that is above a threshold. The significance of an alignment is how probable it would be in random junk. And this means that by looking at Pvalues, we’re fetishizing the model. CS 482/682, Spring 2014, Week 3 20 Assessment… So we have two ways to assess the quality of an alignment: Bayesian: given a prior probability of having pushed H, what do we think it is after we see the alignment of S and T? Frequentist: How weird is the alignment of S and T, given that I might have seen billions of junk alignments? The second tends to be much more popular, because we don’t need a prior. CS 482/682, Spring 2014, Week 3 21
© Copyright 2026 Paperzz