Probability Theory and Bayesian Inference

Computational Statistics with
Application to Bioinformatics
Prof. William H. Press
Spring Term, 2008
The University of Texas at Austin
Unit 1: Probability Theory and Bayesian Inference
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
1
Unit 1: Probability Theory and Bayesian Inference (Summary)
•
Prove informally some basic theorems in probability
–
–
–
–
–
–
•
Elementary probability examples
–
–
•
dice
fish in barrel
What comprises a “calculus of inference” and how to talk like a Bayesian?
–
–
•
Additivity or “Law of Or-ing”
Law of Exhaustion
Multiplicative Rule or “Law of And-ing”; conditional probabilities
Definition of “independence”
Law of Total Probability or “Law of de-And-ing”; Humpty-Dumpty
Bayes Theorem
calculating probabilities of hypotheses as if they were events
should there be a International Talk Like a Bayesian Day?
Simple examples of Bayesian inference
–
Trolls-under-the-bridge
•
–
Monty Hall (“Let’s Make a Deal”) example
•
•
•
•
•
data changes probabilities
use relabeling symmetry to save some work
data changes probabilities, really!
posterior vs. prior
differing background information makes probabilities subjective
Evidence is commutative/associative
–
–
–
get same answers independent of the order in which data is seen
supports the notion of a “calculus of inference”
but requires EME hypotheses (give goodness-of-fit as a contrary example)
continues
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
2
•
Example: “the white shoe is a red herring”
–
–
–
–
•
Example (Jailer’s Tip) with marginalization over an unknown parameter
–
–
•
folk wisdom “an instance of a hypothesis adds support to it” not always true
meaning of data depends on exact nature of hypothesis space
Bayes says “data supports the hypothesis in which it is most likely”
evidence factors can be overcome by strong priors
statistical model with an unknown parameter
non-informative priors, “insensitive to the prior”
Estimate a posterior probability distribution from Bernoulli trials data
–
–
–
–
Bayesians use the exact data seen, not equivalent data not seen
Beta probability distribution
get normalization by symbolic integration
find the mode, mean, and standard error of the Beta distribution
•
•
–
mode is the “naïve” NB/N
mean encodes a uniform prior (“pseudocounts”)
the utility of choosing a conjugate prior
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
3
Laws of Probability
“There is this thing called probability. It obeys the laws of an
axiomatic system. When identified with the real world, it gives
(partial) information about the future.”
•
•
What axiomatic system?
How to identify to real world?
–
–
Bayesian or frequentist viewpoints are somewhat different
“mappings” from axiomatic probability theory to the real world
yet both are useful
“And, it gives a consistent and complete calculus of inference.”
•
This is only a Bayesian viewpoint
–
•
It’s sort of true and sort of not true, as we will see!
R.T. Cox (1946) showed that reasonable assumptions about
“degree of belief” uniquely imply the axioms of probability (and
Bayes)
–
–
–
ordered (transitive) A > B > C
“composable” (belief in AB depends only on A and B|A)
[what about, e.g., “fuzzy logic”? what assumption is violated?]
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
4
Axioms:
I. P (A) ≥ 0 for an event A
II. P (Ω) = 1 where Ω is the set of all possible outcomes
III. if A ∩ B = ∅, then P (A ∪ B) = P (A) + P (B)
Example of a theorem:
Theorem: P (∅) = 0
Proof: A ∩ ∅ = ∅, so
P (A) = P (A ∪ ∅) = P (A) + P (∅), q.e.d.
Basically this is a theory of measure on Venn diagrams,
so we can (informally) cheat and prove theorems by
inspection of the appropriate diagrams, as we now do.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
5
Additivity or “Law of Or-ing”
P (A ∪ B) = P (A) + P (B) − P (AB)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
6
“Law of Exhaustion”
If Ri are exhaustive and mutually exclusive (EME)
X
P (Ri ) = 1
i
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
7
Multiplicative Rule or “Law of And-ing”
(same picture as before)
“given”
P (AB) = P (A)P (B|A) = P (B)P (A|B)
P (B|A) =
“conditional probability”
P (AB)
P (A)
“renormalize the
outcome space”
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
8
Similarly, for multiple And-ing:
P (ABC) = P (A)P (B|A)P (C|AB)
Independence:
Events A and B are independent if
P (A|B) = P (A)
so P (AB) = P (B)P (A|B) = P (A)P (B)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
9
A symmetric die has
P (1) = P (2) = .P
. . = P (6) = 16
Why? Because i P (i) = 1 and P (i) = P (j).
Not because of “frequency of occurence in N trials”.
That comes later!
The sum of faces of two dice (red and green) is > 8.
What is the probability that the red face is 4?
P (R4 | >8) =
2/36
P (R4 ∩ >8)
=
= 0.2
P (>8)
10/36
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
10
Law of Total Probability or “Law of de-Anding”
H’s are exhaustive and
mutually exclusive (EME)
P (B) = P (BH1 ) + P (BH2 ) + . . . =
P (B) =
X
X
P (BHi )
i
P (B|Hi )P (Hi )
i
“How to put Humpty-Dumpty back together again.”
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
11
Example: A barrel has 3 minnows and 2 trout, with
equal probability of being caught. Minnows must
be thrown back. Trout we keep.
What is the probability that the 2nd fish caught is a
trout?
H1 ≡ 1st caught is minnow, leaving 3 + 2
H2 ≡ 1st caught is trout, leaving 3 + 1
B ≡ 2nd caught is a trout
P (B) = P (B|H1 )P (H1 ) + P (B|H2 )P (H2 )
=
2
5
·
3
5
+
1
4
·
2
5
= 0.34
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
12
Bayes Theorem
Thomas Bayes
1702 - 1761
(same picture as before)
Law of And-ing
P (Hi B)
P (B)
P (B|Hi )P (Hi )
P
=
j P (B|Hj )P (Hj )
P (Hi |B) =
We usually write this as
Law of de-Anding
P (Hi |B) ∝ P (B|Hi )P (Hi )
this means, “compute the normalization by using the
completeness of the Hi’s”
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
13
• As a theorem relating probabilities, Bayes is
unassailable
• But we will also use it in inference, where the H’s are
hypotheses, while B is the data
– “what is the probability of an hypothesis, given the data?”
– some (defined as frequentists) consider this dodgy
– others (Bayesians like us) consider this fantastically powerful
and useful
– in real life, the war between Bayesians and frequentists is long
since over, and most statisticians adopt a mixture of techniques
appropriate to the problem.
• Note that you generally have to know a complete set of
EME hypotheses to use Bayes for inference
– perhaps its principal weakness
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
14
Let’s work a couple of examples using Bayes Law:
Example: Trolls Under the Bridge
Trolls are bad. Gnomes are benign.
Every bridge has 5 creatures under it:
20% have TTGGG (H1)
20% have TGGGG (H2)
60% have GGGGG (benign) (H3)
Before crossing a bridge, a knight captures one of the 5
creatures at random. It is a troll. “I now have an 80%
chance of crossing safely,” he reasons, “since only the case
20% had TTGGG (H1) Æ now have TGGG
is still a threat.”
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
15
so,
P (Hi |T ) ∝ P (T |Hi )P (Hi )
2 1
5 · 5
P (H1 |T ) = 2 1 1 1
5 · 5 + 5 · 5 +0·
3
5
2
=
3
The knight’s chance of crossing safely is actually only 33.3%
Before he captured a troll (“saw the data”) it was 60%.
Capturing a troll actually made things worse! [well…discuss]
(80% was never the right answer!)
Data changes probabilities!
Probabilities after assimilating data are called posterior
probabilities.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
16
Example: The Monty Hall or
Let’s Make a Deal Problem
•
•
•
•
•
•
•
•
•
Three doors
Car (prize) behind one door
You pick a door, but don’t open it yet
Monty then opens one of the other doors, always revealing no
car (he knows where it is)
You now get to switch doors if you want
Should you?
Most people reason: Two remaining doors were equiprobable
before, and nothing has changed. So doesn’t matter whether
you switch or not.
Marilyn vos Savant (“highest IQ person in the world”) famously
thought otherwise (Parade magazine, 1990)
No one seems to care what Monty Hall thought!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
17
Hi = car behind door i, i = 1, 2, 3
Wlog, you pick door 2 (relabeling).
Wlog, Monty opens door 3 (relabeling).
P (Hi |O3) ∝ P (O3|Hi )P (Hi )
“Without loss of generality…”
P (H1 |O3) ∝ 1 ·
P (H2 |O3) ∝
1
2
·
P (H3 |O3) ∝ 0 ·
1
3
=
2
6
1
3
=
1
6
1
3
=0
ignorance of Monty’s preference
between 1 and 3, so take 1/2
So you should always switch: doubles your chances!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
18
Exegesis on Monty Hall
•
•
•
•
•
Very important example! Master it.
P (Hi ) = 13 is the “prior probability” or “prior”
P (Hi |O3) is the “posterior probability” or “posterior”
P (O3|Hi ) is the “evidence factor” or “evidence”
Bayes says posterior ∝ prior × evidence
Bayesian viewpoint:
Probabilities are modified by data. This
makes them intrinsically subjective,
because different observers have
access to different amounts of data
(including their “background information”
or “background knowledge”).
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
19
Commutivity/Associativity of Evidence
P (Hi |D1 D2 ) desired
We see D1 :
P (Hi |D1 ) ∝ P (D1 |Hi )P (Hi )
Then, we see D2 :
P (Hi |D1 D2 ) ∝ P (D2 |Hi D1 )P (Hi |D1 )
this is now a prior!
But,
= P (D2 |Hi D1 )P (D1 |Hi )P (Hi )
= P (D1 D2 |Hi )P (Hi )
this being symmetrical shows that we would get the same answer
regardless of the order of seeing the data
All priors P (Hi ) are actually P (Hi |D),
conditioned on previously seen data! Often
background information
write this as P (Hi |I).
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
20
Bayes Law is a “calculus of inference”, often better (and
certainly more self-consistent) than folk wisdom.
Example: Hempel’s Paradox
Folk wisdom: A case of a hypothesis adds support to that
hypothesis.
Example: “All crows are black” is supported by each new
observation of a black crow.
All crows
are black
Ù
All non-black things
are non-crows
But this is supported by the observation of a white shoe.
So, the observation of a white shoe is thus evidence that
all crows are black!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
21
I.J. Good: “The White Shoe
is a Red Herring” (1966)
We observe one bird, and it is a black crow.
a) Which world are we in?
b) Are all crows black?
P (D|H1 )P (H1 )
P (H1 |D)
=
P (H2 |D)
P (D|H2 )P (H2 )
P (H1 )
0.0001 P (H1 )
= 0.001
=
0.1 P (H2 )
P (H2 )
So the observation strongly supports H2 and the existence of white crows.
Hempel’s folk wisdom premise is not true.
Data supports the hypotheses in which it is more likely compared with other
hypotheses.
We must have some kind of background information about the universe of
hypotheses, otherwise data has no meaning at all.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
22
Example: Jailer’s Tip
(our first “estimation of parameters”)
A
•
•
•
•
Of 3 prisoners (A,B,C), 2 will be released tomorrow.
A asks jailer for name of one of the lucky – but not himself.
Jailer says, truthfully, “B”.
“Darn,” thinks A, “now my chances are only ½, C or me”.
Is this like Monty Hall? Did the data (“B”) change the probabilities?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
23
Suppose (unlike Monty Hall) the jailer is not indifferent
about responding “B” versus “C”.
P (SB |BC) = x,
(0 ≤ x ≤ 1)
“says B”
0
P (A|SB ) = P (AB|SB ) + P (AC|SB )
=
=
1
1/3
P (SB |AB)P (AB)
P (SB |AB)P (AB) + P (SB |BC)P (BC) + P (SB |CA)P (CA)
1
3
1·
1
3
+x·
1
3
+0
=
1
1+x
x
So if A knows the value x, he can calculate his chances.
If x=1/2 (like Monty Hall), his chances are 2/3, same as before; so (unlike
Monty Hall) he got no new information.
If x≠1/2, he does get new info – his chances change.
But what if he doesn’t know x at all?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
24
“Marginalization” (this is important!)
• When a model has unknown, or uninteresting,
parameters we “integrate them out”
• Multiplying by any knowledge of their distribution
– At worst, just a prior informed by background information
– At best, a narrower distribution based on data
• This is not any new assumption about the world
– it’s just the Law of de-Anding
law of de-Anding
(e.g., Jailer’s Tip):
P (A|SB I) =
R
=
x
R
P (A|SB xI) p(x|I) dx
1
1+x
x
p(x|I) dx
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
25
P (A|SB I) =
R
=
x
R
P (A|SB xI) p(x|I) dx
1
x 1+x
p(x|I) dx
first time we’ve seen a continuous probability distribution,
but we’ll skip the obvious repetition of all the previous laws
p(x) ≡ p(x|I)
(Notice that p(x) is a probability of a probability!
That is fairly common in Bayesian inference.)
X
i
Pi = 1 ↔
X
i
p(xi )dxi = 1 ↔
Z
p(x) dx = 1
x
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
26
P (A|SB I) =
R
=
x
R
P (A|SB xI) p(x|I) dx
1
x 1+x
p(x|I) dx
What should Prisoner A take for p(x) ?
Maybe the “uniform prior”?
A
p(x) = 1, (0 ≤ x ≤ 1)
R1 1
P (A|SB I) = 0 1+x dx = ln 2 = 0.693
Not the same as the “massed prior at x=1/2”
p(x) = δ(x −
P (A|SB I) =
1
2 ),
1
1+1/2
(0 ≤ x ≤ 1)
= 2/3
“Dirac delta function”
substitute value and
remove integral
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
27
Where are we?
P (A|SB I) =
We are trying to estimate a probability
x = P (SB |BC),
R
=
(0 ≤ x ≤ 1)
x
R
P (A|SB xI) p(x|I) dx
1
x 1+x
p(x|I) dx
The form of our estimate is a (Bayesian) probability
distribution
This is a sterile exercise if it is just a debate about priors.
What we need is data! Data might be a previous history
of choices by the jailer in identical circumstances.
BCBCCBCCCBBCBCBCCCCBBCBCCCBCBCBBCCB
N = 35,
NB = 15,
NC = 20
(What’s wrong with: x=15/35=0.43?
Hold on…)
We hypothesize (might later try to check) that these are i.i.d. “Bernoulli
trials” and therefore informative about x
“independent and identically distributed”
As good Bayesians, we now need
P (data|x)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
28
P (data|x)
can mean different things in frequentist vs. Bayesian contexts,
so this is a good time to understand the differences (we’ll use
both ideas as appropriate)
frequentist considers the universe of what might have been, imagining
repeated trials, even if they weren’t actually tried:
since i.i.d. only the N ’s can matter.
P (data|x) =
¶ prob. of exact sequence seen
N
xNB (1 − x)NC
NB
µ
no. of equivalent arrangements
Bayesian considers only the exact data seen:
¡n¢
k
=
n!
k!(n−k)!
prior is still with us
P (x|data) ∝ xNB (1 − x)NC p(x|I)
No binomial coefficient, since independent of x and absorbed in the
proportionality. Use only the data you see, not “equivalent arrangements”
that you didn’t see. This issue is one we’ll return to, not always entirely
sympathetically to Bayesians (e.g., goodness-of-fit).
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
29
Use Matlab to calculate the posterior probability of x:
(note we are doing this symbolically to get the general result)
syms nn nb x
num = x^nb * (1-x)^(nn-nb)
num =
x^nb*(1-x)^(nn-nb)
denom = int(num, 0, 1)
denom =
gamma(nn-nb+1)*gamma(nb+1)/gamma(nn+2)
p = num / denom
ezplot(subs(p,[nn,nb],[35,15]),[0,1])
p =
x^nb*(1-x)^(nn-nb)/gamma(nn-nb+1)/gamma(nb+1)*gamma(nn+2)
so this is what
the data tells us
about p(x)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
30
Same calculation in Mathematica:
In[7]:=
Out[7]=
In[8]:=
num = x ^ nb H1 − xL ^ Hnn − nbL
H1 − xL−nb +nn xnb
denom = Integrate@num, 8x, 0, 1<,
GenerateConditions → FalseD
Gamma@1 + nbD Gamma@1 − nb + nnD
Gamma@2 + nnD
Out[8]=
In[9]:=
p@x_ D = num ê denom
H1 − xL−nb +nn xnb Gamma@2 + nnD
Gamma@1 + nbD Gamma@1 − nb + nnD
Out[9]=
In[12]:=
Plot@p@xD ê. 8nn → 35, nb → 15<, 8x, 0, 1<,
PlotRange → All, Frame → TrueD
4
so this is what the data tells us
about p(x)
3
2
1
0
0
Out[12]=
0.2
0.4
0.6
0.8
1
Graphics
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
31
Find the mean, standard error, and mode of our estimate for x
Dp = diff(p);
derivative has this simple factor
simplify(Dp)
ans =
-x^(nb-1)*(1-x)^(nn-nb-1)*(-nb+x*nn)*gamma(nn+2)/gamma(nb+1)/gamma(nn-nb+1)
solve(Dp)
ans =
nb/nn
“maximum likelihood” answer is to estimate x as exactly the fraction seen
mean = int(x*p, 0, 1)
mean is the 1st moment
mean =
(nb+1)/(nn+2)
sigma = simplify(sqrt(int(x^2 * p, 0, 1)-mean^2))
standard error involves the
2nd moment, as shown
sigma =
(-(nb+1)*(-nn+nb-1)/(nn+2)^2/(nn+3))^(1/2)
pretty(sigma)
This shows how p(x)
gets narrower as the
amount of data
increases.
/ (nb + 1) (-nn + nb - 1)\1/2
|- -----------------------|
|
2
|
\
(nn + 2) (nn + 3)
/
they call this “pretty”!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
32
Or in Mathematica:
In[20]:=
Out[20]=
In[21]:=
Out[21]=
In[23]:=
Out[23]=
In[27]:=
Out[27]=
Simplify @D@p@xD, xDD
H1 − xL−1−nb +nn x−1+nb H−nb + nn xL Gamma@2 + nnD
−
Gamma@1 + nbD Gamma@1 − nb + nnD
Solve@Simplify @D@p@xD, xDD
99x →
nb
==
nn
0, xD
mean = Integrate@x p@xD, 8x, 0, 1<,
GenerateConditions → FalseD
1 + nb
2 + nn
sigma =
Sqrt@FullSimplify @
Integrate@x ^ 2 p@xD, 8x, 0, 1<,
GenerateConditions → FalseD − mean ^2DD
H1 + nbL H1 − nb + nnL
$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
H2 + nnL2 H3 + nnL
Editorial: Mathematica has prettier output for symbolic calculations, but Matlab is significantly
better at handling numerical data (which we will be doing more of).
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
33
Assimilating new data:
If we got a new batch of data, say with new counts NN and NB, we
could use our p(x) based on nn and nb as the prior, then redo the
calculations above.
However, you should be able to immediately write down the
answers for the mean, standard error, and mode, above.
Priors that preserve the analytic form of p(x) are called
“conjugate priors”. There is nothing special about them except
mathematical convenience.
xα (1 − x)β is a conjugate prior for our
beta distribution xNB (1 − x)N −NB
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
34

Download Report

Probability Theory and Bayesian Inference

Paperzz.com

Your Paperzz