Basic of Probability Theory for Ph.D. students in Education, Social

Basic of Probability Theory for Ph.D. students in Education, Social
Sciences and Business
(Shing On LEUNG and Hui Ping WU)
(May 2015)
This is a series of 3 talks respectively on:
A. Probability Theory
B. Hypothesis Testing
C. Bayesian Inference
Lecture 3: Bayesian Inference
(Most statistical details can be found via web-search. These lectures
emphasize on conceptual understanding instead of technical details.)
1
C. Bayesian approach
Bayes Theorem
Pr(A/B) = Pr(A & B) / Pr(B) = Pr(B/A) * Pr(A) / Pr(B)
Pr(B) = Pr(B/A)*Pr(A) + Pr(B/not A)*Pr(not A)
For our convenience, let A = θ (model), B = X (observations)
So, Pr(θ/X) = Pr(θ and X) / Pr(X) = Pr(X/θ) * Pr(θ) / Pr(X)
2
The roles of X and θ are interchanged
Classical
 Pr(X/θ): sampling distribution of data
 Frequentist implicitly assume “many” realization of data (e.g. each
day it can be raining or not raining), but in reality only once (each day
only one event happen: rain or not rain) (Yuen 2011)
Bayesian
 Pr(θ/X): uncertainty over the parameter space
 Bayesianly, raining yesterday is X (happened, only once), data;
tomorrow is y (predictive), which is governed by θ, parameter
3
An easy example:
4
5
 M1 = 1 black and 9 white balls, θ=θ1=0.1
 M2 = 9 black and 1 white ball, θ=θ2=0.9
 Procedure: select a bag at random (Pr(θ1)=Pr(θ2)=0.5), and select 1
ball, and guess which bag (M1 or M2) it comes from
 X=B if the selected ball is black; X=W if it is white
 Pr(X=B) = Pr(X=B/θ1)*Pr(θ1) + Pr(X=B/θ2)*Pr(θ2) = 0.1*0.5 + 0.9*
0.5 = 0.5
 Pr(θ1/X=B) = Pr(X=B/θ1)*Pr(θ1) / Pr(X=B) = 0.05 / 0.5 = 0.1
 Pr(θ2/X=B) = 0.9
Pr(θ1/B) vs Pr(θ2/B) = 0.1 vs 0.9
6
Interpretation
 If the ball is black, we guess it is come from M2, otherwise, M1
 We make inference (M) based on our observations (X)
 X can be medical symptom (data), M can be disease
 X can be students’ achievements, M can be SES, efforts, esteem,
attitude, etc.
 * Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X)
Prior and Posterior
(In its continuous forms)
p(θ/x) = p(x/θ) *p(θ) / p(x), and, p(x) =ʃp(x/θ) *p(θ) dθ
 Pr(M) or p(θ) is prior
 Pr(M/X) or p(θ/x) is posterior
7
Simple example with Hypothesis Testing (Classical Approach)
H0: The ball is from M1
H1: The ball is from M2
 Decision rule: If the ball is Black (X=B), we reject H0 and said the
ball is from M2.
 Under H0, Pr(X=B/M1) = 0.1 > 0.05, so we do not reject H0. (Unless 5
black out of 100 balls)
 With p=0.1, we can use the term “marginal significant”. Still, we
are protecting the H0.
Under H1,
Pr(X=B/M2) = 0.9 (power of the test is good)
Pr(X=W/M2) = 0.1 (this is type 2 error)
But, we decided the decision rule without know what happen in M2, as
Hypothesis testing always concentrate on M1. Hence, power can be
8
less, and type 2 error can be large.
No statement on Pr(θ1/B) vs Pr(θ2/B) = 0.1 vs 0.9
9
Credible Interval Vs Confident Interval
 A credible interval (CreI) is an interval in Pr(θ/X) (posterior) to
specify the most possible range (say 95%) of a parameter.
 In multi-dimensional parameter space, it is the credible region
 A 95% Confident Interval (ConI): If we have many samples, 95% of
ConI will contain the true value (in practice we got only one sample).
Here, parameters are fixed and intervals are random
 CreI not equal to ConI because (i) prior exist, and (ii) treatment of
nuisance parameters
 For (i), if prior are "reasonably unbiased", the differences are minor.
 For (ii), ConI (or classical approach) take mle values of nuisance
parameters. Bayesian has to conduct integrations (and that is the
most difficult task)
10
Simple Example: t-test
Golf scores for males and females are:
Male: 82, 80, 85, 85, 78, 87, 82;
Female: 75, 76, 80, 77, 80, 77, 73
Sample difference: y =
= 5.85
N = 7 for both sex (but not necessary)
t-test from SPSS, path: Analyze – Compare mean – Independent
samples t-test
t = 3.83, df = 12, p-value = 0.002, 95%, CI (2.52, 9.19)
(Many people use t-test but not t-distribution, i.e. complication behind)
11
Bayesian Inference of this example
x1 ~ N(µ 1, σ12), and x2 ~ N(µ 2, σ22), then
~ N(µ 1, σ12/n1),
~ N(µ 2, σ22/n2)
 Assume σ1 and σ2 are known, but …(later)
 Let θ=µ1-µ 2, parameter on differences, and
y=
be the random variable
y|θ ~ N(θ, σ12/n1 + σ22/n2)
p(θ|y) = p(y|θ) * p(θ) / p(y); or p(θ|y) ∝ p(y|θ) * p(θ)
 Assume: Prior of θ ~ N(0, σ32); 0 imply no biase; σ3 known
 if every things is Normal, p(θ|y), posterior is Normal, and …
θ|y ~ N(µ, σ)
where
and
.
12
Credible Interval is CreI (U1, U2) = (µ-2*σ, µ+2*σ)
But,
 Everything is Normal is not always the case
 If σ is unknown, p(θ|y) is not Normal even everything is Normal
 In general, the posterior p(θ|y) can be very complicated. This is one
of the difficulties in using Bayesian approach.
13
Prior and Credible Interval
σ3
µ
CreI(U1, U2)
1
2
3
10
1.75
3.69
4.64
5.72
(0.38, 3.13)
(0.80, 6.58)
(1.00, 8.28)
(1.24, 10.20)
50
5.85
(1.26, 10.43)
CI
5.85
(2.52, 9.19)
14
Interpretation:
 Final results is a mixture between: (i) prior, (ii) observation
 The prior mean is 0. When prior variance, σ3, is small, the effect of
prior is stronger and the final results will be closer to 0.
 If σ3 is known and is very large, we do not have any strong
assumption on prior belief (unbiased prior or non-informative prior),
final result will depend largely on observations
 Generally, non-informative (or large variance) prior makes Bayesian
similar to Classical
 Another factor: σ3 is assumed to be known, which is unrealistic.
This is call “nuisance parameters” (or hyper-parameters)
 Everyone uses computer packages, later, we introduce “Bayesian
t-test” web-based calculator. There are many others.
15
Nuisance parameters (Nuisance but important!)
 Classically, we take nuisance parameters, σ3, at its mle
 Bayesian Evidences,
p(θ/x) = p(x/θ) *p(θ) / p(x), or
Posterior = Evidences * Prior / p(x)
p(x/θ) = ʃp(x, σ3/θ) dσ3, called “evidence”
 To compute evidences, we need to integrate “ʃ” over all possible
values of σ3 in the parameter space.
 "... this averaging automatically controls the complexity of different
models ..." (Wetzels et al 2011)
 The following 5 terms carry the same technical implications
i. penalty towards extra parameters
ii. over-fitting
iii. complexity of the models
iv. weighted average of the likelihood
16
v.
vi.
posterior probability
evidence
 That simple sign “ʃ” can imply 100 dimensional integral in the
parameter space. Not only computationally difficult, but also …
 Models are more complex, …
 Large prediction errors, …
 Bayesianly, more parameters, more “ʃ”, more complex, larger
prediction errors, …. etc. Eventually, models are not preferred
 Approximation methods are used, say MCMC, or BIC (created
immediately with AIC), or Laplace method, …, etc. (web-search)
17
Bayes Factor
 Recall …Posterior = Evidences * Prior / p(x)
 To compare two models, M0 and M1, we compare their posterior
Posterior of M0 = Evidence(M0) * Prior(M0) / p(x)
Posterior of M1 = Evidence(M1) * Prior(M1) / p(x)
 Since (i) p(x) is the same, and, (ii) further, of we assume two prior are
equally likely, comparing posteriors becomes comparing evidence
 So, Bayes Factor is:
B01 = Evidence(M0) / Evidence (M1), or
B01= (ʃPr(X/θ,M0) * Pr(θ/M0) dθ) / (ʃPr(X/θ,M1) * Pr(θ/M1) dθ)
 Bayesisanly, there is no preference between M0 and M1, unlike
frequentist (which protect H0 and try to reject it)
 But, Evidence (M0 and M1) need to “ʃ” over all nuisance
parameters!
18
 Bayes Factor can apply to many situations, some with many
parameters, say Factor Analysis, SEM, etc, but some with very few
parameters, say Bayesian t-test.
 Now, if (i) all nuisance parameters take mle (instead of “ʃ”), Bayes
Factor equals to Likelihood Ratio (and use LRT… and then Classical
approach)
19
Interpretations of Bayes Factor
B01
Interpretation
< 1/10
Substantially prefer M1, more than 10 times as likely
1/10 ~ 1/3 Slightly prefer M1, between 3 to 10 times as likely
1/3 ~ 3 Indifferent (within 3 times as likely)
3 ~ 10
Slightly prefer M0, between 3 to 10 times as likely
> 10
Substantially prefer M0, more than 10 times as likely
 B01 = Evidence(M0) / Evidence(M1);
 B10 = Evidence(M1) / Evidence(M0); so, B01 = 1 / B10
 Because of the reciprocal relationship, the “strength” of index is the
same for B01 and 1/ B01, say “1/3” and “3” are the same.)
20
Bayesian t-test
 There are many Bayesian t-test, Rouder et al (2009) is only one
convenient example.
http://pcl.missouri.edu/bf-two-sample
Rouder et al (2009) and Wetzels et al (2011)
Reference:
Rouder, J. N., Speckman P. L., Sun D., Morey R. D., & Iverson G. (2009) Bayesian t-Tests for Accepting and Rejecting
the Null Hypothesis. Psychonomic Bulletin & Review, 16, 225-237.
Wetzels, Ruud; Matzke, Dora; Lee, Michael D.; Rouder, Jeffrey N.; Iversion, Geoffrey J. and Wagenmakers, Eric-Jan (2011)
Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on
Psychological Science, 6(3) 291–298.
µ: mean of difference; σ: variance of difference
δ=µ /σ: effect size of difference
M0: δ= 0 (hence µ=0) (so called null)
M1: δ ~ N(0, σδ2), σδ2 is σ of δ (alternative)
21
Note:
 effect size δ gives a standard way to compare means of different
populations, and “…researchers have an intrinsic scale about the
ranges of effect sizes that applies broadly…” (Rouder et al, 2009)
 For M0, δ= 0, and evidence under M0 only take δ=0
 For M1, δ ~ N(0, σδ2) and evidence under M1 will take automatic
averaging of all δ in the whole range. So, one more integration
under M1
 In the above example, we only need to input: n1=n2=7, t = 3.83, r =
0.707, results are:
 Scaled JZS Bayes Factor = 14.7
 Scaled-Information Bayes Factor = 17.7
 But, need to take the reciprocal of the above values, because the
web-calculator calculate B10 instead of B01, so
 Scaled JZS Bayes Factor = 1/14.7 = 0.068
22
 Scaled-Information Bayes Factor = 1/17.7 = 0.056
 We expect more packages in Bayesian applications in future, but the
principles remain the same
23
Summary (statistical term are different from daily English)
 Bayesian makes inferences based on data we observed (i.e. Pr(θ/X),
posterior ), which is more neutral
 In doing so, we need to specify prior, which is mostly unbiased, or
non-informative
 Posterior can have very complicated distributions to be solved
analytically, which prohibit Bayesian approach and its packages
 Handling nuisance parameters is the major technical difficulty, but it
automatically handle model complexity (give penalty towards
complicated models) and give direct probability statement
(Pr(Model/Data)).
*** Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X)
24
Q&A
Shing On LEUNG
[email protected]
Hui Ping WU
[email protected]
25