Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015) This is a series of 3 talks respectively on: A. Probability Theory B. Hypothesis Testing C. Bayesian Inference Lecture 3: Bayesian Inference (Most statistical details can be found via web-search. These lectures emphasize on conceptual understanding instead of technical details.) 1 C. Bayesian approach Bayes Theorem Pr(A/B) = Pr(A & B) / Pr(B) = Pr(B/A) * Pr(A) / Pr(B) Pr(B) = Pr(B/A)*Pr(A) + Pr(B/not A)*Pr(not A) For our convenience, let A = θ (model), B = X (observations) So, Pr(θ/X) = Pr(θ and X) / Pr(X) = Pr(X/θ) * Pr(θ) / Pr(X) 2 The roles of X and θ are interchanged Classical Pr(X/θ): sampling distribution of data Frequentist implicitly assume “many” realization of data (e.g. each day it can be raining or not raining), but in reality only once (each day only one event happen: rain or not rain) (Yuen 2011) Bayesian Pr(θ/X): uncertainty over the parameter space Bayesianly, raining yesterday is X (happened, only once), data; tomorrow is y (predictive), which is governed by θ, parameter 3 An easy example: 4 5 M1 = 1 black and 9 white balls, θ=θ1=0.1 M2 = 9 black and 1 white ball, θ=θ2=0.9 Procedure: select a bag at random (Pr(θ1)=Pr(θ2)=0.5), and select 1 ball, and guess which bag (M1 or M2) it comes from X=B if the selected ball is black; X=W if it is white Pr(X=B) = Pr(X=B/θ1)*Pr(θ1) + Pr(X=B/θ2)*Pr(θ2) = 0.1*0.5 + 0.9* 0.5 = 0.5 Pr(θ1/X=B) = Pr(X=B/θ1)*Pr(θ1) / Pr(X=B) = 0.05 / 0.5 = 0.1 Pr(θ2/X=B) = 0.9 Pr(θ1/B) vs Pr(θ2/B) = 0.1 vs 0.9 6 Interpretation If the ball is black, we guess it is come from M2, otherwise, M1 We make inference (M) based on our observations (X) X can be medical symptom (data), M can be disease X can be students’ achievements, M can be SES, efforts, esteem, attitude, etc. * Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X) Prior and Posterior (In its continuous forms) p(θ/x) = p(x/θ) *p(θ) / p(x), and, p(x) =ʃp(x/θ) *p(θ) dθ Pr(M) or p(θ) is prior Pr(M/X) or p(θ/x) is posterior 7 Simple example with Hypothesis Testing (Classical Approach) H0: The ball is from M1 H1: The ball is from M2 Decision rule: If the ball is Black (X=B), we reject H0 and said the ball is from M2. Under H0, Pr(X=B/M1) = 0.1 > 0.05, so we do not reject H0. (Unless 5 black out of 100 balls) With p=0.1, we can use the term “marginal significant”. Still, we are protecting the H0. Under H1, Pr(X=B/M2) = 0.9 (power of the test is good) Pr(X=W/M2) = 0.1 (this is type 2 error) But, we decided the decision rule without know what happen in M2, as Hypothesis testing always concentrate on M1. Hence, power can be 8 less, and type 2 error can be large. No statement on Pr(θ1/B) vs Pr(θ2/B) = 0.1 vs 0.9 9 Credible Interval Vs Confident Interval A credible interval (CreI) is an interval in Pr(θ/X) (posterior) to specify the most possible range (say 95%) of a parameter. In multi-dimensional parameter space, it is the credible region A 95% Confident Interval (ConI): If we have many samples, 95% of ConI will contain the true value (in practice we got only one sample). Here, parameters are fixed and intervals are random CreI not equal to ConI because (i) prior exist, and (ii) treatment of nuisance parameters For (i), if prior are "reasonably unbiased", the differences are minor. For (ii), ConI (or classical approach) take mle values of nuisance parameters. Bayesian has to conduct integrations (and that is the most difficult task) 10 Simple Example: t-test Golf scores for males and females are: Male: 82, 80, 85, 85, 78, 87, 82; Female: 75, 76, 80, 77, 80, 77, 73 Sample difference: y = = 5.85 N = 7 for both sex (but not necessary) t-test from SPSS, path: Analyze – Compare mean – Independent samples t-test t = 3.83, df = 12, p-value = 0.002, 95%, CI (2.52, 9.19) (Many people use t-test but not t-distribution, i.e. complication behind) 11 Bayesian Inference of this example x1 ~ N(µ 1, σ12), and x2 ~ N(µ 2, σ22), then ~ N(µ 1, σ12/n1), ~ N(µ 2, σ22/n2) Assume σ1 and σ2 are known, but …(later) Let θ=µ1-µ 2, parameter on differences, and y= be the random variable y|θ ~ N(θ, σ12/n1 + σ22/n2) p(θ|y) = p(y|θ) * p(θ) / p(y); or p(θ|y) ∝ p(y|θ) * p(θ) Assume: Prior of θ ~ N(0, σ32); 0 imply no biase; σ3 known if every things is Normal, p(θ|y), posterior is Normal, and … θ|y ~ N(µ, σ) where and . 12 Credible Interval is CreI (U1, U2) = (µ-2*σ, µ+2*σ) But, Everything is Normal is not always the case If σ is unknown, p(θ|y) is not Normal even everything is Normal In general, the posterior p(θ|y) can be very complicated. This is one of the difficulties in using Bayesian approach. 13 Prior and Credible Interval σ3 µ CreI(U1, U2) 1 2 3 10 1.75 3.69 4.64 5.72 (0.38, 3.13) (0.80, 6.58) (1.00, 8.28) (1.24, 10.20) 50 5.85 (1.26, 10.43) CI 5.85 (2.52, 9.19) 14 Interpretation: Final results is a mixture between: (i) prior, (ii) observation The prior mean is 0. When prior variance, σ3, is small, the effect of prior is stronger and the final results will be closer to 0. If σ3 is known and is very large, we do not have any strong assumption on prior belief (unbiased prior or non-informative prior), final result will depend largely on observations Generally, non-informative (or large variance) prior makes Bayesian similar to Classical Another factor: σ3 is assumed to be known, which is unrealistic. This is call “nuisance parameters” (or hyper-parameters) Everyone uses computer packages, later, we introduce “Bayesian t-test” web-based calculator. There are many others. 15 Nuisance parameters (Nuisance but important!) Classically, we take nuisance parameters, σ3, at its mle Bayesian Evidences, p(θ/x) = p(x/θ) *p(θ) / p(x), or Posterior = Evidences * Prior / p(x) p(x/θ) = ʃp(x, σ3/θ) dσ3, called “evidence” To compute evidences, we need to integrate “ʃ” over all possible values of σ3 in the parameter space. "... this averaging automatically controls the complexity of different models ..." (Wetzels et al 2011) The following 5 terms carry the same technical implications i. penalty towards extra parameters ii. over-fitting iii. complexity of the models iv. weighted average of the likelihood 16 v. vi. posterior probability evidence That simple sign “ʃ” can imply 100 dimensional integral in the parameter space. Not only computationally difficult, but also … Models are more complex, … Large prediction errors, … Bayesianly, more parameters, more “ʃ”, more complex, larger prediction errors, …. etc. Eventually, models are not preferred Approximation methods are used, say MCMC, or BIC (created immediately with AIC), or Laplace method, …, etc. (web-search) 17 Bayes Factor Recall …Posterior = Evidences * Prior / p(x) To compare two models, M0 and M1, we compare their posterior Posterior of M0 = Evidence(M0) * Prior(M0) / p(x) Posterior of M1 = Evidence(M1) * Prior(M1) / p(x) Since (i) p(x) is the same, and, (ii) further, of we assume two prior are equally likely, comparing posteriors becomes comparing evidence So, Bayes Factor is: B01 = Evidence(M0) / Evidence (M1), or B01= (ʃPr(X/θ,M0) * Pr(θ/M0) dθ) / (ʃPr(X/θ,M1) * Pr(θ/M1) dθ) Bayesisanly, there is no preference between M0 and M1, unlike frequentist (which protect H0 and try to reject it) But, Evidence (M0 and M1) need to “ʃ” over all nuisance parameters! 18 Bayes Factor can apply to many situations, some with many parameters, say Factor Analysis, SEM, etc, but some with very few parameters, say Bayesian t-test. Now, if (i) all nuisance parameters take mle (instead of “ʃ”), Bayes Factor equals to Likelihood Ratio (and use LRT… and then Classical approach) 19 Interpretations of Bayes Factor B01 Interpretation < 1/10 Substantially prefer M1, more than 10 times as likely 1/10 ~ 1/3 Slightly prefer M1, between 3 to 10 times as likely 1/3 ~ 3 Indifferent (within 3 times as likely) 3 ~ 10 Slightly prefer M0, between 3 to 10 times as likely > 10 Substantially prefer M0, more than 10 times as likely B01 = Evidence(M0) / Evidence(M1); B10 = Evidence(M1) / Evidence(M0); so, B01 = 1 / B10 Because of the reciprocal relationship, the “strength” of index is the same for B01 and 1/ B01, say “1/3” and “3” are the same.) 20 Bayesian t-test There are many Bayesian t-test, Rouder et al (2009) is only one convenient example. http://pcl.missouri.edu/bf-two-sample Rouder et al (2009) and Wetzels et al (2011) Reference: Rouder, J. N., Speckman P. L., Sun D., Morey R. D., & Iverson G. (2009) Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis. Psychonomic Bulletin & Review, 16, 225-237. Wetzels, Ruud; Matzke, Dora; Lee, Michael D.; Rouder, Jeffrey N.; Iversion, Geoffrey J. and Wagenmakers, Eric-Jan (2011) Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on Psychological Science, 6(3) 291–298. µ: mean of difference; σ: variance of difference δ=µ /σ: effect size of difference M0: δ= 0 (hence µ=0) (so called null) M1: δ ~ N(0, σδ2), σδ2 is σ of δ (alternative) 21 Note: effect size δ gives a standard way to compare means of different populations, and “…researchers have an intrinsic scale about the ranges of effect sizes that applies broadly…” (Rouder et al, 2009) For M0, δ= 0, and evidence under M0 only take δ=0 For M1, δ ~ N(0, σδ2) and evidence under M1 will take automatic averaging of all δ in the whole range. So, one more integration under M1 In the above example, we only need to input: n1=n2=7, t = 3.83, r = 0.707, results are: Scaled JZS Bayes Factor = 14.7 Scaled-Information Bayes Factor = 17.7 But, need to take the reciprocal of the above values, because the web-calculator calculate B10 instead of B01, so Scaled JZS Bayes Factor = 1/14.7 = 0.068 22 Scaled-Information Bayes Factor = 1/17.7 = 0.056 We expect more packages in Bayesian applications in future, but the principles remain the same 23 Summary (statistical term are different from daily English) Bayesian makes inferences based on data we observed (i.e. Pr(θ/X), posterior ), which is more neutral In doing so, we need to specify prior, which is mostly unbiased, or non-informative Posterior can have very complicated distributions to be solved analytically, which prohibit Bayesian approach and its packages Handling nuisance parameters is the major technical difficulty, but it automatically handle model complexity (give penalty towards complicated models) and give direct probability statement (Pr(Model/Data)). *** Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X) 24 Q&A Shing On LEUNG [email protected] Hui Ping WU [email protected] 25
© Copyright 2026 Paperzz