Computer intensive statistical methods Lecture 5 Jonas Wallin Chalmers, Gothenburg . Jonas Wallin - [email protected] 1/25 The agenda of the day • repetition • Bayesian statistics basic ideas • Priors • Bayesian Hypothesis testing The main idea is to build ”realistic” models. Today Jonas Wallin - [email protected] 2/25 Self-normalized IS Assume f (x) is known only up to a normalizing constant cf > 0, i.e. f (x) = z(x) cf , where we can evaluate z(x) = cf f (x) but not f (x). Then, as before, R Z cf f (x)>0 h(x)f (x) dx R τ = Ef (h(X )) = h(x)f (x) dx = cf f (x)>0 f (x) dx χ R R cf f (x) f (x)>0 h(x) g (x) g (x) dx g (x)>0 h(x)ωu (x)g (x) dx R = = R cf f (x) g (x) dx g (x)>0 ωu (x)g (x) dx g (x)>0 g (x) = Eg (h(X )ωu (X )) , Eg (ωu (X )) where we are able to evaluate ωu : {x ∈ χ : g (x) > 0} 3 x 7→ reptetition z(x) . g (x) Jonas Wallin - [email protected] 3/25 Control variates Let real-valued random variable Y , referred to as a control variate such that (i) E(Y ) = m is known and (ii) h(X ) − Y can be simulated at the same complexity as h(X ). Then for any α ∈ R, Z = h(X ) + α(Y − m), so that E(Z ) = E(h(X ) + α(Z − m)) = E(h(X )) +α (E(Y ) − m) = τ. | {z } | {z } =τ reptetition =0 Jonas Wallin - [email protected] 4/25 Control variates (cont.) h(X ) and Y has covariance C(h(X ), Y ) it holds that V(Z ) = V(h(X ) + αY ) = C(h(X ) + αY , h(X ) + αY ) = V(h(X )) + 2αC(h(X ), Y ) + α2 V(Y ). reptetition Jonas Wallin - [email protected] 5/25 Control variates (cont.) h(X ) and Y has covariance C(h(X ), Y ) it holds that V(Z ) = V(h(X ) + αY ) = C(h(X ) + αY , h(X ) + αY ) = V(h(X )) + 2αC(h(X ), Y ) + α2 V(Y ). Differentiating w.r.t. α and minimizing yields 0 = 2C(h(X ), Y ) + 2αV(Y ) ⇔ α = α∗ = − C(h(X ), Y ) , V(Y ) which provides the optimal coefficient α∗ in terms of variance. reptetition Jonas Wallin - [email protected] 5/25 Control variates reconsidered A problem with the control variate approach is that the optimal α, i.e. C(h(X ), Y ) α∗ = − , V(Y ) So instead use the estimate of each term α∗ = − CN , Vn where def. CN = N 1 X h(Xi )(Yi − m) N i=1 N X def. 1 VN = (Yi − m)2 N i=1 reptetition Jonas Wallin - [email protected] 6/25 Control variates multivariate There is a multivariate extension using control variates: Let real-valued random variable Y ∈ Rd , such that E(Y) = m Now for any α ∈ Rd , Z = h(X ) + αT (Y − m), has E[Z ] = 0. Further V(Z ) = V(h(X )) + 2αT C(h(X ), Y) + αT V(Y)α. reptetition Jonas Wallin - [email protected] 7/25 Control variates multivariate There is a multivariate extension using control variates: Let real-valued random variable Y ∈ Rd , such that E(Y) = m Now for any α ∈ Rd , Z = h(X ) + αT (Y − m), has E[Z ] = 0. Further V(Z ) = V(h(X )) + 2αT C(h(X ), Y) + αT V(Y)α. Thus the optimal vector α is α∗ = V(Y)−1 C(h(X ), Y). reptetition Jonas Wallin - [email protected] 7/25 Control variates multivariate example Lets estimate π again. Using the fact that P(U12 + U22 ≤ 1) = π . 4 Using three control variates Y1 = I(U1 + U2 ≤ 1), √ Y2 = I(U1 + U2 ≤ 2), Y3 = (U1 + U2 − 1)I(1 < U1 + U2 ≤ √ 2). All exceptions are explicit √ √ √ 1 (2 − 2)2 ( 2 − 1)2 ( 2 − 1)3 E[Y1 ] = , E[Y2 ] = , E[Y3 ] = − . 2 2 2 3 reptetition Jonas Wallin - [email protected] 8/25 Control variates multivariate example Variance reduction due to various control variates 1 − 1 0.727 reptetition 2 0.242 3 0.999 1,2 0.222 1,3 0.62 2,3 0.181 Jonas Wallin - [email protected] −1 α αT ∗ ∗ V[Y] V(h(X )) : 1,2,3 0.175 9/25 What is a Statistical model • I define a stastitical model, as any model where the data is a random variable. Bayesian statistics — Introduction Jonas Wallin - [email protected] 10/25 What is a Statistical model • I define a stastitical model, as any model where the data is a random variable. • Recall from the first course in statistics: yi = β0 + xi β1 + i , i ∼ N(0, σ 2 ). This model gives a likelihood f (Y|Θ) given a set of parameters Θ = {σ, β0 , β1 } Bayesian statistics — Introduction Jonas Wallin - [email protected] 10/25 What is a Statistical model • I define a stastitical model, as any model where the data is a random variable. • Recall from the first course in statistics: yi = β0 + xi β1 + i , i ∼ N(0, σ 2 ). This model gives a likelihood f (Y|Θ) given a set of parameters Θ = {σ, β0 , β1 } • To get ”the best” (most likely) model, we set the parameters Θ̂ such that the likelihood of the data Y is maximized (typically found numerically). Bayesian statistics — Introduction Jonas Wallin - [email protected] 10/25 What is a Bayesian Statistical model • In a Bayesian model paramets are not fixed but rather as random variables. Bayesian statistics — Introduction Jonas Wallin - [email protected] 11/25 What is a Bayesian Statistical model • In a Bayesian model paramets are not fixed but rather as random variables. • First one defines a distribution (prior) π(Θ) on the parameter. Bayesian statistics — Introduction Jonas Wallin - [email protected] 11/25 What is a Bayesian Statistical model • In a Bayesian model paramets are not fixed but rather as random variables. • First one defines a distribution (prior) π(Θ) on the parameter. • First oneThen one assumes a model (or a likelihood) given a set of parameter f (Y|Θ). Bayesian statistics — Introduction Jonas Wallin - [email protected] 11/25 What is a Bayesian Statistical model • In a Bayesian model paramets are not fixed but rather as random variables. • First one defines a distribution (prior) π(Θ) on the parameter. • First oneThen one assumes a model (or a likelihood) given a set of parameter f (Y|Θ). • Together this creates a joint distribution f (Y, Θ) = f (Y|Θ)π(Θ). Bayesian statistics — Introduction Jonas Wallin - [email protected] 11/25 What is a Bayesian Statistical model • In a Bayesian model paramets are not fixed but rather as random variables. • First one defines a distribution (prior) π(Θ) on the parameter. • First oneThen one assumes a model (or a likelihood) given a set of parameter f (Y|Θ). • Together this creates a joint distribution f (Y, Θ) = f (Y|Θ)π(Θ). • And instead of trying to find ”the best” model one studies the posterior distribution of the parameters f (Θ|Y). Bayesian statistics — Introduction Jonas Wallin - [email protected] 11/25 Bayes’ Formula To combine the prior and likelihood we use. Bayes’ Formula f (Θ|y) = f (y|Θ)π(Θ) f (y|Θ)π(Θ) =R 0 0 0 f (y) χ f (y|Θ )π(Θ ) dΘ f (Θ|y) is called the posterior, or “a posteriori”, distribution. Sometimes f (y) is denoted prior predictive. Often, only the proportionality relation f (Θ|y) ∝ π(Θ, y) = f (y|Θ)π(Θ) is needed, when seen as a function of Θ. Bayesian statistics — Introduction Jonas Wallin - [email protected] 12/25 An example • Suppose a hospital has around 200 beds occupied each day, and we want to know the underlying risk that a patient will be infected by MRSA (methicillin-resistant Staphylococcus aureus). Bayesian statistics — Introduction Jonas Wallin - [email protected] 13/25 An example • Suppose a hospital has around 200 beds occupied each day, and we want to know the underlying risk that a patient will be infected by MRSA (methicillin-resistant Staphylococcus aureus). • Looking back at the first six months of the year, we count y = 20 infections in 40000 bed-days. Bayesian statistics — Introduction Jonas Wallin - [email protected] 13/25 An example • Suppose a hospital has around 200 beds occupied each day, and we want to know the underlying risk that a patient will be infected by MRSA (methicillin-resistant Staphylococcus aureus). • Looking back at the first six months of the year, we count y = 20 infections in 40000 bed-days. • Let θ be the risk of infection per 10000 bed-days. Bayesian statistics — Introduction Jonas Wallin - [email protected] 13/25 An example • Suppose a hospital has around 200 beds occupied each day, and we want to know the underlying risk that a patient will be infected by MRSA (methicillin-resistant Staphylococcus aureus). • Looking back at the first six months of the year, we count y = 20 infections in 40000 bed-days. • Let θ be the risk of infection per 10000 bed-days. • A reasonable model is that y ∈ Po(4θ). Bayesian statistics — Introduction Jonas Wallin - [email protected] 13/25 An example (Classical modeling) • ML estimate θb = y /4 = 20/4 = 5. Bayesian statistics — Introduction Jonas Wallin - [email protected] 14/25 An example (Classical modeling) • ML estimate θb = y /4 = 20/4 = 5. • An approximate confidence interval based on a Normal approximation s θb ± λ0.975 Bayesian statistics — Introduction θb . 4 Jonas Wallin - [email protected] 14/25 An example (Bayesian modeling) Bayesian • However, other evidence about the underlying risk may exist, such as the previous year’s rates or rates in similar hospitals. Bayesian statistics — Introduction Jonas Wallin - [email protected] 15/25 An example (Bayesian modeling) Bayesian • However, other evidence about the underlying risk may exist, such as the previous year’s rates or rates in similar hospitals. • Suppose this other information, on its own, suggests plausible values of θ of around 10 per 10000, with 95% of the support for θ lying between 5 and 17. Bayesian statistics — Introduction Jonas Wallin - [email protected] 15/25 An example (Bayesian modeling) Bayesian • However, other evidence about the underlying risk may exist, such as the previous year’s rates or rates in similar hospitals. • Suppose this other information, on its own, suggests plausible values of θ of around 10 per 10000, with 95% of the support for θ lying between 5 and 17. • This can be expressed as a prior on θ, θ ∈ Γ(a, b), Bayesian statistics — Introduction a = 10, b = 1. Jonas Wallin - [email protected] 15/25 An example (Bayesian modeling) Bayesian • However, other evidence about the underlying risk may exist, such as the previous year’s rates or rates in similar hospitals. • Suppose this other information, on its own, suggests plausible values of θ of around 10 per 10000, with 95% of the support for θ lying between 5 and 17. • This can be expressed as a prior on θ, θ ∈ Γ(a, b), a = 10, b = 1. • The posterior distribution is now f (θ|y ) ∝ θy e −4θ θa−1 e −bθ ∝ θy +a−1 e −θ(4+b) . This is a conjugate prior of the Poisson distribution (more about this later). Bayesian statistics — Introduction Jonas Wallin - [email protected] 15/25 An example (Bayesian modeling) cont Bayesian • The posterior is f (θ|y ) = Γ(y + a, 4 + b). Bayesian statistics — Introduction Jonas Wallin - [email protected] 16/25 An example (Bayesian modeling) cont Bayesian • The posterior is f (θ|y ) = Γ(y + a, 4 + b). • If we want a point estimate of θ, the maximum a posteriori estimator can be used: θ̂MAP = arg max f (θ|y ) θ Bayesian statistics — Introduction Jonas Wallin - [email protected] 16/25 An example (Bayesian modeling) cont Bayesian • The posterior is f (θ|y ) = Γ(y + a, 4 + b). • If we want a point estimate of θ, the maximum a posteriori estimator can be used: θ̂MAP = arg max f (θ|y ) θ • A credible or posterior probability interval can be found using the quantiles of the posterior distribution. Bayesian statistics — Introduction Jonas Wallin - [email protected] 16/25 An example (Bayesian modeling) cont Lets study the relative what the different distribution looks like for the parameter θ: Bayesian statistics — Introduction Jonas Wallin - [email protected] 17/25 An example (Bayesian modeling) cont Lets study the relative what the different distribution looks like for the parameter θ: piror 0 5 10 15 20 θ Bayesian statistics — Introduction Jonas Wallin - [email protected] 17/25 An example (Bayesian modeling) cont Lets study the relative what the different distribution looks like for the parameter θ: piror, Likelihood 0 5 10 15 20 θ Bayesian statistics — Introduction Jonas Wallin - [email protected] 17/25 An example (Bayesian modeling) cont Lets study the relative what the different distribution looks like for the parameter θ: piror, Likelihood and posterior distribution. 0 5 10 15 20 θ Bayesian statistics — Introduction Jonas Wallin - [email protected] 17/25 An example pedictions cont So what is the difference in prediction: ML vs Bayesian 0 10 20 30 40 50 z Bayesian statistics — Introduction Jonas Wallin - [email protected] 18/25 An example next year Now suppose that during the following years we observe an other 12000 bed-days and we get 72 infections. What happens to the posterior distribution θ? Bayesian statistics — Introduction Jonas Wallin - [email protected] 19/25 An example next year Now suppose that during the following years we observe an other 12000 bed-days and we get 72 infections. What happens to the posterior distribution θ? 0 Bayesian statistics — Introduction 5 10 15 20 θ Jonas Wallin - [email protected] 19/25 priors distribution We will spend a few slides presenting various forms of priors and there interpretation. Bayesian statistics — Prior Jonas Wallin - [email protected] 20/25 priors distribution We will spend a few slides presenting various forms of priors and there interpretation. • The main criticism of the Bayesian method, is that different priors generate posterior different distribution. From a philosophical point you can never reject a prior since it represent the prior knowledge you had, which is subjective. Bayesian statistics — Prior Jonas Wallin - [email protected] 20/25 priors distribution We will spend a few slides presenting various forms of priors and there interpretation. • The main criticism of the Bayesian method, is that different priors generate posterior different distribution. From a philosophical point you can never reject a prior since it represent the prior knowledge you had, which is subjective. • The influence of a prior can be, negligible, moderate or enormous. The importance is to know the effect prior. Bayesian statistics — Prior Jonas Wallin - [email protected] 20/25 priors distribution We will spend a few slides presenting various forms of priors and there interpretation. • The main criticism of the Bayesian method, is that different priors generate posterior different distribution. From a philosophical point you can never reject a prior since it represent the prior knowledge you had, which is subjective. • The influence of a prior can be, negligible, moderate or enormous. The importance is to know the effect prior. • Often priors are chosen so that • generate a non-informative prior, or and • generate computationally convenient posterior density. Bayesian statistics — Prior Jonas Wallin - [email protected] 20/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. • The assumption of the r.v of p i is: Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. • The assumption of the r.v of p i is: dens 0.0000 0.0005 0.0010 0.0015 0.0020 p^2 0.0 0.2 0.4 0.6 0.8 1.0 p Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. • The assumption of the r.v of p i is: dens 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 p^3 0.0 0.2 0.4 0.6 0.8 1.0 p Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. • The assumption of the r.v of p i is: dens 0.000 0.001 0.002 0.003 0.004 p^4 0.0 0.2 0.4 0.6 0.8 1.0 p Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 priors distribution cont The prior distribution is always a choice. For example suppose we have a binomial likelihood (bin(n, p)) and assume a uniform prior on p, that is we assume any value of the probability is equally likely. • The assumption of the r.v of p i is: 0.000 0.001 0.002 dens 0.003 0.004 0.005 p^5 0.0 0.2 0.4 0.6 0.8 1.0 p Bayesian statistics — Prior Jonas Wallin - [email protected] 21/25 Conjugate priors One of the most common class of prior are the so called the conjugate prior. Definition (Conjugate prior) A family F of probability distributions on Θ is said to be conjugate (or closed under sampling) for a likelihood function f (x|Θ) if, for every π ∈ F, the posterior distribution f (Θ|x) also belongs to F. • This definition is mainly of use for families that are in the same class of parametric distributions. Bayesian statistics — Prior Jonas Wallin - [email protected] 22/25 Conjugate priors One of the most common class of prior are the so called the conjugate prior. Definition (Conjugate prior) A family F of probability distributions on Θ is said to be conjugate (or closed under sampling) for a likelihood function f (x|Θ) if, for every π ∈ F, the posterior distribution f (Θ|x) also belongs to F. • This definition is mainly of use for families that are in the same class of parametric distributions. • Main motivation is typically computational convince. Bayesian statistics — Prior Jonas Wallin - [email protected] 22/25 Conjugate priors cont. Conjugate priors for Θ for some common likelihoods. All parameters except θ are assumed fixed and data (y1 , . . . , yn ) are conditionally independent given Θ = θ. Likelihood Bin(n, θ) Ge(θ) NegBin(n, θ) Gamma(k, θ) Po(θ) Prior Beta(α, β) Beta(α, β) Beta(α, β) Gamma(α, β) Gamma(α, β) N(µ, θ−1 ) N(θ, σ 2 ) Gamma(α, β) N(m, s 2 ) Bayesian statistics — Prior Posterior Beta(α + y , β + P n − y) Beta(α + n, β + ni=1 yi − n) Beta(α + n, β + y − n) Pn Gamma(α + nk, Pn β + i=1 yi ) Gamma(α + i=1 yiP , β + n) n (y −µ)2 Gamma(α + n2 , β + i=1 2 i 2 +nȳ /σ 2 1 N( m/s , ) 1/s 2 +n/σ 2 1/s 2 +n/σ 2 Jonas Wallin - [email protected] ) 23/25 Improper prior Suppose we want an ”agreement” between the classical statistics and the Bayesian statistics, so that all information about the parameter is in the likelihood. Bayesian statistics — Prior Jonas Wallin - [email protected] 24/25 Improper prior Suppose we want an ”agreement” between the classical statistics and the Bayesian statistics, so that all information about the parameter is in the likelihood. The only choice: π(Θ) ∝ 1. This is an improper prior. Bayesian statistics — Prior Jonas Wallin - [email protected] 24/25 Improper prior Improper prior is a prior not being a proper density. • Example: • π(Θ) ∝ 1 when the parameter space is Rd . Bayesian statistics — Prior Jonas Wallin - [email protected] 25/25 Improper prior Improper prior is a prior not being a proper density. • Example: • π(Θ) ∝ 1 when the parameter space is Rd . • π(Θ) ∝ e Θ when the parameter space is Rd . Bayesian statistics — Prior Jonas Wallin - [email protected] 25/25 Improper prior Improper prior is a prior not being a proper density. • Example: • π(Θ) ∝ 1 when the parameter space is Rd . • π(Θ) ∝ e Θ when the parameter space is Rd . • π(θ) ∝ |θ|−1 when the parameter space is R. Bayesian statistics — Prior Jonas Wallin - [email protected] 25/25 Improper prior Improper prior is a prior not being a proper density. • Example: • • • • π(Θ) ∝ 1 when the parameter space is Rd . π(Θ) ∝ e Θ when the parameter space is Rd . π(θ) ∝ |θ|−1 when the parameter space is R. They might generate a proper posterior with respect to some likelihoods. • For more complex models this hard to verify. Bayesian statistics — Prior Jonas Wallin - [email protected] 25/25
© Copyright 2026 Paperzz