CSE 788.04: Topics in Machine Learning Lecture Date: April 9, 2012 Lecture 5: Bayesian Linear Regression Lecturer: Brian Kulis 1 1.1 Scribe: Jimmy Voss Linear Regression Classical Approach Given input pairs: x1 , x2 , · · · , xn y1 , y2 , · · · , yn x i ∈ Rn yi ∈ R In linear regression, one models the predicted values for the yi s as linear combinations of their features: xi = (xi (1), · · · , xi (d))T ŷi = w0 + w1 xi (1) + · · · + wd xi (d) ŷi = w0 + wT x To compress the notation further, this can be written without any w0 term by mapping: xi 7→ xi 1 wi 7→ wi w0 Then, we write ŷi = wT xi for the linear regression equation. Often, the observed value for yi is considered to be made up of 2 components: yi = wT xi + (1) where is considered to be the error term. Error is assumed to be Gaussian noise: ∼ N (0, σ 2 ). This allows for the following probability model of yi : P (yi |w, X, σ 2 ) = N (wT xi , σ 2 ) Y P (y|w, X, σ 2 ) = P (yi |w, xi , σ 2 ) i Classically, this likelihood function is maximized with respect to w. Defining xT1 xT2 .. . −−− −−− X= −−− xTn −−− −−− −−− Notice that each xi is a row vector in X rather than a column vector. To find the maximum likelihood 1 2 Lecture 5: Bayesian Linear Regression estimate of w, one typically maximizes the log-likelihood: X ln(P (y|w, x, σ 2 ) = ln(N (yi |w, xi , σ 2 )) i " = X i n 1 1 X − ln(2πσ 2 ) − 2 (yi − wT xi )2 2 2σ i=1 # Then the maximization problem is equivalent to finding: n wM LE = argmin w 2 1X yi − xTi w 2 i=1 (2) differentiating the right hand side with respect to w and setting the resulting equation equal to 0 in order to obtain critical points yields: n X yi − xTi w xTi = 0 i=1 w = (X T X)−1 X T y 1.2 Treating Regression as an Optimization problem Notice that solving for wM LE in equation (2) can be viewed as an optimization problem. Treating (2) as an optimization problem can be useful since it can get rid of numerical issues such as non-invertible matrices. It also allows for a penalty to be placed upon using complicated weight vectors w. 2 such techniques follow: 1. In Ridge Regression, the following equation is minimized (w.r.t. w): n 1X λ (yi − wT xi )2 + wT w 2 i=1 2 (3) Here a regularization term in the L2 norm is used. The result is: w = (λI + X T X)−1 X T y (4) 2. Lasso – The following equation is minimized (w.r.t. w): n n X 1X (yi − wT xi )2 + λ |wi | 2 i=1 i=1 thus using an L1 norm regularization term. This often results in solutions that are partially sparse (that is, many of the wi terms are 0). 2 Bayesian Linear Regression In the Bayesian case, one puts prior distributions on one or both of w and σ 2 . Lecture 5: Bayesian Linear Regression 2.1 3 w is given a prior distribution The conjugate prior for w is the normal distribution: P (w) ∼ N (µ0 , S0 ). To compute the posterior distribution: P (w|y) ∝ P (y|w)P (w) P (w|y) ∼ N (µ, S) 1 S −1 = S0−1 + 2 X T X σ 1 T −1 µ = S S0 µ0 + 2 X y σ If one assumes that S0 = (1/α)I and µ0 = 0, then −1 1 T 1 T µ = αI + 2 X X X y σ σ2 −1 T = ασ 2 I + X T X X y which is equivalent to the weight vector equation found via Ridge Regression in equation (4). Looking at the maximum a posteriori probability (MAP) estimate yields the same result. Alternatively, using the Laplace Distribution centered at the origin yields: α 1 α P (wi |α) = · exp − |wi | 2 2 2 Notice that the Laplace distribution has a sharp peak at the origin which gets higher as α is made larger. 4 Lecture 5: Bayesian Linear Regression Using the Laplace distribution as a prior for each wi yields: Pr(w|α) = 1 α · 2 2 d ! αX exp − |wi | 2 i The MAP estimate for w is calculated as: argmax P (w|y) ∝ P (y|w)P (w|α) w ln(P (w|y)) = ln P (y|w) + ln P (w|α) + constant Which is an equivalent problem to minimizing the Lasso optimization problem. Given Xnew to predict Ynew , in the classical framework, one predicts: ynew = wT xnew In the Bayesian framework, one looks at the predictive distribution Z P (ynew |y, X, xnew , σ 2 ) = P (ynew |w, X, xnew , σ 2 )P (w|X) dw which is Gaussian. The mean of Pr(ynew |y, X, xnew , σ 2 ) is w̃T xnew where w̃ is the posterior mean. 2.2 w and σ 2 are both given prior distributions For the conjugate priors in this case, we use: w|σ 2 ∼ N (µ0 , σ02 ) σ 2 ∼ IG(α, β) where IG means the inverse gamma distribution. The resulting posterior distribution is: P (w, σ 2 |y, x) ∼ N (µn , σ 2 Λ−1 n )IG(αn , βn ) where αn = α + n/2 1 T y y + µT0 S0−1 µ0 − µTn Λn µn βn = β + 2 µn = (X T X + S0−1 )−1 (X T y + S0−1 µ0 ) Λn = X T X + S0−1 In practice, this model is too informative and has too many constants. People often prefer to use the g-prior, which is a special case of this general prior distribution. In the g-prior, the constants S0 and µ0 are set to: S0 = g(X T X)−1 µ0 = 0 And α → 0, β → 0 via limits since the inverse gamma is not defined at α = 0, β = 0. Then, P (σ 2 ) ∝ 1/θ2 is the Jeffrey’s prior. P (w) ∼ N (0, g(X T X)−1 ) Lecture 5: Bayesian Linear Regression 5 The weight random variable w has posterior mean: g E(w) = g−1 µ0 + wLS g where wLS is the least squares solution for w. Typically, setting µ0 = 0 yields: g wLS g+1 Note that X T X gives some indication of which coordinates carry more relevance. This gives a very high level argument as to why having X T X might be useful within the definition of the probability distribution. This leaves a question: How does one set the value of g? There are several strategies that can be employed: 1. Put a prior on g. This is the fully Bayesian answer. Typically, an inverse gamma distribution is used. However, at some point, one must stop going down the Bayesian rabbit hole, and one must actually assign values directly to hyper parameters. 2. Cross-validation can be used. In this case, a portion of the data is held out. The data which is held out is used to determine the most likely value for g. 3. Empirical Bayes ĝ = argmax P (y|X, g) g maximizes the marginal likelihood of y after integrating out all other parameters. Typically, this is calculated numerically. While it has the disadvantage where g is choses partially based upon the input data, the Empirical Bayes remains an important and practical method.
© Copyright 2026 Paperzz