Lecture 5: Bayesian Linear Regression 1 Linear Regression

CSE 788.04: Topics in Machine Learning
Lecture Date: April 9, 2012
Lecture 5: Bayesian Linear Regression
Lecturer: Brian Kulis
1
1.1
Scribe: Jimmy Voss
Linear Regression
Classical Approach
Given input pairs:
x1 , x2 , · · · , xn
y1 , y2 , · · · , yn
x i ∈ Rn
yi ∈ R
In linear regression, one models the predicted values for the yi s as linear combinations of their features:
xi = (xi (1), · · · , xi (d))T
ŷi = w0 + w1 xi (1) + · · · + wd xi (d)
ŷi = w0 + wT x
To compress the notation further, this can be written without any w0 term by mapping:
xi 7→
xi
1
wi 7→
wi
w0
Then, we write ŷi = wT xi for the linear regression equation. Often, the observed value for yi is considered
to be made up of 2 components:
yi = wT xi + (1)
where is considered to be the error term. Error is assumed to be Gaussian noise: ∼ N (0, σ 2 ). This allows
for the following probability model of yi :
P (yi |w, X, σ 2 ) = N (wT xi , σ 2 )
Y
P (y|w, X, σ 2 ) =
P (yi |w, xi , σ 2 )
i
Classically, this likelihood function is maximized with respect to w. Defining
xT1
xT2
..
.

−−−
 −−−

X=

−−−
xTn

−−−
−−− 



−−−
Notice that each xi is a row vector in X rather than a column vector. To find the maximum likelihood
1
2
Lecture 5: Bayesian Linear Regression
estimate of w, one typically maximizes the log-likelihood:
X
ln(P (y|w, x, σ 2 ) =
ln(N (yi |w, xi , σ 2 ))
i
"
=
X
i
n
1
1 X
− ln(2πσ 2 ) − 2
(yi − wT xi )2
2
2σ i=1
#
Then the maximization problem is equivalent to finding:
n
wM LE = argmin
w
2
1X
yi − xTi w
2 i=1
(2)
differentiating the right hand side with respect to w and setting the resulting equation equal to 0 in order
to obtain critical points yields:
n
X
yi − xTi w xTi = 0
i=1
w = (X T X)−1 X T y
1.2
Treating Regression as an Optimization problem
Notice that solving for wM LE in equation (2) can be viewed as an optimization problem. Treating (2) as an
optimization problem can be useful since it can get rid of numerical issues such as non-invertible matrices.
It also allows for a penalty to be placed upon using complicated weight vectors w. 2 such techniques follow:
1. In Ridge Regression, the following equation is minimized (w.r.t. w):
n
1X
λ
(yi − wT xi )2 + wT w
2 i=1
2
(3)
Here a regularization term in the L2 norm is used. The result is:
w = (λI + X T X)−1 X T y
(4)
2. Lasso – The following equation is minimized (w.r.t. w):
n
n
X
1X
(yi − wT xi )2 + λ
|wi |
2 i=1
i=1
thus using an L1 norm regularization term. This often results in solutions that are partially sparse
(that is, many of the wi terms are 0).
2
Bayesian Linear Regression
In the Bayesian case, one puts prior distributions on one or both of w and σ 2 .
Lecture 5: Bayesian Linear Regression
2.1
3
w is given a prior distribution
The conjugate prior for w is the normal distribution: P (w) ∼ N (µ0 , S0 ). To compute the posterior distribution:
P (w|y) ∝ P (y|w)P (w)
P (w|y) ∼ N (µ, S)
1
S −1 = S0−1 + 2 X T X
σ
1 T
−1
µ = S S0 µ0 + 2 X y
σ
If one assumes that S0 = (1/α)I and µ0 = 0, then
−1 1 T
1 T
µ = αI + 2 X X
X y
σ
σ2
−1 T
= ασ 2 I + X T X
X y
which is equivalent to the weight vector equation found via Ridge Regression in equation (4). Looking at
the maximum a posteriori probability (MAP) estimate yields the same result.
Alternatively, using the Laplace Distribution centered at the origin yields:
α
1 α
P (wi |α) =
·
exp − |wi |
2 2
2
Notice that the Laplace distribution has a sharp peak at the origin which gets higher as α is made larger.
4
Lecture 5: Bayesian Linear Regression
Using the Laplace distribution as a prior for each wi yields:
Pr(w|α) =
1 α
·
2 2
d
!
αX
exp −
|wi |
2 i
The MAP estimate for w is calculated as:
argmax P (w|y) ∝ P (y|w)P (w|α)
w
ln(P (w|y)) = ln P (y|w) + ln P (w|α) + constant
Which is an equivalent problem to minimizing the Lasso optimization problem.
Given Xnew to predict Ynew , in the classical framework, one predicts:
ynew = wT xnew
In the Bayesian framework, one looks at the predictive distribution
Z
P (ynew |y, X, xnew , σ 2 ) = P (ynew |w, X, xnew , σ 2 )P (w|X) dw
which is Gaussian. The mean of Pr(ynew |y, X, xnew , σ 2 ) is w̃T xnew where w̃ is the posterior mean.
2.2
w and σ 2 are both given prior distributions
For the conjugate priors in this case, we use:
w|σ 2 ∼ N (µ0 , σ02 )
σ 2 ∼ IG(α, β)
where IG means the inverse gamma distribution. The resulting posterior distribution is:
P (w, σ 2 |y, x) ∼ N (µn , σ 2 Λ−1
n )IG(αn , βn )
where
αn = α + n/2
1 T
y y + µT0 S0−1 µ0 − µTn Λn µn
βn = β +
2
µn = (X T X + S0−1 )−1 (X T y + S0−1 µ0 )
Λn = X T X + S0−1
In practice, this model is too informative and has too many constants. People often prefer to use the g-prior,
which is a special case of this general prior distribution. In the g-prior, the constants S0 and µ0 are set to:
S0 = g(X T X)−1
µ0 = 0
And α → 0, β → 0 via limits since the inverse gamma is not defined at α = 0, β = 0. Then, P (σ 2 ) ∝ 1/θ2
is the Jeffrey’s prior.
P (w) ∼ N (0, g(X T X)−1 )
Lecture 5: Bayesian Linear Regression
5
The weight random variable w has posterior mean:
g
E(w) =
g−1
µ0
+ wLS
g
where wLS is the least squares solution for w. Typically, setting µ0 = 0 yields:
g
wLS
g+1
Note that X T X gives some indication of which coordinates carry more relevance. This gives a very high
level argument as to why having X T X might be useful within the definition of the probability distribution.
This leaves a question: How does one set the value of g? There are several strategies that can be employed:
1. Put a prior on g. This is the fully Bayesian answer. Typically, an inverse gamma distribution is used.
However, at some point, one must stop going down the Bayesian rabbit hole, and one must actually
assign values directly to hyper parameters.
2. Cross-validation can be used. In this case, a portion of the data is held out. The data which is held
out is used to determine the most likely value for g.
3. Empirical Bayes
ĝ = argmax P (y|X, g)
g
maximizes the marginal likelihood of y after integrating out all other parameters. Typically, this is
calculated numerically. While it has the disadvantage where g is choses partially based upon the input
data, the Empirical Bayes remains an important and practical method.

Download Report

Lecture 5: Bayesian Linear Regression 1 Linear Regression

Paperzz.com

Your Paperzz