Linear Regression - Duke Statistical

Linear Regression
Rebecca C. Steorts
Bayesian Methods and Modern Statistics: STA 360/601
Module 10
1
Setup
Let’s assume that Di = (xi , yi ) for all i.
Assume
iid
Yi ∼ N (wT xi , σ 2 ).
Assume σ 2 known and θ = w.
What is the MLE?
θM LE = arg max p(D | θ)
θ∈Θ
2
What is the likelihood? (Want to get to the MLE).
Define y = (y1 , . . . , yn ). Note that wT xi = xTi w. Define
A = (xT1 , . . . , xTn ). (A is often called the design matrix).
p(D | θ) = p(y | x, θ)
Y
=
p(yi | xi , θ)
(1)
(2)
i
=
Y
i
√
1
2πσ 2
exp{−1/(2σ 2 )(yi − wT xi )2 }
X
1
= (√
)n exp{−1/(2σ 2 )
(yi − wT xi )2 }
2
2πσ
i
1
n
2
= (√
) exp{−1/(2σ )(y − Aw)T (y − Aw)}
2
2πσ
(3)
(4)
(5)
Goal: minimize
(y − Aw)T (y − Aw)
(Think about why we’re minimizing).
3
Goal: minimize
(y − Aw)T (y − Aw)
Expand what we have above.
g := (y − Aw)T (y − Aw) = y T y − 2wT AT y + wT AT Aw
Now take the gradient or derivative with respective to w.
∂g
= −2AT y + 2AT Aw =: 0.
∂w
This implies that
AT y = AT Aw =⇒ θ̂ = (AT A)−1 AT y
Why is (AT A)−1 invertible? (exercise). Hint: this also shows that
θ̂ is unique!
4
Matrix Facts on previous slide
Note: We’re using the fact above from matrix algebra that
X
∂ T
a w=
ai wi = aj .
∂wj
i
The second fact we use is known as a quadratic form. Assume B is
symmetric.
n
∂ X
∂
T
w Bw =
wi wj bij
∂wk
∂wk
i,j=1
(
2wi bij , if i = j = k
=
wi bij if j = k, i 6= j
(6)
(7)
5
We picked up some nice tricks for working with gradients.
Also, we can identify that θ̂ is unbiased. (exercise).
What is the variance of θ̂? (exercise).
6
Bayesian linear regression
We derived the MLE. Why not use the MLE?
The MLE often overfits the data. Also, no notion of uncertainty.
Now suppose we want to predict a new point but what if this is the
diagnostic for a patient. Or an investment for a stock portfolio.
How certain are you? (Let’s put in error bars).
7
Bayesian linear regression
Now suppose we want to predict a new point but what if this is the
diagnostic for a patient. Or an investment for a stock portfolio.
How certain are you? We’re not certain at all!
8
Why Bayesian?
Bayesian approach allows you to say, I don’t know!
We can tie back to decision theory and optimize a loss function by
optimizing the predictive distribution
p(y | x, D)
9
Setup
D = (xi , yi ) for all i. Let a = 1/σ 2 . Let b = 1/τ 2 .
ind
yi | w ∼ N (wT xi , a−1 )
(8)
w ∼ M V N (0, b−1 , I)
(9)
(10)
We assume that a,b are known. Here, θ = w.
Recall: Look at the Multivariate model as these are needed to
understand this module.
10
Computing the Posterior
What is the likelihood?
p(D | w) ∝ P (D | w) ∝ exp{−a/2(y − Aw)T (y − Aw)}
(11)
What is the posterior?
p(w | D) ∝ p(D | w)p(w)
(12)
T
T
∝ exp{−a/2(y − Aw) (y − Aw)} × exp{−b/2w w)}
(13)
Just like in the Multivariate modules, we just simplify. (Check
these details on your own).
p(w | D) ∝ M V N (w | µ, Λ−1 )
where Λ = aAT A + bI and µ = aΛ−1 AT y.
11
You can show (exercise that the Maximum a Posterior estimate of
w is
a(aAT A + bI)−1 AT y = (AT A + b/aI)−1 AT y
How does this compare to the MLE estimate? Think about this on
your own!
You will see more about Bayesian linear regression in lab. (For
more on this, see Hoff).
12
The predictive distribution
p(y | x, D)
(14)
Z
= p(y | x, D, w)p(w | x, D)dw
(15)
Z
= p(y | x, w)p(w | D)dw
(16)
Z
= N (y | wT x, a−1 )N (w | µ, Λ−1 )dw
(17)
Z
∝ exp{−a/2(y − wT x)2 } exp{−1/2(w − µ)T Λ(w − µ)}dw
(18)
Z
∝
exp{−a/2(y 2 − 2(wT x)y + (wT x)2 )
− 1/2(wT Λw − 2wT Λµ + µT Λµ)}
(19)
13
p(y | x, D)
Z
= N (y | wT x, a−1 )N (w | µ, Λ−1 )dw
Z
∝ exp{−a/2(y 2 − 2(wT x)y + (wT x)2 )
(21)
− 1/2(wT Λw − 2wT Λµ + µT Λµ)}
(22)
(20)
Our goal is to make the above into
Z
N (w | −)g(y)dw = g(y) ∝ N (y | −).
How can you do this?
14
Exercise
1. First
show that the above, can be written as
R
N (w | −)g(y)dw as give the parameters of the Gaussian.
R
2. Next, show trivially that N (w | −)g(y)dw = g(y) with the
parameters in 1.
3. Finally, show that g(y) ∝ N (y | −) and give the parameters
for the normal of the predictive distribution.
To summarize, you should in the end find that
p(y | x, D) ∝ exp{−λ/2(y − u)2 )
where λ = a(1 − axT L−1 x), u = λ−1 axT L−1 Λµ, L = axxT + Λ.
(You may assume that w has mean m).
15
Food for Thought
I
When does θ̂B = θ̂M LE ? (What does this mean in terms of
the prior variance or precision)?
I
Suppose θ̂B 6= θ̂M LE . What benefit are we getting from linear
regression in this case over ordinary least squares? Explain.
16