Linear Regression - Carnegie Mellon School of Computer Science

School of Computer Science
10-­‐601 Introduction to Machine Learning Linear Regression Readings: Bishop, 3.1 Matt Gormley Lecture 5 September 14, 2016 1 Reminders •  Homework 2: –  Extension: due Friday (9/16) at 5:30pm •  Recitation schedule posted on course website 2 Outline •  Linear Regression –  Simple example –  Model •  Learning –  Gradient Descent –  SGD –  Closed Form •  Advanced Topics –  Geometric and Probabilistic Interpretation of LMS –  L2 Regularization –  L1 Regularization –  Features 3 Outline •  Linear Regression –  Simple example –  Model •  Learning (aka. Least Squares) –  Gradient Descent –  SGD (aka. Least Mean Squares (LMS)) –  Closed Form (aka. Normal Equations) •  Advanced Topics –  Geometric and Probabilistic Interpretation of LMS –  L2 Regularization (aka. Ridge Regression) –  L1 Regularization (aka. LASSO) –  Features (aka. non-­‐linear basis functions) 4 Linear regression
Y
•  Our goal is to estimate w from a training
y = wx + ε
data of <xi,yi> pairs
•  Optimization goal: minimize squared error
(least squares):
arg min w ∑ ( yi − wxi ) 2
X
i
•  Why least squares?
- minimizes squared distance between
measurements and predicted line
- has a nice probabilistic interpretation
- the math is pretty
see HW
Solving linear regression
•  To optimize – closed form:
•  We just take the derivative w.r.t. to w and set to 0:
∂
2
(y
−
wx
)
= 2∑ −xi (yi − wxi ) ⇒
∑
i
i
∂w i
i
2∑ xi (yi − wxi ) = 0 ⇒ 2∑ xi yi − 2∑ wxi xi = 0
i
i
∑ x y = ∑ wx
i i
i
i
∑x y
w=
∑x
i i
i
2
i
i
2
i
⇒
i
Linear regression
•  Given an input x we would like to
compute an output y
•  In linear regression we assume
that y and x are related with the
following equation:
What we are
trying to predict
Y
Observed values
y = wx+ε
where w is a parameter and ε
represents measurement or
other noise
X
Regression example
•  Generated: w=2
•  Recovered: w=2.03
•  Noise: std=1
Regression example
•  Generated: w=2
•  Recovered: w=2.05
•  Noise: std=2
Regression example
•  Generated: w=2
•  Recovered: w=2.08
•  Noise: std=4
Bias term
•  So far we assumed that the
line passes through the origin
•  What if the line does not?
•  No problem, simply change the
model to
y = w0 + w1x+ε
Y
w0
•  Can use least squares to
determine w0 , w1
w0 =
∑y
i
− w1 xi
i
n
X
∑x (y − w )
=
∑x
i
w1
i
0
i
2
i
i
Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. K
D = {x(i) , y (i) }N
where
x
R
and y R
i=1
What are some example problems of this form? 12 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. K
D = {x(i) , y (i) }N
where
x
R
and y R
i=1
Prediction: Output is a linear function of the inputs. ŷ = h (x) = 1 x1 + 2 x2 + . . . + K xK
ŷ = h (x) =
T
x
(We assume x1 is 1) Learning: finds the parameters that minimize some objective function. = argmin J( )
13 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( )
We minimize the sum of the squares: N
1
J( ) =
( T x(i) y (i) )2
2 i=1
Why? 1.  Reduces distance between true measurements and predicted hyperplane (line in 1D) 2.  Has a nice probabilistic interpretation 14 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( )
We minimize the sum of the squares: This is a very general N
1
T (i)
(i) 2
J(
)
=
(
x
y
)
optimization setup. 2 i=1
We could solve it in Why? lots of ways. Today, 1.  Reduces distance between true measurements and we’ll consider three predicted hyperplane (line in 1D) 2.  Has aways. nice probabilistic interpretation 15 Least Squares Learning: Three approaches to solving = argmin J( )
Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 16 Gradient Descent Algorithm 1 Gradient Descent
1:
2:
3:
4:
5:
procedure GD(D,
(0)
)
(0)
while not converged do
+
J( )
return
In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives). J( ) =
d
d 1 J(
d
d 2 J(
)
)
..
.
d
d N J( )
17 Gradient Descent Algorithm 1 Gradient Descent
1:
2:
3:
4:
5:
procedure GD(D,
(0)
)
(0)
while not converged do
+
J( )
return
There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance. ||
J( )||2
Alternatively we could check that the reduction in the objective function from one iteration to the next is small. 18 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD)
1:
2:
3:
4:
5:
6:
procedure SGD(D,
(0)
)
(0)
while not converged do
for i shuffle({1, 2, . . . , N }) do
+
J (i) ( )
return
Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-­‐example objective: Let J( ) =
where J (i) ( )
N
(i)
J
( )
i=1
= 12 ( T x(i)
y (i) )2 .
19 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD)
1:
2:
3:
procedure SGD(D,
(0)
)
(0)
6:
while not converged do
for i shuffle({1, 2, . . . , N }) do
for k {1, 2, . . . , K} do
d
(i)
( )
k
k + d kJ
7:
return
4:
5:
Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-­‐example objective: Let J( ) =
where J (i) ( )
N
(i)
J
( )
i=1
= 12 ( T x(i)
y (i) )2 .
20 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD)
1:
2:
3:
procedure SGD(D,
(0)
)
(0)
6:
while not converged do
for i shuffle({1, 2, . . . , N }) do
for k {1, 2, . . . , K} do
d
(i)
( )
k
k + d kJ
7:
return
4:
5:
Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm Let’s start by calculating We need a per-­‐example objective: this partial derivative for the Linear Regression N
(i)
Let J( ) = i=1 J ( )
objective function. where J (i) ( ) = 12 (
T
x(i)
y (i) )2 .
21 Partial Derivatives for Linear Reg. N
Let J( ) = i=1 J (i) ( )
T (i)
1
(i)
where J ( ) = 2 ( x
d (i)
d 1
J ( )=
(
d k
d k2
1 d
=
(
2d k
=(
=(
=(
T
T
T
x
x
(i)
(i)
x(i)
y (i) )2 .
T
x(i)
y (i) )2
T
x(i)
y (i) )2
d
y )
(
d k
(i)
(i) d
y )
(
d k
T
x(i)
y (i) )
K
(i)
k xk
y (i) )
k=1
(i)
y (i) )xk
22 Partial Derivatives for Linear Reg. N
Let J( ) = i=1 J (i) ( )
T (i)
1
(i)
where J ( ) = 2 ( x
d (i)
J ( )=(
d k
T
d
d
J( ) =
d k
d k
x(i)
y (i) )2 .
(i)
y (i) )xk
Used by SGD (aka. LMS) N
J (i) ( )
i=1
N
=
(
i=1
T
x(i)
(i)
y (i) )xk
Used by Gradient Descent 23 Least Mean Squares (LMS) Algorithm 3 Least Mean Squares (LMS)
1:
2:
3:
procedure LMS(D,
(0)
)
(0)
6:
while not converged do
for i shuffle({1, 2, . . . , N }) do
for k {1, 2, . . . , K} do
T (i)
(i) (i)
+
(
x
y
)xk
k
k
7:
return
4:
5:
Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm 24 Optimization for Linear Reg. vs. Logistic Reg. •  Can use the same tricks for both: –  regularization –  tuning learning rate on development data –  shuffle examples out-­‐of-­‐core (if can’t fit in memory) and stream over them –  local hill climbing yields global optimum (both problems are convex) –  etc. •  But Logistic Regression does not have a closed form solution for MLE parameters… …what about Linear Regression? 25 Least Squares Learning: Three approaches to solving = argmin J( )
Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 26 The normal equations
l 
Write the cost function in matrix form:
1
T
( x i θ − yi ) 2
∑
2 i =1
1
!T
!
= ( Xθ − y ) ( Xθ − y )
2
1
! !
! !
= θ T X T Xθ − θ T X T y − y T Xθ + y T y
2
J (θ ) =
(
l 
⎡− − x1
⎢− − x
2
X = ⎢
⎢ !
!
⎢
⎣− − x n
n
− −⎤
− −⎥⎥
! ⎥
⎥
− −⎦
⎡ y1 ⎤
⎢ ⎥
" ⎢ y2 ⎥
y=
⎢ ! ⎥
⎢ ⎥
⎣ yn ⎦
)
To minimize J(θ), take derivative and set to zero:
1
! !
! !
∇θ tr θ T X T Xθ − θ T X T y − y T Xθ + y T y
2
1
!
! !
= ∇θ trθ T X T Xθ − 2∇θ try T Xθ + ∇θ try T y
2
1
!
= X T Xθ + X T Xθ − 2 X T y
2
!
= X T Xθ − X T y = 0
(
∇θ J =
)
(
(
)
!
X Xθ = X y
T
⇒
T
The normal equations
)
© Eric Xing @ CMU, 2006-2011
⇓
−1
θ = (X X )
*
T
!
X y
T
27
Some matrix derivatives
l 
l 
For f : R m×n ! R , define:
⎡ ∂
⎢ ∂A f
⎢ 11
∇ A f ( A) = ⎢ "
⎢ ∂
⎢
f
⎢ ∂A1 m
⎢⎣
Trace:
n
trA = ∑ Aii ,
∂
⎤
f ⎥
∂A1 n
⎥
#
" ⎥
⎥
∂
!
f ⎥
∂Amn ⎥
⎥⎦
!
tra = a ,
trABC = trCAB = trBCA
i =1
l 
Some fact of matrix derivatives (without proof)
∇ A trAB = BT ,
∇ A trABAT C = CAB + C T ABT ,
© Eric Xing @ CMU, 2006-2011
−1 T
( )
∇A A = A A
28
Comments on the normal equation
l 
In most situations of practical interest, the number of data
points N is larger than the dimensionality k of the input space
and the matrix X is of full column rank. If this condition holds,
then it is easy to verify that XTX is necessarily invertible.
l 
The assumption that XTX is invertible implies that it is positive
definite, thus the critical point we have found is a minimum.
l 
What if X has less than full column rank? à regularization
(later).
© Eric Xing @ CMU, 2006-2011
29
Direct and Iterative methods
l 
l 
Direct methods: we can achieve the solution in a single step
by solving the normal equation
l 
Using Gaussian elimination or QR decomposition, we converge in a finite number
of steps
l 
It can be infeasible when data are streaming in in real time, or of very large
amount
Iterative methods: stochastic or steepest gradient
l 
Converging in a limiting sense
l 
But more attractive in large practical problems
l 
Caution is needed for deciding the learning rate α
© Eric Xing @ CMU, 2006-2011
30
Convergence rate
l 
Theorem: the steepest descent equation algorithm
converge to the minimum of the cost characterized by
normal equation:
If
l 
A formal analysis of LMS need more math-mussels; in
practice, one can use a small α, or gradually decrease α.
© Eric Xing @ CMU, 2006-2011
31
Convergence Curves
© Eric Xing @ CMU, 2006-2011
l 
For the batch
method, the training
MSE is initially large
due to uninformed
initialization
l 
In the online update,
N updates for every
epoch reduces MSE
to a much smaller
value.
32
Least Squares Learning: Three approaches to solving = argmin J( )
Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient)
•  pros: conceptually simple, guaranteed convergence •  cons: batch, often slow to converge Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) •  pros: memory efficient, fast convergence, less prone to local optima •  cons: convergence in practice requires tuning and fancier variants Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) •  pros: one shot algorithm! •  cons: does not scale to large datasets (matrix inverse is bottleneck) 33 Matching Game Goal: Match the Algorithm to its Update Rule 1. SGD for Logistic Regression 4. h (x) = p(y|x)
2. Least Mean Squares h (x) =
T
5. x
3. Perceptron (next lecture) h (x) = sign(
T
x)
+ (h (x(i) )
k
k
k
k +
k
k
6. A. 1=5, 2=4, 3=6 B. 1=5, 2=6, 3=4 C. 1=6, 2=4, 3=4 D. 1=5, 2=6, 3=6 E. 1=6, 2=6, 3=6 y (i) )
1
1 + exp (h (x(i) )
+ (h (x(i) )
y (i) )
(i)
y (i) )xk
34 Geometric Interpretation of LMS •  The predictions on the training data are: −1
!ˆ
*
T
T !
y = Xθ = X X X X y
•  Note that ⎡ y ⎤
(
! !
yˆ − y = X X T X
and (
(
! !
X T yˆ − y = X T X X T X
(
)
(
(
)
−1
T
!
XT −I y
)
−1
!
XT −I y
−1
!
XT − XT y
( )
X (X X )
= XT
= 0 !! )
)
1
⎢ ⎥
" ⎢ y 2 ⎥
y=
⎢ ! ⎥
⎢ ⎥
⎣ yn ⎦
⎡− − x1
⎢− − x
2
X = ⎢
⎢ !
!
⎢
⎣− − x n
− −⎤
− −⎥⎥
! ⎥
⎥
− −⎦
)
!
!
ŷ is the orthogonal projection of y
into the space spanned by the columns of X
© Eric Xing @ CMU, 2006-­‐2011 35 Probabilistic Interpretation of LMS •  Let us assume that the target variable and the inputs are related by the equation: yi = θ T x i + ε i
where ε is an error term of unmodeled effects or random noise •  Now assume that ε follows a Gaussian N(0,σ), then we have: ⎛ ( yi − θ T xi )2 ⎞
1
⎟⎟
p( yi | xi ;θ ) =
exp⎜⎜ −
2
2σ
2π σ
⎝
⎠
•  By independence assumption: n
n
T
2 ⎞
⎛
(
y
−
θ
x
)
⎛ 1 ⎞
∑
i
i
⎟
L(θ ) = ∏ p( yi | xi ;θ ) = ⎜
⎟ exp⎜ − i =1
⎜
⎟
2σ 2
⎝ 2π σ ⎠
i =1
⎝
⎠
© Eric Xing @ CMU, 2006-­‐2011 n
36 Probabilistic Interpretation of LMS, cont. •  Hence the log-­‐likelihood is: l (θ ) = n log
1
1 1 n
− 2 ∑i =1 ( yi − θ T xi )2
2π σ σ 2
•  Do you recognize the last term? Yes it is: 1 n
T
J (θ ) = ∑ (x i θ − yi )2
2 i =1
•  Thus under independence assumption, LMS is equivalent to MLE of θ !
© Eric Xing @ CMU, 2006-­‐2011 37 Ridge Regression •  Adds an L2 regularizer to Linear Regression JRR ( ) = J( ) + || ||22
1
=
2
N
(
T
x(i)
K
y (i) )2 +
2
k
i=1
k=1
•  Bayesian interpretation: MAP estimation with a Gaussian p
rior o
n t
he p
arameters N
M AP
= argmax
i=1
log p (y (i) |x(i) ) + log p( )
= argmax JRR ( )
wher
e p( )
N(
1)
0,
38 LASSO •  Adds an L1 regularizer to Linear Regression JLASSO ( ) = J( ) + || ||1
1
=
2
N
(
T
x(i)
K
y (i) )2 +
| k|
i=1
k=1
•  Bayesian interpretation: MAP estimation with a Laplace prior on the parameters N
M AP
= argmax
i=1
log p (y (i) |x(i) ) + log p( )
= argmax JLASSO ( )
e wher
p( )
Lapl
, f(
0
(
e
c
a
))
39 Ridge Regression vs Lasso X
X
Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 βs with constant l1 norm β1 Lasso (l1 penalty) results in sparse solu2ons – vector with more zero coordinates Good for high-­‐dimensional problems – don’t have to store all coordinates! © Eric Xing @ CMU, 2006-­‐2011 40 Non-Linear basis function
•  So far we only used the observed values x1,x2,…
•  However, linear regression can be applied in the same
way to functions of these values
–  Eg: to add a term w x1x2 add a new variable z=x1x2 so each
example becomes: x1, x2, …. z
•  As long as these functions can be directly computed
from the observed values the parameters are still linear
in the data and the problem remains a multi-variate linear
regression problem
2
1 1
2
k
y = w0 + w x + … + wk x + ε
Non-linear basis functions
•  What type of functions can we use?
•  A few common examples:
- Polynomial: φj(x) = xj for j=0 … n
(x − µ j )
- Gaussian: φ j (x) =
2σ 2j
1
- Sigmoid: φ j (x) =
1+ exp(−s j x)
€
- Logs:
€
φ j (x) = log(x +1)
Any function of the input
values can be used. The
solution for the parameters
of the regression remains
the same.
General linear regression problem
•  Using our new notations for the basis function linear
regression can be written as
n
y = ∑ w j φ j (x)
j= 0
•  Where φj(x) can be either xj for multivariate regression or
one of the non-linear basis functions we defined
•  … and €
φ0(x)=1 for the intercept term
An example: polynomial basis
vectors on a small dataset
–  From Bishop Ch 1
0th Order Polynomial
n=10
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Regularization
Penalize large coefficient values
2
& λ
1 # i
i
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( −
w
2 i $
2
'
j
2
Regularization:
+
Polynomial Coefficients
none
exp(18)
huge
Over Regularization: