School of Computer Science 10-‐601 Introduction to Machine Learning Linear Regression Readings: Bishop, 3.1 Matt Gormley Lecture 5 September 14, 2016 1 Reminders • Homework 2: – Extension: due Friday (9/16) at 5:30pm • Recitation schedule posted on course website 2 Outline • Linear Regression – Simple example – Model • Learning – Gradient Descent – SGD – Closed Form • Advanced Topics – Geometric and Probabilistic Interpretation of LMS – L2 Regularization – L1 Regularization – Features 3 Outline • Linear Regression – Simple example – Model • Learning (aka. Least Squares) – Gradient Descent – SGD (aka. Least Mean Squares (LMS)) – Closed Form (aka. Normal Equations) • Advanced Topics – Geometric and Probabilistic Interpretation of LMS – L2 Regularization (aka. Ridge Regression) – L1 Regularization (aka. LASSO) – Features (aka. non-‐linear basis functions) 4 Linear regression Y • Our goal is to estimate w from a training y = wx + ε data of <xi,yi> pairs • Optimization goal: minimize squared error (least squares): arg min w ∑ ( yi − wxi ) 2 X i • Why least squares? - minimizes squared distance between measurements and predicted line - has a nice probabilistic interpretation - the math is pretty see HW Solving linear regression • To optimize – closed form: • We just take the derivative w.r.t. to w and set to 0: ∂ 2 (y − wx ) = 2∑ −xi (yi − wxi ) ⇒ ∑ i i ∂w i i 2∑ xi (yi − wxi ) = 0 ⇒ 2∑ xi yi − 2∑ wxi xi = 0 i i ∑ x y = ∑ wx i i i i ∑x y w= ∑x i i i 2 i i 2 i ⇒ i Linear regression • Given an input x we would like to compute an output y • In linear regression we assume that y and x are related with the following equation: What we are trying to predict Y Observed values y = wx+ε where w is a parameter and ε represents measurement or other noise X Regression example • Generated: w=2 • Recovered: w=2.03 • Noise: std=1 Regression example • Generated: w=2 • Recovered: w=2.05 • Noise: std=2 Regression example • Generated: w=2 • Recovered: w=2.08 • Noise: std=4 Bias term • So far we assumed that the line passes through the origin • What if the line does not? • No problem, simply change the model to y = w0 + w1x+ε Y w0 • Can use least squares to determine w0 , w1 w0 = ∑y i − w1 xi i n X ∑x (y − w ) = ∑x i w1 i 0 i 2 i i Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. K D = {x(i) , y (i) }N where x R and y R i=1 What are some example problems of this form? 12 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. K D = {x(i) , y (i) }N where x R and y R i=1 Prediction: Output is a linear function of the inputs. ŷ = h (x) = 1 x1 + 2 x2 + . . . + K xK ŷ = h (x) = T x (We assume x1 is 1) Learning: finds the parameters that minimize some objective function. = argmin J( ) 13 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: N 1 J( ) = ( T x(i) y (i) )2 2 i=1 Why? 1. Reduces distance between true measurements and predicted hyperplane (line in 1D) 2. Has a nice probabilistic interpretation 14 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: This is a very general N 1 T (i) (i) 2 J( ) = ( x y ) optimization setup. 2 i=1 We could solve it in Why? lots of ways. Today, 1. Reduces distance between true measurements and we’ll consider three predicted hyperplane (line in 1D) 2. Has aways. nice probabilistic interpretation 15 Least Squares Learning: Three approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 16 Gradient Descent Algorithm 1 Gradient Descent 1: 2: 3: 4: 5: procedure GD(D, (0) ) (0) while not converged do + J( ) return In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives). J( ) = d d 1 J( d d 2 J( ) ) .. . d d N J( ) 17 Gradient Descent Algorithm 1 Gradient Descent 1: 2: 3: 4: 5: procedure GD(D, (0) ) (0) while not converged do + J( ) return There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance. || J( )||2 Alternatively we could check that the reduction in the objective function from one iteration to the next is small. 18 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: 2: 3: 4: 5: 6: procedure SGD(D, (0) ) (0) while not converged do for i shuffle({1, 2, . . . , N }) do + J (i) ( ) return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-‐example objective: Let J( ) = where J (i) ( ) N (i) J ( ) i=1 = 12 ( T x(i) y (i) )2 . 19 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: 2: 3: procedure SGD(D, (0) ) (0) 6: while not converged do for i shuffle({1, 2, . . . , N }) do for k {1, 2, . . . , K} do d (i) ( ) k k + d kJ 7: return 4: 5: Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per-‐example objective: Let J( ) = where J (i) ( ) N (i) J ( ) i=1 = 12 ( T x(i) y (i) )2 . 20 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: 2: 3: procedure SGD(D, (0) ) (0) 6: while not converged do for i shuffle({1, 2, . . . , N }) do for k {1, 2, . . . , K} do d (i) ( ) k k + d kJ 7: return 4: 5: Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm Let’s start by calculating We need a per-‐example objective: this partial derivative for the Linear Regression N (i) Let J( ) = i=1 J ( ) objective function. where J (i) ( ) = 12 ( T x(i) y (i) )2 . 21 Partial Derivatives for Linear Reg. N Let J( ) = i=1 J (i) ( ) T (i) 1 (i) where J ( ) = 2 ( x d (i) d 1 J ( )= ( d k d k2 1 d = ( 2d k =( =( =( T T T x x (i) (i) x(i) y (i) )2 . T x(i) y (i) )2 T x(i) y (i) )2 d y ) ( d k (i) (i) d y ) ( d k T x(i) y (i) ) K (i) k xk y (i) ) k=1 (i) y (i) )xk 22 Partial Derivatives for Linear Reg. N Let J( ) = i=1 J (i) ( ) T (i) 1 (i) where J ( ) = 2 ( x d (i) J ( )=( d k T d d J( ) = d k d k x(i) y (i) )2 . (i) y (i) )xk Used by SGD (aka. LMS) N J (i) ( ) i=1 N = ( i=1 T x(i) (i) y (i) )xk Used by Gradient Descent 23 Least Mean Squares (LMS) Algorithm 3 Least Mean Squares (LMS) 1: 2: 3: procedure LMS(D, (0) ) (0) 6: while not converged do for i shuffle({1, 2, . . . , N }) do for k {1, 2, . . . , K} do T (i) (i) (i) + ( x y )xk k k 7: return 4: 5: Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm 24 Optimization for Linear Reg. vs. Logistic Reg. • Can use the same tricks for both: – regularization – tuning learning rate on development data – shuffle examples out-‐of-‐core (if can’t fit in memory) and stream over them – local hill climbing yields global optimum (both problems are convex) – etc. • But Logistic Regression does not have a closed form solution for MLE parameters… …what about Linear Regression? 25 Least Squares Learning: Three approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 26 The normal equations l Write the cost function in matrix form: 1 T ( x i θ − yi ) 2 ∑ 2 i =1 1 !T ! = ( Xθ − y ) ( Xθ − y ) 2 1 ! ! ! ! = θ T X T Xθ − θ T X T y − y T Xθ + y T y 2 J (θ ) = ( l ⎡− − x1 ⎢− − x 2 X = ⎢ ⎢ ! ! ⎢ ⎣− − x n n − −⎤ − −⎥⎥ ! ⎥ ⎥ − −⎦ ⎡ y1 ⎤ ⎢ ⎥ " ⎢ y2 ⎥ y= ⎢ ! ⎥ ⎢ ⎥ ⎣ yn ⎦ ) To minimize J(θ), take derivative and set to zero: 1 ! ! ! ! ∇θ tr θ T X T Xθ − θ T X T y − y T Xθ + y T y 2 1 ! ! ! = ∇θ trθ T X T Xθ − 2∇θ try T Xθ + ∇θ try T y 2 1 ! = X T Xθ + X T Xθ − 2 X T y 2 ! = X T Xθ − X T y = 0 ( ∇θ J = ) ( ( ) ! X Xθ = X y T ⇒ T The normal equations ) © Eric Xing @ CMU, 2006-2011 ⇓ −1 θ = (X X ) * T ! X y T 27 Some matrix derivatives l l For f : R m×n ! R , define: ⎡ ∂ ⎢ ∂A f ⎢ 11 ∇ A f ( A) = ⎢ " ⎢ ∂ ⎢ f ⎢ ∂A1 m ⎢⎣ Trace: n trA = ∑ Aii , ∂ ⎤ f ⎥ ∂A1 n ⎥ # " ⎥ ⎥ ∂ ! f ⎥ ∂Amn ⎥ ⎥⎦ ! tra = a , trABC = trCAB = trBCA i =1 l Some fact of matrix derivatives (without proof) ∇ A trAB = BT , ∇ A trABAT C = CAB + C T ABT , © Eric Xing @ CMU, 2006-2011 −1 T ( ) ∇A A = A A 28 Comments on the normal equation l In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible. l The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum. l What if X has less than full column rank? à regularization (later). © Eric Xing @ CMU, 2006-2011 29 Direct and Iterative methods l l Direct methods: we can achieve the solution in a single step by solving the normal equation l Using Gaussian elimination or QR decomposition, we converge in a finite number of steps l It can be infeasible when data are streaming in in real time, or of very large amount Iterative methods: stochastic or steepest gradient l Converging in a limiting sense l But more attractive in large practical problems l Caution is needed for deciding the learning rate α © Eric Xing @ CMU, 2006-2011 30 Convergence rate l Theorem: the steepest descent equation algorithm converge to the minimum of the cost characterized by normal equation: If l A formal analysis of LMS need more math-mussels; in practice, one can use a small α, or gradually decrease α. © Eric Xing @ CMU, 2006-2011 31 Convergence Curves © Eric Xing @ CMU, 2006-2011 l For the batch method, the training MSE is initially large due to uninformed initialization l In the online update, N updates for every epoch reduces MSE to a much smaller value. 32 Least Squares Learning: Three approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) • pros: conceptually simple, guaranteed convergence • cons: batch, often slow to converge Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) • pros: memory efficient, fast convergence, less prone to local optima • cons: convergence in practice requires tuning and fancier variants Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) • pros: one shot algorithm! • cons: does not scale to large datasets (matrix inverse is bottleneck) 33 Matching Game Goal: Match the Algorithm to its Update Rule 1. SGD for Logistic Regression 4. h (x) = p(y|x) 2. Least Mean Squares h (x) = T 5. x 3. Perceptron (next lecture) h (x) = sign( T x) + (h (x(i) ) k k k k + k k 6. A. 1=5, 2=4, 3=6 B. 1=5, 2=6, 3=4 C. 1=6, 2=4, 3=4 D. 1=5, 2=6, 3=6 E. 1=6, 2=6, 3=6 y (i) ) 1 1 + exp (h (x(i) ) + (h (x(i) ) y (i) ) (i) y (i) )xk 34 Geometric Interpretation of LMS • The predictions on the training data are: −1 !ˆ * T T ! y = Xθ = X X X X y • Note that ⎡ y ⎤ ( ! ! yˆ − y = X X T X and ( ( ! ! X T yˆ − y = X T X X T X ( ) ( ( ) −1 T ! XT −I y ) −1 ! XT −I y −1 ! XT − XT y ( ) X (X X ) = XT = 0 !! ) ) 1 ⎢ ⎥ " ⎢ y 2 ⎥ y= ⎢ ! ⎥ ⎢ ⎥ ⎣ yn ⎦ ⎡− − x1 ⎢− − x 2 X = ⎢ ⎢ ! ! ⎢ ⎣− − x n − −⎤ − −⎥⎥ ! ⎥ ⎥ − −⎦ ) ! ! ŷ is the orthogonal projection of y into the space spanned by the columns of X © Eric Xing @ CMU, 2006-‐2011 35 Probabilistic Interpretation of LMS • Let us assume that the target variable and the inputs are related by the equation: yi = θ T x i + ε i where ε is an error term of unmodeled effects or random noise • Now assume that ε follows a Gaussian N(0,σ), then we have: ⎛ ( yi − θ T xi )2 ⎞ 1 ⎟⎟ p( yi | xi ;θ ) = exp⎜⎜ − 2 2σ 2π σ ⎝ ⎠ • By independence assumption: n n T 2 ⎞ ⎛ ( y − θ x ) ⎛ 1 ⎞ ∑ i i ⎟ L(θ ) = ∏ p( yi | xi ;θ ) = ⎜ ⎟ exp⎜ − i =1 ⎜ ⎟ 2σ 2 ⎝ 2π σ ⎠ i =1 ⎝ ⎠ © Eric Xing @ CMU, 2006-‐2011 n 36 Probabilistic Interpretation of LMS, cont. • Hence the log-‐likelihood is: l (θ ) = n log 1 1 1 n − 2 ∑i =1 ( yi − θ T xi )2 2π σ σ 2 • Do you recognize the last term? Yes it is: 1 n T J (θ ) = ∑ (x i θ − yi )2 2 i =1 • Thus under independence assumption, LMS is equivalent to MLE of θ ! © Eric Xing @ CMU, 2006-‐2011 37 Ridge Regression • Adds an L2 regularizer to Linear Regression JRR ( ) = J( ) + || ||22 1 = 2 N ( T x(i) K y (i) )2 + 2 k i=1 k=1 • Bayesian interpretation: MAP estimation with a Gaussian p rior o n t he p arameters N M AP = argmax i=1 log p (y (i) |x(i) ) + log p( ) = argmax JRR ( ) wher e p( ) N( 1) 0, 38 LASSO • Adds an L1 regularizer to Linear Regression JLASSO ( ) = J( ) + || ||1 1 = 2 N ( T x(i) K y (i) )2 + | k| i=1 k=1 • Bayesian interpretation: MAP estimation with a Laplace prior on the parameters N M AP = argmax i=1 log p (y (i) |x(i) ) + log p( ) = argmax JLASSO ( ) e wher p( ) Lapl , f( 0 ( e c a )) 39 Ridge Regression vs Lasso X X Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 βs with constant l1 norm β1 Lasso (l1 penalty) results in sparse solu2ons – vector with more zero coordinates Good for high-‐dimensional problems – don’t have to store all coordinates! © Eric Xing @ CMU, 2006-‐2011 40 Non-Linear basis function • So far we only used the observed values x1,x2,… • However, linear regression can be applied in the same way to functions of these values – Eg: to add a term w x1x2 add a new variable z=x1x2 so each example becomes: x1, x2, …. z • As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem 2 1 1 2 k y = w0 + w x + … + wk x + ε Non-linear basis functions • What type of functions can we use? • A few common examples: - Polynomial: φj(x) = xj for j=0 … n (x − µ j ) - Gaussian: φ j (x) = 2σ 2j 1 - Sigmoid: φ j (x) = 1+ exp(−s j x) € - Logs: € φ j (x) = log(x +1) Any function of the input values can be used. The solution for the parameters of the regression remains the same. General linear regression problem • Using our new notations for the basis function linear regression can be written as n y = ∑ w j φ j (x) j= 0 • Where φj(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined • … and € φ0(x)=1 for the intercept term An example: polynomial basis vectors on a small dataset – From Bishop Ch 1 0th Order Polynomial n=10 1st Order Polynomial 3rd Order Polynomial 9th Order Polynomial Over-fitting Root-Mean-Square (RMS) Error: Polynomial Coefficients Data Set Size: 9th Order Polynomial Regularization Penalize large coefficient values 2 & λ 1 # i i J X,y (w) = ∑%% y − ∑ w jφ j (x )(( − w 2 i $ 2 ' j 2 Regularization: + Polynomial Coefficients none exp(18) huge Over Regularization:
© Copyright 2025 Paperzz