Machine Learning Srihari Linear Models for Regression Sargur Srihari [email protected] 1 Machine Learning Srihari Linear Regression with One Input • Simplest form of linear regression: – linear function of single input variable Task is to learn weights w0, w1 – y(x,w) = w0+w1x from data D={(y ,x )}i=1,,N • More useful class of functions: i i – Polynomial curve fitting – y(x,w) = w0+w1x+w2x2+…=Σ wixi – linear combination of non-linear functions of input variables φi (x) instead of xi called basis functions • Linear functions of parameters (which gives them simple analytical properties), yet are nonlinear with2 respect to input variables Machine Learning Srihari Plan of Discussion • • • • Discuss supervised learning starting with regression Goal: predict value of one or more target variables t Given d-dimensional vector x of input variables Terminology – Regression • When t is continuous-valued – Classification • if t has a value consisting of labels (non-ordered categories) – Ordinal Regression • Discrete values, ordered categories 3 Machine Learning Srihari Regression with multiple inputs • Polynomial – Single input variable scalar x • Generalizations Target Variable Input Variable – Predict value of continuous target variable t given value of d input variables x=[x1,..xd] – t can also be a set of variables (multiple regression) – Linear functions of adjustable parameters • Specifically linear combinations of nonlinear functions of input variable 4 Machine Learning Srihari Simplest Linear Model with d inputs • Regression with d input variables y(x,w) = w0+w1x1+..+wd xd = wTx This differs from Linear Regression with one variable and Polynomial Reg with one variable where x=(x1,..,xd)T are the input variables • Called Linear Regression since it is a linear function of – parameters w0,..,wd – input variables x1,..,xd • Significant limitation since it is a linear function of input variables – In the one-dimensional case this amounts a straight-line fit (degree-one polynomial) 5 – y(x,w) = w0 + w1x Machine Learning Srihari Fitting a Regression Plane • Assume t is a function of inputs x1, x2,...xd (independent variables). Goal is to find the best linear regressor of t on all the inputs. • For d =2: fitting a plane through N input samples or a hyperplane in d dimensions. x1 x2 t 6 Machine Learning Srihari Learning to Rank Problem • LeToR • Multiple Inputs • Target Value – t is discrete (eg, 1,2..6) in training set but a continuous value in [1,6] is learnt and used to rank objects 7 Machine Learning Srihari Regression with multiple inputs: LeToR Input (xi): (d Features of Query-URL pair) (d >200) In LETOR 4.0 dataset 46 query-document features Maximum of 124 URLs/query – Log frequency of query in anchor text – Query word in color on page – # of images on page – # of (out) links on page – PageRank of page – URL length – URL contains “~” – Page length Traditional IR uses TF/IDF Output (y): Relevance Value Target Variable - Point-wise (0,1,2,3) - Regression returns continuous value Yahoo! data set has d=700 - Allows fine-grained ranking of URLs Machine Learning Srihari Input Features See http://research.microsoft.com/en-us/projects/mslr/feature.aspx Feature List of Microsoft Learning to Rank Datasets feature id feature description 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 stream comments body anchor covered query term title number url whole document body anchor covered query term title ratio url whole document body anchor stream length title url whole document body anchor IDF(Inverse document title frequency) url whole document body anchor sum of term frequency title url whole document IDF(t,D)=log N / |{d ∈ D: t ∈d}| 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 body anchor min of term frequency title url whole document body anchor max of term frequency title url whole document body anchor mean of term frequency title url whole document body anchor variance of term title frequency url whole document body sum of stream length anchor normalized term title frequency url whole document Machine Learning Srihari Feature Statistics • Most of 46 features are normalized as continuous values from 0 to 1, exception some features are all 0s’. Feature Min Max Mean 1 0 1 0.254 2 0 1 0.1598 3 0 1 0.1392 4 0 1 0.2158 5 0 1 0.1322 6 0 0 0.1614 7 0 0 0 8 0 0 0 9 0 0 0 10 0 0 0 11 0 1 0 12 0 1 0.2841 13 0 1 0.1382 14 0 1 0.2109 15 0 1 0.1218 16 0 1 0.2879 17 0 1 0.1533 18 0 1 0.2258 19 0 1 0.3057 20 0 1 0.3332 21 0 1 0.1534 22 0 1 0.5473 23 0 1 0.5592 24 0 1 0.5453 25 0 1 0.5622 26 0 1 0.1675 27 0 1 0.1377 28 0 1 0.1249 29 0 1 0.126 30 0 1 0.2109 31 0 1 0.1705 32 0 1 0.1694 33 0 1 0.1603 45 0 1 0.1065 46 0 1 0.1211 34 0 1 0.1275 35 0 1 0.0762 36 0 1 0.0762 37 0 1 0.0728 38 0 1 0.5479 39 0 1 0.5574 40 0 1 0.5502 41 0 1 0.5673 42 0 1 0.4321 43 0 0 0.3361 44 0 1 0 Machine Learning Srihari Linear Regression with M Basis Functions • Extended by considering nonlinear functions of input variables M −1 y(x, w ) = w0 + ∑ w jφ j (x) j =1 – where φj(x) are called Basis functions – We now need M weights for basis functions instead of d weights for features – Can be written as M −1 y(x, w) = ∑ w j φ j (x) = wT φ (x) j= 0 – where w=(w0,w1,..,wM-1) and Φ =(φ0,φ1,..,φM-1)T • Basis functions allow non-linearity with d input variables 11 Machine Learning Srihari Polynomial Basis with one variable • Linear Basis Function Model M −1 y(x, w) = ∑ w j φ j (x) = wT φ (x) xj j= 0 • Polynomial Basis (for single variable x) φj(x)=x j with degree M-1 polynomial • Disadvantage x – Global: • changes in one region of input space affects others – Difficult to formulate • Number of polynomials increases exponentially with M – Can divide input space into regions • use different polynomials in each region: • equivalent to spline functions 12 Machine Learning Srihari Polynomial Basis with d variables (Not used in practice!) • Consider (for a vector x) the basis: φ j (x) =|| x || = [ ] j x + x + ..+ x – x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory – Vector is being converted into a scalar value thereby losing information j 2 1 2 2 2 d • Better approach: – Polynomial of degree M-1 has terms with variables taken none, one, two… M-1 at a time. – Use multi-index j=(j1,j2,..jd) such that j1+j2+..jd < M-1 – For a quadratic (M=3) with three variables (d=3) y(x,w) = ∑ w φ (x) = w j j 0 + w1,0,0 x1 + w 0,1,0 x 2 + w 0,0,1 x 3 + w1,1,0 x1 x 2 + w1,0,1 x1 x 3 + ( j1 , j 2 , j 3 ) w 0,1,1 x 2 x 3 + w 2,0,0 x12 + w 0,2,0 x 22 + w 0,0,2 x 32 – Number of quadratic terms is 1+d+d(d-1)/2+d – For d=46, it is 1128 Machine Learning Srihari Gaussian Radial Basis Functions • Gaussian " (x − µ j )2 % φ j (x) = exp $$ '' 2 # 2s & – Does not necessarily have a probabilistic interpretation – Usual normalization term is unimportant • since basis function is multiplied by weight wj • Choice of parameters – µj govern the locations of the basis functions • Can be an arbitrary set of points within the range of the data – Can choose some representative data points – s governs the spatial scale • Could be chosen from the data set e.g., average variance • Several variables – A Gaussian kernel would be chosen for each dimension – For each j a different set of means would be needed– perhaps chosen from the data ⎛ ⎞ 1 φj (x) = exp ⎜⎜⎜− (x − µ j )t Σ−1(x − µ j )⎟⎟⎟ ⎟⎠ ⎝ 2 Machine Learning Srihari Sigmoidal Basis Function • Sigmoid " x −µj % ϕ j (x) = σ $ ' # s & where σ (a) = 1 1+ exp(−a) • Equivalently, tanh because it is related to logistic sigmoid by tanh( a) = 2σ (a) − 1 Logistic Sigmoid Machine Learning Srihari Other Basis Functions • Fourier – Expansion in sinusoidal functions – Infinite spatial extent • Signal Processing – Functions localized in time and frequency – Called wavelets • Useful for lattices such as images and time series • Further discussion independent of choice of basis including φ(x) = x 16 Machine Learning Srihari Probabilistic Formulation • Given: – Data set of n observations {xn}, n=1,.., N – Corresponding target values {tn} • Goal: – Predict value of t for a new value of x • Simplest approach: – Construct function y(x) whose values are the predictions • Probabilistic Approach: – Model predictive distribution p(t|x) • It expresses uncertainty about value of t for each x – From this conditional distribution • Predict t for any value of x so as to minimize a loss function • Typical loss is squared loss for which the solution is the conditional expectation of t 17 Machine Learning Srihari Relationship between Maximum Likelihood and Least Squares • Will show that Minimizing sum-of-squared errors is the same as maximum likelihood solution under a Gaussian noise model • Target variable is a scalar t given by deterministic function y(x,w) with additive Gaussian noise t = y(x,w)+∈ ∈ is zero-mean Gaussian with precision β • Thus distribution of t is univariate normal: p(t|x,w,β)=N(t|y(x,w),β -1) mean variance Machine Learning Srihari Likelihood Function • Data set: – Input X={x1,..xN} with target t = {t1,..tN} – Target variables tn are scalars forming a vector of size N • Likelihood of the target data – It is the probability of observing the data assuming they are independent N p(t | X,w, β ) = ∏ N ( t n | wT φ (x n ), β −1 ) n=1 Machine Learning Srihari Log-Likelihood Function N p(t | X,w, β ) = ∏ N ( t n | wT φ (x n ), β −1 ) • Likelihood n=1 • Log-likelihood N ln p(t | w, β ) = ∑ ln N ( t n | wT φ (x n ), β −1 ) n=1 N N = ln β − ln2π − βE D (w) 2 2 – Where { 1 N ED ( w ) = ∑ t n − w T φ ( xn ) 2 n =1 2 } Sum-of-squares Error Function – We have used standard form of univariate Gaussian ⎧ 1 1 2⎫ N(x | µ,σ ) = exp⎨− 2 (x − µ) ⎬ ⎩ 2σ ⎭ (2πσ 2 )1/ 2 2 Machine Learning Srihari Maximizing Log-Likelihood Function • Log-likelihood N ln p(t | w, β ) = ∑ ln N ( t n | wT φ (x n ), β −1 ) n=1 = – Where { 1 N ED ( w ) = ∑ t n − w T φ ( xn ) 2 n =1 2 } N N ln β − ln2π − βE D (w) 2 2 Sum-of-squares Error Function • Maximizing likelihood is equivalent to minimizing ED(w) • Take derivative of ln p(t | w, β ) wrt w and set equal to zero and solve for w – or equivalently derivative of -ED(w) Machine Learning Srihari Gradient of Log-likelihood wrt w N { ∇ ln p(t | w, β ) = ∑ tn − wT φ (x n ) n=1 Since ∇ w [− } φ (x ) T n 1 ( a − wb)2 ] = (a − wb)b 2 • Gradient is set to zero and we solve for w #N & 0 = ∑ tnφ (x n ) − w % ∑φ (x n )φ (x n )T ( $ n=1 ' n=1 N T as shown in next slide • Second derivative will be negative making this a maximum Machine Learning Srihari Max Likelihood Solution for w • Solving for w: w ML = Φ + t X={x1,..xN} are samples (vectors of d variables) t = {t1,..tN} are targets (scalars) where Φ + = (ΦT Φ) −1 ΦT is the Moore-Penrose pseudo inverse of N x M Design Matrix Design Matrix: Rows correspond to N samples, Columns to M basis functions ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞ ⎜ ⎟ ⎜ φ (x ) ⎟ Φ=⎜ 0 2 ⎟ ⎜ ⎟ ⎜φ (x ) φ M −1( x N ) ⎟⎠ ⎝ 0 N Pseudo inverse: generalization of notion of matrix inverse to non-square matrices If design matrix is square and invertible. then pseudo-inverse is same as inverse ϕi(xn) are M basis functions, e.g., Gaussians centered on M data points ⎛ 1 ⎞⎟ t −1 ⎜ φj (x) = exp ⎜− (x − µ j ) Σ (x − µ j )⎟⎟ ⎟⎠ ⎜⎝ 2 Machine Learning Srihari What is the role of Bias parameter w0 M −1 • • • • Regression function: Error Function: Explicit bias parameter: Setting derivatives wrt w0 y(x, w ) = w0 + ∑ w jφ j (x) j =1 { 1 N ED (w) = ∑ t n −wT φ (x n ) 2 n=1 M −1 1 N ⎧⎪ ED (w) = ∑ ⎨ t n −w0 − ∑ w jφ j (x n ) 2 n=1 ⎪⎩ j=1 ⎫⎪ ⎬ ⎪⎭ 2 M −1 equal to zero and solving for w0 where } 2 1 N t = ∑ tn N n=1 w0 = t − ∑ w j φ j j=1 N 1 φ j = ∑ φ j (x n ) N n=1 • First term is average of the N values of t • Second is weighted average of basis function values • Thus bias w0 compensates for difference • between average target values and weighted sum of averages of basis function values Machine Learning Srihari Maximum Likelihood for precision β • We have determined m.l.e. solution for w using a probabilistic formulation • p(t|x,w,β)=N(t|y(x,w),β -1) • With log-likelihood w ML = Φ + t N N ln p(t | w, β ) = ln β − ln 2π − β ED (w) 2 2 • Taking gradient wrt β gives { 1 1 N = ∑ tn − wTMLφ (x n ) β ML N n=1 } { 1 N ED (w) = ∑ t n −wT φ (x n ) 2 n=1 2 • Thus Inverse of the noise precision gives Residual variance of the target values 25 around the regression function } 2 Machine Learning Srihari Geometry of Least Squares Solution • N-dim space with axes tn: t =(t1,…,tN)T • Each basis function φj(xn), corresponding to jth column of Φ, is represented in this space as vector ϕ j subspace – If M<N then φj(xn) are in subspace S of dim M • y is N-dim vector of elements y(xn,w) • Sum-of-squares error is equal to squared M Basis functions Euclidean distance between y and t ⎛ φ ( x ) φ ( x ) ... φ ( x ) ⎞ • Solution w corresponds to y that lies in ⎜ ⎟ φ ( x ) ⎜ ⎟ subspace S that is closest to t Φ= 1 0 2 1 ⎜ ⎜ ⎜φ (x ) ⎝ 0 N 1 M −1 1 ⎟ ⎟ φ M −1( x N ) ⎟⎠ N Data – Corresponds to ortho projection of t onto S 0 Represented as N-dim vectors ϕ0 ϕ1 ϕ M-1 Machine Learning Srihari Difficulty of Direct solution • Direct solution of normal equations w ML = Φ + t Φ + = (ΦT Φ) −1 ΦT – Can lead to numerical difficulties • When two basis functions are collinear, ΦTΦ is singular (determinant=0) – Parameters can have large magnitudes • Not uncommon with real data sets • Can be addressed using – Singular Value Decomposition – Addition of regularization term ensures matrix is non-singular 27 Machine Learning Srihari Sequential (On-line) Learning • Maximum likelihood solution is w ML = (ΦT Φ)−1 ΦT t • It is a batch technique – Processing entire training set in one go • It is computationally expensive for large data sets • Due to huge N x M Design matrix Φ • Solution is to use a sequential algorithm where samples are presented one at a time • Called Stochastic gradient descent 28 Machine Learning Stochastic Gradient Descent • Error function Srihari 2 { } sums over data – Denoting ED(w)=ΣnEn update parameter vector w using 1 N ED (w) = ∑ t n −wT ϕ (x n ) 2 n=1 w (τ +1) = w (τ ) − η∇En • Substituting for the derivative w(τ +1) = w(τ ) + η(t n − w(τ )φ (x n ))φ (x n ) • w is initialized to some starting vector w(0) • η chosen with care to ensure convergence • Known as Least Mean Squares Algorithm 29 Machine Learning Srihari Derivation of Gradient Descent Rule 30 Error surface shows desirability of every weight vector, a parabola with a single global minimum Machine Learning Srihari Simple Gradient Descent Procedure Gradient-Descent ( θ1 //Initial starting point f //Function to be minimized δ //Convergence threshold ) 1 t←1 2 do ( ) 3 θ t+1 ← θ t − η∇f θ t 4 t ← t +1 5 while || θ t − θ t−1 ||> δ ( ) Intuition Taylor’s expansion of function f(θ) in the t t T t neighborhood of θt is f (θ) ≈ f (θ ) + (θ − θ ) ∇f (θ ) Let θ=θt+1 =θt +h, thus f (θt+1 ) ≈ f (θt ) + h ∇f (θt ) Derivative of f(θt+1) wrt h is ∇f (θt ) t At h = ∇f (θ ) a maximum occurs (since h2 is positive) and at h = −∇f (θt ) a minimum occurs. Alternatively, t The slope ∇f (θ ) points to the direction of steepest ascent. If we take a step η in the opposite direction we decrease the value of f 6 return θ t One-dimensional example Let f(θ)=θ2 This function has minimum at θ=0 which we want to determine using gradient descent We have f’(θ)=2θ For gradient descent, we update by –f’(θ) If θt > 0 then θt+1<θt If θt<0 then f’(θt)=2θt is negative, thus θt+1>θt Machine Learning Srihari Difficulties with Simple Gradient Descent • Performance depends on choice of learning rate η • Large learning rate – “Overshoots” • Small learning rate – Extremely slow • Solution – Start with large η and settle on optimal value – Need a schedule for shrinking η 32 Machine Learning Srihari Improvements to Gradient Descent Procedure Gradient-Descent ( θ1 //Initial starting point f //Function to be minimized δ //Convergence threshold ) 1 t←1 2 do ( ) 3 θ t+1 ← θ t − η∇f θ t 4 t ← t +1 Line Search Adaptively choose η at each step Define !a “line” in the direction of gradient g (η ) = θ + η∇f (θ ) Find three points η1 < η2 < η3 so that f(g(η2)) is smaller than at both others η’=(η1+η2)/2 t t 5 while || θ t − θ t−1 ||> δ ( ) 6 return θ t Brent’s method: Solid line is f(g(η)) Three points bracket its maximum Dashed line shows quadratic fit Machine Learning Conjugate Gradient Ascent Srihari • In simple gradient ascent: – two consecutive gradients are orthogonal: – ∇fobj (θt+1 ) is orthogonal to ∇fobj (θt ) – Progress is zig-zag: progress to maximum slows down Quadratic: fobj(x,y)=-(x2+10y2) Exponential: fobj(x,y)=exp [-(x2+10y2)] • Solution: – Remember past directions and bias new direction ht as combination of current gt and past ht-1 – Faster convergence than simple gradient ascent Machine Learning Srihari Steepest descent • Gradient descent search determines a weight vector w that minimizes ED(w) by – Starting with an arbitrary initial weight vector – Repeatedly modifying it in small steps – At each step, weight vector is modified in the direction that produces the steepest descent along the error surface 35 Machine Learning Srihari Definition of Gradient Vector • The Gradient (derivative) of E with respect to each component of the vector w !" ⎡ ∂E ∂E ∂E ⎤ ∇E[w ] ≡ ⎢ , ,# ⎥ ∂w ∂w ∂w ⎣ 0 1 n⎦ • Notice ∇E[w] is a vector of partial derivatives • Specifies the direction that produces steepest increase in E • Negative of this vector specifies direction of steepest decrease 36 Machine Learning Srihari Direction of Steepest Descent Negated gradient Direction in w0-w1 plane producing steepest descent 37 Machine Learning Srihari Gradient Descent Rule ! ! w ← w + Δw – where – η Δ w = −η∇E[w] is a positive constant called the learning rate • Determines step size of gradient descent search • Component Form of Gradient Descent – Can also be written as wi ← wi + Δwi • where 38 ∂E Δwi = −η ∂wi Machine Learning Srihari Regularized Least Squares • Regularization term in error to control overfit – Error function to minimize E(w) = ED (w) + λ EW (w) • where λ is the regularization coefficient controls relative importance of data-dependent error ED(w) and regularization term EW(w) – Simple form of regularizer EW ( w ) = 1 w T w 2 – Total error function becomes “Quadratic” Regularizer { 1 N E(w) = ∑ tn − wT φ (xn ) 2 n=1 } 2 + λ T w w 2 – This regularizer is called weight decay • because in sequential learning weight values decay 39 towards zero unless supported by data Machine Learning Srihari Quadratic Regularizer 2 1 N λ E(w) = ∑{ t n − wT φ (x n ) } + wT w 2 n=1 2 • Advantage: error function remains a quadratic function of w • Its exact minimizer can be found in closed form • Setting gradient to zero and solving for w −1 w = ( λI + Φ Φ) Φ t T T • This is a simple extension of the least squared solution w ML = (ΦT Φ)−1 ΦT t Machine Learning Srihari Generalization of Quadratic Regularizer N M 2 1 λ • Regularized Error t n − wT φ (x n ) } + ∑ | w j |q { ∑ 2 n=1 2 j=1 • Where q=2 corresponds to the quadratic regularizer q=1 is known as lasso • Contours of constant regularization term: |wj|q Space of w1,w2 w1 + w2 = const w12 + w22 = const w1 + w2 = const w14 + w24 = const 41 Lasso Quadratic Machine Learning Srihari Geometric Interpretation of Regularizer w2 Contours of Unregularized Error function We are trying to find w that minimizes N ∑{ t − w φ (x n ) } T n 2 n=1 Any value of w on contour gives same error Choose that value of w Subject to the constraint M w1 q | w | ∑ j ≤η j=1 We don’t want the weights to become too large The two approaches are related by using Lagrange multipliers 42 Machine Learning Srihari Minimization of Unregularized Error subject to constraint • Blue: Contours of unregularized error function • Constraint region • w* is optimum value Minimization With quadratic Regularizer, q=2 43 Machine Learning Srihari Sparsity with Lasso constraint • With q=1 and λ is sufficiently large, some of the coefficients wj are driven to zero • Leads to a sparse model – where corresponding basis functions play no role • Origin of sparsity is illustrated here: Quadratic solution where w1*and wo* are nonzero Minimization with Lasso Regularizer A sparse solution with w1*=0 44 Machine Learning Srihari Regularization: Conclusion • Regularization allows – complex models to be trained on small data sets – without severe over-fitting • It limits model complexity – i.e., how many basis functions to use? • Problem of limiting complexity is shifted to – one of determining suitable value of regularization coefficient l 45 Machine Learning Srihari Returning to LeToR Problem • • • • Try: Several Basis Functions Quadratic Regularization Express results as ERMS – rather than as squared error E(w*) or as Error Rate with thresholded results E RMS = 2E(w*) /N 46 Machine Learning Srihari Multiple Outputs • Several target variables t =(t1,..,tK) where K >1 • Can be treated as multiple (K) independent regression problems – Different basis functions for each component of t • More common solution: same set of basis functions to model all components of target vector y(x,w)=WTφ(x) – where y is a K-dim column vector, W is a M x K matrix of weights and φ(x) is a M-dimensional column vector with with elements φj(x) 47 Machine Learning Srihari Solution for Multiple Outputs • Set of observations t1,..,tN are combined into a matrix T of size N x K such that the nth row is given by tnT • Combine input vectors x1,..,xN into matrix X • Log-likelihood function is maximized • Solution is similar: WML=(ΦTΦ)-1ΦTT 48
© Copyright 2026 Paperzz