1 Regularized linear models Let us consider a linear model Yi = xTi β + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n. (1) We have seen in the first part of the course that the best linear unbiased estimator for β can be found by solving the equation (X T X)β = X T y. (2) However, to solve this equation we need the matrix X T X to be invertible. This is not the case if the number of predictor p is larger than the number of observation n (because the p × p matrix has maximum rank n) or if there is a collinearity between predictor vectors. We need then to propose an alternative estimator for β. 1.1 Ridge regression Hoerl and Kennard (1970) proposed a class of estimator obtained from a modification of the score equation (2), based on a small perturbation λ of the matrix (X T X) to make it invertible. Thus, we solve the equation (X T X + λI)β = X T y, (3) where I is the identity matrix. The estimator for β is therefore: b β(λ) = (X T X + λI)−1 X T y and it is called ridge estimator. This estimator is of course biased but if we consider the MSE 2 2 T 2 2 T b b b MSE(β(λ)) = ||E[β(λ)−β]|| 2 +trace(V ar(β(λ))) = λ β W β+σ trace(W (X X)W ), where W = (X T X + λI)−1 . It is easy to see that the first term is zero when λ = 0 and it is increasing in λ. The second term approaches σ 2 trace((X T X)−1 ) when λ → 0 (and thus diverges if (X T X) is ill-conditioned) and it is decreasing in λ. From an intuitive point of view, it should be possible to find a value of λ which balances the contributions of these two terms (bias-variance trade off ) so that we obtain the minimum MSE. Indeed, the following result can be found in Hoerl and Kennard (1970). Theorem 1.1. It exists a constant C so that for λ ∈ (0, C), b MSE(β(λ)) < MSE(βbOLS ), where βbOLS = (X T X)−1 X T y is the ordinary least square estimator. 1 It is worth to notice that this result guarantees that it exists a ridge estimator which performs better than the ordinary least square estimator even if X T X is invertible. b The ridge estimator β(λ) can be alternatively seen as the solution of the optimization problem b β(λ) = arg min β n X (yi − x(i) β)2 + λ||β||22 , (4) i=1 Pp 2 (i) is the i-th row of the design matrix X. where ||β||22 = k=1 βk and x Indeed, if we look for the solution of the problem (4) by equating its gradient to zero, we find out equation (3). In problem (4), λ can be interpreted as the lagrangian multiplier for a constrained optimization problem of the form b = arg min β(s) β n X (yi − x(i) β)2 i=1 ||β||22 < s, with s and λ linked by a one-to-one correspondence. This expression of the ridge estimator highlights how this procedure constrains the estimator of the coefficients to avoid too high variance. The penalization of ||β||22 assumes that all the variables have the same scale, otherwise the coefficients with a larger scale would have a larger weight in the penalization without any good reason. Therefore we usually center and rescale the variables matrix X before estimating the model. The difficult task is of course to find the optimal value of λ and many approaches have been proposed. A qualitative one consists in looking at the ridge trace , i.e. the plot of the estimated coefficients βk (λ), k = 1, . . . , p as a function of λ and choosing the smaller λ for which all the estimates are “stable”. Clearly this is very subjective and thus more quantitative methods have been developed, based on cross-validation. The idea is removing one observation from the dataset, fitting the model and evaluation the error in predicting the removed variable. We repeat this for all the observations and we can pick the value of λ for which this error is minimum. Luckily, it is not necessary to fit all the n models for each value of λ, we can use the projection matrix as in the ordinary least square case. Let us call H(λ) = X(X T X + λI)−1 X T the projection matrix that maps the observation y to the fitted values for the chosen λ. Then, the cross-validation error is given by n 2 b 1 X (yi − x(i) β(λ)) CV (λ) = . n (1 − Hii (λ))2 i=1 2 It is often preferable to use a generalized cross validation error: n 2 b 1X (yi − x(i) β(λ)) GCV (λ) = . n (1 − trace(Hii (λ))/n)2 i=1 Thus, λ can be chosen minimizing the cross-validation error or the generalized cross-validation error. 1.2 LASSO The shortcoming of the ridge regression is that in general it does not perform variables selection: all the coefficient are constrained but none are put to zero. If we believe that only few predictors are really relevant in our model, having a lot of small but not zero estimated coefficients is terribly inefficient. This problem can be solved by changing the norm for the penalization of the coefficient vector in (4). The LASSO (least absolute shrinkage and selection operator) estimator is defined as b β(λ) = arg min β where ||β||1 = Pp k=1 |βk | n X (yi − x(i) β)2 + λ||β||1 , (5) i=1 or, equivalently, b = arg min β(s) β n X (yi − x(i) β)2 i=1 ||β||1 < s, Fig. 1 illustrates graphically why the different choice of the norm allows to put to zero the coefficients associated to irrelevant predictor (for each choice of λ). The drawback is that there isn’t an analytical solution for (5) and it has to be solved numerically. The optimal value of λ needs again to be chosen by crossvalidation. An alternative is using Mallows Cp statistic, defined as n X 2 b Cp = (yi − x(i) β(λ)) + 2p̃b σ2, i=1 where p̃ here is the number of non-zero coefficients. This is again a criterion which penalizes for the complexity of the model. Usually we look for a “knee” in the plot of the Cp for different choice of λ. Many statistical software (including R) prefers to use as tuning parameter for the LASSO estimator the proportion t = s/||βbOLS ||1 . In this way we have a tuning parameter t between 0 and 1. See Hastie at al. (2009) for more details about how to solve computationally the LASSO problem and about the choice of the tuning parameter. 3 Figure 1: Graphical representation of the constrained optimization problems associated with LASSO estimation (left) and ridge estimation (right). The shape of the constrain associated with the norm 1 forces the coefficients that would have been “small” to be zero. This picture is inspired by Figure 3.11 of Hastie at al. (2009). References Hoerl, A. E. and Kennard, R. W. (1970) “”Ridge Regression: Biased Estimation of Non-Orthogonal Problem”. Technometrics, 12:55–67. Hastie, Hastie, T., Friedman, J., and Tibshirani, R. (2009). The elements of statistical learning. New York. Springer. 4
© Copyright 2026 Paperzz