1 Regularized linear models

1
Regularized linear models
Let us consider a linear model
Yi = xTi β + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n.
(1)
We have seen in the first part of the course that the best linear unbiased
estimator for β can be found by solving the equation
(X T X)β = X T y.
(2)
However, to solve this equation we need the matrix X T X to be invertible.
This is not the case if the number of predictor p is larger than the number
of observation n (because the p × p matrix has maximum rank n) or if there
is a collinearity between predictor vectors. We need then to propose an
alternative estimator for β.
1.1
Ridge regression
Hoerl and Kennard (1970) proposed a class of estimator obtained from a
modification of the score equation (2), based on a small perturbation λ of
the matrix (X T X) to make it invertible. Thus, we solve the equation
(X T X + λI)β = X T y,
(3)
where I is the identity matrix. The estimator for β is therefore:
b
β(λ)
= (X T X + λI)−1 X T y
and it is called ridge estimator. This estimator is of course biased but if we
consider the MSE
2
2 T
2
2
T
b
b
b
MSE(β(λ))
= ||E[β(λ)−β]||
2 +trace(V ar(β(λ))) = λ β W β+σ trace(W (X X)W ),
where W = (X T X + λI)−1 . It is easy to see that the first term is zero when
λ = 0 and it is increasing in λ. The second term approaches σ 2 trace((X T X)−1 )
when λ → 0 (and thus diverges if (X T X) is ill-conditioned) and it is decreasing in λ. From an intuitive point of view, it should be possible to find a
value of λ which balances the contributions of these two terms (bias-variance
trade off ) so that we obtain the minimum MSE. Indeed, the following result
can be found in Hoerl and Kennard (1970).
Theorem 1.1. It exists a constant C so that for λ ∈ (0, C),
b
MSE(β(λ))
< MSE(βbOLS ),
where βbOLS = (X T X)−1 X T y is the ordinary least square estimator.
1
It is worth to notice that this result guarantees that it exists a ridge estimator
which performs better than the ordinary least square estimator even if X T X
is invertible.
b
The ridge estimator β(λ)
can be alternatively seen as the solution of the
optimization problem
b
β(λ)
= arg min
β
n
X
(yi − x(i) β)2 + λ||β||22 ,
(4)
i=1
Pp
2
(i) is the i-th row of the design matrix X.
where ||β||22 =
k=1 βk and x
Indeed, if we look for the solution of the problem (4) by equating its gradient
to zero, we find out equation (3). In problem (4), λ can be interpreted as
the lagrangian multiplier for a constrained optimization problem of the form
b = arg min
β(s)
β
n
X
(yi − x(i) β)2
i=1
||β||22 < s,
with s and λ linked by a one-to-one correspondence. This expression of the
ridge estimator highlights how this procedure constrains the estimator of
the coefficients to avoid too high variance.
The penalization of ||β||22 assumes that all the variables have the same scale,
otherwise the coefficients with a larger scale would have a larger weight in
the penalization without any good reason. Therefore we usually center and
rescale the variables matrix X before estimating the model.
The difficult task is of course to find the optimal value of λ and many
approaches have been proposed. A qualitative one consists in looking at the
ridge trace , i.e. the plot of the estimated coefficients βk (λ), k = 1, . . . , p as
a function of λ and choosing the smaller λ for which all the estimates are
“stable”. Clearly this is very subjective and thus more quantitative methods
have been developed, based on cross-validation. The idea is removing one
observation from the dataset, fitting the model and evaluation the error in
predicting the removed variable. We repeat this for all the observations and
we can pick the value of λ for which this error is minimum.
Luckily, it is not necessary to fit all the n models for each value of λ, we
can use the projection matrix as in the ordinary least square case. Let
us call H(λ) = X(X T X + λI)−1 X T the projection matrix that maps the
observation y to the fitted values for the chosen λ. Then, the cross-validation
error is given by
n
2
b
1 X (yi − x(i) β(λ))
CV (λ) =
.
n
(1 − Hii (λ))2
i=1
2
It is often preferable to use a generalized cross validation error:
n
2
b
1X
(yi − x(i) β(λ))
GCV (λ) =
.
n
(1 − trace(Hii (λ))/n)2
i=1
Thus, λ can be chosen minimizing the cross-validation error or the generalized cross-validation error.
1.2
LASSO
The shortcoming of the ridge regression is that in general it does not perform
variables selection: all the coefficient are constrained but none are put to
zero. If we believe that only few predictors are really relevant in our model,
having a lot of small but not zero estimated coefficients is terribly inefficient.
This problem can be solved by changing the norm for the penalization of the
coefficient vector in (4). The LASSO (least absolute shrinkage and selection
operator) estimator is defined as
b
β(λ)
= arg min
β
where ||β||1 =
Pp
k=1 |βk |
n
X
(yi − x(i) β)2 + λ||β||1 ,
(5)
i=1
or, equivalently,
b = arg min
β(s)
β
n
X
(yi − x(i) β)2
i=1
||β||1 < s,
Fig. 1 illustrates graphically why the different choice of the norm allows
to put to zero the coefficients associated to irrelevant predictor (for each
choice of λ). The drawback is that there isn’t an analytical solution for (5)
and it has to be solved numerically. The optimal value of λ needs again to
be chosen by crossvalidation. An alternative is using Mallows Cp statistic,
defined as
n
X
2
b
Cp =
(yi − x(i) β(λ))
+ 2p̃b
σ2,
i=1
where p̃ here is the number of non-zero coefficients. This is again a criterion
which penalizes for the complexity of the model. Usually we look for a
“knee” in the plot of the Cp for different choice of λ. Many statistical
software (including R) prefers to use as tuning parameter for the LASSO
estimator the proportion t = s/||βbOLS ||1 . In this way we have a tuning
parameter t between 0 and 1. See Hastie at al. (2009) for more details
about how to solve computationally the LASSO problem and about the
choice of the tuning parameter.
3
Figure 1: Graphical representation of the constrained optimization problems
associated with LASSO estimation (left) and ridge estimation (right). The
shape of the constrain associated with the norm 1 forces the coefficients that
would have been “small” to be zero. This picture is inspired by Figure 3.11
of Hastie at al. (2009).
References
Hoerl, A. E. and Kennard, R. W. (1970) “”Ridge Regression: Biased Estimation of Non-Orthogonal Problem”. Technometrics, 12:55–67. Hastie,
Hastie, T., Friedman, J., and Tibshirani, R. (2009). The elements of statistical learning. New York. Springer.
4