Simplicity Matters? The Case of Non-Parametric Models Introduction

Simplicity Matters? The Case of Non-Parametric Models
Introduction. Influential arguments on the importance of parametric simplicity for model
selection have been biased by their focus on parametric models. See (Forster & Sober 1994)
(Forster 2001) (Sober & Hitchcock 2004). Such a focus leads us to believe that there is a
fundamental trade-off between parametric simplicity and goodness of fit. But no such trade-off
is considered when we select some non-parametric models with few adjustable parameters, like
KNN1 regression models. We can increase the fit of KNN regression models by keeping the
number of adjustable parameters to 1.
The trade-off appears to make sense when we are trying to adjust a parametric regression
model to our data, such as a linear regression model. Nonetheless, if our main goal is to make
accurate predictions and that we do not care about the interpretability of our model, then we
might not even consider those parametric models in the first place. A good KNN regression model
can fit the data much better and it is a great predictive tool. In other words, parametric simplicity
might be the least of our worries if our main goal is to make accurate predictions.
There is however a fundamental trade-off that is made as we select any kind of regression
model. It is one between the bias and the variance of an estimator for a certain function. In fact,
the apparent importance of parametric simplicity is reducible to the importance of that tradeoff. In other words, a fit/simplicity trade-off is desirable to the extent that it allows us to make
an appropriate bias/variance trade-off.
Even so, we can always use a k-fold cross-validation in order to assess if we have made a
good bias/variance trade-off. Such a validation can be computed without taking parametric
simplicity into account. Furthermore, it is possible show that a selection criterion based on k-fold
cross-validation can outperform selection criteria that are computed with the number of
adjustable parameters, such as the Akaike information criterion (AIC).
In the first part of this paper I present two very different types of regressions: the linear
and the KNN regression. The first is a parametric and the second is non-parametric. I describe
1
KNN stands for β€˜k nearest neighbours’.
1
some of their respective properties. The comparison between the two models allows me to
downplay considerably the importance of parametric simplicity in model selection. In the second
part, I explain the nature of the bias/variance trade-off and how it can be made appropriately.
This allows me to underscore the fact that we can select good models without taking into account
their parametric simplicity. We can do so and still outperform model selection criteria that rely
on that kind of simplicity.
1. Parametric and Non-Parametric Regressions
Regression models are used when we are trying to estimate the function that relates a
set of independent variables (𝑋) (the predictors) and a dependent variable (π‘Œ) (the predicted
variable). In reality, the dependent variable will most probably not be fully determined by that
function. We might not have all the relevant predictors at our disposal, or there might be some
unmeasurable variance in our data (James et al. 2013, 18). Thus, what we are usually trying to
estimate is 𝑓(𝑋) in the following expression, where πœ€ is the irreducible error term.
π‘Œ = 𝑓(𝑋) + πœ€
In other words, we are often facing scenarios where a perfect estimate will not predict the
dependent variable perfectly.
We can estimate that function by making strong or weak assumptions about its form. For
example, when we use a parametric model, we are making strong assumptions about the form
of the function. A Linear (parametric) model, for instance, can have the following form, j stands
for the number of adjustable parameters p’ (excluding the intercept):
𝑝′
π‘Œ = π›½π‘œ + βˆ‘ 𝛽𝑗 π‘₯ + πœ€ ; (𝑗 = 1, … , 𝑝′)
𝑗=1
2
Our estimate2 of the function 𝑓̂(𝑋) will be given by adjusting the free parameters with the data.
Linear regressions usually have at least 2 adjustable parameters.
Here is a simple illustration. Suppose that the function relating an independent variable
(π‘₯) and a dependent variable (𝑦) can be expressed as follows, where πœ€ is an error term that
follows a normal distribution centered on 0 and of variance equal to 1.
π‘Œ = 𝑓(𝑋) + πœ€ = 20 +
π‘₯2
+ πœ€
6
This function is unknown to us but we can measure the dependent and the independent
variables. Thus we might try to fit a linear regression model with 2 adjustable parameters on 100
independent observations of (x, y). The resulting estimation of 𝑓(π‘₯) will have the following form:
Μ‚0 + 𝛽
Μ‚1 π‘₯
𝑓̂(π‘₯) = 𝛽
It is a straight line that can be visualised in Figure 1. The small circles in that figure represent the
observations.
Figure 1
2
I am using the notation β€˜^’ to identify an estimate.
3
Now, we could have approached the same problem very differently. We could have
decided to fit a non-parametric model on the observations. One possible advantage for this
alternative choice is that we do not have to assume any particular form for 𝑓(π‘₯) in order to fit
non-parametric models. The assumptions that we need to make are thus much weaker in that
respect.
For example, when we use a KNN regression model, we may estimate 𝑓(π‘₯) with a function
𝑓̂(π‘₯0 ) that will find the K nearest points (calculated by using a Euclidean distance) from any π‘₯0 in
a set of observations 𝑁0 . It will then associate the mean of their respective value for π‘Œ to π‘₯0 . Here
is the estimating function:
𝑓̂(π‘₯0 ) =
1
βˆ‘ 𝑦𝑖
𝐾
π‘₯π‘–βˆˆπ‘0
The rationale behind this estimator is that similar values for π‘₯ should have similar values for π‘Œ.
In this particular case, the KNN regression has only 1 adjustable parameter: K. We can see what
𝑓̂(π‘₯0 ) looks like in Figure 2 when K=10. It has the shape of a staircase and it fits the data much
better that the simple regression model.
Figure 2
4
Both the parametric and the non-parametric approaches for estimating 𝑓(π‘₯) are
extremely different. The parametric approach can be qualified as a top-down approach. We first
assume a particular form for 𝑓(π‘₯) and then we see if we can successfully adjust the parameters
with the data at hand. The non-parametric approach however can be qualified as a bottom-up
approach. The data at hand determines the form that our estimation of 𝑓(π‘₯) will take.
Moreover, the non-parametric models can be simpler in many respects. They are
grounded on fewer assumptions about 𝑓(π‘₯) and models like KNN regressions do not assume any
distribution for the error term. They may also contain fewer adjustable parameters. In fact, as far
as parametric simplicity goes, it does not get any better than the KNN model described above.
Of course, the fact that parametric models rest on fewer assumptions than their
parametric counterparts is an obvious advantage in certain scenarios. When we lack the
necessary knowledge to construct a parametric model, a non-parametric option can be more
than welcomed. But it is not obvious how the parametric simplicity of non-parametric models is
a desirable feature at all.
If we look at some statements found in the philosophy of science literature, their
parametric simplicity should matter a great deal because it is supposed to prevent overfitting by
making an appropriate trade-off between fit and simplicity:
Simplicity matters. A sufficiently simple hypothesis, formulated on the basis of a given body of data, will not
drastically overfit the data. It does not contain too many parameters whose values have been set according to the
data. Thus, a simple hypothesis that successfully accommodates a given body of data can be expected to make more
accurate predictions about new data than a more complex theory that fits the data equally well (Hitchcock & Sober
2004, 22).
Model selection involves a trade-off between simplicity and fit for reasons that are now fairly well understood
(Forster 2001, 83)
Perhaps the most interesting of the standard arguments in favor of simplicity is based upon the concept of
β€˜overfitting’. The idea is that predicting the future by means of an equation with too many free parameters compared
to the size of the sample is more likely to produce a prediction far from the true value. […] this argument is sound
and compelling, so far as using an equation for predictive purposes is concerned (Kelly 2007, 113).
If we only consider our linear model, this makes a lot of sense. Overfitting is a
phenomenon that occurs when we end up modeling the irreducible error in our data, such that
5
it looks like we are estimating 𝑓(π‘₯) quite well. However, if we were to make more observations,
we would realise that our model is not good at all because the irreducible error cannot be
captured. Now it is possible to model the irreducible error with a linear regression simply by
adding independent variables into our model. In effect, this comes down to increasing the fit of
our model by adding more adjustable parameters into our model. Evidently, a good fit with the
data is important. But we should not fit our data so well that we end up modeling the irreducible
error term by adding too many adjustable parameters. Therefore, we may tempted to conclude
that there is an important trade-off between parametric simplicity and goodness of fit.
But no such trade-off is considered when we look at our KNN regression model. Its
number of adjustable parameters is 1 and it is possible to severely overfit the data by settting the
value of K to 1. Figure 3 offers an explicit visualisation of the phenomenon. The regression line
passes through each data point such that the fit of the model is perfect. But we would definitely
be unable to perfectly predict any future data. In other words, the apparent perfection of our
model is due to the fact that we are modeling the irreducible error: we are overfitting the data.
Figure 3
6
The real challenge when we are using a KNN regression model is to figure out the
appropriate value for K such that we will not overfit the data. It makes no sense to try to figure
out what is the KNN model that makes the right balance between goodness of fit and parametric
simplicity. Therefore, parametric simplicity does not always matter in model selection. The tradeoff between simplicity and fit is not a fundamental one in model selection.
In fact, even if the trade-off makes sense in the parametric case, we might not even
consider the linear models in the first place if our only goal is to make accurate predictions. It is
true that a good KNN regression model is much more difficult to interpret than a good linear
regression model. For example, the estimating function given by a KNN regression model does
not allow us to tell what happens to (π‘Œ) when we increase or decrease (𝑋). In contrast, a good
regression model will easily inform us about the expected increase of (π‘Œ) when (π‘₯𝑖 ) increases.
But, this is totally irrelevant if we only care about making good predictions. In that case, the
fit/simplicity trade-off can be the last of our worries.
Now, Elliott Sober and Christopher Hitchcock (2004) discuss the importance of parametric
simplicity in a context where the only goal is to make accurate predictions. Therefore, it makes
their emphasis on the importance of a fit/simplicity trade-off all the more peculiar. A good KNN
regression model can be a great predictive tool and its parametric simplicity does not guard it at
all against the sin of overfitting:
Attaining the goal of predictive accuracy requires that one guard against overfitting […], and giving due weight to
simplicity is a means to that end (Hitchcock & Sober 2004, 14).
In sum, parametric simplicity does not play any role when we are selecting a KNN model
with only 1 adjustable parameter. Such models can severely overfit the data even if they are
extremely simple. Therefore, the trade-off between simplicity and fit is not fundamental. It does
not make sense in every regression modeling scenario. Moreover, even if there is a family of
models where such a trade-off can make sense, we might not even consider those models if our
main goal is to obtain a model that makes accurate predictions. Non-parametric models with very
few adjustable parameters, like the KNN regression models, can provide some of the best
predicting tools when they do not overfit the data.
7
2. The Bias/Variance Trade-off
There is however a fundamental trade-off that is made as we select any kind of regression
model. It is one between the bias and the variance of the estimator 𝑓̂(π‘₯). In theory, the quality
of the estimator can be assessed by calculating the mean squared error (MSE). The MSE is given
by the following expression: 𝐸(π‘¦π‘œ βˆ’ 𝑓̂(π‘₯0 ))2. The greater the MSE, the worse the model is. Now
what is particularly interesting is that this expression3 can be broken down into three
fundamental expressions as follows:
2
𝐸(π‘¦π‘œ βˆ’ 𝑓̂(π‘₯0 ))2 = π‘‰π‘Žπ‘Ÿ(𝑓̂(π‘₯0 )) + [π΅π‘–π‘Žπ‘  (𝑓̂(π‘₯0 ))] + π‘‰π‘Žπ‘Ÿ(πœ€)
The first expression on the right side of the equality stands for the variance of our estimator. It
gives us an idea as to how much our estimator would change, given a new data set. Ideally, we
want an estimator that is relatively constant across different independent data set taken from
the same population. The second expression, stands for the bias of our estimator. Ideally we want
an estimator that is equal to 𝑓(π‘₯) on average. Finally, the third expression stands for the variance
of the irreducible error term.
If we consider regression models more specifically, we can say that the variance of the
estimator provided by a linear regression model will increase as the number of free parameters
increase. Likewise, the variance of an estimator provided by a KNN regression model will increase
as the value for K decreases. This does not matter if the bias decreases in such a way that the
MSE will also decrease. But there always comes a point where the bias starts to increase with the
variance and, naturally, the MSE as well. This means that, regardless of the nature of our model
(parametric or non-parametric), we always need to figure out an appropriate trade-off between
3
It is worth mentioning that the MSE is not useful by itself. What is useful is to compare the MSE of different
models in order to make an appropriate choice.
8
the variance of our estimator and its bias. This is the fundamental trade-off (James et al. 2013,
35). It explains why parametric simplicity seems to be so important.
But in practice, we do not know 𝑓(π‘₯). Hence, we cannot calculate the real MSE explicitly
in order to find if we have made the appropriate trade-off. Luckily, there are ways to estimate
the real MSE for our model. Regardless of the model that we are using, we can use a general
technique called k-fold cross-validation (CV). A k-fold CV consists in randomly separating our data
set k times. We then construct our model with one of the resulting data sets and calculate the
observed MSE with the help of the rest of the data that was not used in the construction of the
model. We then repeat this procedure k times. We will therefore obtain k observed MSE scores.
The average of those scores will give us an estimate of the real MSE for our model:
π‘˜
1
πΆπ‘‰π‘˜ = βˆ‘ 𝑀𝑆𝐸𝑖
π‘˜
𝑖=1
If k equals the number of observations n, then we call this kind of cross-validation the
β€˜leave-one-out cross-validation’ (LOOCV). But there is also a bias/variance trade-off to consider
when we are estimating the real MSE of a model. The greater k is, the smaller the bias of our
estimator is. Therefore LOOCV seems optimal in that respect. However, the greater k is, the
greater the variance of our estimator. Thus LOOCV is actually far from being optimal:
To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically,
given these considerations, one performs k-fold cross-validation using k=5 or k=10, as these values have been shown
empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance
(James et al. 2013, 184).
Now the first point that I wish to underscore in this section is the fact that the apparent
importance of parametric simplicity in certain contexts is reducible to the importance of a
fundamental trade-off between the bias and the variance of our models. By itself, parametric
simplicity is not a desirable feature of a model, as we could see when we considered KNN
regression models with only 1 free parameter. The second point that I want to make concerns
the relevance of taking parametric simplicity into consideration as we select our models.
9
Clearly, the idea that there is an important trade-off between simplicity and fit did not
appear out of thin air. The Akaike information criterion (AIC), for example, explicitly relies on a
trade-off between fit and simplicity.
Akaike (1973) showed that an unbiased estimate of the predictive accuracy of a model can be obtained by
considering both its fit-to-data and its simplicity, as measured by the number of adjustable parameters it contains.
[…] Akaike’s remarkable result is that (modulo certain assumptions):
An unbiased estimate of the predictive accuracy of model M β‰… log[Pr(Data|L(M)]-k (Hitchcock & Sober 2004, 12).
Here, k refers to the number of adjustable parameters and L(M) refers to a model where the
adjustable parameters are being determined by maximum likelihood estimators. We can easily
see that the predictive accuracy is penalised by an increase in k.
Needless to say that this criterion cannot be applied in a context where we are using a
KNN regression model. But it is often used when we are trying to choose a reasonable parametric
regression model. In such a context it can be shown to be an unbiased estimator of the predictive
accuracy of our model (MSE). Nevertheless, there are alternative methods to evaluate the quality
of our models and some of them do not involve parametric simplicity at all. A selection method
based on k-fold CV, for example, does not rely on parametric simplicity. Moreover, a k-fold CV
can provide a more efficient selection criteria than the AIC.
The AIC is actually asymptotically equivalent to LOOCV (Hitchcock & Sober 2004, 13).
Consequently the AIC suffers from the same defect. As we know, it is not enough to have an
unbiased estimator of the predictive accuracy of our model. We also need an estimator that is
not too variant and LOOCV is very poor in that respect. Therefore, a k-fold CV, where k is smaller
than the number of observations n, can be preferable.
Conclusion.
Overall, I have reached three important conclusions. The first one is that there is no
fundamental trade-off between parametric simplicity and goodness of fit. It does not make sense
in every modeling scenarios. This is especially true if our main goal is to make accurate
predictions. The second one is that the apparent trade-off between parametric simplicity and
goodness of fit can be reduced to a trade-off between the bias and the variance of our estimator
for 𝑓(π‘₯). We can evaluate if we have made the appropriate bias/variance trade-off by estimating
10
the real MSE of our model. This led me to my final conclusion, which is that it is never necessary
to take into account the parametric simplicity of our models. In fact, model selection criteria that
do not take parametric simplicity into consideration can outperform those that do.
11
References
Akaike, H, (1973), β€˜Information Theory as an Extension of the Maximum Likelihood Principle’, (In
B. Petrov, and F. Csaki (Eds.), Second International Symposium on Information Theory. (pp. 26781). Budapest: Akademiai Kiado).
Forster, M. (2001), β€œThe New Science of Simplicity”, (In A. Zellner, H. Keuzenkamp, and M.
McAleer (Eds.), Simplicity, Inference and Modeling. (pp. 83-119). Cambridge: Cambridge
University Press).
Forster M. and Sober, E. (1994), β€œHow to Tell When Simpler, More Unified, or Less Ad Hoc
Theories will Provide More Accurate Predictions”, The British Journal for the Philosophy of
Science, 45: 1 - 35.
Hitchcock, C. and Sober, E. (2004) β€œPrediction Versus Accommodation and the Risk of
Overfitting”, The British Journal for the Philosophy of Science, 55: 1-34.
James. G. et al. (2013), An Introduction to Statistical Learning with Applications in R, Dordrecht:
Springer.
Kelly, K. T. (2007), β€œHow Simplicity Helps You Find the Truth Without Pointing at it”, (In V.
Harazinov, M. Friend, and N. Goethe (Eds.), Philosophy of Mathematics and Induction. (pp. 111143). Dordrecht: Springer).
12