Simplicity Matters? The Case of Non-Parametric Models Introduction. Influential arguments on the importance of parametric simplicity for model selection have been biased by their focus on parametric models. See (Forster & Sober 1994) (Forster 2001) (Sober & Hitchcock 2004). Such a focus leads us to believe that there is a fundamental trade-off between parametric simplicity and goodness of fit. But no such trade-off is considered when we select some non-parametric models with few adjustable parameters, like KNN1 regression models. We can increase the fit of KNN regression models by keeping the number of adjustable parameters to 1. The trade-off appears to make sense when we are trying to adjust a parametric regression model to our data, such as a linear regression model. Nonetheless, if our main goal is to make accurate predictions and that we do not care about the interpretability of our model, then we might not even consider those parametric models in the first place. A good KNN regression model can fit the data much better and it is a great predictive tool. In other words, parametric simplicity might be the least of our worries if our main goal is to make accurate predictions. There is however a fundamental trade-off that is made as we select any kind of regression model. It is one between the bias and the variance of an estimator for a certain function. In fact, the apparent importance of parametric simplicity is reducible to the importance of that tradeoff. In other words, a fit/simplicity trade-off is desirable to the extent that it allows us to make an appropriate bias/variance trade-off. Even so, we can always use a k-fold cross-validation in order to assess if we have made a good bias/variance trade-off. Such a validation can be computed without taking parametric simplicity into account. Furthermore, it is possible show that a selection criterion based on k-fold cross-validation can outperform selection criteria that are computed with the number of adjustable parameters, such as the Akaike information criterion (AIC). In the first part of this paper I present two very different types of regressions: the linear and the KNN regression. The first is a parametric and the second is non-parametric. I describe 1 KNN stands for βk nearest neighboursβ. 1 some of their respective properties. The comparison between the two models allows me to downplay considerably the importance of parametric simplicity in model selection. In the second part, I explain the nature of the bias/variance trade-off and how it can be made appropriately. This allows me to underscore the fact that we can select good models without taking into account their parametric simplicity. We can do so and still outperform model selection criteria that rely on that kind of simplicity. 1. Parametric and Non-Parametric Regressions Regression models are used when we are trying to estimate the function that relates a set of independent variables (π) (the predictors) and a dependent variable (π) (the predicted variable). In reality, the dependent variable will most probably not be fully determined by that function. We might not have all the relevant predictors at our disposal, or there might be some unmeasurable variance in our data (James et al. 2013, 18). Thus, what we are usually trying to estimate is π(π) in the following expression, where π is the irreducible error term. π = π(π) + π In other words, we are often facing scenarios where a perfect estimate will not predict the dependent variable perfectly. We can estimate that function by making strong or weak assumptions about its form. For example, when we use a parametric model, we are making strong assumptions about the form of the function. A Linear (parametric) model, for instance, can have the following form, j stands for the number of adjustable parameters pβ (excluding the intercept): πβ² π = π½π + β π½π π₯ + π ; (π = 1, β¦ , πβ²) π=1 2 Our estimate2 of the function πΜ(π) will be given by adjusting the free parameters with the data. Linear regressions usually have at least 2 adjustable parameters. Here is a simple illustration. Suppose that the function relating an independent variable (π₯) and a dependent variable (π¦) can be expressed as follows, where π is an error term that follows a normal distribution centered on 0 and of variance equal to 1. π = π(π) + π = 20 + π₯2 + π 6 This function is unknown to us but we can measure the dependent and the independent variables. Thus we might try to fit a linear regression model with 2 adjustable parameters on 100 independent observations of (x, y). The resulting estimation of π(π₯) will have the following form: Μ0 + π½ Μ1 π₯ πΜ(π₯) = π½ It is a straight line that can be visualised in Figure 1. The small circles in that figure represent the observations. Figure 1 2 I am using the notation β^β to identify an estimate. 3 Now, we could have approached the same problem very differently. We could have decided to fit a non-parametric model on the observations. One possible advantage for this alternative choice is that we do not have to assume any particular form for π(π₯) in order to fit non-parametric models. The assumptions that we need to make are thus much weaker in that respect. For example, when we use a KNN regression model, we may estimate π(π₯) with a function πΜ(π₯0 ) that will find the K nearest points (calculated by using a Euclidean distance) from any π₯0 in a set of observations π0 . It will then associate the mean of their respective value for π to π₯0 . Here is the estimating function: πΜ(π₯0 ) = 1 β π¦π πΎ π₯πβπ0 The rationale behind this estimator is that similar values for π₯ should have similar values for π. In this particular case, the KNN regression has only 1 adjustable parameter: K. We can see what πΜ(π₯0 ) looks like in Figure 2 when K=10. It has the shape of a staircase and it fits the data much better that the simple regression model. Figure 2 4 Both the parametric and the non-parametric approaches for estimating π(π₯) are extremely different. The parametric approach can be qualified as a top-down approach. We first assume a particular form for π(π₯) and then we see if we can successfully adjust the parameters with the data at hand. The non-parametric approach however can be qualified as a bottom-up approach. The data at hand determines the form that our estimation of π(π₯) will take. Moreover, the non-parametric models can be simpler in many respects. They are grounded on fewer assumptions about π(π₯) and models like KNN regressions do not assume any distribution for the error term. They may also contain fewer adjustable parameters. In fact, as far as parametric simplicity goes, it does not get any better than the KNN model described above. Of course, the fact that parametric models rest on fewer assumptions than their parametric counterparts is an obvious advantage in certain scenarios. When we lack the necessary knowledge to construct a parametric model, a non-parametric option can be more than welcomed. But it is not obvious how the parametric simplicity of non-parametric models is a desirable feature at all. If we look at some statements found in the philosophy of science literature, their parametric simplicity should matter a great deal because it is supposed to prevent overfitting by making an appropriate trade-off between fit and simplicity: Simplicity matters. A sufficiently simple hypothesis, formulated on the basis of a given body of data, will not drastically overfit the data. It does not contain too many parameters whose values have been set according to the data. Thus, a simple hypothesis that successfully accommodates a given body of data can be expected to make more accurate predictions about new data than a more complex theory that fits the data equally well (Hitchcock & Sober 2004, 22). Model selection involves a trade-off between simplicity and fit for reasons that are now fairly well understood (Forster 2001, 83) Perhaps the most interesting of the standard arguments in favor of simplicity is based upon the concept of βoverfittingβ. The idea is that predicting the future by means of an equation with too many free parameters compared to the size of the sample is more likely to produce a prediction far from the true value. [β¦] this argument is sound and compelling, so far as using an equation for predictive purposes is concerned (Kelly 2007, 113). If we only consider our linear model, this makes a lot of sense. Overfitting is a phenomenon that occurs when we end up modeling the irreducible error in our data, such that 5 it looks like we are estimating π(π₯) quite well. However, if we were to make more observations, we would realise that our model is not good at all because the irreducible error cannot be captured. Now it is possible to model the irreducible error with a linear regression simply by adding independent variables into our model. In effect, this comes down to increasing the fit of our model by adding more adjustable parameters into our model. Evidently, a good fit with the data is important. But we should not fit our data so well that we end up modeling the irreducible error term by adding too many adjustable parameters. Therefore, we may tempted to conclude that there is an important trade-off between parametric simplicity and goodness of fit. But no such trade-off is considered when we look at our KNN regression model. Its number of adjustable parameters is 1 and it is possible to severely overfit the data by settting the value of K to 1. Figure 3 offers an explicit visualisation of the phenomenon. The regression line passes through each data point such that the fit of the model is perfect. But we would definitely be unable to perfectly predict any future data. In other words, the apparent perfection of our model is due to the fact that we are modeling the irreducible error: we are overfitting the data. Figure 3 6 The real challenge when we are using a KNN regression model is to figure out the appropriate value for K such that we will not overfit the data. It makes no sense to try to figure out what is the KNN model that makes the right balance between goodness of fit and parametric simplicity. Therefore, parametric simplicity does not always matter in model selection. The tradeoff between simplicity and fit is not a fundamental one in model selection. In fact, even if the trade-off makes sense in the parametric case, we might not even consider the linear models in the first place if our only goal is to make accurate predictions. It is true that a good KNN regression model is much more difficult to interpret than a good linear regression model. For example, the estimating function given by a KNN regression model does not allow us to tell what happens to (π) when we increase or decrease (π). In contrast, a good regression model will easily inform us about the expected increase of (π) when (π₯π ) increases. But, this is totally irrelevant if we only care about making good predictions. In that case, the fit/simplicity trade-off can be the last of our worries. Now, Elliott Sober and Christopher Hitchcock (2004) discuss the importance of parametric simplicity in a context where the only goal is to make accurate predictions. Therefore, it makes their emphasis on the importance of a fit/simplicity trade-off all the more peculiar. A good KNN regression model can be a great predictive tool and its parametric simplicity does not guard it at all against the sin of overfitting: Attaining the goal of predictive accuracy requires that one guard against overfitting [β¦], and giving due weight to simplicity is a means to that end (Hitchcock & Sober 2004, 14). In sum, parametric simplicity does not play any role when we are selecting a KNN model with only 1 adjustable parameter. Such models can severely overfit the data even if they are extremely simple. Therefore, the trade-off between simplicity and fit is not fundamental. It does not make sense in every regression modeling scenario. Moreover, even if there is a family of models where such a trade-off can make sense, we might not even consider those models if our main goal is to obtain a model that makes accurate predictions. Non-parametric models with very few adjustable parameters, like the KNN regression models, can provide some of the best predicting tools when they do not overfit the data. 7 2. The Bias/Variance Trade-off There is however a fundamental trade-off that is made as we select any kind of regression model. It is one between the bias and the variance of the estimator πΜ(π₯). In theory, the quality of the estimator can be assessed by calculating the mean squared error (MSE). The MSE is given by the following expression: πΈ(π¦π β πΜ(π₯0 ))2. The greater the MSE, the worse the model is. Now what is particularly interesting is that this expression3 can be broken down into three fundamental expressions as follows: 2 πΈ(π¦π β πΜ(π₯0 ))2 = πππ(πΜ(π₯0 )) + [π΅πππ (πΜ(π₯0 ))] + πππ(π) The first expression on the right side of the equality stands for the variance of our estimator. It gives us an idea as to how much our estimator would change, given a new data set. Ideally, we want an estimator that is relatively constant across different independent data set taken from the same population. The second expression, stands for the bias of our estimator. Ideally we want an estimator that is equal to π(π₯) on average. Finally, the third expression stands for the variance of the irreducible error term. If we consider regression models more specifically, we can say that the variance of the estimator provided by a linear regression model will increase as the number of free parameters increase. Likewise, the variance of an estimator provided by a KNN regression model will increase as the value for K decreases. This does not matter if the bias decreases in such a way that the MSE will also decrease. But there always comes a point where the bias starts to increase with the variance and, naturally, the MSE as well. This means that, regardless of the nature of our model (parametric or non-parametric), we always need to figure out an appropriate trade-off between 3 It is worth mentioning that the MSE is not useful by itself. What is useful is to compare the MSE of different models in order to make an appropriate choice. 8 the variance of our estimator and its bias. This is the fundamental trade-off (James et al. 2013, 35). It explains why parametric simplicity seems to be so important. But in practice, we do not know π(π₯). Hence, we cannot calculate the real MSE explicitly in order to find if we have made the appropriate trade-off. Luckily, there are ways to estimate the real MSE for our model. Regardless of the model that we are using, we can use a general technique called k-fold cross-validation (CV). A k-fold CV consists in randomly separating our data set k times. We then construct our model with one of the resulting data sets and calculate the observed MSE with the help of the rest of the data that was not used in the construction of the model. We then repeat this procedure k times. We will therefore obtain k observed MSE scores. The average of those scores will give us an estimate of the real MSE for our model: π 1 πΆππ = β πππΈπ π π=1 If k equals the number of observations n, then we call this kind of cross-validation the βleave-one-out cross-validationβ (LOOCV). But there is also a bias/variance trade-off to consider when we are estimating the real MSE of a model. The greater k is, the smaller the bias of our estimator is. Therefore LOOCV seems optimal in that respect. However, the greater k is, the greater the variance of our estimator. Thus LOOCV is actually far from being optimal: To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k=5 or k=10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance (James et al. 2013, 184). Now the first point that I wish to underscore in this section is the fact that the apparent importance of parametric simplicity in certain contexts is reducible to the importance of a fundamental trade-off between the bias and the variance of our models. By itself, parametric simplicity is not a desirable feature of a model, as we could see when we considered KNN regression models with only 1 free parameter. The second point that I want to make concerns the relevance of taking parametric simplicity into consideration as we select our models. 9 Clearly, the idea that there is an important trade-off between simplicity and fit did not appear out of thin air. The Akaike information criterion (AIC), for example, explicitly relies on a trade-off between fit and simplicity. Akaike (1973) showed that an unbiased estimate of the predictive accuracy of a model can be obtained by considering both its fit-to-data and its simplicity, as measured by the number of adjustable parameters it contains. [β¦] Akaikeβs remarkable result is that (modulo certain assumptions): An unbiased estimate of the predictive accuracy of model M β log[Pr(Data|L(M)]-k (Hitchcock & Sober 2004, 12). Here, k refers to the number of adjustable parameters and L(M) refers to a model where the adjustable parameters are being determined by maximum likelihood estimators. We can easily see that the predictive accuracy is penalised by an increase in k. Needless to say that this criterion cannot be applied in a context where we are using a KNN regression model. But it is often used when we are trying to choose a reasonable parametric regression model. In such a context it can be shown to be an unbiased estimator of the predictive accuracy of our model (MSE). Nevertheless, there are alternative methods to evaluate the quality of our models and some of them do not involve parametric simplicity at all. A selection method based on k-fold CV, for example, does not rely on parametric simplicity. Moreover, a k-fold CV can provide a more efficient selection criteria than the AIC. The AIC is actually asymptotically equivalent to LOOCV (Hitchcock & Sober 2004, 13). Consequently the AIC suffers from the same defect. As we know, it is not enough to have an unbiased estimator of the predictive accuracy of our model. We also need an estimator that is not too variant and LOOCV is very poor in that respect. Therefore, a k-fold CV, where k is smaller than the number of observations n, can be preferable. Conclusion. Overall, I have reached three important conclusions. The first one is that there is no fundamental trade-off between parametric simplicity and goodness of fit. It does not make sense in every modeling scenarios. This is especially true if our main goal is to make accurate predictions. The second one is that the apparent trade-off between parametric simplicity and goodness of fit can be reduced to a trade-off between the bias and the variance of our estimator for π(π₯). We can evaluate if we have made the appropriate bias/variance trade-off by estimating 10 the real MSE of our model. This led me to my final conclusion, which is that it is never necessary to take into account the parametric simplicity of our models. In fact, model selection criteria that do not take parametric simplicity into consideration can outperform those that do. 11 References Akaike, H, (1973), βInformation Theory as an Extension of the Maximum Likelihood Principleβ, (In B. Petrov, and F. Csaki (Eds.), Second International Symposium on Information Theory. (pp. 26781). Budapest: Akademiai Kiado). Forster, M. (2001), βThe New Science of Simplicityβ, (In A. Zellner, H. Keuzenkamp, and M. McAleer (Eds.), Simplicity, Inference and Modeling. (pp. 83-119). Cambridge: Cambridge University Press). Forster M. and Sober, E. (1994), βHow to Tell When Simpler, More Uniο¬ed, or Less Ad Hoc Theories will Provide More Accurate Predictionsβ, The British Journal for the Philosophy of Science, 45: 1 - 35. Hitchcock, C. and Sober, E. (2004) βPrediction Versus Accommodation and the Risk of Overο¬ttingβ, The British Journal for the Philosophy of Science, 55: 1-34. James. G. et al. (2013), An Introduction to Statistical Learning with Applications in R, Dordrecht: Springer. Kelly, K. T. (2007), βHow Simplicity Helps You Find the Truth Without Pointing at itβ, (In V. Harazinov, M. Friend, and N. Goethe (Eds.), Philosophy of Mathematics and Induction. (pp. 111143). Dordrecht: Springer). 12
© Copyright 2026 Paperzz