Applied Soft Computing 11 (2011) 3859–3869 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc The predictive accuracy of feed forward neural networks and multiple regression in the case of heteroscedastic data Mukta Paliwal, Usha A. Kumar ∗ Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India a r t i c l e i n f o Article history: Received 17 May 2010 Received in revised form 2 December 2010 Accepted 24 January 2011 Available online 3 March 2011 Keywords: Monte Carlo simulation Heteroscedaticity Prediction Regression Neural networks a b s t r a c t This paper compares the performances of neural networks and regression analysis when the data deviate from the homoscedasticity assumption of regression. To carry out this comparison, datasets are simulated that vary systematically on various dimensions like sample size, noise levels and number of independent variables. Analysis is performed using appropriate experimental designs and the results are presented. Prediction intervals for both the methods for the case of nonconstant error variance are also calculated and are graphically compared. Two real life data sets that are heteroscedastic have been analyzed and the findings are in line with the results obtained from experiments using simulated data sets. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Neural networks are mathematical tools that resulted from attempts to model the capabilities of human brains. Literature [1–4] points out the potential of neural networks for prediction problems which has led to a number of studies comparing the performance of neural networks and regression analysis. However, there are very few studies addressing the consequences of deviation from the underlying assumptions of regression technique on the comparative performances of these two techniques. Pickard et al. [5] have used simulation to create data sets with known underlying model and with non-normal characteristics such as skewness, unstable variance and outliers that are frequently found in software cost modeling problem. Multiple regression and classification and regression tree have been compared and viability of simulation to allow such comparisons under controlled mechanism is demonstrated. Standard multiple regression techniques are found to be the best when the data exhibit moderate non-normality and under more extreme condition such as heteroscedasticity, nonparametric techniques are found to have the best performance. Gaudart et al. [6] have done comparison of the performance of multilayer perceptron and linear regression for epidemiological data when the data deviate from some of the underlying assumptions of regression. One of the functional forms considered in this ∗ Corresponding author. Tel.: +91 22 25767786; fax: +91 22 25722872. E-mail addresses: [email protected] (M. Paliwal), [email protected] (U.A. Kumar). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.01.043 study pertains to heteroscedasticity in the error variance. For the heteroscedastic errors, authors have concluded that predictions from both the models are of the same size and order. However, the findings are limited to the characteristics of the functional form considered in this study. The available literature is sparse, particularly comparing the performance of these two techniques when the data is heteroscedastic. Thus there is a need for extensive and systematic study that takes into account various patterns of heteroscedasticity and a large number of data characteristics while comparing the performance of the two techniques. The aim of the present study is to systematically compare the performance of neural network and regression technique when the data deviate from the assumption of homoscedasticity. This comparison is carried out by simulating data sets for different patterns of heteroscedasticity possessing various data characteristics that vary in sample size, amount of noise and number of independent variables. Appropriate prediction intervals for both the methods in presence of heteroscedasticity are also obtained and are graphically compared. To gain insight on how the comparative performance of both the techniques under heteroscedasticity translates to real life situations, two data sets have also been analyzed and results are discussed. The next section describes the regression model with nonconstant error variances. Weighted least squares method, a remedial measure to deal with unequal error variances is also discussed in this section. Section 3 presents the construction of prediction intervals for regression and neural network techniques when the data is heteroscedastic. The experimental design, data generation 3860 M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 procedure and performance evaluation criterion are discussed in Section 4. Section 5 presents the data analyses and results from the simulation study are presented in Section 6. Section 7 presents analysis of two real life data sets and the last section concludes this paper. be easily expressed using matrix terms. The following matrices are defined as: ⎛ ⎜ ⎝ Y =⎜ n×1 2. The model Consider the multiple linear regression model given by the equation, Yi = ˇ0 + ˇ1 X1i + · · · + ˇj Xji + · · · + ˇp Xpi + εi for i = 1, 2, . . . , n j = 0, 1, 2, . . . , p + 1 and (1) where Yi is the dependent variable, the Xji are independent variables, ˇj are unknown parameters and εi is the error term. In order for estimates of ˇ’s to have desirable properties, we need the following assumptions, called Gauss–Markov conditions: E(εi ) = 0; E(ε2i ) 2 = ; E(εi εk ) = 0, when i = / j, These conditions, if they hold, assure that ordinary least squares (OLS) estimates are best linear minimum variance estimates among the class of unbiased estimators. Violation of the second condition is called heteroscedasticity in which the error term of a regression model does not have constant variance over all the observations, i.e. Var(εi ) = i2 . OLS estimates of regression coefficients under heteroscedasticity are still unbiased and consistent but they are no longer the best linear unbiased estimators. Heteroscedasticity causes variances of parameter estimates to be large and thus can affect R2 , estimate of 2 , and various inferential procedures substantially. In this study, heteroscedasticity in the error term is introduced by defining εi as hi ui , where ui (random error) are independent N(0, 2 ) and hi is some function of one of the independent variables (Xi ’s) and thus Var(εi ) = i2 = (h2i 2 ). It is common for heteroscedastic error variance to increase or decrease as the expectation of Y grows larger, or there may be systematic relationship between error variance and a particular X. In this situation, the nonconstant error variances i2 = (h2i 2 ) are known up to a proportionality constant, where 2 is the proportionality constant. Weighted least squares estimation procedure is one of the methods that can be used to correct such form of heteroscedasticity and is the topic of discussion in the next subsection. 2.1. Weighted least squares n wi (Yi − ˇ0 − ˇ1 X1i − ˇ2 X2i − · · · − ˇp Xpi )2 ⎛ X11 X21 .. . Xn1 1 ⎜1 X =⎜ . ⎝ .. n×(p+1) 1 ⎟ ⎟ ⎠ X12 X22 .. . Xn1 ⎞ · · · X1p · · · X1p ⎟ .. ⎟ ⎠ . · · · Xnp (3) ⎛ w1 ⎜ 0 W =⎜ . ⎝ .. nxn 0 ⎞ ... ··· .. . ··· 0 w2 .. . 0 0 0 ⎟ .. ⎟ ⎠ . wn (4) The normal equations can be expressed as follows: (X WX)bw = X WY (5) The weighted least squares estimators of the regression coefficients are: = (X WX) −1 X WY (6) where bw is the vector of the estimated regression coefficients obtained by weighted least squares. The variance–covariance matrix of the weighted least squares estimated regression coefficients is: 2 {bw } (p+1)×(p+1) = 2 (X WX) −1 (7) and this matrix is not known since the proportionality constant 2 is not known. However, it can be estimated as s2 {bw } (p+1)×(p+1) 2 = sw (X WX) −1 (8) 2 is based on weighted squared residuals and can be calcuwhere sw 2 wi (Yi − Ŷi ) /(n − p − 1). Weighted least squares can also lated as be viewed as ordinary least squares on transformed model given by Yw = Xw  + w (9) 1/2 1/2 1/2 where Yw = W Y, Xw = W X and w = W with W1/2 being a diagonal matrix containing square roots of the weights wi . Here bw (p+1)×1 = (X WX) −1 X WY = (Xw Xw ) = 2 (X WX) −1 = 2 (Xw Xw ) −1 Xw Yw and 2 {bw } (p+1)×(p+1) −1 It can be observed that Weighted least squares (WLS) regression is useful for estimating the regression parameters when heteroscedasticity is present in the data. Here, we assume that the error variances are known up to proportionality constant i.e. Var(εi ) = i2 = (h2i 2 ). Estimation of parameters by the method of weighted least squares is closely related to parameter estimation by ordinary least squares. The weighted least squares criterion includes an additional weight, wi (= 1/h2i ), where the estimates of parameters are obtained by minimizing Qw = ⎞ Let the matrix W be a diagonal matrix containing the weights wi : bw (p+1)×1 for all i, j = 1, . . . , n Y1 Y2 .. . Yn (2) i=1 Since the weight wi is inversely related to the variance i2 , it reflects the amount of information contained in the observation Yi . Thus, an observation Yi that has a large variance receives less weight than another observation that has a smaller variance. The weighted least squares estimators of the regression coefficients for model (1) can 2 {εw } = W1/2 2 {ε}W1/2 = W1/2 2 W−1 W1/2 = 2 satisfying the constant variance assumption. 2.2. Prediction intervals Prediction intervals to predict the future observation, Y0 , from regression model and neural network model in case of heteroscedastic error variance are discussed in this section. In case of heteroscedastic error variance, a modified form of standard prediction interval needs to be used as usual prediction intervals will tend to be wider for small values of error variance and will tend to be narrower for large values of error variance. Incorporating the nonconstant error variance component hi in regression technique, an appropriate 100(1 − ˛) % prediction interval for Y0 is given by (n−p−1) [Ŷ0 ± t˛/2 sw h2i + X0 (Xw Xw ) −1 X0 ] (10) M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 3861 −1 where (Xw Xw ) and sw are computed from the transformed data as defined in model (9). Here X0 = [1 x01 x02 ... x0p ] is a row vector containing the values of independent variables for the new observation. For neural network, the prediction intervals in case of nonconstant error are obtained using the bootstrap method given by Hesks [7]. In this technique, a number of bootstrap samples are created by repeatedly sampling with replacement from the original data set and training each of these samples individually. If B bootstrap samples are created, a committee of B networks of the same architecture is formed. The output of the network for the input vector X will be the average of the B network outputs. Suppose we have a set of observations Dn = (Xi , Yi ), i = 1, 2, . . ., n, that satisfy the neural model: Yi = g (Xi ) + εi (11) where Yi is the output of the neural network g (Xi ) and εi is the error. Repeatedly sampling B times from the original data set, the predicted output of the network would be the average of B network outputs given by 1 gˆ (Xi )(b) B B Ŷi = (12) b=1 For the observation (X0 , Y0 ), 100(1 − ˛) % prediction interval can be calculated as Ŷi ± t˛/2 ˆ p (X0 ) (13) 2) where p2 is given by the sum of model uncertainty variance (m and data error variance (e2 ) which is nonconstant. From the boot2 is estimated as strap method, the model uncertainty variance, m 1 = B−1 B 2 ˆ m b=1 2 g ∧ (Xi ) (b) − Ŷi (14) and for estimating the error variance e2 , which is a function of independent variable (X2i ), a supplementary network is trained using 2 the squared residuals (Yi − Ŷi ) as output values and X2i as input values. 3. Methodology Monte Carlo method is used to simulate data sets from a linear functional model of the form (1) in such a way that they deviate from the assumption of homoscedasticity of multiple regression model. Heteroscedasticity in the error term is introduced by defining εi as hi ui , where ui are independent N(0, 2 ) and hi is some function of X2 . The three patterns of hi considered in this study are: (i) hi = X2i , (ii) hi = X2i , (iii) hi = 1 + 2 X2i + 1 These variance configurations are similar to those considered by Glejser [8] and Wilcox [9]. The other factors considered in this study, details of the data generation and performance evaluation criterion are explained in the subsequent subsections. 3.1. Experimental design The impact of nonconstant error variance on the performance of regression analysis and neural network may depend on factors like sample size, number of independent variables, and amount of variation in i2 . Thus, the experiment considered is a 2 × 3 × 3 × 3 repeated factorial design whose four factors are the two methods of Fig. 1. Residual plots for different forms of heteroscedasticity. analysis, no. of independent variables, sample size and noise in the data. Different levels that are considered for each of these factors are as follows: Methods of analysis: Multiple linear regression and feed forward neural network. To compare the performance of these two techniques, the coefficients of regression are estimated using the weighted least square (WLS) and the neural network is trained using Levenberg–Marquardt algorithm [10,11]. Number of variables (p): Three different levels considered for the number of explanatory variables are 2, 4, and 10, respectively. 3862 M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 Table 1 Sample sizes for different values of p. by predicting each observation in the test data using the trained model and then calculating the mean of the squared prediction. Sample size (n) p=2 p=4 p = 10 Small Medium Large 18 150 500 30 255 840 65 560 1840 Sample size (n): For choosing the sample size, a rule of thumb for deciding the subject-to-variable ratio has been used as given in Sawyer [12]. Three levels of sample size, namely small, medium and large samples corresponding to three levels of p = 2, 4 and 10 are chosen. The specific values of n for different values of p are given in Table 1. Error variance ( 2 ): This factor determines the variation in the random (Gaussian) noise present in the generated data. This variation is measured in terms of the signal-to-noise ratio (SNR). The SNR is defined as ˇ ˇ/ 2 . Three values of 2 considered in this study are 1, 16 and 100. The value of ˇ ˇ used is 100 resulting in three values of SNR considered as high noise, medium noise and low noise levels, respectively. 3.2. Data generation Simulation allows comparison of analytical techniques even when assumptions of the techniques under study are violated by generating data sets that have the required characteristics. Data matrices where the error variances are nonconstant are generated for each of the 27 experimental designs (3 levels of independent variables, 3 levels of sample sizes, 3 levels of random noise levels). Thirty replications are considered in this experiment and hence a total of 30 data matrices were generated for each of the 27 design conditions. Three functional forms are considered for heteroscedasticity as described earlier in this section resulting in three different experiments. The independent variables are assumed to be normally distributed with mean of 0 and standard deviation of 1 and are independent of each other. For a selected value of ˇ ˇ, the regression coefficients are derived via normalization of p uniform random numbers ni , i = 1,2,. . .,p generated from U[−10,10]. The regression coefficients are obtained as given in Delaney and Chatterjee [13]. ˇi =  rn ni (15) 2 n2i . The value of  used in this study is 100. The where rn2 = number of independent variables and sample sizes are chosen to represent a variety of conditions found in past research as well as to explore the range of situations that arise in various applications. The data thus generated facilitates comparison between neural network and regression techniques when the assumption of homoscedasticity does not hold true. 3.3. Performance evaluation criterion Mean square error, an unbiased estimate of 2 for the linear regression equation of the form (1) is given by 2 1 (Yi − Ŷi ) n−p−1 n s2 = (16) i=1 The estimate s2 is a measure of the variation in the prediction error. But this mean square error (MSE) will tend to understate the inherent variability in making the future predictions using the trained model. The actual prediction capability of the trained model can be measured by mean squared prediction error which is obtained 2 1 (Yi − Ŷi ) n∗ n MSPR = (17) i=1 where n* is the number of observations in the test data set. If MSPR is fairly close to the MSE of the trained model then this measure can also be used as a measure of the predictive capability. If the MSPR is much larger then the MSE, then MSPR is an appropriate indicator of how well the trained model will predict in future. In order to have the magnitude of the error in the same units as the observations, the square root of MSPR denoted by RMSP is used for test data and root mean square error (RMSE) is the error measure used for training data. 4. Data analysis To start with the analyses, the data sets for each of the experimental conditions were generated with double the sample sizes as mentioned in Table 1. These data sets were then divided randomly into two sets. The first data set is used to build the model and is called as training set and the second independent data set, which is not used for training the model, is referred to as holdout sample or test sample. This holdout sample is used for validating the trained model from both the techniques and comparing their predictive ability. The entire process has been replicated 30 times for each of the experimental conditions and are analyzed using both the techniques. In designed experiments, replication allows obtaining an estimate of the prediction error and also in obtaining a precise estimate of the effects of various factors considered in the study. Regression analysis is performed using a modified form of OLS technique where the coefficients are estimated using weighted least squares procedure. To perform the neural network analysis, a three layer feed forward network is considered and Levenberg–Marquardt training algorithm is used to train the network. The input layer contains p nodes depending on the experimental condition and the output layer contains one node corresponding to one response variable. The hyperbolic tangent activation function is used at the hidden layer and the identity activation function is used at the output layer. One of the important architectural decisions involves the determination of the number of nodes in the hidden layer. The number of nodes in the hidden layer depends on the complexity of the problem at hand [14] and needs to be empirically determined to get the best performance for the data being considered. Further, the goal of training neural networks is not to learn an exact representation of the training data itself, but rather to build a statistical model of the process that generalizes well to the new data. In practical applications of a feed forward neural network, if the network is over-fit to the noise on the training data, it memorizes the training data and gives poor generalization. Network generalization ability can be improved by the process of regularization. One of the simplest forms of regularizer namely weight decay, has been used in the present study, details of which are discussed in Bishop [22]. It has been found empirically that a regularizer of this form can lead to significant improvements in network generalization [15]. There are no theoretical guidelines for determining the appropriate values of the weight decay parameter and number of nodes in the hidden layer. In this study, the optimum values of weight decay parameter and number of nodes in the hidden layer have been obtained by trial and error. Three nodes in the hidden layer are found to be optimum for all the experimental conditions except one experiment pertaining to small sample size and p = 2. For this experiment, two nodes were selected in the hidden layer due to the limitations on the number of sample points. The weight decay 0.086 0.063 0.037 0.035 0.036 0.023 1.385 0.961 1.392 0.942 1.397 0.913 0.436 0.039 0.274 0.031 0.159 0.020 2.713 0.896 2.607 0.899 2.578 0.902 0.070 0.218 0.089 0.125 0.054 0.038 1.363 1.065 1.434 1.020 1.412 0.973 0.694 0.072 0.370 0.062 0.163 0.040 2.602 0.901 2.595 0.898 2.325 0.915 0.255 0.850 0.219 0.972 0.132 0.586 1.452 2.132 1.548 2.395 1.531 2.471 0.726 0.259 0.542 0.198 0.208 0.178 2.311 1.166 2.028 0.989 1.996 0.920 10 4 WLS NN WLS NN WLS NN 2 10 4 4.697 2.987 4.570 2.631 4.398 2.563 10 4 2 Med Low 0.145 0.176 0.122 0.108 0.099 0.105 3.684 3.648 3.762 3.670 3.723 3.635 0.393 0.103 0.247 0.120 0.115 0.086 4.762 3.584 4.714 3.523 4.680 3.569 0.277 0.345 0.232 0.259 0.142 0.145 3.701 3.899 3.702 3.803 3.773 3.844 0.312 0.264 0.232 0.173 0.218 0.165 4.610 3.379 4.573 3.461 4.524 3.448 0.654 1.195 0.678 1.252 0.388 0.960 4.115 5.298 4.024 6.064 3.941 6.703 0.385 0.413 0.308 0.362 0.219 0.230 WLS NN WLS NN WLS NN 1.395 1.075 0.583 0.748 0.544 0.412 Mean 9.004 9.170 8.977 9.141 8.997 9.097 0.495 0.473 0.287 0.298 0.198 0.217 Std Dev Mean 10.348 8.627 10.270 8.819 10.263 8.792 0.845 1.008 0.548 0.628 0.324 0.327 Std Dev Mean 9.013 9.699 9.058 9.710 8.921 9.615 0.818 0.851 0.524 0.528 0.281 0.336 Std Dev Std Dev 10.284 7.856 9.987 5.717 10.246 6.400 WLS NN WLS NN WLS NN 2 High Std Dev 10.363 8.631 10.302 8.698 10.360 8.678 Mean Mean Mean Training Training 1.964 2.605 1.828 2.385 1.135 1.703 Medium sample size Testing Small sample size Method VAR NOISE Table 2 |x2i |. Mean and standard deviation of errors for training and test data for hi = Appropriate error measures are computed for WLS regression and neural network for all the 30 replications for training data and test data sets as described in the previous section. Three functional forms are considered to generate heteroscedasticity in the data and comparison of the two methods is carried out for each of these cases separately. Residual plots are used in detecting systematic relationships that may exist between the error variance and any of the particular independent variables. Fig. 1(a)–(c) shows the residual plots against absolute value of X2 for the three forms of heteroscedasticity for one of the design conditions. This design pertains to data with large sample size, high noise and two independent variables. The mean and standard deviations of the RMSE and RMSP for the two techniques for each of the three cases of heteroscedasticity for training and test data sets are presented in Tables 2–4, respectively. From these tables, it can be observed that the mean errors for the training data sets are generally smaller than the mean errors of weighted regression analysis. The differences are more for the small sample size and become less for large sample size. But for the test data sets, the mean error values of weighted least squares regression is less than that of neural networks for most of the design conditions considered in this study. This shows the problem of overfitting in neural network in spite of presence of regularization parameter in training the model. The results of regression analysis do not show such a problem and its performance is consistent in both training and test data sets. It is further analyzed to see whether these differences in the mean error values of the two techniques for the test data sets are statistically significant. A full factorial analysis of variance with repeated measure is conducted to investigate whether significant difference exists in the performance of two techniques for all the design factors. The model under which the usual F-tests in a repeated measure factorial experiment are valid assumes that homogeneity of covariance matrices across the levels of the between subjects factors (compound symmetry) and variances of all pairwise differences between variables are equal (sphericity). For this experiment, the sphericity test is not required as there are only two levels in the within subject factor (method of analysis). Box’M test for homogeneity of covariance matrices is carried out and is coming out to be significant. For a balanced design (group sizes are equal), empirical literature [16,17] recommends the use of univariate F-tests with adjusted degrees of freedom to carry out the analysis. The results from univariate tests for between and within subject effects for each of the three cases of heteroscedasticity are summarized in Table 5 for the test data sets. Testing 5. Results 9.284 12.151 9.563 13.750 9.379 13.576 Training Large sample size Testing parameter values in the range 0.1–0.4 were found to be optimum for all the experimental conditions. As the prediction error was not very sensitive to any value of weight decay parameter in this range, a value of 0.2 is chosen for all the experiments. RMSE is calculated for training data set and RMSP is calculated for test data set for each of the 30 replications. Analysis of variance for four factorial design with repeated measure on the last factor (method of analysis) is used to analyze whether significant differences exist in the performance of the two techniques with respect to various design factors like independent variable, sample size, and noise levels for the three different forms of heteroscedasticity. Multiple comparison tests are performed to see at what level of one factor, difference exists in the other factor causing interaction effect to be significant. Appropriate prediction intervals are obtained for the test data set using regression and neural network technique when the error variances are heteroscedastic. In order to further compare the performances, the average CPU time taken by both the techniques is presented in Section 5. 3863 1.743 2.442 1.390 1.189 0.626 1.247 Std Dev M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 3864 M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 Fig. 2. Scatter plots of prediction intervals of regression and NN for the case of two independent variables and large sample size at various levels of noise for hi = Multiple comparison tests are performed for those interaction effects that turned out to be significant in order to see the level at which the differences are present in the other factors. To carry out multiple comparisons of this repeated measure factorial design, F-tests with appropriate numerator and denominator are used as explained in Winer [18]. Results from the multiple comparison tests |x2i |. suggest the performance of regression to be generally better than neural network for small sample size irrespective of the amount of noise and number of independent variables. However when the number of independent variables is two, the performance of the two techniques is almost the same for the second form of heteroscedasticity corresponding to the case of low and medium noise Table 3 Mean and standard deviation of errors values for training and test data for hi = |x2i |. NOISE VAR Method Small sample size Training High 2 4 10 Med 2 4 Low 2 4 10 Training Large sample size Testing Training Testing Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev WLS NN WLS NN WLS NN 12.609 7.212 10.911 6.114 10.380 6.816 3.178 2.234 1.640 1.630 1.082 1.526 8.637 11.517 9.921 14.722 10.032 15.170 1.972 2.551 2.415 3.584 1.534 1.795 25.981 9.388 15.184 9.252 14.536 9.691 22.902 1.168 5.766 0.787 1.855 0.663 10.511 10.710 9.721 10.552 10.063 10.728 1.804 1.576 0.816 1.251 0.567 0.563 54.683 9.738 22.265 9.743 19.768 9.804 85.660 0.748 8.730 0.547 3.799 0.257 10.683 10.501 10.092 10.353 10.105 10.244 1.848 0.964 0.548 0.664 0.278 0.332 WLS NN WLS NN WLS NN 12.055 3.060 6.332 2.650 5.559 2.532 19.790 0.866 2.802 0.630 1.084 0.537 4.851 5.179 4.512 7.068 4.248 7.035 2.264 1.935 1.028 1.794 0.796 1.623 13.738 3.597 13.094 3.642 11.088 3.789 6.014 0.426 6.208 0.350 2.487 0.247 4.384 4.472 4.358 4.337 4.182 4.330 0.499 0.727 0.392 0.363 0.221 0.293 40.324 3.906 22.955 3.942 18.749 3.969 28.233 0.250 10.561 0.191 6.590 0.119 4.561 4.107 4.342 4.136 4.217 4.045 0.807 0.261 0.405 0.200 0.185 0.140 WLS NN WLS NN WLS NN 7.050 1.135 3.906 1.042 3.729 0.958 7.176 0.319 1.805 0.249 0.920 0.182 2.610 2.626 1.897 2.514 1.921 2.514 3.462 1.656 0.494 0.901 0.279 0.671 14.778 0.961 13.419 0.967 11.694 0.996 14.543 0.097 7.215 0.068 3.744 0.066 2.205 1.158 1.824 1.168 1.811 1.087 1.153 0.186 0.238 0.182 0.201 0.076 32.687 0.980 24.276 1.010 19.281 0.998 32.343 0.068 13.040 0.057 5.303 0.027 2.237 1.055 1.866 1.044 1.865 1.026 1.072 0.082 0.330 0.065 0.253 0.031 Table 4 Mean and standard deviation of errors values for training and test data for hi = 1 + 2/( X2i + 1). NOISE VAR Method Small sample size Training High 2 4 10 Med 2 4 10 Low 2 4 10 Medium sample size Testing Training Large sample size Testing Training Testing Std Dev Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev WLS NN WLS NN WLS NN 10.481 22.759 9.985 17.650 9.916 17.848 1.483 6.908 1.749 3.866 0.949 3.702 23.301 30.533 22.774 30.985 24.178 31.576 4.931 6.885 3.252 6.929 2.438 3.694 9.822 21.501 10.057 21.903 9.973 21.449 0.609 1.374 0.421 1.086 0.281 0.638 22.796 24.018 22.682 24.046 22.659 24.178 1.523 1.728 1.248 1.341 0.711 0.885 10.002 22.379 10.027 22.500 10.010 22.322 0.271 0.758 0.202 0.495 0.182 0.397 22.779 23.115 22.734 23.197 22.639 23.136 0.749 0.807 0.613 0.655 0.447 0.427 WLS NN WLS NN WLS NN 4.118 9.007 4.183 7.136 4.027 6.299 0.713 2.042 0.650 1.883 0.394 1.114 8.930 12.026 10.113 13.542 9.665 14.254 1.605 1.943 1.167 2.485 1.127 1.588 4.036 8.943 4.027 8.777 4.026 8.744 0.283 0.678 0.188 0.482 0.126 0.302 9.091 9.547 9.172 9.737 9.151 9.665 0.556 0.644 0.373 0.441 0.293 0.379 4.049 9.066 4.034 9.026 4.022 8.965 0.154 0.396 0.101 0.234 0.066 0.168 9.170 9.225 9.168 9.264 9.097 9.193 0.283 0.287 0.221 0.240 0.118 0.115 WLS NN WLS NN WLS NN 1.024 2.400 1.097 1.855 1.111 1.735 0.214 0.711 0.129 0.284 0.084 0.239 2.611 4.113 2.804 5.012 2.717 4.929 0.402 1.137 0.420 1.229 0.276 0.879 1.097 2.279 1.120 2.296 1.107 2.260 0.071 0.173 0.046 0.106 0.030 0.060 2.471 2.395 2.499 2.398 2.481 2.400 0.156 0.182 0.102 0.145 0.070 0.099 1.110 2.266 1.106 2.270 1.101 2.259 0.040 0.081 0.029 0.066 0.021 0.039 2.479 2.312 2.466 2.289 2.476 2.289 0.080 0.065 0.073 0.068 0.045 0.033 3865 Mean M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 10 Medium sample Size Testing 3866 M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 levels. In case of medium sample size, regression is seen to perform better when the noise level is high. However with the decrease in the noise level, either the performance of both the techniques is almost the same or neural network is seen to perform better in some of the cases. For the case of large sample size, the performance of both the techniques is found to be nearly equal for almost all data conditions. Only in the second form of heteroscedasticity, performance of neural network becomes better for medium and large samples corresponding to low level of noise. For all other experimental conditions, no statistical significant difference is found in the performance of the two techniques. Appropriate prediction intervals are obtained for the test data for the three forms of heteroscedasticity using both regression and neural networks analyses for each of the experimental designs considered. The prediction intervals corresponding to all levels of noise for the case of p = 2 and sample size being large are shown as scatter plots in Fig. 2 for the first form of heteroscedasticity. It can be seen from these plots that the range of the prediction intervals of regression is generally lesser than the corresponding range of prediction intervals obtained from neural network. Further, the size of the prediction intervals becomes narrower as we move from high noise level to low noise level. Similar observations can be made from the prediction intervals pertaining to other levels of no. of independent variables, namely 4 and 10 and other two forms of heteroscedasticity and hence the corresponding scatter plots are not shown here. We have further analyzed the coverage of prediction intervals of both the methods for all the experimental conditions and the results are summarized in Table 6. This table gives the percentage of times the length of prediction interval given by regression being less than the corresponding length given by neural network. One such typical value 96.29 represents the percentage of observations for which length of prediction interval given by regression is lesser than the interval given by neural network. For all three forms of heteroscedasticity, prediction intervals pertaining to small sample size are comparatively wider for neural network almost for all levels of noise and number of independent variables. However, the gaps between the prediction intervals tend to decrease with the increase in sample size. The CPU time taken by the two techniques for one of the experimental conditions is shown in Table 7. This table reports the average CPU time of 10 runs for the first form of heteroscedasticity with number of variables being 10, noise being high and sample size at all three levels. The entire data analysis is performed in SAS software [19] on a P-IV machine and thus this measure can be compared for the two techniques. It can be observed from this table that the time taken by regression analysis is comparatively much lesser than that of neural network model irrespective of the sample sizes. Neural network being trained using an iterative algorithm takes more time to learn and converge as compared to regression analysis. Further, the determination of various parameters like the number of hidden layers, number of nodes in the hidden layer etc. associated with neural networks is not straightforward and finding the optimal configuration of neural networks is again a time consuming process. In the next section, two real life data sets have been analyzed and results are discussed. 6. Illustration The predictive performance of neural network and regression is compared on two real life data sets. The first data set pertains to a study of diastolic blood pressure (Y) and age (X) of 54 healthy adult women of 20–60 years old. [20]. A simple scatter plot of the diastolic blood pressure and age shown in Fig. 3(a), indicates the presence of heteroscedasticity as the spread of diastolic blood pressure seem to increase with an increase in age. The OLS regres- Fig. 3. Various plots pertaining to first example. sion is performed and the residuals are analyzed to check whether all the assumptions of regression hold true. Fanning out patterns in residual plot against age given in Fig. 3(b), indicates the error variances to be a function of the independent variable, age. Weighted least squares regression is used to remedy the non-constant error variance problem by choosing the weights, wi as 1/xi2 . The residuals obtained from WLS regression are plotted against x and are shown in Fig. 3(c) which appears to be free from the presence of heteroscedasticity. The resulting weighted regression model is Y = 55.07 1 x2 − 0.61(x) (18) Further, the real life data set has also been analyzed using feed forward neural network model. A three layer network M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 3867 Fig. 4. Various plots regarding the second real life data analysis. architecture with one node in the hidden layer and weight decay parameter value of 0.05 was chosen and trained using the Levenberg–Marquardt algorithm. As the sample size is small, leave one out cross validation method has been used to validate the performance of both the techniques. The model is trained 54 times (sample size) separately, using both the techniques on all the data except one observation where the trained model is used to predict this observation. The average of prediction errors over 54 runs is used to assess the predictive performance of the two techniques. The values of this average for regression model and neural network model are 6.55(5.02) and 6.71(5.31), respectively. The numbers reported in parenthesis are the standard deviations of errors for 54 runs. It can be seen that the error from the regression analysis is less than that from neural network model and the results are more stable as compared to neural network technique. The characteristics of this data set corresponds to somewhere between small to medium sample size with noise level being high and the number of independent variables being low. The finding from this real life data set agrees with the results obtained from the simulation experiment for corresponding characteristics of the data. The second data set has been taken from the UCI Machine Learning Data Repository [21]. The data pertains to the average miles per 3868 M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 Table 5 Repeated measures analysis of variance for test data. hi = Source Tests of hypotheses for between subjects effects VAR NOISE VAR*NOISE SIZE VAR*SIZE NOISE*SIZE VAR*NOISE*SIZE Tests of hypotheses for within subjects effects Method Method*VAR Method*NOISE Method*VAR*NOISE Method*SIZE Method*VAR*SIZE Method*NOISE*SIZE Method*VAR*NOISE*SIZE |x2i | + + + + + hi = |x2i | hi = 1 + 2/( X2i + 1) + + + + + + + + + + + + + + + + + + + + + + + + Note: + indicates significant effect at 1% level of significance and blank cell represent that the corresponding factors are not significant. Table 6 Comparison of length of prediction intervals from regression and neural networks. h 1 2 3 Sample size Noise levels Noise levels Noise levels High Med Low High Med Small 10 4 2 Variable 96.29 93.33 88.89 93.84 100 94.44 100 100 100 96.92 90 94.44 98.46 93.33 77.78 Med 10 4 2 73.75 72.55 68.67 63.39 74.9 53.33 68.57 58.82 66.67 74.64 73.73 74 73.39 67.84 54 Large 10 4 2 60.65 46.3 55.2 51.9 57.02 43 63.36 80.83 65.6 65.92 64.4 70.2 64.56 60 64 Table 7 Average CPU time for the two techniques. Sample size CPU time Small Med Large Reg NN 0.022 0.038 0.055 0.248 1.062 3.25 gallons (MPG) of 392 cars with four continuous valued explanatory variables as displacement (DP), horsepower (HP), weight (WT) and acceleration (AC). A preliminary regression analysis carried out on this data set indicated DP and AC to be insignificant and these variables were also found to be highly correlated with the other variables and hence dropped from further analysis. Regression analysis is performed on the resulting data set and the residual plot shows fanning out of residuals with respect to increase in the values of predicted MPG as can be seen from Fig. 4(a). The residuals plotted against each of the independent variables as shown in Fig. 4(b) and (c) indicate the presence of heteroscedasticity in the Table 8 Summary of results. Noise High Med Low Sample size Small Medium Large R R R R E N/E E E N/E Low High Med Low 83.08 93.33 44.44 96.92 80 72.22 100 96.67 83.33 69.82 68.23 86 57.86 56.86 49.33 71.79 54.9 42.67 65.18 59.61 40.67 64.72 74.28 66.2 66.36 65.36 63.8 60.98 67.62 56.6 62.34 78.45 36.6 100 100 100 data. Formal test procedures like Glesjer and Park also provided sufficient evidence of heteroscedasticity. As the error variances i2 are unknown, weights were estimated using iteratively reweighted least squares method [20]. However this does not seem to correct the problem of nonconstant error variance as can be seen from the residual plot shown in Fig. 4(d). A log transformation on both the sides of regression equation was tried and this appears to reasonably correct the situation as shown in Fig. 4(e). Fig. 4(f) shows the qq-plot of the residuals of transformed observations indicating the distribution of error to be normal. The resulting regression model is log(MPG) = 10.11 − 0.36 log(HP) − 0.689 log(WT) (19) This model seems to reasonably satisfy the assumptions of regression. A three layer feed forward neural network model with one node in the hidden layer is trained optimally using Levenberg–Marquardt algorithm and weight decay regularizer is used to avoid over fitting. To validate the trained model, 10-fold cross validation method has been performed. This would help in testing robustness of both the techniques with respect to sampling fluctuations. For performing 10-fold cross validation, the original data is split into ten mutually exclusive subsets of nearly equal size where nine sub-samples are used for training and the remaining one is used for testing purpose. This process is repeated ten times and the average of these ten runs is used in estimating the generalization error. The average RMSP for test samples for regression and neural network models are 3.99(0.43) and 3.89(0.44), respectively. The values in parenthesis are standard deviations over 10 runs. The RMSP values indicate similar performances of both the techniques M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869 for this data set. This also agrees with our finding from the simulation experiment corresponding to medium to large sample size for low level of noise and number of independent variables being two. 7. Conclusion In this study, simulation experiments are conducted to assess the consequences of deviation from the underlying assumption of homoscedasticity on the comparative performance of regression analysis and neural networks. The predictive performance of these two techniques is compared using simulated data sets having data characteristics that vary in sample size, number of independent variables and amount of noise present in the data. Models are validated by considering independent data sets that were not used in training the models. Results of this study show that the performance of regression analysis is generally better than neural network when the size of sample is small irrespective of the amount of noise and number of independent variables in the data. This gives further evidence for neural networks not being able to learn optimally for small samples. In case of medium and large sample sizes, performance of regression technique is better or equivalent to neural network for high level of noise. However with the decrease in the noise level, neural network seems to perform better for some of the design conditions. This finding contradicts the claim of literature where neural network is said to be a robust technique with respect to noise in the data. For the second form of heteroscedasticity, some of the results are not consistent with the findings mentioned above. This could be due to the extent of heteroscedasticity being more for this function as the rate of increase in the error variances is comparatively high with respect to increase in the values of independent variable. To easily comprehend the results mentioned above, the findings are presented in Table 8. In this table, R represents the fact that regression performs better, E stands for the performance being same for both the techniques and N stands for better performance of neural network. Thus it can be observed that the predictive performance of both the techniques gets affected by characteristics of the data such as sample size, amount of noise, and functional form of heteroscedasticity though not largely by the number of independent variables. Analysis of prediction intervals of the two techniques for all three forms of heteroscedasticity reveal the prediction intervals obtained from regression to be smaller for small sample size, almost for all levels of noise and number of independent variables. For medium and large samples sizes, percentage of times the length of prediction interval given by regression being lesser as compared to that of neural network decreases with an increase in sample size. These observations tend to support the results obtained from the analysis of predictive error values. Two real life data sets having heteroscedasticity have also been analyzed and the findings agree with the results obtained from the simulation experiment for corresponding data characteristics. In this study, various combinations of data characteristics are identified where one method would perform better than the other method in terms of its predictive ability. It is further observed that 3869 regression analysis is much faster as its running time is comparatively less. Being an iterative procedure, neural networks take more time to converge and additional efforts and time are required to optimize the architectural parameters. Regression analysis remains an attractive option because of its established methodology and its ability to interpret the explanatory variables. However, neural networks can serve to be an alternative model when the patterns of heteroscedasticity are not known a priori. This work being empirical, further research is required to strengthen the findings of this study. References [1] W.L. Gorr, D. Nagin, J. Szczypula, Comparative study of artificial neural network and statistical models for predicting student grade point averages, International Journal of Forecasting 10 (1994) 17–34. [2] T. Hill, W. Remus, Neural network models for intelligent support of managerial decision making, Decision Support Systems 11 (1994) 449–459. [3] B. Warner, M. Misra, Understanding neural networks as statistical tools, The American Statistician 50 (4) (1996) 284–293. [4] P.C. Pendharkar, Scale economies and production function estimation for object-oriented software component and source code documentation size, European Journal of Operational Research 172 (2006) 1040–1050. [5] L. Pickard, B. Kitchenham, S.J. Linkman, Using simulated datasets to compare data analysis techniques used for software cost modeling, IEE ProceedingsSoftware 148 (6) (2001) 164–175. [6] J. Gaudart, B. Giusiano, L. Huiart, Comparison of the performance of multilayer perceptron and linear regression for epidemiological data, Computational Statistics and Data Analysis 44 (2004) 547–570. [7] T. Hesks, Practical confidence and prediction intervals, in: M.C. Mozer, M.I. Jordan, T. Pestsche (Eds.), Advances in Neural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA, 1997, pp. 176–182. [8] H. Glejser, A new test for heteroskedasticity, Journal of the American Statistical Association 64 (325) (1969) 316–323. [9] R.R. Wilcox, Confidence intervals for the slope of a regression line when the error term has nonconstant variance, Computational Statistics and Data Analysis 22 (1996) 89–98. [10] K. Levenberg, A method for the solution of certain problems in least squares, Quarterly Applied Mathematics 2 (1944) 164–168. [11] D.W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, Journal of Society of Industrial Mathematics 11 (1963) 431–441. [12] R. Sawyer, Sample size and the accuracy of predictions made from multiple regression equations, Journal of Educational Statistics 7 (2) (1982) 91–104. [13] N.J. Delaney, S. Chatterjee, Use of the bootstrap and cross validation in ridge regression, Journal of Business and Economic Statistics 4 (2) (1986) 255–262. [14] W.Y. Huang, R.P. Lippmann, Comparisons between neural net and conventional classifiers, in: IEEE First International Conference on Neural Networks, IV, San Diego, CA, 1987, pp. 485–493. [15] G.E. Hinton, Learning translation invariant recognition in a massively parallel network, in: Proceedings of the Conference on Parallel Architectures and Languages Europe, Eindhoven, The Netherlands, 1987, pp. 1–13. [16] H. Huynh, Some approximate tests for repeated measurements, Psychometrika 43 (1978) 161–175. [17] H.J. Keselman, K.C. Carriere, L.M. Lix, Testing repeated measures hypotheses when covariance matrices are heterogeneous, Journal of Educational Statistics 18 (4) (1993) 305–319. [18] B.J. Winer, Statistical Principles in Experimental Design, second ed., McGrawHill, Inc., New York, 1971. [19] SAS Institute Inc., Statistical Analysis System, Version 9.1, SAS Institute Inc., Cary, NC, 2007. [20] J. Neter, M.H. Kutner, C.J. Nachtsheim, W. Wasserman, Applied Linear Statistical Models, third ed., The McGraw-Hill Companies, Inc., 1996, ISBN 0-256-11736-5. [21] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2007, Available from: http://www.ics.uci.edu/∼mlearn/MLRepository.html. [22] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, London, U.K., 1995.
© Copyright 2025 Paperzz