18.pdf

Applied Soft Computing 11 (2011) 3859–3869
Contents lists available at ScienceDirect
Applied Soft Computing
journal homepage: www.elsevier.com/locate/asoc
The predictive accuracy of feed forward neural networks and multiple regression
in the case of heteroscedastic data
Mukta Paliwal, Usha A. Kumar ∗
Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India
a r t i c l e
i n f o
Article history:
Received 17 May 2010
Received in revised form 2 December 2010
Accepted 24 January 2011
Available online 3 March 2011
Keywords:
Monte Carlo simulation
Heteroscedaticity
Prediction
Regression
Neural networks
a b s t r a c t
This paper compares the performances of neural networks and regression analysis when the data deviate
from the homoscedasticity assumption of regression. To carry out this comparison, datasets are simulated
that vary systematically on various dimensions like sample size, noise levels and number of independent
variables. Analysis is performed using appropriate experimental designs and the results are presented.
Prediction intervals for both the methods for the case of nonconstant error variance are also calculated
and are graphically compared. Two real life data sets that are heteroscedastic have been analyzed and
the findings are in line with the results obtained from experiments using simulated data sets.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
Neural networks are mathematical tools that resulted from
attempts to model the capabilities of human brains. Literature [1–4]
points out the potential of neural networks for prediction problems
which has led to a number of studies comparing the performance
of neural networks and regression analysis. However, there are
very few studies addressing the consequences of deviation from
the underlying assumptions of regression technique on the comparative performances of these two techniques. Pickard et al. [5]
have used simulation to create data sets with known underlying model and with non-normal characteristics such as skewness,
unstable variance and outliers that are frequently found in software cost modeling problem. Multiple regression and classification
and regression tree have been compared and viability of simulation to allow such comparisons under controlled mechanism is
demonstrated. Standard multiple regression techniques are found
to be the best when the data exhibit moderate non-normality and
under more extreme condition such as heteroscedasticity, nonparametric techniques are found to have the best performance.
Gaudart et al. [6] have done comparison of the performance
of multilayer perceptron and linear regression for epidemiological
data when the data deviate from some of the underlying assumptions of regression. One of the functional forms considered in this
∗ Corresponding author. Tel.: +91 22 25767786; fax: +91 22 25722872.
E-mail addresses: [email protected] (M. Paliwal), [email protected]
(U.A. Kumar).
1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.asoc.2011.01.043
study pertains to heteroscedasticity in the error variance. For the
heteroscedastic errors, authors have concluded that predictions
from both the models are of the same size and order. However, the
findings are limited to the characteristics of the functional form
considered in this study.
The available literature is sparse, particularly comparing the
performance of these two techniques when the data is heteroscedastic. Thus there is a need for extensive and systematic
study that takes into account various patterns of heteroscedasticity and a large number of data characteristics while comparing the
performance of the two techniques.
The aim of the present study is to systematically compare the
performance of neural network and regression technique when the
data deviate from the assumption of homoscedasticity. This comparison is carried out by simulating data sets for different patterns
of heteroscedasticity possessing various data characteristics that
vary in sample size, amount of noise and number of independent
variables. Appropriate prediction intervals for both the methods in
presence of heteroscedasticity are also obtained and are graphically
compared. To gain insight on how the comparative performance of
both the techniques under heteroscedasticity translates to real life
situations, two data sets have also been analyzed and results are
discussed.
The next section describes the regression model with nonconstant error variances. Weighted least squares method, a remedial
measure to deal with unequal error variances is also discussed
in this section. Section 3 presents the construction of prediction
intervals for regression and neural network techniques when the
data is heteroscedastic. The experimental design, data generation
3860
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
procedure and performance evaluation criterion are discussed in
Section 4. Section 5 presents the data analyses and results from
the simulation study are presented in Section 6. Section 7 presents
analysis of two real life data sets and the last section concludes this
paper.
be easily expressed using matrix terms. The following matrices are
defined as:
⎛
⎜
⎝
Y =⎜
n×1
2. The model
Consider the multiple linear regression model given by the
equation,
Yi = ˇ0 + ˇ1 X1i + · · · + ˇj Xji + · · · + ˇp Xpi + εi for i = 1, 2, . . . , n
j = 0, 1, 2, . . . , p + 1
and
(1)
where Yi is the dependent variable, the Xji are independent variables, ˇj are unknown parameters and εi is the error term. In order
for estimates of ˇ’s to have desirable properties, we need the following assumptions, called Gauss–Markov conditions:
E(εi ) = 0;
E(ε2i )
2
= ;
E(εi εk ) = 0,
when i =
/ j,
These conditions, if they hold, assure that ordinary least squares
(OLS) estimates are best linear minimum variance estimates among
the class of unbiased estimators. Violation of the second condition
is called heteroscedasticity in which the error term of a regression
model does not have constant variance over all the observations,
i.e. Var(εi ) = i2 . OLS estimates of regression coefficients under
heteroscedasticity are still unbiased and consistent but they are
no longer the best linear unbiased estimators. Heteroscedasticity
causes variances of parameter estimates to be large and thus can
affect R2 , estimate of 2 , and various inferential procedures substantially.
In this study, heteroscedasticity in the error term is introduced
by defining εi as hi ui , where ui (random error) are independent N(0,
2 ) and hi is some function of one of the independent variables (Xi ’s)
and thus Var(εi ) = i2 = (h2i 2 ). It is common for heteroscedastic
error variance to increase or decrease as the expectation of Y grows
larger, or there may be systematic relationship between error variance and a particular X. In this situation, the nonconstant error
variances i2 = (h2i 2 ) are known up to a proportionality constant,
where 2 is the proportionality constant. Weighted least squares
estimation procedure is one of the methods that can be used to correct such form of heteroscedasticity and is the topic of discussion
in the next subsection.
2.1. Weighted least squares
n
wi (Yi − ˇ0 − ˇ1 X1i − ˇ2 X2i − · · · − ˇp Xpi )2
⎛
X11
X21
..
.
Xn1
1
⎜1
X =⎜ .
⎝ ..
n×(p+1)
1
⎟
⎟
⎠
X12
X22
..
.
Xn1
⎞
· · · X1p
· · · X1p ⎟
.. ⎟
⎠
.
· · · Xnp
(3)
⎛
w1
⎜ 0
W =⎜ .
⎝ ..
nxn
0
⎞
...
···
..
.
···
0
w2
..
.
0
0
0 ⎟
.. ⎟
⎠
.
wn
(4)
The normal equations can be expressed as follows:
(X WX)bw = X WY
(5)
The weighted least squares estimators of the regression coefficients
are:
= (X WX)
−1
X WY
(6)
where bw is the vector of the estimated regression coefficients
obtained by weighted least squares. The variance–covariance
matrix of the weighted least squares estimated regression coefficients is:
␴2
{bw }
(p+1)×(p+1)
= 2 (X WX)
−1
(7)
and this matrix is not known since the proportionality constant 2
is not known. However, it can be estimated as
s2
{bw }
(p+1)×(p+1)
2
= sw
(X WX)
−1
(8)
2 is based on weighted squared residuals and can be calcuwhere sw
2
wi (Yi − Ŷi ) /(n − p − 1). Weighted least squares can also
lated as
be viewed as ordinary least squares on transformed model given by
Yw = Xw ␤ + ␧w
(9)
1/2
1/2
1/2
where Yw = W Y, Xw = W X and ␧w = W ␧ with W1/2
being a diagonal matrix containing square roots of the weights wi .
Here
bw
(p+1)×1
= (X WX)
−1
X WY = (Xw Xw )
= 2 (X WX)
−1
= 2 (Xw Xw )
−1
Xw Yw
and 2
{bw }
(p+1)×(p+1)
−1
It can be observed that
Weighted least squares (WLS) regression is useful for estimating the regression parameters when heteroscedasticity is present
in the data. Here, we assume that the error variances are known
up to proportionality constant i.e. Var(εi ) = i2 = (h2i 2 ). Estimation of parameters by the method of weighted least squares is
closely related to parameter estimation by ordinary least squares.
The weighted least squares criterion includes an additional weight,
wi (= 1/h2i ), where the estimates of parameters are obtained by
minimizing
Qw =
⎞
Let the matrix W be a diagonal matrix containing the weights wi :
bw
(p+1)×1
for all i, j = 1, . . . , n
Y1
Y2
..
.
Yn
(2)
i=1
Since the weight wi is inversely related to the variance i2 , it reflects
the amount of information contained in the observation Yi . Thus,
an observation Yi that has a large variance receives less weight than
another observation that has a smaller variance. The weighted least
squares estimators of the regression coefficients for model (1) can
␴2 {εw } = W1/2 ␴2 {ε}W1/2 = W1/2 2 W−1 W1/2 = 2
satisfying the constant variance assumption.
2.2. Prediction intervals
Prediction intervals to predict the future observation, Y0 , from
regression model and neural network model in case of heteroscedastic error variance are discussed in this section.
In case of heteroscedastic error variance, a modified form of
standard prediction interval needs to be used as usual prediction
intervals will tend to be wider for small values of error variance and
will tend to be narrower for large values of error variance. Incorporating the nonconstant error variance component hi in regression
technique, an appropriate 100(1 − ˛) % prediction interval for Y0 is
given by
(n−p−1)
[Ŷ0 ± t˛/2
sw
h2i + X0 (Xw Xw )
−1
X0 ]
(10)
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
3861
−1
where (Xw Xw ) and sw are computed from the transformed data
as defined in model (9). Here X0 = [1 x01 x02 ... x0p ] is a row vector
containing the values of independent variables for the new observation.
For neural network, the prediction intervals in case of nonconstant error are obtained using the bootstrap method given by Hesks
[7]. In this technique, a number of bootstrap samples are created
by repeatedly sampling with replacement from the original data
set and training each of these samples individually. If B bootstrap
samples are created, a committee of B networks of the same architecture is formed. The output of the network for the input vector X
will be the average of the B network outputs.
Suppose we have a set of observations Dn = (Xi , Yi ), i = 1, 2, . . ., n,
that satisfy the neural model:
Yi = g (Xi ) + εi
(11)
where Yi is the output of the neural network g␪ (Xi ) and εi is the
error. Repeatedly sampling B times from the original data set, the
predicted output of the network would be the average of B network
outputs given by
1
gˆ (Xi )(b)
B
B
Ŷi =
(12)
b=1
For the observation (X0 , Y0 ), 100(1 − ˛) % prediction interval can be
calculated as
Ŷi ± t˛/2 ˆ p (X0 )
(13)
2)
where p2 is given by the sum of model uncertainty variance (m
and data error variance (e2 ) which is nonconstant. From the boot2 is estimated as
strap method, the model uncertainty variance, m
1 =
B−1
B
2
ˆ m
b=1
2
g ∧ (Xi )
(b)
− Ŷi
(14)
and for estimating the error variance e2 , which is a function of independent variable (X2i ), a supplementary network is trained using
2
the squared residuals (Yi − Ŷi ) as output values and X2i as input
values.
3. Methodology
Monte Carlo method is used to simulate data sets from a linear
functional model of the form (1) in such a way that they deviate
from the assumption of homoscedasticity of multiple regression
model. Heteroscedasticity in the error term is introduced by defining εi as hi ui , where ui are independent N(0, 2 ) and hi is some
function of X2 . The three patterns of hi considered in this study are:
(i) hi =
X2i , (ii) hi = X2i , (iii) hi = 1 + 2 X2i + 1
These variance configurations are similar to those considered by
Glejser [8] and Wilcox [9]. The other factors considered in this
study, details of the data generation and performance evaluation
criterion are explained in the subsequent subsections.
3.1. Experimental design
The impact of nonconstant error variance on the performance
of regression analysis and neural network may depend on factors
like sample size, number of independent variables, and amount of
variation in i2 . Thus, the experiment considered is a 2 × 3 × 3 × 3
repeated factorial design whose four factors are the two methods of
Fig. 1. Residual plots for different forms of heteroscedasticity.
analysis, no. of independent variables, sample size and noise in the
data. Different levels that are considered for each of these factors
are as follows:
Methods of analysis: Multiple linear regression and feed forward neural network. To compare the performance of these two
techniques, the coefficients of regression are estimated using the
weighted least square (WLS) and the neural network is trained
using Levenberg–Marquardt algorithm [10,11].
Number of variables (p): Three different levels considered for the
number of explanatory variables are 2, 4, and 10, respectively.
3862
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
Table 1
Sample sizes for different values of p.
by predicting each observation in the test data using the trained
model and then calculating the mean of the squared prediction.
Sample size (n)
p=2
p=4
p = 10
Small
Medium
Large
18
150
500
30
255
840
65
560
1840
Sample size (n): For choosing the sample size, a rule of thumb
for deciding the subject-to-variable ratio has been used as given
in Sawyer [12]. Three levels of sample size, namely small, medium
and large samples corresponding to three levels of p = 2, 4 and 10
are chosen. The specific values of n for different values of p are given
in Table 1.
Error variance ( 2 ): This factor determines the variation in the
random (Gaussian) noise present in the generated data. This variation is measured in terms of the signal-to-noise ratio (SNR). The
SNR is defined as ˇ ˇ/ 2 . Three values of 2 considered in this study
are 1, 16 and 100. The value of ˇ ˇ used is 100 resulting in three values of SNR considered as high noise, medium noise and low noise
levels, respectively.
3.2. Data generation
Simulation allows comparison of analytical techniques even
when assumptions of the techniques under study are violated by
generating data sets that have the required characteristics. Data
matrices where the error variances are nonconstant are generated
for each of the 27 experimental designs (3 levels of independent
variables, 3 levels of sample sizes, 3 levels of random noise levels).
Thirty replications are considered in this experiment and hence
a total of 30 data matrices were generated for each of the 27
design conditions. Three functional forms are considered for heteroscedasticity as described earlier in this section resulting in three
different experiments. The independent variables are assumed to
be normally distributed with mean of 0 and standard deviation of
1 and are independent of each other. For a selected value of ˇ ˇ,
the regression coefficients are derived via normalization of p uniform random numbers ni , i = 1,2,. . .,p generated from U[−10,10].
The regression coefficients are obtained as given in Delaney and
Chatterjee [13].
ˇi =
␤
rn
ni
(15)
2
n2i . The value of ␤ used in this study is 100. The
where rn2 =
number of independent variables and sample sizes are chosen to
represent a variety of conditions found in past research as well
as to explore the range of situations that arise in various applications. The data thus generated facilitates comparison between
neural network and regression techniques when the assumption
of homoscedasticity does not hold true.
3.3. Performance evaluation criterion
Mean square error, an unbiased estimate of 2 for the linear
regression equation of the form (1) is given by
2
1
(Yi − Ŷi )
n−p−1
n
s2 =
(16)
i=1
The estimate s2 is a measure of the variation in the prediction error.
But this mean square error (MSE) will tend to understate the inherent variability in making the future predictions using the trained
model. The actual prediction capability of the trained model can
be measured by mean squared prediction error which is obtained
2
1
(Yi − Ŷi )
n∗
n
MSPR =
(17)
i=1
where n* is the number of observations in the test data set.
If MSPR is fairly close to the MSE of the trained model then this
measure can also be used as a measure of the predictive capability. If
the MSPR is much larger then the MSE, then MSPR is an appropriate
indicator of how well the trained model will predict in future.
In order to have the magnitude of the error in the same units
as the observations, the square root of MSPR denoted by RMSP is
used for test data and root mean square error (RMSE) is the error
measure used for training data.
4. Data analysis
To start with the analyses, the data sets for each of the experimental conditions were generated with double the sample sizes as
mentioned in Table 1. These data sets were then divided randomly
into two sets. The first data set is used to build the model and is
called as training set and the second independent data set, which is
not used for training the model, is referred to as holdout sample or
test sample. This holdout sample is used for validating the trained
model from both the techniques and comparing their predictive
ability. The entire process has been replicated 30 times for each
of the experimental conditions and are analyzed using both the
techniques. In designed experiments, replication allows obtaining
an estimate of the prediction error and also in obtaining a precise
estimate of the effects of various factors considered in the study.
Regression analysis is performed using a modified form of OLS
technique where the coefficients are estimated using weighted
least squares procedure. To perform the neural network analysis, a three layer feed forward network is considered and
Levenberg–Marquardt training algorithm is used to train the
network. The input layer contains p nodes depending on the
experimental condition and the output layer contains one node
corresponding to one response variable. The hyperbolic tangent
activation function is used at the hidden layer and the identity
activation function is used at the output layer.
One of the important architectural decisions involves the determination of the number of nodes in the hidden layer. The number
of nodes in the hidden layer depends on the complexity of the problem at hand [14] and needs to be empirically determined to get the
best performance for the data being considered. Further, the goal
of training neural networks is not to learn an exact representation
of the training data itself, but rather to build a statistical model
of the process that generalizes well to the new data. In practical
applications of a feed forward neural network, if the network is
over-fit to the noise on the training data, it memorizes the training
data and gives poor generalization. Network generalization ability
can be improved by the process of regularization. One of the simplest forms of regularizer namely weight decay, has been used in
the present study, details of which are discussed in Bishop [22]. It
has been found empirically that a regularizer of this form can lead
to significant improvements in network generalization [15].
There are no theoretical guidelines for determining the appropriate values of the weight decay parameter and number of nodes
in the hidden layer. In this study, the optimum values of weight
decay parameter and number of nodes in the hidden layer have
been obtained by trial and error. Three nodes in the hidden layer
are found to be optimum for all the experimental conditions except
one experiment pertaining to small sample size and p = 2. For this
experiment, two nodes were selected in the hidden layer due to
the limitations on the number of sample points. The weight decay
0.086
0.063
0.037
0.035
0.036
0.023
1.385
0.961
1.392
0.942
1.397
0.913
0.436
0.039
0.274
0.031
0.159
0.020
2.713
0.896
2.607
0.899
2.578
0.902
0.070
0.218
0.089
0.125
0.054
0.038
1.363
1.065
1.434
1.020
1.412
0.973
0.694
0.072
0.370
0.062
0.163
0.040
2.602
0.901
2.595
0.898
2.325
0.915
0.255
0.850
0.219
0.972
0.132
0.586
1.452
2.132
1.548
2.395
1.531
2.471
0.726
0.259
0.542
0.198
0.208
0.178
2.311
1.166
2.028
0.989
1.996
0.920
10
4
WLS
NN
WLS
NN
WLS
NN
2
10
4
4.697
2.987
4.570
2.631
4.398
2.563
10
4
2
Med
Low
0.145
0.176
0.122
0.108
0.099
0.105
3.684
3.648
3.762
3.670
3.723
3.635
0.393
0.103
0.247
0.120
0.115
0.086
4.762
3.584
4.714
3.523
4.680
3.569
0.277
0.345
0.232
0.259
0.142
0.145
3.701
3.899
3.702
3.803
3.773
3.844
0.312
0.264
0.232
0.173
0.218
0.165
4.610
3.379
4.573
3.461
4.524
3.448
0.654
1.195
0.678
1.252
0.388
0.960
4.115
5.298
4.024
6.064
3.941
6.703
0.385
0.413
0.308
0.362
0.219
0.230
WLS
NN
WLS
NN
WLS
NN
1.395
1.075
0.583
0.748
0.544
0.412
Mean
9.004
9.170
8.977
9.141
8.997
9.097
0.495
0.473
0.287
0.298
0.198
0.217
Std Dev
Mean
10.348
8.627
10.270
8.819
10.263
8.792
0.845
1.008
0.548
0.628
0.324
0.327
Std Dev
Mean
9.013
9.699
9.058
9.710
8.921
9.615
0.818
0.851
0.524
0.528
0.281
0.336
Std Dev
Std Dev
10.284
7.856
9.987
5.717
10.246
6.400
WLS
NN
WLS
NN
WLS
NN
2
High
Std Dev
10.363
8.631
10.302
8.698
10.360
8.678
Mean
Mean
Mean
Training
Training
1.964
2.605
1.828
2.385
1.135
1.703
Medium sample size
Testing
Small sample size
Method
VAR
NOISE
Table 2
|x2i |.
Mean and standard deviation of errors for training and test data for hi =
Appropriate error measures are computed for WLS regression
and neural network for all the 30 replications for training data
and test data sets as described in the previous section. Three functional forms are considered to generate heteroscedasticity in the
data and comparison of the two methods is carried out for each
of these cases separately. Residual plots are used in detecting systematic relationships that may exist between the error variance
and any of the particular independent variables. Fig. 1(a)–(c) shows
the residual plots against absolute value of X2 for the three forms
of heteroscedasticity for one of the design conditions. This design
pertains to data with large sample size, high noise and two independent variables.
The mean and standard deviations of the RMSE and RMSP for
the two techniques for each of the three cases of heteroscedasticity
for training and test data sets are presented in Tables 2–4, respectively. From these tables, it can be observed that the mean errors for
the training data sets are generally smaller than the mean errors of
weighted regression analysis. The differences are more for the small
sample size and become less for large sample size. But for the test
data sets, the mean error values of weighted least squares regression is less than that of neural networks for most of the design
conditions considered in this study. This shows the problem of
overfitting in neural network in spite of presence of regularization
parameter in training the model. The results of regression analysis
do not show such a problem and its performance is consistent in
both training and test data sets. It is further analyzed to see whether
these differences in the mean error values of the two techniques for
the test data sets are statistically significant.
A full factorial analysis of variance with repeated measure is
conducted to investigate whether significant difference exists in
the performance of two techniques for all the design factors. The
model under which the usual F-tests in a repeated measure factorial experiment are valid assumes that homogeneity of covariance
matrices across the levels of the between subjects factors (compound symmetry) and variances of all pairwise differences between
variables are equal (sphericity). For this experiment, the sphericity
test is not required as there are only two levels in the within subject
factor (method of analysis). Box’M test for homogeneity of covariance matrices is carried out and is coming out to be significant. For a
balanced design (group sizes are equal), empirical literature [16,17]
recommends the use of univariate F-tests with adjusted degrees of
freedom to carry out the analysis. The results from univariate tests
for between and within subject effects for each of the three cases of
heteroscedasticity are summarized in Table 5 for the test data sets.
Testing
5. Results
9.284
12.151
9.563
13.750
9.379
13.576
Training
Large sample size
Testing
parameter values in the range 0.1–0.4 were found to be optimum
for all the experimental conditions. As the prediction error was not
very sensitive to any value of weight decay parameter in this range,
a value of 0.2 is chosen for all the experiments.
RMSE is calculated for training data set and RMSP is calculated
for test data set for each of the 30 replications. Analysis of variance
for four factorial design with repeated measure on the last factor
(method of analysis) is used to analyze whether significant differences exist in the performance of the two techniques with respect
to various design factors like independent variable, sample size,
and noise levels for the three different forms of heteroscedasticity.
Multiple comparison tests are performed to see at what level of one
factor, difference exists in the other factor causing interaction effect
to be significant. Appropriate prediction intervals are obtained for
the test data set using regression and neural network technique
when the error variances are heteroscedastic. In order to further
compare the performances, the average CPU time taken by both
the techniques is presented in Section 5.
3863
1.743
2.442
1.390
1.189
0.626
1.247
Std Dev
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
3864
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
Fig. 2. Scatter plots of prediction intervals of regression and NN for the case of two independent variables and large sample size at various levels of noise for hi =
Multiple comparison tests are performed for those interaction
effects that turned out to be significant in order to see the level
at which the differences are present in the other factors. To carry
out multiple comparisons of this repeated measure factorial design,
F-tests with appropriate numerator and denominator are used as
explained in Winer [18]. Results from the multiple comparison tests
|x2i |.
suggest the performance of regression to be generally better than
neural network for small sample size irrespective of the amount
of noise and number of independent variables. However when the
number of independent variables is two, the performance of the
two techniques is almost the same for the second form of heteroscedasticity corresponding to the case of low and medium noise
Table 3
Mean and standard deviation of errors values for training and test data for hi = |x2i |.
NOISE
VAR
Method
Small sample size
Training
High
2
4
10
Med
2
4
Low
2
4
10
Training
Large sample size
Testing
Training
Testing
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
WLS
NN
WLS
NN
WLS
NN
12.609
7.212
10.911
6.114
10.380
6.816
3.178
2.234
1.640
1.630
1.082
1.526
8.637
11.517
9.921
14.722
10.032
15.170
1.972
2.551
2.415
3.584
1.534
1.795
25.981
9.388
15.184
9.252
14.536
9.691
22.902
1.168
5.766
0.787
1.855
0.663
10.511
10.710
9.721
10.552
10.063
10.728
1.804
1.576
0.816
1.251
0.567
0.563
54.683
9.738
22.265
9.743
19.768
9.804
85.660
0.748
8.730
0.547
3.799
0.257
10.683
10.501
10.092
10.353
10.105
10.244
1.848
0.964
0.548
0.664
0.278
0.332
WLS
NN
WLS
NN
WLS
NN
12.055
3.060
6.332
2.650
5.559
2.532
19.790
0.866
2.802
0.630
1.084
0.537
4.851
5.179
4.512
7.068
4.248
7.035
2.264
1.935
1.028
1.794
0.796
1.623
13.738
3.597
13.094
3.642
11.088
3.789
6.014
0.426
6.208
0.350
2.487
0.247
4.384
4.472
4.358
4.337
4.182
4.330
0.499
0.727
0.392
0.363
0.221
0.293
40.324
3.906
22.955
3.942
18.749
3.969
28.233
0.250
10.561
0.191
6.590
0.119
4.561
4.107
4.342
4.136
4.217
4.045
0.807
0.261
0.405
0.200
0.185
0.140
WLS
NN
WLS
NN
WLS
NN
7.050
1.135
3.906
1.042
3.729
0.958
7.176
0.319
1.805
0.249
0.920
0.182
2.610
2.626
1.897
2.514
1.921
2.514
3.462
1.656
0.494
0.901
0.279
0.671
14.778
0.961
13.419
0.967
11.694
0.996
14.543
0.097
7.215
0.068
3.744
0.066
2.205
1.158
1.824
1.168
1.811
1.087
1.153
0.186
0.238
0.182
0.201
0.076
32.687
0.980
24.276
1.010
19.281
0.998
32.343
0.068
13.040
0.057
5.303
0.027
2.237
1.055
1.866
1.044
1.865
1.026
1.072
0.082
0.330
0.065
0.253
0.031
Table 4
Mean and standard deviation of errors values for training and test data for hi = 1 + 2/( X2i + 1).
NOISE
VAR
Method
Small sample size
Training
High
2
4
10
Med
2
4
10
Low
2
4
10
Medium sample size
Testing
Training
Large sample size
Testing
Training
Testing
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
WLS
NN
WLS
NN
WLS
NN
10.481
22.759
9.985
17.650
9.916
17.848
1.483
6.908
1.749
3.866
0.949
3.702
23.301
30.533
22.774
30.985
24.178
31.576
4.931
6.885
3.252
6.929
2.438
3.694
9.822
21.501
10.057
21.903
9.973
21.449
0.609
1.374
0.421
1.086
0.281
0.638
22.796
24.018
22.682
24.046
22.659
24.178
1.523
1.728
1.248
1.341
0.711
0.885
10.002
22.379
10.027
22.500
10.010
22.322
0.271
0.758
0.202
0.495
0.182
0.397
22.779
23.115
22.734
23.197
22.639
23.136
0.749
0.807
0.613
0.655
0.447
0.427
WLS
NN
WLS
NN
WLS
NN
4.118
9.007
4.183
7.136
4.027
6.299
0.713
2.042
0.650
1.883
0.394
1.114
8.930
12.026
10.113
13.542
9.665
14.254
1.605
1.943
1.167
2.485
1.127
1.588
4.036
8.943
4.027
8.777
4.026
8.744
0.283
0.678
0.188
0.482
0.126
0.302
9.091
9.547
9.172
9.737
9.151
9.665
0.556
0.644
0.373
0.441
0.293
0.379
4.049
9.066
4.034
9.026
4.022
8.965
0.154
0.396
0.101
0.234
0.066
0.168
9.170
9.225
9.168
9.264
9.097
9.193
0.283
0.287
0.221
0.240
0.118
0.115
WLS
NN
WLS
NN
WLS
NN
1.024
2.400
1.097
1.855
1.111
1.735
0.214
0.711
0.129
0.284
0.084
0.239
2.611
4.113
2.804
5.012
2.717
4.929
0.402
1.137
0.420
1.229
0.276
0.879
1.097
2.279
1.120
2.296
1.107
2.260
0.071
0.173
0.046
0.106
0.030
0.060
2.471
2.395
2.499
2.398
2.481
2.400
0.156
0.182
0.102
0.145
0.070
0.099
1.110
2.266
1.106
2.270
1.101
2.259
0.040
0.081
0.029
0.066
0.021
0.039
2.479
2.312
2.466
2.289
2.476
2.289
0.080
0.065
0.073
0.068
0.045
0.033
3865
Mean
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
10
Medium sample Size
Testing
3866
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
levels. In case of medium sample size, regression is seen to perform
better when the noise level is high. However with the decrease in
the noise level, either the performance of both the techniques is
almost the same or neural network is seen to perform better in some
of the cases. For the case of large sample size, the performance of
both the techniques is found to be nearly equal for almost all data
conditions. Only in the second form of heteroscedasticity, performance of neural network becomes better for medium and large
samples corresponding to low level of noise. For all other experimental conditions, no statistical significant difference is found in
the performance of the two techniques.
Appropriate prediction intervals are obtained for the test data
for the three forms of heteroscedasticity using both regression and
neural networks analyses for each of the experimental designs considered. The prediction intervals corresponding to all levels of noise
for the case of p = 2 and sample size being large are shown as scatter plots in Fig. 2 for the first form of heteroscedasticity. It can be
seen from these plots that the range of the prediction intervals of
regression is generally lesser than the corresponding range of prediction intervals obtained from neural network. Further, the size of
the prediction intervals becomes narrower as we move from high
noise level to low noise level. Similar observations can be made
from the prediction intervals pertaining to other levels of no. of
independent variables, namely 4 and 10 and other two forms of
heteroscedasticity and hence the corresponding scatter plots are
not shown here.
We have further analyzed the coverage of prediction intervals
of both the methods for all the experimental conditions and the
results are summarized in Table 6. This table gives the percentage
of times the length of prediction interval given by regression being
less than the corresponding length given by neural network. One
such typical value 96.29 represents the percentage of observations
for which length of prediction interval given by regression is lesser
than the interval given by neural network. For all three forms of
heteroscedasticity, prediction intervals pertaining to small sample
size are comparatively wider for neural network almost for all levels
of noise and number of independent variables. However, the gaps
between the prediction intervals tend to decrease with the increase
in sample size.
The CPU time taken by the two techniques for one of the experimental conditions is shown in Table 7. This table reports the average
CPU time of 10 runs for the first form of heteroscedasticity with
number of variables being 10, noise being high and sample size
at all three levels. The entire data analysis is performed in SAS
software [19] on a P-IV machine and thus this measure can be compared for the two techniques. It can be observed from this table
that the time taken by regression analysis is comparatively much
lesser than that of neural network model irrespective of the sample sizes. Neural network being trained using an iterative algorithm
takes more time to learn and converge as compared to regression
analysis. Further, the determination of various parameters like the
number of hidden layers, number of nodes in the hidden layer etc.
associated with neural networks is not straightforward and finding the optimal configuration of neural networks is again a time
consuming process. In the next section, two real life data sets have
been analyzed and results are discussed.
6. Illustration
The predictive performance of neural network and regression is
compared on two real life data sets. The first data set pertains to
a study of diastolic blood pressure (Y) and age (X) of 54 healthy
adult women of 20–60 years old. [20]. A simple scatter plot of
the diastolic blood pressure and age shown in Fig. 3(a), indicates
the presence of heteroscedasticity as the spread of diastolic blood
pressure seem to increase with an increase in age. The OLS regres-
Fig. 3. Various plots pertaining to first example.
sion is performed and the residuals are analyzed to check whether
all the assumptions of regression hold true. Fanning out patterns in
residual plot against age given in Fig. 3(b), indicates the error variances to be a function of the independent variable, age. Weighted
least squares regression is used to remedy the non-constant error
variance problem by choosing the weights, wi as 1/xi2 . The residuals obtained from WLS regression are plotted against x and are
shown in Fig. 3(c) which appears to be free from the presence of
heteroscedasticity.
The resulting weighted regression model is
Y = 55.07
1
x2
− 0.61(x)
(18)
Further, the real life data set has also been analyzed using
feed forward neural network model. A three layer network
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
3867
Fig. 4. Various plots regarding the second real life data analysis.
architecture with one node in the hidden layer and weight
decay parameter value of 0.05 was chosen and trained using the
Levenberg–Marquardt algorithm. As the sample size is small, leave
one out cross validation method has been used to validate the performance of both the techniques. The model is trained 54 times
(sample size) separately, using both the techniques on all the data
except one observation where the trained model is used to predict
this observation. The average of prediction errors over 54 runs is
used to assess the predictive performance of the two techniques.
The values of this average for regression model and neural network
model are 6.55(5.02) and 6.71(5.31), respectively. The numbers
reported in parenthesis are the standard deviations of errors for
54 runs. It can be seen that the error from the regression analysis
is less than that from neural network model and the results are
more stable as compared to neural network technique. The characteristics of this data set corresponds to somewhere between small
to medium sample size with noise level being high and the number of independent variables being low. The finding from this real
life data set agrees with the results obtained from the simulation
experiment for corresponding characteristics of the data.
The second data set has been taken from the UCI Machine Learning Data Repository [21]. The data pertains to the average miles per
3868
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
Table 5
Repeated measures analysis of variance for test data.
hi =
Source
Tests of hypotheses for between subjects effects
VAR
NOISE
VAR*NOISE
SIZE
VAR*SIZE
NOISE*SIZE
VAR*NOISE*SIZE
Tests of hypotheses for within subjects effects
Method
Method*VAR
Method*NOISE
Method*VAR*NOISE
Method*SIZE
Method*VAR*SIZE
Method*NOISE*SIZE
Method*VAR*NOISE*SIZE
|x2i |
+
+
+
+
+
hi = |x2i |
hi = 1 + 2/( X2i + 1)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Note: + indicates significant effect at 1% level of significance and blank cell represent that the corresponding factors are not significant.
Table 6
Comparison of length of prediction intervals from regression and neural networks.
h
1
2
3
Sample size
Noise levels
Noise levels
Noise levels
High
Med
Low
High
Med
Small
10
4
2
Variable
96.29
93.33
88.89
93.84
100
94.44
100
100
100
96.92
90
94.44
98.46
93.33
77.78
Med
10
4
2
73.75
72.55
68.67
63.39
74.9
53.33
68.57
58.82
66.67
74.64
73.73
74
73.39
67.84
54
Large
10
4
2
60.65
46.3
55.2
51.9
57.02
43
63.36
80.83
65.6
65.92
64.4
70.2
64.56
60
64
Table 7
Average CPU time for the two techniques.
Sample size
CPU time
Small
Med
Large
Reg
NN
0.022
0.038
0.055
0.248
1.062
3.25
gallons (MPG) of 392 cars with four continuous valued explanatory variables as displacement (DP), horsepower (HP), weight (WT)
and acceleration (AC). A preliminary regression analysis carried
out on this data set indicated DP and AC to be insignificant and
these variables were also found to be highly correlated with the
other variables and hence dropped from further analysis. Regression analysis is performed on the resulting data set and the residual
plot shows fanning out of residuals with respect to increase in the
values of predicted MPG as can be seen from Fig. 4(a). The residuals plotted against each of the independent variables as shown in
Fig. 4(b) and (c) indicate the presence of heteroscedasticity in the
Table 8
Summary of results.
Noise
High
Med
Low
Sample size
Small
Medium
Large
R
R
R
R
E
N/E
E
E
N/E
Low
High
Med
Low
83.08
93.33
44.44
96.92
80
72.22
100
96.67
83.33
69.82
68.23
86
57.86
56.86
49.33
71.79
54.9
42.67
65.18
59.61
40.67
64.72
74.28
66.2
66.36
65.36
63.8
60.98
67.62
56.6
62.34
78.45
36.6
100
100
100
data. Formal test procedures like Glesjer and Park also provided
sufficient evidence of heteroscedasticity. As the error variances i2
are unknown, weights were estimated using iteratively reweighted
least squares method [20]. However this does not seem to correct
the problem of nonconstant error variance as can be seen from the
residual plot shown in Fig. 4(d). A log transformation on both the
sides of regression equation was tried and this appears to reasonably correct the situation as shown in Fig. 4(e). Fig. 4(f) shows the
qq-plot of the residuals of transformed observations indicating the
distribution of error to be normal. The resulting regression model
is
log(MPG) = 10.11 − 0.36 log(HP) − 0.689 log(WT)
(19)
This model seems to reasonably satisfy the assumptions of
regression. A three layer feed forward neural network model
with one node in the hidden layer is trained optimally using
Levenberg–Marquardt algorithm and weight decay regularizer is
used to avoid over fitting. To validate the trained model, 10-fold
cross validation method has been performed. This would help in
testing robustness of both the techniques with respect to sampling
fluctuations. For performing 10-fold cross validation, the original
data is split into ten mutually exclusive subsets of nearly equal size
where nine sub-samples are used for training and the remaining
one is used for testing purpose. This process is repeated ten times
and the average of these ten runs is used in estimating the generalization error. The average RMSP for test samples for regression and
neural network models are 3.99(0.43) and 3.89(0.44), respectively.
The values in parenthesis are standard deviations over 10 runs. The
RMSP values indicate similar performances of both the techniques
M. Paliwal, U.A. Kumar / Applied Soft Computing 11 (2011) 3859–3869
for this data set. This also agrees with our finding from the simulation experiment corresponding to medium to large sample size for
low level of noise and number of independent variables being two.
7. Conclusion
In this study, simulation experiments are conducted to assess
the consequences of deviation from the underlying assumption of
homoscedasticity on the comparative performance of regression
analysis and neural networks. The predictive performance of these
two techniques is compared using simulated data sets having data
characteristics that vary in sample size, number of independent
variables and amount of noise present in the data. Models are validated by considering independent data sets that were not used in
training the models.
Results of this study show that the performance of regression
analysis is generally better than neural network when the size of
sample is small irrespective of the amount of noise and number of
independent variables in the data. This gives further evidence for
neural networks not being able to learn optimally for small samples.
In case of medium and large sample sizes, performance of regression technique is better or equivalent to neural network for high
level of noise. However with the decrease in the noise level, neural
network seems to perform better for some of the design conditions. This finding contradicts the claim of literature where neural
network is said to be a robust technique with respect to noise in the
data. For the second form of heteroscedasticity, some of the results
are not consistent with the findings mentioned above. This could be
due to the extent of heteroscedasticity being more for this function
as the rate of increase in the error variances is comparatively high
with respect to increase in the values of independent variable. To
easily comprehend the results mentioned above, the findings are
presented in Table 8.
In this table, R represents the fact that regression performs
better, E stands for the performance being same for both the techniques and N stands for better performance of neural network. Thus
it can be observed that the predictive performance of both the techniques gets affected by characteristics of the data such as sample
size, amount of noise, and functional form of heteroscedasticity
though not largely by the number of independent variables.
Analysis of prediction intervals of the two techniques for all
three forms of heteroscedasticity reveal the prediction intervals
obtained from regression to be smaller for small sample size, almost
for all levels of noise and number of independent variables. For
medium and large samples sizes, percentage of times the length of
prediction interval given by regression being lesser as compared
to that of neural network decreases with an increase in sample
size. These observations tend to support the results obtained from
the analysis of predictive error values. Two real life data sets having heteroscedasticity have also been analyzed and the findings
agree with the results obtained from the simulation experiment
for corresponding data characteristics.
In this study, various combinations of data characteristics are
identified where one method would perform better than the other
method in terms of its predictive ability. It is further observed that
3869
regression analysis is much faster as its running time is comparatively less. Being an iterative procedure, neural networks take more
time to converge and additional efforts and time are required to
optimize the architectural parameters. Regression analysis remains
an attractive option because of its established methodology and its
ability to interpret the explanatory variables. However, neural networks can serve to be an alternative model when the patterns of
heteroscedasticity are not known a priori. This work being empirical, further research is required to strengthen the findings of this
study.
References
[1] W.L. Gorr, D. Nagin, J. Szczypula, Comparative study of artificial neural network
and statistical models for predicting student grade point averages, International Journal of Forecasting 10 (1994) 17–34.
[2] T. Hill, W. Remus, Neural network models for intelligent support of managerial
decision making, Decision Support Systems 11 (1994) 449–459.
[3] B. Warner, M. Misra, Understanding neural networks as statistical tools, The
American Statistician 50 (4) (1996) 284–293.
[4] P.C. Pendharkar, Scale economies and production function estimation for
object-oriented software component and source code documentation size,
European Journal of Operational Research 172 (2006) 1040–1050.
[5] L. Pickard, B. Kitchenham, S.J. Linkman, Using simulated datasets to compare
data analysis techniques used for software cost modeling, IEE ProceedingsSoftware 148 (6) (2001) 164–175.
[6] J. Gaudart, B. Giusiano, L. Huiart, Comparison of the performance of multilayer perceptron and linear regression for epidemiological data, Computational
Statistics and Data Analysis 44 (2004) 547–570.
[7] T. Hesks, Practical confidence and prediction intervals, in: M.C. Mozer, M.I. Jordan, T. Pestsche (Eds.), Advances in Neural Information Processing Systems, vol.
9, MIT Press, Cambridge, MA, 1997, pp. 176–182.
[8] H. Glejser, A new test for heteroskedasticity, Journal of the American Statistical
Association 64 (325) (1969) 316–323.
[9] R.R. Wilcox, Confidence intervals for the slope of a regression line when the
error term has nonconstant variance, Computational Statistics and Data Analysis 22 (1996) 89–98.
[10] K. Levenberg, A method for the solution of certain problems in least squares,
Quarterly Applied Mathematics 2 (1944) 164–168.
[11] D.W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, Journal of Society of Industrial Mathematics 11 (1963) 431–441.
[12] R. Sawyer, Sample size and the accuracy of predictions made from multiple
regression equations, Journal of Educational Statistics 7 (2) (1982) 91–104.
[13] N.J. Delaney, S. Chatterjee, Use of the bootstrap and cross validation in
ridge regression, Journal of Business and Economic Statistics 4 (2) (1986)
255–262.
[14] W.Y. Huang, R.P. Lippmann, Comparisons between neural net and conventional
classifiers, in: IEEE First International Conference on Neural Networks, IV, San
Diego, CA, 1987, pp. 485–493.
[15] G.E. Hinton, Learning translation invariant recognition in a massively parallel network, in: Proceedings of the Conference on Parallel Architectures and
Languages Europe, Eindhoven, The Netherlands, 1987, pp. 1–13.
[16] H. Huynh, Some approximate tests for repeated measurements, Psychometrika
43 (1978) 161–175.
[17] H.J. Keselman, K.C. Carriere, L.M. Lix, Testing repeated measures hypotheses
when covariance matrices are heterogeneous, Journal of Educational Statistics
18 (4) (1993) 305–319.
[18] B.J. Winer, Statistical Principles in Experimental Design, second ed., McGrawHill, Inc., New York, 1971.
[19] SAS Institute Inc., Statistical Analysis System, Version 9.1, SAS Institute Inc.,
Cary, NC, 2007.
[20] J. Neter, M.H. Kutner, C.J. Nachtsheim, W. Wasserman, Applied Linear Statistical
Models, third ed., The McGraw-Hill Companies, Inc., 1996, ISBN 0-256-11736-5.
[21] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2007, Available
from: http://www.ics.uci.edu/∼mlearn/MLRepository.html.
[22] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, London, U.K., 1995.