Collinearity
David Gerbing
School of Business Administration
Portland State University
October 27, 2014
Contents
1 Collinearity
1
1
1
Collinearity
Ideally, each predictor variable in a regression model satisfies two conditions, it is (a) related
to the response variable and (b) contributes new information relative to the remaining
predictor variables in the regression model. For some regression models, however, a problem
arises in that two or more predictor variables are highly correlated.
Collinearity: Occurs when the value of a predictor variable can be
linearly predicted from one or more of the remaining predictor variables
The information supplied by these collinear predictor variables is redundant. Little gain in
predictive efficiency results from the addition to the model of a new predictor variable that
is substantially related to an existing predictor variable. Conversely, dropping one of the
redundant variables from the model does not substantially diminish predictive accuracy.
What is the influence of collinearity on the estimation of the regression coefficients, the
intercept and slope coefficients? The problem is that estimation of the corresponding partial
slope coefficients follows the principle of ceteris paribus. That is, the estimation of the
effects of a unit change in each predictor variable on the response variable, with the value
of all other predictor variables held constant. The more collinearity, the more difficult to
separate the effect of the affected predictor variables from each other. This difficulty in
separation increases the standard errors of the slope coefficients for the correlated predictor
variables. When both predictor variables simply represent different names for the same
concept, the standard errors of the estimates are inflated because the effects cannot be easily
disentangled.
Consider the extreme case, illustrated in the following example, in which two predictor
variables are almost perfectly correlated. Management seeks to understand the factors that
contribute to the maintenance costs of its machines used to manufacture its product. One
consideration is the number of days a machine runs until a breakdown occurs. At a meeting
to discuss the various possibilities, one manager suggests that the Age of the machine is a
contributing factor. Another manager suggests that the Time actually used to manufacture
the product since the machine was purchased is the critical variable. During any given
week, a machine could be idle most if not all of the time. Yet another manager suggests
simultaneously studying both variables.
The data, presented below, are read into R from the web with:
> mydata <- Read("http://web.pdx.edu/~gerbing/data/collinearity.csv")
The following analyses were all obtained from one of the following Regression calls, a
multiple regression with both predictors present and for each predictor separately.
> Regression(Days ~ Age + Used)
> Regression(Days ~ Age)
> Regression(Days ~ Used)
David W. Gerbing
October 27, 2014
2
> mydata
Age Used Days
1 2014 1833 703
2 1938 1739 1214
3 2289 2072 1271
4 1447 1202
20
5 1831 1790 169
6 1293 1117
39
7 2014 1739 466
8 1889 1740 707
9 1889 1669 941
10 1889 1748 560
Correlations
One method of detecting collinearity is to examine the correlations among the predictors.
These correlations are a part of a multiple regression as reported by the lessR regression
function. In this example, rage,used equals 0.97 to within two decimal digits.
Correlations
Days Age Used
Days 1.00 0.79 0.73
Age 0.79 1.00 0.97
Used 0.73 0.97 1.00
The primary limitation of this method is that, by definition, a correlation coefficient is a
function of only two variables, yet collinearity can be a function of more than two predictor
variables. In practice, however, many instances of collinearity can be detected from an
examination of the correlations.
Variance Inflation Factor and Tolerance
A more formal method for detecting collinearity considers the relationship between each
predictor variable with the remaining predictors. To evaluate this relationship, create a
regression model in which the j th predictor, Xj , becomes the response variable, and all
remaining predictors are still predictor variables. If Xj is collinear with one or more of the
remaining predictors, then the resulting Rj2 should approach 1. If there is no collinearity,
then Rj2 should approach 0.
With this Rj2 defined for the j th predictor, two different indices of collinearity are
commonly defined: Tolerance and the Variance Inflation Factor or VIF.
Tolerance(bj ) = 1 − Rj2
David W. Gerbing
and
VIF(bj ) =
1
1
=
Tolerance
1 − Rj2
October 27, 2014
3
VIF is defined to provide a useful interpretation of the impact of any collinearity. The
square root of the VIF indicates how much larger the standard error of the corresponding
slope coefficient increases from the baseline that would exist if that variable were uncorrelated
with the other predictor variables in the model.
These collinearity indices are a part of a multiple regression are part of the output of
Regression. Running the separate regressions of each predictor variable on the remaining
predictors is not necessary. VIF and Tolerance for the two predictor variables Xage and
Xused appear below.
Collinearity
Age
Used
Tolerance
0.051
0.051
VIF
19.657
19.657
The result is a demonstration of low Tolerance, which indicates the presence of collinearity.
A rough rule of thumb is to be concerned with collinearity if any of the Tolerances are below
0.20. Because VIF is the reciprocal of Tolerance, the corresponding value of VIF is 1/0.20 =
5. Values of VIF larger than 5 indicate collinearity. These indices unequivocally indicate
collinearity for this model.
Multiple Regression
Days (to breakdown) as a function of Age and (Time) Used
Model Coefficients
Estimate
(Intercept) -1888.461
Age
2.435
Used
-1.205
Std Err
695.105
1.548
1.532
t-value
-2.717
1.573
-0.787
p-value
0.030
0.160
0.457
Lower 95%
-3532.124
-1.226
-4.827
Upper 95%
-244.799
6.096
2.417
The multiple regression model to predict the number of days between Breakdowns from
the Age in days and the number of days the machine is Used is,
Ŷ = −1888.46 + 2.44XAge − 1.21XU sed
Model Fit
Standard deviation of residuals: 298.74 for 7 degrees of freedom
R-squared: 0.657
Adjusted R-squared: 0.559
Null hypothesis that population R-squared=0
F-statistic: 6.703
df: 2 and 7
p-value:
David W. Gerbing
0.024
October 27, 2014
4
The accuracy of predictability using this model is a considerable improvement over using
2 = .56. Unfortunately, the estimated slope
no model at all, as indicated by R2 = .66 and Radj
coefficients bAge = −2.44 and bU sed = −1.21 are unstable, as is evident from their respective
standard errors, 1.55 and 1.53. These estimates result in extremely wide 95% confidence
intervals of -1.23 to 6.10 and -4.83 to 2.42 (for t.025 = 2.26 at df = 9), respectively. With
95% confidence the population slope coefficients β likely lie anywhere within these large
intervals.
Both of these intervals include zero, indicating that values for the population regression coefficients of zero are plausible. Neither population regression coefficient may be
differentiated from zero.
Age of machine and Time the machine was used are almost perfectly correlated, r = 0.97.
Using both of these variables as predictors in a regression model to predict the time until
the next machine breakdown results in a collinear model. The estimated slope coefficients
are unstable because the model attempts to portray the separate effects of each predictor
variable on time between Breakdowns, yet the two predictor variables are essentially two
names for the same concept. Using both predictors simultaneously in the model does not
improve the accuracy of prediction over the use of either one of the variables alone, and it
also yields highly unstable, unusable estimates of the resulting slope coefficients.
One-Predictor Regressions
Age only:
Days (to breakdown) as a function of Age
Model Coefficients
Estimate
(Intercept) -1700.364
Age
1.249
Std Err
636.930
0.341
t-value
-2.670
3.664
p-value
0.028
0.006
Lower 95%
-3169.128
0.463
Upper 95%
-231.601
2.035
With just Age as a predictor, now its slope coefficient is significantly different from zero,
with p-value = 0.006 < α = 0.05.
Model Fit
Standard deviation of residuals:
R-squared:
0.627
291.54 for 8 degrees of freedom
Adjusted R-squared:
0.580
Null hypothesis that population R-squared=0
F-statistic: 13.428
df: 1 and 8
p-value:
time Used only:
0.006
Days (to breakdown) as a function of time Used
Model Coefficients
David W. Gerbing
October 27, 2014
5
Estimate
(Intercept) -1292.922
Used
1.142
Std Err
634.358
0.376
t-value
-2.038
3.038
p-value
0.076
0.016
Lower 95%
-2755.755
0.275
Upper 95%
169.910
2.009
With just time Used as a predictor, now its slope coefficient is significantly different from
zero, with p-value=0.016 < α = 0.05.
Model Fit
Standard deviation of residuals:
R-squared:
0.536
325.10 for 8 degrees of freedom
Adjusted R-squared:
0.478
Null hypothesis that population R-squared=0
F-statistic: 9.232
df: 1 and 8
p-value:
0.016
The corresponding separate regression models for predicting the number of days between
Breakdowns from the Age in days or the number of days the machine is Used are,
Ŷ = −1700.36 + 1.25XAge
Ŷ = −1292.92 + 1.14XU sed
The precision of predictability using these one-predictor models is reasonable; it is equivalent
to the precision obtained from the multiple regression model, as indicated by the respective
2
Radj
of .58 and .48. The estimated slope coefficients bAge = 1.25 and bU sed = 1.14 are
considerably more stable, given the dramatically reduced standard errors of .34 and .38,
respectively. The considerably smaller 95% confidence intervals around the estimated slope
coefficients, again computed from t.025 = 2.26 for df = 9, are 0.46 to 2.04 and .28 to 2.01.
The regression model for predicting time between breakdowns should be estimated with
either one of the predictor variables, Age of machine or Time the machine has been Used.
Both predictor variables should not be used simultaneously.
David W. Gerbing
October 27, 2014
© Copyright 2026 Paperzz