Multiple Regression

Multiple Regression
W&W, Chapter 13, 15(3-4)
Introduction
Multiple regression is an extension of bivariate
regression to take into account more than one
independent variable. The most simple
multivariate model can be written as:
Y = 0 + 1X1 + 2X2 + 
We make the same assumptions about our error
() that we did in the bivariate case.
Example
Suppose we examine the impact of fertilizer on
crop yields, but this time we want to control for
another factor that we think affects yield levels,
rainfall.
We collect the following data.
Data
Y (yield)
40
50
50
70
65
65
80
X1 (fertilizer)
100
200
300
400
500
600
700
X2 (rainfall)
10
20
10
30
20
20
30
Multiple Regression
Partial Slope Coefficients
1 is geometrically interpreted as the marginal effect of
fertilizer (X1) on yield (Y) holding rainfall (X2) constant.
The OLS model is estimated as:
Yp = b0 + b1X1 + b2X2 + e
Solving for b0, b1, and b2 becomes more complicated than
it was in the bivariate model because we have to
consider the relationships between X1 and Y, X2 and Y,
and X1 and X2.
Finding the Slopes
We would solve the following equations simultaneously for
this problem:
(X1-Mx1)(Y-My) = b1(X1-Mx1)2 + b2 (X1-Mx1)(X2-Mx2)
(X2-Mx2)(Y-My) = b1 (X1-Mx1)(X2-Mx2) + b2(X2-Mx2)2
b0 = My – b1X1 – b2X2
These are called the normal or estimating equations.
Solution
b1 = [(X1-Mx1)(Y-My)][(X2-Mx2)2] – [(X2-Mx2)(Y-My)][ (X1-Mx1)(X2-Mx2)]
(X1-Mx1)2 (X2-Mx2)2 -  (X1-Mx1)(X2-Mx2)2
b2 = [(X2-Mx2)(Y-My)][(X1-Mx1)2] – [(X1-Mx1)(Y-My)][ (X1-Mx1)(X2-Mx2)]
(X1-Mx1)2 (X2-Mx2)2 -  (X1-Mx1)(X2-Mx2)2
Good thing we have computers to calculate this for us!
Hypothesis Testing for 
We can calculate a confidence interval:
 = b +/- t/2(seb)
Df = N – k – 1, where k=# of regressors
We can also use a t-test for each independent variable to
test the following hypotheses (as one or two tailed
tests):
Ho: 1 = 0
Ho: 2 = 0
where t = bi/(sebi)
HA: 1  0
HA: 2  0
Dropping Regressors
We may be tempted to throw out variables that
are insignificant, but we might bias the remaining
coefficients in the model. Such an omission of
important variables is called omitted variable
bias. If you have a strong theoretical reason to
include a variable, then you should keep it in the
model. One way to minimize such bias is to use
randomized assignment of the treatment
variables.
Interpreting the Coefficients
In the bivariate regression model, the slope (b) represents
a change in Y that accompanies a one unit change in X.
In the multivariate regression model, each slope coefficient
(bi) represents the change in Y that accompanies a one
unit change in the regressor (Xi) if all other regressors
remain constant. This is like taking a partial derivative in
calculus, which is why we refer to these as partial slope
coefficients.
Partial Correlation
Partial correlation calculates the correlation
between Y and Xi with the other regressors held
constant:
Partial r = b/(seb) =
t
 [b/(seb)2 + (n-k-1)] [t2 + (n-k-1)]
Calculating Adjusted R2
R2 = SSR/SS
Problem: R2 increases as k increases, so some
people advocate the use of the adjusted R2:
R2A: (n-1)R2 – k
(n-k-1)
We subtract k in the numerator as a “penalty” for
increasing the size of k (# of regressors).
Stepwise Regression
W&W talk about stepwise regression (pages
499-500). This is an atheoretical procedure that
selects variables on the basis of how they
increase R2. Don’t use this technique because it
is not theoretically driven and R2 is a very
problematic statistic (as you will learn later).
Standard error of the estimate
A better measure of model fit is the standard error of the
estimate:
s = [(Y-Yp)2]/[n-k-1]
This is just the square root of the SSE, controlling for
degrees of freedom.
A model with a smaller standard error of the estimate is
better.
See Chris Achen’s Sage monograph on regression for a
good discussion of this measure.
Multicollinearity
An additional assumption we must make in
multiple regression is that none of the
independent variables are perfectly correlated
with each other.
In the simple multivariate model, for example:
Yp = b0 + b1X1 + b2X2 + e
r12  1
Multicollinearity
With perfect multicollinearity, you cannot estimate the
partial slope coefficients. To see why this is so, rewrite
the estimate for b1 in the model with two independent
variables as:
b1 = [ry1 – r12ry2] sy
[1 – r122] s1
ry1 = correlation between Y and X1, r12 = correlation
between X1 and X2, ry2 = correlation between Y and X2,
sy = standard deviation of Y, s1 = standard deviation of
X1
Multicollinearity
We can see that if r12=1 or –1 you are dividing by
zero, which is impossible.
Often times, if r12 is high, but not equal to one, you
will get a good overall model fit (high R2,
significant F-statistic), but insignificant t-ratios.
You should always examine the correlations
between you independent variables to determine
if this might be an issue.
Multicollinearity
Multicollinearity does not bias our estimates, but if
inflates the variance and thus the standard error
of the parameters (or increases inefficiency).
This is why we get insignificant t ratios, because
we t = b/(seb), and as we inflate seb, we depress
the t-ratio making it less likely that we will reject
the null hypothesis.
Standard error for bi
We can calculate the standard error for b1, for
example, as:
se1 = s
[(X1-Mx1)2][1-R12]
Where
R1 = the multiple correlation of X1 with all the other regressors.
As R1 increases, our standard error increases.
Note that for bivariate regression, the term [1-R12] drops out.