If an independent variable is an exact linear combination of the other

Topic 3. Multiple Regression Analysis:
Estimation
Motivation for multiple regression analysis
1) Assumption SLR.4 is often violated in a twovariable regression model and therefore the OLS
estimators are biased.
2) A multiple linear regression model allows for a
more general functional form, e.g. quadratic
function
3) More variation in y can be explained ( R 2 is
higher). Thus, we get better model for predicting
the dependent variable.
1
Consider the model
log( price)  0  1 log(dist )  u,
where price is housing price and dist to the
distance from a recently built garbage incinerator.
We suspect that the estimate of 1is biased since
some of the omitted variables that affect price are
likely to correlate dist .
2
For example, we may expect that one of such
factors is age of the house, age. So, we may
include in the model:
log( price)  0  1 log(dist )   2 age  u.
In this case, we still need the assumption that the
error term is uncorrelated with dist and age :
E(u | dist , age)  0.
Does this assumption is likely to hold? Do we need
to include additional independent variables in the
model?
3
In some cases, we expect that the effect of x on y
is not constant and depends on the level of x . In
this case we may include quadratic term to a
regression model.
For example, we may hypothesize that advertising
has a diminishing marginal effect on the revenue
of a firm. We may specify that as follows:
(1.1)
revenue   0  1adv   2 adv 2  u ,
where revenue is annual revenue in thousands of
dollars of a firm and adv is annual advertising
expenditure in thousands of dollars.
4
Generally, the model with two independent
variables can be written as follows:
(1.2)
y  0  1 x1   2 x2  u,
where
0 is the intercept,
1 measures the change in y with respect to x1
holding other factors fixed,
 2 measures the change in y with respect to x2
holding other factors fixed.
5
Note that when the model includes quadratic terms
the interpretation of parameters is different.
QUESTION 2: Consider the model (1.1). How
annual revenue would change if annual advertising
expenditure increases by 1 thousand dollars
holding other factors fixed? (see quadratic
functions in Lecture 1)
6
OLS estimates and interpretation
The multiple linear regression model in the
population can be written as
(1.3)
y  0  1 x1   2 x2  ...   k xk  u,
where 0 is the intercept, 1,  2 ,….,  k are slope
parameters.
Note, that since there are k slope parameters and
the intercept, the multiple regression model
contains (k  1) parameters. This is important to
remember in order to calculate degrees of freedom
that we will need in hypothesis testing.
7
The sample regression function (SRF), which is
also called OLS regression line, is
(1.4)
yˆ  ˆ0  ˆ1 x1  ˆ2 x2  ...  ˆk xk .
For each observation i , we can get the fitted value
of y , denoted as ŷ :
(1.5)
yˆi  ˆ0  ˆ1 x1i  ˆ2 xi 2  ...  ˆk xik .
Note: independent variables have two subscripts:
i is the observation number, i  1, 2,..., n. The
second subscript is the way we distinguish
between different independent variables. There are
k independent variables.
8
The residual, i.e. the difference between the
observed value yi in the sample and the predicted
value yˆ i is
(1.6) uˆi  yi  yˆi  yi  ˆ0  ˆ1xi1  ...  ˆk xik .
The method of ordinary least squares (OLS)
chooses ˆ0 , ˆ1, ˆ2 ,..., ˆk such that the sum of
squared residuals
(1.7)
n

i 1
 y  ˆ
i
0
 ˆ1 xi1  ˆ2 xi 2  ...  ˆk xik 
2
is as small as possible.
9
Having presented the general intuition of the OLS
method, we will not discuss how
ˆ0 , ˆ1, ˆ2 ,..., ˆk are actually computed. You may
use any econometric software to compute the
estimates of the parameters in the multiple linear
regression model.
In Stata use “regress” command:
regress depvar [indepvars] [if] [in] [weight] [,
options]
10
Interpretation of the sample regression function
(SRF)
(1.8)
yˆ  ˆ0  ˆ1 x1  ˆ2 x2  ...  ˆk xk .
is the same as in the case of the simple linear
regression function. SRF is an estimated version of
population regression function:
(1.9)
E ( y | x1 , x2 ,..., xk )  0  1 x1   2 x2  ...   k xk .
ŷ is the predicted value of y , that is an estimate of
the expected value of y E ( y | x1 , x2 ,.., xk )
̂ 0 is the predicted value of y when variables
x1 , x2 ,...., xk are all zero.
11
To interpret slope coefficients in the SRF take the
difference of (1.8):
(1.10)
yˆ  ˆ1x1  ˆ2 x2  ...  ˆk xk .
Set changes in x2 ,...., xk equal zero
x2  0, x3  0,...., xk  0and solve (1.10) with
respect to ˆ1:
yˆ
ˆ
.
(1.11) 1 
x1
Thus, ˆ1is the partial effect of x1on ŷ holding
x2 ,...., xk fixed. Sometimes researchers say that we
control for the variables x2 ,...., xk when estimating
the effect of x1on ŷ .
Other coefficients have similar interpretation.
12
QUESTION 3: Suppose, a researcher decided to
examine the relationship between education and
wages of individuals and obtained the following
results (you may use the WAGE1.DTA file to
replicate the results):
log(wage
ˆ )  0.583  0.083educ,
where wage is average hourly earnings, educ is
years of education. Then the researcher decided to
control for years of experience, exper , and years
with current employer, tenure , and obtained the
following results:
log( wage
ˆ )  0.284  0.092educ  0.0041exp er
0.022tenure
a) Interpret the estimates of every coefficient in
both models.
b) Why do you think the researcher decided to
control for exper and tenure ?
c) Does 0.092 is definitely closer to the true
parameter than 0.083? Explain.
13
In the question above we may also want to
calculate the effect of simultaneous increase in
experience and tenure by 1 year and no change in
education. In this case we take the difference of
the model:
 log( wage
ˆ )  0.092educ  0.0041 exp er
0.022tenure,
then set educ  0, tenure  1,  exp er  1 and
calculate the proportionate change in hourly
earnings:
 log(wage
ˆ )  0.0041  0.022  0.0261
To get the percentage increase multiply the
estimate by 100:
0.0261*100%  2.61%
14
As in the simple linear regression model we may
obtain R 2 as
SSE
SSR
2
(1.12)
R 
 1
.
SST
SST
R 2 is the proportion of the sample variation in y
that is explained by independent variables
x1 , x2 ,..., xk . Thus it shows how well the model, or
OLS line, fits the data.
15
R 2 has a number of limitations:
1) R 2 never decreases and usually increases when
another independent variable is added to the
model. Regardless of whether a variable belongs in
a true model or not, R 2 increases if you add the
variable. Therefore, R 2 should not be used as a tool
to decide whether to add a variable or not.
2) As we discussed in Lecture 4, R 2 does not show
whether the estimates are reliable or not.
Reliability of the estimates depend on whether the
assumptions underlying the analysis hold or not.
Note that R 2 can be shown to equal the squared
correlation coefficient between the actual yi and
the fitted values yˆ i .
16
We have already defined sample variance and
sample covariance in Lecture 3. An estimator of
correlation coefficient between two random
variables X and Y is
(1.13)
X i  X Yi  Y 

i 1
n
RXY
S XY


S X  SY


n

i 1
(Xi  X )
 
1/2
Yi  Y 

i 1
n


1/2
.
In Stata you may use command “correlate” to get
sample correlation or sample covariance of the
variables.
17
Unbiasedness of the OLS Estimators
Assumption MLR.1 (Linear in Parameters): The
model in the population can be written as
(1.14) y  0  1 x1   2 x2  ...   k xk  u,
where 0 , 1 ,...,  k are the unknown parameters
(constants) of interest and u is an unobservable
random error or disturbance term
The key idea of the assumption is that the
population model (also called true model) is linear
in parameters 0 , 1 ,...,  k .
18
Assumption MLR.2 (Random Sampling): We have
a random sample of n observations
( xi1, xi 2, ..., xik , yi ) : i  1,2,3,..., nfollowing the
population model in Assumption MLR.1
Assumption MLR.3 (No Perfect Collinearity): In
the sample (and therefore in the population), none
of the independent variables is constant, and there
are no exact linear relationships among the
independent variables.
Note that this is the only assumption of
assumptions MLR.1-MLR.5 that involves not only
population considerations, but also sample
considerations. All other assumptions deal with the
population.
19
If an independent variable is an exact linear
combination of other independent variables, then
we say that the model suffers from perfect
collinearity and it cannot be estimated by OLS.
x1  a  bx2 .
In this case, the correlation between x1and x2 will
be perfect, corr ( x1 , x2 )  1.
20
There might be a number of reasons why that
might be the case:
a) the two variables in the model sum to a constant
by definition. For example, if we control for
percentage of female employees, f , in a company
and percentage of male employees, m , in the
company: f  100  m .
b) one of the variables is a constant. A linear
regression with an intercept can be thought of as
including a regressor with all observations equal 1.
Therefore, if one of the variables is constant, it will
be perfectly collinear with this “intercept”
regressor.
21
c) control for the quadratic term while taking log
of it. For example, consider the model (1.1). If you
would like to specify it as a log-log model and take
the log of each variable then you:
log(revenue)   0  1log (adv)   2 log(adv 2 )  u.
The independent variables are perfectly collinear
since
log(adv 2 )  2log(adv) .
You need to specify the model as
log(revenue)   0  1log (adv)   2 [log(adv)]2  u.
22
Mathematically, perfect collinearity is problem
since it results in division by zero in the OLS
formulas.
Intuitively, it is a problem since it results in the
illogical question. For example, consider case a)
above. The coefficient estimate of f would show
change in dependent variable when the percentage
of female employees increases by one percentage
point, holding the percentage of male employees
fixed. However, this is impossible.
The solution to the perfect collinearity problem is
to drop a regressor which is perfectly collinear
with another regressor. We will discuss a bigger
problem called multicollinearity later.
23
Another point is that there are at least as many
observations in the sample as number of
parameters in the model: n  (k  1) . Otherwise it
is impossible to estimate the parameters by OLS.
Assumption MLR.4 (Zero Conditional Mean): The
error u has an expected value of zero given any
values of the independent variables. In other
words,
(1.15)
E (u | x1 , x2 ,..., xk )  0.
24
There are a number of reasons why the assumption
may fail:
1) Functional form misspecification. For example,
the true model (population) is the log-log model
while we specify the level-level model. We discuss
this problem later.
2) Omitted variable problem: a variable that affects
y and is correlated with one of the independent
variables is omitted from the regression. The
reason for omitting the variable might be data
limitations or because the variable is not
measurable.
25
3) Measurement error problem. It is discussed in a
more detail at the end of the course. The idea is
that if there is an error in measurement of a
variable, the error should not correlate with the
disturbance term.
4) Simultaneity. One of the independent variables
is a function of the dependent variable y . For
example, wages can be considered as a function of
prices and prices are a function of wages. In this
case, we estimate simultaneous equations models.
The models are not often used in practice and we
do not cover them in the course.
If independent variable does not correlate with the
error term, it is called exogenous variable. If
independent variable correlates with the error term,
it is called endogenous variable.
26
Theorem (Unbiasedness of OLS): Under
Assumptions MLR.1-MLR.4,
E (  j )   j , j  0,1,..., k ,
(1.16)
for any values of the population parameter  j . In
other words, the OLS estimators are unbiased
estimators of the population parameters.
Remember, that unbiasedness does not mean that
the OLS estimate equals the true parameter. It only
says that if we get OLS estimates from all possible
samples, the expected value of the estimates equal
the true parameter. In practice, when we deal with
the one sample, we get only one estimate of the
parameter and it can be far from the true
parameter.
Note, that correlation between a single
independent variable and an error term makes
estimators of all parameters to be biased.
27
Endogeneity problem is one of the main concerns
in practical applications. If we suspect that some of
the variables are endogenous, then the OLS
estimators are biased. One of the conventional
ways to deal with endogeneity is to use
Instrumental Variable (IV) estimator. However,
this estimator also suffers from a number of
problems, especially the weak instruments
problem. At the end of the course, we will discuss
IV estimator, as well as the test for endogeneity. In
most of the applications, researchers estimate the
model with OLS estimator and then check for
robustness of the results estimating with IV
estimator.
28
The Variances of the OLS Estimators
Variance of the OLS estimator show how the
estimates from all possible samples of the
population are spread around the mean.
If the estimator are unbiased the mean is the value
of the parameter.
We are interested in making the variances as small
as possible.
29
Assumption MLR.5 (Homoscedasticity): The error
u has the same variance given any values of the
explanatory variables. In other words
var(u | x1 , x2 ,..., xk )   2 .
If the assumption fails, the model exhibits
heteroscedasticity.
Note that
var( y | x1 , x2 ,...., xk )  var(u | x1 , x2 ,..., xk )   2 .
Assumptions MLR.1-MLR.5 are known as GaussMarkov assumptions.
30
Theorem (Sampling Variances of the OLS Slope
Estimators): Under Assumptions MLR.1 –MLR.5,
conditional on the sample values of the
independent variables,
(1.17)
var( ˆ j ) 
2
SST j (1  R )
2
j
for j  1,2,..., k , where SST j 
n

j 1
,
( xij  x j )2 is the
total sample variation in x j , and R 2j is the R 2 from
regressing x j on all other independent variables
(and including an intercept).
For hypothesis testing we need standard deviation
of OLS estimators, ˆ j , which is the square root of
the variance:
(1.18)
sd ( ˆ j ) 

SST j (1  R )
2
j
.
31
The only component of the (1.17) that we do not
know is the error variance  2 . It can be estimated
by
n
(1.19)
 uˆi
2
SSR
ˆ 

.
n  k  1 df
2
i 1
Since standard deviation of the error term  is not
known we use its estimator
(1.20)
ˆ  ˆ 2
which is called standard error of the regression.
Now we can get estimators of standard deviations
of ˆ j
ˆ
ˆ
(1.21)
se(  j ) 
SST j (1  R 2j )
that are called standard errors.
32
An important thing to remember for hypothesis
testing is that standard errors given in formula
(1.21) are not valid of the assumption of
homoscedasticity is violated. Stata as well as other
econometric software compute standard errors
using formula(1.21) by default. Therefore, if we
suspect heteroscedasticity we need to correct that.
We will discuss this issue later.
33
Let’s discuss the components of the variance of the
OLS estimator given in formula (1.17).
1. The error variance,  2 . The larger the error
variances, the larger are the variances of the OLS
estimators. The more variables we omit from the
regression, the larger is the error variance, and the
larger are the variances of the OLS estimators.
Adding more explanatory variable to the model,
we may reduce the error variance.
2. The total sample variation in exploratory
variable, SST j . The larger is the SST j , the smaller
is the variances of the OLS estimators. Typically,
the larger is the sample size, the larger is the SST j
and therefore the more precise are the OLS
estimators.
34
3. Linear relationships among the independent
variables, R 2j . This is the R 2 from regressing x j on
all other independent variables including intercept.
In general, R 2j shows the proportion of the total
variation in x j that can be explained by other
independent variables in the model.
R 2j  0is the best case since in this situation
var( ˆ j ) is minimum. This can happen only if
x j has zero sample correlation with other
independent variables.
The extreme case R 2j  1 is not allowed. It implies
division by zero in (1.17) and violates assumption
MLR.3.
35
As R 2j increases to 1, var( ˆ j ) gets larger and larger.
Thus higher degree of linear relationship between
independent variables leads to higher variance of
OLS estimator of a parameter.
High correlation between independent variables,
when R 2j is close to 1, is called multicollinearity.
This is not violation of the assumption MLR.3.
However, multicollinearity may be a problem
since it may lead to high variance of the OLS
estimator. This makes the estimator less precise as
we discuss later in the course.
Multicollinearity can be mitigated by lower  2 or
larger SST j . Therefore, in general, there is no
benchmark value of R 2j that would indicate
multicollinearity. However, in some case
researchers become concerned of multicollinearity
if R 2j  0.9 .
36
We may also discuss multicollinearity in terms of
the variance inflation factor:
1
(1.22)
VIFj 
.
2
1 Rj
The general conclusion is that we are interested in
less correlation between the independent variables.
Also, note that correlation between a number of
independent variables does not affect variances of
the estimators of the coefficients of other
independent variables. In contrast, correlation
between a single independent variable and an error
term makes estimators of all parameters to be
biased.
37
Solutions of multicollinearity.
1) Drop one of the independent variables that has a
strong correlation with the variable of interest.
However, in this case, although we reduce
variance of the variable but we get omitted
variable bias.
2) Collect more data to increase the sample size
and thus increase SST j . This leads to economic
costs.
3) In some cases, the variables that are strongly
correlated can be modified. For example, a number
of variables representing subcategories (e.g.
subcategories of expenditure) can be aggregated to
one variable that represents a category (e.g. total
expenditure). In other cases, we sometimes may
modify a variable to net out the part of the variable
that correlates with another variable. For example,
if firm output and exchange rate correlate, I can
subtract exports from the total firm output and thus
reduce its correlation with exchange rate.
38
Overspecifying the Model and Underspecifying the
Model
Overspecifying the model - we include the
irrelevant variable in the model. Irrelevant variable
is the variable that does not belong in the
population model, in other words, it has no partial
effect on y in the population.
Overspecification of the model does not lead to
biased OLS estimators. However, it leads to higher
variances of the OLS estimators if the irrelevant
variable included in the model is correlated with
other independent variables. This is the cost of
overspecification.
39
Underspecifying the model - we do not include
the variable that actually belongs in the population
model (it has nonzero partial effect on y ).
If the omitted variable appears to correlate with the
independent variable(s), then this leads to the
violation of assumption MLR.4 and therefore to
the omitted variable bias. Note that in this case
OLS estimators of all parameters are biased.
40
Gauss-Markov Theorem
Gauss-Markov Theorem: Under Assumptions
MLR.1-MLR.5, ˆ0 , ˆ1 ,...., ˆk are the best linear
unbiased estimators (BLUEs) of 0 , 1 ,....,  k ,
respectively.
“Best” estimators means that the OLS estimators
have the smallest variances among all linear
unbiased estimators. This theorem justifies
application of OLS estimators. It also shows why
assumptions MLR.1-MLR.5 are important. If these
assumptions do not hold then OLS estimators are
not BLUE.
We need assumptions MLR.1-MLR.4 for OLS to
be unbiased and we need additional assumption
MLR.5 for the OLS to have the smallest variance.
If MLR.5 homoscedasticity assumption is violated
then OLS does not have the smallest variance and
we would use another estimator.
41
QUESTION 4: (problem 3.13 in the textbook):
The following equation represents the effects of
tax revenue mix on subsequent employment
growth for the population of counties in the United
States:
growth  0  1shareP   2 shareI  3 shareS  u,
where growth is the percentage change in
employment from 1980 to 1990, shareP is the
share of property taxes in total tax revenue,
shareI is the share of income tax revenues, and
shareS is the share of sales tax revenues. All of
these variables are measured in 1980. The omitted
share, shareF , includes fees and miscellaneous
taxes. By definition the four shares add up to one.
1) Why must we omit one of the tax share
variables ( shareF was omitted) from the equation?
2) Give careful interpretation of 1.
42