Econometrics 2) Linear regression model and the OLS

30C00200
Econometrics
2) Linear regression model
and the OLS estimator
Timo Kuosmanen
Professor, Ph.D.
http://nomepre.net/index.php/timokuosmanen
Today’s topics
• Multiple linear regression – what changes?
– Model and its interpretation
– Deriving the OLS estimator
•
•
•
•
Standard error
Multicollinearity
Goodness of fit: R2 and Adj.R2
Fitting nonlinear functions using linear regression
2
Hedonic model of housing market
Dependent variable (y)
• Market price of apartment (€)
Explanatory variables (x)
• Size (m2)
• Number of bedrooms (#)
• Age (years)
Hedonic model of housing market
Dependent variable (y)
• Market price of apartment (€)
Explanatory variables (x)
• Size (m2)
• Number of bedrooms (#)
• Age (years)
Expected signs
positive
negative
negative
Model
Regression equation with K parameters:
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• β1 is the intercept term (constant)
• βk is the slope coefficient of variable xk (marginal effect)
• εi is the disturbance term for observation i
Interpretation
Interpretation of slope β2 for the variable ”size” (m2)
Single regression model
• Value of an additional m2, on the average, in apartments
located in Tapiola
Multiple regression model
• Value of an additional m2, on the average, in apartments with
the same nr. of bedrooms and age in Tapiola
Interpretation
Why intercept is β1 ?
• Consider a model without a constant term:
yi = β1x1i + β2x2i + β3x3i + … + βKxKi + εi
• Where x1 = (1, 1, 1, …, 1)
• This model is identical to
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• We can think of the intercept term as the slope coefficient of a
regressor that is constant (1) for all observations.
Excel output of the hedonic model
- Single regression with m2
SUMMARY OUTPUT
Regression Statistics
Multiple R
0,808717
R Square
0,654023
Adjusted R
Square
0,648701
Standard Error
110242
Observations
67
ANOVA
df
Regression
Residual
Total
Intercept
size m2
SS
1,49E+12
7,9E+11
2,28E+12
MS
1,49E+12
1,22E+10
Coefficients Standard Error
-31430,8
33447,06
5460,683
492,6256
t Stat
-0,93972
11,08486
1
65
66
F
Significance F
122,874
1,26E-16
P-value
0,350841
1,26E-16
Lower 95%
-98229,2
4476,842
Upper 95%
35367,56
6444,524
Excel output of the hedonic model
- Multiple regression
SUMMARY OUTPUT
Regression Statistics
Multiple R
0,905971
R Square
0,820784
Adjusted R
Square
0,81225
Standard Error
80593,26
Observations
67
ANOVA
df
Regression
Residual
Total
Intercept
size m2
nr. bedrooms
age
SS
MS
F
1,87E+12
4,09E+11
2,28E+12
6,25E+11
6,5E+09
96,17703
1,76E-23
Coefficients Standard Error
141366,7
36401,58
6972,404
857,5745
t Stat
3,883532
8,130378
P-value
0,000249
2,11E-11
Lower 95%
68623,96
5258,678
Upper 95%
214109,5
8686,13
-3,19498
-5,6965
0,002186
3,46E-07
-105952
-3809,39
-24413,3
-1830,8
3
63
66
-65182,6
-2820,1
20401,6
495,0578
Significance F
OLS estimator
OLS problem with K parameters:
n
min RRS   ei
2
i 1
s.t.
yi  b1  b2 x2,i  b3 x3,i  ...  bK xK ,i  ei
Or equivalently
n
min RSS   ( yi  b1  b2 x2,i  b3 x3,i  ...  bK xK ,i ) 2
i 1
OLS estimator
First-order conditions:
n
RSS
 2 ( yi  b1  b2 x2,i  b3 x3,i  ...  bK xK ,i )  0
b1
i 1
n
RSS
 2 x2,i ( yi  b1  b2 x2,i  b3 x3,i  ...  bK xK ,i )  0
b2
i 1
n
RSS
 2 xK ,i ( yi  b1  b2 x2,i  b3 x3,i  ...  bK xK ,i )  0
bK
i 1
System of K linear equations with K unknowns
- best solved by using matrix algebra
OLS estimator
Closed form solution in matrix form:
b = (XX)-1 Xy
Note: to study econometrics at more advanced level, you have
to master matrix algebra.
OLS estimator
Closed form solutions:
The intercept term
b1  y  b2 x2  b3 x3  ...  bK xK
OLS estimator
Closed form solutions:
Special case with two regressors x2 and x3
Slope b2
b2 
Est.Cov( x2 , y )  Est.Var ( x3 )  Est.Cov( x3 , y )  Est.Cov( x2 , x3 )
Est.Var ( x2 ) Est.Var ( x3 )   Est.Cov( x2 , x3 ) 
2
OLS estimator
Important: explanatory variable x3 influences the slope of
regressor x2 through the sample covariances.
Note: if the regressor x2 does not correlate with the other
regressor x3, that is, the sample covariance is
Est.Cov( x2 , x3 )  0
then the slope b2 estimated from the multiple regression model
is exactly the same as that of the single regression of y on x2,
leaving the effects of x3 to the disturbance term
Est.Cov( x2 , y )
b2 
Est.Var ( x2 )
OLS estimator as a random variable
OLS estimates are calculated from observed data using the
formula:
b = (XX)-1 Xy
Important: OLS estimator b is itself a random variable. (Why?)
That is, elements of vector b have a probability distribution with
the expected value E(b) and variance Var(b).
The estimated standard deviation of b is called standard error.
Excel output of the hedonic model
- Multiple regression
SUMMARY OUTPUT
Regression Statistics
Multiple R
0,905971
R Square
0,820784
Adjusted R
Square
0,81225
Standard Error
80593,26
Observations
67
ANOVA
df
Regression
Residual
Total
Intercept
size m2
nr. bedrooms
age
SS
MS
F
1,87E+12
4,09E+11
2,28E+12
6,25E+11
6,5E+09
96,17703
1,76E-23
Coefficients Standard Error
141366,7
36401,58
6972,404
857,5745
t Stat
3,883532
8,130378
P-value
0,000249
2,11E-11
Lower 95%
68623,96
5258,678
Upper 95%
214109,5
8686,13
-3,19498
-5,6965
0,002186
3,46E-07
-105952
-3809,39
-24413,3
-1830,8
3
63
66
-65182,6
-2820,1
20401,6
495,0578
Significance F
Standard error
2
Unbiased estimator of Var ( )    is
n
se2 
2
e
i
i 1
nK
Standard error of parameter estimate bk is
s.e.(bk )  se2 ( XX) 1 
kk
The general expression requires matrix algebra
Standard error
Special case of two regressors (x2, x3):
se2
s.e.(b2 ) 
(n  1)  Est.Var ( x2 )  (1  rx22, x 3 )
where rx2,x3 is the sample correlation coefficient.
Decomposed as product of 4 components:
1
1
1
s.e.(b2 )  se 


n  1 Est.Var ( x2 ) 1  rx22, x 3
Standard error
1
1
1
s.e.(b2 )  se 


n  1 Est.Var ( x2 ) 1  rx22, x 3
Four potential ways to improve precision:
1) Decrease se (improve empirical fit; add explanatory
variables?)
2) Increase sample size n
3) Increase sample variance of x2
4) Decrease correlation of x2 and x3
Multicollinearity
• High correlation between explanatory variables x can cause
loss of precision
• In practice, there is always some correlation
– Example: sample correlation between variables ”size (m2)” and ”nr. of
bedrooms” is 0.887
• Multiple regression analysis takes the correlation explicitly
into account – correlation as such is not a problem
– Depends on circumstances, particularly the sample size n, variance of
disturbance ε, sample variances of regressors x.
• When the high correlation is considered a problem, it is
referred to as multicollinearity
Multicollinearity
Typical symptoms of multicollinearity:
• Model has high R2 and is jointly significant in the F-test
• Slope coefficients are large (small) but still insignificant
• Slope coefficients have high standard errors
• Slope coefficients have unexpected signs
• Slope coefficients are much smaller or larger than expected
Note: If two variables are perfectly correlated, the OLS estimator
ill-defined and cannot be computed. This is the extreme type
of multicollinearity.
Multicollinearity
Indirect methods to influence the other factors:
1) Improving empirical fit to decrease s
–
Include additional explanatory variables
2) Increase sample size n
3) Increase sample variance of problem variables
Direct treatments:
• Exclude one of problem variables
–
•
problem: omitted variable bias
Impose theoretical restriction
Goodness of fit
ANOVA and the coefficient of determination R2 :
TSS  ESS  RSS
n
( y
i 1
i
n
2
i 1
ESS
RSS
R 
1
TSS
TSS
2
n
 y )   ( yi  yˆi )   ( yˆi  y ) 2
2
i 1
Excel output of the hedonic model
- Multiple regression
SUMMARY OUTPUT
Regression Statistics
Multiple R
0,905971
R Square
0,820784
Adjusted R
Square
0,81225
Standard Error
80593,26
Observations
67
ANOVA
df
Regression
Residual
Total
Intercept
size m2
nr. bedrooms
age
SS
MS
F
Significance F
3
63
66
1 874 088 309 956
409 202 213 242
2 283 290 523 198
6,25E+11
6,5E+09
96,17703
1,76E-23
Coefficients
141366,7
6972,404
Standard Error
36401,58
857,5745
t Stat
3,883532
8,130378
P-value
0,000249
2,11E-11
Lower 95%
68623,96
5258,678
Upper 95%
214109,5
8686,13
-65182,6
-2820,1
20401,6
495,0578
-3,19498
-5,6965
0,002186
3,46E-07
-105952
-3809,39
-24413,3
-1830,8
Goodness of fit
ESS
RSS
R 
1
TSS
TSS
2
Question: What happens to R2 if we include an additional
explanatory variable xK+1to the model?
Answer:
• TSS remains the same
• RSS tends to decrease / ESS increase. Thus, R2 is likely to
increase
• ESS and R2 can never decrease
– In the worst case, bK+1 = 0, and there is no effect.
Adjusted R2
n 1 2 K 1
Adj.R  R 
R 
nK
nK
2
2
• Simple (but rather arbitrary) degrees of freedom correction to
the usual R2
• Adding a new variable xK+1 will increase Adj. R2 if and only if
|tK+1| ≥ 1
• Dougherty concludes: ”There is little to be gained by
finetuning [R2] by with a ’correction’ of dubious value.”
Linearity of the model
Regression equation:
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• β1 is the intercept term (constant)
• βk is the slope coefficient of variable xk (marginal effect)
• εi is the disturbance term for observation i
Important: model must be linear in parameters βK , εi.
It does not need to be linear in variables x.
Examples
• Polynomian functional form:
yi = β1 + β2x2i + β3x2i2+ … + βKx2iK-1 + εi
• Log-linear (Cobb-Douglas, 1928) functional form:
ln yi = β1 + β2 lnx2i + β3lnx3i + … + βK lnxKi + εi
Exercise
Exam question (Fall 2014):
1
We are interested in estimating parameters β1, β2, and β3. Which of the following 10 models can be stated
as multiple linear regression (MLR) model, possibly after some variable transformations or other
mathematical operations, such that parameters β1, β2, and β3 can be estimated by OLS? If some
adjustments are required, briefly state the required operations and the resulting MLR equation that can be
estimated by OLS. If the parameters cannot be estimated by OLS, briefly point out the problem.
a)
yi  1  2 xi  3 xi2   i
f)
yi   i  1  2 x1i  3 x2i
b)
yi  1  x1i 2  x2i3   i
g)
c)
yi  1  x1i 2  x2i3  exp( i )
yi  ln 1  2 x1i  3 x2i   i
h)
yi  1  x1i 2  x2i3  exp( i )  0
i)
yi  1  2 /   2 x1i  3 x2i   i
j)
 i  1  2 ln x1i  3 x23i  yi
d)
e)
yi  1  2  ln xi  3 xi0.4   i
yi  1
 i
 2 x1i  3 x2i
Notation: y and x denote the dependent and the independent variables, coefficients β are the parameters
to be estimated, and ε are the random disturbance terms, respectively.
Next time – Mon 14 Sept
Topic:
• Statistical properties of the OLS estimator
31