Basic Econometrics Tools Correlation and Regression Analysis

Basic Econometrics Tools
Correlation and Regression Analysis
Christopher Grigoriou
Executive MBA – HEC Lausanne
2007/2008
1
A collector of antique grandfather clocks wants to know if the price received for the clocks
increases linearly with the age of the clocks.
The following model: yi=a0 + a1*x1i + εi ,
where yi=Auction price of the clock i, x1i=Age of clock (years),
A sample of 32 auction prices of grandfather clocks, along with their age, is given in the next table.
Table 1- Auction price and Age
i
1
2
3
4
5
6
7
8
9
10
11
12
Auction
Price, y
1235
1080
845
1522
1047
1979
1822
1253
1297
946
1713
1024
Age,
x1
127
115
127
150
156
182
156
132
137
113
137
117
i
13
14
15
16
17
18
19
20
21
22
23
24
Auction
Price, y
1147
1092
1152
1336
2131
1550
1884
2041
845
1483
1055
1545
Age,
x1
137
153
117
126
170
182
162
184
143
159
108
175
i
25
26
27
28
29
30
31
32
Auction
Price, y
729
1792
1175
1593
785
744
1356
1262
Age,
x1
108
179
111
187
111
115
194
168
2
Correlation analysis (see Appendix 1)
Figure 1- Scatter Plot- Auction Price and Age
2590
Auction Price, y
2 09 0
1 590
1 09 0
590
90
1 00
125
150
175
2 00
Age, x1
From this figure => positive correlation between Auction price and Age?
3
=> Correlation coefficient
Outil/Utilitaire d'Analyse/ Analyse de correlation/plages d’entrée/de sortie/ok
-1 ≤ rx,y ≤ +1
rx,y ≤ 0 => negative correlation ;
rx,y ≥ 0 => positive correlation
rx,y = 0 => no linear correlation
rx,y = ± 1 => perfect correlation
The relationship we consider is LINEAR!
Correlation does not mean causality
•
Spurious correlation (e.g.: population in Egypte and Economic growth in China)
•
Reversal causality
4
One step further : the simple regression (see Appendix 2)
Ordinary Least Squares
Optimization process:
Solutions:
5
Figure 1- Scatter Plot- Auction Price and Age
2590
Auction Price, y
2 09 0
1 590
1 09 0
y = 1 0.4 8x - 1 9 2 .05
590
90
1 00
125
150
175
2 00
A ge, x 1
• yi= - 192.05 + 10.48*x1i + εi
Any increase in the Age of 1 unit => increase in the price of 10.48 units
Constant?
6
R-squared - Strength of the Relationship/the goodness of fit
(see Appendix 3)
Let’s split yi between an explained and an unexplainded part as follows:
Fundamental Variance Equation:
TSS
MSS
RSS
Total Variability (TSS) = Explained Variability (MSS) + Unexplained variablility (RSS)
=> from this equation we can assess the goodness of fit of the model:
7
In other words: how much of the variance of Y is explained by our model?
Figure 1- Scatter Plot- Auction Price and Age
2590
Auction Price, y
2 09 0
1 590
1 09 0
y = 1 0.4 8x - 1 9 2 .05
R2 = 0.53 2 4
590
90
1 00
125
150
175
2 00
A ge, x 1
Here, the R-squared is 0.5324.
=>It means that 53.24% of the variance of the price of the clocks is explained by the variance
of their age.
=> In other words, the age of the clocks explain 53.24% of the variability of their price.
8
Confidence interval and test on the coefficient (see Appendix 4)
Assuming independent and identically distributed (i.i.d) residuals and following the Gaussian Law:
since we don’t know the real (i.e
theoretical) standard deviation of
the error term, we use the following:
tT-2 being a student law with T-2 degrees of freedom
9
From these distributions, one can:
Compare coefficients
Compute a confidence interval for the regression coefficient
Test on the coefficient
H0: a1 = β versus H1: a1 ≠ β
We compute the following t-statistic:
Then, the decision rule for a threshold 5% (usually) is as follows:
We reject H0: a1 is significantly ≠ from β
with a probability to be wrong ≤ α
We cannot reject H0: a1 is not
significantly different from β
10
Critical probability
P-value : type 1 error: probability to be wrong when rejecting H0 (two kinds of error:
H0 is right
H0 is wrong
Reject
Type 1 error
ok
Not reject
Ok
Type 2 error
Softwares (Stata, E-views, Pc-Give but also…Excel) usually provide the p-value associated to any
test => you just need to know H0 and H1.
p-value < 5% => type 1 error low => we reject H0 and conclude on H1
p-value > 5% => type 1 error high => we cannot reject H0 and conclude on H0
Confidence Interval of the regression coefficient
11
Multiple regression: what does it change? (see Appendix 6)
Now the collector of antique grandfather clocks hypothesizes that the auction price of the clocks
will increase linearly as the number of bidders increases. Thus, the following model is hypothesized
now:
y=β0 + β1*x1 + β2*x2 + ε,
where y=Auction price, x1=Age of clock (years), x2=Number of bidders
see the sheet multiple regression and interpret the “new” findings.
The interpretation are done really similarly to the simple case (R-squared, coefficient, t-test,
confidence interval). What changes is that the comments are now : all other things being equal.
In other words, the impact of the increase in one of the variable is assessed controlling for the level
of (the) other (main) factors supposed to impact the dependant variable.
12
Logarithmic Specification (see Appendix 7)
y=β0 + β1*x1 + β2*x2 + ε => lny=β’0 + β’1*lnx1 + β’2*lnx2 + ε
Why a logarithmic specification ?
interpretation of the coefficients as elasticities
smaller weights to the highest values
make non additive model log-linear
lny=-1.32 + 1.42*lnx1 + 0.65*lnx2 + ε
Any increase in the age of 1% will result in, all other things being equal, an increase
in the price of 1.42%
13
Quadratic Specification (see Appendix 8)
Examine the relationship between the average utility bills for homes of a particular size, Y, and the
average monthly temperature, X (in degrees F). The next table represents the average monthly bill
and temperature for each month of the past year. The simple linear regression model is given to the
right of the data.
y=β0 + β1*x1 + β2*x2 + ε => y=β0 + β1*x1 + β2*x2 + β3*(x2*x2) + ε
Why a quadratic specification ? To try to capture non-linear relationship between y and x2.
∂y
∂x = β 2 + β 3 x2
2
Depending on the signs of β2 and β3 you can estimate an acceleration or a reversal
of the impact beyond a certain threshold.
14
Warnings !
Sensibility of the results to any change in:
A variable of the model (added or droped)
The sample
Interpretation
Correlation and causality…(a major problem in econometrics)
The third variable problem: the move in the two variables can be explained by a third
(omitted) variable
Reverse causality: you want to assess the impact of a change in X on Y (typically supply and
demand) but the causality runs in both directions.
Your findings result from a particular sample and particular framework, you must keep it in
mind when interpreting them
15