Basic Econometrics Tools Correlation and Regression Analysis Christopher Grigoriou Executive MBA – HEC Lausanne 2007/2008 1 A collector of antique grandfather clocks wants to know if the price received for the clocks increases linearly with the age of the clocks. The following model: yi=a0 + a1*x1i + εi , where yi=Auction price of the clock i, x1i=Age of clock (years), A sample of 32 auction prices of grandfather clocks, along with their age, is given in the next table. Table 1- Auction price and Age i 1 2 3 4 5 6 7 8 9 10 11 12 Auction Price, y 1235 1080 845 1522 1047 1979 1822 1253 1297 946 1713 1024 Age, x1 127 115 127 150 156 182 156 132 137 113 137 117 i 13 14 15 16 17 18 19 20 21 22 23 24 Auction Price, y 1147 1092 1152 1336 2131 1550 1884 2041 845 1483 1055 1545 Age, x1 137 153 117 126 170 182 162 184 143 159 108 175 i 25 26 27 28 29 30 31 32 Auction Price, y 729 1792 1175 1593 785 744 1356 1262 Age, x1 108 179 111 187 111 115 194 168 2 Correlation analysis (see Appendix 1) Figure 1- Scatter Plot- Auction Price and Age 2590 Auction Price, y 2 09 0 1 590 1 09 0 590 90 1 00 125 150 175 2 00 Age, x1 From this figure => positive correlation between Auction price and Age? 3 => Correlation coefficient Outil/Utilitaire d'Analyse/ Analyse de correlation/plages d’entrée/de sortie/ok -1 ≤ rx,y ≤ +1 rx,y ≤ 0 => negative correlation ; rx,y ≥ 0 => positive correlation rx,y = 0 => no linear correlation rx,y = ± 1 => perfect correlation The relationship we consider is LINEAR! Correlation does not mean causality • Spurious correlation (e.g.: population in Egypte and Economic growth in China) • Reversal causality 4 One step further : the simple regression (see Appendix 2) Ordinary Least Squares Optimization process: Solutions: 5 Figure 1- Scatter Plot- Auction Price and Age 2590 Auction Price, y 2 09 0 1 590 1 09 0 y = 1 0.4 8x - 1 9 2 .05 590 90 1 00 125 150 175 2 00 A ge, x 1 • yi= - 192.05 + 10.48*x1i + εi Any increase in the Age of 1 unit => increase in the price of 10.48 units Constant? 6 R-squared - Strength of the Relationship/the goodness of fit (see Appendix 3) Let’s split yi between an explained and an unexplainded part as follows: Fundamental Variance Equation: TSS MSS RSS Total Variability (TSS) = Explained Variability (MSS) + Unexplained variablility (RSS) => from this equation we can assess the goodness of fit of the model: 7 In other words: how much of the variance of Y is explained by our model? Figure 1- Scatter Plot- Auction Price and Age 2590 Auction Price, y 2 09 0 1 590 1 09 0 y = 1 0.4 8x - 1 9 2 .05 R2 = 0.53 2 4 590 90 1 00 125 150 175 2 00 A ge, x 1 Here, the R-squared is 0.5324. =>It means that 53.24% of the variance of the price of the clocks is explained by the variance of their age. => In other words, the age of the clocks explain 53.24% of the variability of their price. 8 Confidence interval and test on the coefficient (see Appendix 4) Assuming independent and identically distributed (i.i.d) residuals and following the Gaussian Law: since we don’t know the real (i.e theoretical) standard deviation of the error term, we use the following: tT-2 being a student law with T-2 degrees of freedom 9 From these distributions, one can: Compare coefficients Compute a confidence interval for the regression coefficient Test on the coefficient H0: a1 = β versus H1: a1 ≠ β We compute the following t-statistic: Then, the decision rule for a threshold 5% (usually) is as follows: We reject H0: a1 is significantly ≠ from β with a probability to be wrong ≤ α We cannot reject H0: a1 is not significantly different from β 10 Critical probability P-value : type 1 error: probability to be wrong when rejecting H0 (two kinds of error: H0 is right H0 is wrong Reject Type 1 error ok Not reject Ok Type 2 error Softwares (Stata, E-views, Pc-Give but also…Excel) usually provide the p-value associated to any test => you just need to know H0 and H1. p-value < 5% => type 1 error low => we reject H0 and conclude on H1 p-value > 5% => type 1 error high => we cannot reject H0 and conclude on H0 Confidence Interval of the regression coefficient 11 Multiple regression: what does it change? (see Appendix 6) Now the collector of antique grandfather clocks hypothesizes that the auction price of the clocks will increase linearly as the number of bidders increases. Thus, the following model is hypothesized now: y=β0 + β1*x1 + β2*x2 + ε, where y=Auction price, x1=Age of clock (years), x2=Number of bidders see the sheet multiple regression and interpret the “new” findings. The interpretation are done really similarly to the simple case (R-squared, coefficient, t-test, confidence interval). What changes is that the comments are now : all other things being equal. In other words, the impact of the increase in one of the variable is assessed controlling for the level of (the) other (main) factors supposed to impact the dependant variable. 12 Logarithmic Specification (see Appendix 7) y=β0 + β1*x1 + β2*x2 + ε => lny=β’0 + β’1*lnx1 + β’2*lnx2 + ε Why a logarithmic specification ? interpretation of the coefficients as elasticities smaller weights to the highest values make non additive model log-linear lny=-1.32 + 1.42*lnx1 + 0.65*lnx2 + ε Any increase in the age of 1% will result in, all other things being equal, an increase in the price of 1.42% 13 Quadratic Specification (see Appendix 8) Examine the relationship between the average utility bills for homes of a particular size, Y, and the average monthly temperature, X (in degrees F). The next table represents the average monthly bill and temperature for each month of the past year. The simple linear regression model is given to the right of the data. y=β0 + β1*x1 + β2*x2 + ε => y=β0 + β1*x1 + β2*x2 + β3*(x2*x2) + ε Why a quadratic specification ? To try to capture non-linear relationship between y and x2. ∂y ∂x = β 2 + β 3 x2 2 Depending on the signs of β2 and β3 you can estimate an acceleration or a reversal of the impact beyond a certain threshold. 14 Warnings ! Sensibility of the results to any change in: A variable of the model (added or droped) The sample Interpretation Correlation and causality…(a major problem in econometrics) The third variable problem: the move in the two variables can be explained by a third (omitted) variable Reverse causality: you want to assess the impact of a change in X on Y (typically supply and demand) but the causality runs in both directions. Your findings result from a particular sample and particular framework, you must keep it in mind when interpreting them 15
© Copyright 2026 Paperzz