Estimating a demand function — it’s about time Our earlier look at estimating a demand function demonstrated how multiple regression could be used to estimate the demand for gasoline as a function of various predictors, including its price. The chosen model there was the following: Regression Analysis: logGpc versus logPG, logI, logPD, logPN, logPS Analysis of Variance Source DF Regression 5 logPG 1 logI 1 logPD 1 logPN 1 logPS 1 Error 30 Total 35 Adj SS 0.148017 0.004579 0.014972 0.001372 0.002646 0.005953 0.002058 0.150076 Adj MS F-Value P-Value 0.029603 431.44 0.000 0.004579 66.73 0.000 0.014972 218.20 0.000 0.001372 19.99 0.000 0.002646 38.57 0.000 0.005953 86.76 0.000 0.000069 Model Summary S R-sq R-sq(adj) R-sq(pred) 0.0082834 98.63% 98.40% 98.06% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -3.348 0.339 -9.89 0.000 logPG -0.4985 0.0610 -8.17 0.000 130.80 logI 1.1622 0.0787 14.77 0.000 24.98 logPD 0.802 0.179 4.47 0.000 441.63 logPN 1.172 0.189 6.21 0.000 955.86 logPS -1.204 0.129 -9.31 0.000 617.64 Regression Equation logGpc = -3.348 - 0.4985 logPG + 1.1622 logI + 0.802 logPD + 1.172 logPN - 1.204 logPS c 2016, Jeffrey S. Simonoff 1 Although this model fits the data reasonably well, it does suffer from a difficulty — it does not address the time ordering of the data. In fact, the residuals from this model exhibit autocorrelation, as can be seen from this time series plot: The Durbin–Watson statistic supports this, as it equals 1.02; so does the runs test (although a bit weaker): Runs test for SRES1 Runs above and below K = 0 The observed number of runs = 13 The expected number of runs = 18.7778 20 observations above K, 16 below P-value = 0.048 The ACF plot of the standardized residuals also indicates autocorrelation: c 2016, Jeffrey S. Simonoff 2 As we’ve discussed, one approach for handling autocorrelation is to use a lagged version of the target variable as a predictor (Lagged logGpc, saying that the previous year’s gas consumption goes a long way to predicting this year’s consumption, due to basic stability in the process). Also, in thinking about the dynamics of how people decide to use their automobiles, it seems reasonable to consider also using a lagged version of the price index of gasoline, Lagged logPG (saying that consumption might be affected not only by current price, but previous price, because of the perception of people that prices are increasing or decreasing). Generally speaking, using lagged versions of predictors is not designed to specifically address autocorrelation (as the use of the lagged target as a predictor often is), but rather based on such use making sense in context. Here is a scatter plot of logged per capita consumption on the previous year’s logged per capita consumption. We can see that there is a strong relationship, although it is apparently weaker for the higher values. I haven’t bothered to give the plot of logged per capita consumption versus previous year’s price index, since it looks very similar to the one for current year’s price index that we saw earlier. c 2016, Jeffrey S. Simonoff 3 Here is output for a regression using these variables, along with logPG, as predictors (I could have used a best subsets regression here, but it’s clear that all three variables provide a lot of predictive power): Analysis of Variance Source DF Regression 3 Lagged logGpc 1 logPG 1 Lagged logPG 1 Error 31 Total 34 Adj SS 0.127284 0.072538 0.005443 0.004972 0.002063 0.129347 Adj MS F-Value P-Value 0.042428 637.51 0.000 0.072538 1089.94 0.000 0.005443 81.79 0.000 0.004972 74.71 0.000 0.000067 Model Summary S R-sq R-sq(adj) R-sq(pred) 0.0081580 98.40% 98.25% 97.96% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -0.0529 0.0306 -1.73 0.094 Lagged logGpc 1.0751 0.0326 33.01 0.000 2.34 logPG -0.3278 0.0363 -9.04 0.000 45.42 Lagged logPG 0.2902 0.0336 8.64 0.000 39.41 c 2016, Jeffrey S. Simonoff 4 Regression Equation logGpc = -0.0529 + 1.0751 Lagged logGpc - 0.3278 logPG + 0.2902 Lagged logPG The model fits very well, and the autocorrelation has apparently been removed. We might also consider a further simplification of the model. Note that the estimated slopes for logged price and lagged logged price are very similar in magnitude and of opposite sign; that suggests that replacing the two variables with their difference could provide similar fit, and would be easily interpretable as implying that it is simply the change in price, along with the previous year’s consumption, that are related to current consumption. A partial F -test of this hypothesis (βlogPG = −βlaglogPG), however, does not support this simplification (F = 22.8, p < .0001), so we will not pursue this further. A time series plot of the residuals, however, shows that there is a clear outlier: This outlier corresponds to 1991: Row Year 1 2 3 4 5 6 1960 1961 1962 1963 1964 1965 logGpc SRES2 HI2 COOK2 0.85772 * 0.85583 -2.19803 0.86806 0.02270 0.87579 -0.80749 0.89125 0.07544 0.90611 0.61353 * 0.166530 0.174651 0.145953 0.130924 0.113090 * 0.241328 0.000027 0.027857 0.000214 0.011999 c 2016, Jeffrey S. Simonoff 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 0.92591 0.93752 0.96368 0.98779 1.00721 1.02345 1.03491 1.05138 1.02465 1.03286 1.04569 1.05460 1.07060 1.04468 0.99919 0.99262 0.99460 1.01066 1.01604 1.01414 1.04995 1.05783 1.05873 1.06109 1.05335 1.03261 1.04080 1.04596 1.04710 1.05415 0.88899 -0.14895 1.35005 1.19743 0.01578 -0.61246 -1.30374 0.80589 -0.97127 0.15422 0.34341 0.08992 0.77236 0.09022 -1.22083 1.04975 -0.55284 1.50700 0.23891 -0.34561 -0.15029 0.63400 -0.78952 0.86471 0.55444 -3.51351 0.58789 0.00685 -0.29816 0.63187 0.089974 0.073089 0.068043 0.067586 0.094891 0.127361 0.157625 0.135470 0.237654 0.051697 0.060323 0.069092 0.081818 0.221465 0.316834 0.148019 0.118739 0.103501 0.080059 0.073825 0.292040 0.049303 0.063589 0.058482 0.084320 0.076899 0.068258 0.068665 0.065460 0.064769 0.019534 0.000437 0.033268 0.025983 0.000007 0.013687 0.079513 0.025443 0.073521 0.000324 0.001893 0.000150 0.013289 0.000579 0.172805 0.047863 0.010295 0.065548 0.001242 0.002380 0.002329 0.005211 0.010582 0.011611 0.007077 0.257095 0.006330 0.000001 0.001557 0.006913 This year was the year of a serious recession and the first Gulf War (Operation Desert Storm), so apparently gasoline consumption decreased during this time period. As an outlier, we could contemplate removing this case and reanalyzing the data. Unfortunately, if we do that, we will disturb the natural time ordering in the data. An alternative approach is to substitute a “reasonable” value, such as the average of the two neighboring values, for the outlying value, and then reanalyze the entire adjusted data set. This is admittedly an ad hoc solution, and more complex (and theoretically justified) substitution methods are possible. Still, very simple techniques like this can work quite adequately. For these data, the gas consumption of 1.03261 is too low, relative to the values of 1.05335 for 1990 and 1.0408 for 1992, so the averaged value of 1.04708 is substituted (of course, when c 2016, Jeffrey S. Simonoff 6 discussing our results, we must note that they no longer apply to 1991, or future years that might be like 1991; recessions, for example). Here is the resultant regression output: Analysis of Variance Source DF Regression 3 Lagged logGpc 1 logPG 1 Lagged logPG 1 Error 31 Total 34 Adj SS 0.128926 0.073058 0.005653 0.005227 0.001460 0.130386 Adj MS F-Value P-Value 0.042975 912.79 0.000 0.073058 1551.75 0.000 0.005653 120.06 0.000 0.005227 111.02 0.000 0.000047 Model Summary S R-sq R-sq(adj) R-sq(pred) 0.0068616 98.88% 98.77% 98.52% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -0.0565 0.0258 -2.19 0.036 Lagged logGpc 1.0789 0.0274 39.39 0.000 2.34 logPG -0.3341 0.0305 -10.96 0.000 45.42 Lagged logPG 0.2976 0.0282 10.54 0.000 39.41 Regression Equation logGpc = -0.0565 + 1.0789 Lagged logGpc - 0.3341 logPG + 0.2976 Lagged logPG The model fits slightly better, but the coefficients have changed little. More importantly, there is no autocorrelation, and no outliers are apparent: c 2016, Jeffrey S. Simonoff 7 Runs test for SRES2 Runs above and below K = 0 The observed number of runs = 22 The expected number of runs = 18.1429 20 observations above K, 15 below P-value = 0.176 c 2016, Jeffrey S. Simonoff 8 Row Year logGpc SRES2 HI2 COOK2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 0.85772 0.85583 0.86806 0.87579 0.89125 0.90611 0.92591 0.93752 0.96368 0.98779 1.00721 1.02345 1.03491 1.05138 1.02465 1.03286 1.04569 1.05460 1.07060 1.04468 0.99919 0.99262 0.99460 1.01066 1.01604 1.01414 1.04995 1.05783 1.05873 1.06109 1.05335 1.04708 1.04080 1.04596 1.04710 1.05415 * -2.55989 0.09034 -0.90839 0.13495 0.78300 1.09185 -0.15210 1.61432 1.42412 -0.00708 -0.76761 -1.59821 0.93727 -1.09989 0.12996 0.33488 0.02913 0.82485 0.10898 -1.44476 1.16183 -0.81402 1.64740 0.14236 -0.54445 -0.45074 0.63208 -1.08116 0.91787 0.55781 -2.15160 0.55002 -0.14783 -0.50620 0.60268 * 0.166530 0.174651 0.145953 0.130924 0.113090 0.089974 0.073089 0.068043 0.067586 0.094891 0.127361 0.157625 0.135470 0.237654 0.051697 0.060323 0.069092 0.081818 0.221465 0.316834 0.148019 0.118739 0.103501 0.080059 0.073825 0.292040 0.049303 0.063589 0.058482 0.084320 0.076899 0.068258 0.068665 0.065460 0.064769 * 0.327330 0.000432 0.035255 0.000686 0.019544 0.029467 0.000456 0.047567 0.036752 0.000001 0.021499 0.119489 0.034414 0.094282 0.000230 0.001800 0.000016 0.015157 0.000845 0.242012 0.058629 0.022320 0.078331 0.000441 0.005907 0.020952 0.005180 0.019844 0.013083 0.007163 0.096413 0.005541 0.000403 0.004487 0.006289 The residual versus fitted plot gives a slight indication of structure, but given the very high R2 here, it is unlikely that any corrective action would make much of a difference. Note that 1980 and 1986 are potential leverage points, which we will not pursue here. c 2016, Jeffrey S. Simonoff 9 This new gas demand function has an appealing intuitive justification. Given the last two years’ prices, gasoline demand is directly to last year’s demand (1% higher demand last year is associated with 1.08% estimated expected increase this year). Given last year’s demand and price, this year’s demand is inversely related to this year’s price, which is the inverse demand / price relationship expected from economic theory (1% higher price is associated with .33% estimated expected decrease in demand). Further, given this year’s price and last year’s demand, this year’s demand is directly related to last year’s price (1% higher price last year is associated with .30% estimated expected increase in demand this year). This also makes sense, since a higher value of last year’s price, given this year’s price is fixed, is consistent with a decreasing price trend, which would encourage additional consumption. The standard error of the estimate implies that per capita gas demand can be predicted to within 3% (10.013724 = 1.03). The fill–in method for handling an outlier used here has two limitations that are worth noting. First, adjusting the target (y) value will not fix leverage points, as they are characterized by unusual predictor values, not unusual target values. Second, unusual observations often occur in “patches” in time series data, reflecting a temporary change in the underlying structure of the process; a constant fill–in for four or five (say) consecutive time periods is obviously not accurately reflecting what we think the series really should be. An alternative that addresses both of these points (and is thus the only alternative for handling leverage points) is to create an indicator variable that defines the unusual observation or patch of unusual observations, equaling one for all observations in the patch, and zero otherwise (isolated unusual observations that are not in a consecutive patch of time points have a 0/1 variable defined for each of them). Including this variable in the regression will effectively remove the influence of the unusual values from the regression fit. Here is how this works for these data (with Year1991 defining only 1991). Analysis of Variance Source DF Regression 4 Lagged logGpc 1 logPG 1 Lagged logPG 1 Year1991 1 Error 30 Total 34 Adj SS 0.128105 0.073260 0.005820 0.005415 0.000822 0.001242 0.129347 c 2016, Jeffrey S. Simonoff Adj MS F-Value P-Value 0.032026 773.86 0.000 0.073260 1770.20 0.000 0.005820 140.63 0.000 0.005415 130.85 0.000 0.000822 19.85 0.000 0.000041 10 Model Summary S R-sq R-sq(adj) R-sq(pred) 0.0064331 99.04% 98.91% * Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -0.0604 0.0242 -2.49 0.018 Lagged logGpc 1.0830 0.0257 42.07 0.000 2.35 logPG -0.3407 0.0287 -11.86 0.000 45.89 Lagged logPG 0.3054 0.0267 11.44 0.000 40.06 Year1991 -0.02983 0.00670 -4.46 0.000 1.05 Regression Equation logGpc = -0.0604 + 1.0830 Lagged logGpc - 0.3407 logPG + 0.3054 LaggedlogPG - 0.02983 Year1991 The fitted coefficients are virtually the same as when the fill–in method is used. One additional piece of information from this approach is the coefficient for Year1991: given previous year’s gasoline consumption, and this and last year’s gasoline price index, the observed logged per capita consumption for 1991 is seen to have been .0298 lower than expected (translating to a demand roughly 6.6% lower than expected that year), and this amount is significantly different from zero (p < .001). Thus, the t–test for the indicator variable is a formal test of whether the point is an outlier (but remember that it would not necessarily be significant for a leverage point). Two issues related to software in general and Minitab in particular if you use this method: 1. You should set the standardized residual for any observations identified by a single 0/1 variable equal to 0 (Minitab sets them equal to *, since they are technically 0/0). Note that the leverage value for any such observation will equal 1.00, but that is not an issue, since the observation is intended to be omitted. You should not include any indicator variables constructed for this purpose in the total number of predictors p when determining a guideline for extreme leverage points. 2. If you are doing model selection (using best subsets, for example), you must be careful to force the indicator variables that define the unusual observations into all models, to c 2016, Jeffrey S. Simonoff 11 make sure that those points are effectively omitted from the sample. This is particularly important for a leverage point, since its corresponding indicator will not necessarily be identified as an important predictor by best subsets, even if its inclusion could greatly change the fitted regression. Also, if you use this method to account for unusual observations you should not count the indicator variables in the number of predictors p for measures like Cp , AICC , and so on, and you should not count the observations being accounted for in the value of n used in those measures. c 2016, Jeffrey S. Simonoff 12
© Copyright 2026 Paperzz