Estimating a demand function

Estimating a demand function — it’s about time
Our earlier look at estimating a demand function demonstrated how multiple regression could be used to estimate the demand for gasoline as a function of various predictors,
including its price. The chosen model there was the following:
Regression Analysis: logGpc versus logPG, logI, logPD, logPN, logPS
Analysis of Variance
Source
DF
Regression 5
logPG
1
logI
1
logPD
1
logPN
1
logPS
1
Error
30
Total
35
Adj SS
0.148017
0.004579
0.014972
0.001372
0.002646
0.005953
0.002058
0.150076
Adj MS F-Value P-Value
0.029603 431.44
0.000
0.004579
66.73
0.000
0.014972 218.20
0.000
0.001372
19.99
0.000
0.002646
38.57
0.000
0.005953
86.76
0.000
0.000069
Model Summary
S
R-sq R-sq(adj) R-sq(pred)
0.0082834 98.63%
98.40%
98.06%
Coefficients
Term
Coef SE Coef T-Value P-Value
VIF
Constant -3.348
0.339
-9.89
0.000
logPG
-0.4985
0.0610
-8.17
0.000 130.80
logI
1.1622
0.0787
14.77
0.000
24.98
logPD
0.802
0.179
4.47
0.000 441.63
logPN
1.172
0.189
6.21
0.000 955.86
logPS
-1.204
0.129
-9.31
0.000 617.64
Regression Equation
logGpc = -3.348 - 0.4985 logPG + 1.1622 logI + 0.802 logPD + 1.172 logPN
- 1.204 logPS
c
2016,
Jeffrey S. Simonoff
1
Although this model fits the data reasonably well, it does suffer from a difficulty — it
does not address the time ordering of the data. In fact, the residuals from this model exhibit
autocorrelation, as can be seen from this time series plot:
The Durbin–Watson statistic supports this, as it equals 1.02; so does the runs test (although a bit weaker):
Runs test for SRES1
Runs above and below K = 0
The observed number of runs = 13
The expected number of runs = 18.7778
20 observations above K, 16 below
P-value = 0.048
The ACF plot of the standardized residuals also indicates autocorrelation:
c
2016,
Jeffrey S. Simonoff
2
As we’ve discussed, one approach for handling autocorrelation is to use a lagged version
of the target variable as a predictor (Lagged logGpc, saying that the previous year’s gas
consumption goes a long way to predicting this year’s consumption, due to basic stability
in the process). Also, in thinking about the dynamics of how people decide to use their
automobiles, it seems reasonable to consider also using a lagged version of the price index
of gasoline, Lagged logPG (saying that consumption might be affected not only by current
price, but previous price, because of the perception of people that prices are increasing
or decreasing). Generally speaking, using lagged versions of predictors is not designed to
specifically address autocorrelation (as the use of the lagged target as a predictor often is),
but rather based on such use making sense in context.
Here is a scatter plot of logged per capita consumption on the previous year’s logged
per capita consumption. We can see that there is a strong relationship, although it is
apparently weaker for the higher values. I haven’t bothered to give the plot of logged per
capita consumption versus previous year’s price index, since it looks very similar to the one
for current year’s price index that we saw earlier.
c
2016,
Jeffrey S. Simonoff
3
Here is output for a regression using these variables, along with logPG, as predictors (I
could have used a best subsets regression here, but it’s clear that all three variables provide
a lot of predictive power):
Analysis of Variance
Source
DF
Regression
3
Lagged logGpc 1
logPG
1
Lagged logPG
1
Error
31
Total
34
Adj SS
0.127284
0.072538
0.005443
0.004972
0.002063
0.129347
Adj MS F-Value P-Value
0.042428 637.51
0.000
0.072538 1089.94
0.000
0.005443
81.79
0.000
0.004972
74.71
0.000
0.000067
Model Summary
S
R-sq R-sq(adj) R-sq(pred)
0.0081580 98.40%
98.25%
97.96%
Coefficients
Term
Coef SE Coef T-Value P-Value
VIF
Constant
-0.0529
0.0306
-1.73
0.094
Lagged logGpc
1.0751
0.0326
33.01
0.000
2.34
logPG
-0.3278
0.0363
-9.04
0.000 45.42
Lagged logPG
0.2902
0.0336
8.64
0.000 39.41
c
2016,
Jeffrey S. Simonoff
4
Regression Equation
logGpc = -0.0529 + 1.0751 Lagged logGpc - 0.3278 logPG + 0.2902 Lagged logPG
The model fits very well, and the autocorrelation has apparently been removed. We
might also consider a further simplification of the model. Note that the estimated slopes
for logged price and lagged logged price are very similar in magnitude and of opposite sign;
that suggests that replacing the two variables with their difference could provide similar fit,
and would be easily interpretable as implying that it is simply the change in price, along
with the previous year’s consumption, that are related to current consumption. A partial
F -test of this hypothesis (βlogPG = −βlaglogPG), however, does not support this simplification
(F = 22.8, p < .0001), so we will not pursue this further.
A time series plot of the residuals, however, shows that there is a clear outlier:
This outlier corresponds to 1991:
Row
Year
1
2
3
4
5
6
1960
1961
1962
1963
1964
1965
logGpc
SRES2
HI2
COOK2
0.85772
*
0.85583 -2.19803
0.86806
0.02270
0.87579 -0.80749
0.89125
0.07544
0.90611
0.61353
*
0.166530
0.174651
0.145953
0.130924
0.113090
*
0.241328
0.000027
0.027857
0.000214
0.011999
c
2016,
Jeffrey S. Simonoff
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
0.92591
0.93752
0.96368
0.98779
1.00721
1.02345
1.03491
1.05138
1.02465
1.03286
1.04569
1.05460
1.07060
1.04468
0.99919
0.99262
0.99460
1.01066
1.01604
1.01414
1.04995
1.05783
1.05873
1.06109
1.05335
1.03261
1.04080
1.04596
1.04710
1.05415
0.88899
-0.14895
1.35005
1.19743
0.01578
-0.61246
-1.30374
0.80589
-0.97127
0.15422
0.34341
0.08992
0.77236
0.09022
-1.22083
1.04975
-0.55284
1.50700
0.23891
-0.34561
-0.15029
0.63400
-0.78952
0.86471
0.55444
-3.51351
0.58789
0.00685
-0.29816
0.63187
0.089974
0.073089
0.068043
0.067586
0.094891
0.127361
0.157625
0.135470
0.237654
0.051697
0.060323
0.069092
0.081818
0.221465
0.316834
0.148019
0.118739
0.103501
0.080059
0.073825
0.292040
0.049303
0.063589
0.058482
0.084320
0.076899
0.068258
0.068665
0.065460
0.064769
0.019534
0.000437
0.033268
0.025983
0.000007
0.013687
0.079513
0.025443
0.073521
0.000324
0.001893
0.000150
0.013289
0.000579
0.172805
0.047863
0.010295
0.065548
0.001242
0.002380
0.002329
0.005211
0.010582
0.011611
0.007077
0.257095
0.006330
0.000001
0.001557
0.006913
This year was the year of a serious recession and the first Gulf War (Operation Desert
Storm), so apparently gasoline consumption decreased during this time period. As an outlier,
we could contemplate removing this case and reanalyzing the data. Unfortunately, if we do
that, we will disturb the natural time ordering in the data. An alternative approach is to
substitute a “reasonable” value, such as the average of the two neighboring values, for the
outlying value, and then reanalyze the entire adjusted data set. This is admittedly an ad hoc
solution, and more complex (and theoretically justified) substitution methods are possible.
Still, very simple techniques like this can work quite adequately.
For these data, the gas consumption of 1.03261 is too low, relative to the values of 1.05335
for 1990 and 1.0408 for 1992, so the averaged value of 1.04708 is substituted (of course, when
c
2016,
Jeffrey S. Simonoff
6
discussing our results, we must note that they no longer apply to 1991, or future years that
might be like 1991; recessions, for example). Here is the resultant regression output:
Analysis of Variance
Source
DF
Regression
3
Lagged logGpc 1
logPG
1
Lagged logPG
1
Error
31
Total
34
Adj SS
0.128926
0.073058
0.005653
0.005227
0.001460
0.130386
Adj MS F-Value P-Value
0.042975 912.79
0.000
0.073058 1551.75
0.000
0.005653 120.06
0.000
0.005227 111.02
0.000
0.000047
Model Summary
S
R-sq R-sq(adj) R-sq(pred)
0.0068616 98.88%
98.77%
98.52%
Coefficients
Term
Coef SE Coef T-Value P-Value
VIF
Constant
-0.0565
0.0258
-2.19
0.036
Lagged logGpc
1.0789
0.0274
39.39
0.000
2.34
logPG
-0.3341
0.0305
-10.96
0.000 45.42
Lagged logPG
0.2976
0.0282
10.54
0.000 39.41
Regression Equation
logGpc = -0.0565 + 1.0789 Lagged logGpc - 0.3341 logPG + 0.2976 Lagged logPG
The model fits slightly better, but the coefficients have changed little. More importantly,
there is no autocorrelation, and no outliers are apparent:
c
2016,
Jeffrey S. Simonoff
7
Runs test for SRES2
Runs above and below K = 0
The observed number of runs = 22
The expected number of runs = 18.1429
20 observations above K, 15 below
P-value = 0.176
c
2016,
Jeffrey S. Simonoff
8
Row
Year
logGpc
SRES2
HI2
COOK2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
0.85772
0.85583
0.86806
0.87579
0.89125
0.90611
0.92591
0.93752
0.96368
0.98779
1.00721
1.02345
1.03491
1.05138
1.02465
1.03286
1.04569
1.05460
1.07060
1.04468
0.99919
0.99262
0.99460
1.01066
1.01604
1.01414
1.04995
1.05783
1.05873
1.06109
1.05335
1.04708
1.04080
1.04596
1.04710
1.05415
*
-2.55989
0.09034
-0.90839
0.13495
0.78300
1.09185
-0.15210
1.61432
1.42412
-0.00708
-0.76761
-1.59821
0.93727
-1.09989
0.12996
0.33488
0.02913
0.82485
0.10898
-1.44476
1.16183
-0.81402
1.64740
0.14236
-0.54445
-0.45074
0.63208
-1.08116
0.91787
0.55781
-2.15160
0.55002
-0.14783
-0.50620
0.60268
*
0.166530
0.174651
0.145953
0.130924
0.113090
0.089974
0.073089
0.068043
0.067586
0.094891
0.127361
0.157625
0.135470
0.237654
0.051697
0.060323
0.069092
0.081818
0.221465
0.316834
0.148019
0.118739
0.103501
0.080059
0.073825
0.292040
0.049303
0.063589
0.058482
0.084320
0.076899
0.068258
0.068665
0.065460
0.064769
*
0.327330
0.000432
0.035255
0.000686
0.019544
0.029467
0.000456
0.047567
0.036752
0.000001
0.021499
0.119489
0.034414
0.094282
0.000230
0.001800
0.000016
0.015157
0.000845
0.242012
0.058629
0.022320
0.078331
0.000441
0.005907
0.020952
0.005180
0.019844
0.013083
0.007163
0.096413
0.005541
0.000403
0.004487
0.006289
The residual versus fitted plot gives a slight indication of structure, but given the very
high R2 here, it is unlikely that any corrective action would make much of a difference. Note
that 1980 and 1986 are potential leverage points, which we will not pursue here.
c
2016,
Jeffrey S. Simonoff
9
This new gas demand function has an appealing intuitive justification. Given the last two
years’ prices, gasoline demand is directly to last year’s demand (1% higher demand last year
is associated with 1.08% estimated expected increase this year). Given last year’s demand
and price, this year’s demand is inversely related to this year’s price, which is the inverse
demand / price relationship expected from economic theory (1% higher price is associated
with .33% estimated expected decrease in demand). Further, given this year’s price and last
year’s demand, this year’s demand is directly related to last year’s price (1% higher price
last year is associated with .30% estimated expected increase in demand this year). This
also makes sense, since a higher value of last year’s price, given this year’s price is fixed,
is consistent with a decreasing price trend, which would encourage additional consumption.
The standard error of the estimate implies that per capita gas demand can be predicted to
within 3% (10.013724 = 1.03).
The fill–in method for handling an outlier used here has two limitations that are worth
noting. First, adjusting the target (y) value will not fix leverage points, as they are characterized by unusual predictor values, not unusual target values. Second, unusual observations
often occur in “patches” in time series data, reflecting a temporary change in the underlying
structure of the process; a constant fill–in for four or five (say) consecutive time periods is
obviously not accurately reflecting what we think the series really should be.
An alternative that addresses both of these points (and is thus the only alternative
for handling leverage points) is to create an indicator variable that defines the unusual
observation or patch of unusual observations, equaling one for all observations in the patch,
and zero otherwise (isolated unusual observations that are not in a consecutive patch of time
points have a 0/1 variable defined for each of them). Including this variable in the regression
will effectively remove the influence of the unusual values from the regression fit. Here is
how this works for these data (with Year1991 defining only 1991).
Analysis of Variance
Source
DF
Regression
4
Lagged logGpc 1
logPG
1
Lagged logPG
1
Year1991
1
Error
30
Total
34
Adj SS
0.128105
0.073260
0.005820
0.005415
0.000822
0.001242
0.129347
c
2016,
Jeffrey S. Simonoff
Adj MS F-Value P-Value
0.032026 773.86
0.000
0.073260 1770.20
0.000
0.005820 140.63
0.000
0.005415 130.85
0.000
0.000822
19.85
0.000
0.000041
10
Model Summary
S
R-sq R-sq(adj) R-sq(pred)
0.0064331 99.04%
98.91%
*
Coefficients
Term
Coef SE Coef T-Value P-Value
VIF
Constant
-0.0604 0.0242
-2.49
0.018
Lagged logGpc
1.0830 0.0257
42.07
0.000
2.35
logPG
-0.3407 0.0287
-11.86
0.000 45.89
Lagged logPG
0.3054 0.0267
11.44
0.000 40.06
Year1991
-0.02983 0.00670
-4.46
0.000
1.05
Regression Equation
logGpc = -0.0604 + 1.0830 Lagged logGpc - 0.3407 logPG + 0.3054 LaggedlogPG
- 0.02983 Year1991
The fitted coefficients are virtually the same as when the fill–in method is used. One
additional piece of information from this approach is the coefficient for Year1991: given
previous year’s gasoline consumption, and this and last year’s gasoline price index, the
observed logged per capita consumption for 1991 is seen to have been .0298 lower than
expected (translating to a demand roughly 6.6% lower than expected that year), and this
amount is significantly different from zero (p < .001). Thus, the t–test for the indicator
variable is a formal test of whether the point is an outlier (but remember that it would not
necessarily be significant for a leverage point).
Two issues related to software in general and Minitab in particular if you use this method:
1. You should set the standardized residual for any observations identified by a single
0/1 variable equal to 0 (Minitab sets them equal to *, since they are technically 0/0).
Note that the leverage value for any such observation will equal 1.00, but that is not
an issue, since the observation is intended to be omitted. You should not include any
indicator variables constructed for this purpose in the total number of predictors p
when determining a guideline for extreme leverage points.
2. If you are doing model selection (using best subsets, for example), you must be careful
to force the indicator variables that define the unusual observations into all models, to
c
2016,
Jeffrey S. Simonoff
11
make sure that those points are effectively omitted from the sample. This is particularly
important for a leverage point, since its corresponding indicator will not necessarily be
identified as an important predictor by best subsets, even if its inclusion could greatly
change the fitted regression.
Also, if you use this method to account for unusual observations you should not count the
indicator variables in the number of predictors p for measures like Cp , AICC , and so on, and
you should not count the observations being accounted for in the value of n used in those
measures.
c
2016,
Jeffrey S. Simonoff
12