Accuracy and Accuracy-Implication Metrics for Intermittent Demand

Issue 4 June 2006
FORESIGHT
The International Journal of Applied Forecasting
THINK DEVISE CALCULATE ENVISION INVENT INSPIRE THINK DEVISE CALCULATE ENVISION INV
SPECIAL FEATURES
Forecasting for Call Centers
Forecast Accuracy Metrics for Inventory Control
Lessons From Successful Companies
Breaking Down Barriers to Forecast Process Improvement
Transformation Lessons From Coca-Cola Enterprises Inc.
A PUBLICAT I O N O F T H E I N T E R N AT I O N A L I N S T I T U T E O F F O R E C A S T E R S
IIF
ACCURACY AND ACCURACY-IMPLICATION METRICS
FOR INTERMITTENT DEMAND
by John Boylan and Aris Syntetos
Preview: John and Aris distinguish between forecast-accuracy metrics, which measure the errors resulting
from a forecast method, and accuracy-implication metrics, which measure the achievement of the
organization’s stock-holding and service-level goals. Both measurements are important. The correct choice
of a forecast-accuracy metric depends on the organization’s inventory rules and on whether accuracy is to
be gauged for a single item or across a range of items. The authors recommend specific accuracy and
accuracy-implication metrics for each context.
John Boylan is Professor of Management Science at Buckinghamshire Chilterns University College.
Previously, he worked in OR at Rolls-Royce and at the Unipart Group. His research and publications
(Journal of the OR Society, International Journal of Production Economics, International Journal of
Forecasting) have increasingly focused on the challenges of forecasting slow, intermittent and
lumpy demands.
Aris Syntetos is a Lecturer in Operations Management and Operational Research at the University of
Salford, UK. His research interests include intermittent-demand forecasting and the interface between
forecasting and stock control. On behalf of the Salford Business School, he is currently involved in
two inventory-management projects, one with an engineering firm and one with an international
wholesaling company.
In considering forecasting-accuracy metrics for
intermittent demand, we should begin by looking
at the inventory method. Depending on that
method, we may need estimates of mean demand,
variance of demand, percentiles of demand, and
probabilities of high-demand values.
When a forecast of mean demand is needed, the
accuracy of the forecast for an individual item can
be judged by the mean absolute error (MAE). To
assess forecast accuracy across a range of items,
a scale-independent metric, such as the ratio of
the mean absolute error to the mean demand, is
appropriate. Alternatively, the geometric mean
absolute error (GMAE) may be used.
If forecasts of percentiles of demand or
probabilities of high-demand values are needed,
then an appropriate chi-square test should be
used, concentrating on the upper end of the
distribution (for example, the 95th percentile).
No matter which inventory system is used, the
accuracy-implication metrics of stock-holding and
service levels should always be considered.
Introduction
In the February 2006 edition of Foresight, Kenneth Kahn
poses the following question: “Should we view forecast
accuracy as an end in itself or rather as a means to an end?”
(Kahn, 2006, p. 25). Most commonly, intermittent-demand
forecasting is a means toward the twin ends of lowering
stock-holding costs (including costs of stock obsolescence)
and maintaining or improving stock availability (“service
level”). The achievement of these goals depends not only
on the accuracy of the forecasting method but also on the
suitability of the inventory rules determining the timing and
size of orders. The relationship between these factors and
the system’s goals is shown in Figure 1, next page.
If we regard the design and implementation of a stockmanagement system as a means toward an end, then the
outcome measures on the right-hand side of Figure 1 should
not be ignored. These measures ensure that forecasters and
inventory managers do not lose sight of the system’s purpose.
Accuracy-Implication Metrics
Most managers would regard stock-holding costs and
service level as outcome measures rather than accuracy
measures. But if we keep the inventory rules fixed and try
different forecasting methods, these outcome measures
become accuracy-implication measures. The term
“accuracy implication” is used instead of “accuracy”
because metrics such as service level do not measure the
accuracy of a forecasting method, but they do measure the
implication of its accuracy under a given inventory rule.
A good example of this approach is the study by Eaves and
Kingsman (2004). They estimate the effect of forecastingmethod choice on 18,750 line items, including intermittent
June 2006 Issue 4 FORESIGHT
39
Figure 1. Relationship Among Forecasting, Inventory Rules
and Performance Measures
Forecasting
Method
STOCK
MANAGEMENT
SYSTEM
Inventory
Rules
Stock-holding
Costs
Service
Level
items. Importantly, they assume a constant service-level
requirement, allowing accuracy implications to be assessed.
For example, for quarterly data, they find that using single
exponential smoothing instead of Croston’s method
requires an additional stock investment of £1.28m.
Therefore, instead of simply reporting that Croston’s
method is more accurate than smoothing, they show the
cost implication of making the wrong choice.
What Types of Forecasts Are Required?
To identify the most appropriate accuracy metrics, we must
first ask what is to be forecast. The variables to be forecast
depend on the inventory method. For example, suppose
that we use a periodic (R, s, S) inventory rule. This means
that we review the inventory system every R periods, and
when the stock level drops to a certain reorder point (called
s) or lower, then we order enough stock to take us back up
to the reorder level (called S). Some of the most effective
methods for finding s and S, including Naddor’s heuristic
(Naddor, 1975), require only estimates of the mean and
variance of demand.
To take a second example, suppose that we use an (s, Q)
inventory method. In this case, we review the stock
continuously and place an order of fixed quantity Q if the
stock drops to the reorder level s or below. This system is
also known as (r, Q). In some systems, we wish to ensure
that there is no more than a 10% chance of stockout during
the replenishment cycle (review time plus lead time). For
this case, we need to estimate the 90th percentile of the
distribution of demand over the replenishment cycle, rather
than the mean and the variance.
Another alternative is that we wish to ensure that at least
90% of demand is satisfied directly off the shelf—please
note that this is not the same as a 90% chance of no stock
outs; this point is discussed in greater detail by Silver, Pyke,
& Peterson (1998, 266-270). In this case, we need to
estimate the probabilities of any demands that exceed the
reorder level. Here, instead of estimates of percentiles, we
want estimates of the probabilities of high demand.
40
FORESIGHT Issue 4 June 2006
Should We Forecast the Entire Demand
Distribution?
Willemain (this issue) suggests that the general problem
is to forecast the whole distribution of demand. It is true
that this is the most general statement of the problem.
However, as we have already noted, some inventory systems
require estimates of only the mean and variance. For other
systems, estimates of high percentiles and probabilities of
high-demand values are needed; even in these cases, we
do not need a forecast of the entire distribution.
Measures based on the entire distribution can be
misleading. A good overall “goodness of fit” statistic may
result from excellent forecasts of the chances of lowdemand values, which can mask poor forecasts of the
chances of high-demand values. It may be that for other
applications (for example, revenue forecasts), forecasts of
low percentiles are required (Willemain et al., 2004).
However, for inventory calculations, we suggest that
attention be restricted to the upper end of the distribution
(the 90th or 95th percentiles).
In summary, percentile forecasts and estimates of probabilities
of demand are required for some inventory systems. For other
systems, we need forecasts of the mean and variance of
demand. All these quantities are features of the overall
distribution of demand. It is the accuracy of determining the
key quantities (for example, mean demand, 90th percentile)
required for the inventory rules that is important, rather than
the accuracy across the entire demand distribution.
Estimates of Mean Demand
When it is necessary to forecast the mean demand level,
there are two issues to address:
What is the best forecasting method for a particular
stock-keeping unit (SKU)?
What is the best forecasting method across a range
of SKUs?
The second problem is more common in practice, but
answering the first question gives us some insight into
how to answer the second.
The case of a single SKU
For a single SKU, we may use a simple measure such as
the mean absolute error (MAE) to measure a method’s
accuracy in forecasting mean demand. (The mean absolute
error is calculated by noting each of the errors and treating
them all as positive in sign, and then averaging them.)
The mean squared error is not suitable for intermittent-
demand items because it is sensitive to the occurrence of
very high forecast errors.
The accuracy of a method’s mean-demand forecasts can
be compared with another method by calculating the
percentage of series for which it has a lower MAE. This
approach is known as the Percentage Better method, which
is discussed in more detail by Boylan (2005). The approach
can be easily extended to the comparison of more than two
methods; in that case, it would be termed Percentage Best.
A limitation of the Percentage Better method is that,
although it summarizes the frequency with which one
method outperforms another, it does not inform the user
of the degree of improvement in accuracy. Averaging the
values of mean absolute error across series would seem to
be the obvious answer. Unfortunately, such measures can
be dominated by a small number of SKUs with large errors.
This problem is known as scale dependence.
For non-intermittent data, an effective way of addressing
the scale-dependence problem is to calculate the mean
absolute percentage error (MAPE). However, the MAPE
measure fails for intermittent data because the denominator
(actual value) is frequently zero. Amending the
denominator to unity when the actual value is zero, as
suggested by Jim Hoover in this section, is a pragmatic
idea, but it is without any foundation in statistical theory.
Another option mentioned by Hoover is the symmetric
MAPE (sMAPE), in which the numerator is the absolute
value of the actual minus forecast, and the denominator is
the average of the actual and forecast values. However,
whenever the actual value is zero, the sMAPE entry will
have a value of two, regardless of the forecast. If the actual
is zero and our forecast is 1, then
sMAPE = 1 / ((0+1) / 2)) = 2.
If our forecast is 100, then
sMAPE = 100 / ((0+100) / 2)) = 2.
Therefore, sMAPE cannot be recommended because when
actual demand is zero, it does not discriminate between
forecasting methods.
Scale-independent metrics
Scale-independent metrics are required to assess forecast
accuracy across a range of items. For intermittent data, a good
scale-independent measure is the ratio of the mean absolute
error to the mean demand, as suggested by Jim Hoover. A
variation on this approach is to compare the accuracy of one
method to another by taking the ratio of mean absolute errors.
Alternatively, instead of MAEs, we can compute the ratio of
the geometric root mean square error (GRMSE) of one method
to that of another. Although this metric is more complex, it is
even more robust (less sensitive) than the MAE regarding
outlying observations. Fildes (1992) showed that, in the
GRMSE calculation, the distorting effect of large errors
cancels out. For details on the application of the GRMSE to
intermittent-demand items, see Syntetos and Boylan (2005).
In his accompanying article, Rob Hyndman correctly notes
that the geometric root mean square error is identical to
the geometric mean absolute error (GMAE). Because the
GMAE is easier to calculate than the GRMSE, and delivers
the same result, we will use it in the example that follows.
The geometric mean is an alternative to averaging by the
arithmetic mean, in which we multiply all observations
and then find the nth root. For example, suppose we have
three observations: 1, 4, and 16. The geometric mean is 4,
as this number is the cube root of 64 (= 1 x 4 x 16).
This approach can also be applied to the absolute forecast
errors. Suppose we have four forecast errors: -3, 1, -5, and
4. The absolute errors are 3, 1, 5, and 4. Then the geometric
mean is the fourth root of 60 (= 3 x 1 x 5 x 4), namely
2.783. This is the geometric mean absolute error, and it is
identical to the geometric root mean square error.
A potential problem with the GMAE is that if any one
forecast error is zero, then the GMAE is also zero, regardless
of the size of the other forecast errors. Zero forecast errors
can arise in two ways for intermittent demand:
1. Non-zero demand: identical non-zero is forecast.
2. Zero actual demand: zero is forecast.
In our experience, the first case does not arise frequently
in practice, and it never occurred on the dataset of 3,000
SKUs analyzed by Syntetos and Boylan (2005). Methods
based on exponential smoothing (ES), such as Croston’s
method and the Syntetos-Boylan approximation, do not
generally produce whole-number estimates of the mean
demand; therefore they do not typically generate zero
errors. Consequently, series with zero GMAEs will be rare
if ES-based methods are compared, and they can be
excluded from an across-series analysis. If other methods
are used, such as the naïve method, then the GMAE will
not always be well defined. However, the naïve method is
sensitive to large demands and will generate high forecasts
in such instances, making it inappropriate for practical
inventory applications.
June 2006 Issue 4 FORESIGHT
41
The second case, highlighted to us in a private e-mail
correspondence by Jack Hayya, can occur when it has been a
very long time since there have been any non-zero
observations. This may signal that the item is at the end of its
life and should be reviewed for classification as “obsolescent,”
requiring no subsequent forecasts. If the item is nearing
obsolescence (but is not yet obsolete), there would have been
some evidence of demand in recent years, and a zero mean
demand forecast is inappropriate and should be reviewed.
Estimates of Demand Variance
Why does the variance of forecast error need to be
estimated? There are two reasons: (1) in some cases, the
variance of demand is estimated as an intermediate step
in finding a percentile of forecast demand, or the probability
of high values of demand; (2) in other cases, the variance
of demand is input to a formula that will be used to estimate
inventory parameters, such as the reorder point (s).
It is not possible to assess the accuracy of variance estimates
directly, unless assumptions are made about the demand
distribution. However, indirect approaches are available.
If the variance is estimated to find a percentile of demand,
we can examine the accuracy of the resulting percentile
estimate. To do this, we identify the percentile of interest
(for example, the 90th percentile) and compare how many
observations exceed the percentile estimate against the
expected value. This can be achieved using the chi-square
test, as discussed by Tom Willemain. A similar approach
can be adopted if the variance estimate is used to calculate
probabilities of high values of demand.
If the variance is used as an input to an inventory formula,
we can look at the measures of inventory cost and service.
This would enable different approaches to variance
estimation to be compared indirectly and is an example of
the accuracy-implication approach advocated in this paper.
Conclusions
In considering forecasting-accuracy metrics for intermittent
demand, we should begin by looking at the inventory
method. Which forecasts are required for the particular
inventory method? The answer may be estimates of mean
demand, variance of demand, percentiles of demand, or
probabilities of high-demand values. There are appropriate
accuracy metrics for each type of estimate.
No matter which inventory system is in use, the accuracyimplication metrics of stock-holding costs and service
42
FORESIGHT Issue 4 June 2006
levels should always be considered because these are of
prime importance to the organization. The use of these
measures should not be limited to situations in which it is
difficult to assess forecast error directly. Accuracyimplication metrics also offer a basis for the comparison
of different forecasting methods.
References
Boylan, J. (2005). Intermittent and lumpy demand: A
forecasting challenge, Foresight: The International
Journal of Applied Forecasting, Issue 1, 36-42.
Eaves, A. H. C. & Kingsman, B. G. (2004). Forecasting
for the ordering and stock-holding of spare parts, Journal
of the Operational Research Society, 55, 431-437.
Fildes, R. (1992). The evaluation of extrapolative forecasting
methods, International Journal of Forecasting, 8, 91-98.
Kahn, K. B. (2006). Commentary: Putting forecast
accuracy into perspective, Foresight: The International
Journal of Applied Forecasting, Issue 3, 25-26.
Naddor, E. (1975). Optimal and heuristic decisions on
single and multi-item inventory systems, Management
Science, 21, 1234-1249.
Silver, E. A., Pyke, D. F. & Peterson, R. (1998). Inventory
Management and Production Planning and Scheduling,
3rd ed., New York: John Wiley & Sons.
Syntetos, A. A. & Boylan, J. E. (2005). The accuracy of
intermittent demand estimates, International Journal of
Forecasting, 21, 303-314.
Willemain, T. R., Smart, C. N. & Schwarz, H. F. (2004). A
new approach to forecasting intermittent demand for
service parts inventories, International Journal of
Forecasting, 20, 375-387.
Contact Info:
John Boylan
Buckinghamshire Chilterns
University College, UK
[email protected]
Aris Syntetos
University of Salford, UK
[email protected]