Centre for Risk & Insurance Studies enhancing the understanding of risk and insurance Backtesting Stochastic Mortality Models: An Ex-Post Evaluation of Multi-Period-Ahead Density Forecasts Kevin Dowd, Andrew J.G. Cairns, David Blake Guy D. Coughlan, David Epstein, Marwa Khalaf-Allah CRIS Discussion Paper Series – 2008.IV Backtesting Stochastic Mortality Models: An Ex-Post Evaluation of Multi-Period-Ahead Density Forecasts Kevin Dowd *, Andrew J.G. Cairns #, David Blake ♣ Guy D. Coughlan, ♦ David Epstein, ♦ Marwa Khalaf-Allah ♦ September 8, 2008 Abstract This study sets out a backtesting framework applicable to the multi-period-ahead forecasts from stochastic mortality models and uses it to evaluate the forecasting performance of six different stochastic mortality models applied to English & Welsh male mortality data. The models considered are: Lee-Carter’s 1992 one-factor model; a version of Renshaw-Haberman’s 2006 extension of the Lee-Carter model to allow for a cohort effect; the age-period-cohort model of Currie (2006), which is a simplified version of Renshaw-Haberman; Cairns, Blake and Dowd’s 2006 two-factor model; and two generalised versions of the latter with an added cohort effect. For the data set used herein the results from applying this methodology suggest that the models perform adequately by most backtests, and that there is little difference between the performances of five of the models. The remaining model, however, shows forecast instability. The study also finds that density forecasts that allow for uncertainty in the parameters of the mortality model are more plausible than forecasts that do not allow for such uncertainty. Key words: backtesting, forecasting performance, mortality models * Centre for Risk & Insurance Studies, Nottingham University Business School, Jubilee Campus, Nottingham, NG8 1BB, United Kingdom. Corresponding author: email at [email protected]. The authors thank Lixia Loh and Liang Zhao for excellent research assistance. # Maxwell Institute for Mathematical Sciences, and Actuarial Mathematics and Statistics, Heriot-Watt University, Edinburgh, EH14 4AS, United Kingdom. ♣ Pensions Institute, Cass Business School, 106 Bunhill Row, London, EC1Y 8TZ, United Kingdom. ♦ Pension ALM Group, JPMorgan Chase Bank, 125 London Wall, London EC2Y 5AJ, United Kingdom. 1 1. Introduction A recent paper, Cairns et al. (2007), examined the empirical fits of eight different stochastic mortality models, variously labelled M1 to M8. Seven 1 of these models were further analysed in a second study, Cairns et al. (2008), which evaluated the exante plausibility of the models’ probability density forecasts. Amongst other findings, that study found that one of these models – M8, a version of the Cairns-Blake-Dowd (CBD) model (Cairns et al., 2006) with an allowance for a cohort effect – generated implausible forecasts on US data, and consequently this model was also dropped from further consideration. A third study, Dowd et al. (2008), then examined the ‘goodness of fit’ of the remaining six models by analysing the statistical properties of their various residual series. The six models that were examined are: M1, the Lee-Carter model (Lee and Carter, 1992): M2B, a version of Renshaw and Haberman’s cohorteffect generalisation of the Lee-Carter model (Renshaw and Haberman, 2006); M3B, a version of Currie’s age-period-cohort model (Currie, 2006), which is a simplified version of M2B; 2 M5, the CBD model; and M6 and M7, two alternative cohort-effect generalisations of the CBD model. Details of these models’ specifications are given in Appendix A. It is quite possible for a model to provide a good in-sample fit to historical data and produce forecasts that appear to be ‘plausible’ ex ante, but still produce poor ex-post forecasts, that is, forecasts that differ significantly from subsequently realised outcomes. A ‘good’ model should therefore produce forecasts that perform well outof-sample when evaluated using appropriate forecast evaluation or backtesting methods, as well as provide good fits to the historical data and plausible forecasts ex ante. The primary purpose of the present paper is, accordingly, to set out a backtesting framework that can be used to evaluate the ex-post forecasting performance of a mortality model. 1 The model omitted from this second study was the P-splines model of Currie et al. (2004) and Currie (2006). This model was dropped in part because of its poor performance relative to the other models when assessed by the Bayes Information Criterion, and in part because it cannot be used to project stochastic mortality outcomes. 2 M2B and M3B are the versions of M2 and M3 that assume an ARIMA(1,1,0) process for the cohort effect (see Cairns et al., 2008). 2 A secondary purpose of the paper is to illustrate this backtesting framework by applying it to the six models listed above. The backtesting framework is applied to each model over various forecast horizons using a particular data set, namely, LifeMetrics data for the mortality rates of English & Welsh males 3 for ages varying from 60 to 89 and spanning the years 1971 to 2006. The backtesting of mortality models is still in its infancy. Most studies that assess mortality forecasts focus on ex-post forecast errors (see, e.g., Keilman (1997, 1998), National Research Council (2000), or Koissi et al. (2006)). This is limited in so far as it ignores all information contained in the probability density forecasts, except for the information reflected in the mean forecast or ‘best estimate’. A more sophisticated treatment – and a good indicator of the current state of best practice in the evaluation of mortality models – is provided by Lee and Miller (2001). They evaluated the performance of the Lee-Carter (1992) model by examining the behaviour of forecast errors (comparing MAD, RMSE, etc.) and plots of ‘percentile error distributions’, although they did not report any formal test results based on these latter plots. However, they also reported plots showing mortality prediction intervals and subsequently realised mortality rates, and the frequencies with which realised mortality observations fell within the prediction intervals. These provide a more formal (i.e., probabilistic) sense of model performance. More recently, CMI (2006) included backtesting evaluations of the P-spline model and CMI (2007) included backtesting evaluations of M1 and M2. Their evaluations were based on plots of realised outcomes against forecasts, where the forecasts included both projections of central values and projections of prediction intervals, which also give some probabilistic sense of model performance. The backtesting framework used in this paper can best be understood if we outline the following key steps: 1. We begin by selecting the metric of interest, namely the forecasted variable that is the focus of the backtest. Possible metrics include the mortality rate, life 3 See Coughlan et al. (2007) and www.lifemetrics.com for the data and a description of LifeMetrics. The original source of the data was the UK Office for National Statistics. 3 expectancy, future survival rates, and the prices of annuities and other lifecontingent financial instruments. Different metrics are relevant for different purposes, for example, in evaluating the effectiveness of a hedge of longevity or mortality risk, the relevant metric is the monetary value of the underlying exposure. In this paper, we focus on the mortality rate itself, but, in principle, backtests could be conducted on any of these other metrics as well. 2. We select the historical ‘lookback’ window which is used to estimate the parameters of each model for any given year: thus, if we wish to estimate the parameters for year t and we use a lookback window of length n , then we are estimating the parameters for year t , using observations from years t − n + 1 to t . In this paper, we use a fixed-length 4 lookback window of 10 years 5. 3. We select the horizon (i.e., the ‘lookforward’ window) over which we will make our forecasts, based on the estimated parameters of the model. In the present study, we focus on relatively long-horizon forecasts, because it is with the accuracy of these forecasts that pension plans are principally concerned, but which also pose the greatest modelling challenges. 4. Given the above, we decide on the backtest to be implemented and specify what constitutes a ‘pass’ or ‘fail’ result. Note that we use the term ‘backtest’ here to refer to any method of evaluating forecasts against subsequently realised outcomes. ‘Backtests’ in this sense might involve the use of plots whose goodness of fit is interpreted informally, as well as formal statistical tests of predictions generated under the null hypothesis that a model produces adequate forecasts. The above framework for backtesting stochastic mortality models is a very general one. 4 The use of a fixed-length lookback window simplifies the underlying statistics and means that our results are equally responsive to the ‘news’ in any given year’s observations. By contrast, an expanding lookback window is more difficult to handle and also means that, as time goes by, the ‘news’ contained in the most recent observations receives less weight, relative to the expanding number of older observations in the lookback sample. 5 We choose a 10-year lookback window because that is the window emerging as the standard amongst market practitioners. However, we acknowledge that a lookback window of 10 years is not necessarily optimal from a forecasting perspective. The choice of lookback window inevitably involves a tradeoff: a long window is less responsive than a short window to changes in the trend of mortality rates, but is less influenced by random fluctuations in mortality rates than a short window. The choice of lookback window is therefore to some extent subjective. 4 Within this broad framework, we implement the following four backtest procedures for each model, noting that the metric (mortality rate) and lookback windows (10 years) are the same in each case: ● Contracting horizon backtests: First, we evaluate the convergence properties of the forecast mortality rate to the actual mortality rate at a specified future date. We chose 2006 as the forecast date and we examine forecasts of that year’s mortality rate, where the forecasts are made at different dates in the past. The first forecast was made with a model estimated from 10 years of historical observations up to 1980, the second with the model estimated from 10 years of historical observations up to 1981, and so forth, up to 2006. In other words, we are examining forecasts with a fixed end date and a contracting horizon from 26 years ahead down to one year ahead. The backtest procedure then involves a graphical comparison of the evolution of the forecasts towards the actual mortality rate for 2006 and intuitively ‘good’ forecasts should converge in a fairly steady manner towards the realised value for 2006. ● Expanding horizon backtests: Second, we consider the accuracy of forecasts of mortality rates over expanding horizons from a common fixed start date, or ‘stepping off’ year. For example, the start date might be 1980, and the forecasts might be for one up to 26 years’ ahead, that is, for 1981 up to 2006. The backtest procedure again involves a graphical comparison of forecasts against realised outcomes, and the ‘goodness’ of the forecasts can be assessed in terms of their closeness to the realised outcomes. ● Rolling fixed-length horizon backtests: Third, we consider the accuracy of forecasts over fixed-length horizons as the stepping off date moves sequentially forward through time. This involves examining plots of mortality prediction intervals for some fixed-length horizon (e.g., 15 years) rolling forward over time – with subsequently realised outcomes superimposed on them. ● Mortality probability density forecast tests: Fourth, we carry out formal hypothesis tests in which we use each model as of one or more start dates to 5 simulate the forecasted mortality probability density at the end of some horizon period, and we then compare the realised mortality rate(s) against this forecasted probability density (or densities). The forecast ‘passes’ the test if each realised rate lies within the more central region of the forecast probability density and the forecast ‘fails’ if it lies too far out in either tail to be plausible given the forecasted probability density. 6 For each of the four classes of test, we examine two types of mortality forecast. The first of these are forecasts in which it is assumed that the parameters of the mortality models are known with certainty. The second are forecasts, in which we make allowance for the fact that the parameters of the models are only estimated, i.e., their ‘true’ values are unknown. We would regard the latter as more plausible, and evidence from other studies suggests that allowing for parameter uncertainty can make a considerable difference to estimates of quantifiable uncertainty (see, e.g., Cairns et al. (2006), Dowd et al. (2006) or Blake et al. (2008)). For convenience we label these the ‘parameter certain’ (PC) and ‘parameter uncertain’ (PU) cases respectively. 7 More details of the latter case are given in Appendix B. This paper is organised as follows. Section 2 considers the contracting horizon backtests, section 3 considers the expanding horizon backtests, section 4 considers the 6 In principle, we could also perform further backtests based on the Probability Integral Transformation (PIT) series that are predicted to be standard uniform under the null of forecast adequacy, which we might do, e.g., by tests of the Kolmogorov-Smirnov family (e.g., Kolmogorov-Smirnov test, Kuiper tests, Lilliefors tests, etc.). We could also put the PIT series through an inverse standard normal distribution function as suggested by Berkowitz (2001) and then test these series for the predictions of standard normality, e.g., a zero mean, a unit standard deviation, a Jarque-Bera test of the skewness and kurtosis predictions of normality, etc. We carried out such tests but do not report them here because the PIT and Berkowitz series are not predicted to be iid in a multi-period ahead forecasting context (and more to the point, in many cases iid is clearly rejected), and the absence of iid undermines the validity of standard tests (e.g., such as the Kolmogorov-Smirnov test). A potential solution is offered by Dowd (2007), who suggests that the Berkowitz series should be a moving average of iid standard normal series whose order is determined by the properties of the year-ahead forecast, but this is only a conjecture. An alternative possibility is to fit some ad hoc process to the Berkowitz series (e.g., such as an ARMA type process, as suggested by Dowd, 2008). We have not followed up these suggestions in the present study. 7 To be more precise, the PU versions of the model use a Bayesian approach to take account of the uncertainty in the parameters driving the period and, where applicable, the cohort effects. There is no need to take account of uncertainty in the age effects present, because we can rely on the law of large numbers to give us fairly precise estimates of these latter effects, i.e., so the estimation errors in the age effects will be negligible. 6 rolling fixed-length backtests for an illustrative horizon of 15 years, and section 5 considers the mortality probability forecast tests. Section 6 concludes. 2. Contracting Horizon Backtests: Examining the Convergence of Forecasts through Time The first kind of backtest in our framework examines the consistency of forecasts for a fixed future year (in our example below, the year 2006) made in earlier years. For a well-behaved model we would expect consecutive forecasts to converge towards the realised outcome as the date the forecasts are made (the stepping-off date) approaches the forecast year. Figures 1, 2 and 3 show plots of forecasts of the 2006 mortality rate for 65-year-old, 75-year-old and 85-year-old males, respectively, made in years 1980, 1981, …, 2006. The central lines in each plot are the relevant model’s median forecast of the 2006 mortality rate based on 5,000 simulation paths, and the dotted lines on either side are estimates of the model’s 90% prediction interval or risk bounds (i.e., 5th and 95th percentiles). The starred point in each chart is the realised mortality rate for that age in 2006. These plots show the following patterns: ● For models M1, M3B, M5, M6 and M7, the forecasts tend to decline in a stable way over time towards a value close to the realised value and the prediction intervals narrow over time. 8 ● For these same models, the PU prediction intervals are considerably wider than the PC prediction intervals, but there is negligible difference between the PC and PU median plots. 8 It is noteworthy, however, that models M1 and M5 both produce projected mortality rates for the mid 60’s that are systematically below the observed mortality rate in 2006. This appears to be due to a significant cohort effect in the data that the other models are picking up, and this would explain why the realised mortality rates for these models for the age-65 charts in Figure 1 are a little out of line with forecasts from the periods running up to 2006. 7 ● For M2B, the forecasts are sometimes unstable, exhibiting major spikes, and there is also typically less difference between the PC and PU forecasts than is the case with the other models. Note that we have only evaluated the convergence of the models for one end date and one data set. This is insufficient to draw general conclusions about the forecasting capabilities of the various models. However, the instability observed in M2 (or at least our variant of M2, M2B) is an issue. Figure 1: Forecasts of the 2006 Mortality Rate from 1980 Onwards: Males aged 65 Males aged 65: Model M1 Males aged 65: Model M2B 0.04 Mortality rate Mortality rate 0.04 0.03 0.02 0.01 1980 1985 1990 1995 2000 0.03 0.02 0.01 1980 2005 Males aged 65: Model M3B Mortality rate Mortality rate 0.02 1985 1990 1995 2000 2000 2005 0.03 0.02 0.01 1980 2005 Males aged 65: Model M6 1985 1990 1995 2000 2005 Males aged 65: Model M7 0.04 Mortality rate 0.04 Mortality rate 1995 0.04 0.03 0.03 0.02 0.01 1980 1990 Males aged 65: Model M5 0.04 0.01 1980 1985 1985 1990 1995 Stepping off year 2000 2005 0.03 0.02 0.01 1980 1985 1990 1995 Stepping off year 2000 2005 Notes: Forecasts based on estimates using English & Welsh male mortality data for ages 60-89 and a rolling 10-year historical window. The stepping-off year is the final year in the rolling window. The fitted model is then used to estimate the median and 90% prediction interval for both parameter-certain forecasts (given by the dashed lines) and parameter-uncertain cases (given by the continuous lines). The realised mortality rate for 2006 is denoted by *. Based on 5,000 simulation trials. 8 Figure 2: Forecasts of the 2006 Mortality Rate from 1980 Onwards: Males aged 75 Males aged 75: Model M1 Males aged 75: Model M2B 0.08 Mortality rate Mortality rate 0.08 0.06 0.04 0.02 1980 1985 1990 1995 2000 0.06 0.04 0.02 1980 2005 Males aged 75: Model M3B Mortality rate Mortality rate 2000 2005 0.04 1985 1990 1995 2000 0.06 0.04 0.02 1980 2005 Males aged 75: Model M6 1985 1990 1995 2000 2005 Males aged 75: Model M7 0.08 Mortality rate 0.08 Mortality rate 1995 0.08 0.06 0.06 0.04 0.02 1980 1990 Males aged 75: Model M5 0.08 0.02 1980 1985 1985 1990 1995 Stepping off year 2000 0.06 0.04 0.02 1980 2005 1985 1990 1995 Stepping off year 2000 2005 Notes: As per Figure 1. Figure 3: Forecasts of the 2006 Mortality Rate from 1980 Onwards: Males aged 85 Males aged 85: Model M2B 0.25 0.2 0.2 Mortality rate Mortality rate Males aged 85: Model M1 0.25 0.15 0.1 0.05 1980 1985 1990 1995 2000 0.15 0.1 0.05 1980 2005 0.25 0.25 0.2 0.2 0.15 0.1 0.05 1980 1985 1990 1995 2000 0.05 1980 2005 0.2 Mortality rate Mortality rate 0.25 0.15 0.1 2000 2000 2005 1985 1990 1995 2000 2005 Males aged 85: Model M7 0.2 1990 1995 Stepping off year 1995 0.1 0.25 1985 1990 0.15 Males aged 85: Model M6 0.05 1980 1985 Males aged 85: Model M5 Mortality rate Mortality rate Males aged 85: Model M3B 2005 0.15 0.1 0.05 1980 1985 1990 1995 Stepping off year 2000 2005 Notes: As per Figure 1. 9 3. Expanding Horizon Backtests In the second class of backtest, we consider the accuracy of forecasts over increasing horizons against the realised outcomes for those horizons. Accuracy is reflected in the degree of consistency between the outcome and the prediction interval associated with each forecast. These backtests are best evaluated graphically using charts of mortality prediction intervals. Figure 4 shows the mortality prediction intervals for age 65 for forecasts starting in 1980 (with model parameters estimated using data from the preceding 10 years). The charts show the 90% prediction intervals (or ‘risk bounds’) as dashed lines for the PC forecasts and continuous lines for the PU forecasts. Roughly speaking, if a model is adequate, we can be 90% confident of any given outcome occurring between the dashed risk bounds if we ‘believe’ the PC forecasts, and we can be 90% confident of any given outcome occurring between the continuous risk bounds if we ‘believe’ the PU forecasts. The prediction intervals all show the same basic shape – they fan out somewhat over time around a gradually decreasing trend and show a little more uncertainty on the upper side than on the lower side. But the most striking finding is that the PU risk bounds are considerably wider than their PC equivalents, indicating that the PU forecasts are more plausible (in the sense of having higher likelihoods) than the PC ones. The charts also show the forecasted median mortality forecasts as dotted lines for the PC forecasts and continuous lines for the PU forecasts, and we see the median projections falling as the forecast horizon lengthens. As in the earlier Figures, there are virtually no differences between the PC- and PU-based median projections. Superimposed on the charts are the realised outcomes indicated by stars. For the most part, the paths of realised outcomes tend to be below the median projection and move downwards over time relative to the forecasts. This suggests that, viewed from 1980, most forecasts are biased upwards (i.e., they under-estimate the downward trend of future mortality rates) and the bias tends to increase with the length of the forecast 10 horizon. However, the size and significance of this bias varies considerably across different models and between PC and PU forecasts. Figure 4: Mortality Prediction-Interval Charts from 1980: Males aged 65 Males aged 65: Model M1 Males aged 65: Model M2B 0.05 PU: [xL, xM, xU, n] = [8, 27, 0, 27] PC: [xL, xM, xU, n] = [7, 25, 1, 27] Mortality rate Mortality rate PU: [xL, xM, xU, n] = [0, 25, 1, 27] 0.04 0.03 0.02 0.01 1980 1985 1990 1995 2000 0.05 0.03 0.02 0.01 1980 2005 Males aged 65: Model M3B Mortality rate Mortality rate 0.03 0.02 1985 1990 1995 2000 0.05 0.03 0.02 1990 1995 Year 2000 PC: [xL, xM, xU, n] = [18, 27, 0, 27] 1985 1990 1995 2000 2005 Males aged 65: Model M7 PC: [xL, xM, xU, n] = [14, 25, 1, 27] 1985 2005 0.02 0.06 PU: [xL, xM, xU, n] = [0, 25, 1, 27] 0.04 0.01 1980 2000 0.03 0.01 1980 2005 Mortality rate Mortality rate 0.05 1995 0.04 Males aged 65: Model M6 0.06 1990 PU: [xL, xM, xU, n] = [1, 27, 0, 27] PC: [xL, xM, xU, n] = [12, 26, 1, 27] 0.04 0.01 1980 1985 Males aged 65: Model M5 PU: [xL, xM, xU, n] = [0, 26, 1, 27] 0.05 PC: [xL, xM, xU, n] = [16, 27, 0, 27] 0.04 2005 0.05 PU: [xL, xM, xU, n] = [0, 19, 1, 27] PC: [xL, xM, xU, n] = [7, 19, 1, 27] 0.04 0.03 0.02 0.01 1980 1985 1990 1995 Year 2000 2005 Notes: Forecasts based on estimates using English & Welsh male mortality data for ages 60-89 and years 1971-1980. The dashed lines refer to the forecast medians and bounds of the 90% prediction interval for the parameter-certain (PC) forecasts and continuous lines are their equivalents for the parameter-uncertain (PU) forecasts. The realised mortality rates are denoted by *. For each of these cases, xL and xM are the numbers of realised rates below the lower 5% and 50% prediction bounds, xU is the number of realised mortality rates above the upper 5% bound and n is the number of forecasts including that for the starting point of the forecasts. Based on 5,000 simulation trials. Figure 4 also shows quadruplets in the form [ xL , xM , xU , n ], where n is the number of mortality forecasts, xL is the number of lower exceedances or the number of realised outcomes out of n falling below the lower risk bound, xM is the number of observations falling below the projected median and xU is the number of upper exceedances or observations falling above the upper risk bound. 9 These statistics 9 Note though that the positions of the individual observations within the prediction intervals are not independent. If, for example, the 1990 observation is ‘low’, then the 1995 observation is also likely to be ‘low’. 11 provide useful indicators of the adequacy of model forecasts. If the forecasts are adequate, for any given PC or PU case, we would expect xL and xU to be about 5% and we would expect xM to be about 50%. Too many exceedances, on the other hand, would suggest that the relevant bounds were incorrectly forecasted: for example, too many realisations above the upper risk bound would suggest that the lower risk bound is too low; too many observations below the median bound would suggest that the forecasts are biased upwards; and too many observations below the lower risk bound would suggest that the lower risk bound is too high. As a robustness check, Figure 5 shows the comparable chart for 65-year-old males starting from year 1990 (with model parameters estimated using data for years 19811990), with horizons now going out 16 years ahead to 2006. Figures 6-9 show the corresponding results for ages 75 and 85 for both PC and PU forecasts. These Figures suggest the following: • The models forecast decreases in median mortality rates, forecast risk bounds that fan out over the forecast horizon, and in most cases considered show a bias that increases with the forecast horizon. • The PU forecasts perform better than the PC forecasts. • Forecast performance tends to improve with higher ages. • For the sample periods considered, the forecasts are fairly robust to the choice of sample period. 12 Figure 5: Mortality Prediction-Interval Charts from 1990: Males aged 65 Males aged 65: Model M1 Males aged 65: Model M2B 0.05 PU: [xL, xM, xU, n] = [10, 16, 1, 17] PC: [xL, xM, xU, n] = [10, 16, 1, 17] Mortality rate Mortality rate PU: [xL, xM, xU, n] = [0, 16, 1, 17] 0.04 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.05 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 Males aged 65: Model M3B Males aged 65: Model M5 PU: [xL, xM, xU, n] = [15, 17, 0, 17] PC: [xL, xM, xU, n] = [7, 16, 1, 17] Mortality rate Mortality rate PU: [xL, xM, xU, n] = [0, 16, 1, 17] 0.05 0.04 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.05 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 Males aged 65: Model M7 0.06 PU: [xL, xM, xU, n] = [12, 16, 1, 17] PC: [xL, xM, xU, n] = [14, 16, 1, 17] Mortality rate Mortality rate 0.05 PC: [xL, xM, xU, n] = [17, 17, 0, 17] 0.04 Males aged 65: Model M6 0.06 PC: [xL, xM, xU, n] = [13, 16, 1, 17] 0.04 0.04 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year 0.05 PU: [xL, xM, xU, n] = [1, 14, 1, 17] PC: [xL, xM, xU, n] = [10, 14, 1, 17] 0.04 0.03 0.02 0.01 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Notes: As per Figure 4, but using data over years 1981-1990. Figure 6: Mortality Prediction-Interval Charts from 1980: Males aged 75 Males aged 75: Model M1 0.08 Males aged 75: Model M2B 0.1 PU: [xL, xM, xU, n] = [1, 27, 0, 27] Mortality rate Mortality rate 0.1 PC: [xL, xM, xU, n] = [12, 27, 0, 27] 0.06 0.04 1980 0.08 0.04 1985 1990 1995 2000 2005 1980 0.1 PU: [xL, xM, xU, n] = [1, 27, 0, 27] PC: [xL, xM, xU, n] = [8, 27, 0, 27] 0.06 0.04 1980 0.08 1990 1995 2000 2005 2000 2005 PU: [xL, xM, xU, n] = [0, 25, 1, 27] PC: [xL, xM, xU, n] = [7, 25, 1, 27] 1980 1985 1990 1995 2000 2005 Males aged 75: Model M7 0.1 PU: [xL, xM, xU, n] = [1, 27, 0, 27] PC: [xL, xM, xU, n] = [8, 27, 0, 27] 0.06 0.04 1980 1995 0.04 1985 Mortality rate Mortality rate 0.08 1990 0.06 Males aged 75: Model M6 0.1 1985 Males aged 75: Model M5 Mortality rate Mortality rate 0.08 PC: [xL, xM, xU, n] = [13, 27, 0, 27] 0.06 Males aged 75: Model M3B 0.1 PU: [xL, xM, xU, n] = [1, 27, 0, 27] 0.08 PU: [xL, xM, xU, n] = [1, 27, 0, 27] PC: [xL, xM, xU, n] = [9, 27, 0, 27] 0.06 0.04 1985 1990 1995 Year 2000 2005 1980 1985 1990 1995 Year 2000 2005 Notes: As per Figure 4. 13 Figure 7: Mortality Prediction-Interval Charts from 1990: Males aged 75 Males aged 75: Model M1 0.08 Males aged 75: Model M2B 0.1 PU: [xL, xM, xU, n] = [1, 14, 0, 17] Mortality rate Mortality rate 0.1 PC: [xL, xM, xU, n] = [1, 14, 0, 17] 0.06 0.04 0.08 0.06 1990 1992 1994 1996 1998 2000 2002 2004 2006 Males aged 75: Model M3B Males aged 75: Model M5 0.1 PU: [xL, xM, xU, n] = [1, 15, 0, 17] Mortality rate Mortality rate 0.08 PC: [xL, xM, xU, n] = [3, 15, 0, 17] 0.06 0.04 0.08 1990 1992 1994 1996 1998 2000 2002 2004 2006 Males aged 75: Model M7 0.1 PU: [xL, xM, xU, n] = [1, 14, 0, 17] Mortality rate Mortality rate PC: [xL, xM, xU, n] = [4, 14, 0, 17] 0.06 Males aged 75: Model M6 0.08 PU: [xL, xM, xU, n] = [1, 14, 0, 17] 0.04 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.1 PC: [xL, xM, xU, n] = [1, 5, 0, 17] 0.04 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.1 PU: [xL, xM, xU, n] = [1, 5, 0, 17] PC: [xL, xM, xU, n] = [3, 14, 0, 17] 0.06 0.04 0.08 PU: [xL, xM, xU, n] = [1, 16, 0, 17] PC: [xL, xM, xU, n] = [5, 16, 0, 17] 0.06 0.04 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Notes: As per Figure 5. Figure 8: Mortality Prediction-Interval Charts from 1980: Males aged 85 Males aged 85: Model M1 0.2 Males aged 85: Model M2B 0.25 PU: [xL, xM, xU, n] = [1, 22, 0, 27] PC: [xL, xM, xU, n] = [4, 22, 0, 27] Mortality rate Mortality rate 0.25 0.15 0.1 0.05 1980 1985 1990 1995 2000 0.2 0.1 Males aged 85: Model M3B 0.2 0.25 PU: [xL, xM, xU, n] = [1, 21, 0, 27] PC: [xL, xM, xU, n] = [2, 21, 0, 27] 0.15 0.1 0.05 1980 1985 1990 1995 2000 0.2 2005 0.25 PU: [xL, xM, xU, n] = [1, 18, 0, 27] PC: [xL, xM, xU, n] = [1, 18, 0, 27] 0.1 1985 1990 1995 Year 2000 2000 2005 PU: [xL, xM, xU, n] = [1, 24, 0, 27] PC: [xL, xM, xU, n] = [2, 24, 0, 27] 1985 1990 1995 2000 2005 Males aged 85: Model M7 0.15 0.05 1980 1995 0.1 0.05 1980 Mortality rate Mortality rate 0.2 1990 0.15 Males aged 85: Model M6 0.25 1985 Males aged 85: Model M5 Mortality rate Mortality rate 0.25 PC: [xL, xM, xU, n] = [0, 5, 1, 27] 0.15 0.05 1980 2005 PU: [xL, xM, xU, n] = [0, 7, 1, 27] 2005 0.2 PU: [xL, xM, xU, n] = [1, 26, 0, 27] PC: [xL, xM, xU, n] = [5, 26, 0, 27] 0.15 0.1 0.05 1980 1985 1990 1995 Year 2000 2005 Notes: As per Figure 4. 14 Figure 9: Mortality Prediction-Interval Charts from 1990: Males aged 85 Males aged 85: Model M1 0.2 Males aged 85: Model M2B 0.25 PU: [xL, xM, xU, n] = [1, 4, 0, 17] PC: [xL, xM, xU, n] = [2, 4, 0, 17] 0.15 0.1 Mortality rate Mortality rate 0.25 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.2 0.1 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 Males aged 85: Model M5 0.25 PU: [xL, xM, xU, n] = [0, 3, 1, 17] PC: [xL, xM, xU, n] = [0, 3, 1, 17] 0.15 0.1 Mortality rate Mortality rate 0.2 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 0.2 0.1 Males aged 85: Model M7 0.25 PU: [xL, xM, xU, n] = [0, 1, 1, 17] PC: [xL, xM, xU, n] = [0, 1, 1, 17] 0.15 0.1 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year PC: [xL, xM, xU, n] = [0, 1, 1, 17] 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 Mortality rate Mortality rate 0.2 PU: [xL, xM, xU, n] = [0, 1, 1, 17] 0.15 Males aged 85: Model M6 0.25 PC: [xL, xM, xU, n] = [1, 5, 0, 17] 0.15 Males aged 85: Model M3B 0.25 PU: [xL, xM, xU, n] = [1, 5, 0, 17] 0.2 PU: [xL, xM, xU, n] = [0, 1, 1, 17] PC: [xL, xM, xU, n] = [0, 1, 1, 17] 0.15 0.1 0.05 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year Notes: As per Figure 5. It is also useful to investigate further the performance of the different models across PC and PU forecasts. To help do so, we can examine Table 1 which summarises the models’ average scores (i.e., xL , xM and xU ) from Figures 4-9, where the averages are taken for each model across all these Figures. The first six rows give the average percentages of outcomes that are exceedances (i.e., it gives xL / n , xM / n and xU / n ). The next six rows give the same excess exceedances in absolute value form – that is, | xL / n − 0.05 | , | xM / n − 0.5 | and | xU / n − 0.05 | – and these are expected to be ‘close’ to zero under the null hypothesis: they show how the actual exceedances compare against expectations under the null. The next six lines give the models’ rankings by how close the actual exceedances compare to expectations. 15 Table 1: Exceedances Results for Expanding Horizon Backtests Parameters Certain Model xL / n 27.3% 33.3% 24.2% 36.4% 30.3% 27.3% M1 M2B M3B M5 M6 M7 | xL / n − 0.05 | M1 M2B M3B M5 M6 M7 22.3% 28.3% 19.2% 31.4% 25.3% 22.3% M1 M2B M3B M5 M6 M7 =2 5 1 6 4 =2 Parameters Uncertain (a) Exceedances xM / n xU / n xL / n xM / n 81.8% 1.5% 3.0% 81.8% 64.4% 1.5% 15.9% 65.9% 81.8% 2.3% 2.3% 81.8% 81.8% 1.5% 13.6% 81.8% 76.5% 2.3% 11.4% 76.5% 78.0% 2.3% 3.0% 78.0% (b) Absolute values of excess exceedances xM n − 0.5 | xM / n − 0.5 | | xU / n − 0.05 | | xL / n − 0.05 | 31.8% 3.5% 2.0% 31.8% 14.4% 3.5% 10.9% 15.9% 31.8% 2.7% 2.7% 31.8% 31.8% 3.5% 8.6% 31.8% 26.5% 2.7% 6.4% 26.5% 28.0% 2.7% 2.0% 28.0% (c) Ranking by absolute values of excess exceedances =4 =1 =4 1 6 1 =4 3 =4 =4 5 =4 2 4 2 3 =1 3 xU / n 1.5% 1.5% 2.3% 1.5% 2.3% 2.3% | xU / n − 0.05 | 3.5% 3.5% 2.7% 3.5% 2.7% 2.7% Notes: Based on the exceedance results in Figures 4-9. xL , xM and xU are the numbers of observations below the lower risk bound, below the median bound and above the upper risk bound respectively, and n is the number of forecasts. The main points to note are: • Lower exceedances: The PC models in general have far too many lower exceedances. By contrast, the corresponding PU lower exceedances are much closer to expectations, and especially so for models M1 and M7 (which at 2% rank equal first by this criterion) and M3B which follows close behind at 3%. These findings confirm our claim that the PU forecasts are more plausible than the PC forecasts. • Median exceedances: There are negligible differences between the PC and PU median exceedance results, and the proportions of median exceedances are much higher than expected. This confirms that there is, in general, a tendency for the forecasts to be biased upwards. The top performing models by this 16 criterion are M2B (at 15.9% for the PU case), M6 (at 26.5%) and M7 (at 28%). The remaining models all score a little higher at 31.8%. • Upper exceedances: Finally, there are very few upper exceedances – in every case, there is only either 0 or 1 upper exceedances – so the upper exceedances results are close to expectations and there are no notable differences across the models or across PC and PU forecasts. Accordingly, there is no point trying to rank the models’ upper-exceedances performance. 4. Rolling Fixed-Length Horizon Backtests In the third class of backtests, we consider the accuracy of forecasts over fixed horizon periods as these roll forward through time. Once again, accuracy is reflected in the degree of consistency between the collective set of realised outcomes and the prediction intervals associated with the forecasts. We examine the case of an illustrative 15-year forecast horizon and, from this point on, we restrict ourselves to backtesting the performance of the PU forecasts only. Figures 10-16 show the rolling prediction intervals for each of the six models in turn. Each plot shows the rolling prediction intervals and realised mortality outcomes for ages 65, 75 and 85 over the years 1995 to 2006, where the forecasts are obtained using data from 15 years before. Figure 10 gives the prediction intervals for model M1, Figure 11 gives the rolling prediction intervals for M2B, and so forth. To facilitate comparison, the y-axes in all the charts have the same range running from 0.01 to 0.2 on a logarithmic scale. The most striking feature of these charts is the instability in the forecasts of M2B in Figure 11. As with Figures 1-3, this model generates occasional sharp and seemingly implausible ‘spikes’ in the forecasts which again suggests some degree of instability in the forecasts of this model. The other charts indicate that there is not much to choose between the other five models. The models all show periods in which forecasts are within the prediction intervals for all ages, but especially for higher ages (i.e., age 85). They also show periods in which the forecasts have drifted below the prediction intervals, particularly for age 65. 17 Figure 10: Rolling 15-Year Ahead Prediction Intervals: M1 Age 85: [xL, xM, xU, n] = [1, 10, 0, 12] Age 75: [xL, xM, xU, n] = [0, 11, 0, 12] Age 65: [xL, xM, xU, n] = [1, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: Model estimates based on English & Welsh male mortality data for ages 60-89 and a rolling 10year historical window starting from 1971-1980. The continuous, short-dashed and long-dashed lines are the parameter-uncertain (PU) forecast medians and bounds of the 90% prediction interval for ages 65, 75 and 85, respectively. The realised mortality rates are denoted by *. For each age, xL and xM are the numbers of realised rates below the lower 5% and 50% prediction bounds, xU is the number of realised mortality rates above the upper 5% bound and n is the number of forecasts including that for the starting point of the forecasts. Based on 5,000 simulation trials. 18 Figure 11: Rolling 15-Year Ahead Prediction Intervals: M2B UU Age 85: [xL, xM, xU, n] = [1, 5, 0, 12] Age 75: [xL, xM, xU, n] = [0, 12, 0, 12] Age 65: [xL, xM, xU, n] = [8, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: As per Figure 10. Figure 12: Rolling 15-Year Ahead Prediction Intervals: M3B Age 85: [xL, xM, xU, n] = [0, 8, 0, 12] Age 75: [xL, xM, xU, n] = [0, 12, 0, 12] Age 65: [xL, xM, xU, n] = [2, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: As per Figure 10. 19 Figure 13: Rolling 15-Year Ahead Prediction Intervals: M5 Age 85: [xL, xM, xU, n] = [0, 8, 0, 12] Age 75: [xL, xM, xU, n] = [0, 12, 0, 12] Age 65: [xL, xM, xU, n] = [9, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: As per Figure 10. Figure 14: Rolling 15-Year Ahead Prediction Intervals: M6 Age 85: [xL, xM, xU, n] = [0, 4, 0, 12] Age 75: [xL, xM, xU, n] = [0, 12, 0, 12] Age 65: [xL, xM, xU, n] = [10, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: As per Figure 10. 20 Figure 15: Rolling 15-Year Ahead Prediction Intervals: M7 Age 85: [xL, xM, xU, n] = [0, 8, 0, 12] Age 75: [xL, xM, xU, n] = [0, 12, 0, 12] Age 65: [xL, xM, xU, n] = [4, 12, 0, 12] Age 85 -1 Mortality rate 10 Age 75 Age 65 -2 10 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Notes: As per Figure 10. Table 2 presents a summary of the exceedance results reported in Figures 10-15. The Table leaves out upper exceedances because there were none observed in any of these Figures. We find that: • Lower exceedances: All models exhibit excess lower exceedances, but the degree of excess is very low (0.6%) for M1 and M3B, 5.6% for M7, and a little over 20% for the remaining models. • Median exceedances: All models show high excess median exceedances. The best performing models by this criterion are M6 (27.8%) and M2B (30.5%). M2B, M5 and M7 then come in joint third at 38.9%, and M1 comes in last at 40.7%. 21 Table 2: Exceedances Results for Rolling Fixed-Horizon Backtests: Parameters Uncertain Model M1 M2B M3B M5 M6 M7 M1 M2B M3B M5 M6 M7 M1 M2B M3B M5 M6 M7 (a) Exceedances xL / n xM / n 5.6% 91.7% 25.0% 80.6% 5.6% 88.9% 25.0% 88.9% 27.8% 77.8% 11.1% 88.9% (b) Excess exceedances xM / n − 0.5 xL / n − 0.05 0.6% 41.7% 20.0% 30.6% 0.6% 38.9% 20.0% 38.9% 22.8% 27.8% 6.1% 38.9% (c) Ranking by excess exceedances =1 6 =4 2 =1 =3 =4 =3 6 1 3 =3 Notes: Based on the exceedance results in Figures 10-15. xL and xM are the numbers of observations below the lower risk bound and below the median bound, and n is the number of forecasts. 5. Mortality Probability Density Forecast Backtests: Backtests based on Statistical Hypothesis Tests The fourth and final class of backtests involves formal hypothesis tests based on comparisons of realised outcomes against forecasts of the relevant probability densities. 10 To elaborate, suppose we wish to use data up to and including 1980 to evaluate the forecasted probability density function of the mortality rate of, say, 65-year olds in 2006, involving a forecast horizon of 26 years ahead. A simple way to implement 10 Since we are testing forecasts of future mortality rates, we can also regard the tests considered here as similar to tests of q -forward forecasts using the different models; q -forwards are mortality rate forward contracts (see Coughlan et al. , 2007). 22 such a test is to use each model to forecast the cumulative density function (CDF) for the mortality rate in 2006, as of 1980, and compare how the realised mortality rate for 2006 compares with its forecasted distribution. To carry out this backtest for any given model, we use the data from 1971 to 1980 to make a forecast of the CDF of the mortality rate of 65-year olds in 2006. 11 The curves shown in Figure 16 are the CDFs for each of the six models. The null hypothesis is that the realised mortality rate for 65-year-old males in 2006 is consistent with the forecasted CDF, so we superimpose on these CDF curves the realised mortality rate for this age group in 2006. We then determine the p -value associated with the null hypothesis, which is obtained by taking the value of the CDF where the vertical line drawn up from the realised mortality rate on the x-axis crosses the cumulative density curve. A model passes the test if the p -value is large enough 12 (i.e., is not too far out into the tails of the forecasted cumulative density) and otherwise fails it. For the backtests illustrated in Figure 16, we get estimated p -values of 15.9%, 2.1%, 7.4%, 4.9%, 5.2% and 16.5% for models M1, M2B, M3B, M5, M6 and M7, respectively. 13 These results tell us that the null hypothesis that M1 generates adequate forecasts over a horizon of 26 years is associated with a probability of 15.9%, and so would ‘pass’ a test based on a conventional significance level of 5%. The corresponding p -values for M2B, M3B, M5, M6 and M7 are 2.1%, 7.4%, 4.9%, 5.2% and 16.5%, so by these test results the forecasts of M2B and M5 ‘fail’ at the 5% but ‘pass’ at the 1% significance level; whereas those of other models ‘pass’ at both significance levels. 14 11 The reader will recall that we are now dealing with PU forecasts only. This statement does however assume that we are dealing with CDF-values on the left-hand of the Figure. Strictly speaking, the forecasts would also fail if the realised mortality rate were associated with a CDF value that is in the upper tail of the forecasted distribution. This raises the issue of 1-sided versus 2-sided tests which is dealt with further in note 14 below. 13 We should be careful to resist the temptation to use the estimated p -values to rank models. The p values from different models are not comparable because the alternative hypotheses are not comparable across the models. We can therefore only use the p -values to test each model on a stand-alone basis. 14 Where the realised values fall into the left- hand tails, the reported p -values are those associated with 1-sided tests of the null. Of course, we may wish to work with 2-sided tests instead, and we should in principle also take account of the possibility that we might get ‘fail’ results because realised 12 23 Figure 16: Bootstrapped P -Values of Realised Mortality Outcomes: Males aged 65, 1980 Start, Horizon = 26 Years Ahead Males aged 65: Model M1 Males aged 65: Model M2B 1 CDF under null CDF under null 1 Realised q = 0.0149 : p-value = 0.159 0.5 0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 0.04 Realised q = 0.0149 : p-value = 0.021 0.5 0 0.005 Males aged 65: Model M3B CDF under null CDF under null Realised q = 0.0149 : p-value = 0.074 0.025 0.03 0.035 0.04 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 0.04 0.035 0.04 0.035 0.04 Realised q = 0.0149 : p-value = 0.049 0.5 0 0.005 Males aged 65: Model M6 0.01 0.015 0.02 0.025 0.03 Males aged 65: Model M7 1 CDF under null 1 CDF under null 0.02 1 0.5 Realised q = 0.0149 : p-value = 0.052 0.5 0 0.015 Males aged 65: Model M5 1 0 0.01 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 0.035 0.04 Realised q = 0.0149 : p-value = 0.165 0.5 0 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 Note: Forecasts based on models estimated using English & Welsh male mortality data for ages 60-89 and years 1971-1980 assuming parameter uncertainty. Each figure shows the bootstrapped forecast CDF of the mortality rate in 2006, where the forecast is based on the density forecast made ‘as if’ in 1980. Each black vertical line gives the realised mortality rate in 2006 and its associated p -value in terms of the forecast CDF. Based on 5,000 simulation trials. Now suppose we are interested in carrying out a test of the models’ 25-year-ahead mortality density forecasts. There are now two cases to consider: we can start in 1980 and carry out the test for a forecast of the 2005 mortality density, or we can start in 1981 and carry out the test for a forecast of the 2006 mortality density. 15 These are illustrated in Figures 17 and 18 below. Pulling the results from these three Figures together, models M1, M3B and M7 always ‘pass’ tests at conventional significance levels, and models M2B, M5 and M6 ‘fail’ at least once at the 5% significance level. mortality outcomes fall into the extreme right-hand side of the forecasted density. These extensions are fairly obvious and there is no need to dwell on them further here. 15 In principle, both these tests should give much the same answer under the null hypothesis and there is nothing to choose between them other than the somewhat extraneous consideration (for present purposes) that the latter test uses slightly later data. 24 Figure 17: Bootstrapped P -Values of Realised Mortality Outcomes: Males aged 65, 1980 Start, Horizon = 25 Years Ahead Males aged 65: Model M1 Males aged 65: Model M2B 1 Realised q = 0.0149 : p-value = 0.1519 0.5 0 CDF under null CDF under null 1 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Realised q = 0.0149: p-value = 0.020 0.5 0 0.04 0 0.005 Males aged 65: Model M3B CDF under null CDF under null Realised q = 0.0149 : p-value = 0.077 0.025 0.03 0.035 0.04 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.035 0.04 0.035 0.04 Realised q = 0.0149 : p-value = 0.054 0.5 0 0.04 0 0.005 Males aged 65: Model M6 0.01 0.015 0.02 0.025 0.03 Males aged 65: Model M7 1 CDF under null 1 CDF under null 0.02 1 0.5 Realised q = 0.0149 : p-value = 0.045 0.5 0 0.015 Males aged 65: Model M5 1 0 0.01 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 0.035 0.04 Realised q = 0.0149 : p-value = 0.164 0.5 0 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 Note: Forecasts based on models estimated using English & Welsh male mortality data for ages 60-89 and years 1971-1980 assuming parameter uncertainty. Each figure shows the bootstrapped forecast CDF of the mortality rate in 2005, where the forecast is based on the density forecast made ‘as if’ in 1980. Each black vertical line gives the realised mortality rate in 2005 and its associated p -value in terms of the forecast CDF. Based on 5,000 simulation trials. 25 Figure 18: Bootstrapped P -Values of Realised Mortality Outcomes: Males aged 65, 1981 Start, Horizon = 25 Years Ahead Males aged 65: Model M1 Males aged 65: Model M2B Realised q = 0.0149 : p-value = 0.206 0.5 0 0.005 0.015 0.02 0.025 0.03 0.035 Realised q = 0.0149 : p-value = 0.111 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Realised q = 0.0149 : p-value = 0.066 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 0.035 0.04 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.035 0.04 0.035 0.04 Males aged 65: Model M5 Realised q = 0.0149 : p-value = 0.105 0 0.005 0.01 0.015 0.02 0.025 0.03 Males aged 65: Model M7 1 0.5 0 0.005 0.5 0 0.04 Males aged 65: Model M6 1 0 1 0.5 0 Realised q = 0.0149 : p-value = 0.075 0.5 0 0.04 Males aged 65: Model M3B 1 CDF under null 0.01 CDF under null 0 CDF under null CDF under null 1 CDF under null CDF under null 1 Realised q = 0.0149 : p-value = 0.287 0.5 0 0 0.005 0.01 0.015 0.02 0.025 Mortality rate 0.03 Note: Forecasts based on estimates using English & Welsh male mortality data for ages 60-89 and years 1972-1981 assuming parameter uncertainty. Each figure shows the bootstrapped forecast CDF of the mortality rate in 2006, where the forecast is based on the density forecast made ‘as if’ in 1980. Each black vertical line gives the realised mortality rate in 2006 and its associated p -value in terms of the forecast CDF. Based on 5,000 simulation trials. Along the same lines, if we wish to test the models’ forecasts of 24-year-ahead density forecasts, we have three choices: start in1980 and forecast the density for year 2004, start in 1981 and forecast the density for year 2005, and start in 1982 and forecast the density for year 2006. 16 We can carry on in the same way for forecasts 23 years ahead, 22 years ahead, and so on, down to forecasts 1 year ahead. By the time we get to 1-year-ahead forecasts, we would have 26 different possibilities (i.e., start in 1980, start in 1981, etc., up to start in 2005). 16 Again, in principle each of these tests should yield similar results under the null. 26 So, for any given model, we can construct a sample of 26 different estimates of the p -value of the null associated with 1-year-ahead forecasts, we can construct a sample of 25 different estimates of the p -value of the null associated with 2-year-ahead forecasts, and so forth. We can then construct plots of estimated p -values for any given horizon or set of horizons. Some examples are given in Figure 19. This Figure shows plots of the p values for 65-year-olds associated with forecast horizons of 5, 10 and 15 years. The p -values are fairly variable, but our best estimate of each p -value is to take the mean, i.e., our best estimate of the p -value associated with 5-year-ahead forecasts is the mean of the 22 sample ‘observations’ of the estimated 5-year-ahead p -value, our best estimate of the p -value associated with 10-year-ahead forecasts is the mean of the 17 sample ‘observations’ of the estimated 10-year-ahead p -values, and so forth, and the Figure also reports these mean estimates. 17 Figures 19 and 20 give the corresponding results for 75-year olds and 85-year olds. In most cases considered, the average p-values decline as the forecast horizon lengthens, indicating that forecast performance in these cases usually deteriorates as the horizon lengthens. For ages 75 and 85, we find that no model ‘fails’ even at the 5% significance level; for age 65, however, we find three ‘failures’ (out of a possible 18) at the 5% level: M2B, M5 and M6 ‘fail’ for h = 15 . Thus, out of a total of 3 × 18 = 48 possible ‘failures’, we find no ‘fails’ at the 1% significance level and only three at the 5% significance level. 18 17 We also repeat our earlier warning from section 3 just before Figure 4: the outcomes are not independent, so we cannot treat alternative estimates of the p -values for any given model and horizon h as independent. However, this does not invalidate our claim that the ‘best’ estimate of any of these p -values is the mean, as reported in Figures 19-21 18 With so few ‘fails’ to report there is little point in summarising the results in a Table as we did with the results from the previous two sections. 27 Figure 19: Long-Horizon P -Values of Realised Future Mortality Rates: Males aged 65 Males aged 65: Model M1 Males aged 65: Model M2B 1 Average = 0.290 for forecasts 5 years ahead Average = 0.188 for forecasts 10 years ahead Average = 0.143 for forecasts 15 years ahead P-value P-value 1 0.5 0 1985 1990 1995 2000 Average = 0.178 for forecasts 5 years ahead Average = 0.086 for forecasts 10 years ahead Average = 0.041 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 65: Model M3B P-value P-value 2005 0.5 1990 1995 2000 Average = 0.107 for forecasts 5 years ahead Average = 0.063 for forecasts 10 years ahead Average = 0.042 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 65: Model M6 1990 1995 2000 2005 Males aged 65: Model M7 1 1 Average = 0.193 for forecasts 5 years ahead Average = 0.082 for forecasts 10 years ahead Average = 0.039for forecasts 15 years ahead P-value P-value 2000 1 Average = 0.259 for forecasts 5 years ahead Average = 0.164 for forecasts 10 years ahead Average = 0.109 for forecasts 15 years ahead 0.5 0 1985 1995 Males aged 65: Model M5 1 0 1985 1990 1990 1995 Starting year 2000 Average = 0.270 for forecasts 5 years ahead Average = 0.178 for forecasts 10 years ahead Average = 0.132 for forecasts 15 years ahead 0.5 0 1985 2005 1990 1995 Starting year 2000 2005 Notes: Forecasts based on estimates using English & Welsh male mortality data for ages 60-89 and a rolling 10-year window assuming parameter uncertainty. The ‘– – –’, ‘– – –’ and ‘’ lines refer to p values over horizons of h = 5, 10 and 15 years respectively. Based on 5,000 simulation trials. Figure 20: Long-Horizon P -Values of Realised Future Mortality Rates: Males aged 75 Males aged 75: Model M1 Males aged 75: Model M2B 1 Average = 0.297 for forecasts 5 years ahead Average = 0.314 for forecasts 10 years ahead Average = 0.267 for forecasts 15 years ahead P-value P-value 1 0.5 0 1985 1990 1995 2000 Average = 0.330 for forecasts 5 years ahead Average = 0.326 for forecasts 10 years ahead Average = 0.321 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 75: Model M3B P-value P-value 2005 0.5 1990 1995 2000 Average = 0.308 for forecasts 5 years ahead Average = 0.291 for forecasts 10 years ahead Average = 0.228 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 75: Model M6 1990 1995 2000 2005 Males aged 75: Model M7 1 1 Average = 0.310 for forecasts 5 years ahead Average = 0.284 for forecasts 10 years ahead Average = 0.226 for forecasts 15 years ahead P-value P-value 2000 1 Average = 0.314 for forecasts 5 years ahead Average = 0.282 for forecasts 10 years ahead Average = 0.228 for forecasts 15 years ahead 0.5 0 1985 1995 Males aged 75: Model M5 1 0 1985 1990 1990 1995 Starting year 2000 2005 Average = 0.312 for forecasts 5 years ahead Average = 0.258 for forecasts 10 years ahead Average = 0.228 for forecasts 15 years ahead 0.5 0 1985 1990 1995 Starting year 2000 2005 Notes: As per Figure 19. 28 Figure 21: Long-Horizon P -Values of Realised Future Mortality Rates: Males aged 85 Males aged 85: Model M1 Males aged 85: Model M2B 1 Average = 0.240 for forecasts 5 years ahead Average = 0.326 for forecasts 10 years ahead Average = 0.282 for forecasts 15 years ahead P-value P-value 1 0.5 0 1985 1990 1995 2000 Average = 0.335 for forecasts 5 years ahead Average = 0.368 for forecasts 10 years ahead Average = 0.331 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 85: Model M3B 2000 2005 1 Average = 0.318 for forecasts 5 years ahead Average = 0.386 for forecasts 10 years ahead Average = 0.367 for forecasts 15 years ahead P-value P-value 1995 Males aged 85: Model M5 1 0.5 0 1985 1990 1995 2000 Average = 0.327 for forecasts 5 years ahead Average = 0.377 for forecasts 10 years ahead Average = 0.380 for forecasts 15 years ahead 0.5 0 1985 2005 Males aged 85: Model M6 1990 1995 2000 2005 Males aged 85: Model M7 1 1 Average = 0.327 for forecasts 5 years ahead Average = 0.378 for forecasts 10 years ahead Average = 0.386 for forecasts 15 years ahead Average = 0.330 for forecasts 5 years ahead Average = 0.370 for forecasts 10 years ahead P-value P-value 1990 0.5 0 1985 1990 1995 Starting year 2000 2005 Average = 0.371 for forecasts 15 years ahead 0.5 0 1985 1990 1995 Starting year 2000 2005 Notes: As per Figure 19. 6. Conclusions The purposes of this paper are: (i) to set out a backtesting framework that can be used to evaluate the ex-post forecasting performance of stochastic mortality models; and (ii) to illustrate this framework by evaluating the forecasting performance of a number of mortality models calibrated under a particular data set. The backtesting framework presented here is based on the idea that forecast distributions should be compared against subsequently realised mortality outcomes: if the realised outcomes are compatible with their forecasted distributions, then this would suggest that the forecasts and the models that generated them are good ones; and if the forecast distributions and realised outcomes are incompatible, this would suggest that the forecasts and models are poor. We discussed four different classes of backtest building on this general idea: (i) backtests based on the convergence of forecasts through time towards the mortality rate(s) in a given year; (ii) backtests 29 based on the accuracy of forecasts over multiple horizons; (iii) backtests based on the accuracy of forecasts over rolling fixed-length horizons and (iv) backtests based on formal hypothesis tests that involve comparisons of realised outcomes against forecasts of the relevant densities over specified horizons. As far as the individual models are concerned, we find that models M1, M3B, M5, M6 and M7 perform well most of the time and there is relatively little to choose between these models. Of the Lee-Carter class of models examined here, it is difficult to choose between M1 and M3B; of the CBD models (i.e., between M5, M6 and M7), however, we would suggest that the evidence presented here points to M7 being the best performer on this dataset, and to it being difficult to choose between M5 and M6. Model M2B repeatedly shows evidence of considerable instability. However, we should make clear that we have examined a particular version of M2 in this study and so cannot rule out the possibility that other specifications of, or extensions to, M2 might resolve the stability problem identified both here and elsewhere (see, e.g., Cairns et al. (2008) and Dowd et al. (2008)). We would emphasise that these results are obtained using one particular data set – that is, data for English & Welsh males – over limited sample periods. Accordingly, we make no claim for how these models might perform over other data sets or sample periods. We would also make two other comments of a more general nature. If one looks over Figures 4-15, it is clear that, in most but not all cases, and in varying degrees depending on the model and whether we take account of parameter uncertainty or not: • There are fewer upper exceedances than predicted under the null; • There are usually more median exceedances – realised outcomes below the median – than predicted; and • There are usually many more lower exceedances than predicted, especially for the parameter-certain forecasts. 30 These findings suggest that all the bounds – especially the PC lower bounds – are generally too high. In short, in most cases, the forecasted bounds are biased upwards. Needless to say, this is useful information to potential users. Finally, the results presented make strongly suggest that models that take account of parameter uncertainty generate more plausible and better performing forecasts than models that do not take account of parameter uncertainty. We obtained this finding repeatedly and for every model considered, and the fact that we did so leads us to expect similar findings for other models and datasets. Thus, regardless of the particular model one might use, our results highlight the importance of parameter uncertainty and the need to take account of parameter uncertainty if one is to make plausible forecasts of future mortality. References Berkowitz, J. (2001) “Testing density forecasts, with applications to risk management.” Journal of Business and Economic Statistics 19: 465-474. Blake, D., K. Dowd, and A. J. G. Cairns (2008) “Longevity risk and the grim reaper’s toxic tail: The survivor fan charts.” Insurance: Mathematics and Economics, 42:10621068. Cairns, A. J. G. (2000) “A discussion of parameter and model uncertainty in insurance.” Insurance: Mathematics and Economics 27: 313-330. Cairns, A. J. G., D. Blake and K. Dowd (2006) “A two-factor model for stochastic mortality with parameter uncertainty: Theory and calibration.” Journal of Risk and Insurance 73: 687-718. Cairns, A. J. G., D. Blake, K. Dowd, G. D. Coughlan, D. Epstein, A. Ong, and I. Balevich (2007) “A quantitative comparison of stochastic mortality models using data 31 from England & Wales and the United States.” Pensions Institute Discussion Paper PI-0701. Cairns, A. J. G., D. Blake, K. Dowd, G. D. Coughlan, D. Epstein, and M. KhalafAllah (2008) “Mortality density forecasts: An analysis of six stochastic mortality models.” Pensions Institute Discussion Paper PI-0801. Continuous Mortality Investigation (2006) “Stochastic projection methodologies: Further progress and P-Spline model features, example results and implications.” Working Paper 20. Continuous Mortality Investigation (2007) “Stochastic projection methodologies: LeeCarter model features, example results and implications.” Working Paper 25. Coughlan, G., D. Epstein, A. Sinha, P.Honig (2007) q-Forwards: Derivatives for Transferring Longevity and Mortality Risks. Available at www.jpmorgan.com/lifemetrics. Currie, I. D., M. Durban, and P. H. C. Eilers (2004) “Smoothing and forecasting mortality rates.” Statistical Modelling, 4: 279-298. Currie, I. D. (2006) “Smoothing and forecasting mortality rates with P-splines.” Talk given at the Institute of Actuaries, June 2006. See http://www.ma.hw.ac.uk/~iain/research/talks.html. Diebold, F. X., T. A. Gunther, and A. S. Tay (1998) “Evaluating density forecasts with applications to financial risk management.” International Economic Review 39: 863-883. Dowd, K. (2007) “Backtesting risk models within a standard normality framework.” Journal of Risk, 9: 93-111. 32 Dowd, K. (2008) “The GDP fan charts: An empirical evaluation.” National Institute Economic Review, 203(1): 59-67. Dowd, K., A. J. G. Cairns and D. Blake (2006) “Mortality-dependent measures of financial risk.” Insurance: Mathematics and Economics, 38: 427-440. Dowd, K., Cairns, A. J. G., D. Blake, G. D. Coughlan, D. Epstein, and M. KhalafAllah (2008) “Evaluating the goodness of fit of stochastic mortality models.” Pensions Institute Discussion Paper PI-0802. Embrechts, P., C. Klüppelberg, and T. Mikosch (1997) Modelling Extreme Events for Insurance and Finance. Berlin: Springer Verlag. Keilman, N. (1997) “Ex-post errors in official population forecasts in industrialized countries.” Journal of Official Statistics (Statistics Sweden) 13: 245-247. Keilman, N. (1998) “How accurate are the United Nations world population projections?” Population and Development Review 24: 15-41. Koissi, M.-C., A. Shapiro, and G. Höhnäs (2006) “Evaluating and extending the LeeCarter model for mortality forecasting: Bootstrap confidence interval.” Insurance: Mathematics and Economics , 38: 1-20. Lee, R. D., and L. R. Carter (1992) “Modeling and forecasting U.S. mortality.” Journal of the American Statistical Association 87: 659-675. Lee, R. D., and T. Miller (2001) “Evaluating the performance of the Lee-Carter method for forecasting mortality.” Demography 38: 537-549. Longstaff, F. A., and E. S. Schwartz (1992) “Interest rate volatility and the term structure: A two-factor general equilibrium model.” Journal of Finance 47: 12591282. 33 National Research Council (2000) Beyond Six Billion: Forecasting the World’s Population, edited by J. Bongaerts and R. A. Bulatao. Washington, DC: National Academy Press. Renshaw, A. E., and S. Haberman (2006) “A cohort-based extension to the Lee-Carter model for mortality reduction factors.” Insurance: Mathematics and Economics 38: 556-70. Vasicek, O. (1977) “An equilibrium characterization of the term structure.” Journal of Financial Economics 5: 177-188. Wilmoth, J.R. (1998) “Is the pace of Japanese mortality decline converging toward international trends?” Population and Development Review 24:593-600. 34 Appendix A: The Stochastic Mortality Models Considered in this Study Model M1 Model M1, the Lee-Carter model, postulates that the true underlying death rate, m(t , x) = − log(1 − q(t , x)) , satisfies: log m(t , x) = β x(1) + β x(2)κ t(2) (A1) where the state variable κ t( 2) follows a one-dimensional random walk with drift: (2) κ t(2) = κ t(2) −1 + μ + CZ t (A2) in which μ is a constant drift term, C a constant volatility and Z t(2) a onedimensional iid N(0,1) error. This model also satisfies the identifiability constraints: ∑κ t (2) t = 0 and ∑β (2) x =1 (A3) x Model M2B Model M2B is an extension of Lee-Carter that allows for a cohort effect. It postulates that m(t , x) satisfies: log m(t , x) = β x(1) + β x(2)κ t(2) + β x(3)γ c(3) (A4) where the state variable κ t( 2) follows (A2) and γ c(3) is a cohort effect where c = t − x is the year of birth. We follow Cairns et al. (2008) and CMI (2007) and model γ c(3) as an ARIMA(1,1,0) process: 35 ( ) Δγ(c3) = μ(γ) +α (γ) Δγ(c3−)1 −μ(γ) + σ(γ) Z c(γ) (A5) with parameters μ(γ) , α(γ) and σ(γ) , and Z c(γ) is iid N(0,1). This model satisfies the identifiability constraints: ∑κ (2) t t =0, ∑β = 1, (2) x x ∑γ (3) c = 0 and c ∑β (3) x =1 (A6) x Model M3B This model is a simplified version of M2B which postulates: log m(t , x) = β x(1) + κ t(2) + γ c(3) (A7) where the variables (including the cohort effect) are the same as for M2B. This model satisfies the constraints: ∑κ (2) t = 0 and ∑γ (3) t−x =0 (A8) x ,t t Given these constraints, we let β x(1) = 10−1 ∑ log m(t , x) – bearing in mind that our t lookback sample window has a length of 10 years – and obtain ∑ ( x − x )(β − β δ =− ∑ (x − x ) (1) x (1) x ) x 2 (A9) x and thence obtain the following revised parameter estimates: κ t(2) = κ t(2) − δ (t − t ) (3) γ t(3) − x = γ t − x + δ ( (t − t ) − ( x − x ) ) (A10) β x(1) = β x(1) + δ ( x − x ) 36 which are the ones on which the forecasts are based. Model M5 Model M5 is a reparameterised version of the CBD two-factor mortality model (Cairns et al., 2006) and postulates that q(t , x) satisfies: logit q(t , x) = κ t(1) + κ t(2) ( x − x ) (A11) where x is the average of the ages used in the dataset and where the state variables follow a two-dimensional random walk with drift: κ t = κ t −1 + μ + CZ t (A12) in which μ is a constant 2 × 1 drift vector, C is a constant 2 × 2 upper triangular ‘volatility’ matrix (to be precise, the Choleski ‘square root’ matrix of the variancecovariance matrix) and Z t is a two-dimensional standard normal variable, each component of which is independent of the other. Model M6 M6 is a generalised version of M5 with a cohort effect: logit q(t , x) = κ t(1) + κ t(2) ( x − x ) + γ c(3) (A13) where the κ t process follows (A12) and the γ c(3) process follows (A5). This model satisfies the identifiability constraints: ∑γ c∈C (3) c = 0 and ∑ cγ c∈C (3) c =0 (A14) Model M7 M7 is a second generalised version of M5 with a cohort effect: 37 logit q (t , x) = κ t(1) + κ t(2) ( x − x ) + κ t(3) (( x − x ) 2 − σ x2 ) + γ c(4) (A15) where the state variables κ t follow a three-dimensional random walk with drift, σ2x is the variance of the age range used in the dataset, and γ c(4) is a cohort effect that is modelled as an AR(1) process. This model satisfies the identifiability constraints: ∑γ c∈C (4) c = 0, ∑ cγ c∈C (4) c = 0 and ∑c γ 2 c∈C (4) c =0 (A16) More details on the models can be found in Cairns et al. (2007). 38 Appendix B: Simulating Model Parameters under Conditions of Parameter Uncertainty This Appendix explain the procedure used to simulate the values of the parameters driving the κ t process and, where the model involves a cohort effect γ c , the parameters of the process that drives γ c . The approach used is a Bayesian one based on non-informative Jeffreys prior distributions (see Cairns (2000), Cairns et al., 2006). Simulating the values of the parameters driving the κ t process Recall from Appendix A that each of the models has an underlying random walk with drift: κ t = κ t −1 + μ + CZ t (B1) where: in the case of M1, M2B and M3B, κ t , μ and V are scalars; in the case of M5 and M6, κ t and μ are 2 × 1 vectors and a V is 2 × 2 matrix; and in the case of M7, κ t and μ are 3 ×1 vectors and V is a 3 × 3 matrix. For the sake of generality, we refer below to μ as a vector and V as a matrix. The parameters of (B1) are estimated by MLE using a sample of size n =10 for the results presented in this paper. We first simulate V from its posterior distribution, the Wishart ( n − 1, n −1Vˆ −1 ) distribution. To do so, we carry out the following steps: ● We simulate n − 1 i.i.d. vectors α1 ,...,α n −1 from a multivariate normal distribution with mean vector 0 and covariance matrix n −1Vˆ −1 . These vectors 39 will have dimensions 1×1 for M1, M2B and M3B, dimensions 2 ×1 for M5 and M6, and dimensions 3 ×1 for M7. ● We construct the matrix X = ∑i =1α iα iT . ● We then invert X to obtain our simulated positive-definite covariance matrix n −1 V (= X −1 ) . matrix, we simulate the μ vector from Having obtained our simulated V MVN ( μˆ , n −1V ) . Simulating the values of the parameters driving the γ c process Recall that c is the year of birth. We first note that, for model M7, the cohort effect γ c(4) follows an AR(1) process, and for models M2B, M3B and M5, the cohort effect γ c(3) follows an ARIMA(1,1,0) process. This latter process implies that the first difference of γ c(3) , Δγ c(3) , follows an AR(1). Hence our task is to simulate the values of the parameters μ (γ ) , α (γ ) and σ (γ ) in the following equation: γ c = μ (γ ) + α (γ )γ c −1 + σ (γ )ε c (B2) and γ c is, as appropriate, either Δγ c(3) or γ c(4) . The method we use is taken from Cairns (2000, pp. 320-321) and is based on a Jeffreys prior distribution for ( μ (γ ) , α (γ ) , σ (γ ) ) and MLE estimates of these parameters ( μˆ (γ ) , αˆ (γ ) , σˆ (γ ) ). This involves the following steps: • We simulate the value of α (γ ) from αˆ (γ ) and n from its posterior distribution f (α (γ ) ) = ((α (γ ) ) 2 − 2α (γ )αˆ (γ ) + 1) − ( n −1)/2 using a suitable algorithm (e.g., the rejection method). • We then simulate X ∼ χ n2−1 and obtain a simulated value of (σ (γ ) ) 2 from (σ (γ ) 2 ) = (n − 1)(σˆ (γ ) ) 2 (1 + (α (γ ) − αˆ (γ ) ) 2 / (1 − (αˆ (γ ) ) 2 ) ) X . 40 • Finally, we simulate Z ∼ N (0,1) and obtain a simulated value of μ (γ ) from μ (γ ) = μˆ (γ ) + (σ (γ ) ) 2 / (n − 1) ×Z . (1 − α (γ ) ) 2 Further details are provided in Cairns (2000, pp. 320-321). 41
© Copyright 2026 Paperzz