•
•
•
PREDICTIVE EVALUATION OF LOGISTIC MODELS
by
Francoise Seillier-Moiseiwitsch
Department of Biostatistics, University of
North Carolina at Chapel Hill, NC.
•
Institute of Statistics Mimeo Series No. 2101
June 1992
,
PREDICTIVE EVALUATION OF LOGISTIC
MODELS
Franr;oise SeiIlier-Moiseiwitsch
Department of Biostatistics
University of North Carolina at Chapel Hill
ABSTRACT
Logistic models are assessed for their ability to produce valid forecasts for new
observables rather than for their goodness of fit to past observations. The diagnostics
described here are based on scoring rules and, being updated easily, are particularly
well-suited to a sequential context. Test statistics measuring different aspects of
empirical validity are described. Simulations for moderate sample sizes compliment
results pertaining to their asymptotic behaviour.
•
•
..
Key Words: deviance, logistic model, prediction, scoring rule.
1. Introduction
In the work of Guttman (1967), Akaike (1974), Stone (1974), Geisser (1975) and Geisser & Eddy
(1979), attention has turned, in model selection, from the capability to explain past data to the ability
to predict future out-turns. Putting philosophical standpoints aside, predictive model assessment offers
sound evaluations: the data are not used simultaneously for estimation and validation. This approach
also provides a safeguard against overparametrization. Furthennore, these papers rightly emphasize the
importance of sound model evaluation once the model building phase is completed.
"
Here, the evaluation of logistic models is cast in a sequential framework. This often reflects constrains imposed by the data collection process. It also mimicks the way the model will be used in the
future. So, at instance i, the forecast relies on the previous i-I observations (as well as on the covariate
infonnation associated with all observations up to and including instance i) and once the outcome of
the ith event becomes known, it will in turn be used to forecast the next event.
Therefore, neither will estimated parameter values be compared to posited ones nor will expectations be taken over unrealized values (as is done when calculating the mean squared error of prediction). The model evaluation is based solely on probabilities for future events generated by the model.
These will be measured against the outcomes via a scoring rule. The training set, of size n, consists of
the observations from which the first parameter estimates are generated. All other data points serve, in
turn, two purposes: validation and estimation.
What is described here is an adaptation of crossvalidation to sequential set-ups (Dawid,1984). Each
new data point involves a single re-estimation of the parameters while crossvalidation would require,
in its one-out-at-a-time version, as many new fits as there are data points. As evidenced later, this setup is more amenable to fonnal testing than crossvalidation. Earlier papers have looked into the weIlfoundedness of the proposed diagnostics from a general view-point (Seillier-Moiseiwitsch &
Dawid,1993; Seillier-Moiseiwitsch et aI.,1992).
The next section reviews scoring rules and their use. The proposed diagnostic tests are described in
section 3. Proofs of their distributional properties are relegated to the Appendix. Section 4 summarizes
some empirical convergence results. In section 5, several bootstrap procedures are described and used
to improve the relevance of the reference distribution of the tests. Some empirical power studies are
presented in section 6.
2. Scoring Rules
To be of use to a decision-maker, model-based probabilistic statements should exhibit a high
degree of realism. Their realism can be quantified via measures of calibration and resolution (DeGroot
& Fienberg,1983). In well-calibrated forecasting systems, the frequency of occurence of the event of
interest among those instances which were assigned a probability p (a(p) say) should be close to p.
Perfect resolution is attained when all the events with the same likelihood of occurence are given the
same probability. In the context of simple two-decision situations, the perfonnance of a model can
always be improved by its calibrated version (i.e. quoting a(p) rather than p) (Schervish, 1983). Hence,
high resolution is the more desirable of these two attributes. Indeed, given enough data, recalibration is
1
•
always possible while improving resolution would involve a reassessment of the information at hand
(and hence the formulation of a new model).
t
Scoring rules were devised to help in determining the suitability of probability forecasts. Among
the class of all possible such functions, those which encourage honesty, i.e. which are optimized whenever the forecasters report their actual beliefs, are termed proper. These have been shown to be an
agregate measure of both calibration and resolution (DeGroot & Fienberg,1983). Let (\1, ... ,An
denote the sequence of events of interest and Pi the model's probability that Ai occurs (calculated on
the basis of AI' ... , "-1)' For simplicity, asssume that Ai'S are binary events. A generic bounded
scoring rule can be characterized as follows :
dJ(pD
S(Ai,Pi) = J(Pi) + (Ai-Pi) ~
where J(Pi) is a strictly convex, differentiable and bounded function on [0,1]. Then the average score
after n instances can be partitioned as follows:
S(N) = -
1 N
N
with SI(N)=
L f(p) a(p)
L S(~, Pi) = SI(N) + Sz(N)
i=1
[(S(l,p) - S(l;[(p»)+ (l-a(p» (S(O,p)-S(O,a(p»)] and Sz(N) =
L
f(p) <l>(a(p»
peP
peP
where <l>(t)=tS(l,t)+(l-t)S(O,t) O:S;t:S; 1, P is the set of allowable probabilities and f(p) the frequency
of prediction p.
Example 1: The Brier score
1 N
BS=-
L
m i=n+l
(~- pDz
where m=N-n and n is the number of observations that ensures that the first estimates of the parameters are relatively stable.
Example 2: The overall calibration score
1 N
OC=- L (~- Pi)
m i=n+l
quantifies the overall bias in the forecasts.
Example 3: The logarithmic score
•
is simply the average loglikelihood under the predictive distribution.
\
When enough data are available, an in-depth study of predictive performance would involve partitioning the original sequence into a number of subsequences and computing a score on each subsequence separately. These subsequences can be generated in a number of ways. For instance, one can
divide the probability range into subintervals and consider instances which were assigned probability
2
forecasts in the the same subinterval. One can also stratify the set of events according to their covariates: similar covarites values would put the evnets into the same subset.
For a general survey of scoring rules and their merits, see Murphy & Epstein (1967a,b), Winkler
& Murphy (1968), Murphy & Winkler (1970), Dawid (1986).
"
3. Diagnostic Tests for Predictive Validity
Starting with the case where Ai'S are binary events, we assume that
Ai - Bemoulli( Pi(!!)) with logit Pi(!!) =
where both
model.
-a and -Xi are k
x
f
~i for i E { 1,...,N}
1 vectors. Denote by Pa the probability distribution stemming from this
-
As in classical hypothesis testing, the diagnostics take the form of Z-statistics. The raw scores are
therefore standardized. But, here, the standardization is performed with respect to the forecasting
model, that is, assuming that the estimated probability of the event Ai (based on li-l») is in fact the
sampling model. As a result, after subtracting the expectation under the predictive model, the scoring
functions considered fit the following expression:
1 N
S(N)=N L(Ai-Pi) g(Pi) where g(.) is an arbitrary function and Pi= PaH(Ai = 1)
i=l
Examples:
g(Pi) = 1 leads to the normalized overall calibration score, g(Pi) = 1-2Pi to the normalized Brier score and
g(Pi) = logit Pi to the normalized logarithmic score.
Limit theorems being unaffected by early realizations, the results are stated for N, rather than N-n,
scores.
Theorem:
Let g denote a bounded function of the probability forecast and assume that its first derivative is also
bounded.
N
L (Ai - Pi) g(Pi)
i=l
N
{L Pi (1- Pi) (g(pD)2} 1h
!4
N(O,I) under Pa as N ~
00 •
i=l
The proof is given in the Appendix. For the logarithmic score, though g is not bounded, this result still
holds if one assumes either that the model does not generate categorical forecasts or that, if it does,
these are correct. This is certainly the case under the null hypothesis that the assumed model is generating the observations.
The next result can be applied when one divides the original sequence into a number of subsequences in order to investigate the predictive performance of the model in more depth.
3
.
Corollary 1:
Consider the following partition into K subsequences {E k ,k=l,...,K} and d(A) takes the value 1 if A
occurs and 0 otherwise, then
•
N
K
{~d(Ai e E k ) (Ai-Pi) g(Pi)}
2
~ Ni=1
~
t
k=1
D
~
~ d(A i e
XK2 under p!! as N ~
00 •
E k ) Pi (I-Pi) (g(Pi»2
i-I
This follows from applying the above theorem to each subinterval and invoking a multivariate martingale central limit theorem (Aalen,1977).
For discrete covariates taking a fixed number of possible values, the usual limit distribution for the
deviance obtains for a predictive version of this statistic. Let mj refer to the number of outcomes with
covariate vector!i and the number of distinct covariate vectors be J. The index i runs from 1 to N, the
total number of binary events. Here the convergence is considered under the condition that each mj
goes to infinity.
Predictive Deviance
Corollary 2:
Letting Aj= ~Aid(!i=!i)'
i
2~ {Ajlog(Aj / ~Pid(!i=~)+(mj-Aj)log«mj-Aj)/(mj- ~Pid(!i=~»)} !.., X;_k
j
i
i
under p!! as mj~OO for all j
An outline of the proof is written up in the appendix. If mj= 1 for all j, McCullagh's distributional
results (1986) for the deviance apply in this setting: the predictive deviance is degenerate conditionally
on the sufficient statistic and thus provides no information regarding the predictive performance of the
model.
The above theorem is easily generalized to the case where Ai - Binomia1(ni,Pi(8» with
ie {1,... ,N}. Let Pik=Pa.
(~=k) and Aik=O(Ai=k) with k e {l,... ~}. Then,
.:;;-1
OJ
S(Ai,Pi) = ~ S(Aik,Pik)
k=l
Corollary 3:
Let E and Var denote the expectation and variance under the forecast distribution.
N
S(N) -
..
ni
~ ~ E( S(~,Pik)
i=1 k=1
~
N(O,I)
under Pa as N
~
Examples:
For the Brier score,
l1t
and E(S(Ai,PJ) = 1: Pik (1- Pik)
k=1
00 •
~
fixed and
~
~
2
Var(S(Ai,Pi))= I: Pik (l-Pik) (l-2Pik) k=l
I:
PikPim(l-2Pik)(l-2 pm)
k,m=l;k;lm
As for the logarithmic rule,
S(~,Pi) =
~
I: ~ logpik
•
~
and E(S(Ai,Pi)) = I: PUt logpik
pi
pi
n\
n\
Var(S(Ai,pJ) = I: Pik (1- PUt) (lOgPik/k..l
I:
Pim Pik logpim logPik
k,m=I;k;lm
This approach is unsuitable for the overall calibration score since S(Ai,Pi) = o. One can, however, compare actual outcomes with expected numbers under the predictive distribution:
Corollary 4:
N
{I: ~ Ph (I-Ph) }
Yz
~ N(O, 1)
under Pa as N ~
00
•
i.. l
redictive deviance is described in McCullagh (1986).
The next sections look into the behaviour of these diagnostics, with regards to rates of convergence and power, on simulated data.
4. Empirical Rate of Convergence
In the first instance, we investigate the rate of convergence of the limit theorem presented in the
preceding section when the predictions are computed from the actual model. The latter contains 4
parameters:
logitPi(!!) = 1.96 - 0.35 Xl + 0.25 X2 - 2.58 X3 .
Xl has 4 categories. Both X2 and X3 are continuous. A/s are binary random variables.
Table 1 gives Kolmogorov-Smimov test statistics calculated from 1000 scores. CA and BI denote
the x2-functions based, respectively, on the overall calibration and the Brier scores. The probability
range was divided into 11 subintervals, which entails that the asymptotic distribution is Xl~. Clearly,
under the true model, the empirical distributions of OC, BS and LO do not depart significantly from
that of a standard normal variable with as few as 75 data points, while the distributional convergence
of CA and BI requires at least some 200 observations.
5
.
•
•
Sample Size
OC
BS
LO
50
75
100
125
150
175
200
0.03271
0.03209
0.01554
0.02546
0.02221
0.01209
0.02595
0.03585
0.03086
0.02617
0.03192
0.02046
0.03085
0.02010
0.04871 **
0.04293 *
0.03297
0.03180
0.02142
0.03953 *
0.02973
CA
0.10100
0.08689
0.06621
0.05735
0.04864
0.04336
0.03260
BI
***
***
***
***
**
**
0.10539
0.07626
0.06140
0.05539
0.05640
0.04477
0.02934
***
***
***
***
***
**
Table 1: Significance levels: *=90% **=95% ***=99%
Kohnogorov-Smirnov test statistics for scores based on the true model (1000 simulations)
.
The reliability of the tests revolves around the percentiles of the score distribution being close to
nominal ones. In table 2, are entered the number of scores which, under this scenario, fall below the
0.5, 1, 2.5 and 5 percentiles and above the 95, 97.5, 99 and 99.5 percentiles of the normal distribution.
The results are again based on 1000 simulations. Similarly for aggregate scores in table 3. The asterisks indicate whether the entries are between 1 and 2 (*), between 2 and 3 (**) or further than 3 standard deviations (***) away from expected numbers. It appears that, for sample sizes greater than 75,
the percentiles of DC approximate well those of the standard normal distribution, while for' BS and LD
only the 95 and 97.5 percentiles are reliably estimated. Clearly, when they fail to do so, their overall
tendency is to err on the conservative side. Regarding CA and BI, for sample sizes of 125 and above,
the numbers in the tail are as expected. Thus, if one is checking a fully specified model, all these test
statistics can be computed on an original sequence of fairly moderate size.
..
6
95
97.5
99
99.5
52
53
57 *
50
49
47
50
36 **
42 *
44
50
47
47
48
9 ***
21
24
23
16 *
21
25
1 **
12
6*
4*
7
9
10
0**
4
2*
1*
4
7
6
***
***
**
*
*
**
*
27
37
40
36
40
46
36
***
*
*
**
*
34
28
23
27
31
33
33
18
15
16
18
15
11
19
**
*
*
**.
*
**
58 *
52
46
48
49
55
54
***
***
**
*
**
**
**
15
36
36
34
38
37
31
***
*
*
**
*
*
**
58
58
45
50
44
61
65
38 **
28
30
29
26
29
34 *
17
17
19
17
13
14
16
**
**
**
**
Score
Sample Size
0.5
1
2.5
5
OC
50
75
100
125
150
175
200
6
8*
6
7
7
8*
4
13
11
10
15 *
15 *
13
11
33 *
27
30 *
26
23
25
31 *
BS
50
75
100
125
150
175
200
0**
0**
1*
0**
0**
0**
1*
1 **
7
5*
6*
5*
2 **
4*
6
10
13
17
16
14
18
50
75
100
125
150
175
200
0**
0**
1*
0**
0**
0**
1*
0***
2 **
3 **
4*
3 **
1 **
3 **
2
10
11
16
15
13
14
LO
*
*
*
**
*
*
*
*
**
*
*
8
11
8
11
10
5
9
*
**
*
**
**
8
11
12
9
10
7
10
***
**
**
***
***
***
***
*
..
Table 2: Number of scores, based on true model, below 0.5, 1, 2.5 and 5 percentiles and above 95,
97.5, 99 and 99.5 percentiles of normal distribution (1000 simulations)
CA
Score
Sample Size
95
50
75
100
125
150
175
200
41 *
66 **
55
51
50
48
59 *
BI
99
19
25
20
9
10
17
12
**
***
***
**
95
99
40 *
72 ***
61 **
47
52
48
50
20 ***
25 ***
18 **
11
13
11
16
Table 3: Number of scores, based on true model, above 95 and 99 percentiles
of asymptotic distribution Xl~ (1000 simulations)
Now. the parameters are estimated from an increasing training sample. In the tables below. the
training size refers to the size of the first training set. Tables 4 displays Kolmogorov-Smirnov statistics
7
t
•
calculated from 1,000 overall calibration and Brier scores for various sample and training set sizes. For
DC, distributional convergence is reached at moderate sample sizes: the two significant statistics (sample size of 200 with training sizes of 50 and 75) may well be flukes as both smaller and larger sample
sizes, for similar training sets, did not produce significant results. For BS and LD, on other hand, none
of the size combinations produced a non-significant statistic.
Sample
•
Training Size
DC
50
75
100
100
150
200
250
300
350
0.041
0.029
0.034
0.045
0.031
0.029
0.024
0.023
BS
50
75
100
100
150
200
250
300
350
0.313
0.169
0.220
0.267
0.241
0.148
0.217
0.200
0.046
0.028
0.333
0.300
125
150
0.020
0.025
0.035
0.023
0.015
125
150
0.134
0.174
0.157
0.111
0.141
200
0.026
0.018
200
0.074
0.089
250
275
300
0.034
0.026
0.030
0.015
250
275
300
0.041
0.120
0.078
0.055
Table 4: Kolmogorov-Smirnov statistics for overall calibration and Brier scores based on
true model when one estimates parameters (1000 simulations)
If one applies a somewhat less stringent criterion and considers the number of scores, out of 1,000,
falling in the tails of the normal distribution, the outcome for DC and BS is shown in tables 5, 6 and
7. These summary statistics for aggregate scores, based on 11 subintervals, appear in table 8.
For DC, numbers are acceptable for training size 75 and above and sample size of 150 and above.
The outcome of the simulations suggest using the majority of the observations to obtain accurate
parameter estimates and setting aside no less than 75 data points for evaluation. The summary statistics
indeed tend to move towards the critical region when the size of the training set is such that fewer
than 75 points are left for assessment These numbers will, of course, be dependent on the number of
parameters in the underlying model.
For BS, the score distribution is highly asymmetric. For all sample and training sizes selected, the
number of test statistics falling below or above the considered percentiles differed from the expected
values by more than 3 standard deviations. However, when one looks at the numbers of scores outside
central intervals with 90%, 95%, 98% and 99% nominal coverage (table 7), the picture changes. For
the 95% and 98% intervals, acceptable numbers are attained with a sample of 250 observations , 175
of which are used as training set and a sample of 300 observations, 200 of which are set aside for
8
estimation. For the logarithmic score, similar results were obtained. As expected, the aggregate scores
yield very conservative tests: the frequencies obtained by simulation grossly overestimate tail areas of
the normal distribution.
Note that, in these tables, the entries tend to decrease as the training size increases, which seems to
point to the longlasting effect of earlier unreliable predictions. On the other hand, for fixed training
size, the entries, apart from a couple of exceptions, decrease as the sample size becomes large. This
reflects the central limit property of the statistics.
Sample Size
Training Size
0.5
1
2.5
5
95
97.5
100
50
75
8*
11 **
21 ***
20 ***
44 ***
43 ***
71 ***
73 ***
65 **
58 **
31 *
35 **
8
11
3
6
150
75
100
11 **
3
13 *
17 **
34 *
32 *
64 **
53
72 ***
73 ***
39 **
35 **
19 **
12
7
7
200
50
75
100
125
150
6
9*
9*
7
11 **
11
16 *
14 *
14 *
12
28
28
30 *
31 *
30 *
56
50
57 *
57 *
52
68
63
62
43
43
40
32
28
30
31
19 **
15 *
11
10
9
8
5
8
5
7
50
75
100
125
150
175
200
2*
5
3
6
6
8*
4
11
10
8
10
10
18
8
27
24
28
24
32
34
28
65
57
56
52
54
58
65
59 *
60 *
53
48
48
47
58 *
10
10
7
11
13
13
8
6
2*
4
,.
6
7
7
5
250
**
*
*
**
*
*
**
**
*
*
*
*
29
28
26
26
27
26
29
***
*
*
*
99
99.5
Table 5: Numbers of overall calibration, based on true model, below 0.5, 1, 2.5 and 5 percentiles and above
95, 97.5, 99 and 99.5 percentiles of normal distribution, when one estimates parameters (1000 simulations)
9
*
*
.
•
•
Sample
Training Set
0.5
1
2.5
5
95
97.5
99
99.5
100
50
75
0**
0**
0***
0***
1 ***
1 ***
4 ***
9 ***
251 ***
175 ***
175 ***
126 ***
125 ***
83 ***
93 ***
57 ***
150
75
100
0**
0**
1 **
1 **
2 ***
4 ***
10 ***
7 ***
136 ***
183 ***
114 ***
89 ***
54 ***
69 ***
45 ***
33 ***
200
50
75
100
125
150
1*
1*
1*
1*
0**
2 **
2 **
1 **
1 **
0***
4
5
3
3
8
***
***
***
***
***
8
8
9
10
20
***
***
***
***
***
265
199
162
139
115
***
***
***
***
***
196
142
111
95
65
***
***
***
***
***
118
83
61
60
43
***
***
***
***
***
82
56
34
41
28
***
***
***
***
***
250
50
75
100
125
150
175
200
0**
0**
0**
1*
0**
1*
0**
0***
0***
0***
1 **
0***
1 **
0***
2 ***
3 ***
2 ***
1 ***
4 ***
2 ***
3 ***
3
8
6
11
10
18
13
***
***
***
***
***
***
***
265
201
163
140
118
104
105
***
***
***
***
***
***
***
180
146
108
84
67
68
64
***
***
***
***
***
***
***
119
80
54
45
41
36
35
***
***
***
***
***
***
***
88
52
29
29
25
25
18
***
***
***
***
***
***
***
300
125
175
200
225
250
0**
0**
0**
0**
0**
0***
0***
2 **
1 **
0***
18
20
19
27
28
***
***
***
***
***
125
108
109
87
76
***
***
***
***
***
77
70
63
54
43
***
***
***
***
***
35
39
35
27
21
***
***
***
***
***
21
26
24
21
14
***
***
***
***
***
4
3
6
10
7
***
***
***
***
***
Table 6: Numbers of scores, based on true model, below 0.5, 1, 2.5 and 5 percentiles and above 95,
97.5, 99 and 99.5 percentiles of normal distribution, when one estimates parameters (1000 simulations)
10
Sample Size
Training Size
99%
98%
95%
90%
100
50
75
93 ***
57 ***
125 ***
83 ***
176 ***
127 ***
255 ***
184 ***
150
75
100
45 ***
33 ***
56 ***
70 ***
116 ***
93 ***
146 ***
190 ***
200
50
75
100
125
150
83
57
35
42
28
***
***
***
***
***
120
85
62
61
43
***
***
***
***
***
200
147
114
98
72
***
***
***
***
***
273
207
171
149
135
250
50
75
100
125
150
175
200
88
52
29
30
25
26
18
***
***
***
***
***
***
**
119
80
54
46
41
37
35
***
***
***
***
***
***
***
182
149
110
85
71
70
67
***
***
***
***
***
**
**
268 ***
209 ***
169 ***
151 ***
128 **
122 **
118*
300
125
175
200
225
250
21
26
24
21
14
***
***
***
***
*
35
39
37
28
21
***
***
***
*
81
73
69
64
50
***
***
**
**
143 ***
128 **
128 **
114*
104
.
***
***
***
***
***
Table 7: Numbers of Brier scores, based on true model, in the middle 90%, 95%,98% and 99%
of normal distribution, when one estimates parameters (1000 simulations)
I
11
CA
Scores
.
BI
Sample Size
Training Size
90
95
99
90
95
99
200
50
75
100
125
150
302
245
213
187
146
224
196
146
133
100
123
86
73
67
56
303
262
222
191
143
229
191
153
130
99
130
90
70
68
56
250
50
75
100
125
150
175
200
277
239
215
181
170
167
147
185
150
149
136
117
113
101
106
70
61
58
48
54
52
290
243
225
183
171
163
148
207
164
139
133
118
117
102
113
71
57
61
46
49
56
300
125
175
200
225
250
165
172
171
152
133
106
116
106
100
92
42
40
41
46
41
170
162
158
159
129
102
106
106
105
95
42
43
42
49
39
"
Table 8: Numbers of aggregate overall calibration and Brier scores (11 subintervals),
based on true model, above 90, 95 and 99 percentiles of chi-square distribution,
when one estimates parameters (1000 simulations)
5. Bootstrap Tests
These tests can be extended further using a predictive bootstrap approach. Such an approach would
make the reference distribution more relevant to the data at hand when the sample size does not garantee that normality holds. The simulation results from the preceding section indeed show that the convergence is somewhat slow. A description of possible bootstrap approaches follows.
The first two methods mimick the evaluation procedure. The parameter vector is first estimated
from In), which yields ~n. The probability that the next event An+1 occurs is then computed using
and the covariates associated with An+l" This probability generates a bootstrap observation a:+l" It is this
realization that enters the scoring function. For the first procedure, this whole process is repeated, in
turn, on ~(n+l),~(n+2), ••• ,~(N-l). For the second procedure, the bootstrap observations become part of the
training set: for event i, the bootstrap probability distribution is based on {In), a:+l' . . . , ai~J. The
second approach is therefore more likely to yield a bootstrap score distribution with large spread.
t
The last three procedures involve generating realizations from the model fitted on the full data set.
The third procedure is a replica of the original evaluation process but now on the bootstrap outcomes.
The last two procedures are similar to the first two above with ~'s replaced by data generated from the
12
fitted model.
Table 9 displays the numbers of overall calibration scores, out of 200 simulations based on the
actual model, which fall below the 1st, 5th and above the 95th, 99th percentiles of the bootstrap distribution. The numbers outside the middle 90% and 98% of this distribution are also shown. Similarly
for the Brier and logarithmic scores in table 10. For each iteration, 100 bootstrap samples were generated. Results for 200 bootstrap samples were not substantially different. Evidently, procedures 1 and
3 peformed best, the other three being overly conservative. For the overall calibration score, contrasting tables 5 and 9 reveals that a substantial improvement obtains with method 1 and particularly for
method 3. For the Brier score (cf. table 7), method 1 yields coverage probabilities somewhat closer to
the nominal ones. Method 3, on the other hand, achieves the nominal levels. For other combinations
of sample and training sizes (data not shown), the overall features remain as these sizes increase. Also,
as already observed, LO yields tail frequencies slightly worse than BR does.
Method
Sample
Training Set
1
75
100
50
50
75
4*
2
5 **
2
75
100
50
50
75
2
0*
0*
3
75
100
50
50
1
3
9
10
8
12
4*
1
17
22
4
75
100
50
75
8 ***
8 ***
20 ***
19 **
11
10
2
3
31 **
29 **
10 **
11 ***
5
75
100
50
75
5 **
6 **
14 *
19 **
17 **
6*
7 ***
5 **
31 **
25 *
12 ***
11 ***
5
95
99
90
98
7
8
14 *
6*
8
6*
4*
4*
2
13 *
16
20
8*
6
7*
47 ***
62 ***
36 ***
20 ***
29 ***
9 ***
1
4*
2 **
9
51 ***
64 ***
45 ***
22 ***
29 ***
9 **
5
4
Table 9: Numbers of overall calibration scores, based on true model, below 1, 5 , above 95, 99
percentiles and outside the middle 90%, 98% of the bootstrap distribution (200 simulations)
l
13
Score
5
95
99
90
98
2
0*
0*
4*
2 **
8
40 ***
45 ***
28 ***
11 ***
19 ***
7 ***
44 ***
47 ***
36 ***
13 ***
19 ***
7*
50
50
75
2
0*
0*
4*
2 **
8
40 ***
46 ***
32 ***
15 ***
18 ***
8 ***
44 ***
48 ***
40 ***
17 ***
18 ***
8*
75
100
50
50
4*
4*
13
12
4
75
100
50
75
3
4*
9
8
30 ***
32 ***
11 ***
10 ***
39 ***
40 ***
14 ***
14 ***
5
75
100
50
75
2
4*
7
11
44 ***
33 ***
15 ***
12 ***
51 ***
44 ***
17 ***
16 ***
1
75
100
50
50
75
2
0*
0*
5*
2 **
8
47 ***
56 ***
34 ***
20 ***
.27 ***
14 ***
52 ***
58 ***
42 ***
22 ***
27 ***
14 ***
2
75
100
50
50
75
2
0*
0*
4*
2 **
9
47 ***
62 ***
36 ***
20 ***
29 ***
9 ***
51 ***
64 ***
45 ***
22 ***
29 ***
9 **
3
75
100
50
50
4*
3
12
13
4
75
100
50
75
3
4*
10
9
34 ***
37 ***
14 ***
10 ***
44 ***
46 ***
17 ***
14 ***
5
75
100
50
75
2
5 **
8
13
41 ***
32 ***
18 ***
16 ***
49 ***
45 ***
20 ***
21 ***
Method
Sample
Training Set
1
75
100
50
50
75
2
75
100
3
BS
"
LO
•
1
7
10
7
9
0*
3
1
0*
20
22
19
21
4
7*
5
3
Table 10: Numbers of Brier and logarithmic scores, based on true model, below 1, 5 , above
95, 99 percentiles and outside the middle 90%, 98% of the bootstrap distribution (200 simulations)
6. Power Studies
•
In order to investigate the behaviour of these tests when the observations and the predictions are
generated from different models, three types of departure from the underlying model are considered:
ignoring one of the covariates, adding a redundant variable and substituting one of the covariates with
a correlated one.
Table 11 gives the number of scores in the tails of the standard normal distribution if one were to
use the coefficients of the actual model in computing the forecast probabilities. It therefore shows the
best results attainable and hence the limitations of scoring-rule based tests. Numbers in parentheses
14
refer to the correlation between X3 and X5. Both are nonnally distributed covariables. X4 is a binary
variable.
Even under these ideal circumstances the overall calibration score has almost no power to distin·
guish between correlated explanatory variables. On the other hand, the Brier score is sensitive to this
type of model mispecification. Both scores are able to detect, with high probability, the omission of an
explanatory variable. Neither could reliably identify redundant variables. Again, the results for the logarithmic score follow closely those of the Brier score. The perfonnance of the overall calibration score
is explained by its evaluation of average properties of the model rather than of the forecasts on· an
individual basis (as is the focus of the Brier score).
•
15
Model
Sample Size
Score
0.5
2.5
97.5
99.5
-X3
75
DC
256
0
503
0
11
0
15
0
43
0
72
0
106
0
235
0
57
0
44
0
31
0
32
0
18
0
22
0
10
0
15
0
473
0
710
0
45
14
46
17
117
10
194
4
277
1
421
2
99
0
90
0
80
0
66
0
49
0
51
0
40
3
33
2
1
596
0
854
14
35
10
36
3
65
2
74
1
143
0
205
119
1000
97
995
73
922
69
996
43
457
·48
728
29
125
29
175
0
414
0
722
2
15
3
13
1
25
0
30
1
65
0
88
53
998
36
980
27
846
25
985
12
292
16
547
6
44
9
59
BR
DC
150
BR
+ .15 X4
DC
75
BR
DC
150
BR
+ .55 X4
DC
75
BR
DC
150
+ .95 X4
BR
OC
BR
OC
BR
75
150
-X3+X5 (.25)
-X3+X5 (.45)
75
-X3+X5 (.65)
75
-X3+X5 (.85)
-X3+X5 (.95)
DC
75
BR
OC
BR
OC
BR
150
DC
75
BR
OC
BR
150
DC
75
BR
OC
BR
150
DC
BR
Table 11: Number of scores, based on models differing from true one (Xl+X2+X3), below 0.5 and 2.5
percentiles and above 97.5 and 99.5 percentiles of normal distribution (1000 simulations)
When one estimates parameters, the power of these scores is drastically reduced, as evidenced by
table 12. The entries in this table give the numbers of scores, resulting from 1,000 simulations, above
and below percentiles of the normal distribution. The total sample size used is 150 and the training
size 75. Again, the overall calibration score exhibits the least power. The logarithmic score is actually
16
the most likely, of these three rules, to detect departures from the actual model. By contrast with the
previous scenario, the scores have highest power against the inclusion of a redundant covariate.
With bootstrap reference distributions, as generated by procedure 1, the chance of picking up
model misspecifications, of the type investigated here, is higher than if one uses the asymptotic distribution. The entries of table 13 are based on 200 simulations and, for each of these, 100 bootstrap samples. The total sample size is 100 and the training size 50. Procedure 3 lacks ability to select the
correct model. This is indeed expected as it makes use of bootstrap observations generated from a
model estimated on the whole sample.
Model
-X3
Score
0.5
2.5
97.5
99.5
OC
7
0
0
4
0
0
2
0
0
2
0
0
3
0
0
3
0
0
30
2
2
29
3
3
28
2
2
34
1
1
31
3
3
38
4
4
25
110
116
35
187
220
31
181
192
28
151
172
25
125
144
24
139
157
7
38
39
9
93
131
5
64
68
4
56
64
7
54
66
7
56
60
BR
+X4
LO
OC
BR
-X3+X5 (.25)
LO
OC
BR
-X3+X5 (.45)
LO
OC
BR
-X3+X5 (.65)
LO
OC
BR
-X3+X5 (.85)
LO
OC
BR
LO
Table 12: Numbers of scores above and below percentiles of the normal distribution when
the model is different from the actual one (1000 simulations).
17
-.
..
Model
-X3
Method
Score
1
5
95
99
1
OC
3
0
0
1
3
3
2
1
1
3
3
3
3
0
0
2
1
1
4
1
1
1
1
1
8
0
0
8
9
9
9
2
2
11
8
10
8
1
1
11
4
6
9
1
2
10
10
8
10
54
56
9
7
7
8
62
77
11
7
4
11
65
74
13
12
9
9
54
60
8
10
10
4
17
17
4
3
3
5
30
38
4
2
2
3
23
28
4
5
3
3
27
27
6
4
3
BR
•
3
LO
OC
BR
+X4
1
LO
OC
BR
3
LO
OC
BR
-X3+X5 (.25)
1
LO
OC
BR
3
LO
OC
BR
-X3+X5 (.45)
1
LO
OC
BR
3
LO
OC
BR
LO
Table 13: Numbers of scores above and below percentiles of the bootstrap distribution when
the model is different from the actual one (200 simulations)
7. Conclusion
The sequential test statistics for predictive performance, considered here, were shown to converge
to their expected distribution. Simulation results demonstrate that this convergence is relatively slow,
which leads to conservative tests. This can be remedied through bootstrap procedures. From these
simulations, it also transpires that, for reasons of reliability, these tests should be employed in their
two-sided version.
The power of these tests is drastically reduced when estimating parameters and using the asymptotic reference distribution. The bootstrap procedures actually improve power. Adding a redundant
variable is the type of departure most often picked up when the model parameters need to be
estimated. Removal of a covariate and replacement by a correlated one seem difficult to detect.
18
Of the three scoring rules studied here, the overall calibration score exhibits almost no power
though it converges most quickly to the theoretical asymptotic distribution. The logarithmic score, with
convergence similar in rate to that of the Brier score, is the most powerful.
Of the bootstrap approaches considered, the procedure that is most similar to the evaluation process and calls on the actual observations as training sample is to be preferred. Procedures based on
bootstrap samples generated from distributions estimated from the whole sample lack the predictive
appeal.
•
In sum, the statistics described here test different aspects of model-based predictions. The predictive deviance allows the comparison of nested models as in the usual goodness-of-fit setting. For
models involving a moderate number of parameters, as employed in the simulation work presented in
sections 4, 5 and 6, practical guidelines would state that the overall calib~ation score can be reliably
referred to the standard normal distribution with 150 observations, 50 of which being left aside for
model evaluation while the reference distribution for the other statistics should be obtained by
bootstrap procedures.
References
Aalen, 0.0. (1977) Weak convergence of stochastic integrals related to counting processes. Zeitschrift
fUr Wahrscheinlichkeitstheorie und Verwandte Gebiete, 38, 261-77.
Akaike, H. (1974) A new look at statistical model identification. I.E.E.E. Transactions on Automatic
Control, AT·19, 716-23.
..
Dawid, A.P. (1984) Statistical theory: The prequential approach (with discussion). Journal 'of the royal
statistical Society, A, 147, 278-92.
Dawid, A.P. (1986) Probability forecasting. Encyclopedia of Statistical Sciences, vol. 7, edited by S.
Kotz, N.L. Johnson & C.B. Read. Wiley-Interscience, 210-218.
DeGroot, M.H. & Fienberg, S.E. (1983) The comparison and evaluation of forecasters. The Statistician, 32, 14-22.
Geisser, S. (1975) The predictive sample reuse method with applications. Journal of the American Statistical Association, 70, 320-8.
Geisser, S. & Eddy, W.P. (1979) A predictive approach to model selection. Journal of the American
Statistical Association, 74, 153-60.
Guttman, I. (1967) The use of the concept of a future observation in goodness-of-fit problems. Journal
of the Royal Statistical Society, B, 29, 83-100.
McCullagh, P. (1986) The conditional distribution of goodness-of-fit statistics for discrete data. Journal
of the American Statistical Association, 81, 104-7.
McCullagh, P. & NeIder, J.A. (1989) Generalized Linear Models, Second Edition. Chapman and Hall,
London.
Murphy, A.H. & Epstein, E.S. (l967a) Verification of probabilistic predictions: A briefreview. Journal
of Applied meteorology, 6, 748-55.
19
(
Murphy, A.H. & Epstein, E.S. (1967b) A note on probability forecasts and "hedging". Journal of
Applied meteorology, 6, 1002-4.
•
Murphy, A.H. & Winkler, R.L. (1970) Scoring rules in probability assessment and evaluation. Acta
Psychologica, 34, 273-86.
Schervish, M.J. (1989) A general method for comparing probability assessors. The Annals of Statistics,
17, 1856-79.
Seillier-Moiseiwitsch, F. & Dawid, A.P. (1993) On testing the validity of probability forecasts. To
appear in the Journal of the American Statististical Association.
Seillier-Moiseiwitsch, F., Sweeting, T.W. & Dawid, A.P. (1992) Prequential tests of model fit. To
appear in the Scandinavian Journal of Statistics.
Stone, M. (1974) Cross-validatory choice and assessment of statistical models (with discussion). Journal of the royal statistical Society, B, 36, 111-47.
Winker, R.L. & Murphy, A.H. (1968) "Good" probability assessors. Journal of Applied meteorology, 7,
751-8.
Appendix
Proof of Theorem:
Let Pi(~ be the probability that Ai occurs under the actual model Le. Pi(~ = efJ.T~ / (1 + ellT~). Denote by ,
differentiation with respect to ~ and by H(f(~» the Hessian of the function f(.) with respect to~. Let
Lu(~) be the loglikelihood for the first n instances Le.
n
Lu(~= L {Ai logpi(~+(l-Ai)log( I-Pi(~)} =
~l
Let ~ = (!l!Z ...
n
L
{Aif!i+log(l-Pi(~)}
~l
!l ' then Qi(~) = Pi(~)(I-Pi(~»!i and g(~) = X~ (Al-Pl ... An-Pn)T
Let Vn = diag(Pl(~)(l-Pl(~), ... ,Pn(~)(l-Pn(~») and ati=-L;;(~)=X~VnXn.
Consider the following Taylor expansion:
l
Pi(!-l) - Pi(~ = (~i-l - ~)TQi(~) + Op((i-lr )
= {hi-l(~ai-==l +llz«!_l~)THi~l (~i-l~) ... (~i_l~THt.l(~i-l~)}ai-==lJti(~)+Op((i-lrl)
where ~ = H(aLu(~) / aai) and the entries of ~: fall between the corresponding ones of ~ and ~n.
Let Ri = V2«~i_l~)T Hi~l (~i-l~) ... (~i_l~)THi~l (~i-l~»)~-==lQ'i(~ - ~ (~i_l~)T pi'(~t) (!-l~)
•
where pi'(~=H(Pi(~» and the entries of ~t.l fall between the corresponding ones of ~ and ~i-l. Let pi
be between Pi and Pi(~) and let gi stand for the first derivative of g(.) evaluated at Pi·. Then
N
L (Ai-
pD g(pD
i=l
N
N
N
~1
~
~
= L (Ai-Pi(~» g(Pi(~»+ Lh7-1(~) ai-==lQi(~) g(pi) (~-Pi(~»+ (Al-Pl) g(Pl)-(Al-Pl(~) g(Pl(~»+ L ~gi(Ai -Pi(~»
.20
N
N
N
N
- L g(Pi(!!» b'I-I(!!) ai~lm(!!)- LP{(!!) (l-Pi(!!»2Cb'I-I(!!) ai~l ~l g'(pt)- L b'f-l(!!) ai~lm(!!) Rjgi + L Rig(pD
i-2
i-2
i"'2
i=2
Now,
•
in view of the fact that
(i) all entries of H}(!!) (l SjSk,l SiSN-l) and Pk'(!!)(l SkSn) are bounded for all !! and
(ii) (~i-l-~) is a vector the entries of which are Op((i-1rl) (McCullagh & NeIder, 1989).
Hence, since we assumed that both g and its first derivative are bounded,
N
N
N-l
N
N-l
N
L (Ai - Pi) g(Pi) = L (Ai - Pi(!!» g(Pi(!!» + LK{(!!) L ai~d~i(!!) gi (Ai - Pie!!)~ - LK{(!!) L g(Pi(!!» ai~d~i(!!)
i-I
i-I
j-l
i-j+l
j-l
i=j+l
N
- LPl(!!) (l-Pi(!!»2 <b1-1(!!) ai~l ~i)2 gt +OpCN
i=2
Ya
)
One can show that
N
N
Vare(LK{(!!) L ai~d~i(!!) gi (Ai-Pi(!!)) =O(logN)
- j=1
i"'j+l
N
Vare(L pl(!!) (l-Pi(!!)/<b1-I(!!) ai~l~f gi)=O(log2 N)
- j==1
N-l
N
N
,.
2
Cove< LK{(!!) L ai~lm(!!) g(Pi(!!»), LPi~!!)(1-Pi(!!»2(b1-I(!!) ai~l~) gt)= O(logN)
i-j+ 1
i-2
- j-l
N
N-l
N
COV!!(L g(Pi(!!»)(Ai-Pi(~»' L K{(!!) L ai~d~i(!!) gi (Ai -Pi(!!»)=O(N~
i..l
j=l
i=j+ 1
N
N
COVe(L g(Pi(!!» (Ai-Pi(!!»' LPi~!!) (l-Pi(!!)2 (b'I-I(!!) ai~l ~i)2 g;)=O(N
- i=l
i..2
N-l
Ya
N
N-l
)
N
COV!!( L rIC!!) L ~~lm(!!) gt (Ai-Pi(!!»' L rIC!!) L ai~d~i(!!) g(Pi(!!»)=OCN'''}
j-l
i=j+ 1
j= 1
i=j+1
~l
N
N
Covr/ L K{(!!) L ai~l Qi(!!) gi (Ai-Pi(!!», LPi2(!!) (l-Pi(!!)/(b1-I(!!) ai~l ~f gt)=O(log~·N)
j=l
i=j+l
i=2
Therefore,
N
Vare(L (Ai -Pi) g(Pi»
•
- i.. l
N
N
N
N-l
N
= Var~(L (Ai -Pie!!) )g(Pi(!!»)+2 L g(Pi(!!» Q';(!!) L ak~d~k(!!) g(Pk(!!»-2 L g(Pj(!!» Q'{(!!) L g(Pi(!!)) ~~l Q'{(!!) •
i-I
i=2
k=i+l
j=l
i=j+l
2
Ya
+ o(N) + O(log N) + O(log N) + OCN ) + O(logSI2 N)
N
2
= L (g(Pi(~») Pie!!) (1- Pi(~» + oCN)
i=l
.2.1
Pi ~ Pi(~)
Also, since
1
N
•
as i ~
w.p. 1
00
21 Pi (1- Pi) g (Pi) ~ N1 21 Pie!!) (1- Pi(~) g (Pi(~»
N
N
Z
Z
w.p. 1
It remains to show that the terms which do not include Ri
N
N-l
N
N
L (Ai-Pi(!!» (g(Pi(!!»+b'l-I(!!) 5~d!i(~ g;)- L K{(~ L g(Pi(!!»5i~d~iC!!)- L PiZC!!)(l-Pi(!!»z(b'l-I(!!) 5i~l~l gi
i=1
j=1
i=j+1
i=2
satisfy the conditions for a central limit theorem. As one can show that
N
LPiZ(!!)(l-Pi(~)\b'l-I(!!)5i~l~lgi=op(N~)
,
i=2
one can disregard this term by invoking Slutzky's lemma. The remaining expressions can be written as
follows:
N
N
i=1
j:q+l
L (Ai -Pie!!»~ {g(Pi(!!»+ b'l-I(~ 5i~12i(!!) gi - ~r L
gCPj(!!» 5~12J(!!)}
As these independent summands are bounded they satisfy the Lindeberg condition 0
Proof of Corollary 2:
J
J
DevNCPj)=2 L {Ajlog(Aj / mjPj) + (mj-Aj)log(Cmj-Aj) / Cmj-mjPj»}
Let
where Lmj=N.
j=1
j=1
One must show that
(a) DevNCmtLPiLlC~i=~j»/DevN(pj(!!»
-4
I
i
(b) DeVNCPjC~» /DevN(Pj(!!»
-4
1
For (a), it is sufficient to show that
J
N
J
-
j=1
i=1
j=1
~
0 as mj
E e( {~m·-l
~ Ll(x.=x.)(L
(8) 5.~
n~(8)+
(1-pfr1
(Ajp:IH -m.)}
~ (Aj logp.(8)
(l-p.(8)
~
J ~
_I -J _ ~
I 1_
I 11;1
_ 1).)
~"1
J
J Z{ ~
J - +(m.-Aj)log
J
J _ »}-z)
~
00
1 N
for all j, where pt falls between pi!!) and - L Pi Ll(~i = ~j)'
mj i=1
J
N
J
Z
1
E.!!( {L mj-l L Ll(~i=~) Cb'l-IC!!) ~~12i(!!)+Ri) (l-prr (Ajpt-1-mj)} Z{ L (Ajlogpi!!)+(mj-Aj) log (l-Pj(!!) » )
j=1
i=l
r
j=1
N
~ O(N-~ E e( L (b' l-I (!!) ~~I ~ + ~) )
Z
- i=l
•
N
N
i=1
i=1
N
~ O(N-~ {L ~r 5i~1 ~i + L EJ/zCRiZ) L EJ'ZC Cb'l-IC!!) 5i~1 ~l) + EeC L ~r ~~l !:'i-I(!!) !:1-1(8) 5k~1 ~k)
-
i=1
N
+ E e(
L
- i,k=l;i;ek
-
N
Ri Rk ) + EaC
L
- i,k=I;i;ek
~o(N-I)+o(N-~)+O(N-) {Ee(
N
L
- i,k=l;i;ek
I
b 'l-IC!!) 5i~1 ~i Rk )} + o(N- )
N
N
N
~ 5i~1 b'i-I(!!) b'l-l(!!) 51<':1 ~k)+ Ee«L R/)+Ee(CLb'l-IC!!) 5i~l~i) CL Rk»}
- i,k=I;i<k
-
i=1
. -
i=l
k=l
N
N
~ O(N-l) + O(N-Ih) + O(N-~ E~I2[ (L b7-1(!!) 0i~l !l] E~I2[(L Rk / ]
-
i-I
-
k-l
~ O(N-l) + o(N-Ih)
•
For (b), it is sufficient to show that
K
K
1
E e( {L (b'~(!!) 5N1Qj(!!)+Rj )(l_p/l (Ajpt-mj)}1 { L (Aj log Pj(!!) + (mj-Aj) log (l-Pj(!!) » )
r
-
j-l
~ 0 as mj ~
j=l
00
for all j where Pj falls between Pj(~N) and pjc!!) ,
Rj* = ~«~N-!!)T H~ (~N-!!) . . . (~N-!!)T H~ (~N-!!) )oil Qj(!!) - ~ (~N-!!)T Pi' (!!~) (~N-!!)
and the entries of !!~ fall between those of !! and ~N" Since J is finite, one only needs to demonstrate
that each summand converges to O.
K
1
E!!( { (b'~(!!) 5N1Qj(!!)+Rj) (l_p/l (Ajpt-mj)}1 { L (Ajlogpj(!!) + (mj- Aj) log (l-Pj(!V » )
r
j=l
S O( 1) E!!( b'~(!!)
= O(l)(Q'{(!!)
1
=O(N- )
5N1 Qj(!!) + Rj* )
1
Oil Q'j(!!) + E!!(Ri) + E!!( (b'~(!!) 5N1Q'j(!!) Rj))
0
t
23
© Copyright 2025 Paperzz