Jørgen Hilden, November 1998 Contribution to workshop arranged

Jørgen Hilden, November 1998
Contribution to workshop arranged by the
Nordic Network for Biostatistics Research,
entitled 'Prediction in Medical Statistics',
Odense University, Febr. 4-5, 1999 /Updated March 2008*/
Scoring Rules for Evaluation of
Prognosticians and Prognostic Rules
J. Hilden
The presentation will focus on: 1) the concept of a proper scoring rule and its rationale, viz. that of
ensuring that a poor or dishonest source of prognoses will not outperform a good (low-loss) or honest
one; 2) its mathematical link to tangents of concave functions and decision-theoretic loss; 3) extending
these concepts from diagnosis (categorial alternatives) to continuous outcomes, such as future
measurements of physical performance or time to recovery/death.
There may be time left to touch briefly upon how these ideas could be further extended
to: 4) comparing several prognosticators on one evaluation sample (in clinical informatics the idea of
paired data seems to have blown unheeded over the top of many a head); 5) repetitive events and
updatable prognostications; 6) evaluation of therapeutic guidance (vectors of management-conditional
prognoses); 7) feed-back to calibrate or fine-tune a prognostic algorithm. For each of the latter
purposes, a superstructure must be added to the basic proper-scoring formalism, and the standard
evaluation study, with its passive recording of consecutive prediction-outcome pairs, may have to be
redesigned.
Jørgen Hilden, assoc. prof.
Dept. of Biostatistics
University of Copenhagen
Tlph +45-35 32 79 17
Fax +45-35 32 79 07
email [email protected]
*The overhead transparencies from 1999 have been tidied up and explanatory remarks and figure
caption have been added. A remark on censoring has been corrected. Some of the formulae that
involve functional derivatives are still left for intuitive appreciation, using a notation that mimics the
simple notation of Fig. 1.
-1-
Topics
1) Proper Scoring Rules ─ rationale =
ensuring that a poor or dishonest source of
prognoses will not outperform a good or
honest one
2) mathematical link to tangents of concave
functions and decision-theoretic loss
3) extending these concepts from diagnosis
to continuous outcomes
4) comparing several prognosticators on
one evaluation sample ─ the use of paired data
5) repetitive events
and updatable prognostications
6) evaluation of therapeutic guidance
7) feed-back to fine-tune a prognostic algorithm
2
The problem:
WE WANT CHANCES OF RECOVERY AND WHEN
─ not just a shot guess
How to score (diagnostic or)
prognostic PROBABILITY advice?
Casei (i = 1, 2, ...):
datai  [BLACK BOX] advicei outcome ti
↓
↓
scorei
(When this  type of scoring is adhered to,
and the sample is representative,
the average score
is pertinent; so let's
focus on one case)
A scoring rule must judge closeness
between an unknown distribution
(denoted by π)
(of 'all similar cases')
and the (BLACK BOX-made) distribution (denoted by p)
on the basis of a SINGLE outcome
-- because exact repetition doesn't exist
(large clinical data space; no two patients are alike)
-- and regression etc. would involve
assumptions ( risk of mis-evaluation)
3
PROPER SCORING RULES reward closeness
and honesty (qua maximizing E{score})
1) HUMAN Black Box: dishonesty doesn't pay &
(s)he can't exploit knowledge of
how the scoring will be done
2) COMPUTER Black Box: protection against
wrong purchase
2a) Vendor can't exploit knowledge...
Do such scoring rules exist?
Yes
What criterion of closeness?
Utility or loss function...
Mathematical basis (cf. Fig. 1)
Notation: p and h are distributions of T,
guestimators of the true π .
An arbitrary convex functional J(p)
gives a proper scoring rule:
Q(p|t) = J(p) + J'(p)((t) ─ p) ,
[interpret notation as required]
'the tangent at J(p) evaluated at P{T=t} = 1.'
(if strictly, then strictly)
EhQ(p|T) = J(p) + J'(p)(h ─ p)
= Q(p|h) < Q(h|h) = J(h) ;
EpQ(p|T) = Q(p|p) = J(p) . See Fig. 1!
Conversely, any Q with this inequality
defines a convex J.
.....................................
4
Figure 1. A proper scoring rule for the binary case (t = 0 vs. t = 1). Prediction p occasions the scores Q(p|0) and
Q(p|1) which arise as the ends of a tangent at fraction p to a convex curve J(p); i.e., Q(p|t) = J(p) + J’(p)(t – p). If the
true probability is π, the expected score, denoted by Q(p|π), is smaller (by the asterisk-marked amount) than the score
expectation that could have been obtained by knowing the truth, viz. J(π) = Q(π|π). Below, left: a non-strictly convex Jcurve gives rise to a non-strictly proper scoring rule: any prediction in the interval in which J is linear gives the same
expected score as predicting the true π. Below, right: As J has a corner at p, any of the pencil bundle of tangents at p
can be used (if used consistently!), so a tie-breaking rule is required (tied recommendation).
5
"Closeness to the true π is rewarded"
in the sense that
If pa = aπ + (1-a)p (0 < a < 1),
then
EπQ(pa|T) > EπQ(p|T).
Closeness  Properness
Def.: The EQUIVALENCE SET of π,
e(π) = {p: Q(p| π) =* Q(π| π)} .
*with = rather than < .
If Q is strictly proper, e(π) is just {π}.
.......................................
Other important properties
Q equivalent to aQ + f(outcome t), a > 0.
( f(t)? Why pay the meteorologist for sunshine? )
................................
PSRs encourage use of all data
- which the Black Box knows how to interpret
................................
Numerous ways of constructing
one Q from others.
Note that given a censored course with indep. censoring at time C
one may still score strictly properly the observable (lumped)
version of t, defined by the scale ]0, C]  {“t > C”} .
6
Decision-theoretic construction
of a SPSR Q via expected Utility.
(Let's first take a look at an example;)
Graphical example illustrating
decision-theoretic construction of a SPSR
We imagine an
Investment A in project of fixed
Duration D leads to
Reward R if survival t > D .
Figure 2. Survival curves with investment decisions. Upper part: The (D, A/R) point is below the predicted
survival curve (i.e., < 1 – p(t)), so the investment-to-reward ratio, A/R, is low enough for making the
investment profitable, despite the D months of delayed return. Lower part: a situation with multiple microdecisions. Here prediction p should be rewarded for suggesting the same investments and abstentions from
investment as the unknown true π: so p is as good as π (hence belongs to e(π)) when none of the decision
points lie between the p-based survival curve and the one based on π.
7
U(invest, t) = ─ A + I(t > D) R ,
U(don't, t) = 0 .
The better option according to p:
β(p) = invest iff p{T > D} >* A/R ,
*with a tied decision in case of =.
so let's mark point (D, A/R) on the survival diagram:
if the p-prediction runs above it, invest!
With an inventory of
such decisions (A,D,R)
(independent and additive)
we have a swarm of points ─ and
the e(π) comprises those p's
whose graphs are
so close to the true π-curve
that they are
not separated from it by any
of these points.
8
Mathematical formulation: Suppose
U(option β, p) = U(β, t)dp(t)
is maximized by β(p).
Q(p|t) = maxchoiceβ{U(β,p), U(β,t)}
= U(β(p), t) ;
i.e., the utility attained by trusting prediction p
and therefore choosing action β(p),
when the true situation turns out to be t.
[Note: maxchoicei{a(i); b(i)} = b(the i that maximizes a(i)) .]
J(p) = Q(p|p) = U(β(p), p) .
Q(p|h) = U(β(p), h)
< U(β(h), h) = J(h),
i.e. properness ─
strictly if the β-space is 'rich'
or a 'rich' set of
such decisions is envisaged,
otherwise 'sufficiently proper,'
i.e., proper enough for the
medical application at hand.
9
Utility-based SPSRs formulated in terms of
functional derivatives
Q(p|t) = J(p) + J'(p)((t) ─ p)
= J(p) + J'(p  (t))
= U(β(p), p) + U(β(p),'p)((t) ─ p)
= U(β(p), t) ;(*)
(*)
Depends on a tie-breaking policy
when β(p) is not unique ─ i.e.
when p is located at a CORNER of J(p)
[Fig. 1, bottom right].
Q(p|h) = J(p) + J'(p  h)
= U(β(p), p) + U(β(p),'p)(h ─ p)
= U(β(p), h) .(*)
10
Derivatives in distribution space
(tiptoeing over intricacies)
Categorial case:
p = (p1, p2, ... )T =(px)T
F'(p  h) =
lim ε-1[F(p + ε(h ─ p)) ─ F(p)]
= F'(p)T(h ─ p) ,
F'(p) = (partial der. vector), or
lim ε-1[F(p + ε((x) ─ p)) ─ F(p)]
x = outcome index above,
(x) = (0, 0, …, 1, …, 0) with 1 at position x.
Here:
Q(p|t) = J(p) + J'(p)T((t) ─ p)
= J(p) + J'(p  (t))
= J + Jt' ─ ΣpjJj' .
-----------------------------------Continuous case:
F'(p  h) =
lim ε-1[F(p + ε(h ─ p)) ─ F(p)]
= F'(p)(h ─ p)
= F'(p)(x)(dh(x) ─ dp(x))
└────┘
/
this component being a function of x
and (regularity conditions!)
= lim ε-1[F(p + ε((x) ─ p)) ─ F(p)]
11
Some non-utilistic PSRs for continuous outcomes
Abbreviations: m* = Ep{T}, s* = SDp{T}
J(p) = –2√m* leading to Q(p|t) = –(t/√m* + √m*)
exemplifies :
Let f* = Ep{f(T)}, and let v be a convex function.
Choosing J(p) = v(f*) leads to
Q(p|t) = v(f*) + v’(f*)(f(t) – f*).
e(π) = { p: f* = Eπ{f(T)} }.
Choosing J(p) = –2s* leads to
Q(p|t) = –((t – m*)2/s* + s*) .
e(π) = { p: p and π have the same mean
and also the same variance }.
J(p) = – Ep,p iid{|T1 – T2|} = –2∫p(x)(1 – p(x))dx,
Note : p(x) is the predicted c.d.f.as before, and the step function
I(t≤x) is its observed counterpart when T = t is observed.
Q(p|t) = –2Ep{|t – T|} – J(p) = –2∫(p(x) – I(t≤x))2dx.
Q(p|h) = –∫[p(x) + h(x) – 2p(x)h(x)]dx .
This scoring rule is strictly proper, e(π) = {π}.
Given z, let ζ be the z-fractile of p (i.e., p(ζ) = z) .
J(p) = – [Ep{T – ζ | T > ζ } + Ep{ζ – T | T < ζ }] leads to
Q(p|t) = – [ (t – ζ)+/(1 – z) + (ζ – t)+/z ] .
e(π) = {p : p and π have the same z-fractile}
A monotonely increasing transformation of T preserves this property
and the scoring rules thus obtained are the only
fractile-only-sensitive scoring rules.
12
"Repetitive events and updatable prognostications"
(modernize terminology if you like)
Simple survival, as a special case.
 predicted hazard γ = (d/dx)Γ
Q(Γ|t) = 2γ(t) ─ 0t γ2(x)dx
is the analogue of the quadratic
scoring rule for categorical outcomes.
Let Truth = Λ, with hazard λ:
Q(Γ|Λ) = (2λ(x)γ(x) ─ γ2(x))exp[-Λ(x)]dx ,
Q(Λ|Λ) ─ Q(Γ|Λ) =
(λ(x) ─ γ(x))2exp[-Λ(x)]dx > 0 .
13
We now see 4 extensions:
1) the quadratic could be replaced with
any convex function ─ which could
be allowed to depend on time or
the history so far;
2) when death is replaced with a repetitive
event, the corresponding terms could be
appended, including one for the final
event-free period (ending at τ), if any ;
3) several event types (competing risks)
could also be handled by simple addition;
4) the hazards could be made dependent
on the history so far ...
In that case we would have not just one
true Λ but an entire hierarchy,
one better than the other; however,
the inequality tells us that
prediction scheme Γ could never (in
expectation) do better than an ideal
prognosticator with access to the same
current history data.
With the prognosis-maker offering
γ = γ(x) = (d/dx)(Γ|hist(x-)),
and the advice evaluator choosing
u = u(γ; hist(x-)),
convex in 1st argument,
we have with N: observed outcome events (counting proc.):
Q(Γ|N) = 0τ [( u ─ u' γ )dx + u' dN(x)] ;
by convexity,
Q(Γ|Λ) = [ u + (λ ─ γ)u' ]exp(-Λ)dx
< Q(Λ|Λ) = [ u(λ; hist(x-))]exp(-Λ)dx .
14
Comparing several BLACK BOXES
As SPSRs promise a larger expected
reward to low-loss prognosticators
in each case,
it is in harmony with the philosophy
to compare prognosticators on the basis
of average score differences ─
even when their case-to-case distribution
is skewed, leptokurtic, etc.
(Randomisation tests, ... ?)
Too often the paired nature of such
evaluation data is overlooked!
But there are other ways to take
outliers seriously.
Estimation (parameterized black boxes)
When the logarithmic scoring rule,
Q(p|outcome) = log p{outcome} ,
is used, the procedure is max likelihood.
Other scoring rules must have the usual
properties of alternatives to max likel:
advantages?
Two BBs (p and q) may be combined via
c(p, q; φ) ,
e.g. so that c = p when φ = φ1 ,
c = q when φ = φ2 , ...
15
Evaluation of therapeutic guidance ─
vectors of management-conditional prognoses
Imagine Black Box prediction p replaced with
one for each management arm.
(Crossover trials? Not so difficult.
Otherwise:)
Following Black Box advice in treatment allocation
is risky (also to patients).
Also, "regress. tow. the mean" 
patients fare worse than predicted
Leaving it to clinicians ?
(same risks as in retrospective outcome studies ─ !?)
So randomization is necessary
─ both for Black Box evaluation
and for treatment evaluation.
- Need everyone be randomized?
- Skewed randomization?
- Black Box-based subgroup analysis, etc?
---- /// ----
16