Jørgen Hilden, November 1998 Contribution to workshop arranged by the Nordic Network for Biostatistics Research, entitled 'Prediction in Medical Statistics', Odense University, Febr. 4-5, 1999 /Updated March 2008*/ Scoring Rules for Evaluation of Prognosticians and Prognostic Rules J. Hilden The presentation will focus on: 1) the concept of a proper scoring rule and its rationale, viz. that of ensuring that a poor or dishonest source of prognoses will not outperform a good (low-loss) or honest one; 2) its mathematical link to tangents of concave functions and decision-theoretic loss; 3) extending these concepts from diagnosis (categorial alternatives) to continuous outcomes, such as future measurements of physical performance or time to recovery/death. There may be time left to touch briefly upon how these ideas could be further extended to: 4) comparing several prognosticators on one evaluation sample (in clinical informatics the idea of paired data seems to have blown unheeded over the top of many a head); 5) repetitive events and updatable prognostications; 6) evaluation of therapeutic guidance (vectors of management-conditional prognoses); 7) feed-back to calibrate or fine-tune a prognostic algorithm. For each of the latter purposes, a superstructure must be added to the basic proper-scoring formalism, and the standard evaluation study, with its passive recording of consecutive prediction-outcome pairs, may have to be redesigned. Jørgen Hilden, assoc. prof. Dept. of Biostatistics University of Copenhagen Tlph +45-35 32 79 17 Fax +45-35 32 79 07 email [email protected] *The overhead transparencies from 1999 have been tidied up and explanatory remarks and figure caption have been added. A remark on censoring has been corrected. Some of the formulae that involve functional derivatives are still left for intuitive appreciation, using a notation that mimics the simple notation of Fig. 1. -1- Topics 1) Proper Scoring Rules ─ rationale = ensuring that a poor or dishonest source of prognoses will not outperform a good or honest one 2) mathematical link to tangents of concave functions and decision-theoretic loss 3) extending these concepts from diagnosis to continuous outcomes 4) comparing several prognosticators on one evaluation sample ─ the use of paired data 5) repetitive events and updatable prognostications 6) evaluation of therapeutic guidance 7) feed-back to fine-tune a prognostic algorithm 2 The problem: WE WANT CHANCES OF RECOVERY AND WHEN ─ not just a shot guess How to score (diagnostic or) prognostic PROBABILITY advice? Casei (i = 1, 2, ...): datai [BLACK BOX] advicei outcome ti ↓ ↓ scorei (When this type of scoring is adhered to, and the sample is representative, the average score is pertinent; so let's focus on one case) A scoring rule must judge closeness between an unknown distribution (denoted by π) (of 'all similar cases') and the (BLACK BOX-made) distribution (denoted by p) on the basis of a SINGLE outcome -- because exact repetition doesn't exist (large clinical data space; no two patients are alike) -- and regression etc. would involve assumptions ( risk of mis-evaluation) 3 PROPER SCORING RULES reward closeness and honesty (qua maximizing E{score}) 1) HUMAN Black Box: dishonesty doesn't pay & (s)he can't exploit knowledge of how the scoring will be done 2) COMPUTER Black Box: protection against wrong purchase 2a) Vendor can't exploit knowledge... Do such scoring rules exist? Yes What criterion of closeness? Utility or loss function... Mathematical basis (cf. Fig. 1) Notation: p and h are distributions of T, guestimators of the true π . An arbitrary convex functional J(p) gives a proper scoring rule: Q(p|t) = J(p) + J'(p)((t) ─ p) , [interpret notation as required] 'the tangent at J(p) evaluated at P{T=t} = 1.' (if strictly, then strictly) EhQ(p|T) = J(p) + J'(p)(h ─ p) = Q(p|h) < Q(h|h) = J(h) ; EpQ(p|T) = Q(p|p) = J(p) . See Fig. 1! Conversely, any Q with this inequality defines a convex J. ..................................... 4 Figure 1. A proper scoring rule for the binary case (t = 0 vs. t = 1). Prediction p occasions the scores Q(p|0) and Q(p|1) which arise as the ends of a tangent at fraction p to a convex curve J(p); i.e., Q(p|t) = J(p) + J’(p)(t – p). If the true probability is π, the expected score, denoted by Q(p|π), is smaller (by the asterisk-marked amount) than the score expectation that could have been obtained by knowing the truth, viz. J(π) = Q(π|π). Below, left: a non-strictly convex Jcurve gives rise to a non-strictly proper scoring rule: any prediction in the interval in which J is linear gives the same expected score as predicting the true π. Below, right: As J has a corner at p, any of the pencil bundle of tangents at p can be used (if used consistently!), so a tie-breaking rule is required (tied recommendation). 5 "Closeness to the true π is rewarded" in the sense that If pa = aπ + (1-a)p (0 < a < 1), then EπQ(pa|T) > EπQ(p|T). Closeness Properness Def.: The EQUIVALENCE SET of π, e(π) = {p: Q(p| π) =* Q(π| π)} . *with = rather than < . If Q is strictly proper, e(π) is just {π}. ....................................... Other important properties Q equivalent to aQ + f(outcome t), a > 0. ( f(t)? Why pay the meteorologist for sunshine? ) ................................ PSRs encourage use of all data - which the Black Box knows how to interpret ................................ Numerous ways of constructing one Q from others. Note that given a censored course with indep. censoring at time C one may still score strictly properly the observable (lumped) version of t, defined by the scale ]0, C] {“t > C”} . 6 Decision-theoretic construction of a SPSR Q via expected Utility. (Let's first take a look at an example;) Graphical example illustrating decision-theoretic construction of a SPSR We imagine an Investment A in project of fixed Duration D leads to Reward R if survival t > D . Figure 2. Survival curves with investment decisions. Upper part: The (D, A/R) point is below the predicted survival curve (i.e., < 1 – p(t)), so the investment-to-reward ratio, A/R, is low enough for making the investment profitable, despite the D months of delayed return. Lower part: a situation with multiple microdecisions. Here prediction p should be rewarded for suggesting the same investments and abstentions from investment as the unknown true π: so p is as good as π (hence belongs to e(π)) when none of the decision points lie between the p-based survival curve and the one based on π. 7 U(invest, t) = ─ A + I(t > D) R , U(don't, t) = 0 . The better option according to p: β(p) = invest iff p{T > D} >* A/R , *with a tied decision in case of =. so let's mark point (D, A/R) on the survival diagram: if the p-prediction runs above it, invest! With an inventory of such decisions (A,D,R) (independent and additive) we have a swarm of points ─ and the e(π) comprises those p's whose graphs are so close to the true π-curve that they are not separated from it by any of these points. 8 Mathematical formulation: Suppose U(option β, p) = U(β, t)dp(t) is maximized by β(p). Q(p|t) = maxchoiceβ{U(β,p), U(β,t)} = U(β(p), t) ; i.e., the utility attained by trusting prediction p and therefore choosing action β(p), when the true situation turns out to be t. [Note: maxchoicei{a(i); b(i)} = b(the i that maximizes a(i)) .] J(p) = Q(p|p) = U(β(p), p) . Q(p|h) = U(β(p), h) < U(β(h), h) = J(h), i.e. properness ─ strictly if the β-space is 'rich' or a 'rich' set of such decisions is envisaged, otherwise 'sufficiently proper,' i.e., proper enough for the medical application at hand. 9 Utility-based SPSRs formulated in terms of functional derivatives Q(p|t) = J(p) + J'(p)((t) ─ p) = J(p) + J'(p (t)) = U(β(p), p) + U(β(p),'p)((t) ─ p) = U(β(p), t) ;(*) (*) Depends on a tie-breaking policy when β(p) is not unique ─ i.e. when p is located at a CORNER of J(p) [Fig. 1, bottom right]. Q(p|h) = J(p) + J'(p h) = U(β(p), p) + U(β(p),'p)(h ─ p) = U(β(p), h) .(*) 10 Derivatives in distribution space (tiptoeing over intricacies) Categorial case: p = (p1, p2, ... )T =(px)T F'(p h) = lim ε-1[F(p + ε(h ─ p)) ─ F(p)] = F'(p)T(h ─ p) , F'(p) = (partial der. vector), or lim ε-1[F(p + ε((x) ─ p)) ─ F(p)] x = outcome index above, (x) = (0, 0, …, 1, …, 0) with 1 at position x. Here: Q(p|t) = J(p) + J'(p)T((t) ─ p) = J(p) + J'(p (t)) = J + Jt' ─ ΣpjJj' . -----------------------------------Continuous case: F'(p h) = lim ε-1[F(p + ε(h ─ p)) ─ F(p)] = F'(p)(h ─ p) = F'(p)(x)(dh(x) ─ dp(x)) └────┘ / this component being a function of x and (regularity conditions!) = lim ε-1[F(p + ε((x) ─ p)) ─ F(p)] 11 Some non-utilistic PSRs for continuous outcomes Abbreviations: m* = Ep{T}, s* = SDp{T} J(p) = –2√m* leading to Q(p|t) = –(t/√m* + √m*) exemplifies : Let f* = Ep{f(T)}, and let v be a convex function. Choosing J(p) = v(f*) leads to Q(p|t) = v(f*) + v’(f*)(f(t) – f*). e(π) = { p: f* = Eπ{f(T)} }. Choosing J(p) = –2s* leads to Q(p|t) = –((t – m*)2/s* + s*) . e(π) = { p: p and π have the same mean and also the same variance }. J(p) = – Ep,p iid{|T1 – T2|} = –2∫p(x)(1 – p(x))dx, Note : p(x) is the predicted c.d.f.as before, and the step function I(t≤x) is its observed counterpart when T = t is observed. Q(p|t) = –2Ep{|t – T|} – J(p) = –2∫(p(x) – I(t≤x))2dx. Q(p|h) = –∫[p(x) + h(x) – 2p(x)h(x)]dx . This scoring rule is strictly proper, e(π) = {π}. Given z, let ζ be the z-fractile of p (i.e., p(ζ) = z) . J(p) = – [Ep{T – ζ | T > ζ } + Ep{ζ – T | T < ζ }] leads to Q(p|t) = – [ (t – ζ)+/(1 – z) + (ζ – t)+/z ] . e(π) = {p : p and π have the same z-fractile} A monotonely increasing transformation of T preserves this property and the scoring rules thus obtained are the only fractile-only-sensitive scoring rules. 12 "Repetitive events and updatable prognostications" (modernize terminology if you like) Simple survival, as a special case. predicted hazard γ = (d/dx)Γ Q(Γ|t) = 2γ(t) ─ 0t γ2(x)dx is the analogue of the quadratic scoring rule for categorical outcomes. Let Truth = Λ, with hazard λ: Q(Γ|Λ) = (2λ(x)γ(x) ─ γ2(x))exp[-Λ(x)]dx , Q(Λ|Λ) ─ Q(Γ|Λ) = (λ(x) ─ γ(x))2exp[-Λ(x)]dx > 0 . 13 We now see 4 extensions: 1) the quadratic could be replaced with any convex function ─ which could be allowed to depend on time or the history so far; 2) when death is replaced with a repetitive event, the corresponding terms could be appended, including one for the final event-free period (ending at τ), if any ; 3) several event types (competing risks) could also be handled by simple addition; 4) the hazards could be made dependent on the history so far ... In that case we would have not just one true Λ but an entire hierarchy, one better than the other; however, the inequality tells us that prediction scheme Γ could never (in expectation) do better than an ideal prognosticator with access to the same current history data. With the prognosis-maker offering γ = γ(x) = (d/dx)(Γ|hist(x-)), and the advice evaluator choosing u = u(γ; hist(x-)), convex in 1st argument, we have with N: observed outcome events (counting proc.): Q(Γ|N) = 0τ [( u ─ u' γ )dx + u' dN(x)] ; by convexity, Q(Γ|Λ) = [ u + (λ ─ γ)u' ]exp(-Λ)dx < Q(Λ|Λ) = [ u(λ; hist(x-))]exp(-Λ)dx . 14 Comparing several BLACK BOXES As SPSRs promise a larger expected reward to low-loss prognosticators in each case, it is in harmony with the philosophy to compare prognosticators on the basis of average score differences ─ even when their case-to-case distribution is skewed, leptokurtic, etc. (Randomisation tests, ... ?) Too often the paired nature of such evaluation data is overlooked! But there are other ways to take outliers seriously. Estimation (parameterized black boxes) When the logarithmic scoring rule, Q(p|outcome) = log p{outcome} , is used, the procedure is max likelihood. Other scoring rules must have the usual properties of alternatives to max likel: advantages? Two BBs (p and q) may be combined via c(p, q; φ) , e.g. so that c = p when φ = φ1 , c = q when φ = φ2 , ... 15 Evaluation of therapeutic guidance ─ vectors of management-conditional prognoses Imagine Black Box prediction p replaced with one for each management arm. (Crossover trials? Not so difficult. Otherwise:) Following Black Box advice in treatment allocation is risky (also to patients). Also, "regress. tow. the mean" patients fare worse than predicted Leaving it to clinicians ? (same risks as in retrospective outcome studies ─ !?) So randomization is necessary ─ both for Black Box evaluation and for treatment evaluation. - Need everyone be randomized? - Skewed randomization? - Black Box-based subgroup analysis, etc? ---- /// ---- 16
© Copyright 2026 Paperzz