"Scoring Rules" in - Georgetown University

SCORING RULES
by Brier [1] to evaluate probabilistic weather
forecasts. Indeed, weather forecasting is the
area in which scoring rules have been used
most extensively.
The presence of such an ex post evaluation
using suitably designed scoring rules also
provides ex ante incentives for careful
formulation of probability forecasts. Much
of the early development of scoring rules
emphasized this ex ante role of scoring rules
[e.g., [2–6]]. Attention was focused on strictly
proper scoring rules, for which a forecaster
can maximize his or her expected score
only by honestly reporting the probabilities
and also has the incentive to obtain further
information to increase the accuracy of the
probabilities. This ex ante motivation yields
rules that reward probabilities that have
good characteristics ex post, as we shall see.
For a general discussion of scoring rules and
reviews of the scoring rule literature, see
Winkler [7] and Gneiting and Raftery [8].
We discuss some basic properties of scoring rules in the second section, focusing on
the aspects related to ex ante incentives, and
present some commonly encountered rules.
In the next section, we turn to ex post evaluation, showing how some notions involving strictly proper scoring rules relate to
ex post evaluation. The next two sections
involve scoring rules with special characteristics, namely those that provide evaluations
of probabilities relative to baseline distributions and those that take into account any
ordering of the events of interest. A brief
summary and discussion, including some connections with other fields, is presented in the
final section.
ROBERT L. WINKLER
Fuqua School of Business, Duke
University, Durham, North
Carolina
VICTOR RICHMOND R. JOSE
McDonough School of Business,
Georgetown University,
Washington, D.C.
INTRODUCTION
Uncertainty is a pervasive feature of our
world, and fields such as decision analysis
and statistics provide methods to help us
make decisions, forecasts, and inferences in
the face of uncertainty. Our everyday language includes many terms that relate to
the degree of uncertainty in a situation: for
example, rain is unlikely today, the chances
are good that a surgical procedure will be
successful, the prospects for an improved
economic situation are not favorable, and so
on. As the mathematical language of uncertainty, probability theory provides a structure to quantify uncertainty. Probabilities
are encountered in the media (e.g., the probability of rain this afternoon is 20%) and
widely used in modeling.
Although probability forecasts are formulated and used extensively, very often they
are never evaluated after the event or variable of interest is observed. Scoring rules
provide such evaluations by giving a numerical score based on the probabilities and on
the actual observation. For example, a probability of rain of 40% in a simple two-state
setting of rain versus no rain will receive a
higher score than a probability of 20% if it
rains, and a lower score if it does not rain.
In this manner, we can use scoring rules
to compare the sources of the probabilities,
which might be experts, models, or simply
past data. The first scoring rule used on a
regular basis was a quadratic rule developed
STRICTLY PROPER SCORING RULES
We begin by considering the simplest possible situation, that of a single event A and
its complement. Suppose that an expert is
assessing a probability for A and is being
evaluated with a scoring rule S. If a probability r is reported for A, then the score willbe
Wiley Encyclopedia of Operations Research and Management Science, edited by James J. Cochran
Copyright © 2010 John Wiley & Sons, Inc.
1
2
SCORING RULES
S(r, e), where e = 1 if A occurs and e = 0 if
A does not occur. Furthermore, assume that
the expert’s best judgment is that the probability of A is denoted by p. Then the expected
score is S(r, p) = pS(r, 1) + (1 − p)S(r, 0). The
scoring rule S is said to be strictly proper if
0.5
S(p, p) > S(r, p) for any r = p.
−0.5
(1)
To maximize the expected score with a
strictly proper rule, the expert should set
r = p, thereby reporting the probability
honestly. The scoring rules discussed here
are oriented such that a higher score is
better. Some rules in the literature, such
as the Brier score [1], are oriented with a
negative score being better, in which case
the expert should set r = p to minimize the
expected score. Rules such as the Brier score
can be converted to a positive orientation by
changing the sign of the score, so the focus
on scores with a positive orientation here is
not restrictive.
The expected score S(p, p) for honest
reporting from a strictly proper scoring rule
is strictly convex, and conversely, a strictly
proper scoring rule can be generated from
any strictly convex function of p that is
taken as the expected score function S(p, p)
for honest reporting [6]. Thus, there are an
infinite number of rules satisfying Equation
(1). Three commonly used rules are as
follows:
Quadratic:
S(r, e) = 1 − 2(e − r)2 ,
(2)
Logarithmic:
S(r, e) = log[re + (1 − r)(1 − e)],
(3)
Spherical:
S(r, e) = [re + (1 − r)(1 − e)]/[r2 + (1 − r)2 ]1/2 .
(4)
These and any other strictly proper rules can
be scaled as desired (e.g., to avoid negative
scores), because any positive affine transformation of a strictly proper rule is itself
strictly proper. Figure 1 shows S(r, 1), S(r, 0),
and S(p, p) for the quadratic scoring rule.
Note that for this simple two-event setting,
S(r, 1) and S(r, 0) are mirror images of each
1
S (r,1)
S (r,0)
0
−1
0.00
0.20
0.40
0.60
0.80
1.00
0.60
0.80
1.00
r
(a)
1
S (p,p)
0.75
0.5
0.00
0.20
0.40
p
(b)
Figure 1. (a) Score functions S(r,1) and S(r,0) and
(b) expected score S(p, p) under honest reporting
for the quadratic scoring rule.
other, with S(r, 1) increasing in r and S(r, 0)
decreasing in r.
With the general concept of a strictly
proper scoring rule established for the case of
a single event (and its complement), we next
generalize to the case of a set of mutually
exclusive and exhaustive events {A1 , . . . , Ak },
for which the expert’s probabilities are
given by the vector p = (p1 , . . . , pk ) and the
reported probabilities are r = (r1 , . . . , rk ).
With a scoring rule S, the expert’s score
is S(r, ei ) if Ai occurs, where ei is a vector
with the ith element equal to 1 and the
other elements all equal to 0. The expected
score from
the perspective of the expert is
S(r, p) = ki=1 pi S(r, ei ), and the scoring rule
is strictly proper if S(p, p) > S(r, p) for any
r = p. Quadratic, logarithmic, and spherical
SCORING RULES
rules for this case are
S(r, ei ) = 2ri −
k
r2j ,
(5)
j=1
and
S(r, ei ) = log ri ,
⎛
⎞1/2
k
S(r, ei ) = ri /⎝
r2j ⎠ ,
(6)
(7)
j=1
respectively. Note that this setup could be
used when we are considering a discrete
distribution of a random variable, which
could include a discretization of a continuous
random variable into a set of intervals.
Finally, we present scoring rules for probability distributions of a continuous random
variable x̃. Let p denote the expert’s probability density function for x̃, and let r denote
the corresponding reported density function.
Then, for a scoring rule that gives a score
S(r, x) when x̃ = x, the expert’s expected score
∞
is S(r, p) = −∞ S(r, x)p(x)dx. Quadratic, logarithmic, and spherical scoring rules in the
continuous case are
∞
r2 (x) dx,
(8)
S(r, x) = 2r(x) −
−∞
S(r, x) = log r(x),
and S(r, x) = r(x)/
∞
(9)
1/2
r2 (x) dx
.
(10)
−∞
Our focus has been on strictly proper
scoring rules, which have been developed
with the goal of providing the expert
with an incentive to report honestly. But
if the expert is not well-informed with
respect to the situation, reporting probabilities honestly may not mean reporting
‘‘good’’ probabilities. Not all probability
forecasts are necessarily ‘‘good’’ forecasts.
Fortunately, assuming honest forecasting,
strictly proper scoring rules will reward
forecasts by providing higher expected
scores to forecasts for which p is closer
to 0 or 1. To see how this works for the
quadratic, logarithmic, and spherical rules
that have been presented here, we note
that they share an important characteristic:
they are symmetric. In the single-event
case, that means that the expected score
3
for an honestly reported probability of r
is the same as the expected score for an
honestly reported probability of 1 − r. As
noted earlier, reporting r = p under a strictly
proper scoring rule results in an expected
score function S(p, p) that is strictly convex
in p. This convexity, combined with the
symmetry, means that S(p, p) is minimized
at p = 0.5 and increases as p → 0 or p → 1
from p = 0.5. These features of S(p, p) are
illustrated for the quadratic rule in Fig. 1.
That means that under honest reporting,
the expected score is higher for probability
forecasts that are sharper, where sharpness
refers to the degree that p is closer to 0 or
1. For example, a probability of 0 or 1 is
perfectly sharp, whereas a probability of
0.5 admits a lot of uncertainty about the
outcome.
To illustrate the incentives from strictly
proper scoring rules for both honesty and
sharpness, consider a decomposition of the
expected score for the quadratic rule in
the case of a probability for a single event.
The expected score is S(r, p) = p[1 − 2(1 −
r)2 ] + (1 − p)(1 − 2r2 ). Expanding, adding
and subtracting p2 , and rearranging yields
S(r, p) = 1 − 2(p − r)2 − 2p(1 − p).
(11)
The second term on the right-hand side of
Equation (11) can be viewed as a penalty
(because of the negative sign) for not setting
r = p, and it thus provides an incentive
for honesty. The last term is a penalty
for lack of sharpness, because p(1 − p) is
maximized at p = 0.5 and decreases as p → 0
or p → 1. The best possible expected score
is one, and dishonesty (r = p) or lack of
perfect sharpness (0 < p < 1) will reduce the
expected score. Other strictly proper rules
(e.g., the logarithmic and spherical rules)
can be decomposed in a similar manner.
Keep in mind that to maximize expected
score, the expert has to report honestly, so
attempting to have probabilities look sharp
artificially (i.e., sharp reported probabilities
that are not consistent with the expert’s
judgments) will decrease the expected score,
not increase it. Note from Equation (11) that
the sharpness term relates to the sharpness
of p, not the sharpness of r. The primary
4
SCORING RULES
aspect of strictly proper scoring rules is to
encourage honesty, and the nature of strictly
proper scoring rules is such that honest
reporting by experts who have sharper probabilities will yield higher expected scores
than honest reporting by experts who have
probabilities that are not so sharp. In the
final analysis, then, strictly proper scoring
rules reward both honesty and sharpness.
Strictly proper scoring rules differ in some
characteristics. We will present different
types of strictly proper rules in later sections
and discuss some of their characteristics. As
we shall see, not all strictly proper rules are
symmetric in the sense discussed above, and
in some cases it may be desirable to use a rule
that is not symmetric. Another characteristic
of note is that the logarithmic rule is the
only rule for which the score depends only
on the probability or density that has been
assigned to the event or value of the variable
that actually occurs. It does not depend on
the probability or density assigned to other
events or values. For example, if we consider
scoring rules for k mutually exclusive and
exhaustive events and event Ai occurs, the
logarithmic score, log ri , depends only
on ri ,
whereas the quadratic score, 2ri − kj=1 r2j ,
depends on all of the probabilities r1 , . . . , rk .
This property, unique to the logarithmic
rule when k > 2, is called locality, and it
is consistent in spirit with the likelihood
principle that plays a major role in statistics.
Although the quadratic, logarithmic, and
spherical rules given above are the usual
suspects when we think about scoring rules,
they are special cases of two rich families
of strictly proper scoring rules [9]. When
probabilities for a set of k mutually exclusive
and exhaustive events are reported, scores
for the pseudospherical and power families
are given by
SSβ (r, ei ) =
1
β −1
ri
β−1
Er (rβ−1 )1/β
−1
(12)
and
β−1
SPβ (r, ei ) =
ri − 1 Er (rβ−1 ) − 1
−
,
β −1
β
(13)
β−1
respectively, where Er (rβ−1 ) = ki=1 ri (ri )
and −∞ < β < ∞. When β = 2, Equations
(12) and (13) yield the spherical and quadratic rules, respectively. When β → 1, both
families converge to the logarithmic rule.
In summary, the most important characteristic of scoring rules in an ex ante sense
related to probability assessment is that they
should be strictly proper, and there are many
such rules from which to choose. If a rule is
indeed strictly proper, then it should provide incentives for an expert to honestly
report probabilities and to invest effort in an
attempt to make those probabilities sharper.
Such effort might be directed toward such
things as gathering more data, using more
powerful methods to analyze the data, and
learning more about the processes affecting
the events in question or about forecasts provided by others.
EX-POST EVALUATION WITH STRICTLY
PROPER SCORING RULES
We shift now from the ex ante perspective
of the previous section to consider the ex
post evaluation of probability forecasts. For
a given situation, we assume that an expert
has probabilities that are based on the information available and consistent with the
expert’s best judgment. The ex ante viewpoint involves rules that provide incentives
for the expert to attempt to come up with
‘‘good’’ probabilities and to report those probabilities honestly. Many of the characteristics discussed in the preceding section have
counterparts when we consider their use for
evaluation purposes.
As is the case with statistical analysis of
data in general, a single observation is not
very informative. To have a reliable evaluation (when comparing experts or models
in terms of their probabilities, for example),
we would like to have a large number of
observations. Thus, instead of considering a
given situation with a single probability or
set of probabilities, we have a set of data
consisting of many situations with different
probabilities.
SCORING RULES
Suppose that we have a sample of probability forecasts and the corresponding observations for the occurrence of an event. For
example, this might consist of probabilities
of default for loans (generated by a model or
assessed by a bank officer) or probabilities
of rain for different days at a given location.
First, we can look at all of the occasions in
the data set for which a particular value
of the reported probability r (say, 0.30) was
used, and determine the relative frequency
of occurrence of the event of interest on those
occasions. Denote this relative frequency
by fr . With the quadratic scoring rule, the
average score on all of the occasions with this
value of r is S(r, fr ) = fr [1 − 2(1 − r)2 ] + (1 −
fr )(1 − 2r2 ). Ex ante, the expected score from
the perspective of the expert is a function of r
and p, with p being known only to the expert.
Ex post, the average score is a function of r
and fr , where we are able to observe fr :
S(r, fr ) = 1 − 2(fr − r)2 − 2fr (1 − fr ).
(14)
Note that this is simply Equation (11) with
fr used in place of p.
The second term on the right-hand side
of Equation (14) is a measure of calibration,
which involves the correspondence between
the reported probability and the relative
frequency of occurrence of the event when
that probability is used. If fr = r, then the
reported probabilities of r are perfectly
calibrated. The more the relative frequency
deviates from r, the worse the calibration
is. The last term on the right-hand side of
Equation (14) is a measure of sharpness,
which is better as fr → 0 or fr → 1. Poorer
calibration and less sharpness lead to lower
average scores.
In data with reported probabilities and
outcomes, different probabilities will be used
on different occasions. The overall average
score S for an expert is found by aggregating
the average scores for the different values of
r. If we let nr represent the number of times
a reported probability of r is used in the data
set and let n = r nr represent the overall
sample size, the overall average score can be
5
expressed as
S=
(nr /n)S(r, fr )
r
=1−2
(nr /n)(fr − r)2
r
−2
(nr /n)fr (1 − fr ).
(15)
r
The first summation on the right-hand side
of Equation (15) is an overall measure of
calibration for the data set, and the second
summation is an overall measure of sharpness. This decomposition into calibration and
sharpness components can be generalized
beyond the quadratic rule to any strictly
proper scoring rule [10].
One convenient way to think about
calibration and sharpness is to think of the
probability assessment process for a single
event as a two-step process. First, the expert
puts forecast situations in equivalence
classes, or ‘‘boxes,’’ such that the expert feels
that the events in a given box have roughly
the same probability of occurrence. Second,
the expert assigns numbers (probability
values) to the boxes. Calibration is then an
evaluation of how well the expert assigns the
numbers. Sharpness, on the other hand, is
unrelated to the probability values. Instead,
it measures how effective the expert is in creating boxes for which the relative frequency
of occurrence of the events is close to 0 or
1. This is unlikely to be the way an expert
really thinks about the forecasting process,
but it is a convenient way to emphasize
key differences between calibration and
sharpness. For one thing, we can always
attempt to correct for miscalibration. If an
expert always gives probabilities that are
too high, for example, a decision maker
using those probabilities can reduce reported
probabilities from that expert. Correcting for
poor sharpness is a much trickier business.
We have illustrated the notion of decomposition of an average score using the
quadratic scoring rule, but the same idea
can be applied to other rules. Also, an ex post
average score or an ex ante expected score
can be decomposed in different ways. The
decomposition into terms measuring calibration and sharpness is the most frequently
6
SCORING RULES
used decomposition, and it is arguably the
most important decomposition. Gneiting
and Raftery [8] comment that ‘‘the goal of
probabilistic forecasting is to maximize the
sharpness of the (probabilities) subject to
calibration.’’ Although that seems reasonable, we feel that in reported evaluations,
too much emphasis is typically given to
calibration and not enough to sharpness.
Ex post evaluation with strictly proper
scoring rules involves most of the same
ideas encountered in the role of scoring
rules in terms of ex ante incentives. Ex ante
incentives for honesty translate into ex post
evaluations of calibration, and ex ante measures of sharpness based on the probabilities
translate into ex post measures based on
the relative frequencies of occurrence of
the events of interest for given probability
values (given boxes). The use of scoring
rules and their decompositions to evaluate
probabilities ex post can be thought of as
exploratory data analysis. Such evaluations
can be used to compare experts or models.
In the case of a single expert or model, they
can be used to learn more about that expert’s
or model’s characteristics and abilities as
a probability forecaster. Feedback can also
help the expert understand his or her own
characteristics as a probability forecaster
and attempt to improve them in the future.
SCORING RULES WITH BASELINE
DISTRIBUTIONS
Scoring rules such as the quadratic, logarithmic, and spherical rules given earlier can be
thought of as providing an ‘‘absolute’’ evaluation of probabilities. Often, we would like to
have a relative evaluation by comparing how
good a probability or probability distribution
is, relative to some baseline. When assessing
probabilities of rain, for example, it is easier
to get a high score in a location where rain
seldom occurs than it is in a location where it
rains reasonably often. Does that mean that
the probability forecasts in the drier area are
better?
As noted earlier, the most commonly used
scoring rules are symmetric in the sense that
any permutation of the labels on the events
and their associated probabilities does not
change the expected score. One implication
of this symmetry, when combined with the
convexity of the expected score function, is
that the expected score is minimized for a
uniform distribution that gives a probability
of 1/k to each event in the k-event case.
Thus, these scores are implicitly being
evaluated relative to a uniform baseline
distribution.
To avoid comparison with a uniform distribution, we could consider the percentage
improvement in average scores over the
scores for a baseline distribution. In assessing a probability of rain, for example, we
might use climatology, which is the long-term
relative frequency of rain in a given location
at a specific time of year, as a baseline.
However, a percentage improvement in the
score over the baseline, which is called a skill
score, is not strictly proper. For a strictly
proper rule with a baseline distribution that
is not uniform, we can choose a desired convex expected score function and generate a
strictly proper rule that yields that expected
score function [11]. For example, we might
choose a function that is minimized at what
might be viewed as a ‘‘least skillful’’ forecast.
In forecasting rain, climatology might be considered ‘‘least skillful’’ among forecasts that
seem reasonable, since it just involves looking
up some past data and does not require any
weather-forecasting expertise. In contrast,
although a uniform distribution requires no
expertise, it may not seem at all reasonable,
as in the case of a dry location with a very
low climatological relative frequency of rain.
Asymmetric rules can be generated from
symmetric rules. For example, in the singleevent case for which it is felt that the expected
score with honest reporting should be minimized at a probability of q, we can take
any symmetric strictly proper scoring rule S
and create a new rule S∗ (r, e|q) = [S(r, e) −
S(q, e)]/T(q), where T(q) = S(1, 1) − S(q, 1) if
r ≥ q and S(0, 0) − S(q, 0) if r ≤ q.
More generally, the families of scoring
rules given by Equations (12) and (13)
can be generalized to pseudospherical and
power families of strictly proper scoring
rules that allow for the incorporation of
baseline distributions [9]. If the baseline
SCORING RULES
2
S (r, e1|q)
1
S (r, e2|q)
0
−1
−2
−3
0.00
0.20
0.40
0.60
0.80
1.00
r1
(a)
2
1.5
S (p, p|q)
1
0.5
0
0.00
0.20
0.40
0.60
0.80
1.00
p1
(b)
Figure 2. (a) Score functions S(r,e1 |q) and
S(r,e2 |q) and (b) expected score S(p,p|q) under
honest reporting for the power scoring rule with
β = 2 and q = (0.2, 0.8).
distribution for a set of k mutually exclusive
and collectively exhaustive events is denoted
by q = (q1 , . . . , qk ), then we can define
the pseudospherical and power families of
scoring rules with baselines as follows:
SSβ (r, ei |q)
β−1
ri /qi
1
=
−1
β −1
Er [(r/q)β−1 ]1/β
7
where Er [(r/q)β−1 ] = ki=1 ri (ri /qi )β−1 and
−∞ < β < ∞. These scoring rules are scaled
so that they yield scores of 0 when r = q.
Thus, a positive score represents improvement over the baseline and a negative
score indicates a forecast worse than the
baseline. (The expert’s expected score with
honest reporting is positive except at r = q,
where it is 0.) As with Equations (12) and
(13), β = 2 corresponds to spherical and
quadratic rules, respectively, in Equations
(16) and (17), and both families converge
to a logarithmic rule when β → 1. Figure 2
shows SSβ (r, e1 |q), SSβ (r, e2 |q), and S(p, p|q)
for the power scoring rule with β = 2.
The consideration of baseline distributions provides a relative evaluation as
opposed to an absolute evaluation, and
relative evaluations are often of great
interest. In addition, evaluations with
baseline distributions can be useful in
evaluating probabilities (and evaluating the
forecasters providing the probabilities) that
are made under different circumstances. For
example, if one weather forecaster assesses
probabilities of rain in a very dry climate
(say, with a climatology of 0.05) and another
forecaster assesses probabilities in a more
moist climate (climatology 0.40), then it is
much easier for the first forecaster to obtain
higher scores in an absolute evaluation. This
is because of the fact that the first forecaster
is able, on average, to make sharper forecasts. If we use a relative evaluation with
climatology as the baseline, then we are
comparing the two forecasters in terms of
how effective they are at improving upon a
forecast based solely on climatology, thereby
adjusting for the differences in the forecast
situations. While not perfect, this will tend
to even the playing field somewhat and make
for fairer comparisons.
(16)
SCORING RULES THAT ARE SENSITIVE TO
DISTANCE
and
SPβ (r, ei |q)
=
(ri /qi )β−1 − 1 Er [(r/q)β−1 ] − 1
−
,
β −1
β
(17)
In some situations, the events of interest are
ordered. For example, in a soccer match, a
team can win, lose, or tie. A win is better
than a tie, which in turn is better than a loss,
8
SCORING RULES
so there is an ordering. If we are giving probabilities for x̃, the amount of rain in inches
on a given day, we might assess probabilities
for x̃ = 0, 0 < x̃ ≤ 0.5, 0.5 < x̃ ≤ 1, and x̃ > 1.
These four events are ordered. The scoring
rules for multiple events, discussed in the
preceding sections, ignore any ordering. Suppose that two experts report probabilities of
(0.3,0.4,0.2,0.1) and (0.3,0.1,0.2,0.4). If there
is no rain, then the experts will receive the
same score. They both gave a probability of
0.3 for the event that occurred, and they
gave probabilities of 0.4, 0.2, and 0.1 for the
other three events. The fact that the latter
three probabilities were in different orders
does not affect the score because the scoring rule does not take ordering of the events
into account. Some might argue that the first
expert gave more probability on the event
closest to the event that occurred and should
therefore receive a higher score. Scoring rules
have been developed that would take ordering into account in this way, and we say that
such rules are sensitive to distance. Informally, this means that for the probability not
assigned to the event that occurs, a higher
score will result if more probability is given
to events closer to the event that occurs and
less probability to ‘‘more distant’’ events.
The first strictly proper sensitive-todistance scoring rule was a quadratic rule
called the ranked probability
score [12]:
k−1
2
2
S(r, ei ) = − i−1
j=1 Rj −
j=i (1 − Rj ) , where
j
Rj = l=1 rl is a cumulative probability. By
connecting the score to cumulative probabilities, the rule is able to take sensitivity to
distance into consideration. As probability
‘‘moves’’ from events more distant from the
event that occurs to events closer to the event
that occurs, the cumulative probabilities
change accordingly and result in an increase
in the score.
The same idea that is used to generate the
ranked probability score from the quadratic
score can be used to obtain a sensitive-todistance scoring rule S∗ based on any strictly
proper rule S for a single event and its
complement:
S∗ (r, ei ) =
i−1
j=1
S(Rj , 0) +
k−1
j=i
S(Rj , 1).
(18)
The corresponding
expected score is
[P
S(R
,
1) + (1 − Pi )S(Ri , 0)],
S∗ (r, p) = k−1
i
i
i=1
usinga vector P of cumulative probabilities
j
Pj = l=1 pl based on p. Note that Equation
(18) can be used to generate new pseudospherical and power families of strictly
proper scoring rules, with or without baseline
distributions [13]. If baseline distributions
are used, they will be expressed in cumulative form, with avector Q of cumulative
j
probabilities Qj = l=1 ql representing the
cumulative baseline distribution.
It is important to mention that properties
of the scoring rule S in Equation (18), other
than the fact that it is strictly proper, are
not necessarily inherited by S∗ . For example,
if S is logarithmic, S∗ will not inherit the
property of locality mentioned earlier. The
score is based on the cumulative probabilities
R1 , . . . , Rk−1 , so it clearly depends on more
than just ri . Also, if S∗ is determined from
Equation (18) using a symmetric, strictly
proper S without a baseline distribution,
then the expected score S∗ (p,p) is minimized
at p = (0.5, 0, . . . , 0, 0.5). That is, if a baseline
distribution is not chosen, the default baseline distribution is (0.5, 0, . . . , 0, 0.5), not
(1/k, . . . , 1/k), and the score for a uniform
distribution will not be the same for all
events because the ordering of the events is
relevant and some events are more distant
from others. Note that the default baseline
distribution of (0.5, 0, . . . , 0, 0.5) for S∗ translates to (0.5, . . . , 0.5) when expressed in terms
of the cumulative probabilities (R1 , . . . , Rk−1 ).
Since the relevant probabilities are cumulative in a score that is sensitive to distance,
the baseline distribution is uniform in the
cumulative probabilities. Furthermore, this
distribution will give the same score S∗
regardless of which event occurs, because
Rj = 0.5 in each of the S(Rj , 0) and S(Rj , 1)
terms in Equation (18).
Commonly encountered scoring rules
ignore any ordering of the events. When
the events of interest are ordered, however,
the ordering may be important in terms
of the underlying real-world situation. For
instance, with forecasts of returns on an
investment, high probabilities for values that
are not identical to the returns that actually
occur but are close to those returns would
SCORING RULES
seem to be more valuable for investment
purposes than high probabilities for values
that are quite distant from the actual
returns. In such a setting, a scoring rule
that is sensitive to distance might provide
incentives and an ex post evaluation that are
more consistent with the decision making
problem.
SUMMARY AND DISCUSSION
Probability forecasts are important inputs
to quantify uncertainty in inferential and
decision making problems. It is therefore
important to have appropriate incentives for
careful formulation of probability forecasts
and to have measures to evaluate the forecasts once the uncertainty is resolved and we
see what actually happens. Those are exactly
the roles that are played by scoring rules. In
particular, strictly proper scoring rules provide incentives for making good forecasts (i.e.,
sharp forecasts) and reporting them honestly.
In terms of ex post evaluation, the incentive
for honest reporting ex ante translates into
measures of the calibration of the forecasts,
and given good calibration, sharper forecasts
will earn higher scores on average. A few
scoring rules tend to be used most often, but
rich families of strictly proper scoring rules
have been developed. Beyond the basic rules,
there are options that add more flexibility
while maintaining the strictly proper nature
of the rules. Some rules allow the evaluation
of probabilities relative to a chosen baseline
distribution. Among other things, this makes
scores for probabilities made in different situations more comparable. Other rules take
into account any ordering of the events and
are sensitive to distance in the sense of giving higher scores to probability distributions
assigning higher probabilities to events near
the event that occurs, all other things being
equal. This feature is relevant when being
‘‘close’’ with the probability forecast can lead
to better decisions or inferences.
How might a user choose a scoring rule
in a given situation? Among the basic rules,
some have different properties from others
and different rules can lead to different scores
and different rankings of experts [14]. Thus,
9
the choice of a rule might depend on how
one feels about those properties and about
the situation at hand. For example, with
probabilistic answers for a multiple-choice
test, the locality property of the logarithmic
rule might have strong appeal. At the same
time, the possibility of a score of negative
infinity with the logarithmic score is a potential concern; some claim that the logarithmic
rule has undesirable properties [15], and in
certain settings, locality becomes less important. For example, in a two-event setting, all
rules satisfy locality, and in a setting where
sensitivity to distance is considered important, a sensitive-to-distance logarithmic rule
no longer has the locality property. There is
no general agreement on a single ‘‘best’’ rule
for all situations.
The use of a scoring rule that involves the
choice of a baseline distribution depends on
whether a relative evaluation is desired and
whether there is a specific baseline distribution against which to compare probabilities.
An important thing to keep in mind is that
using the basic rules without choosing a baseline distribution means that probabilities are
being evaluated relative to the default distribution, which is a uniform distribution.
Another choice when events are ordered is
whether to use a rule that is sensitive to distance, and this choice is related to whether
giving higher probabilities to events close to
the event that occurs is viewed as important
for the situation at hand.
What about practical issues in using the
rules? The use of the rules ex post to evaluate
probabilities is straightforward, just involving the computation of scores using the formula for any scoring rule that is chosen.
Scores can then be used as feedback to enable
experts and modelers to see their performance and perhaps learn from it. The use of
scoring rules ex ante means that they should
be part of a general probability assessment
process that might include some training
regarding probability if necessary. For many
experts, the connection of the scoring rule
formulas with the incentives is too opaque
to make it valuable to dwell on the formulas.
Discussing the incentives in an intuitive fashion is generally more effective. One option
for relatively simple cases (e.g., a probability
10
SCORING RULES
for a single event) is to present the possible
scores in graphical or tabular form.
The incentive to maximize expected score
with strictly proper scoring rules is probably
reasonable in most cases. In the context of
thinking of the expert as wanting to maximize expected utility, it implies that the
utility function is linear in the score (or linear
in money if the score is translated into a monetary reward). If the expert’s utility function
U(S) for the score is known, a modification
of the score to S = U −1 (S) with a strictly
proper S will adjust for U, since U(S ) =
U[U −1 (S)] = S, and will thereby encourage
honest reporting. A practical problem with
this is that we are not likely to know U,
and eliciting it from the expert is not easy. In
many cases, we feel that the importance of the
score is probably not large enough to cause
major violations of such linear utility due
to risk aversion or risk taking, for example.
However, if there are other stakes related to
the probability forecasts, those stakes might
cause significant shifting of the probabilities
away from honest reporting. For example, if
the situation is viewed as a contest against
other experts (rightly or wrongly), the consideration of strategic play might lead the expert
to give more extreme probabilities (i.e., probabilities closer to and often equal to 0 or 1)
than justified by the expert’s best judgments,
in order to try to win the contest [16]. In
most cases, however, we would expect that
experts will try to come up with the best set
of probabilities given the information that is
available to them, and will not think strategically. In any event, strictly proper scoring
rules can always be used for ex post evaluation purposes, and any hedging of reported
probabilities can be expected to lead to lower
average scores.
In closing, we note that work on scoring rules has interesting connections to other
fields. Scoring rules are closely connected to
decision theory/decision analysis. A decision
maker may hire an expert to report probabilities for events related to the decision
and might like to tailor a scoring rule to
the decision-making problem, in the spirit
of Savage’s ‘‘share of the business’’ notion
[6]. The expert’s reported probabilities can
be viewed as new information by the decision maker, and connections between scores
and the value of that information are of
interest. These notions are related to the literature on incentives and mechanism design
in economics and especially to agency theory. On a different tack, expected scores from
strictly proper scoring rules are related to
information measures from signal processing and information theory [9]. For example,
the expected score for honest reporting under
a logarithmic scoring rule is the negative
Shannon entropy of the expert’s probability
distribution p and is the Kullback–Leibler
divergence of p with respect to q if the baseline distribution q is chosen. Finally, extensive experimental work by psychologists has
investigated the degree to which individuals’
probability assessments are well calibrated
and has led to various theories of the calibration of subjective probabilities [17]. Judgments about others’ probabilities are important in competitive situations, and economics
and psychology are both relevant for this
issue. Given the importance of probability
forecasts in decision modeling and statistics
as well as the connections with these different (and somewhat disparate) fields, we
expect the interest in scoring rules and the
application of such rules in practice to grow.
REFERENCES
1. Brier GW. Verification of forecasts expressed
in terms of probability. Mon Weather Rev
1950;78(1):1–3.
2. Good IJ. Rational decisions. J R Stat Soc [Ser
B] 1952;14(1):107–114.
3. McCarthy J. Measures of the value of information. Proc Natl Acad Sci USA 1956;42(9):
654–655.
4. de Finetti B. Does it make sense to speak
of ‘‘good probability appraisers’’? In: Good IJ,
editor. The scientist speculates: an anthology
of partly-baked ideas. New York: Wiley; 1962.
pp. 357–363.
5. Winkler RL, Murphy AH. ‘‘Good’’ probability assessors. J Appl Meteorol 1968;7(5):
751–758.
6. Savage LJ. Elicitation of personal probabilities and expectations. J Am Stat Assoc
1971;66(336):783–801.
SCORING RULES
7. Winkler RL. Scoring rules and the evaluation
of probabilities. Test 1996;5(1):1–60.
8. Gneiting T, Raftery A. Strictly proper scoring
rules, prediction, and estimation. J Am Stat
Assoc 2007;102(477):359–378.
9. Jose VRR, Nau RF, Winkler RL. Scoring rules,
generalized entropy, and utility maximization. Oper Res 2008;56(5):1146–1157.
10. De Groot MH, Fienberg SE. Assessing probability assessors: calibration and refinement.
In: Gupta SS, Berger JO, editors. Statistical
decision theory and related topics. New York:
Academic Press; 1982. pp. 291–314.
11. Winkler RL. Evaluating probabilities: asymmetric scoring rules. Manage Sci 1994;40(11):
1395–1405.
12. Epstein ES. A scoring system for probability
forecasts of ranked categories. J Appl Meteorol
1969;8(6):985–987.
11
13. Jose VRR, Nau RF, Winkler RL. Sensitivity to
distance and baseline distributions in forecast
evaluation. Manage Sci 2009;55(4):582–590.
14. Bickel JE. Some comparisons among
quadratic, spherical, and logarithmic rules.
Decis Anal 2007;4(2):49–65.
15. Selten R. Axiomatic characterization of the
quadratic scoring rule. Exp Econ 1998;1(1):
43–62.
16. Lichtendahl KC, Winkler RL. Probability
elicitation, scoring rules, and competition
among forecasters. Manage Sci 2007;53(11):
1745–1756.
17. O’Hagan A, Buck CE, Daneshkhah A, et al.
Uncertain judgements: eliciting experts’ probabilities. Chichester: Wiley; 2006.

Download Report

"Scoring Rules" in - Georgetown University

Paperzz.com

Your Paperzz