Nau - Scoring Rules and Generalized Entropy

Scoring Rules, Generalized Entropy,
and Utility Maximization
Robert Nau
Fuqua School of Business
Duke University
(with Victor Jose and Robert Winkler)
Presentation for IEOR Seminar
Berkeley, October 29, 2006
Overview
• Scoring rules are reward functions for defining
subjective probabilities and eliciting them in
forecasting applications and experimental
economics (de Finetti, Brier, Savage, Selten...)
• Cross-entropy, or divergence, is a physical
measure of information gain in communication
theory and machine learning (Shannon, KullbackLeibler...)
• Utility maximization is the decision maker’s
objective in Bayesian decision theory and game
theory (von Neumann & Morgenstern, Savage...)
General connections
• Any decision problem under uncertainty may
be used to define a scoring rule or measure of
divergence between probability distributions.
• The expected score or divergence is merely
the expected-utility gain that results from
solving the problem using the decision maker’s
“true” probability distribution p rather than
some other “baseline” distribution q.
Specific results
• We explore the connections among the best-known
parametric families of generalized scoring rules,
divergence measures, and utility functions.
• The expected scores obtained by truthful probability
assessors turn out to correspond exactly to wellknown generalized divergences.
• They also correspond exactly to expected-utility
gains in financial investment problems with utility
functions drawn from the linear-risk-tolerance (a.k.a.
HARA) family.
• These results generalize to incomplete markets via
a primal-dual pair of convex programs.
Part 1: Scoring rules
• Consider a probability forecast for a discrete event
with n possible outcomes (“states of the world”).
• Let ei = (0, ..., 1, ..., 0) denote the indicator vector for
the ith state (where 1 appears in the ith position).
• Let p = (p1, ..., pn) denote the forecaster’s true
subjective probability distribution over states.
• Let r = (r1, ..., rn) denote the forecaster’s reported
distribution (if different from p).
• Let q = (q1, ..., qn) denote a baseline (“prior”)
distribution upon which the forecaster seeks to
improve.
Definition of a scoring rule
• A scoring rule is a function S(r, ei, q) that determines
the forecaster’s score (reward) for giving the
forecast r, relative to the baseline q, when the ith
state is subsequently observed to occur.
• Let
denote the
forecaster’s expected score for reporting r when her
true distribution is p and the baseline distribution is q.
• Thus, in general, a scoring rule can be expressed as
a function of three vector-valued arguments, which is
linear in the 2nd argument.
Proper scoring rules
• The scoring rule S is [strictly] proper if
S(p, p, q)  [>] S(r, p, q) for all r [p], i.e., if the
forecaster’s expected score is [uniquely] maximized
when she reports her true probabilities.
• Henceforth let
denote the
forecaster’s expected score for a truthful forecast,
as a function of p and q.
• S is [strictly] proper iff
is [strictly] convex.
Proper scoring rules, continued
• If S is strictly proper, then it is uniquely determined
from
by McCarthy’s (1956) formula:
• Thus, a strictly proper scoring rule is completely
characterized by its expected-score function.
• Henceforth only strictly proper scoring rules will be
considered, and it will be assumed that r = p.
Standard scoring rules
• The three most commonly used scoring rules all
assume a uniform baseline distribution (q = 1/n),
which will be temporarily suppressed.
• Quadratic scoring rule:
• Spherical scoring rule:
• Logarithmic scoring rule:
History of standard scoring rules
• The quadratic scoring rule was introduced by
de Finetti (1937, 1974) to define subjective
probability; later used by Brier (1950) as a tool
for evaluating and paying weather forecasters;
more recently advocated by Selten (1998) for
paying subjects in economic experiments.
• The spherical and logarithmic rules were
introduced by I.J. Good (1971), who also
noted that the spherical and quadratic rules
could be generalized to positive exponents
other than 2, leading to...
Generalized scoring rules (uniform q)
• Power scoring rule ( quadratic at  = 2):
• Pseudospherical scoring rule ( spherical at  = 1)
• Both rules  rescaled logarithmic rule at  = 1.
Weighted scoring rules (arbitrary q)
• Our first contribution is to merely point out that
the power and pseudospherical rules can be
weighted by an arbitrary baseline distribution q
and scaled so as to be valid for all real .
• Under the weighted rules, the score is zero in all
states iff p  q, and the expected score is
positive iff p  q.
• Thus, the weighted rules measure the “value
added” of p over q as seen from the forecaster’s
perspective.
Weighted power scoring rule:
Weighted pseudospherical scoring rule:
Properties of weighted scoring rules
• Both rules are strictly proper for all real .
• Both rules  weighted logarithmic rule ln(pi/qi) at =1.
• For the same p, q, and , the vector of weighted power
scores is an affine transformation of the vector of
weighted pseudospherical scores, since both are
affine functions of (pi/qi)1.
• However, the two rules present different incentives for
information-gathering and honest reporting.
• The special cases  = 0 and  = ½ have interesting
properties but have not been previously studied.
Special cases of weighted scores
Weighted expected score functions
• Weighted power expected score:
• Weighted pseudospherical expected score:
Special cases of expected scores
Power
Pseudospherical
Figure 1. Weighted power score vs. beta
(uniform q)
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
-0.5 0
0.5
1
1.5
2
2.5
3
-1
-1.5
-2
State 1 (p=0.05)
-2.5
State 2 (p=0.25)
-3
State 3 (p=0.70)
-3.5
• Behavior of the weighted power score for n = 3.
• For fixed p and q, the scores diverge as    .
• For  << 0 [ >> 2] only the lowest [highest]
probability event is distinguished from the others.
Figure 2. Weighted pseudospherical score vs. beta
(uniform q)
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
-0.5 0
0.5
1
1.5
2
2.5
3
-1
-1.5
-2
State 1 (p=0.05)
-2.5
State 2 (p=0.25)
-3
State 3 (p=0.70)
-3.5
• By comparison, the weighted pseudospherical
scores approach fixed limits as    .
• Again, for  << 0 [ >> 2] only the lowest [highest]
probability event is distinguished from the others.
Figure 3. Expected scores vs. beta
(p=0.05, 0.25, 0.70, uniform q)
1
Pseudospherical
0.8
Power
0.6
0.4
0.2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
The corresponding expected scores vs.  are equal
at  = 1, where both rules converge to the weighted
logarithmic scoring rule, but elsewhere the weighted
power expected score is strictly larger.
Part 2. Entropy
• In statistical physics, the entropy of a system with n
possible internal states having probability distribution
p is defined (up to a multiplicative constant) by
• In communication theory, the negative entropy H(p)
is the “self-information” of an event from a stationary
random process with distribution p, measured in
terms of the average number of bits required to
optimally encode it (Shannon 1948).
The KL divergence
• The cross-entropy, or Kullback-Leibler divergence,
between two distributions p and q measures the
expected information gain (reduction in average
number of bits per event) due to replacing the
“wrong” distribution q with the “right” distribution p:
Properties of the KL divergence
• Additivity with respect to independent partitions of
the state space:
• Thus, if A and B are independent events whose
initial distributions qA and qB are respectively
updated to pA and pB, the total expected information
gain in their product space is the sum of the
separate expected information gains, as measured
by their KL divergences.
Properties of the KL divergence
• Recursivity with respect to the splitting of events:
• Thus, the total expected information gain does not
depend on whether the true state is resolved all at
once or via a sequential splitting of events.
Other divergence/distance measures
• The Chi-square divergence (Pearson 1900) is used by
frequentist statisticians to measure goodness of fit:
• The Hellinger distance is a symmetric measure of
distance between two distributions that is popular in
machine learning applications:
Onward to generalized divergence...
• The properties of additivity and recursivity can be
considered as axioms for a measure of expected
information gain which imply the KL divergence.
• However, weaker axioms of “pseudoadditivity” and
“pseudorecursitivity” lead to parametric families of
generalized divergence.
• These generalized divergences “interpolate” and
“extrapolate” beyond the KL divergence, the
Chi-square divergence, and the Hellinger distance.
Power divergence
• The directed divergence of order  , a.k.a. the power
divergence, was proposed by Havrda & Chavrát
(1967) and further elaborated by Rathie & Kannappan
(1972), Cressie & Read (1980), Haussler and Opper
(1997), among others:
• It is pseudoadditive and pseudorecursive for all ,
and it coincides with the KL divergence at  = 1.
• It is identical to the weighted power expected score,
hence the power divergence is the implicit information
measure behind the weighted power scoring rule.
Pseudospherical divergence
• An alternative generalized entropy was introduced
by Arimoto (1971) and further studied by Sharma &
Mittal (1975), Boekee & Van der Lubbe (1980) and
Lavenda & Dunning-Davies (2003), for  >1:
• The corresponding divergence, which we call the
pseudospherical divergence, is obtained by
introducing a baseline distribution q and dividing
out the unnecessary  in the numerator:
Properties of the pseudospherical
divergence
• It is defined for all real  (not merely  > 1).
• It is pseudoadditive but generally not
pseudorecursive.
• It is identical to the weighted pseudospherical
expected score, hence the pseudospherical
divergence is the implicit information measure
behind the weighted pseudospherical scoring rule.
Interesting special cases
• The power and pseudospherical divergences
both coincide with the KL divergence at  = 1.
• At  = 0,  = ½, and  = 2 they are linearly (or at
least monotonically) related to the reverse KL
divergence, the squared Hellinger distance, and
the Chi-square divergence, respectively:
Where we’ve gotten so far...
• There are two parametric families of weighted,
strictly proper scoring rules which correspond exactly
to two well-known families of generalized
divergence, each of which has a full “spectrum” of
possibilities ( <  < ).
• But what is the decision-theoretic significance of
these quantities?
• What are some guidelines for choosing among the
the two families and their parameters?
Part 3. Financial decisions under
uncertainty with linear risk tolerance
• Suppose that an investor with subjective probability
distribution p and utility function u bets or trades
optimally against a risk-neutral opponent or
contingent claim market with distribution q.
• For any risk-averse utility function, the investor’s
gain in expected utility yields an economic measure
of the divergence between p and q.
• In particular, suppose the investor’s utility function
belongs to the linear risk tolerance (HARA) family,
i.e., the family of generalized exponential,
logarithmic, and power utility functions.
Risk aversion and risk tolerance
• Let y denote gain or loss relative to a (riskless) status
quo wealth position, and let u(y) denote the utility of y.
• The monetary quantity  (y)   u (y)/u (y) is the
investor’s local risk tolerance at y (the reciprocal of
the Pratt-Arrow measure of local risk aversion).
• The usual decision-analytic rule of thumb is as
follows: an investor with current wealth y and local
risk tolerance (y) is roughly indifferent to accepting a
50-50 gamble between the wealth positions y  (y)
and y  ½(y), i.e., indifferent to gaining (y) or losing
½(y) with equal probability.
Linear risk tolerance (LRT) utility
• The most commonly used utility functions in decision
analysis and financial economics have the property of
linear risk tolerance, i.e.,  (y) =  + y,
where  > 0 is the risk tolerance coefficient.
• If the unit of money is normalized so that the risk
tolerance equals 1 at y = 0 (status quo wealth), then
 = 1, and the utility function is u(y) = g (y), where:
Special cases of normalized LRT utility
Qualitative properties of LRT utility
g (0) = 0 and g (0) = 1 for all : the functions {g (y)}
are mutually tangent with dollar-utile parity at y = 0.
Figure 4. Normalized LRT utility functions
(beta = risk tolerance coefficient)
0.8
0.6
0.4
0.2
0
-0.8
-0.6
-0.4
-0.2
-0.2
beta = -1 quadratic
0
0.2
0.4
0.6
0.8
beta = 0 exponential
-0.4
-0.6
beta = 1 logarithmic
-0.8
beta = 2 square-root
-1
1
The investor’s decision model
• Model Y: the investor seeks the payoff vector y
that maximizes her own LRT expected utility under
her distribution p subject to not decreasing the
opponent’s linear expected utility (i.e., expected
value) under his distribution q.
• The investor’s reward in state i is her own ex post
utility payoff g (yi).
A modified decision model
• Model Y: the investor seeks the payoff vector y
that maximizes the sum of her own LRT expected
utility under her distribution p and the opponent’s
linear expected utility (expected value) under his
distribution q.
• The investor’s reward in state i is her own ex post
utility payoff g (yi) plus the opponent’s ex ante
expected monetary payoff.
Main result
1. In the solution of Model Y, the investor’s utility
payoff in state i is the weighted pseudospherical
score, whose expected value is the
pseudospherical divergence.
2. In the solution of Model Y, the investor’s utility
payoff in state i is the weighted power score,
whose expected value is the power divergence.
3. For any p, q, and , the weighted power expected
score (power divergence) is greater than or equal
to the weighted pseudospherical expected score
(pseudospherical divergence).
Observations
• Insofar as Model Y is a more “realistic” investment
problem than Model Y, the pseudospherical
divergence appears to be more economically
meaningful than the power divergence.
• The same results are obtained if the investor is
endowed with linear utility while the opponent is risk
averse with risk tolerance coefficient 1.
• Both of these problems involve non-decreasing risk
tolerance on the part of the more-risk-averse agent
only if  is between 0 and 1.
Extension to incomplete markets
• Suppose the investor faces an incomplete market in
which asset prices are supported by a convex set of
risk neutral distributions.
• Let Q denote the matrix whose rows are the extreme
points of the set of risk neutral distributions.
• Then the investor seeks the payoff vector y that
maximizes her own LRT expected utility under her
distribution p subject to the constraint Qy  0.
• This is a convex optimization problem whose dual is to
find the risk neutral distribution in the convex hull of
the rows of Q that minimizes the pseudospherical
divergence from p.
Details of duality relationship
• Let z denote a vector of non-negative weights,
summing to 1, for the k rows of Q.
• Then zTQ is a supporting risk neutral distribution in
the convex hull of the rows of Q, and the primal-dual
pair of optimization problems is as follows:
Conclusions
• The commonly used power & pseudospherical scoring
rules can be improved by incorporating a notnecessarily-uniform baseline distribution.
• The resulting weighted expected scores are equal to
well-known generalized divergences.
• The weighted pseudospherical scoring rule and its
divergence have a more natural utility-theoretic
interpretation than the weighted power versions.
• Values of  between 0 and 1 appear to be the most
interesting, and the cases  = 0 and  = ½ have been
so far under-explored.