Information Measures, Experiments,
Multi-category Hypothesis Tests, and Surrogate Losses
John C. Duchi1,2
Khashayar Khosravi2
Feng Ruan1
Departments of 1 Statistics and 2 Electrical Engineering, Stanford University
{jduchi,khosravi,fengruan}@stanford.edu
arXiv:1603.00126v1 [math.ST] 1 Mar 2016
Abstract
We provide a unifying view of statistical information measures, multi-class classification
problems, multi-way Bayesian hypothesis testing, and loss functions, elaborating equivalence
results between all of these objects. In particular, we consider a particular generalization of
f -divergences to multiple distributions, and we show that there is a constructive equivalence
between f -divergences, statistical information (in the sense of uncertainty as elaborated by
DeGroot), and loss functions for multi-category classification. We also study an extension of
our results to multi-class classification problems in which we must both infer a discriminant
function γ and a data representation (or, in the setting of a hypothesis testing problem, an
experimental design), represented by a quantizer q from a family of possible quantizers Q.
There, we give a complete characterization of the equivalence between loss functions, meaning
that optimizing either of two losses yields the same optimal discriminant and quantizer q. A main
consequence of our results is to describe those convex loss functions that are Fisher consistent
for jointly choosing a data representation and minimizing the (weighted) probability of error in
multi-category classification and hypothesis testing problems.
1
Introduction
Consider the following multiclass classification problem: a decision maker receives a pair of random
variables (X, Y ) ∈ X × {1, . . . , k}, where Y is unobserved, and wishes to assign the variable X to
one of the k classes {1, 2, . . . , k} so as to minimize the probability of a misclassification, that is,
predicting some Yb 6= Y . Formally, we represent the decision maker via a discriminant function
γ : X → Rk , where each component γy (x), y = 1, . . . , k, represents the margin (or a score or
perceived likelihood) the decision maker assigns to class y for datum x. The goal is then to
minimize the expected loss, or Ψ-risk,
RΨ (γ) := E[ΨY (γ(X))],
(1)
where Ψy : Rk → R measures the loss of decision γ(X) ∈ Rk when the true label of X is y
and the expectation (1) is taken jointly over (X, Y ). When the loss Ψ is the 0-1 loss, that is,
Ψy (γ(x)) = 1 {γy (x) ≤ γi (x) for some i 6= y}, the formulation (1) measures the misclassification
probability P(argmaxy γy (X) 6= Y ). More broadly, we may consider the classical k-way Bayesian
experiment: given a random variable X ∈ X drawn according to one of the k hypotheses
H1 : X ∼ P1 ,
H2 : X ∼ P2 ,
...
Hk : X ∼ Pk
with prior probabilities π1 , π2 , . . . , πk , we wish to make
P a decision minimizing the expected loss
E[ΨY (γ(X))], or (in a pointwise sense) the posterior y P(Y = y | X = x)Ψy (γ(x)).
In many applications, making decisions based on a raw vector X is undesirable—the vector X
may be extremely high-dimensional, carry useless information impinging on statistical efficiency,
or we may need to store or communicate the sample using limited memory or bandwidth. If all
1
we wish to do is to classify a person as being taller or shorter than 160 centimeters, it makes little
sense to track his or her blood type and eye color. With the increase in the number and variety of
measurements we collect, such careful design choices are—in our estimation—yet more important
for maintaining statistical power, interpretability, accountability, efficient downstream use, and
mitigating false discovery [3, 12]. This desire to give “better” representations of data X has led to
a rich body of work in statistics, machine learning, and engineering, highlighting the importance of
careful measurement, experimental design, and data representation strategies [29, 9, 10, 27, 33, 24].
To allow better fidelity to real-world considerations, one thus frequently expands this classical
formulation to include a preprocessing stage in which the vector X is mapped into a vector Z via
a (data-dependent) mapping q : X → Z. A number of situations suggest such an approach. In
most practical classification scenarios [30], one includes an equivalent “feature selection” stage to
reduce the dimension of X or increase its interpretability. As a second motivation, consider the
decentralized detection problem [34] in communication applications in engineering, where data are
collected through remote sensors and communicated through limited bandwidth or stored using
limited memory. In this case, the central decision maker can infer the initial distribution Pi only
after communication of the transformed vector Z = q(X), and one wishes to choose a quantizer
q from a family Q of acceptable (low complexity) quantizers. In fuller abstraction, we may treat
the problem as a Bayesian experimental design problem, where the mapping q : X → Z may be
stochastic and is chosen from a family Q of possible experiments (observation channels). In each
of the preceding examples, the desire to incorporate a quantizer q into the testing classification
procedure poses a more complicated problem, as it is now required to simultaneously find both a
data representation q and decision rule γ. The goal, paralleling that for the risk (1), thus becomes
joint minimization of the quantized Ψ-risk
RΨ,π (γ | q) := E [ΨY (γ(q(X)))]
(2)
over a prespecified family Q of quantizers q : X → Z, where γ : Z → Rk .
Frequently—for example, in the zero-one error case—the loss function Ψ is non-convex (and even
discontinuous), so that its population or empirical minimization is intractable. It is thus common
to replace the loss with a convex surrogate and minimize this surrogate instead. We call such a
surrogate Fisher consistent if its minimization yields a Bayes optimal discriminant γ for the original
loss Ψ (for any P distribution on (X, Y )), and for the classical binary and multiclass classification
problems, several researchers have characterized conditions when solutions using convex surrogate
losses yield Fisher consistency [22, 36, 21, 2, 32]. Nguyen et al. [24] substantially extend these
results in the binary case to the problem of joint selection of the discriminant function γ and
quantizer q : X → Z when q ∈ Q, for any Q. In particular, Nguyen et al. exhibit a correspondence
between binary margin-based loss functions and f -divergences—measures of the similarity between
two probability distributions originally developed in information theory and statistics [1, 8, 35]—to
give a general characterization of loss equivalence through classes of divergences. An interesting
consequence of their results is that, in spite of positive results [36, 21, 2] for Fisher consistency in
binary classification problems, essentially only the hinge loss is consistent for the 0-1 loss.
Outline and informal discussion of our contributions
In the present paper, we build on these prior results on Fisher consistency to provide a unifying
framework that relates statistical information measures, loss functions, and uncertainty functions
in the context of multi-category classification. To begin, we present a generalization of Ali and
2
Silvey’s and Csiszár’s f -divergences that applies to multiple distributions, enumerating showing
analogues of their positivity properties, data-processing inequalities, and discrete approximation
(Section 2). We begin our main contributions in Section 3, where we establish a correspondence
between loss functions Ψ, concave measures of uncertainty on discrete distributions as defined by
DeGroot [9], and multi-way f -divergences. In particular, letting π ∈ Rk+ satisfying 1T π = 1 be
a prior distribution on the class labels Y , π
e(X) be the vector of posterior probabilites of label Y
conditional on the observation X, and U : Rk+ → R = R ∪ {−∞} be a concave function, DeGroot
defines the information associated with the experiment X as
I(X, π; U ) := U (π) − E[U (e
π (X))].
(3)
We show that there is an explicit mapping from concave uncertainty functions U to (potentially
multiple) loss functions Ψ, and conversely that each loss function induces a concave uncertainty U
and statistical information (3). The mapping U 7→ Ψ is one-to-many, while the mapping Ψ 7→ U
is many-to-one, but we show that there is always at least one explicitly constructable convex loss
Ψ = {Ψ1 , . . . , Ψk } generating U . We also show how these uncertainty functions U have natural
connections to classification calibration.
In Section 4 we present our main results on consistency of joint selection of quantizer (data
representation/experiment) q and discriminant function γ. We give a complete characteriza(1)
(1)
tion of the situations in which two multi-category loss functions Ψ(1) = {Ψ1 , . . . , Ψk } and
(2)
(2)
Ψ(2) = {Ψ1 , . . . , Ψk } always yield equivalent optimal quantizers and discriminant functions via
minimization of the quantized risk (2). In particular, we show that loss functions are equivalent
in this way if and only if their associated uncertainty functions U1 and U2 —as constructed in
Section 3—satisfy U1 (π) = aU2 (π) + bT π + c for some a > 0, b ∈ Rk , and c ∈ R. We give a
similar characterization that reposes on the generalized f -divergences associated with each loss
function (again, as constructed in Section 3). Measures of statistical information and divergence
have been central to the design of communication and quantization schemes in the signal processing literature [34, 26, 20, 17, 24], and one interpretation of our results is a characterization
of those divergence measures that—when optimized for—yield optimal detection and quantization
procedures. More broadly, we give necessary and sufficient conditions for the Fisher-consistency of
different loss functions for joint selection of data representation (i.e. the experimental design) and
classification scheme, and we provide a result showing when empirical minimization of a surrogate
risk yields a consistent risk-minimization procedure.
A number of researchers have studied the connections between divergence measures and risk
for binary and multi-category experiments, which point to the results we present. Indeed, in
the 1950s Blackwell [6] shows that if a quantizer q1 induces class-conditional distributions with
larger divergence than those induced by q2 , then there are prior probabilities such that q1 allows
tests with lower probability of error than q2 . Liese and Vajda [19] give a broad treatment of
f -divergences, using their representation as the difference between prior and posterior risk in a
binary experiment [25] to derive a number of their properties; see also the paper [28]. Garcı́aGarcı́a and Williamson [13] show how multi-distribution f -divergences [14] arise naturally in the
context of multi-category classification problems as the gap between prior and posterior risk in
classification, as in the work of Liese and Vajda. In the binary case, these results point the way
towards the characterization of Fisher consistency for quantization and binary classification Nguyen
et al. [24] realize. We pursue this line of research to draw the connections between Fisher consistency,
information measures, multi-category classification, surrogate losses, and divergences.
3
Notation We let 0 and 1 denote the all-zeros and all-ones vectors, respectively. For a vector or
collection of objects {t1 , . . . , tm }, the notation t1:m = {t1 , . . . , tm }. The set indicator function I {·}
is +∞ if its argument is false, 0 otherwise, while 1 {·} is 1 if its argument is true, 0 otherwise.
For vectors v, u ∈ Rk , we let v ⊙ u = [vi ui ]ki=1 ∈ Rk be their Hadamard (pointwise) product. We
k
let ∆k = {v ∈ Rk+ : 1T v = 1} denote the probability
Pmsimplex inT R . For any m ∈ N, we set
k
[m] = {1, . . . , m}. For a set A ⊂ R , we let aff A = { i=1 λi xi : λ 1 = 1, xi ∈ A, m ∈ N} denote
the affine hull of A, and rel int A denotes the interior of A relative to aff A. We let R = R ∪ {+∞}
and R = R ∪ {−∞} be the upward and downward extended real numbers. For f : Rk → R, we let
epi f = {(x, t) : f (x) ≤ t} denote the epigraph of f . We say a convex function f is closed if epi f
is a closed set, though we abuse notation and say that a concave function f is closed if epi(−f ) is
closed. For a convex function f : Rk → R, we say that f is strictly convex at a point t ∈ Rk if for
all λ ∈ (0, 1) and t1 , t2 6= t such that t = λt1 + (1 − λ)t2 we have f (t) < λf (t1 ) + (1 − λ)f (t2 ). We
define the (Fenchel) conjugate of a function f : Rk → R by
f ∗ (s) = sup sT t − f (t) .
(4)
t∈Rk
For any function f , the conjugate f ∗ is a closed convex function [16, Chapter X]. For measures ν
and µ, we let dν/dµ denote the Radon-Nikodym derivative of ν with respect to µ. For random
Lp
variables Xn , we say Xn → X∞ if E[|Xn − X∞ |p ] → 0.
2
Multi-distribution f -divergences
Divergence measures for probability distributions have significant statistical, decision-, and informationtheoretic applications [1, 8, 18, 26, 17], including in optimal testing, minimax rates of convergence,
and the design of communication and encoding schemes. Central to this work is the f -divergence,
introduced by Ali and Silvey [1] and Csiszár [8], and defined as follows. Given a pair of distributions
P, Q defined on a common set X , a closed convex function f : [0, ∞) → R satisfying f (1) = 0, and
any measure µ satisfying P ≪ µ and Q ≪ µ, the f -divergence between P and Q is
Z Z
dP
p(x)
q(x)dµ(x) = f
dQ.
(5)
f
Df (P ||Q) :=
q(x)
dQ
X
dQ
Here p = dP
dµ and q = dµ denote the densities of P and Q, respectively, and the value uf (t/u) is
defined appropriately for t = 0 and u = 0 (e.g. [19]). A number of classical
divergence measures
√
arise out of the f -divergence; taking variously f (t) = t log t, f (t) = 21 ( t − 1)2 , and f (t) = 21 |t − 1|
yields (respectively) the KL-divergence, squared Hellinger distance, and total variation distance.
Central to our study of multi-way hypothesis testing and classification is an understanding of
relationships between multiple distributions simultaneously. Thus, we use the following generalization [14, 13] of the classical f -divergence to multiple distributions.
Definition 2.1. Let P1 , . . . , Pk be probability distributions on a common σ-algebra F over a set X .
k−1
→ R be a closed convex function satisfying f (1) = 0. Let µ be any σ-finite measure
Let f : R+
such that Pi ≪ µ for all i, and let pi = dPi /dµ. The f -divergence between P1 , . . . , Pk−1 and Pk is
Z pk−1 (x)
p1 (x)
,...,
pk (x)dµ(x).
(6)
Df (P1 , . . . , Pk−1 ||Pk ) := f
pk (x)
pk (x)
4
We have not specified the value of the integrand in Def. 2.1 when pk (x) = 0. In this case, the
function fe : Rk+ → R defined, for an arbitrary t′ ∈ rel int dom f , by
if u > 0
uf (t/u)
′
e
(7)
f (t, u) = lim sf (t − t + t/s) if u = 0
s→0
+∞
otherwise
is a closed convex function whose value is indepenedent of t′ ; this fe is the closure of the perspective
function of f (see Hiriart-Urruty and Lemaréchal [15, Proposition IV.2.2.2]). Any time we consider
a perspective function (u, t) 7→ uf (t/u) we treat it as its closure (7) without comment.
We now enumerate a few properties of multi-way f -divergences, showing how they are natural
generalizations of classical binary f -divergences. We focus on basic properties that will be useful for
our further results on Bayes risk, classification, and hypothesis testing. Specifically, we show that
generalized f -divergences parallel the binary divergence (6): they are well-defined, have continuity
properties with respect to discrete approximations, and satisfy data-processing inequalities. These
results are essentially known (we somewhat more carefully address measurability concerns) and we
use them as definitional building blocks, so we defer all proofs to Appendix D.
As our first step, we note that Definition 2.1 gives a well-defined quantity, independent of the
base measure µ. (See Appendix D.1 for a proof.)
Lemma 2.1. In expression (6), the value of the divergence does not depend on the choice of the
dominating measure µ. Moreover,
Df (P1 , . . . , Pk−1 ||Pk ) ≥ 0,
and the inequality is strict if f is strictly convex at 1 and the Pi are not all identical.
We now consider approximation of the f -divergence. Given a partition P of X , meaning a finite
or countable collection of disjoint measurable sets A ∈ P satisfying ∪{A | A ∈ P} = X , we define
the partitioned f -divergence as
X P1 (A)
Pk−1 (A)
,...,
Pk (A).
f
Df (P1 , . . . , Pk−1 ||Pk | P) =
Pk (A)
Pk (A)
A∈P
Equivalently, given a quantizer q, meaning a measurable mapping q : X → N (any countable image
space Im q yields equivalent definition), we define the quantized f -divergence as
X
Pk−1 (A)
P1 (A)
,...,
Df (P1 , . . . , Pk−1 ||Pk | q) =
Pk (A).
f
Pk (A)
Pk (A)
−1
A∈q
(N)
As in the binary case [35, 19], we have the following approximability result, that is, that quantizers
give arbitrarily good approximations to generalized f -divergences (see Appendix D.2).
Proposition 1 (Approximation). Let f be a closed convex function with f (1) = 0. Then
Df (P1 , . . . , Pk−1 ||Pk ) = sup Df (P1 , . . . , Pk−1 ||Pk | P) = sup Df (P1 , . . . , Pk−1 ||Pk | q) ,
(8)
q
P
where the first supremum is over finite measurable partitions of X and the second is over quantizers
of X with finite range.
5
An important property of f -divergences is that they satisfy data processing inequalities (see,
for example, Csiszár [8], Györfi and Nemetz [14], Cover and Thomas [7], or Liese and Vajda [19])
which state that processing or transforming an observation X drawn from one of the distributions
P1 , . . . , Pk , decreases the divergence between them. That is, transforming X cannot yield more
“information.” Recall that Q is a Markov kernel from a set X to Z if Q(· | x) is a probability
distribution on Z for each x ∈ X , and for each measurable A ⊂ Z, the mapping x 7→ Q(A | x) is
measurable. We have the following general data processing inequality, whose proof we include in
Appendix D.3 (see also Garcı́a-Garcı́a and Williamson [13, Theorem 4]).
Proposition 2. Let f be a closed convex function
with f (1) = 0. Let Q be a Markov kernel from
R
X to Z and define the marginals QP (A) := X Q(A | x)dP (x). Then
Df QP1 , . . . , QPk−1 ||QPk ≤ Df (P1 , . . . , Pk−1 ||Pk ) .
This proposition immediately yields the intuitive result that quantization reduces information: for
any quantizer q the indicator Q(A | x) = 1 {q(x) ∈ A} defines a Markov kernel, yielding
Corollary 1. Let f be closed convex, satisfy f (1) = 0, and q be a quantizer of X . Then
Df (P1 , . . . , Pk−1 ||Pk | q) ≤ Df (P1 , . . . , Pk−1 ||Pk ) .
We also see that if q1 and q2 are quantizers of X , and q1 induces a finer partition of X than q2 ,
meaning that for x, x′ ∈ X the equality q1 (x) = q1 (x′ ) implies q2 (x) = q2 (x′ ), we have
Df (P1 , . . . , Pk−1 ||Pk | q1 ) ≤ Df (P1 , . . . , Pk−1 ||Pk | q2 ) .
In the sequel, we show how the ordering of quantizers by divergence (and information content)—as
related to the quantized risk (2)—is essential to our understanding of consistency.
3
Risks, information measures, and f -divergences
Having established the basic properties of generalized f -divergences, showing that they essentially
satisfy the same properties as the binary case (cf. [19]), we turn to a more detailed look at their
relationships with multi-way hypothesis tests, multi-class classification, uncertainty measures and
statistical informations relating multiple distributions. We build a set of correspondences between
these ideas that parallels those available for binary experiments and classification problems (see,
for example, Liese and Vajda [19], Nguyen et al. [24], and Reid and Williamson [28]).
We begin by recalling the probabilistic model for classification and Bayesian hypothesis testing
problems described in the introduction. We have a prior distribution π ∈ ∆k = {p ∈ Rk+ : 1T p = 1}
and probability distributions P1 , . . . , Pk defined on a set X . The coordinate Y ∈ [k] = {1, . . . , k}
is drawn according to a multinomial with probabilities π, and conditional on Y = y, we draw
X ∼ Py . Following DeGroot [9], we refer to this as an experiment. Associated with any experiment
(equivalently, a multi-category classification problem) is a family of information measures DeGroot
defines as follows. Let π
e be the posterior distribution on Y given the observation X = x,
πi dPi (x)
.
π
ei (x) = Pk
j=1 πj dPj (x)
6
Given any closed concave function U : Rk+ → R, which we refer to as an uncertainty function, the
information associated with the experiment is the reduction of uncertainty from prior to posterior
(recall equation (3)), that is,
I(X, π; U ) = U (π) − E[U (e
π (X))].
P
The expectation is taken over X drawn marginally according to ki=1 πi Pi . That I(X, π; U ) ≥ 0 is
immediate by concavity; DeGroot [9, Theorem 2.1] shows that I(X, π; U ) ≥ 0 for all distributions
P1 , . . . , Pk and priors π if and only if U is concave on ∆k .
In this section we develop equivalence results between classification risk, loss functions, generalized f -divergences, and uncertainty measures. More concretely, consider a vector Ψ = {Ψ1 , . . . , Ψk }
of loss functions with Ψi : Rk → R, and recall the risk (1), defined as RΨ (γ) = E[ΨY (γ(X))], where
γ ∈ Γ, the set of measurable functions γ : X → Rk . We show in Section 3.1 that each such loss
Ψ induces a concave uncertainty measure U : ∆k → R, and conversely, each uncertainty function
U is induced by (potentially many) convex loss functions Ψ. In Section 3.2, we illustrate Fisher
consistency (see [36, 32]) properties that uncertainty functions U directly imply about the convex
losses Ψ that induce them. Finally, we connect these results in Section 3.3 with our generalized
f -divergences. We show for any loss Ψ there exists an uncertainty function UΨ such that the folk−1
→ R with f (1) = 0 such
lowing holds: for any π ∈ ∆k , there exists a convex function fΨ,π : R+
that the gap between the prior Bayes Ψ-risk—the best expected loss attainable without observing
X—and the posterior Bayes risk inf γ∈Γ RΨ (γ) is
inf
α∈Rk
k
X
i=1
πi Ψi (α) − inf RΨ (γ) = UΨ (π) − E[UΨ (e
π (X))] = DfΨ,π (P1 , . . . , Pk−1 ||Pk ) .
γ∈Γ
k−1
→ R with f (1) = 0, there exist convex losses
Conversely, given any closed convex f : R+
Ψ = {Ψ1 , . . . , Ψk }, an associated uncertainty function UΨ , and a prior π ∈ ∆k satisfying
Df (P1 , . . . , Pk−1 ||Pk ) = UΨ (π) − E[UΨ (e
π (X))] = inf
α∈Rk
3.1
k
X
i=1
πi Ψi (α) − inf RΨ (γ).
γ∈Γ
Uncertainty functions and losses
We now construct a natural bidirectional mapping between losses and uncertainty functions. We
begin by showing how any vector of loss functions for multi-category classification induces a concave
uncertainty function, giving a few examples to illustrate our constructions. Following DeGroot [9,
Eq. (2.14)], we note that the function
UΨ (π) = inf
α∈Rk
X
k
i=1
πi Ψi (α) − I {π ∈ ∆k }
(9)
is a closed concave function, as it is the infimum of linear functionals of π restricted to π ∈ ∆k . It
is thus an uncertainty function. Thus, the gap UΨ (π) − E[UΨ (e
π (X))] between prior and posterior
uncertainty is non-negative, and by the construction (9) any loss function Ψ gives rise to a closed
7
concave uncertainty UΨ . The following two examples with zero-one loss are illustrative.
Consider the zero one loss Ψzo defined for y ∈ [k] by
(
1 if there is j 6= y such that αj ≥ αy
zo
Ψy (α) = 1 {αy ≤ αj for some j 6= i} =
0 otherwise, i.e. αy > maxj6=y αj .
Example 1 (Zero-one loss):
Then we have
UΨzo (π) = inf
α
(
k
X
i=1
)
πi 1 {αi ≤ αj for some j 6= i}
= 1 − max πj .
j
This uncertainty function is concave, bounded on ∆k , and satisfies UΨzo (ei ) = 0 for each standard
basis vector ei . ♣
Example 2 (Cost-weighted classification): In some scenarios, we allow different costs for classifying certain classes y as others; for example, it may be more costly to misclassify a benign tumor
as cancerous than to misclassify a cancerous tumor as benign. In this case, we assume a matrix
k×k
, where cyi ≥ 0 is the cost for classifying an observation of class y as class i
C = [cyi ]ky,i=1 ∈ R+
(i.e. as X having been drawn from Pi instead of Py in the experiment). We assume that cyy = 0
for each y and define
Ψcw
(10)
y (α) = max{cyi | αi = max αj },
i
j
the maximal loss for those indices of α ∈ Rk attaining maxj αj . Let C = [c1 · · · ck ] be the column
representation of C. If cTy π = minl cTl π, then by choosing any α such that αy > αj for all j 6= y,
we have
k
X
πy max{cyi | αi = max αj } = min π T cl .
UΨcw (π) = inf
α
i
j
l
y=1
The uncertainy UΨcw is concave, bounded on ∆k , continuous, and satisfies UΨcw (ei ) = 0 for any
standard basis vector ei . We recover Example 1 by taking C = 11T − Ik×k . ♣
The forward mapping (9) from loss function Ψ to uncertainty measures Uπ is straightforward,
though it is many-to-one. Using convex duality and conjugacy arguments (recall the definition (4)
of the conjugate function), we can show a converse: given any (closed concave) uncertainty function
U defined on ∆k , there exists a loss function Ψ such that U has the form (9).
Proposition 3. For any closed concave uncertainty function U : ∆k → R, the loss functions
Ψi : Rk → R,
Ψi (α) = −αi + (−U )∗ (α),
for i ∈ {1, . . . , k}, are closed, convex, and satisfy the equality (9).
Proof
Using standard Fenchel conjugacy relationships [16, Chapter X], we have
U (π) = inf −π T α + (−U )∗ (α) where (−U )∗ (α) := sup αT π + U (π) .
α∈Rk
π∈∆k
8
(11)
Defining Ψi (α) = −αi + (−U )∗ (α) for i = 1, . . . , k, we can write
U (π) = inf
α∈Rk
−π T α + (−U )∗ (α) = inf −π T α + π T 1 · (−U )∗ (α) = inf
α∈Rk
α∈Rk
( k
X
i=1
)
πi Ψi (α) .
This is our desired infimal relationship.
Proposition 3 shows that associated with every concave utility defined on the simplex, there is
at least one convex loss function Ψ generating the utility via the infimal representation (9), and
there is thus a mapping from loss functions to uncertainty functions and from uncertainty functions
to losses. The mapping from uncertainty functions U to loss functions generating U as in (9) may
be one-to-many; we may begin with a non-convex loss Ψ′ in the definition (9), then construct a
convex Ψ via Proposition 3 satisfying UΨ = UΨ′ .
3.2
Surrogate risk consistency and uncertainty functions
The construction (11) of loss functions is a somewhat privileged construction, as it often yields
desirable properties of the convex loss function itself, especially as related to the non-convex zeroone loss. Indeed, it is often the case that the convex loss Ψ so generated is Fisher consistent; to
make this explicit, we recall the following definition [36, 32].
Definition 3.1. Let Ψ = {Ψ1 , . . . , Ψk } be a vector of loss functions with Ψi : Rk → R. Then Ψ is
classification calibrated for the zero-one loss if for any π ∈ ∆k and any i∗ such that πi∗ < maxj πj ,
we have
)
)
( k
( k
X
X
(12)
πi Ψi (α) : αi∗ ≥ max αj .
πi Ψi (α) < inf
inf
α∈Rk
α∈Rk
i=1
j
i=1
k×k
as in the cost-weighted loss of Example 2, we say Ψ
Given a cost matrix C = [c1 · · · ck ] ∈ R+
is classification calibrated for the cost matrix C if for any π ∈ ∆k and any i∗ such that cTi∗ π >
minj cTj π, we have
inf
α∈Rk
( k
X
i=1
)
πi Ψi (α)
< inf
α∈Rk
(
k
X
i=1
πi Ψi (α) : αi∗ ≥ max αj
j
)
.
(13)
Tewari and Bartlett [32, Theorem 2] and Zhang [36, Theorem 3] show the importance of Definition 3.1. Indeed, let R(γ) be the zero-one risk (Ex. 1) or the cost-weighted risk (Ex. 2). If
inf α Ψi (α) > −∞ for all i, then Ψ is classification calibrated if and only if for any sequence of
functions γn : X → Rk we have the Fisher consistency guarantee that
RΨ (γn ) → inf RΨ (γ) implies R(γn ) → inf R(γ).
γ∈Γ
γ∈Γ
That is, classification calibration is equivalent to surrogate risk or Fisher consistency of the losses
Ψ, and minimization of RΨ implies minimization of the true risk R.
We now show how—under minor restrictions on the uncertainty function U —the construction (11) yields losses that are classification calibrated. We first define the restrictions.
9
Definition 3.2. A convex function f : Rk → R is (λ, κ, k·k)-uniformly convex for some κ ≥ 2 over
C ⊂ Rk if it is closed and for all t ∈ [0, 1] and x1 , x2 ∈ C we have
λ
t(1 − t) kx1 − x2 kκ (1 − t)κ−1 + tκ−1 .
2
We say, without qualification, that a function f is uniformly convex on a set C if dom f ⊃ C and
there exist some λ > 0, norm k·k, and constant κ < ∞ such that Definition 3.2 holds, and that say
f is uniformly concave if −f is uniformly convex. Definition 3.2 is an extension of the usual notion
of strong convexity, which holds when κ = 2, and is essentially a slightly quantified notion of strict
convexity.1
With this definition, we have the following two propositions.
f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ) −
Proposition 4. Assume that U is closed concave, symmetric, and finite on ∆k with dom U = ∆k ,
and let Ψi have
P definition (11). Additionally assume either that (a) U is strictly concave, and the
infimum inf α ki=1 πi Ψi (α) is attained for all π ∈ ∆k , or (b) U is uniformly concave. Then Ψ is
classification calibrated.
Even when U is not strictly convex, we can give classification calibration results. Indeed, recall
Example 1, which showed that for the zero-one-loss, we have UΨ (π) = 1 − maxj πj .
Proposition 5. Let U (π) = 1 − maxj πj . The loss (11) defined by Ψi (α) = −αi + (−U )∗ (α) is
classification calibrated. Moreover, we have for any π ∈ ∆k and α ∈ Rk that
k
k
k
k
X
X
X
1 X
zo
zo
πi Ψi (α) .
πi Ψi (α) − inf
πi Ψi (α) ≥
πi Ψi (α) − inf
α
α
k
i=1
i=1
i=1
i=1
These two propositions, whose proofs we provide in Appendix A, show that uncertainty functions
very naturally give rise to classification calibrated loss functions; we collect several examples of
these results in Section 3.4.
3.3
Divergences, risk, and uncertainty functions
In this section, we show that f -divergences as given in Definition 2.1 have a precise corresondence
with uncertainty functions and losses; Garcı́a-Garcı́a and Williamson [13] also establish a correspondence between f -divergences and infimal losses, though we present it here to give a complete
picture. We begin as in equation
Pk (9) with a concave uncertainty function U and corresponding loss
Ψ satisfying U (π) = inf α∈Rk i=1 πi Ψi (α); by Proposition 3 it is no loss of generality to assume
this correspondence. Let µ be any measure such that Pi ≪ µ for each i, pi = dPi /dµ, and Γ be the
collection of measurable functions γ : X → Rk . The posterior Bayes risk for the loss Ψ is
Z X
k
πi Ψi (γ(x))pi (x)dµ(x)
(14)
UΨ (π, P1:k ) := inf
γ∈Γ X
i=1
=
1
Z
inf
X α∈Rk
(
k
X
i=1
πi pi (x)
Ψi (α) Pk
j=1 πj pj (x)
)
k
X
j=1
πj pj (x) dµ(x) = E[UΨ (e
π (X))],
Strict convexity requires that f (tx1 + (1 − t)x2 ) < tf (x1 ) + (1 − t)f (x2 ) for t ∈ (0, 1) and x1 6= x2 . If C is
compact, then letting diam(C) be the diamater of C, we may take λ = ǫ/ diam(C)κ , for some ǫ arbitrarily close to 0;
then Def. 3.2 is implied by f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ) − ǫt(1 − t)[(1 − t)κ−1 + tκ−1 ], and taking κ ↑ ∞
shows that uniform convexity is not much stronger than strict convexity for compact C.
10
where π
e(x) is the posterior distribution on Y conditional on X = x. The information measure (3)
is thus the gap between the prior Bayes Ψ-risk and posterior Bayes Ψ-risk. We may then write
inf
α∈Rk
=
k
X
i=1
Z
πi Ψi (α) − inf RΨ (γ) = UΨ (π) − UΨ (π, P1:k ) = I(X, π; UΨ )
γ∈Γ
UΨ (π) −
sup
X α∈Rk
k−1
X
i=1
!
pi (x)
πi Ψi (α)
− πk Ψk (α) pk (x)dµ(x) = DfΨ,π (P1 , . . . , Pk−1 ||Pk ) ,
pk (x)
k−1
→ R is defined by
where the closed convex function fΨ,π : R+
fΨ,π (t) := sup
α∈Rk
UΨ (π) −
k−1
X
i=1
!
πi Ψi (α)ti − πk Ψk (α)
(15)
As fΨ,π is the supremum of affine functions of its argument t, it is closed convex, and fΨ,π (1) =
UΨ (π) − UΨ (π) = 0. That is, equation (15) shows that given any loss Ψ = {Ψ1 , . . . , Ψk } or
uncertainty function U , the information measure I(X, π; UΨ ), gap between prior and posterior
Ψ-risk, and fΨ,π -divergence between distributions P1 , . . . , Pk−1 and Pk are identical.
We can also give a converse result that shows that every f -divergence can be written as the gap
between prior and posterior risks for a convex loss function. We first state a result that allows us
to write Df (P1:k−1 ||Pk ) as a statistical information (3) based on a particular concave uncertainty
function U (see also [13, Theorem 3]).
Proposition 6. For any closed and convex function f such that f (1) = 0, let π = 1/k and
t1
tk−1
U (t1 , . . . , tk ) = −ktk f
,...,
,
tk
tk
where we implicitly use the closure of the perspective (cf. definition (7)). Then
Df (P1 , . . . , Pk−1 ||Pk ) = U (1/k) − E[U (e
π (X))],
where the expectation is taken according to
1
k
Pk
i=1 Pi
=
Pk
i=1 πi Pi .
Proof The function U so defined is concave, as it is a negative scalar multiple of the perspective
transform of the function f , which is convex [15, Section IV.2.2]. We have that U (1/k) = 0, and
t
k
f ( ttk1 , . . . , k−1
we also have for t ∈ Rk+ that U ( Pkt1 , . . . , Pktk ) = − Pkt
k
tk ), so that
i=1 ti
Df (P1 , . . . , Pk−1 ||Pk ) = U (1/k) −
where P̄ =
1
k
Pk
i=1 Pi .
i=1 ti
Z
U
i=1 ti
dP1 (x)
Pk
i=1 dPi (x)
dPk (x)
, . . . , Pk
i=1 dPi (x)
!
dP̄ (x),
By combining Propositions 3 and 6 with the infimal representation (9) of UΨ , we immediately
obtain the following corollary.
11
Corollary 2. Let π 0 = 1/k. For any closed and convex function f such that f (1) = 0, the convex
loss Ψ defined by
πk−1
π1
,...,
Ψi (α) = −αi + sup π T α − kπk f
πk
πk
π∈∆k
Pk−1 0
satisfies f (t) = supα∈Rk {UΨ (π 0 ) − i=1
πi Ψi (α)ti − πk0 Ψk (α)}. Additionally,
0
Df (P1:k−1 ||Pk ) = UΨ (π ) − E[UΨ (e
π (X))] = inf
α∈Rk
k
X
i=1
πi0 Ψi (α) − inf E[ΨY (γ(X))],
γ
where the expectation is taken according to Y ∼ π 0 = 1/k and X ∼ Py conditional on Y = y.
Corollary 2, coupled with the information representation given by the f -divergence (15), shows
that there is a complete equivalence between f -divergences, loss functions Ψ, and uncertainty
measures U . For any f -divergence, there exists a loss function Ψ and prior π = 1/k such that
Df (P1:k−1 ||Pk ) = UΨ (π) − UΨ (π, P1:k ). Conversely, for any loss function Ψ and prior π, there exists
a multi-distribution f -divergence such that the gap UΨ (π) − UΨ (π, P1:k ) = Df (P1:k−1 ||Pk ).
3.4
Examples of uncertainty and loss correspondences
We give several examples illustrating the correspondence between (concave) uncertainty functions
and the loss construction (11), using Propositions 4 and 5 to guarantee classification calibration.
Example 3 (0-1 loss, Example 1, continued): We use the uncertainty function U (π) = 1−maxj πj
generated by the zero-one loss to derive a convex loss function Ψ that gives the same uncertainty
function via the representation (9). Consider the conjugate
α(1) + α(2) 1
− ,...,
(−U ) (α) = 1 + max α(1) − 1,
2
2
∗
Pk
i=1 α(i)
k
1
−
,
k
(16)
where α(1) ≥ α(2) ≥ · · · are the entries of α ∈ Rk in sorted order (we show this equality subsequently). Then the convex “family-wise” loss (named for its similarity to family-wise error control
in hypothesis tests)
X
l
1
1
fw
α(j) −
Ψi (α) = 1 − αi + max
l
l
l∈{1,...,k}
j=1
generates the same uncertainty function UΨfw and associated f -divergence as the zero-one loss.
Moreover, Proposition 5 guarantees that Ψfw
i is classification calibrated (Def. 3.1). It appears that
the loss Ψfw
is
a
new
convex
classification-calibrated
loss function.
i
Let us now demonstrate the equality (16) for (−U )∗ (α) = supπ∈∆k {π T α + 1 − kπk∞ }. Formulating the Lagrangian for the supremum with dual variables θ ∈ R, λ ∈ Rk+ , we have
L(π, θ, λ) = π T α − kπk∞ − θ(1T π − 1) + λT π,
which has dual objective
(
θ
sup L(π, θ, λ) =
−∞
π
12
if kα + λ − θ1k1 ≤ 1
otherwise.
P
As inf λ≥0 kα + λ − θ1k1 = kj=1 [αj − θ]+ and strong duality obtains (the problems are linear), we
have that the supremum in expression (16) is
X
k
T
T
[αj − θ]+ ≤ 1 .
sup π α − kπk∞ | π 1 = 1, π ≥ 0 = inf θ |
π
j=1
Without loss of generality we may assume
Pthat α1 ≥ α2 ≥ · · · by symmetry, so that over the
domain θ ∈ (−∞, α1 ], the function θ 7→ kj=1 [αj − θ]+ is strictly decreasing. Thus, there is a
P
unique smallest θ satisfying kj=1 [αj − θ]+ ≤ 1 (attaining the equality), and by inspection, this
must be one of
α1 + α2 1 α1 + α2 + α3 1
1T α 1
θ ∈ α1 − 1,
− ,
− ,...,
−
.
2
2
3
3
k
k
(Any θ makes some number of the terms [αj − θ]+ positive; fixing the number of terms and solving
for θ gives the preceding equality.) Expression (16) follows. ♣
Now we consider the logistic loss for multiclass classification problems; this does not directly
correspond to the zero-one loss, but it generates Shannon entropy and information.
Example 4 (Logistic loss and entropy):
Ψi (α) = log
The multi-class logistic loss is
X
k
αj −αi
e
j=1
, for 1 ≤ i ≤ k.
The uncertainty function (prior Bayes risk) for the logistic loss is the Shannon entropy,
U (π) = inf
α∈Rk
X
k
i=1
πi log
X
k
αj −αi
e
j=1
=−
k
X
πi log πi .
(17)
i=1
Moreover, the negative entropy is strongly convex over the simplex ∆k (this is Pinsker’s inequality [7, Chapter 17.3]), so Proposition 4 immediately implies the classification calibration of the
multiclass logistic loss (cf. [36, Section 4.4]). The information measure (3) associated with the
logistic loss is the mutual information between the observation X and label Y . Indeed, we have
Z
H(Y | X = x)dP̄ (x)
I(X, π; U ) = U (π) − E[U (e
π (X))] = H(Y ) −
X
= H(Y ) − H(Y | X) = I(X; Y )
where H denotes the Shannon entropy, P̄ =
information between X and Y . ♣
4
Pk
i=1 πi Pi ,
and I(X; Y ) is the (Shannon) mutual
Comparison of loss functions
In Section 3, we demonstrated the correspondence between loss functions, concave measures of
uncertainty, statistical information, and f -divergences. These correspondences assume that decision
13
makers have access to the entire observation X, which is often not the case; as noted in the
introduction, it is often beneficial to pre-process data to make it lower dimensional, communicate
or store it efficiently, or to improve statistical behavior. In such cases, it is useful to understand
correspondence between criteria used for the design of experiments, quantization methods, and
data representation, which we now explore.
4.1
A model of quantization and experimental design
Abstractly, we treat the design of an experiment or choice of data representation as a quantization
problem, where a quantizer q maps the space X to a countable space Z (we assume with no loss
of generality that Z ⊂ N). Then, for a loss Ψ and discriminant function γ : Z → Y = {1, . . . , k},
we consider the quantized risk (2), which we recall is
RΨ,π (γ | q) := E [ΨY (γ(q(X)))] .
Given a prior π ∈ ∆k on the label Y and class-conditional distributions P1 , . . . , Pk (equivalently,
hypotheses Hi : Pi in the Bayesian testing setting), and collection Q of quantizers, our design
criterion is to choose the quantizer q that allows the best attainable risk. That is, we consider
the quantized Bayes Ψ-risk, defined as the infimum of the risk (2) over all possible discriminant
functions Γ = {γ : Z → Rk },
)
( k
X
X
πi Ψi (α)Pi (q−1 (z)) .
(18)
inf
inf RΨ,π (γ | q) =
γ∈Γ
z∈Z
α∈Rk
i=1
Whether for computational or analytic reasons, minimizing the loss (18) is often intractable;
the zero-one loss Ψzo (Ex. 1), for example, is non-convex and discontinuous. It is thus of interest to
understand the asymptotic consquences of using a surrogate loss Ψ in place of the desired loss (say
Ψzo ), including in the face of quantization or a further dimension reduction embodied by the choice
q ∈ Q. In the binary case, Nguyen et al. [24] have studied this problem, but the consequences of
using a surrogate for consistency of the resulting quantization and classification procedure is a-priori
unclear: we do not know when using such a surrogate can be done without penalty. To that end,
we now characterize when two loss functions Ψ(1) and Ψ(2) provide equivalent criteria for choosing
quantizers (experimental designs or data representations) according to the Bayes Ψ-risk (18).
4.2
Universal Equivalence of Loss Functions
Recalling our arguments in Section 3.3 that statistical information (the gap between prior and
posterior risks) is a multi-way f -divergence between distributions P1 , . . . , Pk−1 and Pk , we give a
quantized version of this construction. In analogy with the results of Section 3.3, the quantized
statistical information is
I (X, π; UΨ | q) := UΨ (π) − E[UΨ (e
π (q(X)))] = inf
α∈Rk
k
X
i=1
πi Ψi (α) − inf RΨ,π (γ | q)
γ
= DfΨ,π (P1 , . . . , Pk−1 ||Pk | q) ,
(19)
P
where UΨ (π) = inf α∈Rk ki=1 πi Ψi (α) as in (9), the convex function fΨ,π is defined as in expression (15) and does not depend on the quantizer q, and π
e(q(X)) denotes the posterior distribution
on Y ∈ [k] conditional on observing q(X).
14
Using the representation (19), we define loss functions to be equivalent if they provide the same
ordering of quantizers q under the information measure (19).
Definition 4.1. Loss functions Ψ(1) and Ψ(2) are universally equivalent for the prior π, denoted
u
Ψ(1) ≡π Ψ(2) , if for any distributions P1 , . . . , Pk on X and quantizers q1 and q2
I (X, π; UΨ(1) | q1 ) ≤ I (X, π; UΨ(1) | q2 ) iff I (X, π; UΨ(2) | q1 ) ≤ I (X, π; UΨ(2) | q2 ) .
Definition 4.1 of equivalent losses evidently is equivalent to the ordering condition
inf RΨ(1) ,π (γ | q1 ) ≤ inf RΨ(1) ,π (γ | q2 ) iff inf RΨ(2) ,π (γ | q1 ) ≤ inf RΨ(2) ,π (γ | q2 ),
γ
γ
γ
γ
(20)
for all distributions P1 , . . . , Pk , on the quantized Bayes Ψ-risk (18). This definition is somewhat
stringent: losses are universally equivalent only if they induce the same quantizer ordering for all
population distributions. Whenever a quantizer q1 is finer than a quantizer q2 , all losses yield
I (X, π; UΨ | q2 ) ≤ I (X, π; UΨ | q1 ) by the data processing inequality (Corollary 1 of Section 2).
The stronger equivalence notion is important for nonparametric classification settings in which the
underlying distribution on (X, Y ) is only weakly constrained and neither of a pair of quantizers
q1 , q2 ∈ Q is finer than the other.
Definition 4.1 and the representation (19) suggest that the uncertainty function UΨ associated
with the loss Ψ through the infimal representation (9) and the f -divergence associated with Ψ via
the construction (15) are important for the equivalence of two loss functions. This is indeed the
case. First, we have the following result on universal equivalence on loss functions based on their
associated uncertainty functions.
Theorem 1. Let Ψ(1) and Ψ(2) be bounded below loss functions. Let UΨ(1) and UΨ(2) be the associated uncertainy functions as in the construction (9). Then Ψ(1) and Ψ(2) are universally equivalent
if and only if there exist a > 0, b ∈ Rk , and c ∈ R such that for all π ∈ ∆k ,
UΨ(1) (π) = aUΨ(2) (π) + bT π + c.
Secondly, we can characterize universal equivalence for a fixed prior π using f -divergences.
Theorem 2. Let π ∈ ∆k and as in Theorem 1, let Ψ(1) and Ψ(2) be bounded below loss functions,
(1)
(2)
with fπ and fπ the associated f -divergences as in the construction (15). Then Ψ(1) and Ψ(2) are
universally equivalent for the prior π if and only if there exist exist a > 0, b ∈ Rk−1 , and c ∈ R
such that
k−1
.
(21)
fπ(1) (t) = afπ(2) (t) + bT t + c for all t ∈ R+
We provide a proof of each of these results in Section 5.
Theorems 1 and 2 show that two loss functions Ψ(1) and Ψ(2) are equivalent, namely if and
only if their associated uncertainty measures U or f -divergences are positive scalar multiples of one
another (potentially with a linear shift). A major application of these theorems, which we touch
on in the examples in Section 4.3 to follow, is to show that certain non-convex loss functions (such
as the zero-one loss) are equivalent to convex loss functions, including variants of the hinge loss.
As a first application of the theorems, however, we consider the Bayes consistency of empirical
risk minimization strategies for selecting a discriminant γ and quantizer q. In this case, we are
given a sample {(X1 , Y1 ), . . . , (Xn , Yn )}, and we define the empirical risk
n
X
bΨ,n (γ | q) := 1
ΨYi (γ(q(Xi ))).
R
n
i=1
15
Now, let Q1 ⊂ Q2 ⊂ · · · ⊂ Q be a non-decreasing collection of quantizers, indexed by sample size n,
and similarly let Γ1 ⊂ Γ2 ⊂ · · · ⊂ Γ be a non-decreasing collection of discriminant functions, where
we assume the collections satisfy the estimation and approximation error conditions
"
#
b
E
sup RΨ,n (γ | q) − RΨ (γ | q) ≤ ǫest
(22a)
n
γ∈Γn ,q∈Qn
and
inf
γ∈Γn ,q∈Qn
RΨ (γ | q) −
inf
γ∈Γ,q∈Q
RΨ (γ | q) ≤ ǫapp
n ,
(22b)
app
where ǫest
→ 0 as n → ∞. Additionally, let R be the risk functional for the costn → 0 and ǫn
weighted misclassification loss Ψcw (Example 2), where Ψcw
y (α) = maxi {cyi | αi = maxj αj }. Then
we have the following result, which is a consequence of Theorems 1 and 2.
Theorem 3. Assume the conditions (22) hold and that γn and qn are approximate empirical Ψ-risk
minimizers, that is,
b
bΨ,n (γ | q) → 0 as n → ∞.
ǫopt
inf
R
n := E RΨ,n (γn | qn ) −
γ∈Γn ,q∈Qn
R⋆ (Q)
Let
= inf γ∈Γ,q∈Q R(γ | q). If the loss Ψ is (i) classification calibrated for the cost-weighted
cw
loss Ψ and (ii) universally equivalent to the cost-weighted loss, then
L
R(γn | qn ) − R⋆ (Q) →1 0.
Theorem 3 guarantees—at least under the estimation and approximation conditions (22)—that
empirical risk minimization yields a consistent procedure for minimizing the quantized Bayes risk
whenever the loss Ψ is classification calibrated and equivalent to the desired loss. We provide its
proof in Appendix C.
4.3
Examples of universal equivalence
In this section, we give several examples that build off of Theorems 1 and 2, showing that there
exist convex losses that allow optimal joint design of quantizers (or measurement strategies) and
discriminant functions, opening the way for potentially efficient convex optimization strategies. To
that end, we give two examples of hinge-like loss functions that are universally equivalent to the
zero-one loss for all prior distributions π. We also give examples of classification calibrated loss
functions that are not universally equivalent to the zero-one loss, although minimizing them sans
quantization yields Bayes-optimal classifiers.
Example 5 (Hinge losses):
In this example, we consider a pairwise-type multiclass hinge loss,
k → R is defined by
where Ψhinge
:
R
i
X
(α)
=
[1 + αj ]+ + I 1T α = 0 .
Ψhinge
i
j6=i
We also consider the slight extension to weighted loss functions to address asymmetric losses of the
k×k
, we set
form (10) from Example 2. In this case, given the loss matrix C ∈ R+
Ψhinge
(α) =
i
k
X
j=1
cij [1 + αj ]+ + I 1T α = 0 .
16
We make the following observation, essentially due to Zhang [36, Theorem 8]. For completeness,
we include a proof in Appendix A.4.
Observation 1. Let φ : R → R be any convex functionPthat is bounded below, differentiable on
(−∞, 0], and satisfies φ′ (0) < 0. Then the loss Ψy (α) = ki=1 cyi φ(−αi ) is classification calibrated
for the cost matrix C (Def. 3.1, Eq. (13)).
Taking C = 11T − Ik×k , we see that the hinge loss is calibrated for the zero-one loss (Ex. 1); taking
k×k
, the weighted hinge loss is calibrated for the cost matrix C.
arbitrary C ∈ R+
We claim that the associated uncertainty function (as defined by the construction (9)) for the
weighted hinge loss with matrix C = [c1 · · · ck ] is
)
( k
X
hinge
πi Ψ i
(α) = k min π T cl .
(23)
U (π) = inf
α∈Rk
l
i=1
for π ∈ ∆k . Deferring the derivation of this equality, we have that UΨhinge (π) = k minl π T cl =
kUΨcw (π) (recall Example 2). Theorem 1 immediately guarantees that the (weighted) hinge loss
is universally equivalent to the (weighted) 0-1 loss. Even more, as we show in Lemma C.1 in
Appendix C (recall also Proposition 5), we have the quantitative calibration guarantee
k
X
i=1
πi Ψhinge
(α) − inf′
i
α
k
X
i=1
πi Ψhinge
(α′ ) ≥
i
k
X
i=1
πi Ψcw
i (α) − inf
′
α
k
X
′
πi Ψcw
i (α )
i=1
for all π ∈ ∆k and α ∈ Rk , strengthening Observation 1.
We return to demonstrating equality (23) with C = [c1 · · · ck ]. We note that
)
( k
k
k
X
X
X
π T ci [1 + αi ]+ ,
cyi [1 + αi ]+ = inf
πy
U (π) = inf
αT 1=0
αT 1=0
y=1
i=1
i=1
and formulating the Lagrangian by introducing dual variable θ ∈ R, we have
L(α, θ) =
k
X
i=1
π T ci [1 + αi ]+ − θ1T α.
The generalized KKT conditions [4] for this problem are given by taking subgradients of the Lagrangian. At optimum, we must have
if αi < −1
0
νi ∈ ∂ [1 + αi ]+ = [0, 1] if αi = −1 and π T ci νi − θ = 0, 1T α = 0, i = 1, . . . , k.
1
if αi > −1,
Without loss of generality, assume that π T c1 = minj π T cj . Set α⋆ ∈ Rk and θ ⋆ via
α⋆1 = (k − 1), α⋆i = −1 for i ≥ 2,
17
θ ⋆ = min π T cj = π T c1 .
j
We have 1T α⋆ = (k − 1) − (k − 1) = 0, and setting νi = 1 for i such that π T ci = minj π T cj and
min π T c
νi = πjT ci j ∈ [0, 1] otherwise, we see that π T ci νi − θ ⋆ = minj π T cj − minj π T cj = 0 for all i; the
KKT conditions are satisfied. Thus α⋆ and θ ⋆ are primal-dual optimal, yielding expression (23). ♣
Example 6 (Max-type losses and zero-one
loss): We return to Exampleo3 and let Ψfw be the familyn
T
α(1) +α(2)
− 12 , . . . , 1 k α − k1 . We know by Example 3
wise loss Ψfw
i (α) = 1 − αi + max α(1) − 1,
2
that the associated uncertainty
)
( k
X
fw
πi Ψi (α) = 1 − max πj
UΨfw (π) = inf
α
j
i=1
for π ∈ ∆k , and Proposition 5 shows that Ψfw is classification calibrated. As the uncertainty
function satisfies UΨfw ≡ UΨzo , we have that Ψfw and Ψzo are universally equivalent by the construction (15) and Theorems 1 and 2. ♣
For our final example, we consider the logistic loss, which is classification calibrated but—as we
show—not universally equivalent to the zero-one loss.
Pk
αj −αi )
Example 7 (Logistic loss, Example 4 continued): The logistic loss Ψlog
i (α) = log( j=1 e
Pk
has uncertainty function U (π) = − i=1 πi log πi by Example 4. It is clear that there are no a, b, c
such that UΨzo (π) = 1 − maxj πj = aUΨlog (π) + bT π + c for all π ∈ ∆k . Theorem 1 shows that the
logistic loss is not universally equivalent to the zero-one loss. ♣
5
Proof of the Theorems 1 and 2
We divide the proof of the theorems into two parts. The “if” part is straightforward; the “only if”
is substantially more complex, so we present it in parts.
Proof (if direction) We give the proof for Theorem 2, as that for Theorem 1 is identical. Assume
(1)
(2)
that dom fπ = dom fπ and there exist some a > 0, b ∈ Rk−1 , and c ∈ R such that equation (21)
holds. By Definition 2.1 of multi-way f -divergences, for any quantizer q, we have
Df (1) (P1 , . . . , Pk−1 ||Pk | q) = aDf (2) (P1 , . . . , Pk−1 ||Pk | q) + bT 1 + c,
π
as
R
X
π
dPi = 1. Applying the relationship (19), we obtain
I (X, π; UΨ(1) | q) = aI (X, π; UΨ(2) | q) + bT 1 + c.
As q was arbitrary and a > 0, the universal equivalence of Ψ(1) and Ψ(2) follows immediately.
We turn to the “only if” part of the proofs of Theorems 1 and 2. A roadmap of the proof is as
follows: we first provide a definition of what we call order equivalence of convex functions, which is
related to the equivalence of f -divergences and uncertainy functions (Def. 5.1). Then, for any two
loss functions Ψ(1) and Ψ(2) that are universally equivalent, we show that the associated uncertainty
18
functions UΨ(1) and UΨ(2) , as constructed in the infimal representation (9), and the functions f (1) and
f (2) generating the f -divergences via expression (15), are order equivalent (Lemmas 5.1 and 5.3).
After this, we provide a characterization of order equivalent closed convex functions (Lemma 5.4),
which is the lynchpin of our analysis. The lemma shows that for any two order equivalent closed
convex functions f1 and f2 with dom f1 = dom f2 ⊂ Rk+ , there are parameters a > 0, b ∈ Rk−1 , and
c ∈ R such that f (1) (t) = af (2) (t) + bT t + c for all t ∈ dom f1 = dom f2 . This proves the “only if”
part of the Theorems 1 and 2, yielding the desired result. We present the main parts of the proof
in the body of the paper, deferring a number of technical nuances to appendices.
5.1
Universal equivalence and order equivalence
By Definition 4.1 (and its equivalent variant stated (20)), universally equivalent losses Ψ(1) and
Ψ(2) induce the same ordering of quantized information measures and f -divergences. The next
definition captures this ordering in a slightly different way.
Definition 5.1. Let f1 : Rk+ → R and f2 : Rk+ → R be closed convex functions. Let m ∈ N be
k×m
satisfy A1 = B1, where A has columns a1 , . . . , am ∈ Rk+
arbitrary and the matrices A, B ∈ R+
and B has columns b1 , . . . , bm ∈ Rk+ . Then f1 and f2 are order-equivalent if for all m ∈ N and all
such matrices A and B we have
m
X
j=1
f1 (aj ) ≤
m
X
f1 (bj ) if and only if
m
X
j=1
j=1
f2 (aj ) ≤
m
X
f2 (bj ).
(24)
j=1
As suggested by the above context, order equivalence has strong connections with universal
equivalence of loss functions Ψ and associated f -divergences and uncertainty functions. The next
three lemmas make this explicit; we begin by considering uncertainty functions.
Lemma 5.1. If losses Ψ(1) and Ψ(2) are lower bounded and universally equivalent, then the associated uncertainty functions of the construction (9) are order equivalent over ∆k ⊂ Rk+ .
Proof Let U1 and U2 be the associated uncertainty functions, noting that dom U1 = dom U2 = ∆k
k×m
k×m
and B ∈ R+
because inf π∈∆k mini Ui (π) > −∞. Let the matrices A = [a1 · · · am ] ∈ R+
1
1
satisfy ai , bi ∈ ∆
k for each i = 1, . . . , m, and
m B1 ∈ ∆k . We will show that
P
P
Pmlet v = m A1P=
m
m
m
j=1 U1 (aj ) ≤
j=1 U1 (bj ) if and only if
j=1 U2 (aj ) ≤
j=1 U2 (bj ), that is, expression (24)
holds, by constructing appropriate distributions P1:k and π, then applying the universal equivalence
of losses Ψ(1) and Ψ(2) .
Let M0 be any integer large enough that v0 = k1 (1 + M10 )1 − M10 v ∈ Rk+ , so that v0 ∈ ∆k . Then
define the vectors e
a1 = v0 , . . . , e
amM0 = v0 , and let
k×M
k×M
,
and B ext = [b1 · · · bm e
a1 · · · e
amM0 ] ∈ R+
Aext = [a1 · · · am e
a1 · · · e
amM0 ] ∈ R+
ext and bext denote the
where M = (M0 + 1)m. These satisfy Aext 1 = B ext 1 = M
k 1. We let a
columns of these extended matrices.
Now, let the spaces X = [M ] × [M ] and Z = [M ]. Define quantizers q1 , q2 : X → Z by
q1 (i, j) = i and q2 (i, j) = j. For l = 1, . . . , k, define the distributions Pl on X by
Pl (i, j) =
M
M
X
k ext k X ext
k ext
k2
ext ext
P
(i,
j)
=
bjl =
,
so
b
·
a
a
a
l
M 2 il jl
M il M
M il
j=1
j=1
19
P
1
k ext
and similarly
i Pl (i, j) = M bjl . Let π = k 1 be the uniform prior distribution on the label
Y ∈ {1, . . . , k}, and note that the posterior probability
#k
"
P
k
π
P
(i,
j)
aext
l
l
j
−1
il
P
= P ext
= aext
∈ ∆k ,
π
e(q1 ({i})) = P
i
′
π
P
(i,
j)
a
′
′
′
l
l
j
l
l il
l=1
l=1
P
k ext
ext ∈ ∆ . In particular,
because Pl (q−1
e(q−1
k
1 (i)) =
j Pl (i, j) = M ail , and similarly π
2 ({j})) = bj
Pk
taking the expectation over X drawn according to l=1 πl Pl , we have
E[UΨ (e
π (q−1
1 (q1 (X))))] =
M
X aext
1 X
1X
−1
ext
il
Pl (q−1
(i))U
(e
π
(q
(i)))
=
U
(a
)
=
UΨ (aext
Ψ
Ψ i
i ),
1
1
k
M
M
i,l
i=1
i,l
P ext
1 PM
ext
= 1. Similarly, we have E[UΨ (e
π (q−1
because
2 (q2 (X))))] = M
l ail
j=1 Uπ (bj ). Recalling
the definitions (19) and (3) of the (quantized) information associated with the uncertainty U , we
immediately find that the universal equivalence of losses Ψ(1) and Ψ(2) implies that
I (X, π; U1 | q1 ) = U1 (π) −
(for π =
1
k 1)
M
M
1 X
1 X
U1 (aext
)
≤
U
(π)
−
U1 (bext
1
i
i ) = I (X, π; U1 | q2 )
M
M
i=1
i=1
if and only if
M
M
1 X
1 X
ext
U2 (ai ) ≤ U2 (π) −
U2 (bext
I (X, π; U2 | q1 ) = U2 (π) −
i ) = I (X, π; U2 | q2 ) .
M
M
i=1
i=1
Noting that
aext
= bext
for each i ≥ m + 1, we rearrange the preceding equivalent statements by
i
i
1 P
adding M i≥m+1 U (aext
i ) to each side to obtain that the Ui satisfy inequality (24).
We now provide a result for f -divergences similar to that for uncertainty functions. Before
stating the lemma, we give a matrix characterization of non-negative vectors with equal sums
similar to the characterization of majorization via doubly stochastic matrices (cf. Marshall et al.
[23]; see Appendix B.1 for a proof).
m×m
T
T
Lemma 5.2. Vectors a, b ∈ Rm
+ satisfy 1 a = 1 b if and only if there exists a matrix Z ∈ R+
T
such that Z1 = a and Z 1 = b.
Using Lemma 5.2, we can construct a particular discrete space X , discrete measure on X , and
quantizers to show that universal equivalence implies order equivalence for f -divergence functionals.
Lemma 5.3. If losses Ψ(1) and Ψ(2) are universally equivalent for the prior π (Def. 4.1) and
lower-bounded, the corresponding f -divergences fΨ(1) ,π and fΨ(2) ,π of construction (15) are order
equivalent.
Proof
k−1
by the construcLet fi = fΨ(i) ,π for shorthand. Note that dom f1 = dom f2 = R+
(k−1)×m
satisfying
tion (15), because mini inf α∈Rk Ψi (α) > −∞. Now, given matrices A, B ∈ R+
A1 = B1 (where A = [a1 · · · am ] and B = [b1 · · · bm ]), we construct distributions Pi and
quantizers q1 and q2 such that for any f we have
Df (P1 , . . . , Pk−1 ||Pk | q1 ) = C
m
X
j=1
f (aj ) and Df (P1 , . . . , Pk−1 ||Pk | q2 ) = C
20
m
X
j=1
f (bj ),
where C > 0 is a constant. We then use Definition 4.1 of loss equivalence to show that f1 and f2
are order equivalent.
With that in mind, take M to be any positive integer such that M > max{kA1k∞ , m}. We
(k−1)×M
respectively, adding
enlarge A and B into matrices Aext , B ext P
∈ R+
Pm M − m columns. To
m
ext
construct these matrices, let ai,m+1 = M − j=1 aij and bext
=
M
−
i,m+1
j=1 bij , for i = 1, . . . , k −1,
ext
ext
and set ail = bil = 0 for all m + 1 < l ≤ M . The enlarged matrices Aext and B ext thus satisfy
k−1
1
1 ext
.
1= M
B ext 1 = 1, and their columns belong to dom f1 = dom f2 = R+
MA
Let the spaces X = {(i, j) | 1 ≤ i ≤ M, 1 ≤ j ≤ M } and Z = {1, 2, . . . , M }. Define quantizers
q1 , q2 : X → [M ] by q1 (i, j) = i and q2 (i, j) = j. As Aext 1 = B ext 1 = M 1, Lemma 5.2 guarantees
l ] ∈ RM ×M such that
the existence of matrices Z l = [zij
+
1 ext M
∈ RM
[a ]
+
M lj j=1
1 ext M
∈ RM
[b ]
+,
M lj j=1
P l
which implies that for all l ∈ [k − 1], the matrix Z l satisfies ij zij
= 1. For each l ∈ [k − 1],
l for i, j ∈ [M ]. Moreover, let P be
define the probability distribution Pl on X by Pl ((i, j)) = zij
k
the distribution defined by Pk ((i, j)) = M −2 .
Under this quantizer design and choice of distributions Pl , we have for any prior π ∈ Rk+ (upon
which the functions f1 and f2 implicitly depend) and f = f1 or f = f2 that
Z l1 =
and (Z l )T 1 =
M
M
1 X
1 X
ext
f (aj ) and Df (P1 , . . . , Pk−1 ||Pk | q2 ) =
f (bext
Df (P1 , . . . , Pk−1 ||Pk | q1 ) =
j ),
M
M
j=1
j=1
PM l
PM
P
ext
−1 , and similarly for B ext .
because M
i=1 Pl ((i, j)) =
i=1 zij = alj /M and
i=1 Pk ((i, j)) = M
By the loss equivalence of Ψ(1) and Ψ(2) (recall Def. 4.1), we obtain that
M
X
j=1
f1 (aext
j )
≤
M
X
f1 (bext
j )
if and only if
M
X
j=1
j=1
f2 (aext
j )
≤
M
X
f2 (bext
j ).
j=1
Note that aext
= bext
∈ dom fi for each j > m and i = 1, 2. Moreover, aext
= aj and bext
= bj for
j
j
j
j
1 ≤ j ≤ m, so the preceding display is equivalent to the order equivalence (24).
5.2
Characterization of the order equivalence of convex functions
Lemmas 5.1 and 5.3 show that order equivalence (Def. 5.1) and universal equivalence (Def. 4.1)
are strongly related. With that in mind, we now present a lemma fully characterizing the order
equivalence of closed convex functions with identical domains.
Lemma 5.4. Let f1 , f2 : Ω → R be closed convex functions, where Ω ⊂ Rk+ is a convex set. Then
f1 and f2 are order equivalent on Rk if and only if there exist a > 0, b ∈ Rk , and c ∈ R such that
for all t ∈ Ω
f1 (t) = af2 (t) + bT t + c.
(25)
Lemma 5.4 immediately shows that Theorems 1 and 2 hold. In the first case, we apply Lemma 5.1,
which shows that the uncertainty functionals UΨ(1) and UΨ(2) are order equivalent on Ω = ∆k ;
21
Lemma 5.4 yields Theorem 1 as desired. In the second case, Lemma 5.3 implies that the divergence
functionals fΨ(1) ,π and fΨ(2) ,π are order equivalent, thus giving Theorem 2.
The proof of Lemma 5.4 relies on several intermediate lemmas and definitions, which we turn
to presently. In brief, we prove Lemma 5.4 via the following steps:
1. We show that it is no loss of generality to assume that int Ω 6= ∅.
2. We show that Lemma 5.4 holds on any simplex with non-empty interior. Explicitly, considering
any two order equivalent functions f1 and f2 , for any simplex E there exist aE , bE , and cE such
that f1 (t) = aE f2 (t) + bTE t + cE , for all t ∈ E. (Lemma 5.9)
3. We show that any convex set can be covered by simplices (Lemma 5.10).
4. We show that if for each simplex E ⊂ Ω the relationship (25) holds for some aE , bE , and cE ,
then there exist a, b, and c such that the relationship (25) holds on all of Ω (Lemma 5.12).
We defer technical (convex-analytic) proofs to appendices to avoid disrupting the flow of the argument.
5.2.1
It is no loss of generality to assume non-empty interior in Lemma 5.12
Let H = aff Ω be the affine hull of Ω, where dim H = l ≥ 1. (If dim H = 0, then Ω is a single
point, and Lemma 5.4 is trivial.) We argue that if Lemma 5.4 holds for sets Ω such that int Ω 6= ∅
it holds generally; thus we temporarily
assume its truth
for convex Ω with int Ω 6= ∅.
Since dim H = l, we have H = Av + d | v ∈ Rl for some full column-rank matrix A ∈ Rk×l
and d ∈ Rk . As Ω has non-empty interior relative to H (e.g. [15, Theorem 2.1.3]), we have
Ω = {Av + d | v ∈ Ω0 } for a convex set Ω0 ⊂ Rl with int Ω0 6= ∅. Defining fe1 (v) = f1 (Av + d) and
fe2 (v) = f2 (Av + d), where fei (v) = ∞ for v 6∈ Ω0 , we have that fe1 and fe2 are order equivalent on
Ω0 ⊂ Rl . By assumption, Lemma 5.4 holds for Ω0 , so there exist a > 0, eb ∈ Rl , e
c ∈ R such that
f1 (Av + d) = fe1 (v) = afe2 (v) + heb, vi + c, for v ∈ Ω0 .
As A is full column rank, for all t ∈ Ω there exists a unique vt ∈ Ω0 such that t = Avt + d, and
the mapping t 7→ vt is linear, i.e., vt = Et + g for some E ∈ Rl×k and g ∈ Rl (we may take E ∈ Rl×k
to be any left-inverse of A and g = −Ed, so v = Et − Eg). We obtain that fi (t) = fei (Et + g) for
i = 1, 2, whence
f1 (t) = fe1 (Et + g) = afe2 (Et + g) + heb, Et + gi + e
c = af2 (t) + hE T eb, ti + heb, gi + e
c
for all t ∈ Ω, which is our desired result. From this point forward, we thus assume w.l.o.g. that
int Ω 6= ∅.
5.2.2
Proof of Lemma 5.4 for simplices
As outlined above, we first show Lemma 5.4 holds for simplices. To do this, we require the definition
of affine independence.
Definition 5.2. Points x0 , x1 , . . . xm ∈ Rk , m ≤ k + 1, are affinely independent if the vectors
x1 − x0 , x2 − x0 , . . . , xm − x0 ,
are linearly independent. A set E ⊂ Rk is a simplex if E = Conv{x0 , x1 , . . . , xk } where x0 , . . . , xk
are affinely independent.
22
Note that the simplex E = Conv{x0 , x1 , . . . , xk } ⊂ Rk has non-empty interior.
We now state four technical lemmas, whose proofs we defer, that we use to prove Lemma 5.9,
which states that Lemma 5.4 holds on simplices.
P
k
Lemma 5.5. Let x1 , . . . , xm ∈ Rk+ , α ∈ Qm satisfy 1T α = 1, and y = m
i=1 αi xi with y ∈ R+ . If
k
k
f1 : R+ → R and f2 : R+ → R are order equivalent, then
m
X
i=1
αi f1 (xi ) ≤ f1 (y) if and only if
m
X
i=1
αi f2 (xi ) ≤ f2 (y).
See Section B.2 for a proof of Lemma 5.5. As an immediate consequence of Lemma 5.5, we see that
if α ∈ Qn+ satisfies 1T α = 1 and x1 , . . . , xn ∈ Rk+ , then
!
!
n
n
n
n
X
X
X
X
αi f2 (xi ).
(26)
αi xi =
αi f1 (xi ) if and only if f2
αi xi =
f1
i=1
i=1
i=1
i=1
Closed convex functions equal on a dense subset of a convex set Ω are equal on Ω (e.g. [15,
Proposition IV.1.2.5]); this is useful as it means we need only prove equivalence results for convex
functions on dense subsets of their domains.
Lemma 5.6. Let f1 : Ω → R and f2 : Ω → R be closed convex functions satisfying f1 (t) = f2 (t)
for all t in a dense subset of Ω. Then we have f1 (t) = f2 (t) for t ∈ Ω.
The next lemma shows that we can make the equality (25) hold for the extreme points and
centroid of any simplex (see Section B.3 for a proof).
Lemma 5.7. Let f1 , f2 : Ω → R be closed convex and let x0 , x1 , . . . , xk ∈ Ω ⊂ Rk be affinely
independent. Then there exist a > 0, b P
∈ Rk , and c such that f1 (x) = af2 (x) + bT x + c for
k
1
x ∈ {x0 , . . . , xk , xcent }, where xcent = k+1
i=0 xi .
Lastly, we characterize the linearity of convex functions over convex hulls of finite collections of
points. (See Sec. B.4 for a proof of this lemma.)
1 Pm
Lemma
5.8.
Let
f
:
Ω
→
R
be
convex
with
x
,
.
.
.
,
x
∈
Ω
and
x
=
1
m
cent
i=1 xi . If f (xcent ) =
m
1 Pm
T
m
i=1 f (xi ), then for all λ ∈ R+ such that 1 λ = 1,
m
!
m
m
X
X
λi f (xi ).
λi xi =
f
i=1
i=1
We now state and prove Lemma 5.4 for simplices using the preceding four lemmas, assuming
that int Ω 6= ∅.
Lemma 5.9. Let E = Conv{x0 , . . . , xk } ⊂ Ω where x0 , . . . , xk are affinely independent. If f1 and
f2 are order equivalent, then there exist a > 0, b ∈ Rk , and c ∈ R such that
f1 (t) = af2 (t) + bT t + c for all t ∈ E.
23
Proof Let δi = xi −x0 , 1 ≤ i ≤ k, be a linearly independent basis for Rk and let D = [δ1 · · · δk ] ∈
Rk×k . Define the two convex functions g1 : Rk+ → R and g2 : Rk+ → R via
!
!
k
k
X
X
xi y i
δi yi = fj (1 − 1T y)x0 +
gj (y) = fj (x0 + Dy) = fj x0 +
i=1
i=1
for j = 1, 2. If we define Y = {y ∈ Rk+ | 1T y ≤ 1}, then we know that g1 and g2 are continuous,
defined, convex, and order equivalent on Y . The lemma is equivalent to the existence of a > 0,
b ∈ Rk , and c ∈ R such that g1 (y) = ag2 (y) + bT y + c for y ∈ Y .
Let ei ∈ Rk for 1 ≤ i ≤ k be theP
standard basis for Rk and e0 = 0 be shorthand for the all-zeros
1
vector. Further, let ecenter = k+1 ki=0 ei be the centroid of Y (so Y = Conv{e0 , . . . , ek }). We
divide our discussion into two cases.
1 Pk
Linear case: Suppose that g1 (ecenter ) = k+1
i=0 g1 (ei ). Then by order equivalence of g1 and
1 Pk
g2 (Eq. (26)) we have g2 (ecenter ) = k+1 i=0 g2 (ei ). Lemma 5.8 thus implies that g1 and g2 are
linear on Y = Conv{e0 , . . . , ek }. That is, gi (t) = bTi t + ci for some bi ∈ Rk and ci ∈ R for i = 1, 2.
The conclusion of the lemma follows by choosing a =P
1, b = b1 − b2 , and c = c1 − c2 .
k
1
Nonlinear case: Suppose that g1 (ecenter ) 6= k+1
i=0 g1 (ei ). By the convexity of g1 , we then
P
k
1
have g1 (ecenter ) < k+1 i=0 g1 (ei ), and order equivalence (Lemma 5.5) implies that g2 (ecenter ) <
1 Pk
k
i=0 g2 (ei ). Lemma 5.7 guarantees the existence of a > 0, b ∈ R , c ∈ R such that
k+1
g1 (y) = ag2 (y) + bT y + c for y ∈ {e0 , e1 , . . . , ek , ecenter }.
For shorthand, let h1 (y) = g1 (y) and h2 (y) = ag2 (y) + bT y + c, noting that h1 (y) = h2 (y) for
y ∈ {e0 , . . . , ek , ecenter } and that h1 and h2 are order equivalent. We will show that this implies
h1 (y) = h2 (y) for y ∈ Y = {y ∈ Rk+ | 1T y ≤ 1} = Conv{e0 , e1 , . . . , ek , ecenter },
(27)
which of course is equivalent to g1 (y) = ag2 (y)+bT y+c, yielding our desired result. For j =P
1, 2, and
T
T
k
y ∈ Q+ such that 1 y ≤ 1, we use y0 = 1 − 1 y for shorthand. We may thus write y = ki=0 yi ei
P
and have ki=0 yi = 1, so [y0 y1 · · · yk ] ∈ ∆k+1 . We define the linear functions ϕj : R → R as
ϕj (r) = −rhj (y) +
k X
i=0
1−r
yi r −
k+1
hj (ei ) + (1 − r)hj (ecenter )
for j = 1, 2. Then we have
k
1 X
ϕj (0) = hj (ecenter ) −
hj (ei ) < 0.
k+1
i=0
by the non-linearity of g1 and g2 . Further, by the convexity of h1 and h2 we have
ϕj (1) = −hj (y) +
k
X
i=0
yi hj (ei ) = −hj (y) +
k
X
i=1
yi hj (ei ) + (1 − 1T y)hj (e0 ) ≥ 0.
We claim that the order equivalence of h1 and h2 on Y = {y ∈ Rk+ | 1T y ≤ 1} implies that
sign(ϕ1 (r)) = sign(ϕ2 (r)) for r ∈ Q ∩ [0, 1].
24
(28)
To see this, fix any r ∈ Q ∩ (0, 1], and note that y0 = 1 − 1T y ∈ [0, 1] ∩ Q. Consider the expression
k X
i=0
1−r
yi r −
k+1
hj (ei ) + (1 − r)hj (ecenter ) ≤ rhj (y).
1−r
) ∈ Q for i = 0, . . . , k and αk+1 =
Dividing both sides by r > 0 and defining αi = r −1 (yi r − k+1
we see that
k
X
αi hj (ei ) + αk+1 hj (ecenter ) ≤ hj (y).
ϕj (r) ≤ 0 if and only if
1−r
r ,
i=0
Moreover, we have
k
X
1T α
= 1, and
αi ei + αk+1 ecenter =
k
X
i=0
i=0
k
X
1−r
r
yi ei −
ei +
ecenter = y.
r(k + 1)
1−r
i=0
Applying Lemma 5.5 immediately yields that ϕ1 (r) ≤ 0 if and only if ϕ2 (r) ≤ 0 for all r ∈ (0, 1] ∩Q.
Noting that ϕ1 (0) < 0 and ϕ2 (0) < 0, we obtain the sign equality (28).
Using the equality (28), the linearity of ϕj then implies sign(ϕ1 (r)) = sign(ϕ2 (r)) for all r ∈
[0, 1], and, moreover, there exists some r ⋆ ∈ (0, 1] such that ϕ1 (r ⋆ ) = ϕ2 (r ⋆ ) = 0. In particular, we
have
0 = ϕ1 (r ⋆ ) − ϕ2 (r ⋆ ) = −r ⋆ h1 (y) + r ⋆ h2 (y),
where we have used that h1 (ei ) = h2 (ei ) for i = 0, . . . , k and h1 (ecenter ) = h2 (ecenter ). As r ⋆ > 0
and y ∈ Qk+ was arbitrary subject to 1T y ≤ 1, we find that
h1 (y) = h2 (y)
for all y ∈ Qk+ such that 1T y ≤ 1.
The set Y ∩ Q is dense in Y and h1 and h2 are closed, so Lemma 5.6 implies that h1 (y) = h2 (y)
for y ∈ Y = Conv{e0 , . . . , ek } = {y ∈ Rk+ | 1T y ≤ 1}. That is, expression (27) holds, and the proof
is complete.
5.2.3
Covering sets with simplices
With Lemma 5.9 in hand, we now show that the special case for simplices is sufficient to show the
general Lemma 5.4. First, we show that simplices essentially cover convex sets Ω.
Lemma 5.10. Let x, y be arbitrary points in int Ω ⊂ Rk . Then, there exist x0 , x1 , . . . , xk ∈ Ω
such that x, y ∈ int E, where E = Conv{x0 , x1 , . . . , xk }. Moreover, for any points x2 , . . . , xk
that make x, y, x2 , . . . , xk affinely independent (Def. 5.2), there exist x0 , x1 ∈ Ω such that x, y ∈
int Conv{x0 , x1 , . . . , xk }.
Before proving Lemma 5.10, we state a technical lemma about interior points of convex sets.
Lemma 5.11 (Hiriart-Urruty and Lemaréchal [15], Lemma III.2.1.6). Let C ⊂ Rk be a convex set,
a ∈ rel int C and b ∈ cl C. Then for any λ ∈ [0, 1), we have λa + (1 − λ)b ∈ rel int C.
25
Ω
z2 = x2
x1
y
x
x0
Figure 1: The construction in Lemma 5.10.
Proof of Lemma 5.10
Take z2 , . . . , zk ∈ Ω arbitrarily but in general position, so that the
points x, y, z2P
, . . . , zk and y, x, z2 , . . . , zk are affinely independent (Def. 5.2). Now, define zcent =
k
1
(x
+
y
+
i=2 zi ), so that zcent ∈ int Ω (Lemma 5.11), and choose ǫ > 0 small enough that the
k+1
points z0 = x + ǫ(x − zcent ) and z1 = y + ǫ(y − zcent ) are both in Ω. (This is possible as x, y ∈ int Ω;
see Figure 1.) Then we find that x = (z0 + ǫzcent )/(1 + ǫ) and y = (z1 + ǫzcent )/(1 + ǫ), so that
!
k
X
1
z0
z1
2ǫ
zcent =
zi ,
+
+
zcent +
k+1 1+ǫ 1+ǫ 1+ǫ
i=2
or by rearranging, zcent = λ0 (z0 + z1 ) + λ1
λ0 =
2ǫ
1−
(k + 1)(1 + ǫ)
−1
Pk
i=2 zi ,
where
1
and λ1 =
(k + 1)(1 + ǫ)
2ǫ
1−
(k + 1)(1 + ǫ)
−1
1
.
k+1
Noting that 2λ0 + (k − 1)λ1 = 1 and λi > 0, we obtain that zcent ∈ int Conv{z0 , . . . , zk }. As the
points zi are in general position, and as
x=
ǫ
1
ǫ
1
z0 +
zcent and y =
z1 +
zcent ,
1+ǫ
1+ǫ
1+ǫ
1+ǫ
we have x, y ∈ int Conv{z0 , . . . , zk } by Lemma 5.11.
5.2.4
Extension from a single simplex to all of Ω
We use Lemma 5.10 to show the following lemma, which implies Lemma 5.4.
26
Lemma 5.12. In addition to the conditions of Lemma 5.4, assume that for all simplices E ⊂ int Ω
there exist aE > 0, bE ∈ Rk , and cE ∈ R such that f1 (t) = aE f2 (t) + bTE t + cE for t ∈ E. Then
there exist a > 0, b ∈ Rk , and c ∈ R such that
f1 (t) = af2 (t) + bT t + c for t ∈ Ω.
Coupled with Lemma 5.9, Lemma 5.12 immediately yields Lemma 5.4; indeed, Lemma 5.9 shows
that for any simplex E the conditions of Lemma 5.12 holds, so that Lemma 5.4 follows, i.e.,
f1 (t) = af2 (t) + bT t + c for all t ∈ Ω.
Proof For i ∈ {1, 2} define the sets
Si = {(x, y) ∈ int Ω × int Ω | ∇fi (x) 6= ∇fi (y)}
∪ {(x, y) ∈ int Ω × int Ω | ∇fi (x) or ∇fi (y) does not exist} ⊂ Ω × Ω.
We divide our discussion into two cases.
Case 1. First we suppose that S1 = S2 = ∅. Then for i = 1, 2, fi are differentiable in int Ω in
this case, we have that ∇fi (x) = ∇fi (y) for all x, y ∈ int Ω. Then by continuity of the fi on int Ω,
we must have fi (t) = bTi t + ci for i = 1, 2. The result follows by taking a = 1, b = b1 − b2 and
c = c1 − c2 and applying Lemma 5.6.
Case 2. We have that at least one of S1 and S2 is non-empty. Without loss of generality, say
S1 6= ∅. Choose a pair (x⋆ , y ⋆ ) ∈ S1 and consider the collection of sets
M = {E = Conv{x0 , x1 , . . . , xk } | x⋆ , y ⋆ ∈ int E and x0 , x1 , . . . xk ∈ Ω} .
We show that for any E1 , E2 ∈ M, we have aE1 = aE2 , bE1 = bE2 and cE1 = cE2 . We know that
int E1 ∩ int E2 6= ∅, as x⋆ ∈ int E for all E ∈ M. Now, assume for the sake of contradiction that
aE1 6= aE2 . For all t ∈ E1 ∩ E2 we have
f1 (t) = aE1 f2 (t) + bTE1 t + cE1 and f1 (t) = aE2 f2 (t) + bTE2 t + cE2 .
(29)
By subtracting the preceding equations from one another after multiplying by aE2 and aE1 , respectively, one obtains that for t ∈ E1 ∩ E2 ,
f1 (t) =
aE 2
1
(aE2 bE1 − aE1 bE2 )T t + (aE2 cE1 − aE1 cE2 ) .
− aE 1
This yields a contradiction, since we have assumed that either (i) ∇f1 (x⋆ ) or ∇f2 (y ⋆ ) does not exist
or (ii) ∇f1 (x⋆ ) 6= ∇f1 (y ⋆ ), while x⋆ , y ⋆ ∈ int E1 ∩ int E2 = int(E1 ∩ E2 ). Thus aE1 = aE2 . To obtain
bE1 = bE2 , note that int(E1 ∩ E2 ) 6= ∅, as x⋆ , y ⋆ ∈ int(E1 ∩ E2 ). Subtracting the equalities (29),
we find 0 = f1 (t) − f1 (t) = (bE1 − bE2 )T t + cE1 − cE2 for t ∈ E1 ∩ E2 , which has non-empty interior.
That bE1 = bE2 immediately implies cE1 = cE2 . Hence, there exist some a > 0, b ∈ Rk , and c ∈ R
such that for all sets E ∈ M, we have aE = S
a, bE = b, and cE = c.
We complete the proof by showing that {E | E ∈ M} is dense in Ω. Define Ω◦ ⊂ Ω as
Ω◦ = {t ∈ Ω | there exist z3 , . . . , zk such that x⋆ , y ⋆ , t, z3 , . . . , zk are affinely independent} .
The set Ω◦ forms a dense subset of Ω, as x⋆ ∈ int Ω 6= ∅. For any t ∈ Ω◦ , LemmaS5.10 guarantees
the existence of E ∈ M such that t ∈ E and x⋆ , y ⋆ ∈ int E. We thus have Ω◦ ⊂ {E | E ∈ M}.
As f1 (t) = af2 (t) + bT t + c for all t ∈ Ω◦ by the previous paragraph, Lemma 5.6 allows us to extend
the equality to f1 (t) = af2 (t) + bT t + c for all t ∈ Ω.
27
References
[1] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from
another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.
[2] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
Journal of the American Statistical Association, 101:138–156, 2006.
[3] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1):289–300,
1995.
[4] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[5] P. Billingsley. Probability and Measure. Wiley, Second edition, 1986.
[6] D. Blackwell. Comparison of experiments. In Proceedings of the 2nd Berkeley Symposium on
Probability and Statistics, pages 93–102. University of California Press, 1951.
[7] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.
[8] I. Csiszár. Information-type measures of difference of probability distributions and indirect
observation. Studia Scientifica Mathematica Hungary, 2:299–318, 1967.
[9] M. H. DeGroot. Uncertainty, information, and sequential experiments. Annals of Mathematical
Statistics, 33(2):404–419, 1962.
[10] M. H. DeGroot. Optimal Statistical Decisions. Mcgraw-Hill College, 1970.
[11] J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of
Statistics, 41(5):2292–2323, 2013.
[12] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Insitute of Mathematical Statistics Monographs. Cambridge University Press, 2012.
[13] D. Garcı́a-Garcı́a and R. C. Williamson. Divergences and risks for multiclass experiments. In
Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory, 2012.
[14] L. Györfi and T. Nemetz. f -dissimilarity: A generalization of the affinity of several distributions. Annals of the Institute of Statistical Mathematics, 30:105–113, 1978.
[15] J. Hiriart-Urruty and C. Lemaréchal.
Springer, New York, 1993.
Convex Analysis and Minimization Algorithms I.
[16] J. Hiriart-Urruty and C. Lemaréchal.
Springer, New York, 1993.
Convex Analysis and Minimization Algorithms II.
[17] T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE
Transactions on Communication Technology, 15(1):52–60, 1967.
[18] L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, 1986.
28
[19] F. Liese and I. Vajda. On divergences and informations in statistics and information theory.
IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
[20] M. Longo, T. D. Lookabaugh, and R. M. Gray. Quantization for decentralized hypothesis
testing under communication constraints. IEEE Transactions on Information Theory, 36(2):
241–255, 1990.
[21] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods.
Annals of Statistics, 32(1):30–55, 2004.
[22] S. Mannor, R. Meir, and T. Zhang. Greedy algorithms for classification—consistency, convergence rates and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003.
[23] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of Majorization and Its
Applications. Springer, second edition, 2011.
[24] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On surrogate loss functions and f -divergences.
Annals of Statistics, 37(2):876–904, 2009.
[25] F. Österreicher and I. Vajda. Statistical information and discrimination. IEEE Transactions
on Information Theory, 39(3):1036–1039, 1993.
[26] H. V. Poor and J. B. Thomas. Applications of ali-silvey distance measures in the design of
generalized quantizers for binary decision systems. IEEE Transactions on Communications,
25(9):893–900, 1977.
[27] F. Pukelsheim. Optimal Design of Experiments. Classics in Applied Mathematics. SIAM, 1993.
[28] M. Reid and R. Williamson. Information, divergence, and risk for binary experiments. Journal
of Machine Learning Research, 12:731–817, 2011.
[29] H. Robbins. Some aspects of the sequential design of experiments. Bulletin American Mathematical Society, 55:527–535, 1952.
[30] R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.
[31] I. Steinwart. How to compare different loss functions. Constructive Approximation, 26:225–287,
2007.
[32] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal
of Machine Learning Research, 8:1007–1025, 2007.
[33] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In The 37’th
Allerton Conference on Communication, Control, and Computing, 1999.
[34] J. N. Tsitsiklis. Decentralized detection. In Advances in Signal Processing, Vol. 2, pages
297–344. JAI Press, 1993.
[35] I. Vajda. On the f -divergence and singularity of probability measures. Periodica Mathematica
Hungarica, 2(1–4):223–234, 1972.
[36] T. Zhang. Statistical analysis of some multi-category large margin classification methods.
Journal of Machine Learning Research, 5:1225–1251, 2004.
29
A
Proofs of classification calibration results
In this section, we prove Propositions 4 and 5. Before proving the propositions proper, we state
several technical lemmas and enumerate continuity properties of Fenchel conjugates that will prove
useful. We also collect a few important definitions related to convexity and norms here, which we
use without comment in this appendix. For a norm k·k on Rk , we recall the definition of the dual
norm k·k∗ as kyk∗ = supkxk≤1 xT y. For a convex function f : Rk → R, we let
∂f (x) = {g ∈ Rk | f (y) ≥ f (x) + g T (y − x) for all y ∈ Rk }
denote the subgradient set of f at the point x. This set is non-empty if x ∈ rel int dom f (see [15,
Chapter VI]).
A.1
Technical preliminaries
We now give technical preliminaries on convex functions. We recall Definition 3.2 of uniform
convexity, that f is (λ, κ, k·k)-uniformly convex over C ⊂ Rk if it is closed and for all t ∈ [0, 1] and
x1 , x2 ∈ C we have
λ
f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ) − t(1 − t) kx1 − x2 kκ (1 − t)κ−1 + tκ−1 .
(30)
2
We state a related definition of smoothness.
Definition A.1. A function f is (L, β, k·k)-smooth if it has β-Hölder continuous gradient with
respect to the norm k·k, meaning that
k∇f (x1 ) − ∇f (x2 )k∗ ≤ L kx1 − x2 kβ for x1 , x2 ∈ dom f.
Our first technical lemma is an equivalence result for uniform convexity.
Lemma A.1. Let f : Ω → R, where f is closed convex and Ω is a closed convex set. Then f is
(λ, κ, k·k)-uniformly convex over Ω if and only if for x1 ∈ rel int Ω and all y,
λ
kx1 − x2 kκ for s1 ∈ ∂f (x1 ).
(31)
2
If inequality (30) holds, then inequality (31) also holds for any points x1 ∈ Ω and s1 such that
∂f (x1 ) 6= ∅ and s1 ∈ ∂f (x1 ).
f (x2 ) ≥ f (x1 ) + sT1 (x2 − x1 ) +
See Section A.5.1 for a proof of this lemma. There is also a natural duality between uniform
convexity and smoothness of a function’s Fenchel conjugate f ∗ (y) = supx {y T x − f (x)}. Such
dualities are frequent in convex analysis (cf. [16, Chapter X.4]).
Lemma A.2. Let Ω ⊂ Rk be a closed convex set and f : Ω → R be (λ, κ, k·k)-uniformly convex
1
1
, k·k∗ )-smooth (Def. A.1) over dom f ∗ = Rk .
over Ω. Then f ∗ is (λ− κ−1 , κ−1
See Section A.5.2 for a proof of Lemma A.2. We also have two results on the properties of smooth
functions, whose proofs we provide in Sections A.5.3 and A.5.4, respectively.
Lemma A.3. Let f be (L, β, k·k) smooth. Then
f (x2 ) ≤ f (x1 ) + ∇f (x1 )T (x2 − x1 ) +
L
kx1 − x2 kβ+1 .
β+1
Lemma A.4. Let f be (L, β, k·k) smooth over Rk and inf x f (x) > −∞. If the sequence xn satisfies
limn f (xn ) = inf x f (x), then ∇f (xn ) → 0.
30
A.2
Proof of Proposition 4
We state two intermediate lemmas before proving Proposition 4.
Lemma A.5. If U is symmetric, closed and strictly concave, then (−U )∗ is continuously differentiable over Rk . Moreover, if αi ≥ αj , then p = ∇(−U )∗ (α) satisfies pi ≥ pj .
Proof As U is strictly concave and (−U )∗ (α) = supπ∈∆k {π T α+U (π)} < ∞ for all α (suprema of
closed concave functions over compact sets are attained), we have that (−U )∗ is continuously differentiable by standard results in convex analysis (e.g. Hiriart-Urruty and Lemaréchal [16, Theorem
X.4.1.1]), and ∇(−U )∗ (α) = argmaxp∈∆k {pT α + U (p)}.
Now let α satisfy αi ≥ αj . As ∇(−U )∗ (α) = argmaxp∈∆k {pT α + U (p)}, let us assume for the
sake of contradiction that pi < pj . Then letting A be the permutation matrix swapping entries i
and j, the vector p′ = Ap satisfies U (p′ ) = U (p) but U ( 12 (p′ +p)) > 21 U (p′ )+ 21 U (p) = U (p) = U (p′ ),
and
αT p − αT p′ = αi pi − αi pj + αj pj − αj pi = (αi − αj )(pi − pj ) ≤ 0.
Thus we have −αT p ≥ −αT p′ , and so
1
1 ′
1 ′
1
1 T
′
T
α (p + p ) + U
p + p ≥ −α p + U
p + p > αT p + U (p),
2
2
2
2
2
a contradiction to the assumed optimality of p. We must have pi ≥ pj whenever αi ≥ αj .
P
P
Lemma A.6. If U is symmetric and ki=1 πi Ψi (α⋆ ) = inf α ki=1 πi Ψi (α), then πi > πj implies
that α⋆i ≥ α⋆j . If U is strictly concave, then α⋆i > α⋆j .
Proof Let π satisfy πi > πj as assumed in the lemma, and suppose that α⋆i < α⋆j for the sake
of contradiction. Let A be the permutation matrix that swaps α⋆i and α⋆j . Then (−U )∗ (Aα⋆ ) =
(−U )∗ (α⋆ ), and (−U )∗ is also symmetric, and
k
X
i=1
πi Ψi (Aα⋆ ) −
k
X
i=1
πi Ψi (α⋆ ) = −π T Aα⋆ + π T α⋆ = (πi − πj )(α⋆i − α⋆j ) < 0,
a contradiction to the optimality of α⋆ .
P
If U is strictly concave, then Lemma A.5 implies that for α⋆ minimizing ki=1 πi Ψi (α), we have
π = ∇(−U )∗ (α⋆ ). Moreover, by Lemma A.5, if if πi > πj we must have α⋆i > α⋆j .
Pk
Proof of Proposition 4
If the infimum in the definition
π Ψ (α) is attained, then
i=1
P i i
Lemma A.6 gives the result. Otherwise, recall that U (π) = inf α { ki=1 πi Ψi (α)} > −∞, and
P
let πi > πj . Let α(m) be any sequence such that ki=1 πi Ψi (α(m) ) → U (π).
Using that U is uniformly concave, we have ∇(−U )∗ is Hölder continuous over Rk (recall
Lemma A.2). This implies that that π − ∇(−U )∗ (α(m) ) → 0 as m → ∞ (recall Lemma A.4), or
lim ∇(−U )∗ (α(m) ) = π.
m→∞
(m)
Now, by Lemma A.5, for any p(m) = ∇(−U )∗ (α(m) ), we have that αi
(m)
pj .
Thus, if πi > πj , it must be the case that eventually we have
31
(m)
≥ αj
(m)
αi
>
(m)
implies pi
(m)
αj .
≥
Moreover, if
(m)
lim inf |αi
(m)
− αj
(m)
| = 0, then we must have lim inf |pi
(m)
pi
(m)
pj
(m)
− pj
| = 0, which would contradict that
(m)
πi > πj , as we have
−
→ πi − πj . We thus find that lim inf m (αi
sequence tends to the infimum U (π), which implies that
)
( k
X
πi Ψi (α) : αi ≤ αj > U (π).
inf
α
(m)
− αj
) > 0 if the
i=1
The loss (11) is thus classification calibrated (Def. 3.1).
A.3
Proof of Proposition 5
Without loss of generality, we assume that πk < maxj πj , so that restricting to αk ≥ maxj αj forces
α to have larger zero-one risk than 1 − maxj πj . We present two lemmas, based on convex duality
that imply the result.
Lemma A.7. Let v ∈ Rk . Then
(
0
inf
v α=
αk ≥maxj αj
−∞
Pk−1
if vk = − i=1
vi , v\k 0
otherwise.
T
k−1
for the constraints αk ≥ αj for j 6= k, we
Proof By introducing Lagrange multipliers β ∈ R+
have Lagrangian
(
T
0
if vk = β T 1, v\k = −β
β
L(α, β) = v +
α,
so
inf
L(α,
β)
=
α
−β T 1
−∞ otherwise.
Substuting β = −v\k and noting that β 0 gives the result.
Thus, if we define the matrix C = 11T − Ik×k with columns cl , we find by strong duality that
(
)
T
−π α + (−UΨzo )∗ (α) =
inf
−π T α + sup inf pT α + q T C T p
inf
αk ≥maxj αj
αk ≥maxj αj
= sup
p∈∆k q∈∆k
inf
p∈∆k αk ≥maxj αj
T
T
min cl p + (p − π) α
l
T
T
= sup min cl p : π\k p\k , pk = πk + 1 (π\k − p\k )
l
p∈∆k
T
= sup 1 − max pl : π\k p\k , pk = πk + 1 (π\k − p\k ) ,
l
p∈∆k
where in the second to final line we took v = p − π. The next lemma then immediately implies
Proposition 5 once we note that UΨzo (π) = 1 − maxj πj .
Lemma A.8. Let the function Gk (π) = inf αk ≥maxj αj {−π T α + (−UΨzo )∗ (α)}. Then
Gk (π) − (1 − max πj ) ≥
j
32
1
(max πj − πk ).
k j
Proof
We essentially construct the optimal p ∈ ∆k vector for the supremum in the definition
of Gk . Without loss of generality, we assume that π1 = maxj πj > πk , as the result is trivial if
πk = maxj πk . For t ∈ R, define
L(t) = πk +
k
X
j=1
[πj − t]+ and R(t) = t.
By definining tlow = πk and thigh = π1 we have
L(tlow ) ≥ πk + [π1 − πk ]+ = π1 > πk = R(tlow ) and
L(thigh ) = πk < π1 = R(thigh ),
and the fact that L is strictly decreasing in [tlow , thigh ] and R is strictly increasing implies that there
exists a unique root t⋆ ∈ (tlow , thigh ) such that L(t⋆ ) = t⋆ = R(t⋆ ). Now, we define the vector p by
pj = min{t⋆ , πj } for j ≤ k − 1, pk = t⋆ .
Pk−1 ⋆
Pk−1
Then we have 1T p = t⋆ + j=1
t ∧ πj = πk + j=1
([πj − t⋆ ]+ + t⋆ ∧ πj ) = 1, and moreover, we
⋆
have Gk (π) ≥ 1 − maxl pl = 1 − t .
It remains to show that 1 − t⋆ − (1 − π1 ) ≥ k1 (π1 − πk ); equivalently, we must show that
⋆
t ≤ (1 − k1 )π1 + k1 πk . Suppose for the sake of contradiction that this does not hold, that is, that
t⋆ > (1 − k1 )π1 + k1 πk . We know that tlow = πk < t⋆ < π1 = thigh by assumption. With t⋆ satisfying
these two constraints, we have that
L(t⋆ ) = πk +
k
X
j=1
[πj − t⋆ ]+ ≤ πk +
k−1
X
j=1
[π1 − t⋆ ]+ = πk + (k − 1)(π1 − t⋆ )
k−1
1
< πk + (k − 1) π1 −
π1 − πk
k
k
=
k−1
1
π1 + πk < t⋆ ,
k
k
the two strict inequalities by assumption on t⋆ . But then we would have L(t⋆ ) < R(t⋆ ) = t⋆ , a
contradiction to our choice of t⋆ . Because L is decreasing, it must thus be the case that t⋆ ≤
(1 − k1 )π1 + k1 πk , giving the result.
A.4
Proof of Observation 1
Our proof is essentially a trivial modification of Zhang [36, Theorem 8]. We assume without loss
of generality that φ(·) ≥ 0. Let π ∈ ∆k , and recalling that the cost matrix C = [c1 · · · ck ], let
L(π, α) =
k
X
y=1
πy
k
X
cyi φ(−αi ) =
k
X
π T ci φ(−αi ),
i=1
i=1
noting that π T ci ≥ 0 for each i. Without loss of generality, we may assume that π T c1 > π T c2 =
minl π T cl . If we can show that inf α L(π, α) < inf α1 ≥maxj αj L(π, α), then the proof will be complete.
(m)
(m)
Let α(m) ∈ Rk be any sequence satisfying 1T α(m) = 0 and α1 ≥ maxj αj such that L(π, α(m) ) →
inf α1 ≥maxj αj L(π, α).
We first show that it is no loss of generality to assume that α(m) converges. Suppose for the sake
(m)
of contradiction that lim supm α(m) = ∞. Then as 1T α(m) = 0, we must have lim supm α1 = ∞,
33
(m)
so that lim supm φ(−α1 ) = ∞ because φ′ (0) < 0 and φ is convex. As it must be the case
(m)
that π T c1 φ(−α1 ) remains bounded (for the convergence of L(π, α(m) )), we would then have
that π T c1 = 0, which is a contradiction because π T c1 > minl π T cl ≥ 0. Thus we must have
lim supm α(m) < ∞, and so there is a subsequence of α(m) converging; without loss of generality,
we assume that α(m) → α⋆ . Then L(π, α⋆ ) = inf α1 ≥max αj L(π, α) by continuity of φ.
We show that by swapping the value of α⋆1 with the value α⋆2 (or increasing the latter slightly),
we can always improve the value L(π, α⋆ ). We consider three cases, noting in each that α⋆1 ≥ 0 as
1T α⋆ = 0.
1. Let α⋆1 = α⋆2 ≥ 0. Then φ′ (−α⋆1 ) = φ′ (−α⋆2 ) ≤ φ′ (0) < 0, and additionally cT1 πφ′ (−α⋆1 ) −
cT2 πφ′ (−α⋆2 ) < 0 because cT2 π < cT1 π. For sufficently small δ > 0, we thus have
cT1 πφ(−α⋆1 + δ) + cT2 πφ(−α⋆2 − δ) < cT1 πφ(−α⋆1 ) + cT2 πφ(−α⋆2 ),
whence L(π, α⋆ ) > L(π, α⋆ − δ(e1 − e2 )).
2. Let α⋆1 > α⋆2 and φ(−α⋆1 ) > φ(−α⋆2 ). Then taking α ∈ Rk such that α1 = α⋆2 , α2 = α⋆1 , and
αi = α⋆i for i ≥ 3, it is clear that L(π, α) < L(π, α⋆ ).
3. Let α⋆1 > α⋆2 and φ(−α⋆1 ) ≤ φ(−α⋆2 ). Using that −α⋆1 < 0, we see that for sufficiently small δ > 0
we have φ(−α⋆1 + δ) < φ(−α⋆1 ) and φ(−α⋆2 ) ≥ φ(−α⋆2 − δ), because φ must be non-decreasing at
the point −α⋆2 as φ′ (0) < 0. Thus we have L(π, α⋆ − δ(e1 − e2 )) < L(π, α⋆ ).
A.5
A.5.1
Proofs of technical lemmas on smoothness
Proof of Lemma A.1
Let t ∈ (0, 1) and xt = tx2 + (1 − t)x1 . Assume that inequality (31) holds and let x1 ∈ rel int Ω.
Then
κ
λ
λ
f (x2 ) ≥ f (xt ) + sTt (x2 − xt ) + xt − x2 = f (xt ) + (1 − t)sTt (x2 − x1 ) + (1 − t)κ kx1 − x2 kκ ,
2
2
where st ∈ ∂f (xt ), which must exist as xt ∈ rel int Ω (see Lemma 5.11 and [15, Chapter VI]).
Similarly, we have
λ
f (x1 ) ≥ f (xt ) + tsTt (x1 − x2 ) + tκ kx1 − x2 kκ ,
2
and multiplying the preceding inequalities by t and (1 − t), respectively, we have
λ
kx1 − x2 kκ [t(1 − t)κ + (1 − t)tκ ]
2
λ
= f (xt ) + t(1 − t) kx1 − x2 kκ (1 − t)κ−1 + tκ−1
2
tf (x2 ) + (1 − t)f (x1 ) ≥ f (xt ) +
for any x1 ∈ rel int Ω.
We now consider the case that x1 ∈ Ω \ rel int Ω, as the preceding display is equivalent to the
uniform convexity condition (30). Let x1 ∈ Ω \ rel int Ω and x′1 ∈ rel int Ω. For a ∈ [0, 1] define the
functions h(a) = tf (x2 ) + (1 − t)f ((1 − a)x1 + ax′1 ) and
ht (a) = f (tx2 + (1 − t)((1 − a)x1 + ax′1 )) +
κ λ
t(1 − t) (1 − a)x1 + ax′1 − x2 (1 − t)κ−1 + tκ−1 .
2
34
Then h and ht are closed one-dimensional convex functions, which are thus continuous [15, Chapter
I], and we have h(a) ≥ ht (a) for all a ∈ (0, 1) as (1 − a)x1 + ax′1 ∈ rel int Ω. Thus
tf (x1 ) + (1 − t)f (x2 ) = h(0) = lim h(a) ≥ lim ht (a) = ht (0)
a→0
a→0
= f (tx2 + (1 − t)x1 ) +
λ
t(1 − t) kx1 − x2 kκ (1 − t)κ−1 + tκ−1 .
2
This is equivalent to the uniform convexity condition (30).
We now prove the converse. Assume the uniform convexity condition (30), which is equivalent
to
f (xt ) − f (x1 ) λ
+ (1 − t) kx1 − x2 kκ [(1 − t)κ−1 + tκ−1 ] ≤ f (x2 ) − f (x1 )
t
2
(x)
be the directional derivative of
for all x1 , x2 ∈ Ω and t ∈ (0, 1). Let f ′ (x, d) = limt↓0 f (x+td)−f
t
′
f in direction d, recalling that if ∂f (x) 6= ∅ then f (x, d) = supg∈∂f (x) hg, di (see [16, Ch. VI.1]).
Then taking t ↓ 0, we have
f ′ (x1 , x2 − x1 ) +
λ
kx1 − x2 kκ ≤ f (x2 ) − f (x1 ).
2
This implies the subgradient condition (31) because f ′ (x1 , x2 − x1 ) = supg∈∂f (x1 ) hg, x2 − x1 i.
A.5.2
Proof of Lemma A.2
First, we note that as dom f = Ω and f is uniformly convex, it is 1-coercive, meaning that
limkxk→∞ f (x)/ kxk = ∞. Thus dom f ∗ = Rk ; see [16, Proposition X.1.3.8].
As f is strictly convex by assumption, we have that f ∗ is differentiable [16, Theorem X.4.1.1].
Moreover, as f is closed convex, f = f ∗∗ , and we have for any s ∈ Rk that x1 = ∇f ∗ (s) if and
only if s ∈ ∂f (x1 ), meaning that f is subdifferentiable on the set Im ∇f ∗ = {∇f ∗ (s) : s ∈ Rk },
whence Ω ⊃ Im ∇f ∗ . Now, let s1 , s2 ∈ Rk and x1 = ∇f ∗ (s1 ) and x2 = ∇f ∗ (s2 ). We must then
have s1 ∈ ∂f (x1 ) and s2 ∈ ∂f (x2 ) by standard results in convex analysis [16, Corollary X.1.4.4], so
that ∂f (x1 ) 6= ∅ and ∂f (x2 ) 6= ∅.
Now we use the uniform convexity condition (31) of Lemma A.1 to see that
f (x2 ) − f (x1 ) ≥ hs1 , x2 − x1 i +
λ
λ
kx1 − x2 kκ and f (x1 ) − f (x2 ) ≥ hs2 , x1 − x2 i + kx1 − x2 kκ .
2
2
Adding these equations, we find that
hs1 − s2 , x1 − x2 i ≥ λ kx1 − x2 kκ ,
so
ks1 − s2 k∗ kx1 − x2 k ≥ λ kx1 − x2 kκ
by Hölder’s inequality. Dividing each side by kx1 − x2 k = k∇f ∗ (s1 ) − ∇f ∗ (s2 )k, we obtain kx1 − x2 k ≤
1
1
λ− κ−1 ks1 − s2 k∗κ−1 , the desired result.
35
A.5.3
Proof of Lemma A.3
Using Taylor’s theorem, we have
Z 1
h∇f (tx2 + (1 − t)x1 ), x2 − x1 i dt
f (x2 ) = f (x1 ) +
0
Z 1
h∇f (x1 + t(x2 − x1 )) − ∇f (x1 ), x2 − x1 i dt
= f (x1 ) + h∇f (x1 ), x2 − x1 i +
0
Z 1
k∇f (x1 + t(x2 − x1 )) − ∇f (x1 )k∗ kx1 − x2 k dt
≤ f (x1 ) + h∇f (x1 ), x2 − x1 i +
0
Z 1
Ltβ kx1 − x2 kβ+1 dt.
≤ f (x1 ) + h∇f (x1 ), x2 − x1 i +
0
Computing the final integral as
A.5.4
R1
0
tβ dt =
1
β+1
gives the result.
Proof of Lemma A.4
Let gn = ∇f (xn ) and suppose for the sake of contradiction that gn 6→ 0. Then there is a subsequence, which without loss of generality we take to be the full sequence, such that kgn k∗ ≥ c > 0
for all n. Fix δ > 0, which we will choose later. By Lemma A.3, defining yn = xn − δgn / kgn k∗ , we
have
L
L β
β+1
f (yn ) ≤ f (xn ) − δ kgn k +
Lδ
≤ f (xn ) − δ c −
δ .
β+1
β+1
In particular, we see that if
δ≤
c β+1
·
2
L
1/β
, then f (yn ) ≤ f (xn ) −
δc
2
for all n, which contradicts the fact that f (xn ) → inf x f (x) > −∞.
B
Proofs for order equivalent and convex functions
In this appendix, we collect the proofs of our various technical results on order equivalent functions
(Definition 5.1).
B.1
Proof of Lemma 5.2
and Z1 = a while Z T 1 = b, then 1T Z1 = 1T a =
One direction of the proof is easy: if Z ∈ Rm×m
+
T
b 1.
We prove the converse using induction. When m = 1, the result is immediate. We claim that
it is no loss of generality to assume that a1 ≥ a2 ≥ · · · ≥ am and b1 ≥ · · · ≥ bm ; indeed, let Pa and
Pb be permutation matrices such that Pa a and Pb b are in sorted (decreasing) order. Then if we
e = Pa a and ZeT 1 = Pb b, we have Z = P T ZP
e b satisfies Z ∈ Rm×m
such that Z1
construct Ze ∈ Rm×m
a
+
+
e = b.
e = P T Pa a = a and Z T 1 = P T Z1
and Z1 = PaT Z1
a
b
Now, suppose that the statement of the lemma is true for all vectors of dimension up to m − 1;
we argue that the result holds for a, b ∈ Rm
+ . It is no loss of generality to assume that am ≤ bm .
36
Let zm,m = am , zi,m = 0 for 2 ≤ i ≤ m − 1, and set z1,m = bm − am ≥ 0. Additionally, we set
m−1×m−1
be the matrix defined by the upper
zm,i = 0 for 1 ≤ i ≤ m − 1. Now, let Zinner ∈ R+
(m − 1) × (m − 1) sub-matrix of Z, and define the vectors
ainner = [a1 + am − bm a2 a3 · · · am−1 ]T and binner = [b1 b2 · · · bm−1 ]T .
Then as a1 ≥ (1/m)1T a = (1/m)1T b ≥ bm , we have ainner ≥ 0 and binner ≥ 0, and moreover,
1T ainner = 1T a − bm = 1T b − bm = 1T binner . In particular, we have by the inductive hypothesis
T
that we may choose Zinner such that Zinner 1 = ainner and Zinner
1 = binner . By inspection, setting
bm − a m
0
Zinner
.
..
Z=
,
0
0 ··· 0
am
and Z1 = a and Z T 1 = b.
we have Z ∈ Rm×m
+
B.2
Proof of Lemma 5.5
For each i, let λi = [αi ]+ and νi = [αi ]− be the positive and negative parts of the αi , and let λi =
and −νi = qsi where qi , ri , s ∈ N. Then we have
m
X
i=1
λi xi = y −
and 1T r = s + 1T q as 1T (r − q)/s =
m
X
νi xi or
i=1 αi
ri xi = sy +
m
X
q i xi ,
i=1
i=1
i=1
Pm
m
X
ri
s
= 1. Defining the matrices
A = [x1 · · · x1 · · · xm · · · xm ] and B = [y · · · y x1 · · · x1 · · · xm · · · xm ],
| {z }
| {z }
| {z }
| {z } | {z }
r1 times
rm times
q1 times
s times
qm times
T
k×1 r
. Thus, the definition (24) of order equivalence implies that
we have A1 = B1 and A, B ∈ R+
m
X
i=1
ri f1 (xi ) ≤ sf1 (y) +
m
X
qi f1 (xi ) iff
m
X
i=1
i=1
ri f2 (xi ) ≤ sf2 (y) +
By algebraic manipulations, each of these is equivalent to
B.3
Proof of Lemma 5.7
Pm
i=1 αi fj (xi )
m
X
qi f2 (xi ).
i=1
≤ fj (y) for j = 1, 2.
We assume that f1 and f2 are non-linear over Conv{x0 , . . . , xk }, as otherwise the result is trivial.
Let the vectors g1 and g2 and matrix X ∈ Rk×k be defined by
T
x1 − xT0
f1 (x1 ) − f1 (x0 )
f2 (x1 ) − f2 (x0 )
..
..
..
X=
, g1 =
, g2 =
.
.
.
.
xTk − xT0
f1 (xk ) − f1 (x0 )
37
f2 (xk ) − f2 (x0 )
For any a > 0, define the vector b(a) ∈ Rk so that
b(a)T (xi − x0 ) = f1 (xi ) − f1 (x0 ) − a(f2 (xi ) − f2 (x0 )) = g1,i − ag2,i ,
that is, as b(a) = X −1 (g1 − ag2 ), which is possible as X is full rank. By choosing c(a) = f1 (x0 ) −
af2 (x0 ) − b(a)T x0 , we then obtain that
f1 (xi ) = af2 (xi ) + b(a)T xi + c(a)
by algebraic manipulations. We now consider xcent . We have
k
T
b(a) xcent
1
1
1 X
b(a)T xi = b(a)T x0 +
1T Xb(a) = b(a)T x0 +
1T (g1 − ag2 ),
=
k+1
k+1
k+1
i=0
so that
af2 (xcent ) + b(a)T xcent + c(a)
k
= af2 (xcent ) +
1 X
[f1 (xi ) − f1 (x0 ) + af2 (x0 ) − af2 (xi )] + f1 (x0 ) − af2 (x0 )
k+1
1
= af2 (xcent ) +
k+1
i=1
k
X
i=0
k
1 X
f1 (xi ) − a
f2 (xi ).
k+1
i=0
Thus we may choose an a > 0 such that our desired equalities hold if and only if there exists a > 0
such that
k
k
1 X
1 X
f1 (xcent ) −
f1 (xi ) = af2 (xcent ) − a
f2 (xi ).
k+1
k+1
i=0
i=0
By the
assumption that f1 and f2 P
are non-linear, we have that (by Lemma 5.8) f1 (xcent ) <
k
1
1 Pk
f
(x
)
and
f
(x
)
<
2 cent
i=0 1 i
i=0 f2 (xi ). Thus, setting
k+1
k+1
1 Pk
f1 (xcent ) − k+1
i=0 f1 (xi )
a=
>0
P
k
1
f2 (xcent ) − k+1
i=0 f2 (xi )
gives the desired result.
B.4
Proof of Lemma 5.8
Without loss of generality, we assume that λ1 = kλk∞ and that λ1 > 1/m. Then we have
!
m
m
m
m
X
X
X
X
(λi − λ1 )f (xi )
f (xi ) +
λi f (xi ) = λ1
λi x i ≤
f
i=1
i=1
i=1
= mλ1 f (xcent ) +
i=1
m
X
i=2
Let Λ =
Pm
i=1 (λ1
xcent
(λ1 − λi )f (xi ).
− λi ) = mλ1 − 1 > 0. Then
m
m
m
m
1 X
1 X
1 X
Λ X λ1 − λi
=
λi x i +
(λ1 − λi )xi =
λi x i +
xi ,
mλ1
mλ1
mλ1
mλ1
Λ
i=1
i=1
i=1
38
i=1
(32)
and we have
!
!#
m
m
X
X
Λ
λ1 − λi
1
λi x i +
f
f
xi
mλ1 f (xcent ) ≤ nλ1
mλ1
mλ1
Λ
i=1
i=2
!
m
m
X
X
λ1 − λi
λi xi + Λ
≤f
f (xi ).
Λ
"
i=1
i=2
In particular, the inequalities (32) must have been equalities, giving the result.
C
Proof of Theorem 3
The proof of Theorem 3 is almost immediate from the following (somewhat technical) lemma, which
exhibits the power of calibration and universal equivalence. The lemma may be of independent
interest, especially the quantitative guarantee of its second statement.
Lemma C.1. Suppose that Ψ is classification-calibrated and universally equivalent to the weighted
k×k
. There exists a continuous
misclassification loss Ψcw with cost matrix C = [c1 · · · ck ] ∈ R+
concave function h : R+ → [0, maxij cij ] with h(0) = 0 such that
⋆
R(γ | q) − inf R(γ | q) ≤ h RΨ (γ | q) − inf RΨ (q) .
γ∈Γ,q∈Q
q∈Q
P
With the choice Ψy (α) = ki=1 cyi [1 + αi ]+ , then we may take h to be the truncated and scaled
identity function h(ǫ) = (1 + k1 ) min{ǫ, maxij cij }, that is,
1
RΨ (γ | q) − inf RΨ (γ | q) .
R(γ | q) − inf R(γ | q) ≤ 1 +
γ∈Γ,q∈Q
γ∈Γ,q∈Q
k
We postpone the proof of Lemma C.1 to the end of this section, proceeding with the proof of
the theorem. By Lemma C.1, we ahve
⋆
R(γn | qn ) − R⋆ (Q) ≤ h (RΨ (γn | qn ) − RΨ
(Q))
for some h concave, bounded, and satisfying h(0) = 0. It is thus sufficient to show that E[RΨ (γn |
⋆ (Q)] → 0. Now, if we let γ ⋆ and q⋆ minimize R (γ | q) over the set Γ × Q (or be
q n ) − RΨ
Ψ
n
n
n
n
arbitrarily close to minimizing the Ψ-risk), we have
bΨ,n (γn | qn ) + R
bΨ,n (γn | qn ) − RΨ (γn⋆ | q⋆n )
RΨ (γn | qn ) − RΨ (γn⋆ | q⋆n ) = RΨ (γn | qn ) − R
bΨ,n (γn | qn ) + R
bΨ,n (γ ⋆ | q⋆ ) − RΨ (γ ⋆ | q⋆ )
≤ RΨ (γn | qn ) − R
n
n
n
n
{z
}
|
bΨ,n (γ|q)−RΨ (γ|q)|
≤2 supγ∈Γn ,q∈Qn |R
bΨ,n (γn | qn ) −
+R
Consequently, we have the expectation bound
inf
γ∈Γn ,q∈Qn
bΨ,n (γ | q).
R
⋆
⋆
E [RΨ (γn | qn ) − RΨ
(Q)] ≤ E [RΨ (γn | qn ) − RΨ (γn⋆ | q⋆n )] + RΨ (γn⋆ | q⋆n ) − RΨ
(Q)
opt
app
≤ 2ǫest
n + ǫn + ǫn ,
39
which converges to zero as desired.
Proof of Lemma C.1
With the definition R⋆ (q) := inf γ R(γ | q), we have
R(γ | q) −
inf
γ∈Γ,q∈Q
R(γ | q) = R(γ | q) − R⋆ (q) + R⋆ (q) − inf R⋆ (q).
q∈Q
(33)
For the second two terms, we note that by Theorem 1, we have
R⋆ (q) = E[UΨcw (e
π (q(X)))] = aE[UΨ (e
π (q(X)))] + bT π + c,
and similarly for inf q R⋆ (q), so that
⋆
⋆
∗
∗
R (q) − inf R (q) = a RΨ (q) − inf RΨ (q) ≤ a RΨ (γ | q) −
q∈Q
q∈Q
inf
γ∈Γ,q∈Q
RΨ (γ | q) .
Clearly t 7→ at is concave, so that it remains to bound R(γ | q) − R⋆ (q) in expression (33). To that
end, define the function
k
k
k
k
X
X
X
X
cw ′
π
Ψ
(α
)
≥
ǫ
πy Ψcw
(α)
−
inf
πy Ψy (α′ ) |
πy Ψy (α) − inf′
H(ǫ) = inf
y
y
π∈∆k ,α
α′
α
y=1
y=1
y=1
y=1
and let H ∗∗ be its Fenchel biconjugate. Then (see Zhang [36, Proposition 25 and Corollary 26],
as well as the papers [32, 31, 11, Proposition 1]) we have that H ∗∗ (ǫ) > 0 for all ǫ > 0, and
defining h(ǫ) = sup{δ : δ ≥ 0, H ∗∗ (δ) ≤ ǫ} yields the desired concave function except that h may be
unbounded. Noting that the loss Ψ is bounded, we may replace t 7→ h(t)+at by min{maxij cij , h(t)+
at} with no loss of generality gives the first result of the lemma.
Now we give the second result. Without loss of generality, we may assume that the vector
γ(z) ∈ Rk has a unique maximal coordinate; we may otherwise assume a deterministic rule for
breaking ties. Let y(γ(z)) = argmaxj γj (z), assumed w.l.o.g. to be unique. Consider that
X
⋆
T
T
e(z) ,
R(γ | q) − R (q) =
p̄(z) cy(γ(z)) π
e(z) − min cl π
l
z
P
where p̄(z) = ki=1 πi Pi (q−1 (z)) and π
e(z) = [πi Pi (q−1 (z))]ki=1 /p̄(z) are shorthand. To show the
P
second part of the lemma, it is sufficient to argue that for Ψy (α) = ki=1 cyi [1 + αi ]+ , we have
X
X
X
′
T
T
T
πy Ψy (α ) =
πy Ψy (α) − k min π cl ≥ cy(α) π − min π cl
(34)
πy Ψy (α) − inf′
y
α
l
y
y
l
for any vector π ∈ ∆k .
To show inequality (34), assume without loss of generality that there exists an index l⋆ < k such
= minl cTl π, while cTl π > cT1 π for l > l⋆ . (If l⋆ = k, then inequality (34)
that cT1 π = cT2 π = · · · = cTl⋆ πP
is trivial.) We always have y πy Ψy (α) ≥ k minl cTl π; let us suppose that αl ≥ maxj αj for some
l > l⋆ ; without loss of generality take l = k. Then we have
k
X
y=1
πy Ψy (α) ≥
inf
αk ≥maxj αj ,1T α=0
k
X
y=1
πy
k
X
cyj [1 + αj ]+ =
j=1
40
inf
αk ≥maxj αj ,1T α=0
k
X
l=1
cTl π [1 + αl ]+ . (35)
Writing out the Lagrangian for this problem and introducing variables λ ≥ 0 for the inequality
αk ≥ maxj αj and θ ∈ R for the equality 1T α = 0, we have
L(α, λ, θ) =
k
X
l=1
T
T
π cl [1 + αl ]+ + θ1 α + λ max αj − αk .
j
⋆
⋆
k−l −1
−1
⋆
Set α⋆1 = · · · = α⋆l⋆ = k−l
l⋆ +1 , αk = l⋆ +1 , and αl = −1 for l > l , l 6= k. We claim that these are
optimal. Indeed, set θ ⋆ = −(1 − ǫ)π T c1 − ǫπ T ck for some ǫ ∈ (0, 1), whose value we specify later.
Taking subgradients of the Lagrangian with respect to α, we have
T
c1
..
⋆
⋆
∂α L(α , λ, θ ) = diag(ν) . π + θ ⋆ 1 + λ [Conv{e1 , . . . , el⋆ , ek } − ek ] ,
cTk
where νi ∈ [0, 1] are arbitrary scalars satisfying νi ∈ ∂ [1 + α⋆i ]+ , i.e. ν1 = · · · = νl⋆ = νk = 1,
νl ∈ [0, 1] for l ∈ {l⋆ +1, . . . , k −1}. Notably, we have π T cl +θ ⋆ = ǫπ T (c1 −ck ) ≤ 0 for l ∈ {1, . . . , l⋆ }
and π T ck + θ ⋆ = (1 − ǫ)π T (ck − c1 ) ≥ 0. Choosing ǫ small enough that ǫπ T ck + (1 − ǫ)π T c1 < π T cl
T
T
for l ∈ {l⋆ + 1, . . . , k − 1}, it is clear that we may take νl = (1−ǫ)ππTc1c+ǫπ ck ∈ [0, 1] for each
l
l 6∈ {1, . . . , l⋆ , k}, and we show how to choose λ⋆ ≥ 0 so that 0 ∈ ∂α L(α⋆ , λ, θ ⋆ ). Assume without
loss of generality (by scaling of λ ≥ 0) that π T (ck − c1 ) = 1. Eliminating extraneous indices in ∂α L,
we see that we seek a setting of λ and a vector v ∈ ∆l⋆ +1 such that
el⋆ +1 − ǫ1l⋆ +1 + λ(v − el⋆ +1 ) = 0.
1
[ǫ · · · ǫ (1 − ǫ)]T ∈ ∆l⋆ +1 , and set λ⋆ = l⋆ ǫ + (1 − ǫ) ≥ 0.
This is straightforward: take v = l⋆ ǫ+(1−ǫ)
Summarizing, we find that our preceding choices of α⋆ were optimal for the problem (35), that
⋆ −1
⋆
is, α⋆1 = · · · = αl⋆ = k−l
l⋆ +1 = αk , with αl = −1 for l < l < k. We thus have
k − l⋆ − 1
k − l⋆ − 1
T
πy Ψy (α) ≥
π cl 1 +
+ π ck 1 +
l⋆ + 1
l⋆ + 1
y=1
l=1
k
l⋆
⋆
T
⋆
(k − l − 1) + ⋆
cT π.
= min π cl l + ⋆
l
l +1
l +1 k
⋆
l
X
k
X
T
In particular, we have
k
X
y=1
πy Ψy (α) − k min π cl ≥ min π cl l⋆ +
T
l⋆
k
⋆
(k − l − 1) + ⋆
cT π − k min cTl π
⋆
l
l +1
l +1 k
cTk π − min cTl π .
T
l
l
=
l⋆
k
+1
l
Recalling that we must have had l⋆ < k, we obtain inequality (34).
D
Proofs of basic properties of f -divergences
In this section, we collect the proofs of the characterizations of generalized f -divergences (Def. 2.1).
41
D.1
Proof of Lemma 2.1
Let µ1 and µ2 be dominating measures; then µ = µ1 + µ2 also dominates P1 , . . . , Pk as well as µ1
and µ2 . We have for ν = µ1 or ν = µ2 that
dPi /dν dν/dµ
dPi /dµ
dPi dν
dPi
dPi /dν
=
=
and
=
,
dPk /dν
dPk /dν dν/dµ
dPk /dµ
dν dµ
dµ
the latter two equalities holding µ-almost surely by definition of the Radon-Nikodym derivative.
Thus we obtain for ν = µ1 or ν = µ2 that
Z Z dP1 /dν
dP1 /dν
dPk−1 /dν dPk
dPk−1 /dν dPk dν
f
,...,
dν = f
,...,
dµ
dPk /dν
dPk /dν
dν
dPk /dν
dPk /dν
dν dµ
Z dPk−1 /dµ dPk
dP1 /dµ
,...,
dµ
= f
dPk /dµ
dPk /dµ
dµ
dPi /dµ
by definition of the Radon-Nikodym derivative. Moreover, we see that ppki = dP
a.s.-µ, which
k /dµ
shows that the base measure µ does not affect
P the integral.
To see the positivity, we may take µ = k1 ki=1 Pi , in which case Jensen’s inequality implies (via
the joint-convexity of the perspective function (7)) that
dPk−1
dP1
(X), . . . ,
(X) dPk (X)
Df (P1 , . . . , Pk−1 ||Pk ) = E f
dPk
dPk
E[dP1 ]
E[dPk−1 ]
≥f
,...,
E[dPk ] = f (1) = 0,
E[dPk ]
E[dPk ]
where the expectation is taken under the distribution µ. Moreover, the inequality is strict for f
strictly convex at 1 as long as dPi /dPk is non-constant for some i, meaning that there exists an i
such that Pi 6= Pk .
D.2
Proof of Proposition 1
Before proving the proposition, we first establish a more general continuity result for multi-way
f -divergences. This result is a fairly direct generalization of results of Vajda [35, Theorem 5]. Given
a sub-σ-algebra G ⊂ F, we let P G denote the restriction of the measure P , defined on F, to G.
Lemma D.1. Let F1 ⊂ F2 ⊂ . . . be a sequence of sub-σ-algebras of F and let F∞ = σ(∪n≥1 Fn ).
Then
F∞
F2
F1
||PkF∞ ,
||PkF2 ≤ · · · ≤ Df P1F∞ , . . . , Pk−1
Df P1F1 , . . . , Pk−1
||PkF1 ≤ Df P1F2 , . . . , Pk−1
and moreover,
F∞
Fn
||PkF∞ .
||PkFn = Df P1F∞ , . . . , Pk−1
lim Df P1Fn , . . . , Pk−1
n→∞
Proof
Define the measure ν =
1
k
Vn =
Pk
and vectors V n via the Radon-Nikodym derivatives
!
Fn
dPk−1
dP1Fn dP2Fn
.
,
,...,
dν Fn dν Fn
dν Fn
i=1 Pi
1
·
k
42
Then (1 − 1T V n )dν Fn = k1 dPkFn , and V n is a martingale adapted to the filtration Fn by standard
k−1
|
properties of conditional expectation (under the measure ν). Letting the set Ck = {v ∈ R+
T
1 v ≤ 1}, we define g : Ck → R by
v
g(v) = f
(1 − 1T v).
1 − 1T v
We see that g is convex (it is a perspective function), and we have
!
!
Z
k−1
Fn
dPk−1
1 Fn
dP1Fn
1 X dPiFn
Fn
Fn
n
dν
=
.
D
P
||P
,
.
.
.
,
P
Eν [g(V )] = f
,
.
.
.
,
1
−
f
1
k−1 k
k
dν Fn
k
dPkFn
dPkFn
i=1
n
Because V n ∈ Ck for all n, we see that
This gives the first result of the
g(V ) is a submartingale.
Fn
Fn
Fn
is non-decreasing in n.
lemma, that is, that the sequence Df P1 , . . . , Pk−1 ||Pk
Now, assume that the limit in the second statement is finite, as otherwise the result is trivial.
Using that f (1) = 0, we have by convexity that for any v ∈ Ck ,
v
(1 − 1T v) + f (1)1T v ≥ f v + 1(1T v) ≥ inf f (v + 1(1T v)) > −∞,
g(v) = f
T
v∈Ck
1−1 v
the final inequality a consequence of the fact that f is closed and hence attains its infimum. In
particular, the sequence g(V n ) − inf v∈Ck g(v) is a non-negative submartingale, and thus
sup Eν |g(V n ) − inf g(v)| = lim Eν g(V n ) − inf g(v) < ∞.
n
v∈Ck
n
v∈Ck
Coupled with this integrability guarantee, Doob’s second martingale convergence theorem [5, Thm. 35.5]
thus yields the existence of a random vector V ∞ ∈ F∞ such that
n
0 ≤ lim Eν g(V ) − inf g(v) = Eν [g(V ∞ )] − inf g(v) < ∞.
n
v∈Ck
v∈Ck
Because inf v∈Ck g(v) > −∞, we have Eν [|g(V ∞ )|] < ∞, giving the lemma.
P
We now give the proof of Proposition 1 proper. Let the the base measure µ = k1 ki=1 Pi and
n
i
let pi = dP
dµ be the associated densities of the Pi . Define the increasing sequence of partitions P
of X by sets Aαn for vectors α = (α1 , . . . , αk−1 ) and B with
αj − 1
pj (x)
αj
Aαn = x ∈ X |
≤
< n for j = 1, . . . , k − 1
2n
pk (x)
2
where we let each αj range over {−n2n , −n2n + 1, . . . , n2n }, and define and B = (∪α Aαn )c =
X \ ∪α Aαn . Then we have
X
p1 (x)
pk−1 (x)
α
,...,
χA (x) + (n, . . . , n)χB (x),
= lim
n→∞
pk (x)
pk (x)
2n αn
n
n k−1
α∈{−n2 ,...,n2 }
43
where χA denotes the {0, 1}-valued characteristic function of the set A. Each term on the righthand-side of the previous display is Fn -measurable, where Fn denotes the sub-σ-field generated by
the partition P n . Defining F∞ = σ(∪n≥1 Fn ), we have
Fn
||PkFn
Df (P1 , . . . , Pk−1 ||Pk | P n ) = Df P1Fn , . . . , Pk−1
F∞
||PkF∞ = Df (P1 , . . . , Pk−1 ||Pk ) ,
−→ Df P1F∞ , . . . , Pk−1
n↑∞
where the limiting operation follows by Lemma D.1 and the final equality because of the measurap
bility containment ( ppk1 , . . . , k−1
pk ) ∈ F∞ .
D.3
Proof of Proposition 2
Before proving the proposition, we state an inequality that generalizes the classical log-sum inequality (cf. [7, Theorem 2.7.1]).
Lemma D.2. Let f : Rk+ → R be convex. Let a : X → R+ and bi : X → R+ for 1 ≤ i ≤ k be
nonnegative measurable functions. Then for any finite measure µ defined on X , we have
R
R
Z
Z
b1 dµ
bk dµ
b1 (x)
bk (x)
adµ f R
,..., R
≤ a(x)f
,...,
dµ(x).
a(x)
a(x)
adµ
adµ
Proof Recall that the perspective function g(y, t) defined by g(y, t) = tf (y/t) for t > 0 is jointly
convex in y and t. The measure ν = µ/µ(X ) defines a probability measure, so that for X ∼ ν,
Jensen’s inequality implies
R
R
Z
bk dν
b1 dν
b1 (X)
bk (X)
R
R
,...,
≤ Eν a(X)f
,...,
adν f
a(X)
a(X)
adν
adν
Z
b1 (x)
bk (x)
1
,...,
a(x)f
dµ(x).
=
µ(X )
a(x)
a(x)
R
R
R
R
Noting that bi dν/ adν = bi dµ/ adµ gives the result.
We use this lemma and Proposition 1 to give Proposition 2. Indeed Proposition 1 implies
Df QP1 , . . . , QPk−1 ||QPk = sup Df QP1 , . . . , QPk−1 ||QPk | q : q is a quantizer of Z .
It is P
consequently no loss of generality to assume that Z is finite and Z = {1, 2, . . . , m}. Let
µ = ki=1 Pi be a dominating measure and let pi = dPi /dµ. Letting q(j | x) = Q({j} | x) and
qPi (j) = QPi ({j}), we obtain
m
X
qPk−1 (i)
qP1 (i)
qPk (i)f
Df QP1 , . . . , QPk−1 ||QPk =
,...,
qPk (i)
qPk (i)
i=1
R
R
Z
m
X
q(i | x)p1 (x)dµ(x)
q(i | x)pk−1 (x)dµ(x)
X
X
q(i | x)pk (x)dµ(x) f R
=
,..., R
q(i | x)pk (x)dµ(x)
X
X
X q(i | x)pk (x)dµ(x)
i=1
m Z
X
p1 (x)
pk−1 (x)
q(i | x)f
,...,
≤
pk (x)dµ(x)
pk (x)
pk (x)
X
i=1
P
by Lemma D.2. Noting that m
i=1 q(i | x) = 1, we obtain our desired result.
44
© Copyright 2026 Paperzz