Sequence prediction for arbitrary classes of processes

Introduction
The 3 problems
total variation
KL
examples: KL
Sequence prediction for arbitrary classes of
processes
Daniil Ryabko
SequeL, INRIA Lille
UK Easter Probability Meeting
general results
Introduction
The 3 problems
total variation
KL
examples: KL
general results
X -a finite set. In this talk mostly X = {0, 1}.
µ - An unknown probability measure over X ∞ , generates x1 , x2 , . . .
At time 0, we have to predict probabilities of x1 = 0, x1 = 1.
Then x1 is revealed, we have to give probabilities of x2 given x1
and so on; on time n given x1 , . . . , xn generated by µ forecast
probabilities for xn+1 .
Any predictor ρ is also a probability measure: indeed, for each n
and each x1 , . . . , xn it gives ρ(xn+1 = 0|x1 , . . . , xn ) and
ρ(xn+1 = 1|x1 , . . . , xn ).
So we can define
ρ(x1 , . . . , xk ) = ρ(x1 )ρ(x2 |x1 ) . . . ρ(xk |x1 ..xk−1 )
Environments and predictors live in the same space: they are
probability distributions on the space of one-way infinite sequences.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
X -a finite set. In this talk mostly X = {0, 1}.
µ - An unknown probability measure over X ∞ , generates x1 , x2 , . . .
At time 0, we have to predict probabilities of x1 = 0, x1 = 1.
Then x1 is revealed, we have to give probabilities of x2 given x1
and so on; on time n given x1 , . . . , xn generated by µ forecast
probabilities for xn+1 .
Any predictor ρ is also a probability measure: indeed, for each n
and each x1 , . . . , xn it gives ρ(xn+1 = 0|x1 , . . . , xn ) and
ρ(xn+1 = 1|x1 , . . . , xn ).
So we can define
ρ(x1 , . . . , xk ) = ρ(x1 )ρ(x2 |x1 ) . . . ρ(xk |x1 ..xk−1 )
Environments and predictors live in the same space: they are
probability distributions on the space of one-way infinite sequences.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Measuring the quality of prediction
The quality of the predicted probabilities ρ(xn |x1 , . . . , xn−1 ) should
be measured with respect to the “real” (unknown) probabilities
µ(xn |x1 , . . . , xn−1 ). We measure them using Kullback-Leibler
divergence
dn (µ, ρ) = Eµ
n X
X
µ(xk = x|x1 , . . . , xk−1 ) log
k=1 x∈X
Using the chain rule for log we have
dn (µ, ρ) = Eµ log
If
µ(x1 , . . . , xn )
ρ(x1 , . . . , xn )
1
dn (µ, ρ) → 0
n
then we say that ρ predicts µ in KL divergence.
µ(xk = x|x1 , . . . , xk−1 )
ρ(xk = x|x1 , . . . , xk−1 )
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Measuring the quality of prediction
The quality of the predicted probabilities ρ(xn |x1 , . . . , xn−1 ) should
be measured with respect to the “real” (unknown) probabilities
µ(xn |x1 , . . . , xn−1 ). We measure them using Kullback-Leibler
divergence
dn (µ, ρ) = Eµ
n X
X
µ(xk = x|x1 , . . . , xk−1 ) log
k=1 x∈X
Using the chain rule for log we have
dn (µ, ρ) = Eµ log
If
µ(x1 , . . . , xn )
ρ(x1 , . . . , xn )
1
dn (µ, ρ) → 0
n
then we say that ρ predicts µ in KL divergence.
µ(xk = x|x1 , . . . , xk−1 )
ρ(xk = x|x1 , . . . , xk−1 )
Introduction
The 3 problems
total variation
KL
examples: KL
general results
The adversarial argument
For every predictor ρ there is a measure µ such that dn (µ, ρ) ≥ n,
so that lim inf n→∞ n1 dn (µ, ρ) ≥ 1.
Indeed, for every predictor ρ one can construct such a measure µ
as follows: for every x1 , . . . , xn the conditional probability
µ(xn+1 = 0|x1 , . . . , xn ) := 1 if ρ(xn+1 = 0|x1 , . . . , xn ) ≤ 1/2 and
µ(xn+1 = 0|x1 , . . . , xn ) := 0 otherwise.
This is very sad.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
The adversarial argument
For every predictor ρ there is a measure µ such that dn (µ, ρ) ≥ n,
so that lim inf n→∞ n1 dn (µ, ρ) ≥ 1.
Indeed, for every predictor ρ one can construct such a measure µ
as follows: for every x1 , . . . , xn the conditional probability
µ(xn+1 = 0|x1 , . . . , xn ) := 1 if ρ(xn+1 = 0|x1 , . . . , xn ) ≤ 1/2 and
µ(xn+1 = 0|x1 , . . . , xn ) := 0 otherwise.
This is very sad.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Laplace: i.i.d.
Thus we may want to assume that the measure µ belongs to some
known class of processes (measures).
Laplace: x1 , x2 , . . . , xn , . . . are independent and identically
distributed (i.i.d.)
The predictor suggested by Laplace is
ρL (xn+1 = 1|x1 , . . . , xn ) =
where k is the number of 1s in x1 , . . . , xn .
k +1
n+2
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Problem P1
Given a set C of distributions, we want to have a predictor ρ such
that ρ predicts every µ in C:
1
dn (µ, ρ) → 0
n
for every µ ∈ C.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Problem P2
We have a set C of distributions. Each of them predicts some
measures possibly outside of C. We want to have a predictor ρ
that predicts everything that is predicted by some distribution in C:
1
dn (ν, ρ) → 0
n
for every ν such that n1 dn (ν, µ) → 0 for some µ ∈ C.
Introduction
The 3 problems
total variation
KL
examples: KL
Problem P3
We have a set C of distributions. We want a predictor ρ that
predicts every measure at least as well as any measure in C:
sup lim sup
µ∈C
for every ν whatsoever.
1
(dn (ν, ρ) − dn (ν, µ)) ≤ 0
n
general results
Introduction
The 3 problems
total variation
KL
examples: KL
general results
summarizing
Problem 1: realizable case. We assume something about the data,
and work under this assumption.
Problem 2: non-realizable case (or not quite realizable). We have a
set of models (or predictors), and we assume that one of them
predicts the data well. We want a predictor that predicts
everything that one of the models predicts.
Problem 3: completely agnostic. We have a set of predictors, and
we want to predict any data whatsoever as well as the best (for
this data) predictor from the set.
Mostly, problems 1 and 3 are studied separately. Problem 1 under
some specific assumptions (e.g. Markov, parametric), Problem 3
for finite or countable sets of predictors (or experts: prediction
with expert advice [Cesa-Bianchi and Lugosi, 2006]). However, the
methods to solve them are very similar, so may be the problems
are worth solving together.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
summarizing
Problem 1: realizable case. We assume something about the data,
and work under this assumption.
Problem 2: non-realizable case (or not quite realizable). We have a
set of models (or predictors), and we assume that one of them
predicts the data well. We want a predictor that predicts
everything that one of the models predicts.
Problem 3: completely agnostic. We have a set of predictors, and
we want to predict any data whatsoever as well as the best (for
this data) predictor from the set.
Mostly, problems 1 and 3 are studied separately. Problem 1 under
some specific assumptions (e.g. Markov, parametric), Problem 3
for finite or countable sets of predictors (or experts: prediction
with expert advice [Cesa-Bianchi and Lugosi, 2006]). However, the
methods to solve them are very similar, so may be the problems
are worth solving together.
Introduction
The 3 problems
total variation
no parametric assumptions
Assumptions on C: none.
KL
examples: KL
general results
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Total variation distance
Introduce another measure of the quality of prediction: total
variation.
For two distributions µ, ρ let
v (µ, ρ) := sup |µ(A) − ρ(A)|,
A∈F
We say that ρ predicts µ in total variation distance if
v (µ(·|x1 , . . . , xn ), ρ(·|x1 , . . . , xn )) → 0 µ − a.s.
Clearly, we can formulate all those 3 problems for total variation
too.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
tv is strong but very nice
Prediction in total variation means being able to give good
forecasts of probabilities of arbitrary far-off events in the future.
It’s very strong.
ρ predicts µ in total variation if [Blackwell and Dubins, 1962] and
only if [Kalai and Lehrer, 1994] µ is absolutely continuous with
respect to ρ.
This is very nice, since in probability theory we know everything
about absolute continuity.
We will first see what can be done for this nice measure of
predictive quality, why it’s not enough to be happy, and then we’ll
see what can be transferred to KL
Introduction
The 3 problems
total variation
KL
examples: KL
general results
tv is strong but very nice
Prediction in total variation means being able to give good
forecasts of probabilities of arbitrary far-off events in the future.
It’s very strong.
ρ predicts µ in total variation if [Blackwell and Dubins, 1962] and
only if [Kalai and Lehrer, 1994] µ is absolutely continuous with
respect to ρ.
This is very nice, since in probability theory we know everything
about absolute continuity.
We will first see what can be done for this nice measure of
predictive quality, why it’s not enough to be happy, and then we’ll
see what can be transferred to KL
Introduction
The 3 problems
total variation
KL
examples: KL
general results
tv is strong but very nice
Prediction in total variation means being able to give good
forecasts of probabilities of arbitrary far-off events in the future.
It’s very strong.
ρ predicts µ in total variation if [Blackwell and Dubins, 1962] and
only if [Kalai and Lehrer, 1994] µ is absolutely continuous with
respect to ρ.
This is very nice, since in probability theory we know everything
about absolute continuity.
We will first see what can be done for this nice measure of
predictive quality, why it’s not enough to be happy, and then we’ll
see what can be transferred to KL
Introduction
The 3 problems
total variation
KL
examples: KL
general results
P1=P2 for tv
Absolute continuity (def.): µ(A) > 0 implies ρ(A) > 0 for every
event A. Denote this ρ ≥tv µ.
Since ≥tv is transitive, we immediately see that P2=P1; that is, if
ρ predicts µ and µ predicts ν (in total variation), then also ρ
predicts ν.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
P1=P2=P3 for tv
The following lemma is an easy extension of [Blackwell and
Dubins, 1962] using Lebesgue’s decomposition.
Lemma
Let µ, ρ be two measures. Then v (µ, ρ, x1..n ) converges to either 0
or 1 with µ-probability 1.
Corollary
P1=P2=P3 for total variation.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
P1=P2=P3 for tv
The following lemma is an easy extension of [Blackwell and
Dubins, 1962] using Lebesgue’s decomposition.
Lemma
Let µ, ρ be two measures. Then v (µ, ρ, x1..n ) converges to either 0
or 1 with µ-probability 1.
Corollary
P1=P2=P3 for total variation.
Introduction
The 3 problems
total variation
KL
examples: KL
Example: countable C
Let C = {µ1 , µ2 , . . . }. P
∞
Define the
P predictor ρ = k=1 wk µk , where (for example)
wk > 0, k∈N wk = 1.
For every k and every event A if µk (A) > 0 then
ρ(A) ≥ wk µk (A) > 0, so that ρ ≥tv µk .
All problems are solved. Happy.
general results
Introduction
The 3 problems
total variation
KL
examples: KL
Example: countable C
Let C = {µ1 , µ2 , . . . }. P
∞
Define the
P predictor ρ = k=1 wk µk , where (for example)
wk > 0, k∈N wk = 1.
For every k and every event A if µk (A) > 0 then
ρ(A) ≥ wk µk (A) > 0, so that ρ ≥tv µk .
All problems are solved. Happy.
general results
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Theorem
Let C ⊂ P. The following statements about C are equivalent.
(i) There exists a solution to Problem 1 in total variation.
(ii) There exists a solution to Problem 2 in total variation.
(iii) There exists a solution to Problem 3 in total variation.
(iv) C is upper-bounded with respect to ≥tv .
(v) There exists a sequence µk ∈ C, k ∈ N such that for some
(equivalently, for P
every) sequence of weights wk ∈P
(0, 1],
k ∈ N such that k∈N wk = 1, the measure ν = k∈N wk µk
satisfies ν ≥tv µ for every µ ∈ C.
(vi) C is separable with respect to the total variation distance.
(vii) Let C + := {µ ∈ P : ∃ρ ∈ C ρ ≥tv µ}. Every disjoint (with
respect to ≥tv ) subset of C + is at most countable.
Moreover, every solution to any of the Problems 1-3 is a solution
to the other two, as is any upper bound for C. The sequence µk in
the statement (v) can be taken to be any dense (in the total
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Example: iid
Let C be the set of all i.i.d. processes. We can set
C = {µp : p ∈ [0, 1]} where p = µp (x1 = 1).
Then for every p define the set Ap =“the limiting fraction of 1s
euqals p”. These sets are disjoint. We have µp (Ap ) = 1. If there is
a predictor ρ such that ρ ≥tv µp for all p, then ρ(Ap ) > 0 which is
impossible, since there are uncountably many different
(disjoint) Ap .
So: in the prediction quality is measured in total variation distance,
there is no predictor for all iid processes. Fail.
P1-3 are easy to analyze for total variation, but total variation is
too strong. Next we will see what results can be transferred to KL.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Example: iid
Let C be the set of all i.i.d. processes. We can set
C = {µp : p ∈ [0, 1]} where p = µp (x1 = 1).
Then for every p define the set Ap =“the limiting fraction of 1s
euqals p”. These sets are disjoint. We have µp (Ap ) = 1. If there is
a predictor ρ such that ρ ≥tv µp for all p, then ρ(Ap ) > 0 which is
impossible, since there are uncountably many different
(disjoint) Ap .
So: in the prediction quality is measured in total variation distance,
there is no predictor for all iid processes. Fail.
P1-3 are easy to analyze for total variation, but total variation is
too strong. Next we will see what results can be transferred to KL.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Example: iid
Let C be the set of all i.i.d. processes. We can set
C = {µp : p ∈ [0, 1]} where p = µp (x1 = 1).
Then for every p define the set Ap =“the limiting fraction of 1s
euqals p”. These sets are disjoint. We have µp (Ap ) = 1. If there is
a predictor ρ such that ρ ≥tv µp for all p, then ρ(Ap ) > 0 which is
impossible, since there are uncountably many different
(disjoint) Ap .
So: in the prediction quality is measured in total variation distance,
there is no predictor for all iid processes. Fail.
P1-3 are easy to analyze for total variation, but total variation is
too strong. Next we will see what results can be transferred to KL.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
P16=P26=P3
Proposition
For the case of prediction in KL divergence, all the three problems
are different:
There exists a set C such that there is a solution to Problem 1
for C but there is no solution for Problem 2.
There exists a set C such that there is a solution to Problem 2
for C but there is no solution for Problem 3.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
KL: countable
P
Let C = {µ1 , µ2 , . . . } and define the predictor ρ := k∈N wk µk ,
where wk > 0 and sum to 1. Then for every ν whatsoever, and for
every k we have
ν(x1 , . . . , xn )
ρ(x1 , . . . , xn )
ν(x1 , . . . , xn )
≤ Eµ log
= log wk−1 + dn (ν, µk ).
wk µk (x1 , . . . , xn )
dn (ν, ρ) = Eµ log
This means that ρ is the solution to all 3 problems, for the
countable case, in terms of both KL and tv.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
iid/finite memory/stationary
For iid processes, Problem 1 was solved by Laplace (not in terms of
KL divergence). For KL we can have a bound
dn (µ, ρL ) ≤ log(n + 1) where µ is any iid process and ρL is the
Laplace predictor. Better predictors (1 replaced by 1/2), as well as
predictors for k-order Markov processes, were obtained in
[Krichevsky, 1983]. There has been a lot of further work on the
bounds for these classes of processes. For the case when C is the
set of all stationary processes a predictor was obtained in
[B. Ryabko 1988].
Problems 2, 3 are much less explored. P3 is mostly studied for
finite sets C [Cesa-Bianchi, Lugosi, 1996] For P3, the iid case is
solved [Freund 1996]. There is a lot of work on finite/countable
classes, both in general and for such special cases as finite
automata or computable predictors, but I couldn’t find any result
even for the class of all finite-memory predictors.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
another distance, and separability
Define another distance
Definition
Define the distance d∞ (µ1 , µ2 ) on process measures as follows
1 µ1 (x1..n ) d∞ (µ1 , µ2 ) = lim sup sup
(1)
log µ2 (x1..n ) .
n→∞ x1..n ∈Xn n
This distance looks at how differently two measures behave on all
possible sequences.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Theorem
Separability with respect to d∞ is a sufficient but not a necessary
condition for the existence of a solution to P3.
(The proof is quite simple) I don’t know any distance that would
give necessary and sufficient conditions, for either P2 or P3.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Examples: finite memory, stationary
Corollary
There is a solution to P3 for the set of all distributions with finite
memory.
Recall that P1 has a solution for the set of all stationary processes.
Theorem
P2 (and thus P3) have no solution for the set of all stationary
processes.
So it makes a difference whether you want to predict all processes
from the set, or all process for which one of them is a good
predictor.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Examples: finite memory, stationary
Corollary
There is a solution to P3 for the set of all distributions with finite
memory.
Recall that P1 has a solution for the set of all stationary processes.
Theorem
P2 (and thus P3) have no solution for the set of all stationary
processes.
So it makes a difference whether you want to predict all processes
from the set, or all process for which one of them is a good
predictor.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
mixture predictors
To generalize the idea of taking a sum of countably many
measures, we do not need separability, we can do it directly.
The following can be proven for P1 and P2 both in KL divergence
and in tv distance.
Theorem (countable Bayes)
Let C be any set of measures. If there is a solution to Problem 1
(Problem 2) for C, then therePis a sequence of measures µk ∈ C,
k = 1, 2, . . . such that ν := nk=1 wk µk is a solution to Problem 1
(Problem 2), for some weights wk .
This means that we do not need to look outside of C to construct
a predictor; we just have to look for a suitable countable subset
of C.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
We are given a set of measures C and a predictor ρ that predicts
all µ P
∈ C. We need to find a sequence µi ∈ C of measures such
that i∈N wi µi predicts all µ ∈ C (for some weights wi ).
For each n ∈ N we will construct a cover of Xn as follows.
1
Tµn := x1..n ∈ Xn : µ(x1..n ) ≥ ρ(x1..n ) .
n
(2)
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
We are given a set of measures C and a predictor ρ that predicts
all µ P
∈ C. We need to find a sequence µi ∈ C of measures such
that i∈N wi µi predicts all µ ∈ C (for some weights wi ).
For each n ∈ N we will construct a cover of Xn as follows.
1
n
n
Tµ := x1..n ∈ X : µ(x1..n ) ≥ ρ(x1..n ) .
n
(2)
Order Tnµ with respect to ρ as follows: Define
m1n := maxµ∈C ρ(Tµn )
Find any µn1 such that ρn1 (Tµnn ) = m1n and let T1n := Tµnn .
1
1
n ). Let µn be any µ ∈ C
For k > 1, let mkn := maxµ∈C ρ(Tµn \Tk−1
k
n ) = mn , and let T n := T n
n
such that ρ(Tµnn \Tk−1
k
k
k−1 ∪ Tµn .
k
νn :=
k
Kn
X
k=1
wk µnk
and ν :=
∞
X
k=1
wn νn .
(3)
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
We are given a set of measures C and a predictor ρ that predicts
all µ P
∈ C. We need to find a sequence µi ∈ C of measures such
that i∈N wi µi predicts all µ ∈ C (for some weights wi ).
For each n ∈ N we will construct a cover of Xn as follows.
1
Tµn := x1..n ∈ Xn : µ(x1..n ) ≥ ρ(x1..n ) .
n
(2)
mkn are disjoined and ordered w.r.t. ρ, so that ρ(mkn ) ≤ 1/k. This
can be used to show that for each µ the set Tnµ is covered by mkn
with subexponential indices, up to a small probability, since
otherwise ρ would have a linear error on µ.
It remains to notice that on each set mkn ν is as good as ρ.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
The same does not hold for P3!
Mixture predictors
A measure ϕ is called aR mixture predictor (aka Bayesian predictor)
over a set C if ϕ(A) = C µ(A)dW (µ) for some measure W over
(a measurable subset of) C.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
The same does not hold for P3!
Theorem
There exists a family C of distributions and a predictor ρ that is
predicts every process ν as well as the best predictor in C
sup lim sup
µ∈C
1
(dn (ν, ρ) − dn (ν, µ)) ≤ 0
n
but every mixture predictor ϕ must have non-zero error: there
exists ν such that for some µ ∈ C
1
1
lim inf dn (ν, ϕ) ≥ const + lim sup dn (ν, µ)
n→∞ n
n
n→∞
This means that in order to combine the predictive power of
measures in a set C it may be necessary to look somewhere
completely outside C.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
The same does not hold for P3!
Theorem
There exists a family C of distributions and a predictor ρ that is
predicts every process ν as well as the best predictor in C
sup lim sup
µ∈C
1
(dn (ν, ρ) − dn (ν, µ)) ≤ 0
n
but every mixture predictor ϕ must have non-zero error: there
exists ν such that for some µ ∈ C
1
1
lim inf dn (ν, ϕ) ≥ const + lim sup dn (ν, µ)
n→∞ n
n
n→∞
This means that in order to combine the predictive power of
measures in a set C it may be necessary to look somewhere
completely outside C.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
Take a Bernoulli i.i.d. measure βp with p = 1/3, and the S of
sequences typical for this measure. For each x ∈ S define µx as the
measure that behaves as βp on x, and on all other sequences it
behaves as some fixed (independent of x) measure. C is the
resulting set of measures.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
Take a Bernoulli i.i.d. measure βp with p = 1/3, and the S of
sequences typical for this measure. For each x ∈ S define µx as the
measure that behaves as βp on x, and on all other sequences it
behaves as some fixed (independent of x) measure. C is the
resulting set of measures.
βp can be recovered with a mixture predictor from the set S: it is
enough to take βp itself as a prior over S. Such a predictor will
then be as good as βp on any measure. Observe that for every
x1..T it puts the weight of about exp −hp T on the set of sequences
from S that start with x1..T (where hp is the binary entropy for
p = 1/3 of the example). The loss it achieves on measures from S
is thus hp T and this is the minimax loss one can achieve on S.
Introduction
The 3 problems
total variation
KL
examples: KL
general results
proof idea
Take a Bernoulli i.i.d. measure βp with p = 1/3, and the S of
sequences typical for this measure. For each x ∈ S define µx as the
measure that behaves as βp on x, and on all other sequences it
behaves as some fixed (independent of x) measure. C is the
resulting set of measures.
However, it is not possible to achieve the same loss (and to recover
βp ) with a mixture predictor whose prior is concentrated on the
set C :
Each measure µ in C attaches already too little weight to the
sequence from S that it is based on: it is the same exp −hp T that
the mixture gives to the corresponding deterministic sequence.
The best it can do is give another exp −hp T , which means that
the resulting loss is going to be double the best possible one can
obtain on measures from S with the best possible predictor
Introduction
The 3 problems
total variation
KL
examples: KL
general results
Theorem
Let C ⊂ P. The following statements about C are equivalent.
(i) There exists a solution to Problem 1 in total variation.
(ii) There exists a solution to Problem 2 in total variation.
(iii) There exists a solution to Problem 3 in total variation.
(iv) C is upper-bounded with respect to ≥tv .
(v) There exists a sequence µk ∈ C, k ∈ N such that for some
(equivalently, for P
every) sequence of weights wk ∈P
(0, 1],
k ∈ N such that k∈N wk = 1, the measure ν = k∈N wk µk
satisfies ν ≥tv µ for every µ ∈ C.
(vi) C is separable with respect to the total variation distance.
(vii) Let C + := {µ ∈ P : ∃ρ ∈ C ρ ≥tv µ}. Every disjoint (with
respect to ≥tv ) subset of C + is at most countable.
Moreover, every solution to any of the Problems 1-3 is a solution
to the other two, as is any upper bound for C. The sequence µk in
the statement (v) can be taken to be any dense (in the total
Introduction
The 3 problems
total variation
KL
Thanks!
examples: KL
general results