Planning, Learning, Prediction, and Games Predicting with Many

Planning, Learning, Prediction, and Games
Predicting with Many Experts: 11/20/09 - 12/11/09
Lecturer: Patrick Briest
3
Scribe: Philipp Brandes
Predicting with Many Experts
In the previous section on predicting with expert advice we made the assumption that the number
of experts is reasonably small. In particular, this made it possible for algorithm Hedge to keep track
of an explicit weight distribution on experts reflecting their performance in the past. We will now
consider the case that the number of experts is too large for this kind of approach. As an example,
consider the problem of deciding which way to work to take in the morning. We can model the
network of streets as a graph and view each possible route as an expert incurring cost equal to
the total delay on edges on the path. Since the total number of paths connecting a pair of source
and destination vertices can be exponentially large, maintaining an explicit weight distribution is
infeasible in this case.
This example has the special property that experts can be described as vectors in a low-dimensional
vector space and the cost of experts is a linear function. Let e1 , . . . , em be the edges of the graph.
Then a path P - and the corresponding “expert” - can be described as a vector v(P ) ∈ {0, 1}m
with v(P )j = 1 iff ej ∈ P . If the delay on edge ej at time t is ϕt (ej ), the cost function at time t is
ct (x) = (ϕt (e1 ), . . . , ϕt (em )) · x. We will derive an algorithm that achieves almost the same worst
case regret as Hedge on this type of instance against oblivious randomized adversaries.
3.1
Some Notation
Experts form a bounded set E ⊆ Rn . We let
D = sup kx − yk1
x,y∈E
denote the diameter of E. An oblivious adversary specifies a sequence of cost vectors c1 , . . . , cT ∈ Rn .
Let
A = max kct k1 .
1≤t≤T
An online algorithm chooses aPsequence of experts (or ”strategies”) x1 , . . . , xt ∈ E. The cost of
strategy x at time t is ct · x = ni=1 cti xi . Let
R=
|ct (x − y)| .
sup
1≤t≤T,x,y∈E
P
Define ci...j = jt=i ct as the cumulative cost function of time steps i, . . . , j and for an algorithm
ALG, if it picks strategies x1 , . . . , xT ,
ALG(i . . . j) =
j
X
t=i
as the cumulative cost of algorithm ALG.
22
ct xt
For any vector c, M (c) is an arbitrary element of argminx∈E c · x, i.e., a cost-minimal strategy for c.
We will assume throughout that we can find an element M (c) (efficiently) for any cost vector c,
i.e., that we can minimize linear cost functions on set E. In our route to work example above, this
corresponds to computing a shortest path for a given vector of edge delays.
3.2
The ”Follow the Leader”-Algorithm
A natural idea for an algorithm might be as follows: at time t select an expert with the minimum
past cumulative cost. This is formalized in algorithm FTL below.
Algorithm 1: “Follow the Leader”-Algorithm (FTL).
• At time t, choose strategy
xt = M (c1...t−1 ).
Unfortunately, this approach does not work too well, as demonstrated by the following example.
Example 3.1 Let n = 2, E = {(0, 1), (1, 0)} ⊂ R2 ,
µ
¶
1 2
c1 =
,
3 3
and, for t ≥ 2,
(
(1, 0)
ct =
(0, 1)
if t is even
if t is odd.
Now, for t even we have:
c1...t ((1, 0)) =
1
3
+
t
2


2
3
+
t−2
2

c1...t ((1, 0)) =
1
3
+
t−1
2


c1...t ((0, 1)) =
2
3
+
t−1
2

c1...t ((0, 1)) =
⇒ M (c1...t ) = (0, 1)
So we pick (0, 1) in each odd round.
If t is odd, on the other hand, then:
⇒ M (c1...t ) = (1, 0)
So we pick (1, 0) in each even round.
Thus, F T L incurs cost ≥ T − 2/3, whereas both strategies have cost T /2 + O(1) and an ideal
sequence of strategies would even only incur cost 1/3. In particular, FTL suffers regret T /2 − O(1)
and we don’t have vanishing per time step regret as in the case of Hedge.
23
Algorithm 2: “Follow the Perturbed Leader”-Algorithm (FPLp ).
• Choose a random cost vector c0 ∈ Rn according to probability distribution p.
• At time t, choose strategy
xt = M (c0...t−1 ) .
Algorithm FPL below is a variation of FTL, which randomly perturbs the cumulative cost function
by adding an initial cost vector c0 according to some distribution p. Surprisingly, this small change
in the algorithm is powerful enough to obtain strong bounds on the algorithm’s expected regret for
appropriate choices of p.
Let ε > 0 be any positive number. We will sample c0 according to distributions either µ or α, with
density functions
³ ε ´n
e−εkxk1 ,
dµ(x) =
2
and
(¡ ¢n
ε
, if kxk∞ ≤ 1ε
2
dα(x) =
0
, else.
Note, that µ is a geometric distribution around 0, while α is a uniform distribution on [−1/ε, 1/ε]n .
Drawing a sample from these is ”easy”, since in both cases the entries of c0 when sampled according
to either of the two distributions are independent geometrically or uniformly distributed random
variables, respectively.
3.3
Analysis of F P L
Let ”Be the Perturbed Leader” (BPL) be an algorithm similar to FPL, except that it gets a one-step
lookahead and chooses
xt = M (c0...t ) .
Obviously, BPL is not an online algorithm, since the cost vector ct is revealed only after an online
algorithm selects its strategy xt ! It will turn out to be a very helpful tool in the analysis of FPL,
though.
Let OPT(1 . . . T ) = c1...T · M (c1...T ) the total cost of an optimal strategy. Our analysis of FPL will
proceed in 3 steps:
1. Because of its lookahead, BPL is as good as OPT, except for its random perturbation of the
cost function c0 . We will show that BPL(1 . . . T ) ≤ OPT(1 . . . T )+E [c0 · (M (c1...T ) − M (c0 ))].
2. FPL(1 . . . T ) is not much greater than BPL(1 . . . T ). This is because distributions c0...t 1 and
c0...t−1 are closely related.
3. E [c0 · (x − y)] is small for all x, y ∈ E.
1
Throughout the remainder of this section, we will refer to a vector c0 + c and the distribution it induces on Rn , if
c0 is sampled according to some fixed distribution, interchangeably. In particular, if c0 is sampled from a distribution
p with density function dp, then the distribution induced by c0 + c has density function dq(·) = dp(· − c).
24
3.3.1
Step I
Lemma 3.2 For all i ≤ j,
j
X
ct · M (ci...t ) ≤ ci...j · M (ci...j ) .
t=i
Proof: We prove this by induction on j − i. If j − i = 0, the claim reduces to
ci · M (ci ) ≤ ci · M (ci ) ,
which is trivially true. Let then j − i = ` and assume that the claim holds for all i, j with j − i < `.
Then,
j
X
j−1
X
ct · M (ci...t ) =
t=i
ct · M (ci...t ) + cj · M (ci...j )
t=i
≤ ci...j−1 · M (ci...j−1 ) + cj · M (ci...j ) , since j − 1 − i < `
≤ ci...j−1 · M (ci...j ) + cj · M (ci...j )
, since M (ci...j−1 ) ∈ argminx ci...j−1 · x
= ci...j · M (ci...j ) .
This finishes the proof.
Corollary 3.3 It holds that
BPL(1 . . . T ) ≤ OPT(1 . . . T ) + E [c0 · (M (c1...T ) − M (c0 ))] .
Proof: Applying Lemma 3.2 with i = 0, j = T yields
T
X
ct · M (c0...t ) ≤ c0...T · M (c0...T ) ≤ c0...T · M (c1...T ) .
t=0
Subtracting c0 · M (c0 ) from both sides, we obtain
BPL(1 . . . T ) ≤ c1...T M (c1...T ) + c0 · M (c1...T ) − c0 · M (c0 )
|
{z
}
= OPT(1 . . . T ) + c0 · (M (c1...T ) − M (c0 )) .
Taking the expectation of both sides yields the claim.
3.3.2
Step II
Definition 3.4 For two distributions p, q on Rn , their multiplicative distance d× (p, q) is the minimum value of δ, such that their density functions satisfy
dp(x) ≤ (1 + δ) dq(x)
dq(x) ≤ (1 + δ) dp(x)
25
for all x ∈ Rn .
Their additive distance d+ (p, q) is the minimum δ, such that there exists a probability distribution
µ on pairs (x, y) ∈ Rn × Rn with
µ(x 6= y) ≤ δ
and, for all measurable sets S ⊆ Rn ,
µ(x ∈ S) = p(S)
µ(y ∈ S) = q(S).
Distribution µ is called a coupling of p and q.
Lemma 3.5 Let p, q be distributions on Rn .
(1) For any function f : E → [`, h] with h − ` ≤ R,
Ec∼p [f (M (c))] ≤ Ec∼q [f (M (c))] + Rd+ (p, q).
(2) For any function f : E → R+ ,
Ec∼p [f (M (c))] ≤ (1 + d× (p, q)) Ec∼q [f (M (c))]
Proof: For (1), let µ be a coupling of p and q as in Definition 3.4. Then
Ec∼p [f (M (c))] − Ec∼q [f (M (c))] = E(c1 ,c2 )∼µ [f (M (c1 ))] − E(c1 ,c2 )∼µ [f (M (c2 ))]
= E(c1 ,c2 )∼µ [f (M (c1 )) − f (M (c2 ))]
= E(c1 ,c2 )∼µ [f (M (c1 )) − f (M (c2 ))|c1 = c2 ] · µ(c1 = c2 )
{z
}
|
=0
+ E(c1 ,c2 )∼µ [f (M (c1 )) − f (M (c2 ))|c1 6= c2 ] · µ(c1 6= c2 )
|
{z
} | {z }
≤R
≤d+ (p,q)
≤ Rd+ (p, q).
For (2), note that
Z
Ec∼p [f (M (c))] =
f (M (c))dp(c)
Z
≤ (1 + d× (p, q))
f (M (c))dq(c) , since dp(c) ≤ (1 + d× (p, q))dq(c) ∀c ∈ Rn
Rn
Rn
= (1 + d× (p, q)) Ec∼q [f (M (c))] .
Recall that R = sup1≤t≤T,x,y∈E |ct · (x − y)|.
Corollary 3.6 (1) Suppose d+ (c0 , c + c0 ) ≤ δ for all c ∈ {c1 , . . . , cT }. Then
FPL (1 . . . T ) ≤ BPL (1 . . . T ) + δRT.
26
(2) Suppose that d× (c0 , c + c0 ) ≤ δ for all c ∈ {c1 , . . . , cT }. Moreover, assume that c · x ≥ 0 for all
c ∈ {c1 , . . . , cT } and x ∈ E. Then
FPL(1 . . . T ) ≤ (1 + δ) · BPL(1 . . . T ).
Proof: First, note that
t−1
t
X
X
d+ (c0...t−1 , c0...t ) = d+ (
ci ,
ci ) = d+ (c0 , c0 + ct ) ,
i=0
i=0
and, similarly, d× (c0...t−1 , c0...t ) = d× (c0 , c0 + ct ). Now (1) follows immediately by observing that
T
X
FPL(1 . . . T ) =
E [ct · M (c0...t−1 )]
t=1
T ³
X
≤
t=1
´
E [ct · M (c0...t )] + Rd+ (c0 , c0 + ct )
{z
}
|
≤δ
= BPL(1 . . . T ) + T Rδ,
where the inequality follows by applying Part (1) of Lemma 3.5 with f (x) = ct · x and using that
|ct x − ct y| ≤ R for all x, y ∈ E. Claim (2) follows analogously applying Part (2) of Lemma 3.5.
Lemma 3.7 Let c ∈ Rn be any vector with kck1 ≤ A. If c0 is a random sample from distribution
α as defined above, then
d+ (c0 , c + c0 ) ≤ εA.
If c0 is a random sample from distribution µ as defined above, then
d× (c0 , c + c0 ) ≤ eεA − 1.
Proof: See Exercise 5.2 on Homework Assignment 5.
3.3.3
Step III
Lemma 3.8 If kx − yk1 ≤ D and E [kc0 k∞ ] ≤ M , then
E [c0 · (x − y)] ≤ DM.
Proof: For all v, w ∈ Rn we may write that
v·w =
n
X
i=1
v i wi ≤
n
X
|vi | |wi | ≤
i=1
n
X
kvk∞ |wi | = kvk∞ kwk1 .
i=1
Applying this fact with v = c0 , w = x − y yields
£
¤
£
¤
£
¤
E c0 · (x − y) ≤ E kc0 k∞ kx − yk1 ≤ kx − yk1 E kc0 k∞ ≤ DM,
which completes the proof.
27
Lemma 3.9 (1) If c0 is sampled from distribution α, then
£
¤ 1
E kc0 k∞ ≤ .
ε
(2) If c0 is sampled from distribution µ, then
¶
µ
£
¤
log(n)
,
E kc0 k∞ = O
ε
provided that n ≥ 3.
¢
¡
Proof: Claim (1) is obvious, because Probα kc0 k∞ > 1ε = 0.
For (2), first note that the absolute values of the coordinates of c0 when sampled according to µ
are independent exponentially distributed random variables Xi with density function εe−εx and
expectation
Z ∞
Z ∞
¡
¢
¡
¢
x −εe−εx dx
xε e−εx dx = −
E[Xi ] =
0
0
¯∞ Z ∞
−εx ¯
=x·e ¯ +
e−εx dx
0
0
¯∞ 1
¯∞
¯
¯
= xe−εx ¯ − e−εx ¯
0
µ0 ¶ε
1
1
= ,
=0− −
ε
ε
where the second line uses integration by parts (see Remark A.1). Such a random variable Xi
satisfies
1
E [Xi − r | Xi ≥ r] =
ε
for all r ≥ 0 (see Exercise 6.2 on Homework Assignment 6). Let Xi = |(c0 )i |. Then we have
¯
·
¸
ln(n) ¯¯
ln(n)
1
E Xi −
yi ≥
= ,
ε ¯
ε
ε
and
¶ Z ∞
µ
¯∞
1
ln(n)
¯
=
εe−εx dx = −e−εx ¯
= 0 + e− ln(n) = .
Prob Xi ≥
ε
n
ln(n)/ε
ln(n)/ε
This yields
h
i
h
i
E kc0 k∞ = E max Xi
1≤i≤n
·
½
¾¸
ln(n)
ln(n)
=
+ E max Xi −
1≤i≤n
ε
ε
·
½
¾¸
n
X
ln(n)
ln(n)
≤
+
E max Xi −
,0
ε
ε
i=1
¯
µ
¶
·
¸
n
ln(n) X
ln(n)
ln(n) ¯¯
ln(n)
=
+
P r Xi ≥
· E Xi −
Xi ≥
ε
ε
ε ¯
ε
i=1 |
{z
} |
{z
}
1
n
=
1
ε
ln(n) 1
2 ln(n)
+ ≤
.
ε
ε
ε
28
3.3.4
Putting it all together
p
D/(RAT ), we have
³√
´
FPLα (1 . . . T ) ≤ OPT(1 . . . T ) + O
DRAT .
Theorem 3.10 (1) For FPLα with ε =
(2) For FPLµ with any ε ≤ 1/A, we have
FPLµ (1 . . . T ) ≤ (1 + O (εA)) OPT(1 . . . T ) +
e ln(n)D
.
ε
Proof: Claim (1):
FPLα (1 . . . T ) ≤ BPLα (1 . . . T ) + εRAT
, by Corollary 3.6 and Lemma 3.7
D
, by Corollary 3.3, Lemmas 3.8 and 3.9
≤ OPT(1 . . . T ) + εRAT +
ε ´
³√
= OPT(1 . . . T ) + O
DRAT .
Claim (2):
FPLµ (1 . . . T ) ≤ eεA BPLµ (1 . . . T )
, by Corollary 3.6 and Lemma 3.7
≤ (1 + O (εA)) OPT(1 . . . T ) +
3.4
e ln(n)D
, by Corollary 3.3, Lemmas 3.8 and 3.9.
ε
Comparing F P Lµ and Hedge for Expert Learning (with ”few” experts)
The standard expert prediction problem with n experts can be encoded as follows: Define strategy
set {e1 , . . . , en } ⊆ Rn , where unit vector ei stands for expert i. The components of cost vector ct
at time t are the cost of each expert in that step, i.e., expert i has cost ct · ei = (ct )i .
In this setting it holds that kck∞ ≤ 1 and, thus, kck1 ≤ n for all cost vectors c. So, A ≤ n, D ≤ 2
and, by our bounds on the regret of FPLµ ,
¶
µ
ln(n)
.
FPLµ (1 . . . T ) ≤ (1 + O (εn)) · OP T (1 . . . T ) + O
ε
On the other hand, Theorem 2.3 stated for Hedge, that
µ
Hedge(1 . . . T ) ≤ (1 + O (ε)) · OP T (1 . . . T ) + O
ln(n)
ε
¶
.
We can improve the bound for FPLµ via the following little trick. Expand each cost vector ct into
(ct )1 · e1 , (ct )2 · e2 , . . . , (ct )n · en .
Now we have that k (ct )j · ej k1 ≤ 1. So on this new sequence of cost vectors,
µ
¶
ln(n)
FPLµ (1 . . . nT ) ≤ (1 + O (ε)) · OP T (1 . . . nT ) + O
.
ε
Clearly, it holds that OP T (1 . . . nT ) = OP T (1 . . . T ). One can also show that E [FPLµ (1 . . . T )] ≤
E [FPLµ (1 . . . nT )] (see Exercise 6.1 of Problem Set 6) and so the improved bound holds for the
original sequence of cost functions, as well. Note, however, that the bounds for FPLµ only hold
against oblivious adversaries, whereas our bounds for Hedge hold against adaptive adversaries.
FPLµ is sometimes referred to as the ”Hannan-Kalai-Vempala”-Algorithm.
29
(a)
(b)
Figure 1: A 1-dimensional convex function and an illustration of the first-order condition.
3.5
Online Convex Optimization
In what follows we will look at a generalization of the online linear programming problem from
Section 3.1 and consider the case that cost functions are not necessarily linear, but convex. On
the other hand, we will make a stronger assumption regarding the set of strategies, which will be
required to be convex.
3.5.1
Some facts about convex sets and functions
Set S ⊆ Rn is convex, iff
x, y ∈ S ⇒ ∀λ ∈ [0, 1] : (1 − λ) · x + λ · y ∈ S.
A function f : S → R with convex domain S is called convex, iff
∀x, y ∈ S, λ ∈ [0, 1] : f ((1 − λ) x + λy) ≤ (1 − λ) f (x) + λf (y).
For a convex set S ⊆ Rn and z ∈ Rn , let
PS (z) := argminx∈S kx − zk2
denote the projection of z onto S. Proposition 3.11 below states that PS (z) is well-defined.
Proposition 3.11 If S is convex, then
¯
°
° ¯
¯argminx∈S °x − z ° ¯ = 1
2
for all z ∈ Rn . So PS (z) is unambiguous.
For a proof of this simple fact see Fig. 2(a).
Proposition 3.12 Let S ⊆ Rn be convex, z ∈ Rn , y ∈ S. Then
ky − PS (z)k2 ≤ ky − zk2 .
30
(a)
(b)
Figure 2: (a) If there exist x, y ∈ S (x 6= y, S convex) at equal distance d = kx − zk2 = ky − zk2
from z ∈ Rn , then k(1 − λ)x + λy − zk2 < d for all λ ∈ (0, 1). (b) Illustration of the proof of
Proposition 3.12.
Sketch of Proof: Since PS (z) is unambiguous, PS (z) is the only point of S contained in a ball
of radius kz − PS (z)k2 around z. Thus, all of S lies in the halfspace opposite to z defined by the
hyperplane through PS (z) with normal vector z − PS (z). Hence, (y − PS (z)) (z − PS (z)) < 0. Let
φ denote the angle between vectors (y − PS (z)) and (z − PS (z)) as depicted in Fig. 2(b). It follows
that
(y − PS (z)) (z − PS (z))
< 0,
cos φ =
k(y − PS (z))k2 · k(z − PS (z))k2
and, thus, φ > π2 . Considering the triangle defined by z, PS (z) and y, the fact that the angle at
PS (z) is obtuse (φ > π2 ) implies that the line segment between z and y defines the longest side of
the triangle, thus, ky − zk2 ≥ ky − PS (z)k2 .
¤
Proposition 3.13 (First Order Condition) Let f : S → R, S be convex and differentiable. Then
f is convex, iff
f (y) ≥ f (x) + ∇f (x) · (y − x)
for all x, y ∈ S.
Remark: ∇f (x) =
3.5.2
³
d
d
d
dx1 f (x), dx2 f (x), . . . , dxn f (x)
´
denotes the gradient of f in x.
The Online Convex Optimization Problem and Zinkevich’s Algorithm
Given
• a bounded convex set S ⊆ Rn and
• a sequence of convex cost functions f1 , . . . , fT (with domain S, revealed one at a time),
find a sequence of strategies x1 , . . . , xT ∈ S minimizing
T
X
ft (xt ).
t=1
31
Each xt needs to be chosen before ft is revealed!
We are going to analyze the following algorithm for the online convex optimization problem defined
above:
Algorithm 3: Zinkevich’s Algorithm
Given step sizes η1 ≥ η2 ≥ . . . , choose
xt+1 = PS (xt − ηt ∇ft (xt )) .
We will consider the problem of choosing appropriate step sizes later on. First, let’s review some
facts about gradients and develop some intuition for what Zinkevich’s algorithm is doing.
The gradient ∇f (x) of f in x points in the direction of greatest rate of increase of f , i.e., the
direction in which the directional derivative has largest value. k∇f (x)k1 is the value of that
directional derivative. To see this, let u ∈ Rn be a unit-length vector. Then the derivative of f at
∇f (x)
x in direction u is ∇f (x) · u. This is maximized by choosing u = k∇f
(x)k .
2
So Zinkevich’s algorithm moves in the direction opposite to the direction of the greatest rate of
increase of the last observed cost function. (This is similar to the gradient descent method in
optimization.)
The algorithm uses no information about the history except the last observed cost function and its
last strategy to determine its actions. The algorithm does not use randomization. The convexity
of S allows mixing strategies instead of sampling randomly.
Theorem 3.14 Assume that
kx − yk2 ≤ D for all x, y ∈ S
and
k∇f k2 ≤ A for all f ∈ {f1 , . . . , fT } .
Then for every x ∈ S, the sequence of strategies x1 , . . . , xT selected by Zinkevich’s algorithm satisfies
T
X
t=1
ft (xt ) ≤
T
X
T
ft (x) +
t=1
1 X
D2
+ A2
ηt .
2ηT
2
t=1
Proof: Let x ∈ S be fixed and define a potential function
φ(y) = ky − xk22 .
By Proposition 3.12 we know that φ (PS (y)) ≤ φ(y) for all y ∈ Rn . Let us evaluate the change in
potential between steps t and t + 1. We may write that
φ(xt+1 ) − φ(xt ) = φ (PS (xt − ηt ∇ft (xt ))) − φ(xt )
≤ φ (xt − ηt ∇ft (xt )) − φ(xt )
= kxt − ηt ∇ft (xt ) − xk22 − kxt − xk22
¡
¢2
= (xt − x) − ηt ∇ft (xt ) − (xt − x)2
= (xt − x)2 − 2(xt − x)ηt ∇ft (xt ) + ηt2 ∇ft (xt )2 − (xt − x)2
= −2ηt ∇ft (xt ) (xt − x) + ηt2 k∇ft (xt )k22 .
| {z }
≤A2
32
By Proposition 3.13, we know ft (x) ≥ ft (xt ) + ∇ft (xt ) (x − xt ), thus,
∇ft (xt ) (xt − x) ≥ ft (xt ) − ft (x).
It follows that
¡
¢
φ(xt+1 ) − φ(xt ) ≤ −2ηt ft (xt ) − ft (x) + ηt2 A2 .
Intuitively, this says the following: If under cost function ft x incurs much smaller cost than xt ,
then the gradient of ft in xt points away from x. Thus, the algorithm will take a step towards x
and decrease the potential.
Rearranging the above yields
φ(xt ) − φ(xt+1 ) 1 2
+ A ηt
2ηt
2
φ(xt ) − φ(xt+1 ) 1 2
≤
+ A ηt .
2ηT
2
ft (xt ) − ft (x) ≤
Summing over all times t we obtain
T
X
(ft (xt ) − ft (x)) ≤
t=1
T µ
X
φ(xt ) − φ(xt+1 )
2ηT
t=1
=
1
+ A2 η t
2
¶
T
T
1 X
1 X
ηt
(φ(xt ) − φ(xt+1 )) + A2
2ηT
2
t=1
t=1
=
1
1
(φ(x1 ) − φ(xT +1 )) + A2
|
{z
}
2ηT
2
≤D2
T
X
ηt
t=1
T
D2
1 X
≤
+ A2
ηt ,
2ηT
2
t=1
which completes the proof.
We still need to decide what step sizes η1 , η2 , . . . to use in the algorithm. Our bound
PT on the regret
2
2
of Zinkevich’s algorithm consists of two additive terms D /(2ηT ) and (1/2)A
t=1 ηt . The first
term is small when ηT is large, for the second term to be small we need the sequence of ηt ’s to
decrease quickly, so ηT needs to be small. What’s the best tradeoff? We want that
T
T
X
D2
1 2X
D2
= A
ηt or, equivalently, ηT
ηt = 2 .
2ηT
2
A
t=1
t=1
Clearly, we can obtain this by choosing ηt = (D/A)f (t), where f (T )−1 ≈
P
f (t) = t−q , where q ∈ (0, 1). What is Tt=1 t−q ?
We have (see Fig. 3.5.2)
Z
T
(x + 1)
0
−q
dx ≤
T
X
t=1
33
−q
t
Z
≤
0
T
x−q dx.
PT
t=1 f (t).
Let’s try
Figure 3: Upper and lower bound for
Since
Z
T
PT
−q
t=1 t .
1
T 1−q ,
1−q
x−q dx =
0
and
Z
T
(x + 1)−q dx =
T +1
Z
x−q dx =
1
0
1
1
1
1
(T + 1)1−q −
≥
T 1−q −
,
1−q
1−q
1−q
1−q
we obtain,
T
X
1
1
1
T 1−q −
≤
t−q ≤
T 1−q .
1−q
1−q
1−q
t=1
PT
t−q
and, thus, need to choose q = 21 and
r
1 D
· .
ηt =
t A
Plugging this into our bound on the regret of Zinkevich’s algorithm yields:
We wanted
T 1−q
≈
T
X
t=1
=
Tq
T
X
T
1 2X
ft (xt ) ≤
ft (x) +
A
ηt
+
2
t=1
t=1
t=1
| {z }
√ A
√
√
√
= 12 T D
·D2 =O(DA T )
≤ 12 A2 D
·2 T =O(DA T )
A
T
³
X
√ ´
=
ft (x) + O DA T .
D2
2ηT
|{z}
t=1
Finally, we briefly mention that the algorithm compares favorably against a slightly stronger benchmark, too (see Exercise 7.1 of Problem Set 7). For a sequence z1 , . . . , zT ∈ S, let
L (z1 , . . . , zT ) =
T
−1
X
kzt+1 − zt k2 .
t=1
Theorem 3.15 The strategies x1 , . . . , xT chosen by Zinkevich’s algorithm satisfy
T
X
t=1
ft (xt ) ≤
T
X
t=1
T
D2
1 X
2DL (z1 , . . . , zT )
ft (zt ) +
+ A2
ηt +
2ηT
2
ηT
t=1
for any sequence of cost functions f1 , . . . , fT and any sequence of strategies z1 , . . . , zT .
34
A
Appendix
Remark A.1 Let f , g be continuously differentiable functions. Then
Z
Z
f g 0 = f g − f 0 g.
This is referred to as the rule of integration by parts.
35