Notes from Week 7: Learning with Many Experts 1 The Hannan

CS 683 — Learning, Games, and Electronic Markets
Spring 2007
Notes from Week 7: Learning with Many Experts
Instructor: Robert Kleinberg
1
5-9 Mar 2007
The Hannan-Kalai-Vempala Algorithm
Sometimes one needs to solve a version of the best-expert problem in which the
number of experts is exponential in the size of the problem’s natural representation.
For example, consider the problem of choosing a route to take from home to work
every day. The road network forms a directed graph, with edge delays that vary from
day to day. The number of paths joining two nodes is exponential in general. An
algorithm originally discovered by Hannan, and rediscovered by Kalai and Vempala,
demonstrates how to learn the best strategy efficiently, provided that:
• Strategies can be represented as vectors in a low-dimensional vector space. The
costs of different strategies are defined by a linear function on this vector space.
• There is an efficient algorithm for minimizing linear functions on the set of
strategies.
1.1
Notation
The set of experts, S , is a bounded subset of Rn . Let D be the `1 -diameter of S ,
i.e.
D = sup kx − yk1 .
x,y∈S
An oblivious adversary specifies a sequence of cost vectors c1 , c2 , . . . , cT . Let
A =
sup kct k1 .
1≤t≤T
An online algorithm chooses a sequence of strategies x1 , x2 , . . . , xT . The cost of strategy x at time t is the dot product ct · x. Let
R =
sup
1≤t≤T, x,y∈S
|ct · (x − y)|.
For convenience we will define the notation
ci..j =
j
X
t=i
W7-1
ct .
If ALG is a (possibly randomized) algorithm which chooses a sequence of strategies
x1 , x2 , . . . , xT ∈ S , define
j
X
ALG(i..j) =
ct · xt .
t=i
For any vector c we will define M (c) to be an arbitrary element of arg minx∈S c · x.
We will be interested in algorithms which learn to approximately minimize the
average cost of the chosen strategies. We will present algorithms which satisfy a
multiplicative bound
" T
#
µ
¶
X
D
E
ct · xt ≤ (1 + O(εA))E[c1..T · M (c1..T )] + O
log(n)
(1)
ε
t=1
or an additive bound
" T
#
³√
´
X
E
ct · xt ≤ E[c1..T · M (c1..T )] + O
DRAT .
(2)
t=1
1.2
“Follow the leader” and “Follow the perturbed leader”
One natural algorithm for this problem is “follow the leader” (FTL), which always
picks the strategy which has performed best in the past, i.e. the algorithm which
selects xt according to the rule
xt+1 = M (c1..t−1 ).
This algorithm performs very well when the cost functions are i.i.d. samples from a
distribution, but can perform very poorly when the cost functions are chosen adversarially.
¢
¡
Example 1. Suppose n = 2, S = {(0, 1), (1, 0)}, c1 = 31 , 23 , and for t > 1,
½
(1, 0) if t is even
ct =
(0, 1) if t is odd.
For both elements of S , the cost grows at a rate of T /2 + O(1), but FTL always
chooses the strategy whose cost in the current period is 1, resulting in a cumulative
loss of T − O(1).
The example demonstrates that FTL can perform badly in cases in which it is
“badly synchronized” with the input sequence c1 , c2 , . . . , cT . This suggests using randomization to avoid such synchronization problems. Let c0 be a random vector sampled from some probability distribution p on Rn . Think of c0 is an extra cost vector
which the algorithm “hallucinates” at time 0. The algorithm “follow the perturbed
W7-2
leader” (FPLp ) is the same as “follow the leader” except that it minimizes the sum
of actual cost and hallucinated cost, i.e. it samples c0 randomly from distribution p
and then chooses its strategy at time t using the rule
xt = M (c0..t−1 ).
Let ε > 0 be any positive number. We will be analyzing the performance of FPLp
when p is one of the distributions µ, α with the following two density functions:
³ ε ń
dµ(x) =
e−εkxk1
2
½ ¡ ε ¢n
if kxk∞ ≤ 1ε
2
dα(x) =
0
otherwise
A random sample from µ is generated by sampling each coordinate independently
using the following procedure: draw a random positive number y from the exponential
distribution Pr(y > r) = e−εr , and change its sign from positive to negative with
probability 1/2. A random sample from α is generated
£ 1 1 ¤ by sampling each coordinate
independently from the uniform distribution on − ε , ε .
1.3
“Be the perturbed leader”
Define an algorithm “Be the perturbed leader” (BPL) analogously to FPL, except
that it has one-step lookahead. (Hence it is not an online algorithm!) In other words,
BPL chooses xt = M (c0..t ).
Let OPT(1..T ) = c1..T · M (c1..T ). Our analysis of FPL will be based on three facts.
1. BPL(1..T ) is not much greater than OPT(1..T ). Specifically,
BPL(1..T ) ≤ OPT(1..T ) + E [c0 · (M (c1..T ) − M (c0 ))] .
2. FPL(1..T ) is not much greater than BPL(1..T ). (Reason: the distribution of
c0..t and c0..t−1 are so similar that their minimizers are very closely related.)
3. E [c0 · (x − y)] is small for any x, y ∈ S .
The remainder of this section is devoted to proving the first of these three facts.
Lemma 1. For all i ≤ j,
j
X
ct · M (ci..t ) ≤ ci..j · M (ci..j ).
t=i
W7-3
Proof. The proof is by induction on j − i. When j − i = 0 the lemma is trivial
since both sides of the inequality are equal to ci · M (ci ). For j − i > 1, the induction
hypothesis (together with the definition of M (ci..j−1 )) yields the inequality
j−1
X
ct · M (ci..t ) ≤ ci..j−1 · M (ci..j−1 ) ≤ ci..j−1 · M (ci..j ).
t=i
Adding cj · M (ci..j ) to both sides, we obtain
j
X
ct · M (ci..t ) ≤ ci..j · M (ci..j )
t=i
as desired.
Corollary 2. BPL(1..T ) ≤ OPT(1..T ) + E (c0 · [M (c1..T ) − M (c0 )]) .
Proof. Applying Lemma 1 with i = 0, j = T leads to the inequality
T
X
ct · M (c0..t ) ≤ c0..T · M (c0..T ) ≤ c0..T · M (c1..T ).
t=0
Subtracting c0 · M (c0 ) from both sides, we obtain
T
X
ct · M (c0..t ) ≤ c0 · M (c1..T ) + c1..T · M (c1..T ) − c0 · M (c0 )
t=1
= OPT(1..T ) + c0 · [M (c1..T ) − M (c0 )].
The corollary follows by taking the expected value of both sides.
1.4
Comparing BPL and FPL
As we said, comparison of BPL and FPL will be accomplished by showing that c0..t−1
and c1..t have very similar distributions. For this, we need a measure of similarity of
distributions. Actually, two measures of similarity will be useful because we’re trying
to prove a multiplicative bound (1) and an additive bound (2).
Definition 1. For two distributions p, q on Rn , their multiplicative distance (denoted
by d× (p, q)) is the minimum δ such that their density functions satisfy
dp(x) ≤ (1 + δ)dq(x)
dq(x) ≤ (1 + δ)dp(x)
for all x. Their additive distance (denoted by d+ (p, q)) is the minimum δ such that
there exists a probability distribution µ on pairs (x, y) ∈ Rn × Rn such that
µ(x 6= y) ≤ δ
W7-4
and for all measurable S ⊆ Rn
µ(x ∈ S) = p(S)
µ(y ∈ S) = q(S).
(This distribution µ is called a coupling of p and q.)
Lemma 3. Let p, q be two distributions on Rn .
• For any f : S → [−R, R],
Ec←p [f (M (c))] ≤ Ec←q [f (M (c))] + Rd+ (p, q).
• For any f : S → R+ ,
Ec←p [f (M (c))] ≤ (1 + d× (p, q))Ec←q [f (M (c))]
Proof. For the first part, let (c, c0 ) be sampled from a joint distribution µ whose
marginals are p, q (respectively) and such that µ(x 6= y) ≤ δ. Then
Ec←p [f (M (c))] − Ec←q [f (M (c))] = E[f (M (c)) − f (M (c0 ))]
= E[f (M (c)) − f (M (c0 )) | c = c0 ] µ(c = c0 )
+E[f (M (c)) − f (M (c0 )) | c 6= c0 ] µ(c 6= c0 ).
The first term on the right side is 0, and the second term is bounded above by 2Rδ
since f (x) − f (y) ≤ 2R for all x, y ∈ S , and µ(c 6= c0 ) ≤ δ.
For the second part,
Z
Z
Ec←p [f (M (c))] = f (M (c))dp(c) ≤ f (M (c))(1 + δ)dq(c) = (1 + δ)Ec←q [f (M (c))].
Corollary 4. Suppose d+ (c0 , c + c0 ) ≤ δ for all c ∈ {c1 , . . . , cT }. Then
FPL(1..T ) ≤ BPL(1..T ) + δRT.
Suppose d× (c0 , c + c0 ) ≤ δ for all c ∈ {c1 , . . . , cT }. Moreover suppose c · x ≥ 0 for all
c ∈ {c1 , . . . , cT } and x ∈ S . Then
FPL(1..T ) ≤ (1 + δ) · BPL(1..T ).
P
Proof. Recall that FPL(1..T ) is the expected value of Tt=1 ct ·M (c0..t−1 ) and BPL(1..T )
P
is the expected value of Tt=1 ct · M (c0..t ). The corollary follows by using Lemma 3 to
compare the sums term-by-term.
W7-5
Lemma 5. Let c be any vector such that kck1 ≤ A. If c0 is a random sample from
distribution α, then
d+ (c0 , c + c0 ) ≤ εA.
If c0 is a random sample from distribution µ,
d× (c0 , c + c0 ) ≤ eεA − 1.
Proof. Assume without loss of generality that every component of the vector c is
non-negative. Let 1 denote the vector (1, 1, . . . , 1). Let
½
c
if kc0 − ck∞ ≤ 1ε
0
c0 = ¡01 ¢
1 + c − c0 otherwise.
ε
It is an exercise to check that the random vector c00 has the same distribution as c+c0 ,
and that Pr(c0 6= c00 ) ≤ εA. This proves the first part of the lemma.
The second part of the lemma follows from the calculation
³ ε ń
dµ(x + c) =
e−εkx+ck1
2
³ ε ń
e−εkxk1 +εkck1
≤
2
≤ dµ(x) · eεA .
1.5
Bounding the expectation of c0 · (x − y)
Lemma 6. If kx − yk1 ≤ D and E[kc0 k∞ ] ≤ M then
E[c0 · (x − y)] ≤ DM.
Proof. The inequality kvk∞ kwk1 ≥ v · w holds for all vectors v, w, because
X
X
X
v·w =
vi w i ≤
|vi ||wi | ≤
kvk∞ |wi | = kvk∞ kwk1 .
The lemma follows by applying this identity with v = c0 and w = x − y.
Lemma 7. If c0 is sampled from distribution α,
1
E[kc0 k∞ ] ≤ .
ε
If c0 is sampled from distribution µ,
µ
E[kc0 k∞ ] ≤ O
provided that n ≥ 3.
W7-6
log n
ε
¶
Proof. The first statement is obvious. The second statement is because the absolute
values of the coordinates of c0 are independent exponentially distributed random
variables with mean 1ε . Such a random variable y satisfies
E(y − r | y > r) =
1
ε
for all r > 0. Letting yi = |(c0 )i |, we have
¯
µ
¶
ln(n)
ln(n) ¯¯
1
yi >
E yi −
= .
¯
ε
ε
ε
It follows that
E(kc0 k∞ ) = E(max yi )
i
µ
¶¶
µ
ln(n)
ln(n)
=
+ E max yi −
i
ε
ε
¾¶
µ
½
X
ln(n)
ln(n)
≤
+
,0
E max yi −
ε
ε
i
¯
µ
¶ µ
¶
ln(n) X
ln(n)
ln(n) ¯¯
ln(n)
=
+
Pr yi >
E yi −
¯ yi > ε
ε
ε
ε
i
= ln(n)/ε + n · (1/n) · (1/ε)
≤ 2 ln(n)/ε
assuming n ≥ 3.
1.6
Putting it all together
For FPLα , with ε =
p
D/RAT , we have
FPLα (1..T ) ≤ BPLα (1..T ) + 2εRAT
D
≤ OPT(1..T ) + 2εRAT +
ε
√
= OPT(1..T ) + O( DRAT ).
For FPLµ , with any ε ≤ 1/A, we have
FPLµ (1..T ) ≤ eεA BPLµ (1..T )
≤ (1 + O(εA))OPT(1..T ) +
W7-7
2 ln(n)D
.
ε
1.7
Applying this to the best expert problem
We have seen that the best expert problem is a special case of online linear optimization. Here the cost vectors satisfy kck∞ ≤ 1 hence kck1 ≤ n, so applying the analysis
above directly would lead to the bound
FPLµ (1..T ) ≤ (1 + O(εn))OPT(1..T ) + O(
ln(n)
).
ε
By comparison, we know that Hedge satisfies a similar bound but with a factor of
1 + O(ε) instead of 1 + O(εn). It turns out that we can prove that FPLµ satisfies
the same bound (up to constant factors) by a clever trick. Replace the sequence
c1 , c2 , . . . , cT with a new sequence c01 , c02 , . . . , c0nT by expanding each vector ci into n
consecutive vectors (ci )1~e1 , (ci )2~e2 , . . . , (ci )n~en . Note that the cost vectors in the new
sequence have `1 -norms bounded by 1, so
FPLµ (1..nT ) ≤ (1 + O(ε))OPT(1..nT ) + O(
ln(n)
).
ε
OPT(1..nT ) on this new sequence is equal to OPT(1..T ) on the old sequence. And
E(FPLµ (1..T )) ≤ E(FPLµ (1..nT ))
because the probability of incurring cost (ci )k at time (i − 1)n + k, when the cost
vector is (ci )k~ek , is at least as great as the probability that the algorithm incurred cost
(ci )k at time i in the original sequence. (The total past cost of choice k, relative to the
alternatives, only looks better in the new sequence, because some of the alternatives
have already been hit with additional costs.)
2
Online convex optimization
Now suppose S ⊂ Rn is a bounded convex set, and suppose that we have a sequence
of convex cost functions f1 , f2 , . . . , fT . In comparison with the online linear optimization problem studied in the preceding section, we are now making a more specific
assumption about the set S (it is convex and bounded, not just bounded) and a less
specific assumption about the functions ft (they are convex, not necessarily linear).
We will make the following additional assumptions about S and the functions ft .
1. kx − yk2 ≤ D for all x, y ∈ S .
2. k∇f k2 ≤ A for all f ∈ {f1 , f2 , . . . , fT }.
For any point z ∈ Rn , let P (z) denote the closest point of S , i.e.
P (z) = arg min kx − zk2 .
x∈S
W7-8
We have seen in an earlier lecture that the convexity of S implies that the set
arg minx∈S kx−zk2 consists of a single point, so the definition of P (z) is unambiguous.
Zinkevich’s algorithm produces a sequence of strategies x1 , x2 , . . . , xT using a “gradient descent” algorithm. There is a predefined sequence of decreasing step sizes
η1 , η2 , . . . and the algorithm starts with an arbitrary strategy x1 and updates its
strategy according to the rule:
xt+1 = P (xt − ηt ∇ft (xt )).
Note that this is a deterministic algorithm, in contrast to the primarily randomized
learning algorithms we’ve seen up to this point. (The fact that we can get away with
using a deterministic algorithm here is related to the fact that the strategy set S
is convex, so instead of randomizing over a set of strategies we could simply play
the strategy which is their weighted average.) Also note that we always move in the
direction opposite the gradient of the most recent cost function. So the past history
has no influence on the algorithm except that it influences the position of the current
strategy vector xt . It is sort of surprising that the algorithm still accomplishes a
non-trivial learning task, given that it is keeping track of the past history in such an
indirect way.
Theorem 8. For every x ∈ S , the sequence of strategies selected by Zinkevich’s
algorithm satisfies
T
X
t=1
ft (xt ) ≤
T
X
t=1
T
D2 1 2 X
ft (x) +
ηt .
+ A
ηT
2
t=1
Proof. The analysis of the algorithm is inspired by the following informal train of
thought. Time steps when ft (xt ) ≤ ft (x) are no problem. In a step when ft (xt ) >
ft (x), the gradient descent rule ensures that the algorithm will take a step which
brings it closer to x. This suggests the following line of attack: define a potential
function based on the distance from x. Show that every increase in the regret is offset
by a corresponding decrease in the potential.
Consider the potential function Φ(y) = ky − xk22 . For any z we have Φ(P (z)) ≤
Φ(z). This is clear when z ∈ S since P (z) = z. Otherwise, it follows from the fact
that the triangle with vertices z, P (z), x has an obtuse angle at P (z); this is a fact
that we worked out when studying the existence of Nash equilibria and again during
the proof of Blackwell’s approachability theorem.
Armed with this useful fact, let’s evaluate the change in potential from time t to
t + 1.
Φ(xt+1 ) − Φ(xt ) ≤ Φ(xt − ηt ∇ft (xt )) − Φ(xt )
= k(xt − x) − ηt ∇ft (xt )k22 − kxt − xk2
= −2ηt ∇ft (xt ) · (xt − x) + ηt2 k∇ft (xt )k22
W7-9
By convexity of ft , we have
∇ft (xt ) · (xt − x) ≥ ft (xt ) − ft (x).
Hence
Φ(xt+1 ) − Φ(xt ) ≤ −2ηt [ft (xt ) − ft (x)] + A2 ηt2 .
Rearranging terms,
ft (xt ) − ft (x) ≤
Φ(xt ) − Φ(xt+1 ) 1 2
+ A ηt .
2ηt
2
(3)
We now want to manipulate things so that when we sum over t, the Φ(·) terms form
a telescoping sum. The easiest way to accomplish this is to replace the first term on
the right side of (3) with its upper bound (Φ(xt ) − Φ(xt+1 ))/(2ηT ). (Here we are using
the fact that the sequence η1 , η2 , . . . is decreasing.) Thus
T
X
[ft (xt ) − ft (x)] ≤
t=1
T
T
1 X
1 2X
[Φ(xt ) − Φ(xt+1 )] + A
ηt
2ηT t=1
2
t=1
T
1
1 X
=
[Φ(x1 ) − Φ(xT +1 )] + A2
ηt
2ηT
2
t=1
T
≤
D2 1 2 X
+ A
ηt .
ηT
2
t=1
Now P
we have to choose ηt to trade off the conflicting goals of making ηT large but
making Tt=1 ηp
t small. The optimum trade-off (up to constant factors) is achieved by
D
setting ηt = A 1/t which leads to
T
X
t=1
2.1
ft (xt ) ≤
T
X
³
√ ´
ft (x) + O DA T .
t=1
Dealing with concept drift
One of the great features of Zinkevich’s algorithm is that it can deal with a limited
amount of “concept drift,” i.e. it does well not only against the benchmark of the best
fixed strategy, but against a benchmark with a limited amount of power to change
its strategy over time.
P −1
kzt+1 − zt k2 . We have
For a sequence z1 , z2 , . . . , zT ∈ S let L(z1 , . . . , zT ) = Tt=1
the following theorem.
W7-10
Theorem 9. Zinkevich’s algorithm satisfies the following bound for any sequence
z1 , z 2 , . . . , z T .
T
X
t=1
ft (xt ) ≤
T
X
t=1
ft (zt ) +
T
D2 2DL(z1 , . . . , zT ) A2 X
+
+
ηt .
ηT
ηT
2 t=1
Proof. Define Φ(x, z) = kx − zk22 . The argument is parallel to the argument given
above, but with a couple of extra steps.
Φ(xt+1 , zt+1 ) − Φ(xt , zt ) = [Φ(xt+1 , zt+1 ) − Φ(xt+1 , zt )] + [Φ(xt+1 , zt ) − Φ(xt , zt )]
≤ [Φ(xt+1 , zt+1 ) − Φ(xt+1 , zt )] − 2ηt [ft (xt ) − ft (zt )] + A2 ηt2 .
The last line is derived from the line above it by using the technique from the proof
of Theorem 8, with zt in place of the variable labeled x in that proof.
To finish up, we must bound Φ(xt+1 , zt+1 ) − Φ(xt+1 , zt ). The easiest way to do this
is to let v = xt+1 − zt+1 and w = xt+1 − zt and observe that kvk2 , kwk2 ≤ D hence:
kvk22 − kwk22 = (v + w) · (v − w)
≤ kv + wk2 kv − wk2
≤ 2Dkzt+1 − zt k2
Hence
Φ(xt+1 , zt+1 ) − Φ(xt , zt ) ≤ 2Dkzt+1 − zt k2 − 2ηt [ft (xt ) − ft (zt )] + A2 ηt2
2D
Φ(xt+1 , zt+1 ) − Φ(xt , zt )
ft (xt ) − ft (zt ) ≤
kzt+1 − zt k2 +
+
ηt
2ηt
2D
Φ(xt+1 , zt+1 ) − Φ(xt , zt )
≤
kzt+1 − zt k2 +
+
ηT
2ηT
T
T
T
X
X
2D
D2 A2 X
ft (xt ) −
ft (zt ) ≤
ηt .
L(z1 , . . . , zT ) +
+
η
η
2
T
T
t=1
t=1
t=1
W7-11
A2
ηt
2
A2
ηt
2

Download Report

Notes from Week 7: Learning with Many Experts 1 The Hannan

Paperzz.com

Your Paperzz