Weighted Last-Step Min-Max Algorithm with Improved Sub

Weighted Last-Step Min-Max Algorithm with Improved
Sub-logarithmic Regret
Edward Moroshko and Koby Crammer
Department of Electrical Engineering,
The Technion, Haifa, Israel
[email protected], [email protected]
Abstract. In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity
called regret. Forster [4] proposed a last-step min-max algorithm which was
simpler than the algorithm of Vovk [12], yet with the same regret. In fact the
algorithm he analyzed assumed that the choices of the adversary are bounded,
yielding artificially only the two extreme cases. We fix this problem by weighing
the examples in such a way that the min-max problem will be well defined, and
provide analysis with logarithmic regret that may have better multiplicative factor than both bounds of Forster [4] and Vovk [12]. We also derive a new bound
that may be sub-logarithmic, as a recent bound of Orabona et.al [9], but may have
better multiplicative factor. Finally, we analyze the algorithm in a weak-type of
non-stationary setting, and show a bound that is sublinear if the non-stationarity
is sub-linear as well.
1 Introduction
We consider the online learning regression problem, in which a learning algorithm tries
to predict real numbers in a sequence of rounds given some side-information or inputs
xt . Real-world example applications for these algorithms are weather or stockmarket
predictions. The goal of the algorithm is to have a small discrepancy between its predictions and the associated outcomes yt . This discrepancy is measured with a loss function, such as the square loss. It is common to evaluate algorithms by their regret, the
difference between the cumulative loss of an algorithm with the cumulative loss of any
function taken from some class.
Forster [4] proposed a last-step min-max algorithm for online regression that makes
a prediction assuming it is the last example to be observed, and the goal of the algorithm
is indeed to minimize the regret with respect to linear functions. The resulting optimization problem he obtained was convex in both the algorithm’s choice and the adversary’s
choice, which yielded an unbounded problem. Forster circumvented this problem by
assuming a bound Y over the choices of the adversary that should be known to the
algorithm, yet his analysis is for the version with no bound.
We propose a modified last-step min-max algorithm with weights over examples,
that are controled in a way to obtain a concave-convex problem over the adversary’s
choices and the algorithm’s choices. We analyze our algorithm and show a logarithmicregret that may have a better multiplicative factor than the analysis of Forster. We derive
N.H. Bshouty et al. (Eds.): ALT 2012, LNAI 7568, pp. 245–259, 2012.
c Springer-Verlag Berlin Heidelberg 2012
246
E. Moroshko and K. Crammer
additional analysis that is logarithmic in the loss of the reference function, rather than
the number of rounds T . This behaviour was recently given by Orabona et.al [9] for a
certain online-gradient decent algorithm. Yet, their bound [9] has a similar multiplicative factor to that of Forster [4], while our bound has a potentially better multiplicative
factor and it has the same dependency in the cumulative loss of the reference function as
Orabona et.al [9]. Additionally, our algorithm and analysis are totally free of assuming
the bound Y or knowing its value.
Competing with the best single function might not suffice for some problems. In
many real-world applications, the true target function is not fixed, but may change from
time to time. We bound the performance of our algorithm also in non-stationary environment, where we measure the complexity of the non-stationary environment by the
total deviation of a collection of linear functions from some fixed reference point. We
show that our algorithm maintains an average loss close to that of the best sequence of
functions, as long as the total of this deviation is sublinear in the number of rounds T .
2 Problem Setting
We work in the online setting for regression evaluated with the squared loss. Online
algorithms work in rounds or iterations. On each iteration an online algorithm receives
an instance xt ∈ Rd and predicts a real value ŷt ∈ R, it then receives a label yt ∈ R,
2
possibly chosen by an adversary, suffers loss t (alg) = (yt , ŷt ) = (ŷt − yt ) , updates
its prediction rule, and proceeds to the next
The cumulative loss suffered by the
round.
T
algorithm over T iterations is, LT (alg) = t=1 t (alg). The goal of the algorithm is to
perform well compared to any predictor from some function class.
A common choice is to compare the performance of an algorithm with respect to a
single function, or specifically a single linear function, f (x) = x u, parameterized by
2
a vector u ∈ Rd . Denote by t (u) = x
the instantaneous loss of a vector u,
t u − yt
T
and by LT (u) = t t (u). The regret with respect to u is defined to be,
RT (u) =
T
(yt − ŷt )2 − LT (u) .
t
A desired goal of the algorithm is to have RT (u) = o(T ), that is, the average loss
suffered by the algorithm will converge to the average loss of the best linear function u.
Below in Sec. 5 we will also consider an extension of this form of regret, and evaluate
the performance of an algorithm against some T -tuple of functions, (u1 , . . . , uT ) ∈
d T
T
R
, RT (u1 , . . . , uT ) = t (yt − ŷt )2 − LT (u1 , . . . , uT ) , where LT (u1 , . . . , uT )
T
= t t (ut ). Clearly, with no restriction of the T -tuple, any algorithm may suffer a
regret linear in T , as one can set ut = xt (yt / xt 2 ), and suffer zero quadratic loss in
all rounds. Thus, we restrict below the possible choices of T -tuple either explicitly, or
implicitly via some penalty.
3 A Last Step Min-Max Algorithm
Our algorithm is derived based on a last-step min-max prediction, proposed by Forster
[4] and also Takimoto and Warmuth [10]. An algorithm following this approach outputs
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
247
Fig. 1. An illustration of the minmax objective function G(yT , ŷT ) (5). The black line is the value
of the objective as a function of yT for the optimal predictor ŷT . Left: Forster’s optimization
function (convex in yT ). Center: our optimization function (strictly concave in yT , case 1 in
Thm. 1). Right: our optimization function (invariant to yT , case 2 in Thm. 1).
the min-max prediction assuming the current iteration is the last one. The algorithm we
describe below is based on an extension of this notion. For this purpose we introduce a
T
weighted cumulative loss using positive input-dependent weights {at }t=1 ,
La
T (u) =
T
2
at yt − u xt
,
La
T (u1 , . . . , uT ) =
T
t=1
2
a t y t − u
.
t xt
t=1
The exact values of the weights at will be defined below.
Our variant of the last step min-max algorithm predicts1
ŷT = arg min max
ŷT
yT
T
2
2
(yt − ŷt ) − inf b u +
u
t=1
La
T (u)
,
(1)
for some positive constant b > 0. We next compute the actual prediction based on the
optimal last step min-max solution. We start with additional notation,
At = bI +
t
as xs x
s
∈ Rd×d
(2)
∈ Rd .
(3)
s=1
bt =
t
as ys xs
s=1
The solution of the internal infimum over u is summarized in the following lemma,
2
Lemma 1. For all t ≥ 1, the function f (u) = b u2 + ts=1 as ys − u xs is
minimal at a unique point ut given by,
ut = A−1
t bt
and
f (ut ) =
t
−1
as ys2 − b
t At bt .
(4)
s=1
1
yT and ŷT serves both as quantifiers (over the min and max operators, respectively), and as
the optimal values over this optimization problem.
248
E. Moroshko and K. Crammer
The proof is similar to the proof of Lemma 1 by Forster [4]. Substituting (4) back in (1)
we get the following form of the minmax problem,
min max G(yT , ŷT )
ŷT
yT
for
G(yT , ŷT ) = α(aT )yT2 + 2β(aT , ŷT )yT + ŷT2 ,
(5)
for some functions α(aT ) and β(aT , ŷT ). Clearly, for this problem to be well defined
the function G should be convex in ŷT and concave in yT .
A previous choice, proposed by Forster [4], is to have uniform weights and set aT =
1, which for the particular function α(aT ) yields α(aT ) > 0. Thus, G(yT , ŷT ) is a
convex function in yT , implying that the optimal value of G is not bounded from above.
Forster [4] addressed this problem by restricting yT to belong to a predefined interval
[−Y, Y ], known also to the learner. As a consequence, the adversary optimal prediction
is in fact either yT = Y or yT = −Y , which in turn yields
an optimal predictor which
−1
is clipped at this bound, ŷT = clip b
T −1 AT xT , Y , where for y > 0 we define
clip(x, y) = x if |x| ≤ y and clip(x, y) = y sign(x) otherwise.
This phenomena is illustrated in the left panel of Fig. 1 (best viewed in color). For
the minmax optimization function defined by Forster [4], for a constant value of ŷT ,
the function is convex in yT , and the adversary would achieve a maximal value at the
boundary of the feasible values of yT interval. That is, either yT = Y or yT = −Y , as
indicated by the two magenta lines at yT = ±10. The optimal predictor ŷT is achieved
somewhere along the lines yT = Y or yT = −Y .
We propose an alternative approach to make the minmax optimal solution bounded
by appropriately setting the weight aT such that G(yT , ŷT ) is concave in yT for a constant ŷT . We explicitly consider two cases. First, set aT such that G(yT , ŷT ) is strictly
concave in yT , and thus attains a single maximum with no need to artificially restrict
the value of yT . In this case the optimal predictor ŷT is achieved in the unique saddle point, as illustrated in the center panel of Fig. 1. A second case is to set aT such
that α(aT ) = 0 and the minmax function G(yT , ŷT ) becomes linear in yT . Here, the
optimal prediction is achieved by choosing ŷT such that β(aT , ŷT ) = 0 which turns
G(yT , ŷT ) to be invariant to yT , as illustrated in the right panel of Fig. 1.
Equipped with Lemma 1 we develop the optimal solution of the min-max predictor,
summarized in the following theorem.
−1
Theorem 1. Assume that 1 + aT x
T AT −1 xT − aT ≤ 0. Then the optimal prediction
for the last round T is
−1
ŷT = b
T −1 AT −1 xT .
(6)
The proof of the theorem makes use of the following technical lemma.
Lemma 2. For all t = 1, 2, . . . , T
−1
a2t x
t At xt + 1 − at =
−1
1 + at x
t At−1 xt − at
−1
1 + at x
t At−1 xt
.
The proof is omitted due to lack of space. We now turn to prove the theorem.
(7)
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
249
Proof. The adversary can choose any yT , thus the algorithm should predict ŷT such
that the following quantity is minimal,
T
T
2
2
2
max
(yt − ŷt ) − inf b u +
at yt − u xt
yT
u∈Rd
t=1
(4)
= max
T
yT
t=1
2
(yt − ŷt ) −
t=1
T
at yt2
+
−1
b
T AT bT
.
t=1
That is, we need to solve the following minmax problem
T
T
2
2
−1
min max
(yt − ŷt ) −
at yt + bT AT bT
ŷT
yT
t=1
t=1
We use the following relation to re-write the optimization problem,
−1
−1
−1
2 2 −1
b
T AT bT = bT −1 AT bT −1 + 2aT yT bT −1 AT xT + aT yT xT AT xT .
(8)
Omitting all terms that are not depending on yT and ŷT ,
2
−1
2 2 −1
min max (yT − ŷT ) − aT yT2 + 2aT yT b
T −1 AT xT + aT yT xT AT xT
ŷT
yT
We manipulate the last problem to be of form (5) using Lemma 2,
−1
1 + aT x
T AT −1 xT − aT 2
−1
2
min max
yT + 2yT aT bT −1 AT xT − ŷT + ŷT , (9)
−1
yT
ŷT
1 + aT x
T AT −1 xT
where α(aT ) =
−1
1+aT x
T AT −1 xT −aT
−1
1+aT x
T AT −1 xT
−1
and β(aT , ŷT ) = aT b
T −1 AT xT − ŷT .
−1
We consider two cases: (1) 1 + aT x
T AT −1 xT − aT < 0 (corresponding to the
−1
middle panel of Fig. 1), and (2) 1 + aT xT AT −1 xT − aT = 0 (corresponding to the
right panel of Fig. 1), starting with the first case,
−1
1 + aT x
T AT −1 xT − aT < 0
1+aT x A−1 xT −aT
(10)
Denote the inner-maximization problem by, f (yT ) = 1+a TxTA−1
yT2
−1
T T
T −1 xT
−1
2
+ 2yT aT b
T −1 AT xT − ŷT + ŷT . This function is strictly-concave with respect to
yT because of (10). Thus, it has a unique maximal value given by,
aT
ŷT2
1 + aT xT A−1
x
−
a
T
T
T −1
−1
−1
2aT b
A
T −1 T xT 1 + aT xT AT −1 xT
+
ŷT
−1
1 + aT x
T AT −1 xT − aT
2 −1
−1
1 + aT x
a T b
T −1 AT xT
T AT −1 xT
−
−1
1 + aT x
T AT −1 xT − aT
f max (ŷT ) = −
250
E. Moroshko and K. Crammer
Next, we solve minŷT f max (ŷT ), which is strictly-convex with respect to ŷT because
of (10). Solving this problem we get the optimal last step minmax predictor,
−1
−1
ŷT = b
(11)
T −1 AT xT 1 + aT xT AT −1 xT .
We further derive the last equation. From (2) we have,
−1
−1
−1
−1
−1
A−1
T aT xT xT AT −1 = AT (AT − AT −1 ) AT −1 = AT −1 − AT
(12)
Substituting (12) in (11) we have the following equality as desired,
−1
−1
−1
−1
ŷT = b
T −1 AT xT + bT −1 AT aT xT xT AT −1 xT = bT −1 AT −1 xT .
(13)
−1
We now move to the second case for which, 1 + aT x
T AT −1 xT − aT = 0 , which is
written equivalently as,
−1
aT = 1/ 1 − x
(14)
T AT −1 xT .
−1
2
.
Substituting (14) in (9) we get, minŷT maxyT 2yT aT b
T −1 AT xT − ŷT + ŷT
−1
For ŷT = aT b
A
x
,
the
value
of
the
optimization
problem
is
not-bounded
as
T
T −1 T
−1
the adversary may choose yT = z 2 aT b
T −1 AT xT − ŷT for z → ∞. Thus, the
−1
optimal last step minmax prediction is to set ŷT = aT b
T −1 AT xT . Substituting aT =
−1
1 + aT xT AT −1 xT and following the derivation from (11) to (13) above, yields the
desired identity.
We conclude by noting that although we did not restrict the form of the predictor
ŷT , it turns out that it is a linear predictor defined by ŷT = x
T w T −1 for w T −1 =
A−1
b
.
In
other
words,
the
functional
form
of
the
optimal
predictor is the same
T −1 T −1
as the form of the comparison function class - linear functions in our case. We call
the algorithm (defined using (2), (3) and (6)) WEMM for weighted min-max prediction.
WEMM can also be seen as an incremental off-line algorithm [1] or follow-the-leader, on
a weighted sequence. The prediction ŷT = x
T w T −1 is with a model that is optimal
over a prefix of length T − 1. The prediction of the optimal predictor defined in (4) is
−1
x
T uT −1 = xT AT −1 bT −1 = ŷT , where ŷT was defined in (6).
4 Analysis
We analyze the algorithm in two steps. First, in Thm. 2 we show that the algorithm
suffers a constant regret compared with the optimal weight vector u evaluated using
the weighted loss, La (u). Second, in Thm. 3 and Thm. 4 we show that the difference
of
the weighted-loss La (u) to the true loss L(u) is only logarithmic in T or in t t (u).
−1
Theorem 2. Assume 1+at x
t At−1 xt −at ≤ 0 for t = 1 . . . T (which is satisfied by our
−1
choice later). Then, the loss of the last-step minmax predictor (6), ŷt = b
t−1 At−1 xt
for t = 1 . . . T , is upper bounded by,
2
LT (alg) ≤ inf b u + La
(u)
.
T
u∈Rd
−1
Furthermore, if 1+atx
t At−1 xt −at = 0, then the last inequality is in fact an equality.
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
251
Proof Sketch: Long algebraic manipulation given in Sec. A yields,
2
2
a
b
u
t (alg) + inf b u + La
(u)
−
inf
+
L
(u)
t−1
t
u∈Rd
u∈Rd
=
−1
1 + at x
t At−1 xt − at
−1
1 + at x
t At−1 xt
2
(yt − ŷt ) ≤ 0
Summing over t gives the desired bound.
Next we decompose the weighted loss La
T (u) into a sum of the actual loss LT (u) and
a logarithmic term. We give
two
bounds
- one is logarithmic in T (Thm. 3), and the
second is logarithmic in t t (u) (Thm. 4). We use the following notation of the loss
suffered by u over the worst example,
(15)
S = S(u) = sup t (u),
1≤t≤T
where clearly S depends explicitly in u, which is omitted for simplicity. We now turn
to state our first result,
further
that at =
Theorem 3. Assume xt ≤ 1 for t = 1 . . . T and b > 1. Assume
1
1
b
a
.
for
t
=
1
.
.
.
T
.
Then
L
(u)
≤
L
(u)
+
S
ln
A
T
T
b−1
b T
1−x A−1 x
t
t−1
t
The proof follows similar steps to Forster [4]. A detailed proof is given in Sec. B.
Proof Sketch: We decompose the weighted loss,
La
(at − 1)t (u) ≤ LT (u) + S
(at − 1) .
T (u) = LT (u) +
t
(16)
t
−1
From the definition of at we have, at − 1 = a2t x
t At xt ≤
(30)).
following similar steps to Forster [4] we have,
Finally,
ln 1b AT .
b
b−1
T
−1
at x
t At xt (see
t=1
−1
at x
t At xt ≤
Next we show a bound that may be sub-logarithmic if the comparison vector u suffers
sub-linear amount of loss. Such a bound was previously proposed by Orabona et.al [9].
We defer the discussion about the bound after providing the proof below.
Theorem 4. Assume xt ≤ 1 for t = 1 . . . T , and b > 1. Assume further that
at =
for t = 1 . . . T . Then,
1
1−
−1
x
t At−1 xt
(17)
LT (u)
b
Sd 1 + ln 1 +
≤ LT (u) +
.
(18)
b−1
Sd
We prove the theorem with a refined bound on the sum t (at − 1)t (u) of (16) using
the following two lemmas. In Thm. 3 we bound the loss of all examples with S and
then bound the remaining term. Here, instead we show a relation to a subsequence
“pretending” all examples of it as suffering a loss S, yet with the same cumulative loss,
yielding an effective shorter sequence, which we then bound. In the next lemma we
show how to find this subsequence, and in the following one bound the performance.
La
T (u)
252
E. Moroshko and K. Crammer
Lemma 3. Let I ⊂ {1 . . . T } be the indices of the T =
T
t=1 t /S
largest elements
of at , that is |I| = T and mint∈I at ≥ aτ for all τ ∈ {1 . . . T }/I. Then,
T
t (u) (at − 1) ≤ S
t=1
(at − 1) .
t∈I
Proof. For a vector v ∈ RT define by I(v) the set of indicies of the T maximal
absolute-valued elements of v, and define f (v) = t∈I(v) |v t |. The function f (v) is
1
. From the property of dual
a norm [3] with a dual norm g(u) = max u∞ , u
T
norms we have v · u ≤ f (v)g(u). Applying this inequality
− 1...aT − 1)
tovT = (a1
T
t=1 t
and u = (1 . . . T ) we get, t=1 t (u)(at − 1) ≤ max S, T t∈I (at − 1).
T
Combining with ST ≥ t=1 t , completes the proof.
Note that the quantity t∈I at is based only on T examples, yet was generated using
all T examples. In fact by running the algorithm with only these T examples the corresponding sum cannot get smaller. Specifically, assume the algorithm is run with inputs
(x1 , y1 )...(xT , yT ) and generated a corresponding sequence (a1 ...aT ). Let I be the set
of indices with maximal values of at as before. Assume the algorithm is run with the
subsequence of examples from I (with the same order) and generated α1 ...αT (where
/ I). Then, αt ≥ at for all t ∈ I. This statement follows from
we set αt = 0 for t ∈
(2) from which we get that the matrix At is monotonically increasing in t. Thus, by
removing examples we get another smaller matrix which leads to a larger value of αt .
We continue the analysis with a sequence of length T rather than a subsequence of
the original sequence of length T being analyzed. The next lemma upper bounds the
T sum t at over T inputs with another sum of same length, yet using orthonormal set
of vectors of size d.
Lemma 4. Let x1 ...xτ be any τ inputs with unit-norm. Assume the algorithm is performing updates using (17) for some A0 resulting in a sequence a1 ...aτ . Let E =
{v1 ...v d } ⊂ Rd be an eigen-decomposition of A0 with corresponding eigenvalues
λ
1 ...λd . Then
there exists a sequence of indices j1 ...jτ , where ji ∈ {1...d}, such that
a
≤
t t
t αt , where αt are generated using (17) on the sequence v j1 ...v jτ .
number of times eigenvector v s is used (s = 1...d), that is
Additionally, let ns be the
ns = |{jt : jt = s}} (and s ns = τ ), then,
t
αt ≤ τ +
ns
d s=1 r=1
1
.
λs + r − 2
−1
Proof. By induction over τ . For τ = 1 we want to upper bound a1 = 1/(1−x
1 A0 x1 )
which is maximized when x1 = v d the eigenvector with minimal eigenvalue λd , in this
case we have α1 = 1/(1 − 1/λd ) = 1 + 1/(λd − 1), as desired.
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
253
Table 1. Comparison of regret bounds for online regression
Bound on Regret RT (u)
b u2 + dY 2 ln(1 + T /(db))
b u2 + dY 2 ln(1 + T /(db))
Algorithm
Vovk [13]
Forster [4]
2u2 +
(u)
t t
Orabona et.al. [9] 2 u2 + d(U + Y )2 ln 1 +
2
d(U +Y )
2
b
T
Thm. 3
b u + Sd b−1 ln 1 + d(b−1)
b
b u2 + Sd b−1
ln(1 +
Thm. 4
t t (u)
Sd
)
Next we assume the lemma holds for some τ − 1 and show it for τ . Let x1 be
the first input, and let {γs } and {us } be the eigen-values andeigen-vectors of A1 =
τ
A + a x x . The assumption of induction implies that t=2 αt ≤ (τ − 1) +
0d 1ns1 1 1
[6] we know that the eigenvalues of A1
r=1 γs +r−2 . From Theorem 8.1.8 of
s=1
satisfy γs = λs + ms for some ms ≥ 0 and s ms = 1. We thus conclude that
at ≤ 1 + 1/(λd − 1) + (τ − 1) +
t
ns
d s=1 r=1
1
.
λs + ms + r − 2
The last term is convex in m1 ...md and thus is maximized over a vertex of the simplex,
that is when mu = 1 for some u and zero otherwise. In this case, the eigen-vectors
{us } of A1 are in fact the eigenvectors {v s } of A0 , and the proof is completed.
Equipped with these lemmas we now prove Thm. 4.
T
Proof. Let T =
/S
. Our starting point is the equality La
t
T (u) = LT (u) +
t=1
T
t=1 t (u) (at − 1) stated in (16). From Lemma 3 we get,
T
t (u) (at − 1) ≤ S
t=1
(at − 1) ≤ S
T
(αt − 1) ,
(19)
t
t∈I
where I is the subset of T indices for which at are maximal, and αt are the resulting
coefficients computed with (17) using only the sub-sequence of examples xt with t ∈ I.
By definition A0 = bI and thus from Lemma 4 we further bound (19) with,
T
t (u) (at − 1) ≤ S
t=1
ns
d s=1 r=1
1
,
b+r−2
(20)
for some ns such that s ns = T . The last equation is maximized when all the counts
ns are about (as d may not divide T ) the same, and thus we further bound (20) with,
T
t (u) (at − 1)
t=1
≤ Sd
b
b−1
T
1 + ln
d
which completes the proof.
T /d
1
b 1
≤S
≤ Sd
b
+
r
−
2
b
−
1r
s=1 r=1
r=1
b
LT (u)
≤ Sd
1 + ln 1 +
,
b−1
Sd
/d
d T
254
E. Moroshko and K. Crammer
It is instructive to compare bounds of similar algorithms, summarized in Table 1. Our
first bound of Thm. 3 is most similar to the bounds of Forster [4] and Vovk [13]. The
bound in the table is obtained by noting that log det is a concave function of the eigenvalues of the matrix, upper bounded when all the eigenvalues are equal (with the same
trace). They have a multiplicative factor Y 2 of the logarithm, while we have the worstloss of u over all examples. Thus, our first bound is better on problems that are approximately linear yt ≈ u · xt for t = 1, . . . , T and Y is large, and their bound is better if
Y is small. Note that the analysis of Forster [4] assumes that the labels yt are bounded,
and formally the algorithm should know this bound, while we assume that the inputs
are bounded.
Our second bound of Thm. 4 is similar to the bound of Orabona et.al. [9]. Both
bounds have potentially sub-logarithmic regret as the cumulative loss L(u) may be
sublinear in T . Yet, their bound has a multiplicative factor of (U +Y )2 , while our bound
has only the maximal loss S, which, as before, can be much smaller. Additionally, their
analysis assumes that both the inputs xt and the labels yt are bounded, while we only
assume that the inputs are bounded, and furthermore, our algorithm does not need to
assume and know a compact set which contains u (u ≤ U ), as opposed to their
algorithm.
5 Learning in Non-stationary Environment
In this section we present a generalization of the last-step min-max predictor for nonstationary problems given in (1). We define the predictor to be,
T
2
a
2
(yt − ŷt ) −
inf
b ū + cVm + LT (u1 , . . . , uT )
ŷT = arg min max
ŷT
yT
t=1
u1 ,...,uT ,ū
(21)
T
2
for Vm =
at ≥ 1 for
t=1 ut − ū , positive constants b, c > 0 and weights 1 ≤ t ≤ T.
As mentioned above, we use an extended notion of function class, using different
vectors ut across time T . We circumvent here the problem mentioned in the end of
Sec. 2, and restrict the adversary from choosing an arbitrary T -tuple (u1 , . . . , uT ) by introducing a reference weight-vector ū. Specifically, indeed we replace the single-weight
a
cumulative-loss La
T (u) in (1) with a multi-weight cumulative-loss LT (u1 , . . . , uT ) in
(21), yet, we add the term cVm to (21) penalizing a T -tuple (u1 , . . . , uT ) that its elements {ut } are far from some single point ū. Intuitively, Vm serves as a measure of
complexity of the T -tuple by measuring the deviation of its elements from some vector.
The new formulation of (21) clearly subsumes the formulation of (1), as if u1 =
. . . uT = ū = u, then (21) reduces to (1). We now show that in-fact the two notions
of last-step min-max predictors are equivalent. The following lemma characterizes the
solution of the inner infimum of (21) over ū.
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
255
Lemma 5. For any ū ∈ Rd , the function J (u1 , . . . , uT ) = b ū2 +
2
T
T
2
a t y t − u
is minimal for ut = ū +
c t=1 ut − ū + t=1 t xt
c−1
yt − ū xt xt for t = 1...T . The minimal value of J (u1 , . . . , uT ) is
−1 x 2
a−1
t
t +c
given by
2
Jmin = b ū +
T
1
a−1
t
t=1 +
c−1
xt 2
2
yt − ū xt .
(22)
The proof is omitted due to space constraints, and is obtained by setting the derivative
of the objective to zero. Substituting (22) in (21) we obtain the following form of the
last-step minmax predictor,
T
ŷT = arg min max
(yt − ŷt )2 −
ŷT
inf
yT
t=1
2
ū∈Rd
b ū +
T
t=1
1
−1 x a−1
t
t +c
2
2
yt − x
t ū
(23)
Clearly, both equations (23) and (1) are equivalent when identifying,
2
−1
at = 1/ .
a−1
+
c
x
t
t
(24)
Therefore, we can use the results of the previous sections. Specifically, if 1 +
−1
aT x
T AT −1 xT − aT ≤ 0 the optimal predictor developed in Thm. 1 for the station−1
ary case is given by, ŷT = b
T −1 AT −1 xT where we replace (2) with At = bI +
t
t
t
1
xs x
s and (3) with bt =
−1 x 2
s=1 as xs xs = bI +
s=1 s=1 as ys xs =
a−1
s
s +c
t
1
y x . Although most of the analysis above holds for 1 +
s=1 a−1 +c−1 x 2 s s
s
s
−1
at x
t At−1 xt − at ≤ 0 in the end of the day, Thm. 3 assumed that this inequality holds
1
as equality. Substituting at = 1−x A
in (24) and solving for at we obtain,
−1
x
t
t−1
t
2
−1
−1
at = 1/ 1 − x
.
A
x
−
c
x
t
t
t−1 t
(25)
The last-step minmax predictor (21) is convex if at ≥ 0, which holds if, 1/b + 1/c ≤ 1 ,
2
−1
A
=
(1/b)I
and
we
assume
that xt ≤ 1.
because A−1
t−1
0
Let us state the analogous statements of Thm. 2 and Thm. 3. Substituting Lemma 5
in Thm. 2 we bound the cumulative loss of the algorithm with the weighted loss of any
T -tuple (u1 , . . . , uT ),
−1
Corollary 1. Assume xt ≤ 1, 1 + at x
t At−1 xt − at ≤ 0 for t = 1 . . . T , and
−1
1/b + 1/c ≤ 1. Then, the loss of the last-step minmax predictor,
xt
ŷt = bt−1 At−1
2
for t = 1 . . . T , is upper bounded by, LT (alg) ≤ inf u∈Rd b u + La
=
T (u)
inf u1 ,...,uT ,ū b ū2 + c Tt=1 ut − ū2 + La
T (u1 , . . . , uT ) . Furthermore, if 1 +
−1
at x
t At−1 xt − at = 0, then the last inequality is in fact an equality.
256
E. Moroshko and K. Crammer
Next we relate the weighted cumulative loss La
T (u1 , . . . , uT ) to the loss itself
LT (u1 , . . . , uT ),
Corollary 2. Assume xt ≤ 1 for t = 1 . . . T , b > 1 and 1/b + 1/c ≤ 1. Assume
additionally that at = 1−x A−1 x1 −c−1 x 2 as given in (25). Then
t
t−1
La
T (u1 , . . . , uT )
t
t
1
b
S ln AT ≤LT (u1 , . . . , uT ) +
b−1
b
1
+ TS
c (1 − b−1 )2 − (1 − b−1 )
Proof. We start as in the proof of Thm. 3 and decompose the weighted loss,
(
at − 1)t (ut )
La
T (u1 , . . . , uT ) = LT (u1 , . . . , uT ) +
t
≤ LT (u1 , . . . , uT ) + S
(at − 1) + S
t
(
at − at ) .
(26)
t
We bound the sum of the third term,
at − at =
1
1−
−1
x
t At−1 xt
−1
−
c−1
2
xt −
1
1−
−1
x
t At−1 xt
c
1
.
(27)
=
2
−1
(1 − b−1 − c−1 ) (1 − b−1 )
c (1 − b ) − (1 − b−1 )
b
Additionally, as in Thm. 3 the second term is bounded with b−1
S ln 1b AT . Substituting this bound and (27) in (26) completes the proof.
≤
Combining the last two corollaries yields the main result of this section.
Corollary 3. Under the conditions of Cor. 2 the cumulative loss of the last-step minmax
predictor is upper bounded by,
LT (alg) ≤
2
b ū + cVm + LT (u1 , . . . , uT )
inf
u1 ,...,uT ,ū
1
TS
Sb
ln AT +
+
,
2
b − 1 b
c (1 − b−1 ) − (1 − b−1 )
is the deviation
where Vm of {ut } from some fixed weight-vector. Additionally, setting
cV =
b
b−1
1+
ST
Vm
minimizing the above bound over c,
LT (alg) ≤
inf
u1 ,...,uT ,ū
b ū2 + LT (u1 , . . . , uT )
1
b Sb
+
ln AT +
Vm + 2 ST Vm
.
b − 1 b
b−1
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
Few comments. First, it is straightforward to verify that cV =
b
b−1
257
1 + VST
satisfy
m
the constraint 1/b + 1/cV ≤ 1. Second, this bound strictly generalizes the bound for
the stationary case, since Cor. 2 reduces to Thm. 3 when all the weight-vectors equal
each other u1 = . . . uT = ū (i.e. Vm = 0). Third, the constant c (or cV ) is not used by
the algorithm, but only in the analysis. So there is no need to know the actual deviation
Vm to tune the algorithm. In other words, the bound applies essentially to the same
last step minmax predictor defined in Thm. 1. Finally, we can obtain a bound for the
non-stationary case based on Thm. 4 instead of Thm. 3 (omitted due to lack of space).
6 Related Work and Summary
The problem of predicting reals in an online manner was studied for more than five
decades. Clearly we cannot cover all previous work here, and the reader is refered to
the encyclopedic book of Cesa-Bianchi and Lugosi [2] for a full survey.
An online version of the ridge regression algorithm in the worst-case setting was
proposed and analyzed by Foster [5]. A related algorithm called the Aggregating Algorithm (AA) was studied by Vovk [12]. The recursive least squares (RLS) [7] is a similar
algorithm proposed for adaptive filtering. Both algorithms make use of second order information, as they maintain a weight-vector and a covariance-like positive semi-definite
(PSD) matrix used to re-weight the input. The eigenvalues of this covariance-like matrix
grow with time t, a property which is used to prove logarithmic regret bounds. Orabona
et.al. [9] showed that beyond logarithmic regret bound can be achieved when the total
best linear model loss is sublinear in T . We derive a similar bound, with a multiplicative
factor that depends on the worst-loss of u, rather than a bound Y on the labels.
The derivation of our algorithm shares similarities with the work of Forster [4]. Both
algorithms are motivated from the last-step min-max predictor. Yet, the formulation
of Forster [4] yields a convex optimization for which the max operation over yt is not
bounded, and thus he used an artificial clipping operation to avoid unbounded solutions.
With a proper tuning of at and a weighted loss, we are able to obtain a problem that is
convex in ŷt and concave in yt , and thus well defined.
Most recent work is focused in the stationary setting. We also discuss a specific
weak-notion of non-stationary setting, for which the few weight-vectors can be used for
comparison and their total deviation is computed with respect to some single weightvector. Recently, Vaits and Crammer [11] proposed an algorithm designed for nonstationary environments. Herbster and Warmuth [8] discussed general gradient descent
algorithms with projection of the weight-vector using the Bregman divergence, and
Zinkevich [14] developed an algorithm for online convex programming. They all use a
stronger notion of diversity between vectors, as their distance is measured with consecutive vectors (that is drift that may end far from the starting point). Thus, the bounds
cannot be compared in general.
Summary: We proposed a modification of the last-step min-max algorithm [4] using
weights over examples, and showed how to choose these weights for the problem to be
well defined – convex – which enabled us to develop the last step min-max predictor,
258
E. Moroshko and K. Crammer
without requiring the labels to be bounded. Our analysis bounds the regret with quantities that depends only on the loss of the competitor, with no need for any knowledge of
the problem. Our prediction algorithm was motivated from the last-step minmax predictor problem for stationary setting, but we showed that the same algorithm can be used
to derive a bound for a class of non-stationary problems as well.
We plan to perform an extensive empirical study comparing the algorithm to other
algorithms. An interesting direction would be to extend the algorithm for general loss
functions rather than the squared loss, or to classification tasks.
Acknowledgements: The research is partially supported by an Israeli Science Foundation grant ISF- 1567/10.
A Proof of Thm. 2
Proof. Using the Woodbury matrix identity we get
A−1
= A−1
t
t−1 −
−1
A−1
t−1 xt xt At−1
1
at
(28)
−1
+ x
t At−1 xt
therefore
−1
A−1
t xt = At−1 xt −
−1
A−1
t−1 xt xt At−1 xt
1
at
−1
+ x
t At−1 xt
=
A−1
t−1 xt
−1
1 + at x
t At−1 xt
(29)
For t = 1 . . . T we have
2
2
a
b
u
t (alg) + inf b u + La
(u)
−
inf
+
L
(u)
t−1
t
u∈Rd
(4)
2
(8)
2
= (yt − ŷt ) +
u∈Rd
t−1
as ys2
−
−1
b
t−1 At−1 bt−1
s=1
= (yt − ŷt ) −
at yt2
−
t
−1
as ys2 + b
t At bt
s=1
−
−1
b
t−1 At−1 bt−1
+
−1
b
t−1 At bt−1
−1
+ 2at yt b
t−1 At xt
−1
+a2t yt2 x
t At xt
2
−1
−1
−1
= (yt − ŷt ) − at yt2 − b
t−1 At at xt xt At−1 bt−1 + 2at yt bt−1 At xt
(12)
−1
+a2t yt2 x
t At xt
2
2 = (yt − ŷt ) − at yt2 + at −ŷt b
t−1 + 2yt bt−1 + at yt xt
(29)
= (yt − ŷt )2 − at
A−1
t−1 xt
−1
1 + at x
t At−1 xt
2
−1
1 + at x
(yt − ŷt )
t At−1 xt − at
=
(yt − ŷt )2 ≤ 0
−1
−1
1 + at x
1 + at x
t At−1 xt
t At−1 xt
2
Summing over t ∈ {1, . . . , T } yields LT (alg) − inf u∈Rd b u + La
(u)
≤0.
T
Weighted Last-Step Min-Max Algorithm with Improved Sub-logarithmic Regret
259
B Proof of Thm. 3
−1
−1
Proof. From (28) we see that A−1
t ≺ At−1 and because A0 = bI we get xt At xt <
2
−1
1
1
−1
−1
x
t At−1 xt < xt At−2 xt < . . . < xt A0 xt = b xt ≤ b therefore 1 ≤ at ≤
1
1− 1b
=
b
b−1 .
−1
From (29) we have x
t At xt =
so we can bound the term at − 1 as following
−1
at − 1 = a2t x
t At xt ≤
−1
x
t At−1 xt
−1
1+at x
t At−1 xt
=
1− a1
t
1+at (1− a1
b
−1
at x
t At xt .
b−1
t
)
=
at −1
a2t
(30)
|At |
=
|At −at xt x
t |
|At |
ln |At−1 | . Summing the last inequality over t and using the initial value ln 1b A0 = 0
1
T
−1
we get t=1 at x
the last equation in (30) we get
t At xt ≤ ln b AT . Substituting
1
T
b
the logarithmic bound
(at − 1) ≤
ln AT , as required.
−1
With an argument similar to [4] we have, at x
t At xt ≤ ln
t=1
b−1
b
References
[1] Azoury, K.S., Warmuth, M.W.: Relative loss bounds for on-line density estimation with the
exponential family of distributions. Machine Learning 43(3), 211–246 (2001)
[2] Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University
Press, New York (2006)
[3] Dekel, O., Long, P.M., Singer, Y.: Online learning of multiple tasks with a shared loss.
Journal of Machine Learning Research 8, 2233–2264 (2007)
[4] Forster, J.: On Relative Loss Bounds in Generalized Linear Regression. In: Ciobanu, G.,
Păun, G. (eds.) FCT 1999. LNCS, vol. 1684, pp. 269–280. Springer, Heidelberg (1999)
[5] Foster, D.P.: Prediction in the worst case. The An. of Stat. 19(2), 1084–1090 (1991)
[6] Golub, G.H., Van Loan, C.F.: Matrix computations, 3rd edn. Johns Hopkins University
Press, Baltimore (1996)
[7] Hayes, M.H.: 9.4: Recursive least squares. In: Statistical Digital Signal Processing and
Modeling, p. 541 (1996)
[8] Herbster, M., Warmuth, M.K.: Tracking the best linear predictor. Journal of Machine Learning Research 1, 281–309 (2001)
[9] Orabona, F., Cesa-Bianchi, N., Gentile, C.: Beyond logarithmic bounds in online learning.
In: AISTATS (2012)
[10] Takimoto, E., Warmuth, M.K.: The Last-Step Minimax Algorithm. In: Arimura, H., Sharma,
A.K., Jain, S. (eds.) ALT 2000. LNCS (LNAI), vol. 1968, pp. 279–290. Springer, Heidelberg
(2000)
[11] Vaits, N., Crammer, K.: Re-adapting the Regularization of Weights for Non-stationary Regression. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011.
LNCS, vol. 6925, pp. 114–128. Springer, Heidelberg (2011)
[12] Vovk, V.G.: Aggregating strategies. In: Proceedings of the Third Annual Workshop on Computational Learning Theory, pp. 371–383. Morgan Kaufmann (1990)
[13] Vovk, V.: Competitive on-line statistics. International Statistical Review 69 (2001)
[14] Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In:
Proceedings of the Twentieth International Conference on Machine Learning, pp. 928–936
(2003)