Online Learning from Experts: Weighed Majority

E0 370 Statistical Learning Theory
Lecture 20 (Nov 17, 2011)
Online Learning from Experts: Weighed Majority and Hedge
Lecturer: Shivani Agarwal
1
Scribe: Saradha R
Introduction
In this lecture, we will look at the problem of learning from multiple experts in an online fashion. There
are finite number of experts, who give their predictions ξ1 , . . . , ξN . The learning algorithm has to use the
predictor values and come up with an outcome ŷ. The total number of mistakes made by the algorithm is
compared with the performance of the best expert in consideration.
2
Online Prediction from Experts
A general online prediction problem proceeds as follows.
Online (binary) prediction using multiple experts
For t = 1, . . . , T :
– Receive instance xt ∈ X
– Receive expert predictors ξ1 (xt ), . . . , ξN (xt ) ∈ {±1}
– Predict ŷ t ∈ {±1}
– Receive true label y t ∈ {±1}
– Incur loss `(y t , ŷ t )
2.1
Halving Algorithm
Here we assume that the set of experts that we consider has an expert which would give the correct label
for all instances. In the halving algorithm, for every iteration, only the consistent experts are retained. If a
predictor makes a mistake it will no more be contributing in the prediction process.
Halving Algorithm
Initiate weights wi1 = 1∀i ∈ [N ]
For t = 1, . . . , T :
– Receive instance xt ∈ X
– Receive expert predictors
ξ1 (xt ), . . . , ξN (xt ) ∈ {±1}
Pn
t
– Predict ŷ = sign( j=1 wjt .ξjt ) (majority vote)
– Receive true label y t ∈ {±1}
– Incur loss `(y t , ŷ t )
– Update:– Update:- ∀i ∈ 1 . . . N : If ξit 6= y t then
wit+1 ← 0
else
wit+1 ← wit
1
2
Online Learning from Experts: Weighed Majority and Hedge
Thus the maximum number of mistakes, or the sum of loss over any given sequence is bounded by the
logarithm of number of predictors. i.e.
L0-1
S [Halving] ≤ log2 N .
2.2
Weighted Majority (WM) Algorithm
In the halving algorithm, when a predictor makes even one mistake, it will not be able to contribute to
the prediction in the successive iterations. When we don’t have an expert that would predict correctly
for all samples, this would not be a suitable approach. The weighted majority algorithm works well in
such situations. Here every predictor is assigned equal weight, say 1, initially. Later as they make binary
predictions on instances, the weights of the predictors are decreased using multiplicative update, when they
commit mistakes. The rate at which the weights are updated is governed by the parameter η.
Weighted majority Algorithm
Initiate weights wi1 = 1∀i ∈ [N ]
Choose parameter η > 0 that would be used in the weight update rule.
For t = 1, . . . , T :
– Receive instance xt ∈ X
– Receive expert predictors
ξ1 (xt ), . . . , ξN (xt ) ∈ {±1}
Pn
t
– Predict ŷ = sign( j=1 wjt .ξjt ) (majority vote)
– Receive true label y t ∈ {±1}
– Incur loss `(y t , ŷ t )
– Update:- If ŷ t 6= y t
∀i ∈ 1 . . . N
t
t
wit+1 ← wit exp(η.I(y 6=ξi ))
Theorem 2.1. Let ξ1 , . . . , ξN ∈ {±1}T . Let S = (y 1 , . . . , y T ) ∈ {±1} and let η > 0.
Then the total number of mistakes
!
1
η
0-1
LS [W eightedM ajority(η)] ≤
. min L0-1
. ln(N ) .
S [ξi ] +
2
2
i
ln( 1+exp(−η) )
ln( 1+exp(−η)
)
Proof. Denote L0-1
S [W eightedM ajority] = L
For each trial t on which there is a mistake, we have
W t+1 =
N
X
wit+1 =
i=1
=
N
X
wit . exp(−η.I(y
t
6=ξit ))
.
(1)
i=1
X
wit . exp−η +
i:y t 6=ξit
X
wit
(2)
i:y t =ξit
= exp−η .Wmaj + Wmin
(3)
1 − exp−η
(Wmaj − Wmin )
2
1 + exp−η
+ Wmin ) =
.(W t )
2
≤ exp−η .Wmaj + Wmin +
(4)
1 + exp−η
.(Wmaj
2
(5)
=
t+1
For all mistake trials t, we have WW t ≤ 1+exp
2
Therefore summing over t = 1, . . . , T gives
−η
. For other trials,
1 + exp−η L
W T +1
≤
(
) .
WT
2
W t+1
Wt
≤1
(6)
Online Learning from Experts: Weighed Majority and Hedge
3
Taking logarithm, we get
L ≤
ln W 1 − ln W T +1
.
ln( 1+exp2( −η) )
(7)
Finding the lower bound on ln W T +1
W t+1 =
N
X
wjT +1 ≥ wjT +1 ≥ exp−η.Li wi1 (∀i) .
(8)
j=1
L ≤
ln N + η.Li
ln W 1 − η.Li − ln wi1
=
ln( 1+e2( −η) )
ln( 1+e2( −η) )
(9)
(10)
for all wj1 > 0∀j
Thus we obtain the result.
2.3
Weighted Majority: Continuous Version (WMC)
We now see the continuous version of weighted majority algorithm. The final prediction is a weighted average
of the expert predictor values.
Here ỹ = ŷ = y = [0, 1]
Weighted majority Algorithm :Continuous Version (WMC)
Initiate weights wi1 = 1∀i ∈ [N ]
Choose parameter η > 0 that would be used in the weight update rule.
For t = 1, . . . , T :
– Receive instance xt ∈ X
– Receive expert
predictors ξ1 (xt ), . . . , ξN (xt ) ∈ [0, 1]
P
N
–
–
–
–
wt .ξ t
i i
Predict ŷ t = Pi=1
∈ [0, 1] (Weighted Average)
N
t
i=1 wi
Receive true label y t ∈ [0, 1]
Incur loss `abs (y t , ŷ t ) = |y t − ŷ t |
Update:∀i ∈ 1 . . . N
t
t
wit+1 ← wit . exp−η.|ξi −y |
Theorem 2.2. Let ξ1 , . . . , ξN ∈ [0, 1]T . Let S = (y 1 , . . . , y T ) ∈ [0, 1] and let η > 0.
Then the total number of mistakes
Labs
S [W M C(η)] ≤ (
η
1
). min Labs
. ln(N ) .
S [ξi ] +
−η
i
1 − exp
1 − exp−η
Proof. Denote Labs
S [W M C(η)] = L
For each trial t we have
W t+1 =
N
X
i=1
≤
N
X
i=1
wit+1 =
N
X
wit . exp−η.|y
t
−ξit |
.
(11)
i=1
wit .[1 − (1 − exp−η )|y t − ξit |] .
(12)
4
Online Learning from Experts: Weighed Majority and Hedge
W
t+1
#
wit |y t − ξit |
≤
1 − (1 − exp )
PN
t
i=1 wi
i=1
#
P
"
N wt |y t − ξ t | i t
−η i=1 i
≤ W . 1 − (1 − exp ) PN
t
i=1 wi
= W t . 1 − (1 − exp−η )|ŷ t − y t |
N
X
"
−η
wit
PN
i=1
i
h
−η
t
t
W t+1 ≤ W t . exp−(1−exp ).|ŷ −y | .
h
i
W t+1
−(1−exp−η ).|ŷ t −y t |
≤
exp
Wt
i
h
P
t
t
−η
−(1−exp−η ). T
t=1 |ŷ −y |
= exp−(1−exp ).L
≤ exp
(13)
(14)
(15)
(16)
(17)
(18)
Taking logarithm, we get
L ≤
ln W 1 − ln W T +1
.
1 − (exp−η )
(19)
Finding the lower bound on ln W T +1
W t+1 ≥ exp−η.Li wi1 (∀i) .
(20)
Thus we obtain the result
L ≤
3
ln W T +1 + η.Li − ln wi1
1 − (exp−η )
ln N + η.Li
.
≤
1 − (exp−η )
(21)
(22)
Online Allocation
The problem of online allocation occurs in scenarios where we need to allocate different fraction of resources
into N different options. The loss associated with every option is available at the end of every iteration. We
would like to reduce the total loss suffered for the particular allocation. The allocation for the next iteration
is then revised, based on the total loss suffered in the current iteration using multiplicative update.
Hedge Algorithm(η)
Initiate weights wi1 = 1∀i ∈ [N ]
Choose parameter η > 0 that would be used in the weight update rule.
For t = 1, . . . , T :
– Make allocation pt ∈ 4N
t
– pt = PNw wt ;
i=1
i
– Receive vector of loses `t = (`t1 , . . . , `tN ) ∈ [0, 1]N
PN
– Incur loss pt .`t = i=1 pti .`ti
– Update:∀i ∈ 1 . . . N
t
wit+1 ← wit . exp−η(`i )
Online Learning from Experts: Weighed Majority and Hedge
5
Theorem 3.1. Let `1 , . . . , `T ∈ [0, 1]N The cumulative loss of the algorithm is
L[A] =
T
X
pt .`t
i=1
If the loss of a particular option over the T iterations is given by
Li =
T
X
`ti .
i=1
Then
L[Hedge(η)] ≤ (
1
η
). min Li +
. ln(N ) .
i
1 − exp−η
1 − exp−η
Proof. Denote L[Hedge(η)] = L
For each trial t we have
N
X
W t+1 =
i=1
≤
N
X
wit+1 =
N
X
1 − (1 − exp
i=1
W t+1 ≤
N
X
(23)
i=1
"
wit .
t
wit . exp−η.`i .
−η
PN
t t
i=1 wi .`i
). P
N
t
i=1 wi
#
.
wit 1 − (1 − exp−η )pt .`t .
(24)
(25)
i=1
h
i
−η
t t
W t+1
≤ exp−(1−exp ).p .`
t
W
h
i
−η
≤ exp−(1−exp ).L
(26)
(27)
Taking logarithm, we get
L ≤
ln W 1 − ln W T +1
.
1 − (exp−η )
(28)
Finding the lower bound on ln W T +1
W t+1 ≥ exp−η.Li wi1 (∀i) .
(29)
Thus we obtain the result
L ≤
4
ln W T +1 + η.Li − ln wi1
1 − (exp−η )
ln N + η.Li
≤
.
1 − (exp−η )
Next Lecture
In the next lecture, we will introduce the idea of minimax regret, in an adversarial learning setting.
References
(30)
(31)