The Forgetron:
A kernel-based Perceptron on a fixed budget
The Hebrew University
Jerusalem, Israel
Problem Setting
•Online learning:
The Forgetron
•Initilize:
•Choose an initial hypothesis f0
Shai Shalev-Shwartz
Yoram Singer
Proof Technique
•Progress measure:
Mistake Bound
can be rewritten as
•A sequence {(x1,y1),…,(xT,yT)} such that K(xt,xt) <= 1
•For t=1,2…
•For t=1,2,…
•Receive an instance xt and predict sign(ft(xt))
•If (yt ft(xt) <= 0)
•Budget B >= 84
•define
•Competitor g s.t.
•The Perceptron update leads to positive progress:
1
•update the hypothesis f
•Goal: minimize the number of prediction mistakes
•Kernel-based hypotheses
2
3
...
...
t-1
•Receive an instance xt, predict sign(ft(xt)), and then receive yt
•If yt ft(xt) <= 0 set Mt = Mt-1 + 1 and update:
Step (1) - Perceptron
is always 1
•Then,
t
•Example: the dual Perceptron
•The shrinking and removal steps might lead to negative progress
Can we quantify the deviation they might cause ?
Deviation due to Shrinking
Experiments
•Case I: the shrinking is projection
onto the ball of radius U = ||g||:
define
•Compare to Crammer, Kandola, Singer (NIPS’03)
•Initial hypothesis:
1
•Competitive analysis
•Compete against any g:
hinge loss
max { 0 , 1 - ytg(xt) }
2
3
...
...
t-1
•Display online error vs. budget constraint
t
•Case II: aggressive shrinking
•If |I’t|<= B skip the next two steps
MNIST
•define rt = min It
USPS
0.25
Forgetron
CKS
0.35
average error
•Update rule:
Step (2) - Shrinking
•Goal: A provably correct algorithm on a budget: |It| <= B
0.3
0.25
0.2
0.15
0.1
average error
•
Ofer Dekel
Forgetron
CKS
0.2
0.15
0.1
0.05
Deviation due to Removal
define
1
•If example (x,y) is activated on round r and deactivated
on round t then,
2
...
...
t-1
we cannot compete with hypotheses
whose norm larger than
Perceptron’s progress
Deviation due to removal:
2
3
...
...
t-1
t
400
500
(10% label noise)
0.45
Forgetron
CKS
0.2
0.15
0.1
0.05
Forgetron
CKS
0.4
0.35
0.3
0.25
0.2
1000 2000 3000 4000 5000 6000
1
300
synthetic
average error
Step (3) - Removal
0.25
200
budget size - B
census-income
(adult)
t
for each ft based on B examples,
exists x in X such that ft(x) = 0
•norm of the competitor is
0.05
100
3000
0.3
•For any kernel-based algorithm that satisfies |It| <= B we
can find g and an arbitrarily long sequence such that the
algorithm errs on every single round whereas g suffers no
loss.
•
3
2000
budget size - B
average error
Hardness Result
1000
budget size - B
0
500
1000
1500
budget size - B
2000
© Copyright 2026 Paperzz