Online
Algorithms
Lecturer: Yishay Mansour
Elad Walach
Alex Roitenberg
Introduction
Up
until now, our algorithms start with
input and work with it
suppose input arrives a little at a time,
need instant response
Oranges example
Suppose we are to build a robot that removes
bad oranges from a kibutz packaging line
After classification the kibutz worker looks at
the orange and tells our robot if his
classification was correct
And repeat indefinitely
Our model:
Input: unlabeled orange 𝑥
Output: classification (good or bad) 𝑏
The algorithm then gets the correct
classification 𝐶𝑡 (𝑥)
Introduction
At every step t, the algorithm predicts the
classification based on some hypothesis 𝐻𝑡
The algorithm receives the correct
classification𝐶𝑡 (𝑥)
A mistake is an incorrect prediction: H𝑡 𝑥 ≠
𝐶𝑡 (𝑥)
The goal is to build an algorithm with a bound
number of mistakes
Number of mistakes Independent of the input
size
Linear Separators
Linear saperator
The
goal: find 𝑤0 and 𝑤 defining a hyper
plane 𝑤 ∗ 𝑥 = 𝑤0
All positive examples will be on the one side
of the hyper plane and all the negative on
the other
I.E. 𝑤 ∗ 𝑥 > 𝑤0 for positive 𝑥 only
We will now look at several algorithms to find
the separator
Perceptron
The
Idea: correct? Do nothing
Wrong? Move separator towards mistake
We’ll
2
scale all x’s so that 𝑥 = 1, since
this doesn’t affect which side of the plane
they are on
The perceptron algorithm
1.
2.
3.
initialize 𝑤1 = 0, 𝑡 = 0
Given 𝑥𝑡 , predict positive IFF 𝑤1 ∗ 𝑥𝑡 >0
On a mistake:
1.
2.
Mistake on positive 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥𝑡
Mistake on negative 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥𝑡
The perceptron algorithm
Suppose
a positive sample 𝑥
If we misclassified 𝑥, then after the update
we’ll get 𝑤𝑡+1 ∗ 𝑥 = (𝑤𝑡 + 𝑥) ∗ 𝑥 = 𝑤𝑡 ∗ 𝑥 + 1
𝑋 was positive, but since we made a
mistake 𝑤𝑡 ∗ 𝑥 was negative, so a
correction was made in the right direction
Mistake Bound Theorem
Let 𝑆 =< 𝑥𝑖 > consistent with
𝑤∗ ∗ 𝑥 > 0 ⟺ 𝑙 𝑥 = 1
M= 𝑖: 𝑙 𝑥𝑖 ≠𝑏𝑖 is the number of mistakes
Then 𝑀 ≤
𝑤∗
1
𝛾2
where 𝛾 =
𝑤 ∗ ∗𝑥
min
𝑥
𝑥𝑖 ∈𝑆
the margin of
the minimal distance of the samples in S from 𝑤 ∗ (after
normalizing both 𝑤 ∗ and the samples)
Mistake Bound Proof
WLOG,
the algorithm makes a mistake on
every step (otherwise nothing happens)
Claim 1: 𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾
Proof:
𝑥 > 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 + 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ + 𝑥 ∗
𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾
𝑥 < 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 − 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ − 𝑥 ∗
𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾
Proof Cont.
Claim
𝑥 > 0: 𝑤𝑡+1
2𝑤 𝑡
2
∗ 𝑥 = 𝑤𝑡
𝑤𝑡
2: 𝑤𝑡+1
< ||𝑤𝑡 ||2 + 1
= ||𝑤𝑡 + 𝑥
2
+
2𝑤 𝑡
|2
= ||𝑤𝑡
|2
∗ 𝑥 + 1 ≤ 𝑤𝑡
+ 𝑥
2
2
+
+1
∗ 𝑥 < 0 since the algorithm made a mistake
𝑥 < 0: 𝑤𝑡+1
2
2𝑤 𝑡 ∗ 𝑥 = 𝑤𝑡
𝑤𝑡
2
= ||𝑤𝑡 − 𝑥 |2 = ||𝑤𝑡 |2 + 𝑥
2
− 2𝑤 𝑡 ∗ 𝑥 + 1 ≤ 𝑤𝑡
2
2
−
+1
∗ 𝑥 > 0 since the algorithm made a mistake
Proof Cont.
From
Claim 1: wM 1 w* M
From Claim 2:
wM 1 M
Also: wt w* wt
Since
w* 1
Combining: M wM 1 w* wM 1 M
M 1
2
The world is not perfect
What
if there is no perfect separator?
The world is not perfect
Claim 1(reminder):𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾
previously we made γ progress on each
mistake
now we might be making negative progress
𝑇𝐷𝛾 = total distance we would have to move the
points to make them separable by a mragin γ
*
So: wM 1 w M TD
*
With claim 2: M TD wM 1 w wM 1 M
M 1
2
2
TD
The world is not perfect
The
Alt.
1
𝛾
total hinge loss of 𝑤 ∗ : 𝑇𝐷𝛾
definition: max( 0,1 y), y
Hinge
loss illustration:
l ( x) x w*
Perceptron for maximizing
margins
the idea: update 𝑤𝑡
whenever the correct
classification margin is less
𝛾
than
2
No. of steps polynomial in
Generalization:
𝛾
2
Update margin:
No. of steps polynomial in
1
𝛾
→ (1 − 𝜀)𝛾
1
𝜀𝛾
Perceptron Algorithm
(maximizing margin)
Assuming∀𝑥𝑖 ∈ 𝑆, 𝑥𝑖 = 1
Init: 𝑤1 ← 𝑙 𝑥1 𝑥
Predict:
𝑤𝑡 ∗𝑥
𝑤𝑡
≥ 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝛾
𝑤𝑡 ∗𝑥
𝑤𝑡
≤ − 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑤𝑡 ∗𝑥
𝑤𝑡
∈ − 2 , 2 → 𝑚𝑎𝑟𝑔𝑖𝑛 𝑚𝑖𝑠𝑡𝑎𝑘𝑒
𝛾
𝛾 𝛾
On mistake (prediction or margin), update:
𝑤𝑡+1 ← 𝑤𝑡 + 𝑙 𝑥 𝑥
Mistake Bound Theorem
Let 𝑠 =< 𝑥𝑖 > consistent with :
𝑤∗ ∗ 𝑥 > 0
𝑙 𝑥 =1
M=No. of mistakes + No. of margin mistakes
Then 𝑀 ≤
𝑤∗
12
𝛾2
where 𝛾 =
𝑤 ∗ ∗𝑥
min
𝑥
𝑥𝑖 ∈𝑆
the margin of
similar to the perceptron proof.
*
*
Claim 1 remains the same: wt 1 w wt w
We only have to bound wt 1
Mistake bound proof
WLoG, the algorithm makes a mistake on every
step
Claim2: 𝑤𝑡+1 ≤ ||𝑤𝑡 +
Proof: |𝑤𝑡+1 | = 𝑤𝑡
𝑤𝑡
1+
𝛾
𝑤𝑡
+
1+
1
𝑤𝑡
2
And since 1 + 𝛼 ≤ 1 +
≤ 𝑤𝑡
1+
𝛾
2 𝑤𝑡
1
2 𝑤𝑡
+
𝛼
2
1
2 𝑤𝑡
2
+
𝛾
2
2𝑙 𝑥 𝑤𝑡 ∗𝑥
𝑤𝑡
2
+
1
𝑤𝑡
2
≤
Proof Cont.
Since
𝑙 𝑥 𝑤𝑡
𝑤𝑡
And
the algorithm made a mistake on t
≤
𝛾
2
thus:
𝑤𝑡+1 ≤ 𝑤𝑡
1
+
2 𝑤𝑡
𝛾
+
2
Proof Cont.
So:
If
𝑤𝑡+1 ≤ 𝑤𝑡 +
1
2 𝑤𝑡
𝑤𝑡 ≥ 2𝛾, 𝑤𝑡+1 < 𝑤𝑡
2
𝛾
𝑤𝑀+1 ≤ 1 + +
From
3𝛾
𝑀
4
+
𝛾
2
3𝛾
+
4
→
Claim 1 as before: M𝛾 ≤ 𝑤𝑀+1 ∗ 𝑤 ∗ ≤
𝑤𝑀+1
Solving
we get: 𝑀 ≤
12
𝛾2
The mistake bound model
Con Algorithm
∈ 𝐶 set of concepts consistent on
𝑥1 , 𝑥2 . . 𝑥𝑖−1
At step t
𝐶𝑡
Randomly choose concept c
Predict 𝑏𝑡 = 𝑐 𝑥𝑖
CON Algorithm
Theorem:
For any concept class C, Con
makes the most |𝐶| − 1 mistakes
Proof: at first 𝐶1 = 𝐶 .
After each mistake|𝐶𝑡 | decreases by at
least 1
|𝐶𝑡 | >= 1,since 𝑐𝑡 ∈ 𝐶 at any t
Therefore number of mistakes is bound by
|𝐶| − 1
The bounds of CON
This
bound is too high!
There
We
𝑛
are 22 different functions on 0,1
can do better!
𝑛
HAL – halving algorithm
∈ 𝐶 set of concepts consistent on
𝑥1 , 𝑥2 . . 𝑥𝑖−1
At step t
𝐶𝑡
Conduct a vote amongst all c
Predict 𝑏𝑡 with accordance to the majority
HAL –halving algorithm
Theorem:
For any concept class C, Con
makes the most log 2 |𝐶| − 1 mistakes
Proof: 𝐶1 = 𝐶. After each mistake 𝐶𝑡+1 ≤
1
𝐶 sine majority of concepts were
2 𝑡
wrong.
Therefore number of mistakes is bound by
log 2 |𝐶|
Mistake Bound model and
PAC
Generates
strong online algorithms
In the past we have seen PAC
Restrictions for mistake bound are much
harsher than PAC
If
we know that A learns C in mistake
bound model , should A learn C in PAC
model?
Mistake Bound model and
PAC
A – mistake bound algorithm
Our goal: to construct Apac a pac algorithm
Assume that after A gets xi he construct
hypothesis hi
Definition : A mistake bound algorithm A is
conservative iff for every sample xi if 𝑐𝑡 𝑥𝑖 =
ℎ𝑖−1 (𝑥𝑖 ) then in the ith step the algorithm will
make a choice ℎ𝑖 = ℎ𝑖−1
Mistake made
change hypothesis
Conservative equivalent of
Mistake Bound Algorithm
Let A be an algorithm whose mistake is bound by
M
Ak is A’s hypothesis after it had seen {𝑥1 , 𝑥2 , . . 𝑥𝑛 }
Define A’
Initially ℎ0 = 𝐴0 .
At 𝑥𝑖 update:
Guess ℎ𝑖−1 𝑥𝑖
If 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 , ℎ𝑖 = ℎ𝑖−1
Else ℎ𝑖 = 𝐴
If we run A onS = {𝑥𝑡 : 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 } it would
make |𝑆| mistakes ⇒
𝐴′ makes as many mistakes as A
Building Apac
𝑖
𝑀
𝛿
𝑘𝑖 = 𝜀 ln
Apac algorithm:
Run A’ over a sample of size 𝜀 ln( 𝛿 ) divided to M equal
blocks
Build hypothesis ℎ𝑘 for each block
𝑖
Run the hypothesis on the next block
If there are no mistakes output ℎ𝑘
,0 ≤ 𝑖 ≤ 𝑀 − 1
𝑀
hk0
hk1
inconsistent
M
ln
1
𝑀
consistent
inconsistent
…
consistent
hk0
hk1
Building Apac
If A’ makes at most M mistakes then Apac
guarantees to finish
𝑀 → APAC outputs a perfect classifier
What happens otherwise?
Theorem: Apac learns PAC
Proof: Pr ℎ𝑘𝑖 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠 𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 𝑤ℎ𝑖𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝜀 −
M -1
M -1
i 0
M
Pr( APAC outputs - bad h ) Pr(0 i M 1 s.t. h k i is - bad) Pr( h k i is - bad)
i 0
Disjunction of Conjuctions
Disjunction of Conjunctions
We
have proven that every algorithm in
mistake bound model can be converted
to PAC
Lets
look at some algorithms in the
mistake bound model
Disjunction Learning
Our goal: classify the set of disjunctions
e.g. 𝑥1 ⋁𝑥2 ⋁ 𝑥6 ⋁𝑥8
Let L be the hypothesis set {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 …𝑥𝑛 , 𝑥𝑛 }
h = ⋁𝑥: 𝑥 ∈ 𝐿
Given a sample y do:
If our hypothesis does a mistake (ht (𝑦) ≠ ct (𝑦)) Than:
𝐿 ← 𝐿\S 𝑤ℎ𝑒𝑟𝑒 𝑆
1.
2.
3.
= {𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑙𝑙 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑎𝑛𝑑 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒}
1.
2.
Else do nothing
Return to step 1 ( update our hypothesis)
Example
If we have only 2 variables
L is {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 }
ℎ𝑡 = 𝑥1 ∨ 𝑥1 ∨ 𝑥2 ∨ 𝑥2
Assume the first sample is y=(1,0)
ℎ𝑡 𝑦 = 1
If 𝑐𝑡 𝑦 = 0
we update 𝐿 = 𝑥2 , 𝑥1
ℎ𝑡+1 = 𝑥1 ∨ 𝑥2
Mistake Bound Analysis
The
number of mistakes is bound by n+1
n is the number of variables
Proof:
Let
R be the set of literals in 𝑐𝑡
Let 𝐿𝑡 be the hypothesis after t samples 𝑦𝑡
Mistake Bound Analysis
We prove by induction that 𝑅 ⊆ 𝐿𝑡
For t=0 it is obvious that 𝑅 ⊆ 𝐿0
Assume after t-1 samples 𝑅 ⊆ 𝐿𝑡−1
If 𝑐𝑡 𝑦𝑡 = 1 than 𝑐𝑡 𝑦𝑡 = ℎ𝑡 𝑦𝑡 and we dont update
If𝑐𝑡 𝑦𝑡 = 1 than ofcourse S and R don’t intersect.
Either way 𝑅 ⊆ 𝐿𝑡
Thus we can only make mistakes when 𝑐𝑡 𝑦 = 0
Mistake analysis proof
At
first mistake we eliminate n literals
At any further mistake we eliminate at
least 1 literal
L0 has 2n literals
So
we can have at most n+1 mistakes
k-DNF
Definition: k-DNF functions are functions that
can be represented by a disjunction of
conjunctions in which there are at most k
literals
E.g.
3-DNF
(𝑥1 ∧ 𝑥2 ∧ 𝑥6 ) ∨ (𝑥1 ∧ 𝑥3 ∧ 𝑥5 )
The number of conjunctions of i terms is
We choose i variables (
we choose a sign (2𝑖 )
𝑛 𝑖
2
𝑖
𝑛
) for each of which
𝑖
k-DNF classification
We can learn this class by changing the
previous algorithm to deal with terms instead
of variables
Reducing the space 𝑋 = 0,1
𝑋 gives a disjunction on 𝑌
𝑛
to 𝑌 = 0,1
2 usable algorithms
ELIM for PAC
The previous algorithm (In mistake bound
model) which has 𝑂 𝑛𝑘 mistakes
𝑛𝑘
Winnow
Monotone
Disjunction: Disjunctions
containing only positive literals.
e.g. 𝑥1 ∨ 𝑥3 ∨ 𝑥5
Purpose: to learn the class of monotone
disjunctions in a mistake-bound model
We look at winnow which is similar to
perceptron
One main difference: it uses multiplicative
steps rather than additive
Winnow
Same classification scheme as perceptron
ℎ 𝑥 = 𝑥 ∗ 𝑤 ≥ 𝜃 ⇒ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
ℎ 𝑥 = 𝑥 ∗ 𝑤 < 𝜃 ⇒ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
Initialize 𝑤0 = 1,1, … , 1
Update scheme:
On positive misclassification
(ℎ(𝑥) =1, 𝑐𝑡(𝑥)=0)
𝑤
∀𝑥𝑖 = 1: 𝑤𝑖 ←
𝑖
2
On negative misclassification : ∀𝑥𝑖 = 1: 𝑤𝑖 ← 2𝑤𝑖
Mistake bound analysis
Similar
to perceptron if the margin is
bigger than 𝛾 then we can prove the error
1
rate is Θ( 2)
𝛾
Winnow Proof:Definitions
Let
𝑆 = {𝑥𝑖1 , 𝑥𝑖2 , . . , 𝑥𝑖𝑟 } be the set of relevant
variables in the target concept
I.e. 𝐶𝑡 = 𝑥𝑖1 ∨ 𝑥𝑖2 ∨. . 𝑥𝑖𝑟
We
define 𝑊𝑟 = 𝑤𝑖1 , 𝑤𝑖2 , . . , 𝑤𝑖𝑟 the weights
of the relevant variables
Let 𝑤 𝑡 be the weight w at time t
Let TW(t) be the total weight of all w(t) of
both relevant and irrelevant variables
Winnow Proof: Positive
Mistakes
Lets look at the positive mistakes
Any mistake on a positive example doubles (at
least) 1 of the relevant weights
∃𝑤 ∈ 𝑊𝑟 𝑠. 𝑡. 𝑤 𝑡 + 1 = 2𝑤(𝑡)
If ∃𝑤𝑖 𝑠. 𝑡. 𝑤𝑖 ≥ 𝑛 we get 𝑥 ∗ 𝑤 ≥ 𝑛 therefore always a
positive classification
So, ∀𝑤𝑖 : 𝑤𝑖 can only be doubled at most 1 + log 𝑛
times
Thus, we can bind the number of positive mistakes:
𝑀+ ≤ 𝑟(1 + log 𝑛 )
Winnow Proof: Positive
Mistakes
For
a positive mistake
ℎ 𝑥 = 𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 < 𝑛
𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 +(𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 )
(1)𝑇𝑊
𝑡 + 1 < 𝑇𝑊 𝑡 + 𝑛
Winnow Proof: Negative
Mistakes
On negative examples none of the relevant
weight change
Thus ∀𝑤 ∈ 𝑊𝑟 , 𝑤 𝑡 + 1 ≥ w(t)
For a negative mistake to occur:
𝑤1 𝑡 𝑥1 +. . +𝑤𝑛 𝑡 𝑥𝑛 ≥ 𝑛
𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 −
⇒ (2)𝑇𝑊 𝑡 + 1 ≤ 𝑇𝑊 𝑡 −
𝑤1 𝑡 𝑥1 +..+𝑤𝑛 𝑡 𝑥𝑛
2
𝑛
2
Winnow Proof:Cont.
Combining
the equations (1),(2):
(3)0 < 𝑇𝑊 𝑡 ≤ 𝑇𝑊 0 + 𝑛𝑀+ −
𝑛
2
At
𝑀−
the beginning all weight are 1 so
𝑇𝑊 0 = 𝑛
(3)(4)
⇒ 𝑀− < 2 + 2𝑀+ ≤ 2 +
2𝑟 𝑙𝑜𝑔𝑛 + 1
⇒
𝑀− + 𝑀+ ≤ 2 + 3𝑟(𝑙𝑜𝑔𝑛 + 1)
What should we know? I
Linear
Separators
Perceptron algorithm : 𝑀 ≤
Margin Perceptron : 𝑀 ≤
The
1
𝛾2
12
𝛾2
mistake bound model
CON algorithm : 𝑀 < 𝐶 − 1 but C may be
very large!
HAL the halving algorithm: 𝑀 < log 2 |𝐶|
What should you know? II
The
relation between PAC and the
mistake bound model
Basic algorithm for learning disjunction of
conjunctions
Learning K-DNF functions
Winnow algorithm :𝑀 ≤ 2 + 3𝑟(log 𝑛 + 1)
Questions?
© Copyright 2026 Paperzz