Online learning and
game theory
Adam Kalai
(joint with Sham Kakade)
How do we learn?
Goal learn a function f: X ! Y
Batch (offline) model
=?
Y = {–,+}
Get training data (x1,y1),…,(xn,yn) drawn
independently from some distribution over X £ Y
We output f: X ! Y with low error P[f(x)y]
Online (repeated game) model, for i=1,2,…,n:
Observe ith example xi 2 X
Distribution-free
We predict its label
learning
Observe true label yi 2 {–,+}
Goal: make as few mistakes as possible
Outline
1.
Online/batch learnability of F
2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability , online learnability?
Online learning in repeated games
Zero-sum: Weighted majority
General-sum: No “internal regret” ) corr. eq.
Online learning
“empirical error”
err(A,data) = |{i j ziyi}|/n
–
–
…
?
+
X = R2
Y = {–,+}
Online alg. A(x1,y1,…,xi-1,yi-1,xi)=zi
Adversary picks
(x1,y1) 2 X £ Y
We see x1
We predict z1
We see y1
Adversary picks
(xn,yn) 2 X £ Y
We see xn
We predict zn
We see yn
Batch
Learning
X£Y
–
–
+
– +
+
–
+
+
+ + +
– – +
–
+
+
+
–
+ – +
+
+ +
+
– +
+
+
–+ – –
–+ + – +
+
+
+
– + –
– +–
– –
– +
–
+– + +
–
–
–
– –
–
–
+ +
+ +
–
+
laerning
algorithm A
X = R2 Y = {–,+}
data: (x1,y1),…,(xn,yn)
f: X ! Y
Batch
Learning
–
–
+
+
– +
+
–
+
+
+ + +
– – +
–
+
+
+
–
+ – +
+
laerning
+ +
+
– +
algorithm A
– + – – + +–+ – +
+
+
+
+
+
–
–
– +–
– – “empirical
– +
error”
–
+– + +
–
–
–
–
–
–
+
–
err(f,data)
+ j f(x
+i)y
+= {i
i} /n
–
|
X = R2 Y = {–,+}
data: (x1,y1),…,(xn,yn)
|
–
X£Y
+
+ – +
+
–
– – + + .+
–+ +
–
+
–
+
+
++
+ + –
+
+
–
+
– ++
–
–
– – + +–+ – +
+
++ –
+
+
–
–
–
– +
– +
–
“generalization
error”
+
+
–
–
+ – –
–
–
–
+
+ +
+ –[f(x)y]
err(f,)
= Pr
f: X ! Y
Online/batch learnability of F
Family F of functions f: X ! Y (Y = {–,+})
Alg. A learns F online if 9k,c>0
Online input: data (x1,y1),…,(xn,yn)
Regret(A,data) = err(A,data)–ming2F err(g,data)
8data E[Regret(A,data)] · k/nc
Alg. B batch learns F if 9k,c>0
X£Y
Input: (x1,y1),…,(xn,yn) independent from
Output: f 2 F, Regret(f,) = err(f,)–ming2F err(g,)
8 Edata[Regret(B,)] · k/nc
Online learnable ) Batch learnable
Given online learning algorithm A
Define batch learning algorithm B
Input: (x1,y1),(x2,y2),…,(xn,yn) from
Let fi(x): X ! Y be fi(x)=A(x1,y1,…,xi-1,yi-1,x)
·
Pick i 2 {1,2,…,n} at random
and output fi
Analysis
E[Regret(A,data)] = E[err(A,data)] – E[ming2F err(g,data)]
E[Regret(B,)] = E[err(B,)] – ming2F err(g,)
Online learnable ) Batch learnable
Given online learning algorithm A
Define batch learning algorithm B
Input: (x1,y1),(x2,y2),…,(xn,yn) from
Let fi(x): X ! Y be fi(x)=A(x1,y1,…,xi-1,yi-1,x)
·
Pick i 2 {1,2,…,n} at random
and output fi
Analysis
E[Regret(A,data)] = E[err(A,data)] – E[ming2F err(g,data)]
·
=
·
E[Regret(B,)] = E[err(B,)] – ming2F err(g,)
Outline
1.
Online/batch learnability of F
2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability ; online learnability
Batch learnability ) online learnability
Online learning in repeated games
Zero-sum: Weighted majority ) eq.
General-sum: No “internal regret” ) corr. eq.
Online majority algorithm
f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
– – +
+ + +
… ……
+ + –
(live)
majority
+ + +
truth y
+ – –
Perfect f 2 F
Say there is some perfect
f* 2 F, err(f*,data)=0
Say |F|=F
Predict according to majority
of consistent f’s
Each mistake Maj makes
eliminates ¸ ½ of f’s
Maj’s #mistakes · log2(F)
err(Maj,data) · log2(F)/n
Naive batch learning
f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
–
– – +
+
+ + +
–
… …… …
+ + –
–
truth y
+ – –
–
Perfect f 2 F
Say there is some perfect
f* 2 F, err(f*,data)=0
Say |F|=F
Select a consistent f
Say 8gf* err(g,data)=log(F)/n
P[err(g,data)=0]=
Wow! Online looks like batch.
Naive batch learning
Naive batch algorithm
Choose f 2 F that minimizes err(f,data)
For any f 2 F, P[|err(f,data)-err(f,)|>
P[9f2F |err(f,data)-err(f,)|>
]·
2
-2c
2e
] · 2Fe-200ln F
· 2-100
E[Regret(n.b.,)] · c
(F = |F|)
Weighted majority’ [LW89]
Assign weight to each f, 8f2F w(f)=1
On period i=1,2,…,n:
Predict weighted maj of f’s
For each f: if f(xi)yi, w(f):=w(f)/2
WM’ errs ) total weight decreases by ¸ 25%
Final total weight · F (3/4)#mistakes(WM’)
Final total weight ¸ 2-minf #mistakes(f)
#mistakes(WM’)·2.41(minf#mistakes(f)+log(F)/n)
(F = |F|)
Weighted majority [LW89]
Assign weight to each f, 8f2F w(f)=1
On period i=1,2,…,n:
Predict weighted maj of f’s
For each f: if f(xi)yi, w(f):=w(f)(1–
)
Thm: E[Regret(WM,data)] · 2
Wow! Online looks like batch.
(F = |F|)
Weighted majority extensions…
W
f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
– – +
+ + +
… ……
+ + –
WM
+ + +
truth y
+ – –
Tracking
On any window W,
E[Regret(WM,W)]· c
Weighted majority extensions…
f1
f2
f3
…
fF
x1 x2 x3 … xn
+
– +
… ……
Multi-armed bandit
WM
+ + +
truth y
– – –
You don’t see xi
You pick f
Find out if you erred
E[Regret] · c
Outline
1.
Online/batch learnability of F
2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability ; online learnability
Batch learnability ) (transductive) online learnability
Online learning in repeated games
Zero-sum: Weighted majority ) eq.
General-sum: No “internal regret” ) corr. eq.
Batch Online
Define fc: [0,1] ! {+,–}, fc(x) = sgn(x – c)
Simple threshold functions F = {fc | c 2 [0,1]}
Batch learnable: Yes
Online learnable: ?No!
Adversary does a “random binary search”
Each label is equally likely to be +/–
E[Regret]=½ for any online algorithm
– – +
0
x2
x4 x5 x3
+
x1=.5
1
Key idea: transductive online learning
[KakadeK05]
We see x1,x2,…,xn 2 X in advance
y1,y2,…,yn 2 {+,–} are revealed online
– – +
0
x2
x4
x3
+
x1=.5
1
Key idea: transductive online learning
“empirical error”
err(A,data) = |{i j ziyi}|/n
–
–
[KakadeK05]
?
+
Adversary picks
(x1,y1),…,(xn,yn)2X£Y
Adversary reveals
x1,x2,…,xn
We predict z1
We see y1
We predict z2
We see y2
…
X = R2
Y = {–,+}
We predict zn
We see yn
Trans. online alg. T(x1,y1,…,xi-1,yi-1,xi,xi+1,…,xn)=zi
Algorithm for trans. online learning
[KK05]
We see x1,x2,…,xn 2 X in advance
y1,y2,…,yn 2 {+,–} are revealed online
Algorithm for trans. online learning
L distinct labelings f(x1),f(x2),…,f(xn)
over all f 2 F
Effective size of F is L
Run WM on L functions
E[Regret(WM,data)] · 2
f1
f2
f3
…
f1
x1 x2 x3 … xn
+ – +
+
+ – +
+
+ + +
–
… …… …
+ + –
–
How many labelings? Shattering & VC
Def: S µ X is shattered if there are 2|S| ways
to label S by f 2 F
VC(F) = max|S| S is shattered by F
Example
–
+
–
+
–
+
+
–
+
VC dimension captures complexity of F
How many labelings? Shattering & VC
Sauer’s lemma: # labelings L = O(nVC(F))
) E[Regret(WM,data)] ·
Cannot batch learn faster than VC(F)
Shattered set S, |S| = VC(F), n > 0
Batch training set of size n
Each x 2 S is not in training
set with probability (1-1/n)n
¼ e-1
) E[Regret(B,)] ¸ cVC(F)/n
X£Y
Putting it together
Almost identical to
standard VC bound
Transductive online:
E[Regret(WM,data)] =
Batch:
E[Regret(B,)] ¸
Trans. online learnable , batch learnable , finite VC(F)
Learnability conclusions
Finite VC(F) characterizes batch and
transductive learnability
Open problem: what propertyof F
characterizes online learnability (nontransductive)
Efficiency!?
WM algorithm requires enumeration of F
Thm [KK05]: if one can efficiently find lowest error
f 2 F, then one can design efficient online
learning algorithm
Online learning in repeated games
Repeated games
Pl.
2
R
Example:
Pl. 1
R
P
0,0
P
S
-1,1 1,-1
-1,1 0,0 -1,1
S 1,-1 -1,1 0,0
Rounds i=1,2,…,n:
Players simultaneously choose actions
Players receive payoff, goal: max total payoff
Learning: players need not know opponent/game
Feedback: player only finds out payoff of his
action and alternatives (not opponent action)
(Mixed) Nash Equlibrium
Each player chooses a dist. over actions
Players are optimizing relative to opponent(s)
1/3
1/3
1/3
R
P
1/3
1/3
1/3
R
P
S
0,0
-1,1 1,-1
-1,1 0,0 -1,1
S 1,-1 -1,1 0,0
Online learning in 0-sum games (Schapire recap)
Payoff is A(i,j) for pl. 1, -A(i,j) for pl. 2
Going first is disadvantage:
maxi minj A(i,j) · minj maxi A(i,j)
Mixed strategies:
maxmin A(,) · minmax A(,)
Min-max theorem “=”
Online learning in 0-sum games (Schapire recap)
Each player uses weighted majority
Maintain weight on each action, initially equal
Choose an action proportional to weight
(assume payoffs are in [-1,1])
Find out payoffs of each action
For each action weight à weight*(1 + payoff)
Regret = possible improvement (in hindsight)
from always playing a single action
WM ) regret is low
Online learning in 0-sum games (Schapire recap)
Actions are (a1,b1),(a2,b2),…,(an,bn)
Regret of pl. 1 is
Let
be empirical distributions of actions a1,…,an
and b1,…,bn, respectively
=
Online learning in 0-sum games (Schapire recap)
WM ) “min-max” theorem
maxmin A(,) = minmax A(,)
= “value of game”
Using WM, each player guarantees
regret ! 0, regardless of opponent
Can beat an idiot in tic-tac-toe
Reasonable strategy to use
Justifies how such equilibria might arise
General-sum games
A
A
B
1,1 0,0
0,0 2,2
No unique “value”
B
Many very different equilibria e.g.
Can’t naively improve a “no regret” algorithm
(by playing a single mixed strategy)
Low regret for both players ; equilibrium
General sum games
Low regret ; Nash equilibrium, e.g.,
(1,1),(2,2),(1,1),(2,2),(1,1),(2,2),…
1
2
3
4
1
2
0,0
-1,-1
1,1
-1,1
-1,-1
0,0
-1,1
1,1
3
1, 1
1,-1
1,1
1,1
4
1,-1
1,1
1,1
1,1
Refined notion of regret
Can’t naively improve a “no regret” algorithm
(by playing a single mixed strategy)
Might be able to naively improve it by
replacing: “When alg. suggests 1, play 3”
1
2
3
4
1
2
0,0
-1,-1
1,1
-1,1
-1,-1
0,0
-1,1
1,1
3
1, 1
1,-1
1,1
1,1
4
1,-1
1,1
1,1
1,1
Internal regret
Internal regret IR(i,j) is how much we could have
improved by replacing all occurrences of action i
with action j
No internal regret ) correlated equilibrium
Calibration ) correlated equilibrium [FosterVohra]
Correlated Equilibrium
Play col. 2
[Aumann]
1,0 0,11/6 0,0
1/6
Best strategy to listen
to the fish
Play row 1
0,0 1,01/6 0,11/6
0,11/6 0,0 1,01/6
Low internal regret ! correlated eq.
Sequence like (1,1),(2,1),(3,2),…
Think about this as a distribution P
No internal regret , correlated eq.
Play col. 2
1,0
1/6
Play row 1
0,11/6 0,0
0,0 1,01/6 0,11/6
0,11/6 0,0 1,01/6
Online learning in games conclusions
Online learning in zero-sum games
Online learning in general-sum games
Weighted majority (low regret)
Achieves value of game
Low internal regret
Achieves correlated equilibrium
Open problems
Are there natural dynamics ) Nash Equilibrium
Is correlated equilibrium going take over?
© Copyright 2026 Paperzz