Online and batch learnability and game theory

Online learning and
game theory
Adam Kalai
(joint with Sham Kakade)
How do we learn?


Goal learn a function f: X ! Y
Batch (offline) model

=?


Y = {–,+}
Get training data (x1,y1),…,(xn,yn) drawn
independently from some distribution  over X £ Y
We output f: X ! Y with low error P[f(x)y]
Online (repeated game) model, for i=1,2,…,n:




Observe ith example xi 2 X
Distribution-free
We predict its label
learning
Observe true label yi 2 {–,+}
Goal: make as few mistakes as possible
Outline
1.
Online/batch learnability of F



2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability , online learnability?
Online learning in repeated games


Zero-sum: Weighted majority
General-sum: No “internal regret” ) corr. eq.
Online learning
“empirical error”
err(A,data) = |{i j ziyi}|/n



–

–
…

?
+


X = R2
Y = {–,+}
Online alg. A(x1,y1,…,xi-1,yi-1,xi)=zi

Adversary picks
(x1,y1) 2 X £ Y
We see x1
We predict z1
We see y1
Adversary picks
(xn,yn) 2 X £ Y
We see xn
We predict zn
We see yn

Batch
Learning
X£Y
–
–
+
– +
+
–
+
+
+ + +
– – +
–
+
+
+
–
+ – +
+
+ +
+
– +
+
+
–+ – –
–+ + – +
+
+
+
– + –
– +–
– –
– +
–
+– + +
–
–
–
– –
–
–
+ +
+ +
–
+
laerning
algorithm A
X = R2 Y = {–,+}
data: (x1,y1),…,(xn,yn)
f: X ! Y
Batch
Learning
–
–
+
+
– +
+
–
+
+
+ + +
– – +
–
+
+
+
–
+ – +
+
laerning
+ +
+
– +
algorithm A
– + – – + +–+ – +
+
+
+
+
+
–
–
– +–
– – “empirical
– +
error”
–
+– + +
–
–
–
–
–
–
+
–
err(f,data)
+ j f(x
+i)y
+= {i
i} /n
–
|
X = R2 Y = {–,+}
data: (x1,y1),…,(xn,yn)
|

–
X£Y
+
+ – +
+
–
– – + + .+
–+ +
–
+
–
+
+
++
+ + –
+
+
–
+
– ++
–
–
– – + +–+ – +
+
++ –
+
+
–
–
–
– +
– +
–
“generalization
error”
+
+
–
–
+ – –
–
–
–
+
+ +
+ –[f(x)y]
err(f,)
= Pr
f: X ! Y
Online/batch learnability of F


Family F of functions f: X ! Y (Y = {–,+})
Alg. A learns F online if 9k,c>0


Online input: data (x1,y1),…,(xn,yn)
Regret(A,data) = err(A,data)–ming2F err(g,data)
8data E[Regret(A,data)] · k/nc

Alg. B batch learns F if 9k,c>0



X£Y
Input: (x1,y1),…,(xn,yn) independent from 
Output: f 2 F, Regret(f,) = err(f,)–ming2F err(g,)
8 Edata[Regret(B,)] · k/nc
Online learnable ) Batch learnable


Given online learning algorithm A
Define batch learning algorithm B




Input: (x1,y1),(x2,y2),…,(xn,yn) from 
Let fi(x): X ! Y be fi(x)=A(x1,y1,…,xi-1,yi-1,x)
·
Pick i 2 {1,2,…,n} at random
and output fi
Analysis
E[Regret(A,data)] = E[err(A,data)] – E[ming2F err(g,data)]
E[Regret(B,)] = E[err(B,)] – ming2F err(g,)
Online learnable ) Batch learnable

Given online learning algorithm A
Define batch learning algorithm B




Input: (x1,y1),(x2,y2),…,(xn,yn) from 
Let fi(x): X ! Y be fi(x)=A(x1,y1,…,xi-1,yi-1,x)
·
Pick i 2 {1,2,…,n} at random
and output fi
Analysis
E[Regret(A,data)] = E[err(A,data)] – E[ming2F err(g,data)]
·
=
·

E[Regret(B,)] = E[err(B,)] – ming2F err(g,)
Outline
1.
Online/batch learnability of F




2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability ; online learnability
Batch learnability ) online learnability
Online learning in repeated games


Zero-sum: Weighted majority ) eq.
General-sum: No “internal regret” ) corr. eq.
Online majority algorithm
f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
– – +
+ + +
… ……
+ + –
(live)
majority
+ + +
truth y
+ – –






Perfect f 2 F
Say there is some perfect
f* 2 F, err(f*,data)=0
Say |F|=F
Predict according to majority
of consistent f’s
Each mistake Maj makes
eliminates ¸ ½ of f’s
Maj’s #mistakes · log2(F)
err(Maj,data) · log2(F)/n
Naive batch learning
f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
–
– – +
+
+ + +
–
… …… …
+ + –
–




truth y
+ – –
–
Perfect f 2 F
Say there is some perfect
f* 2 F, err(f*,data)=0
Say |F|=F
Select a consistent f
Say 8gf* err(g,data)=log(F)/n
P[err(g,data)=0]=
Wow! Online looks like batch.
Naive batch learning

Naive batch algorithm

Choose f 2 F that minimizes err(f,data)
For any f 2 F, P[|err(f,data)-err(f,)|>
P[9f2F |err(f,data)-err(f,)|>
]·
2
-2c
2e
] · 2Fe-200ln F
· 2-100
E[Regret(n.b.,)] · c
(F = |F|)
Weighted majority’ [LW89]


Assign weight to each f, 8f2F w(f)=1
On period i=1,2,…,n:


Predict weighted maj of f’s
For each f: if f(xi)yi, w(f):=w(f)/2
WM’ errs ) total weight decreases by ¸ 25%
 Final total weight · F (3/4)#mistakes(WM’)
 Final total weight ¸ 2-minf #mistakes(f)
#mistakes(WM’)·2.41(minf#mistakes(f)+log(F)/n)

(F = |F|)
Weighted majority [LW89]


Assign weight to each f, 8f2F w(f)=1
On period i=1,2,…,n:



Predict weighted maj of f’s
For each f: if f(xi)yi, w(f):=w(f)(1–
)
Thm: E[Regret(WM,data)] · 2
Wow! Online looks like batch.
(F = |F|)
Weighted majority extensions…
W

f1
f2
f3
…
fF
x1 x2 x3 … xn
+ – +
– – +
+ + +
… ……
+ + –
WM
+ + +
truth y
+ – –
Tracking

On any window W,
E[Regret(WM,W)]· c
Weighted majority extensions…

f1
f2
f3
…
fF
x1 x2 x3 … xn
+
– +
… ……
Multi-armed bandit




WM
+ + +
truth y
– – –
You don’t see xi
You pick f
Find out if you erred
E[Regret] · c
Outline
1.
Online/batch learnability of F




2.
Online learnability ) batch learnability
Finite learning: batch and online (via weighted maj.)
Batch learnability ; online learnability
Batch learnability ) (transductive) online learnability
Online learning in repeated games


Zero-sum: Weighted majority ) eq.
General-sum: No “internal regret” ) corr. eq.
Batch  Online




Define fc: [0,1] ! {+,–}, fc(x) = sgn(x – c)
Simple threshold functions F = {fc | c 2 [0,1]}
Batch learnable: Yes
Online learnable: ?No! 



Adversary does a “random binary search”
Each label is equally likely to be +/–
E[Regret]=½ for any online algorithm
– – +
0
x2
x4 x5 x3
+
x1=.5
1
Key idea: transductive online learning
[KakadeK05]


We see x1,x2,…,xn 2 X in advance
y1,y2,…,yn 2 {+,–} are revealed online
– – +
0
x2
x4
x3
+
x1=.5
1
Key idea: transductive online learning
“empirical error”
err(A,data) = |{i j ziyi}|/n
–
–
[KakadeK05]





?
+

Adversary picks
(x1,y1),…,(xn,yn)2X£Y
Adversary reveals
x1,x2,…,xn
We predict z1
We see y1
We predict z2
We see y2
…

X = R2
Y = {–,+}

We predict zn
We see yn
Trans. online alg. T(x1,y1,…,xi-1,yi-1,xi,xi+1,…,xn)=zi
Algorithm for trans. online learning
[KK05]



We see x1,x2,…,xn 2 X in advance
y1,y2,…,yn 2 {+,–} are revealed online
Algorithm for trans. online learning




L distinct labelings f(x1),f(x2),…,f(xn)
over all f 2 F
Effective size of F is L
Run WM on L functions
E[Regret(WM,data)] · 2
f1
f2
f3
…
f1
x1 x2 x3 … xn
+ – +
+
+ – +
+
+ + +
–
… …… …
+ + –
–
How many labelings? Shattering & VC



Def: S µ X is shattered if there are 2|S| ways
to label S by f 2 F
VC(F) = max|S| S is shattered by F
Example
–
+
–
+
–
+

+
–
+
VC dimension captures complexity of F
How many labelings? Shattering & VC

Sauer’s lemma: # labelings L = O(nVC(F))
) E[Regret(WM,data)] ·
Cannot batch learn faster than VC(F)
Shattered set S, |S| = VC(F), n > 0
Batch training set of size n
 Each x 2 S is not in training
set with probability (1-1/n)n
¼ e-1
) E[Regret(B,)] ¸ cVC(F)/n


X£Y
Putting it together
Almost identical to
standard VC bound

Transductive online:
E[Regret(WM,data)] =

Batch:
E[Regret(B,)] ¸
Trans. online learnable , batch learnable , finite VC(F)
Learnability conclusions



Finite VC(F) characterizes batch and
transductive learnability
Open problem: what propertyof F
characterizes online learnability (nontransductive)
Efficiency!?


WM algorithm requires enumeration of F
Thm [KK05]: if one can efficiently find lowest error
f 2 F, then one can design efficient online
learning algorithm
Online learning in repeated games
Repeated games
Pl.
2
R

Example:
Pl. 1

R
P
0,0
P
S
-1,1 1,-1
-1,1 0,0 -1,1
S 1,-1 -1,1 0,0
Rounds i=1,2,…,n:




Players simultaneously choose actions
Players receive payoff, goal: max total payoff
Learning: players need not know opponent/game
Feedback: player only finds out payoff of his
action and alternatives (not opponent action)
(Mixed) Nash Equlibrium


Each player chooses a dist. over actions
Players are optimizing relative to opponent(s)


1/3
1/3
1/3
R
P
1/3
1/3
1/3
R
P
S
0,0
-1,1 1,-1
-1,1 0,0 -1,1
S 1,-1 -1,1 0,0
Online learning in 0-sum games (Schapire recap)





Payoff is A(i,j) for pl. 1, -A(i,j) for pl. 2
Going first is disadvantage:
maxi minj A(i,j) · minj maxi A(i,j)
Mixed strategies:
maxmin A(,) · minmax A(,)
Min-max theorem “=”
Online learning in 0-sum games (Schapire recap)

Each player uses weighted majority







Maintain weight on each action, initially equal
Choose an action proportional to weight
(assume payoffs are in [-1,1])
Find out payoffs of each action
For each action weight à weight*(1 +  payoff)
Regret = possible improvement (in hindsight)
from always playing a single action
WM ) regret is low
Online learning in 0-sum games (Schapire recap)



Actions are (a1,b1),(a2,b2),…,(an,bn)
Regret of pl. 1 is
Let
be empirical distributions of actions a1,…,an
and b1,…,bn, respectively
=
Online learning in 0-sum games (Schapire recap)


WM ) “min-max” theorem
maxmin A(,) = minmax A(,)
= “value of game”
Using WM, each player guarantees
regret ! 0, regardless of opponent



Can beat an idiot in tic-tac-toe
Reasonable strategy to use
Justifies how such equilibria might arise
General-sum games
A
A




B
1,1 0,0
0,0 2,2
No unique “value”
B
Many very different equilibria e.g.
Can’t naively improve a “no regret” algorithm
(by playing a single mixed strategy)
Low regret for both players ; equilibrium
General sum games

Low regret ; Nash equilibrium, e.g.,
(1,1),(2,2),(1,1),(2,2),(1,1),(2,2),…
1
2
3
4
1
2
0,0
-1,-1
1,1
-1,1
-1,-1
0,0
-1,1
1,1
3
1, 1
1,-1
1,1
1,1
4
1,-1
1,1
1,1
1,1
Refined notion of regret


Can’t naively improve a “no regret” algorithm
(by playing a single mixed strategy)
Might be able to naively improve it by
replacing: “When alg. suggests 1, play 3”
1
2
3
4
1
2
0,0
-1,-1
1,1
-1,1
-1,-1
0,0
-1,1
1,1
3
1, 1
1,-1
1,1
1,1
4
1,-1
1,1
1,1
1,1
Internal regret




Internal regret IR(i,j) is how much we could have
improved by replacing all occurrences of action i
with action j
No internal regret ) correlated equilibrium
Calibration ) correlated equilibrium [FosterVohra]
Correlated Equilibrium
Play col. 2
[Aumann]
1,0 0,11/6 0,0
1/6

Best strategy to listen
to the fish
Play row 1
0,0 1,01/6 0,11/6
0,11/6 0,0 1,01/6

Low internal regret ! correlated eq.



Sequence like (1,1),(2,1),(3,2),…
Think about this as a distribution P
No internal regret , correlated eq.
Play col. 2
1,0
1/6
Play row 1
0,11/6 0,0
0,0 1,01/6 0,11/6
0,11/6 0,0 1,01/6

Online learning in games conclusions

Online learning in zero-sum games



Online learning in general-sum games



Weighted majority (low regret)
Achieves value of game
Low internal regret
Achieves correlated equilibrium
Open problems


Are there natural dynamics ) Nash Equilibrium
Is correlated equilibrium going take over?