Kilian`s slides

Learning in Games
Fictitious Play
Notation!
For n Players we have:




n Finite Player’s Strategies Spaces S1, S2, …, Sn
n Opponent’s Strategies Spaces S-1, S-2, …, S-n
n Payoff Functions u1, u2,…, un
For each i and each s-i in S-i a set of
Best Responses BRi (s-i)
What is Fictitious Play?
Each player creates an assessment about the
opponent’s strategies in form of a weight
function:
 0i : S i   
1
i
i
i
i
 t (s )   t 1 (s )  
0
if sti1  s i
if sti1  s i
Prediction
Probability of player i assigning to player –i playing s-i at
time t:
 (s )
 (s ) 
i ~ i
t (s )
i
t
i
i
t
~
s i S i
i
Fictious Play is …

… any rule
ti ( ti )
that assigns
NOT UNIQUE!
 ti ( ti )  BR i ( ti )
Further Definitions
In 2 Player games:
Marginal empirical distributions of j’s play
(j=-i)
dt (s ) 
j
j
 t (s )   0 (s )
j
j
t
Asymptotic Behavior

Propositions:
 Strict
Nash equilibria are absorbing for the
process of fictitious play.
 Any pure-strategy steady state of fictitous play
must be a Nash equilibrium
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
1.5
T
2
Col Player
H
2
T
1.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
1.5
1.5
T
T
2
3
Col Player
H
H
2
2
T
T
H
T
H
T
H
T
H
T
1.5
2.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
1.5
1.5
T
T
2
3
Col Player
H
H
2
2
T
T
H
T
H
T
H
T
H
T
1.5
2.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
T
T
T
T
2
3
3
Col Player
H
H
H
H
2
2
2
T
T
T
T
1.5
2.5
3.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
T
T
T
T
2
3
3
Col Player
H
H
H
H
2
2
2
T
T
T
T
1.5
2.5
3.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
3.5
T
T
T
T
2
3
3
3
Col Player
H
H
H
H
2
2
2
2
T
T
T
T
1.5
2.5
3.5
4.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
3.5
T
T
T
T
2
3
3
3
Col Player
H
H
H
H
2
2
2
2
T
T
T
T
1.5
2.5
3.5
4.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
3.5
T
T
T
T
2
3
3
3
Col Player
H
H
H
H
2
2
2
2
T
T
T
T
1.5
2.5
3.5
4.5
Example “matching pennies”
H
T
H
1,-1
-1,1
T
-1,1
1,-1
Weights:
Row Player
H
H
H
H
1.5
1.5
2.5
1.5
3.5
T
T
T
T
2
3
3
2
3
Col Player
H
H
H
H
2
2
2
2
T
T
T
T
1.5
2.5
3.5
4.5
Weights:
Row Player
Col Player
1.5
2
2
1.5
1.5
3
2
2.5
2.5
3
2
3.5
3.5
3
2
4.5
4.5
3
3
4.5
5.5
3
4
4.5
6.5
6.5
3
4
5
4.5
6
4.5
Convergence?
Strategies cycle and do not converge …
…but the marginal empirical distributions?
 t (s )   0 (s )
j
j
t
1


2
t 
MATLAB Simulation - Pennies
Game Play
Weight / Time
Payoff
Proposition
Under fictitious play, if the empirical
distributions over each player’s
choices converge, the strategy profile
corresponding to the product of these
distributions is a Nash equilibrium.
Rock-Paper-Scissors
Game Play
Weight / Time
A
B C
A 12 , 12 1,0 0,1
1 1
B 0,1 2 , 2 1,0
C 1,0 0,1 12 , 12
Payoff
Rock-Paper-Scissors
Game Play
Weight / Time
1 1
,
2 2
1,0 0,1
1 1
0,1 2 , 2 1,0
1 1
1,0 0,1 2 , 2
Payoff
Shapley Game
Game Play
Weight / Time
0,0 1,0 0,1
0,1 0,0 1,0
1,0 0,1 0,0
Payoff
Persistent miscoordination
Game Play
Weight / Time
A
B
A 0,0
1,1
B 1,1 0,0
Initial weights:
1
1
Payoff
1.4
1.4
Nash:
(1,0)
(0,1)
(0.5,0.5)
Persistent miscoordination
Game Play
Weight / Time
A
B
A 0,0
1,1
B 1,1 0,0
Initial weights:
2
2
Payoff
1.4
1.4
Nash:
(1,0)
(0,1)
(0.5,0.5)
Persistent Miscoordination
Game Play
Weight / Time
A
B
A 0,0
1,1
B 1,1 0,0
Initial weights:
2
2
Payoff
2.4
2.4
Nash:
(1,0)
(0,1)
(0.5,0.5)
Summary on fictitious play
In case of convergence, the time average
of strategies forms a Nash Equilibrium
 The average payoff does not need to be
the one of a Nash (e.g. Miscoordination)
 Time average may not converge at all
(e.g. Shapley Game)

References

Fudenberg D., Levine D. K. (1998)
The Theory of Learning in Games
MIT Press
Nash Convergence of
Gradient Dynamics in
General-Sum Games
Notation

2 Players:
  
  
 Strategies 
 and 

1   
1   
 Payoff
matricies
R=
r11 r12
c11 c12
r21 r22
C= c c
21
22
Objective Functions

Payoff Functions:
 Vr(,)=r11()+r22((1-)(1-))
+r12((1-))+r21((1-))
 Vc(,)=c11()+c22((1-)(1-))
+c12((1-))+c21((1-))
Hillclimbing Idea
Gradient Ascent for Iterated Games
With u=(r11+r22)-(r21+r12)

u’=(c11+c22)-(c21+c12)

Vr ( ,  )
 u  (r22  r12 )

Vc ( ,  )
 u '(c22  c12 )

Update Rule
Vr ( ,  )
 0, k 01 can
arbitrary strategies
 be
k 

Vc ( ,  )
 k 1   k  

Problem

Gradient can lead the players to an
infeasible point outside the unit square.
1
0
1
Solution:

Redefine the gradient to the projection of
the true gradient onto the boundary.
Let this denote the constrained dynamics!
1
0
1
Infinitesimal Gradient Ascent (IGA)
lim
 (t ),  (t )
 0
Become functions
of time!
  
 t   0 u      (r22  r12 ) 

    






(
c

c
)
u
'
0





22
12


 
 t 
1. Case: U is invertible
The two possible qualitative forms of the unconstrained strategy pair:
2. Case: U is not invertible
Some examples of qualitative forms of the unconstrained strategy pair:
Convergence
If both players follow the IGA rule, then
both player’s average payoffs will
converge to the expected payoff of some
Nash equilibrium
If the strategy pair trajectory converges
at all, then it converges to a Nash pair.
Proposition
Both previous propositions also hold with
finite decreasing step size
References

Singh S., Kearns M., Yishay M. (2000)
Nash Convergence of Gradient Dynamics in
General-Sum Games
Proceedings of the Sixteenth Conference on
Uncertainty in Artificial Intelligence, Morgan
Kaufmann, pages 541-548
Dynamic computation of
Nash equilibria in TwoPlayer general-sum
games.
Notation

2 Players:
 p1 
 q1 
 
 
 Strategies    and   
p 
q 
 n
 n
 Payoff
R=
matricies
 r11  r1n 
  


rn1  rnn 
c11  c1n 
   
C= 

cn1  cnn 
Objective Functions

Payoff Functions:
 Row
 Col
Player: Vr ( p, q)  pT Rq
Player:
Vc ( p, q)  p T Cq
Observation!
Vr ( p, q)
is linear in each pi and qj
Let xi denote the pure strategy for action i.
This means:
If Vr ( xi , q)  Vr ( p, q) then increasing
the value of pi increases the payoff.
Hill climbing (again)
Multiplicative Update Rules
 
 
pipi
i
T


 pi (pt )i (Vt )r ( xRq, qi) p
Vr (Rq
p, q )
t t
 

qiqi
T i
T
 qi (qt )i (Vt )c ( p, xR )i Vpc ( Rq
p, q )
t t

Hill climbing (again)
System of Differential Equations (i=1..n)

pi
T


 pi (t ) Rq i  p Rq
t


qi
T
T
 qi (t ) p R i  p Rq
t


Fixed Points?
i either
pi pi (t )  0
T


 pi (t ) Rq i  p Rq
t or

Rq   p
i
T

Rq  0

When is a Fixpoint a Nash?
• Proposition:
Provided all pi(0) are neither
0 nor 1, then if (p,q) converges
to (p*,q*) then this is a
Nash Equilibrium.
Unit Square?
No Problem!
p i
pi=0 or pi=1 both set t to zero!
Convergence of the average of the
payoff

If the (p,q) trajectory and both player’s
payoffs converge in average, the average
payoff must be the payoff of some Nash
Equilibrium
2 Player 2 Action Case

Either the strategies converge immediately
to some pure strategy, or the difference
between the Kullback-Leibler distances of
(p,q) and some mixed Nash are constant.
p
1 p
*
KL( p, p )  p log( * )  (1  p) log(
)
*
p
1 p
KL( p, p* )  KL(q, q* )  const.
Trajectories of the difference
between the Kullback-Leibler
Distances
Nash
But…

… for games with more than 2 actions,
convergence is not guaranteed!
Counterexample: Shapley Game