Learning in Games Fictitious Play Notation! For n Players we have: n Finite Player’s Strategies Spaces S1, S2, …, Sn n Opponent’s Strategies Spaces S-1, S-2, …, S-n n Payoff Functions u1, u2,…, un For each i and each s-i in S-i a set of Best Responses BRi (s-i) What is Fictitious Play? Each player creates an assessment about the opponent’s strategies in form of a weight function: 0i : S i 1 i i i i t (s ) t 1 (s ) 0 if sti1 s i if sti1 s i Prediction Probability of player i assigning to player –i playing s-i at time t: (s ) (s ) i ~ i t (s ) i t i i t ~ s i S i i Fictious Play is … … any rule ti ( ti ) that assigns NOT UNIQUE! ti ( ti ) BR i ( ti ) Further Definitions In 2 Player games: Marginal empirical distributions of j’s play (j=-i) dt (s ) j j t (s ) 0 (s ) j j t Asymptotic Behavior Propositions: Strict Nash equilibria are absorbing for the process of fictitious play. Any pure-strategy steady state of fictitous play must be a Nash equilibrium Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H 1.5 T 2 Col Player H 2 T 1.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H 1.5 1.5 T T 2 3 Col Player H H 2 2 T T H T H T H T H T 1.5 2.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H 1.5 1.5 T T 2 3 Col Player H H 2 2 T T H T H T H T H T 1.5 2.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 T T T T 2 3 3 Col Player H H H H 2 2 2 T T T T 1.5 2.5 3.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 T T T T 2 3 3 Col Player H H H H 2 2 2 T T T T 1.5 2.5 3.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 3.5 T T T T 2 3 3 3 Col Player H H H H 2 2 2 2 T T T T 1.5 2.5 3.5 4.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 3.5 T T T T 2 3 3 3 Col Player H H H H 2 2 2 2 T T T T 1.5 2.5 3.5 4.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 3.5 T T T T 2 3 3 3 Col Player H H H H 2 2 2 2 T T T T 1.5 2.5 3.5 4.5 Example “matching pennies” H T H 1,-1 -1,1 T -1,1 1,-1 Weights: Row Player H H H H 1.5 1.5 2.5 1.5 3.5 T T T T 2 3 3 2 3 Col Player H H H H 2 2 2 2 T T T T 1.5 2.5 3.5 4.5 Weights: Row Player Col Player 1.5 2 2 1.5 1.5 3 2 2.5 2.5 3 2 3.5 3.5 3 2 4.5 4.5 3 3 4.5 5.5 3 4 4.5 6.5 6.5 3 4 5 4.5 6 4.5 Convergence? Strategies cycle and do not converge … …but the marginal empirical distributions? t (s ) 0 (s ) j j t 1 2 t MATLAB Simulation - Pennies Game Play Weight / Time Payoff Proposition Under fictitious play, if the empirical distributions over each player’s choices converge, the strategy profile corresponding to the product of these distributions is a Nash equilibrium. Rock-Paper-Scissors Game Play Weight / Time A B C A 12 , 12 1,0 0,1 1 1 B 0,1 2 , 2 1,0 C 1,0 0,1 12 , 12 Payoff Rock-Paper-Scissors Game Play Weight / Time 1 1 , 2 2 1,0 0,1 1 1 0,1 2 , 2 1,0 1 1 1,0 0,1 2 , 2 Payoff Shapley Game Game Play Weight / Time 0,0 1,0 0,1 0,1 0,0 1,0 1,0 0,1 0,0 Payoff Persistent miscoordination Game Play Weight / Time A B A 0,0 1,1 B 1,1 0,0 Initial weights: 1 1 Payoff 1.4 1.4 Nash: (1,0) (0,1) (0.5,0.5) Persistent miscoordination Game Play Weight / Time A B A 0,0 1,1 B 1,1 0,0 Initial weights: 2 2 Payoff 1.4 1.4 Nash: (1,0) (0,1) (0.5,0.5) Persistent Miscoordination Game Play Weight / Time A B A 0,0 1,1 B 1,1 0,0 Initial weights: 2 2 Payoff 2.4 2.4 Nash: (1,0) (0,1) (0.5,0.5) Summary on fictitious play In case of convergence, the time average of strategies forms a Nash Equilibrium The average payoff does not need to be the one of a Nash (e.g. Miscoordination) Time average may not converge at all (e.g. Shapley Game) References Fudenberg D., Levine D. K. (1998) The Theory of Learning in Games MIT Press Nash Convergence of Gradient Dynamics in General-Sum Games Notation 2 Players: Strategies and 1 1 Payoff matricies R= r11 r12 c11 c12 r21 r22 C= c c 21 22 Objective Functions Payoff Functions: Vr(,)=r11()+r22((1-)(1-)) +r12((1-))+r21((1-)) Vc(,)=c11()+c22((1-)(1-)) +c12((1-))+c21((1-)) Hillclimbing Idea Gradient Ascent for Iterated Games With u=(r11+r22)-(r21+r12) u’=(c11+c22)-(c21+c12) Vr ( , ) u (r22 r12 ) Vc ( , ) u '(c22 c12 ) Update Rule Vr ( , ) 0, k 01 can arbitrary strategies be k Vc ( , ) k 1 k Problem Gradient can lead the players to an infeasible point outside the unit square. 1 0 1 Solution: Redefine the gradient to the projection of the true gradient onto the boundary. Let this denote the constrained dynamics! 1 0 1 Infinitesimal Gradient Ascent (IGA) lim (t ), (t ) 0 Become functions of time! t 0 u (r22 r12 ) ( c c ) u ' 0 22 12 t 1. Case: U is invertible The two possible qualitative forms of the unconstrained strategy pair: 2. Case: U is not invertible Some examples of qualitative forms of the unconstrained strategy pair: Convergence If both players follow the IGA rule, then both player’s average payoffs will converge to the expected payoff of some Nash equilibrium If the strategy pair trajectory converges at all, then it converges to a Nash pair. Proposition Both previous propositions also hold with finite decreasing step size References Singh S., Kearns M., Yishay M. (2000) Nash Convergence of Gradient Dynamics in General-Sum Games Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pages 541-548 Dynamic computation of Nash equilibria in TwoPlayer general-sum games. Notation 2 Players: p1 q1 Strategies and p q n n Payoff R= matricies r11 r1n rn1 rnn c11 c1n C= cn1 cnn Objective Functions Payoff Functions: Row Col Player: Vr ( p, q) pT Rq Player: Vc ( p, q) p T Cq Observation! Vr ( p, q) is linear in each pi and qj Let xi denote the pure strategy for action i. This means: If Vr ( xi , q) Vr ( p, q) then increasing the value of pi increases the payoff. Hill climbing (again) Multiplicative Update Rules pipi i T pi (pt )i (Vt )r ( xRq, qi) p Vr (Rq p, q ) t t qiqi T i T qi (qt )i (Vt )c ( p, xR )i Vpc ( Rq p, q ) t t Hill climbing (again) System of Differential Equations (i=1..n) pi T pi (t ) Rq i p Rq t qi T T qi (t ) p R i p Rq t Fixed Points? i either pi pi (t ) 0 T pi (t ) Rq i p Rq t or Rq p i T Rq 0 When is a Fixpoint a Nash? • Proposition: Provided all pi(0) are neither 0 nor 1, then if (p,q) converges to (p*,q*) then this is a Nash Equilibrium. Unit Square? No Problem! p i pi=0 or pi=1 both set t to zero! Convergence of the average of the payoff If the (p,q) trajectory and both player’s payoffs converge in average, the average payoff must be the payoff of some Nash Equilibrium 2 Player 2 Action Case Either the strategies converge immediately to some pure strategy, or the difference between the Kullback-Leibler distances of (p,q) and some mixed Nash are constant. p 1 p * KL( p, p ) p log( * ) (1 p) log( ) * p 1 p KL( p, p* ) KL(q, q* ) const. Trajectories of the difference between the Kullback-Leibler Distances Nash But… … for games with more than 2 actions, convergence is not guaranteed! Counterexample: Shapley Game
© Copyright 2024 Paperzz