Multi-agent learning and repeated
matrix games
Bruno Bouzy
Université Paris Descartes, France
Outline
The five agendas of Multi-Agent Learning (MAL)
Game theory
MAL algorithms
Bruno Bouzy
Equilibria learning
Best response learning
Competition & cooperation
Regret oriented algorithms
Leader algorithms
Discussion and experiments
{Repeated} matrix games (RMG)
Equilibria (Nash, Correlated)
Algorithms evaluation
Some results
Conclusion
Multi-agent learning and repeated matrix games
2
Multi-agent learning (MAL)
•
Single-agent Reinforcement Learning MAL
–
–
•
Game theory
–
–
•
Other agents change the environment
« non stationarity »
Players = learners
Equilibria
Multi-agent background
–
Bruno Bouzy
Adaptive and autonomous agent
Multi-agent learning and repeated matrix games
3
The five agendas of MAL (1/2)
•
AI Journal (2007)
–
Shoham, Powers Grenager, If multi-agent learning is the answer what
is the question ?
•
•
•
•
Computational agenda
–
•
MAL algorithms used to compute equilibria
Descriptive agenda
–
–
•
Game theory, fictitious play equilibria
Artificial Intelligence, single-agent learning
Since 2004, MAL literature increases
How natural agents learn in the context of other learners ?
Investigate models that fit human or population behavior
Normative agenda
–
Bruno Bouzy
How to account for equilibria reached by given learning rules ?
Multi-agent learning and repeated matrix games
4
The five agendas of MAL (2/2)
•
2 “prescriptive” agendas
–
•
How agents should learn ?
“Cooperative” prescriptive agenda
–
Decentralized control in real world applications
•
•
–
•
Control theory, distributed computing
Joint policy
Communication between agents is allowed
“Non Cooperative” prescriptive agenda (NCPA)
–
–
–
–
Bruno Bouzy
How to obtain high reward in repeated games ?
Convergence is not a goal
Communication between agents is forbidden
Set of players, tournaments
Multi-agent learning and repeated matrix games
5
Cooperative or non cooperative ?
•
Nash meaning
–
–
–
•
Common meaning
–
–
•
Non cooperative games (Annals Math, 1951)
Two person cooperative games (Econometrica 1953)
= communication allowed or not
Cooperative = friendly
Non cooperative = competitive
Repeated matrix game meaning
–
–
Non cooperative agenda = No communication between players
Cooperative game = team game
•
–
Bruno Bouzy
all agents receive the same reward
Competitive game = zero-sum 2-person game
Multi-agent learning and repeated matrix games
6
Game theory (outline)
Bruno Bouzy
{Repeated} Matrix game examples
Nash Equilibria
Correlated Equlibria
Multi-agent learning and repeated matrix games
7
2-player matrix games
•
•
Coordination
Foot
Theater
2 1
0 0
Theater 0 0
1 2
c
d
a
2 2
0 0
Foot
b
0 0
1 1
Competition
c
•
Battle of Sexes
Prisoner Dilemma
d
Coop Defect
a
-3 3 0 0
Coop
3
3 1
4
b
-1 1 -2 2
Defect
4
1 2
2
Stackelberg
Chicken
c
d
a
1 0
3 2
Chicken 6
6
2
7
b
2 1
4 0
Dare
2
0
0
Bruno Bouzy
Chicken Dare
7
Multi-agent learning and repeated matrix games
8
Nash equilibrium
•
Pure strategy = an action
Mixed strategy = a probability distribution over actions
•
Best response to other agent strategies
•
Nash equilibrium:
•
–
–
•
Pure (resp. mixed) equilibrium
–
•
Every strategy is a best response to other strategies
No one wants to change its strategy
The strategies are pure (resp. mixed)
Property
–
Bruno Bouzy
Every matrix game has at least one mixed equilibrium
Multi-agent learning and repeated matrix games
9
Nash equilibria
Coordination
•
Battle of Sexes
Foot
Théatre
2 1
0 0
Théatre 0 0
1 2
c
d
a
2 2
0 0
Foot
b
0 0
1 1
Competition
•
c
d
a
-3 3 0 0
b
-1 1 -2 2
Stackelberg
•
Prisoner Dilemma
p(a)=1/4
p(c)=1/2
Coop Defect
Coop
3
3 1
4
Defect
4
1 2
2
Chicken
c
d
a
1 0
3 2
Chicken6
6
2
7
b
2 1
4 0
Dare
2
0
0
Bruno Bouzy
Chicken Dare
7
Multi-agent learning and repeated matrix games
10
Correlated Equilibrium (1/3)
•
Correlated Equilibrium (CE) (Aumann 1974)
–
–
–
–
–
•
Probability distribution D over joint actions
The players know D
A joint action is drawn from D by a « referee » or « device »
The referee gives every elementary action to its player
If no player wants to deviate, then D is a CE
Property
–
–
–
–
Bruno Bouzy
Every mixed NE is a CE
Regret minimization methods converge to CE
Expected reward of a CE >= Expected reward of a NE
The players communicate (or cooperate) through the referee
Multi-agent learning and repeated matrix games
11
Correlated Equilibrium (2/3)
•
Battle of Sexes
•
Nash Equilibrium
–
–
–
•
Foot Theater
Pure: (F, F) and (T, T)
Mixed: (1/3, 2/3) and (2/3, 1/3)
Expected rewards = (2/3, 2/3)
Foot
2 1
0 0
Theater 0 0
1 2
Correlated Equilibrium
–
–
Bruno Bouzy
{ ((F,F), ½), ((T,T), ½) }
Expected rewards = (3/2, 3/2)
Multi-agent learning and repeated matrix games
12
Correlated Equilibrium (3/3)
•
Chicken Game
•
Nash Equilibrium
–
–
–
•
ChickenDare
Pure: (D, C) and (C, D)
Mixed: (2/3, 2/3)
Expected rewards = (14/3, 14/3)
Chicken6
6
2
7
Dare
2
0
0
7
Correlated Equilibrium
–
–
Bruno Bouzy
{ ((C,C), 1/3), ((C,D), 1/3), ((D,C), 1/3) }
Expected rewards = (5, 5)
Multi-agent learning and repeated matrix games
13
MAL algorithms (outline)
Equilibria learning
Best response learning
Dealing with cooperation and competition
Regret oriented algorithms
Leader algorithms
Bruno Bouzy
Multi-agent learning and repeated matrix games
14
Equilibria learning (1/4)
Minimax-Q (Littman 1994)
Action values are based on joint actions
Vi ( s ) = max min Qi ( s, ( a, o))
a∈Ai
with
o∈A−i
a the learning agent ‘s action
o the opponent’s action
The agent can observe opponent’s actions
The values do not depend of opponents’ strategies
Converge to game-theoretic optimal value in 2-player zero-sum games
Bruno Bouzy
Multi-agent learning and repeated matrix games
15
Equilibria learning (2/4)
Nash-Q (Hu & Wellman 1998)
Minimax-Q extended to 2-player general-sum games
V1 ( s ) = Nash (Q1 ( s ), Q2 ( s )) = Q1 ( s, π1 ( s ), π 2 ( s ))
where (π1(s), π 2(s)) is the NE of the matrix game (Q1(s), Q2(s))
Nash-Q compute mixed strategies
Converges with some (restrictive) conditions
Their exists only one equilibrium for the entire game and for every games defined
by Q functions during learning
All equilibria must be of a same type : adversarial or coordination
Bruno Bouzy
Multi-agent learning and repeated matrix games
16
Equilibria learning (3/4)
Friend-and-Foe-Q (FF-Q) (Littman 2001)
Converges with less restrictive conditions than Nash-Q, but needs to class
the opponent as “friend” or “foe”
Two algorithms, one per class of opponents
Foe opponent (Foe-Q) :
Vi ( s ) = max min Qi ( s, ( a, o))
Friend opponent (Friend-Q) :
Vi ( s ) =
a∈Ai
o∈A−i
max
( a , o )∈Ai ×A−i
Qi ( s, ( a, o))
How to class the opponents ?
Not answered in Littman paper
Bruno Bouzy
Multi-agent learning and repeated matrix games
17
Equilibria learning (4/4)
Correlated-Q (CE-Q) (Greenwald & Hall 2003)
Generalization of Nash-Q et FF-Q
Allows several equilibria in the game
Looks for correlated equilibria
Four classes of opponents are defined => 4 CE types => 4 algorithms
"utilitarian" (uCE-Q),
“egalitarian" (eCE-Q),
"republican" (rCE-Q),
"libertarian“ (lCE-Q)
Equilibria are computed with linear programming
The algorithm must compute opponent’s Q-values
Bruno Bouzy
Multi-agent learning and repeated matrix games
18
From equilibria to best-response
Focus on asymptotically achieving an equilibrium in self-play
Finding an equilibrium is not always the answer
All agents have to play the equilibrium
Bounded rationality
One agent may not play the equilibrium
Multiple equilibria
Agents have to cooperate to play the same equilibrium
Long-term best-response play
Learning strategies that play the best-response to the opponent's observed
strategy.
Bruno Bouzy
Multi-agent learning and repeated matrix games
19
Best-response learning
Learn to play the best response to opponent’s observed behavior
Naïve approach
Single-agent learning
Take into account the possibility that other agents' strategies might change
Fictitious play
Q-learning
Bruno Bouzy
Multi-agent learning and repeated matrix games
20
Q-learning (Watkins 1989, 1992)
Single-agent learning
The Q-learning update :
π ( s) = arg max Q( s, a)
a∈A
Q-learning is guaranteed to converge in stationary environments if :
every state-action pair continue to be visited,
and the learning rate is decreased appropriately over time
Bruno Bouzy
Multi-agent learning and repeated matrix games
21
Fictitious play (FP)
(Brown 1951)
The agent observes time-average frequency of other players’ action choices,
and models:
# times ak observed
prob(ak ) =
total # observations
The agent then plays best-response to this model
Converges in two-player zero-sum games
When it converges, the convergence point is a Nash equilibrium
Bruno Bouzy
Multi-agent learning and repeated matrix games
22
Previous experimental works
MALT (Zawadzki 2005)
Q learning (QL) is better than Nash-based algorithms
GAMUT (Nudelman & al 2004)
(Airiau 2007)
FP is ranked first
Specific representative RMG
Bruno Bouzy
Multi-agent learning and repeated matrix games
23
Cooperation and competition (CC 1/3)
•
•
Prisoner Dilemma
Nash Equilibria
–
–
–
•
Pure (D, D)
Mixed: no
Expected reward = 2
Coop
3
3 1
4
Defect 4
1 2
2
Pareto-optimal state that is better than (D,D)
–
–
•
Coop Defect
(C, C)
Expected reward = 3
Best response algorithms, and regret oriented algorithms
–
Bruno Bouzy
Converge to (D, D)
Multi-agent learning and repeated matrix games
24
S algorithm (CC 2/3)
•
Satisficing (S) algorithm (Stimpson & Goodrich 2003)
•
At instant t:
Receive the reward R(t)
Select action depending on the « aspiration » of agent i
•
•
–
–
•
Compute Aspiration (i, t)
–
–
•
If R(t) >= Aspiration(i, t), then Action(i, t+1) = Action(i, t)
Else Action(i, t+1) = random choice
Aspiration(i, t+1) = λAspiration(i, t) + (1-λ) R
Aspiration(i, 0) = Rmax
In self-play, the S algorithm converges to the pareto-optimal states
Bruno Bouzy
Multi-agent learning and repeated matrix games
25
M Qubed (CC 3/3)
•
(Crandall & Goodrich 2005)
•
Combination of S algorithm and Q learning
•
Use Max or MiniMax (M3) strategies according to the situation (friendly or
inimical)
•
Goals reached
–
–
Bruno Bouzy
Max level against friendly agents
Security level against inimical agents
Multi-agent learning and repeated matrix games
26
Regret oriented algorithms
•
HMC (Hart & Mas-Colell 2000)
–
–
–
Regret of action A over action B (internal regret)
Probability of action X linear in its regret over last action
Convergence to CE
•
Bandit algorithms: UCB (Auer & al 2002)
•
Exp3 and derived
Bruno Bouzy
Multi-agent learning and repeated matrix games
27
Exp3 (Auer et al. 2002)
(Auer & al 2002)
"Exponential-weight algorithm for Exploration and Exploitation"
Non-stochastic multi-armed bandit problems
Action selection scheme
Selection probability of action i
γ : mixture rate
Bruno Bouzy
γ
pi (t ) = (1 − γ ) K
+
∑ j w j (t ) K
Multi-agent learning and repeated matrix games
wi (t )
28
UCB
(Auer & al 2002)
Select an action among a set of actions for both exploit and explore
Action = argmax i ( m(i) + sqrt( log(T) / t(i) ) )
Used in UCT for computer Go and Amazons
UCT = UCB for Trees
Bruno Bouzy
Multi-agent learning and repeated matrix games
29
Leader algorithms
•
Implicit negociation in repeated games (Littman & Stone 2001)
•
Bully
–
Being a leader
Action = argmax i m1(i, J*(i))
•
Assume the opponent is a
best response algorithm
J*(i) = argmax j m2(i, j)
•
–
Bruno Bouzy
a
b
a
1 0 3 2
b
2 1 4 0
Multi-agent learning and repeated matrix games
30
Discussion
Bruno Bouzy
Theoretical MAL criteria
Experimental evaluation criteria
Some experimental results
Multi-agent learning and repeated matrix games
31
Theoretical MAL criteria
Bowling and Veloso (2001) proposed criteria for MAL algorithms.
Learning should :
(1) always converge to a stationary policy
(2) only terminate with a best-response to the play of other agents
Conitzer and Sandholm (2003) added another criterion :
(3) converge to Nash equilibrium in self-play
Bruno Bouzy
Multi-agent learning and repeated matrix games
32
Experimental evaluation in the NCP Agenda
•
Players' confrontation
–
–
•
One-to-one confrontations
All-against-all tournaments with one-to-one confrontations
Set of games
–
–
–
–
Bruno Bouzy
Cooperative games
Competitive games
General sum games
GAMUT
Multi-agent learning and repeated matrix games
33
Tournaments settings
Parameters
– #actions = 3
– -9 <= R <= +9
– #repetitions = 100,000
– #MG = 100
For each MG, an all-against-all tournaments is set up
Bruno Bouzy
Multi-agent learning and repeated matrix games
34
Tournament results on general-sum MG
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: M3
2: S
3: UCB
4: FP
5: Exp3
6: Bully
7: HMC
8: QL
9: MinMax
10: Rand
Multi-agent learning and repeated matrix games
35
Tournament results on team games
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: Exp3
2: M3
3: Bully
4: S
5: HMC
6: FP
7: UCB
8: QL
9: MinMax
10: Rand
Multi-agent learning and repeated matrix games
36
Tournament results on zero-sum games
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: Exp3
2: M3
3: MinMax
4: FP
5: S
6: UCB
7: QL
8: HMC
9: Bully
10: Rand
Multi-agent learning and repeated matrix games
37
Conclusions and future works
MAL Non Cooperative Prescriptive Agenda
Maximize the cumulative returns against many players in many RMG
Intersection of Game Theory, AI, Reinforcement Learning
Experimental results
UCB, M3 work well in general-sum games
Exp3 work well in both zero-sum games and team games
Current work, two important features
Averaging on the near past, and forgetting the far past
Use states where a state corresponds to the past joint action
Future works
Hedging algorithms, or expert algorithms
Algorithm game
Bruno Bouzy
Multi-agent learning and repeated matrix games
38
© Copyright 2026 Paperzz