Multi-agent learning and repeated matrix games

Multi-agent learning and repeated
matrix games
Bruno Bouzy
Université Paris Descartes, France
Outline


The five agendas of Multi-Agent Learning (MAL)
Game theory



MAL algorithms







Bruno Bouzy
Equilibria learning
Best response learning
Competition & cooperation
Regret oriented algorithms
Leader algorithms
Discussion and experiments


{Repeated} matrix games (RMG)
Equilibria (Nash, Correlated)
Algorithms evaluation
Some results
Conclusion
Multi-agent learning and repeated matrix games
2
Multi-agent learning (MAL)
•
Single-agent Reinforcement Learning  MAL
–
–
•
Game theory
–
–
•
Other agents change the environment
« non stationarity »
Players = learners
Equilibria
Multi-agent background
–
Bruno Bouzy
Adaptive and autonomous agent
Multi-agent learning and repeated matrix games
3
The five agendas of MAL (1/2)
•
AI Journal (2007)
–
Shoham, Powers Grenager, If multi-agent learning is the answer what
is the question ?
•
•
•
•
Computational agenda
–
•
MAL algorithms used to compute equilibria
Descriptive agenda
–
–
•
Game theory, fictitious play  equilibria
Artificial Intelligence, single-agent learning
Since 2004, MAL literature increases
How natural agents learn in the context of other learners ?
Investigate models that fit human or population behavior
Normative agenda
–
Bruno Bouzy
How to account for equilibria reached by given learning rules ?
Multi-agent learning and repeated matrix games
4
The five agendas of MAL (2/2)
•
2 “prescriptive” agendas
–
•
How agents should learn ?
“Cooperative” prescriptive agenda
–
Decentralized control in real world applications
•
•
–
•
Control theory, distributed computing
Joint policy
Communication between agents is allowed
“Non Cooperative” prescriptive agenda (NCPA)
–
–
–
–
Bruno Bouzy
How to obtain high reward in repeated games ?
Convergence is not a goal
Communication between agents is forbidden
Set of players, tournaments
Multi-agent learning and repeated matrix games
5
Cooperative or non cooperative ?
•
Nash meaning
–
–
–
•
Common meaning
–
–
•
Non cooperative games (Annals Math, 1951)
Two person cooperative games (Econometrica 1953)
= communication allowed or not
Cooperative = friendly
Non cooperative = competitive
Repeated matrix game meaning
–
–
Non cooperative agenda = No communication between players
Cooperative game = team game
•
–
Bruno Bouzy
all agents receive the same reward
Competitive game = zero-sum 2-person game
Multi-agent learning and repeated matrix games
6
Game theory (outline)
Bruno Bouzy

{Repeated} Matrix game examples

Nash Equilibria

Correlated Equlibria
Multi-agent learning and repeated matrix games
7
2-player matrix games
•
•
Coordination
Foot
Theater
2 1
0 0
Theater 0 0
1 2
c
d
a
2 2
0 0
Foot
b
0 0
1 1
Competition
c
•
Battle of Sexes
Prisoner Dilemma
d
Coop Defect
a
-3 3 0 0
Coop
3
3 1
4
b
-1 1 -2 2
Defect
4
1 2
2
Stackelberg
Chicken
c
d
a
1 0
3 2
Chicken 6
6
2
7
b
2 1
4 0
Dare
2
0
0
Bruno Bouzy
Chicken Dare
7
Multi-agent learning and repeated matrix games
8
Nash equilibrium
•
Pure strategy = an action
Mixed strategy = a probability distribution over actions
•
Best response to other agent strategies
•
Nash equilibrium:
•
–
–
•
Pure (resp. mixed) equilibrium
–
•
Every strategy is a best response to other strategies
No one wants to change its strategy
The strategies are pure (resp. mixed)
Property
–
Bruno Bouzy
Every matrix game has at least one mixed equilibrium
Multi-agent learning and repeated matrix games
9
Nash equilibria
Coordination
•
Battle of Sexes
Foot
Théatre
2 1
0 0
Théatre 0 0
1 2
c
d
a
2 2
0 0
Foot
b
0 0
1 1
Competition
•
c
d
a
-3 3 0 0
b
-1 1 -2 2
Stackelberg
•
Prisoner Dilemma
p(a)=1/4
p(c)=1/2
Coop Defect
Coop
3
3 1
4
Defect
4
1 2
2
Chicken
c
d
a
1 0
3 2
Chicken6
6
2
7
b
2 1
4 0
Dare
2
0
0
Bruno Bouzy
Chicken Dare
7
Multi-agent learning and repeated matrix games
10
Correlated Equilibrium (1/3)
•
Correlated Equilibrium (CE) (Aumann 1974)
–
–
–
–
–
•
Probability distribution D over joint actions
The players know D
A joint action is drawn from D by a « referee » or « device »
The referee gives every elementary action to its player
If no player wants to deviate, then D is a CE
Property
–
–
–
–
Bruno Bouzy
Every mixed NE is a CE
Regret minimization methods converge to CE
Expected reward of a CE >= Expected reward of a NE
The players communicate (or cooperate) through the referee
Multi-agent learning and repeated matrix games
11
Correlated Equilibrium (2/3)
•
Battle of Sexes
•
Nash Equilibrium
–
–
–
•
Foot Theater
Pure: (F, F) and (T, T)
Mixed: (1/3, 2/3) and (2/3, 1/3)
Expected rewards = (2/3, 2/3)
Foot
2 1
0 0
Theater 0 0
1 2
Correlated Equilibrium
–
–
Bruno Bouzy
{ ((F,F), ½), ((T,T), ½) }
Expected rewards = (3/2, 3/2)
Multi-agent learning and repeated matrix games
12
Correlated Equilibrium (3/3)
•
Chicken Game
•
Nash Equilibrium
–
–
–
•
ChickenDare
Pure: (D, C) and (C, D)
Mixed: (2/3, 2/3)
Expected rewards = (14/3, 14/3)
Chicken6
6
2
7
Dare
2
0
0
7
Correlated Equilibrium
–
–
Bruno Bouzy
{ ((C,C), 1/3), ((C,D), 1/3), ((D,C), 1/3) }
Expected rewards = (5, 5)
Multi-agent learning and repeated matrix games
13
MAL algorithms (outline)

Equilibria learning

Best response learning

Dealing with cooperation and competition

Regret oriented algorithms

Leader algorithms
Bruno Bouzy
Multi-agent learning and repeated matrix games
14
Equilibria learning (1/4)

Minimax-Q (Littman 1994)

Action values are based on joint actions
Vi ( s ) = max min Qi ( s, ( a, o))
a∈Ai
with
o∈A−i
a the learning agent ‘s action
o the opponent’s action

The agent can observe opponent’s actions

The values do not depend of opponents’ strategies

Converge to game-theoretic optimal value in 2-player zero-sum games
Bruno Bouzy
Multi-agent learning and repeated matrix games
15
Equilibria learning (2/4)

Nash-Q (Hu & Wellman 1998)

Minimax-Q extended to 2-player general-sum games
V1 ( s ) = Nash (Q1 ( s ), Q2 ( s )) = Q1 ( s, π1 ( s ), π 2 ( s ))
where (π1(s), π 2(s)) is the NE of the matrix game (Q1(s), Q2(s))

Nash-Q compute mixed strategies

Converges with some (restrictive) conditions
 Their exists only one equilibrium for the entire game and for every games defined
by Q functions during learning
 All equilibria must be of a same type : adversarial or coordination
Bruno Bouzy
Multi-agent learning and repeated matrix games
16
Equilibria learning (3/4)

Friend-and-Foe-Q (FF-Q) (Littman 2001)

Converges with less restrictive conditions than Nash-Q, but needs to class
the opponent as “friend” or “foe”

Two algorithms, one per class of opponents

 Foe opponent (Foe-Q) :
Vi ( s ) = max min Qi ( s, ( a, o))
 Friend opponent (Friend-Q) :
Vi ( s ) =
a∈Ai
o∈A−i
max
( a , o )∈Ai ×A−i
Qi ( s, ( a, o))
How to class the opponents ?
 Not answered in Littman paper
Bruno Bouzy
Multi-agent learning and repeated matrix games
17
Equilibria learning (4/4)

Correlated-Q (CE-Q) (Greenwald & Hall 2003)

Generalization of Nash-Q et FF-Q

Allows several equilibria in the game

Looks for correlated equilibria

Four classes of opponents are defined => 4 CE types => 4 algorithms




"utilitarian" (uCE-Q),
“egalitarian" (eCE-Q),
"republican" (rCE-Q),
"libertarian“ (lCE-Q)

Equilibria are computed with linear programming

The algorithm must compute opponent’s Q-values
Bruno Bouzy
Multi-agent learning and repeated matrix games
18
From equilibria to best-response

Focus on asymptotically achieving an equilibrium in self-play

Finding an equilibrium is not always the answer


All agents have to play the equilibrium
Bounded rationality
 One agent may not play the equilibrium

Multiple equilibria
 Agents have to cooperate to play the same equilibrium

Long-term best-response play

Learning strategies that play the best-response to the opponent's observed
strategy.
Bruno Bouzy
Multi-agent learning and repeated matrix games
19
Best-response learning

Learn to play the best response to opponent’s observed behavior

Naïve approach


Single-agent learning
Take into account the possibility that other agents' strategies might change


Fictitious play
Q-learning
Bruno Bouzy
Multi-agent learning and repeated matrix games
20
Q-learning (Watkins 1989, 1992)

Single-agent learning

The Q-learning update :
π ( s) = arg max Q( s, a)
a∈A

Q-learning is guaranteed to converge in stationary environments if :


every state-action pair continue to be visited,
and the learning rate is decreased appropriately over time
Bruno Bouzy
Multi-agent learning and repeated matrix games
21
Fictitious play (FP)

(Brown 1951)

The agent observes time-average frequency of other players’ action choices,
and models:
# times ak observed
prob(ak ) =
total # observations
The agent then plays best-response to this model

Converges in two-player zero-sum games

When it converges, the convergence point is a Nash equilibrium
Bruno Bouzy
Multi-agent learning and repeated matrix games
22
Previous experimental works

MALT (Zawadzki 2005)



Q learning (QL) is better than Nash-based algorithms
GAMUT (Nudelman & al 2004)
(Airiau 2007)


FP is ranked first
Specific representative RMG
Bruno Bouzy
Multi-agent learning and repeated matrix games
23
Cooperation and competition (CC 1/3)
•
•
Prisoner Dilemma
Nash Equilibria
–
–
–
•
Pure (D, D)
Mixed: no
Expected reward = 2
Coop
3
3 1
4
Defect 4
1 2
2
Pareto-optimal state that is better than (D,D)
–
–
•
Coop Defect
(C, C)
Expected reward = 3
Best response algorithms, and regret oriented algorithms
–
Bruno Bouzy
Converge to (D, D)
Multi-agent learning and repeated matrix games
24
S algorithm (CC 2/3)
•
Satisficing (S) algorithm (Stimpson & Goodrich 2003)
•
At instant t:
Receive the reward R(t)
Select action depending on the « aspiration » of agent i
•
•
–
–
•
Compute Aspiration (i, t)
–
–
•
If R(t) >= Aspiration(i, t), then Action(i, t+1) = Action(i, t)
Else Action(i, t+1) = random choice
Aspiration(i, t+1) = λAspiration(i, t) + (1-λ) R
Aspiration(i, 0) = Rmax
In self-play, the S algorithm converges to the pareto-optimal states
Bruno Bouzy
Multi-agent learning and repeated matrix games
25
M Qubed (CC 3/3)
•
(Crandall & Goodrich 2005)
•
Combination of S algorithm and Q learning
•
Use Max or MiniMax (M3) strategies according to the situation (friendly or
inimical)
•
Goals reached
–
–
Bruno Bouzy
Max level against friendly agents
Security level against inimical agents
Multi-agent learning and repeated matrix games
26
Regret oriented algorithms
•
HMC (Hart & Mas-Colell 2000)
–
–
–
Regret of action A over action B (internal regret)
Probability of action X linear in its regret over last action
Convergence to CE
•
Bandit algorithms: UCB (Auer & al 2002)
•
Exp3 and derived
Bruno Bouzy
Multi-agent learning and repeated matrix games
27
Exp3 (Auer et al. 2002)

(Auer & al 2002)

"Exponential-weight algorithm for Exploration and Exploitation"

Non-stochastic multi-armed bandit problems

Action selection scheme

Selection probability of action i

γ : mixture rate
Bruno Bouzy
γ
pi (t ) = (1 − γ ) K
+
∑ j w j (t ) K
Multi-agent learning and repeated matrix games
wi (t )
28
UCB

(Auer & al 2002)

Select an action among a set of actions for both exploit and explore

Action = argmax i ( m(i) + sqrt( log(T) / t(i) ) )

Used in UCT for computer Go and Amazons

UCT = UCB for Trees
Bruno Bouzy
Multi-agent learning and repeated matrix games
29
Leader algorithms
•
Implicit negociation in repeated games (Littman & Stone 2001)
•
Bully
–
Being a leader
Action = argmax i m1(i, J*(i))
•
Assume the opponent is a
best response algorithm
J*(i) = argmax j m2(i, j)
•
–
Bruno Bouzy
a
b
a
1 0 3 2
b
2 1 4 0
Multi-agent learning and repeated matrix games
30
Discussion
Bruno Bouzy

Theoretical MAL criteria

Experimental evaluation criteria

Some experimental results
Multi-agent learning and repeated matrix games
31
Theoretical MAL criteria

Bowling and Veloso (2001) proposed criteria for MAL algorithms.
Learning should :



(1) always converge to a stationary policy
(2) only terminate with a best-response to the play of other agents
Conitzer and Sandholm (2003) added another criterion :

(3) converge to Nash equilibrium in self-play
Bruno Bouzy
Multi-agent learning and repeated matrix games
32
Experimental evaluation in the NCP Agenda
•
Players' confrontation
–
–
•
One-to-one confrontations
All-against-all tournaments with one-to-one confrontations
Set of games
–
–
–
–
Bruno Bouzy
Cooperative games
Competitive games
General sum games
GAMUT
Multi-agent learning and repeated matrix games
33
Tournaments settings

Parameters
– #actions = 3
– -9 <= R <= +9
– #repetitions = 100,000
– #MG = 100

For each MG, an all-against-all tournaments is set up
Bruno Bouzy
Multi-agent learning and repeated matrix games
34
Tournament results on general-sum MG
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: M3
2: S
3: UCB
4: FP
5: Exp3
6: Bully
7: HMC
8: QL
9: MinMax
10: Rand
Multi-agent learning and repeated matrix games
35
Tournament results on team games
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: Exp3
2: M3
3: Bully
4: S
5: HMC
6: FP
7: UCB
8: QL
9: MinMax
10: Rand
Multi-agent learning and repeated matrix games
36
Tournament results on zero-sum games
–
–
–
–
–
–
–
–
–
–
Bruno Bouzy
1: Exp3
2: M3
3: MinMax
4: FP
5: S
6: UCB
7: QL
8: HMC
9: Bully
10: Rand
Multi-agent learning and repeated matrix games
37
Conclusions and future works

MAL Non Cooperative Prescriptive Agenda

Maximize the cumulative returns against many players in many RMG
Intersection of Game Theory, AI, Reinforcement Learning
Experimental results
 UCB, M3 work well in general-sum games
 Exp3 work well in both zero-sum games and team games
Current work, two important features




Averaging on the near past, and forgetting the far past
Use states where a state corresponds to the past joint action
Future works



Hedging algorithms, or expert algorithms

Algorithm game
Bruno Bouzy
Multi-agent learning and repeated matrix games
38