Reinforcement Learning with Partially Known World

Continuation Methods for
Structured Games
Ben Blum
Christian Shelton
Daphne Koller
Stanford University
Outline



Game Theory Description
Nash Equilibria: Solutions to Games
How to Find Them: Continuation Methods




Normal Form Games: Govindan and Wilson 2002
Graphical Games
Results for Graphical Games
Multi-Agent Influence Diagrams


Overview of MAIDs
Continuation methods for MAIDs
Game Theory

Normal-form games





Model the joint behavior of multiple agents
All players move once, simultaneously
Each player’s payoff depends on actions of all
others
Representation exponential in number of players
Structured games (graphical games, MAIDs)


Computer science’s contribution to game theory:
exploit structure, independencies
More compact, elegant representations
Nash Equilibria

Strategy profile




Assigns a strategy to every player
A pure strategy chooses one action
A mixed strategy is a distribution over pure
strategies
A Nash equilibrium is a “solution” to the game


A strategy profile where no player can improve his
payoff by unilaterally deviating from his strategy
Not the perfect notion of a solution, but useful
Why Compute Equilibria?

Descriptive power


Describes stable outcomes of systems
Useful for testing accuracy of an economic model



See if model’s equilibria correspond to real behavior
Fast computation of equilibria required
Prescriptive power


Choose the way an agent should act
The minimum requirement for an optimal strategy

Lets us prune the continuous space of strategies to a few
discrete possibilities
Big Computational Idea

Govindan and Wilson ‘02:
Continuation method for normal-form games

Perturb game by a vector of bonuses


Pay each player an additional bonus for each one of their
actions independently
If the perturbation is large, the game is easily
solvable

Unique pure equilibrium in which each player chooses
the action with the highest bonus
Big Idea: The Picture


Choose a random bonus vector b, perturb game g by
g  lb , l a scale factor
Follow the path back from ( g  lb,  ) to ( g ,  )

Finds equilibria of

Solutions at vertical axis
g  lb as l  0

Multiple equilibria found
with a single ray b
Continuation Methods

General framework for satisfying continuous
constraints


Start at easily solved perturbed problem, then
trace the solution back to the original problem
Task Specification


l is the scale factor for perturbation: 1 is fully
perturbed, 0 is unperturbed
Formulate set of constraintson the
s, to

 solution,
the l–perturbed problem: F ( l , s )  0

F continuous, zero iff s is a solution to the perturbed
problem at l.
Continuation Methods

General Methodology 

 Start with g 0 such that F (1, g )  0
0


  s
 l
Take time derivative  l F (l , s )   s F (l , s )  0
t
t



If lambda and s change by small amounts in these
directions, F is unchanged
Follow differential system with small discrete steps


Requires inverting the matrix s F ( l , s ) at each
step
Continuation Method for Games



Set of constraints, F, based on
homeomorphism between space of games
and their equilibria [Kohlberg & Mertens ’86]
One constraint for every action, a1p , ... , a kp
of each player, p


i
j
In s F (l , s ) , the entry at a p , a p ' corresponds
to the payoff to player p when p and p’
i
deviate from s by playing a p and a pj '
respectively
Implementation of GNM



Implemented Govindan and Wilson’s normal
form algorithm (the Global Newton Method)
Much faster than other leading algorithms for
normal-form games
Available on the web under the GNU Public
License


http://dags.stanford.edu/Games/gametracer.html
Still too slow for large games

Calulcation of F requires exponential time
Graphical Games

How to reduce exponential representation?




Exploit structure!
Only connect players whose payoffs depend on
each other
Each player has a payoff matrix, a function of
neighbors and self only
d1
Representation: O(n  a )
where d is number of
neighbors
Example Graphical Games

Road game

Landowners along
a road

Grid game

Territorial issues
Solving Graphical Games

Want to find equilibria in graphical games


Graphical structure is a useful way to represent large games
Current exact algorithms are impractical

Enforce unreasonable restrictions

Too slow
[Kearns, Littman, Singh ’01]
 2 actions per player
 Tree structure

Approximate equilibria are problematic

Best of both worlds: a general exact algorithm
[Vickrey & Koller ’01]
 No guarantee of an exact equilibrium in the vicinity
 Granularity must be crude for reasonable execution time
Our Algorithm

Based on Govindan and Wilson’s



Uses the same continuation method constraint function
Trace solutions along a ray of perturbed games
Efficient computation using structure

Game structure allows us to compute the components of
locally



F
If two players aren’t adjacent, payoffs don’t depend on each
other so derivative is zero
Otherwise, can use the local game matrix
Exponential in family size, NOT in game size
Graphical Game Results


6x6 grid (intractable for most algorithms): 27s
Results for different road sizes:

Equilibrium error


10-4 (Vickrey & Koller)
<10-14 (GNM)
Multi-Agent Influence Diagrams


Directed acyclic graph, like a BN
Three types of nodes




Chance nodes: acts of nature
Decision nodes: acts of players
Utility nodes: payoffs for a player, can’t be parents
Multiple agents (players)


Payoff is expected sum of owned utility nodes
Strategies: entries in the CPTs of owned decision
nodes
Tree Killer Example


Alice (dark gray) must decide whether to poison
Bob’s tree to get a better patio view
Bob (light gray) must decide whether to call a tree
doctor
MAIDs vs. Other Games

MAIDs correspond to extensive form games
(game trees)



Different from normal form games: sequential
actions
Different homeomorphism and constraint function
needed
Different strategy representation required
Finding Equilibria in MAIDs

Another continuation method


Based on Govindan’s and Wilson’s extensive form
constraint function
A MAID induces an extensive form game




Exponentially larger than the MAID itself
Induced strategy profiles are just as compact as in the
MAID
We can therefore use the extensive form constraint
function, with all computations done inside the MAID
New compact strategy representation:
non-exclusion probabilities
Non-Exclusion Probabilities

Assumes a game with perfect recall



No player “forgets” anything that he has learned
If D2 comes after D1 , then PaD1  {D1}  PaD2
Non-exclusion probability representation:

For player i, topologically sort decision variables
{D1i , D2i ,  , Dki }


One non-exclusion probability for each
i
instantiation of PaDki  { Dk }
For outcome z, pi ( z )   Pr[ D ij  z( D ij ) | PaD
j
i
j
 z ( PaDi )]
j
Non-Exclusion Constraints
Strategy representation now has constraints
Non-exclusion probabilities:
a0,b0
A1:
a0,a1
A2
B1:
A1,B1
a0,b1
a1,b0
a2
a3
b0,b1
A1,B1
Decision node CPTs:
A2:
a2,a3
a1,b1
A
A1
a0
a1
0.4
A2
a0,b0
a0,b1
a2
0.5
0.25
a3
0.5
0.75
a1,b0
a1,b1
Non-Exclusion Constraints
Strategy representation now has additional constraints
Non-exclusion probabilities:
a0,b0
A1:
a0,a1
A2
B1:
a2
A1,B1
a0,b1
a1,b0
0.2
a3
b0,b1
A1,B1
Decision node CPTs:
A2:
a2,a3
a1,b1
A
A1
a0
a1
0.4
A2
a0,b0
a0,b1
a2
0.5
0.25
a3
0.5
0.75
a1,b0
a1,b1
Non-Exclusion Constraints
Strategy representation now has additional constraints
Non-exclusion probabilities:
a0,b0
A1:
a0,a1
A2
B1:
b0,b1
A1,B1
a0,b1
a2
0.2
0.1
a3
0.2
0.3
a1,b0
A1,B1
Decision node CPTs:
A2:
a2,a3
A
A1
a0
a1
0.4
a1,b1
A2
a0,b0
a0,b1
a2
0.5
0.25
a3
0.5
0.75
a1,b0
a1,b1
Non-Exclusion Constraints
Strategy representation now has additional constraints
Non-exclusion probabilities:
a0,b0
A1:
a0,a1
A2
B1:
b0,b1
A1,B1
a0,b1
a2
0.2
0.1
a3
0.2
0.3
a1,b0
A1,B1
Decision node CPTs:
A2:
a2,a3
+ +
A
A1
a0
a1
0.4
a1,b1
A2
a0,b0
a0,b1
a2
0.5
0.25
a3
0.5
0.75
a1,b0
a1,b1
Calculation of Jacobian


One component of F for each non-exclusion
probability
In F, each element is again the payoff to
one player when he and another deviate



This can be calculated by changing the decision
node CPTs (no longer probabilities)
Zero out entries leading to other outcomes
Run “inference” to find reward node expectations
What’s next?

Implement a MAID algorithm




We have a continuation method constraint
function F
Calculation of F can be done with standard
probabilistic inference in the MAID
Calculation of the retraction operator is linear in
the strategy representation, because constraints
are orthogonal; can project onto each one in turn
Now we just have to implement
Conclusions

Applied new methods from economics to
structured games

Graphical games



MAIDs


Fastest general algorithm
Exact
Adapted extensive form theoretical framework to
calculations entirely within the MAID
Software available for download

Download Report

Reinforcement Learning with Partially Known World

Paperzz.com

Your Paperzz