Type-based Methods for Interaction in
Multiagent Systems
Stefano Albrecht
Prashant Doshi
University of Texas at Austin
University of Georgia
Tutorial at AAAI-16
February 12, 2016
Introduction
Multiagent Systems
Multiagent Systems
Many applications:
Robotic elderly care
Multi-robot search and rescue
Adaptive user interfaces
Automated trading agents
c
Biomimetic
Control Research Center, RIKEN
...
c
Team
Hector, Technische Universität Darmstadt
Multiagent Systems
How to interact with other agents?
Reinforcement learning
Classic opponent modelling
Equilibrium solutions
...
Type-based methods
Types
Originally defined as any private knowledge relevant to agent
E.g. payoff function, beliefs
(Harsanyi, 1967)
More generally: complete behaviour specification
Type-based Method
Assume other agents’ types drawn from predefined type space
Type space known or hypothesised
Plan own actions with respect to most likely types
Based on agents’ observed actions
Type-based Method
Type-based Method – Example
c
RAD
Group, The University of Edinburgh
Type-based Method
Where do types come from?
Domain expert
Problem structure
Historical data
How to implement?
Tree expansion
Trajectory sampling
Partial-order planning
More details later...
Type-based Method – Features
Many useful features...
Can use prior knowledge:
In type space
(constrain type space to plausible types)
In prior beliefs
(focus prior beliefs on most likely types)
Type-based Method – Features
High flexibility:
Traditional opponent modelling restricted to class of
behaviours (e.g. trees, finite state automata)
Type-based method can specify any behaviours as types
Plus: can include opponent modelling as special type
Class
Types
Type-based Method – Features
Fast learning:
Few parameters to learn (beliefs)
Often rapid belief convergence
Efficient planning:
Use types to plan in unseen state regions
No explicit exploration needed
(exploration implicit – more details later)
Type-based Method – However...
However...
How to incorporate observations?
When will posterior beliefs be correct?
What impact do prior beliefs have?
What if hypothesised types incorrect?
How to decide if types incorrect?
...
Research on Type-based Methods
Long history:
Game Theory (since 1960’s)
Artificial Intelligence (since 1990’s)
Studied in different ways:
Types used by one agent or all agents
Focus on equilibrium solution or best-response
Observe states/actions directly or indirectly
Solve model offline or online
Multiagent Interaction without Prior Coordination
Recent focus of type-based method on
Multiagent Interaction without Prior Coordination (MIPC)
E.g. ad hoc teamwork (Stone et al., 2010)
Design autonomous agent which achieves flexible and efficient
interaction in multiagent systems without prior coordination
between controlled agent and other agents.
Multiagent Interaction without Prior Coordination
Workshop series at AAAI:
MIPC 2014, Quebec City, Canada
MIPC 2015, Austin, Texas, USA
MIPC 2016, Phoenix, Arizona, USA
(tomorrow, 9am – 5pm, room 104A)
Special Issue on MIPC at AAMAS Journal
mipc.inf.ed.ac.uk
Tutorial Roadmap
Type-based methods in...
Tutorial Roadmap
Type-based methods in...
Bayesian games
Relax the assumption of perfect knowledge of
agents’
t ’ rewards
d
Type system
Agent’s type: Encompasses private information
relevant to the agent’s
agent s behavior
Joint probability distribution over types, which is
common knowledge
g
Bayesian games
In Harsanyi
Harsanyi's
s own words:
“. . . we can regard the attribute vector ci as representing certain
p y
physical,
, social,, and psychological
p y
g
attributes off player
p y i himselff in that it
summarizes some crucial parameters of player i's own payoff function Ui
as well as main parameters of his beliefs about his social and physical
environment . . .”
Bayesian games – Example
Criminals
Police Enter
Patrol
Stay
out
Enter
Stay out
0,-1
2,0
2,1
3,0
Policing is weak
Type space: Θ Police
Criminals
Enter
Stay out
Enter
1.5,-1
3.5,0
Stay
out
2,1
3,0
Policing is strong
= {RWeak , RStrong }
Bayesian games – Example
Criminals
Police Enter
Patrol Stay out
Criminals
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
Let p be the p
probability
y that the police
p
is weak
Enter
Stay out
Enter, Enter
1.5(1-p),-1
2p+3.5(1-p),0
Enter, Stay out
2(1-p), -p+(1-p)
2p + 3(1-p),0
Stay out, Enter
2p + 1.5(1-p), p – (1p)
3p + 3.5(1-p),0
Stay out, Stay
out
2,1
3,0
Bayesian games – Example
Criminals
Police Enter
Patrol Stay out
Criminals
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
For all p ≥ 0,, (Enter,
(
, Enter)) and (Enter,
(
, Stay
y out)) is
dominated
Enter
Stay out
Enter, Enter
1.5(1-p),-1
2p+3.5(1-p),0
Enter, Stay out
2(1-p), -p+(1-p)
2p + 3(1-p),0
Stay out, Enter
2p + 1.5(1-p), p – (1p)
3p + 3.5(1-p),0
Stay out, Stay
out
2,1
3,0
Bayesian games – Example
Criminals
Police Enter
Patrol Stay out
Criminals
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
For all p ≥ 0,, (Enter,
(
, Enter)) and (Enter,
(
, Stay
y out)) is
dominated
so the games collapses into:
Enter
Stay out
Stay out, Enter
2p + 1.5(1-p), p – (1p)
3p + 3.5(1-p),0
Stay out, Stay
out
2,1
3,0
Bayesian games – Example
Criminals
Police Enter
Patrol Stay out
Criminals
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
Enter
Stay
y out
Stay out, Enter
1.5 + 0.5p, 2p -1
3.5 – 0.5p, 0
Stay out, Stay
out
21
2,1
30
3,0
For p > 0.5, Enter is a dominating action for the criminal and
{(Stay out,
out Stay out),Enter}
out) Enter} is a Nash equilibrium
For p ≤ 0.5, {(Stay out, Stay out), Enter} and {(Stay out,
Enter), Stay out} are Nash equilibria
Bayesian games – Example
Criminals
Police Enter
Patrol Stay out
Criminals
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
Enter
Stay
y out
Stay out, Enter
1.5 + 0.5p, 2p -1
3.5 – 0.5p, 0
Stay out, Stay
out
21
2,1
30
3,0
EU(Stay out, Enter) = (1.5+0.5p)x +(1− x)(3.5−0.5p) = 3.5−0.5p + x( p −2)
EU(Stay out,
out Stay out) =2x +3(1− x) = 3− x
Police is indifferent when
3.5p−0.5p+ x( p−2) =3− x
x =1/ 2
Bayesian games – Example
Criminals
Criminals
Police Enter
Patrol Stay out
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Stay out
Policing is strong
Enter
Stay
y out
Stay out, Enter
1.5 + 0.5p, 2p -1
3.5 – 0.5p, 0
Stay out, Stay
out
21
2,1
30
3,0
EU(Enter) = (2p −1)y +1(1− y) = (2p −2)y +1
EU(Stay out) = 0
Criminal is indifferent when 1+ y(2p −2) = 0
y =1/ 2(1− p)
y=
Bayesian games – Example
Criminals
Criminals
Police Enter
Patrol Stay out
Enter
Stay out
Enter
0,-1
2,0
Enter
1.5,-1
3.5,0
2,1
3,0
Stay out
2,1
3,0
Policing is weak
Policing is strong
3B
Bayesian
i
Nash
N h equilibria
ilib i
{Stay out, Enter}
for any p
{(Stay
{(
y out,, Enter),
), Stay
y out}
} if p ≤ 0.5
{
1
1− 2 p
1 1
,
, , }
2(1− p) 2(1− p) 2 2
Stay out
if p ≤ 0.5
Bayesian games
In general, a strategy profile{π i , π j } is a
Bayesian Nash equilibrium if for each agent i
and its type, θ i ,
π i (θi ) = arggmax ∑ Rθ (ai , π j (θ j )) p(θi ,θ j )
ai ∈Ai
θ j ∈Θ j
i
Repeated games
In game theory, two models of decisionmaking in repeated interactions are popular:
Fictitious play
p y
Rational learning
Repeated games – Fictitious play
Simplest model of decision-making in repeated games
At eacht stage,
stage an agent ascribes a mixed strategy to the
other, bi (a j )
Other agent is assumed to act according to this mixed
strategy
The strategy is computed as follows:
⎧⎪1 if a tj−1 = a j
F (a j ) = F (a j ) + ⎨
t −1
⎪⎩ 0 if a j ≠ a j
t
F (a j )
t
t
bi (a j ) =
t −1
Maintain a
f
frequency
count of
previous actions
∑ F (a )
t
a j ∈A j
j
Agent computes its best response to the mixed strategy of
other
Fictitious play – Example
Police patrol 2
Enter
Police
patrol 1
Stay out
Enter
0,0
,Coordination
1,1
,
Stay out
1,1
game
0,0
2 pure strategy Nash equilibria and one mixed strategy
Nash equilibrium
{E t
{Enter,
Stay
St
out}
t}
{St
{Stay
out,
t Enter}
E t }
{ 0.5,0.5 , 0.5,0.5 }
Fictitious play – Example
Police patrol 2
Enter
Police
patrol 1
Round
Stay out
Enter
0,0
,Coordination
1,1
,
Stay out
1,1
game
0,0
Patrol 1
Patrol 2
1
Stay out
Stay out
(1,1.5)
(1,1.5)
2
Enter
Enter
(2,1.5)
(2,1.5)
3
Sta out
Stay
o t
Sta out
Stay
o t
(2 2 5)
(2,2.5)
(2 2 5)
(2,2.5)
4
Enter
Enter
(3,2.5)
(3,2.5)
...
...
...
...
...
0
1’s belief 2’s belief
(1,0.5)
(1,0.5)
Fictitious play – Example
Police patrol 2
Enter
Police
patrol 1
Round
Stay out
Enter
0,0
,Coordination
1,1
,
Stay out
1,1
game
0,0
Patrol 1
Patrol 2
1
Stay out
Stay out
(1,1.5)
(1,1.5)
2
Enter
Enter
(2,1.5)
(2,1.5)
3
Sta out
Stay
o t
Sta out
Stay
o t
(2 2 5)
(2,2.5)
(2 2 5)
(2,2.5)
4
Enter
Enter
(3,2.5)
(3,2.5)
...
...
...
...
...
0
1’s belief 2’s belief
(1,0.5)
(1,0.5)
Nash equilibrium!
Fictitious play
Interesting properties
If an action vector is a strict Nash equilibrium of a
stage game, it is the steady state of fictitious play in the
repeated game
If th
the empirical
i i l distribution
di t ib ti
off each
h agent’s
t’ strategies
t t i
converges in fictitious play, then it converges to a Nash
equilibrium
Fictitious play in repeated games converges if the
game is a 2x2 game with generic payoffs or is a zero
zerosum game
Rational learning
Kalai and Lehrer in 1993 published a
seminal paper on learning in repeated
games
Framework: Each agent knows its own
strategy and maintains a belief over other
agents’ strategies (belief over pure
strategies behavioral strategy (Kuhn ‘53))
Agents update their beliefs using Bayesian
belief update
Agent’s type is a behavioral strategy
Subjective equilibrium in rational
learning
Theoretical Analysis:
Observation histories (paths of play) in the
single agent tiger problem
GL
GL
[TL,OR]
[TL,L]
[TL,L] [TL,L]
GR
GR
[TL,L] [TL,L]
GL
[TL,L] [TL,L]
GR
[TL,OL]
Subjective Equilibrium in IPOMDPs
Agents i and j’s joint policies
induce a true distribution over
the future observation
sequences
GL
GL
[TL,L]
[TL,L]
[TL,L]
GR
[TL,L]
[TL,OR]
GR
[TL,L]
GL
[TL,L]
[TL,L]
GR
[TL,OL]
True distribution
over obs. histories
Agent i’s beliefs over j’s models
and its own policy induce a
subjective distribution over the
future observation sequences
GL
GL
[TL,OR]
[TL,L]
[TL,L]
[TL,L]
GR
[TL,L]
GR
[TL,L]
GL
[TL,L]
[TL,L]
GR
[TL,OL]
Subjective distribution
over obs. histories
Subjective equilibrium in rational
learning
Absolute Continuity Condition (ACC)
Subjective distribution should not rule out the
observation histories considered possible by the
true distribution
Cautious beliefs “Grain of truth” assumption
“Grain of truth” is sufficient but not necessary to
satisfy the ACC
Subjective equilibrium in rational
learning
Proposition 1 (Convergence): Under ACC, an
agent’s belief over other’s strategies updated using the
Bayesian belief update converges with probability 1
Proof sketch: Show that Bayesian learning in IPOMDPs is a
Martingale
Apply the Martingale Convergence Theorem
(Doob53)
-closeness of distributions:
≤
Subjective equilibrium in rational
learning
Lemma (Blackwell&Dubins62): For all agents, if their
initial beliefs satisfy ACC, then after finite time T(),
each of their beliefs are -close to the true distribution
over the future observation paths
Subjective -Equilibrium (Kalai&Lehrer93): A
profile of strategies of agents each of which is an
exact best response to a belief that is -close to the
true distribution over the observation history
Prediction
Subjective equilibrium is stable under learning and
optimization
Subjective equilibrium in rational
learning
Main Results
Proposition 1: If agents’ beliefs satisfy the
ACC, then after finite time T, their
strategies are in subjective -equilibrium,
where is a function of T
When = 0, subjective equilibrium obtains
Proof follows from the convergence of the
Bayesian belief update & Blackwell&Dubins62
ACC is a sufficient condition, but not a
necessary one
Subjective equilibrium is close to Nash
equilibrium
Proposition 2: -subjective equilibrium
induces a distribution over future paths of
play that is -close to the distribution over
the paths of play induced by a -Nash
equilibrium
Computational difficulties in achieving
equilibrium
There exist computable strategies that
admit no computable exact best responses
(Nachbar&Zame96)
If possible strategies are assumed
computable, then i’s best response may
not be computable. Therefore, j’s cautious
beliefs grain of truth
Subtle tension between prediction and
optimization
Strictness of ACC
Computational difficulties in achieving
equilibrium
Proposition 3 (Impossibility): Within the
rational learning framework, all the
agents’ beliefs will never simultaneously
satisfy the grain of truth assumption
Difficult to realize the equilibrium!
Universal type spaces – Bayesian games
Explicit Harsanyi type space for i
Where,
Ѳi is the non-empty set of i’s types
Si is the set of all sigma algebras on Ѳi
Σi : Ѳi -> Sj
βi is the belief associated with i’s types
Redefining Bayesian Game (BG)
Where,
X is the non-empty set of payoffrelevant states of nature
Ai is the set of i’s actions
Ri : X x A -> Ɽ
Defining Level-beliefs; βi : X x Ѳ
Zero-level belief, bi,0 : belief over states of nature, x
First-level belief, bi,1 : belief over j’s zero-level beliefs
…and so on.
βi (Ѳi) induces an infinite beliefhierarchy: {bi,0(θi), bi,1(θi), … }
Note: βi (Ѳi) is analogous to p(.|Ѳi);
p is the common prior over X x Ѳ
(Mertens et al 1985)
STATE SPACE
Universal type spaces – Belief hierarchy
𝑆0 = 𝑋
𝑆1 = 𝑆0 × Δ(𝑆0 )
𝑆2 = 𝑆1 × Δ 𝑆1
𝑆𝑛 = 𝑆𝑛−1 × Δ(𝑆𝑛−1 )
𝜃𝑖1 = b1i , b2i , b3i , … ∈
Δ(𝑆𝑗 )
𝑗=0
Kets type spaces – Finite-level beliefs
Fig: Conditional beliefs of player i over the payoff states and types of j and analogously for j.
Kets type spaces differ from Harsanyi’s type spaces in 2 broad aspects:
Kets types induce a finite-level belief hierarchy whereas Harsanyi types
induce an infinite-level hierarchy
Ex-interim expected utility may not be well-defined in Kets
This is because a strategy may not be comprehensible for a Kets type because
distributions for types within a partition may differ
In Harsanyi’s type spaces, each player’s belief is over a partition of others’
types whose elements are of size 1. So, strategies are always comprehensible!
Tutorial Roadmap
Type-based methods in...
Type-based Methods in Multiagent Systems
This part focuses on full observability of states and actions
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Type-based Methods in Multiagent Systems
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Stochastic Bayesian Game (SBG)
State space S, initial state s 0 ∈ S, terminal states S̄ ⊂ S
Players N = {1, ..., n} and for each i ∈ N:
Set of actions Ai (where A = ×i Ai )
Type space Θi (where Θ = ×i Θi )
Payoff function ui : S × A × Θi → R
Strategy function πi : H × Ai × Θi → [0, 1]
H is set of histories H t = hs 0 , a0 , s 1 , a1 , ..., s t i, t ≥ 0
State transition function T : S × A × S → [0, 1]
Type distribution Υ : Θ+ → [0, 1], Θ+ subset of Θ
(Albrecht and Ramamoorthy, 2013)
Stochastic Bayesian Game (SBG)
SBG starts at time t = 0 in state s 0 :
1
In s t , types θ1t , ..., θnt sampled with probability Υ(θ1t , ..., θnt ),
each player i informed about own type θit
Stochastic Bayesian Game (SBG)
SBG starts at time t = 0 in state s 0 :
1
2
In s t , types θ1t , ..., θnt sampled with probability Υ(θ1t , ..., θnt ),
each player i informed about own type θit
Each player i chooses ait ∈ Ai with probability πi (H t , ait , θit ),
results in joint action at = (a1t , ..., ant )
Stochastic Bayesian Game (SBG)
SBG starts at time t = 0 in state s 0 :
1
2
3
In s t , types θ1t , ..., θnt sampled with probability Υ(θ1t , ..., θnt ),
each player i informed about own type θit
Each player i chooses ait ∈ Ai with probability πi (H t , ait , θit ),
results in joint action at = (a1t , ..., ant )
Each player i receives payoff ui (s t , at , θit ),
game transitions into s t+1 with probability T (s t , at , s t+1 )
Stochastic Bayesian Game (SBG)
SBG starts at time t = 0 in state s 0 :
1
2
3
In s t , types θ1t , ..., θnt sampled with probability Υ(θ1t , ..., θnt ),
each player i informed about own type θit
Each player i chooses ait ∈ Ai with probability πi (H t , ait , θit ),
results in joint action at = (a1t , ..., ant )
Each player i receives payoff ui (s t , at , θit ),
game transitions into s t+1 with probability T (s t , at , s t+1 )
Process repeated until terminal state s t ∈ S̄ reached.
Assumptions
Full observability of states and actions
→ history H t argument in πi (most general variant)
All elements of game known to us except Υ
→ sometimes Θ+ unknown too
Θ+ finite or infinite
→ decides form of beliefs
Posterior Belief
Problem: Type distribution Υ unknown
Can compute posterior belief
Pr(θ−i |H t ) =
Y
j6=i
Prj (θj |H t )
where
Prj (θj |H t ) = η L(H t |θj ) Pj (θj )
L is likelihood, Pj is prior belief, η is normliser
Assumes independent types...
Posterior Belief
Independence of types:
Υ(θ1 , ..., θn ) = Υ1 (θ1 ) ∗ ... ∗ Υn (θn )
Common assumption
Behaviours still depend on each other
→ actions observed in history H t
Some works consider correlated types
e.g. (Albrecht and Ramamoorthy, 2014)
Hypothesised Types
Problem: True type spaces Θ+
j (may be) unknown
Can hypothesise type spaces Θ∗j ⊂ Θj
Each θj∗ ∈ Θ∗j is hypothesis for behaviour of player j
Hypothesise Θ∗j from historical data, problem structure,
domain expert...
Can learn new types online via opponent modelling
Harsanyi-Bellman Ad Hoc Coordination (HBA)
Algorithm HBA defined as
ait ∼ arg max Esati (H t )
ai ∈Ai
where
Esai (Ĥ) =
X
∗
Pr(θ−i
|Ĥ)
∗ ∈Θ∗
θ−i
−i
Qsa (Ĥ)
=
X
X
a
Qs i,−i (Ĥ)
a−i ∈A−i
0
πj (Ĥ, aj , θj∗ )
j6=i
T (s, a, s ) ui (s, a) + γ
s 0 ∈S
Y
max Esa0i
a
i
hĤ, a, s i
0
γ ∈ [0, 1] is discount factor
(Albrecht and Ramamoorthy, 2013)
Harsanyi-Bellman Ad Hoc Coordination (HBA)
HBA is general algorithmic description of type-based method
Choose action which maximises expected long-term payoff
with respect to current beliefs over types
Esai corresponds to i’s component of Bayes-Nash equilibrium
→ (Harsanyi, 1967)
Qsa corresponds to Bellman optimality equation
→ (Bellman, 1957)
Harsanyi-Bellman Ad Hoc Coordination (HBA)
Informal description:
1
For each action available to HBA, unfold tree of all future
interaction trajectories after taking action
2
Associate each trajectory with utility and probability
(action probabilities of types & posterior beliefs)
3
Calculate expected payoff
of action by traversing
to root
4
Choose best action
Type-based Methods in Multiagent Systems
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Finite Tree Expansion
Finite tree expansion:
Unfold tree of future trajectories with fixed depth
Direct implementation
Optimal with respect to tree depth
Least efficient: exponential in # states, actions, agents
(Albrecht and Ramamoorthy, 2013)
Finite Tree Expansion – Experiment
Human-machine experiment at science exhibition:
Prisoner’s Dilemma (PD) or Rock-Paper-Scissors (RPS)
Tree expansion with depth 10 for PD and 1 for RPS
Type distribution (Υ) and space (Θ+ ) unknown
(Albrecht and Ramamoorthy, 2013)
Finite Tree Expansion – Types
Hypothesised 5 types for PD and 6 types for RPS, e.g.
RPS type
Definition
Copycat
ai0 ∼ U(Ai ), ait = ajt−1
RetryIfWon
ait ∼ U(Ai ) if t = 0 ∨ ui (at−1 ) < 0 else ait = ait−1
P
πi (ai , H t ) = g (ai , x)/ âi ∈Ai g (âi , x), x = min[t, h]
Px
g (ai , x) = max 0, x − τ =1 [ait−τ = ai ]1 (x + 1 − τ )
P
ait ∼ arg maxai aj ∈Aj πj (aj , H t ) ui (ai , aj )
i-focused(h)
h ∈ {1, 2}
j-focused(h)
h ∈ {1, 2}
where πj (aj , H t ) is obtained using i-focused(h) for j
(Albrecht and Ramamoorthy, 2013)
Finite Tree Expansion – Results
Participants played two matches of one game:
One match against HBA, one against C/JAL
Each match consisted of 20 rounds of one-shot game
427 participants (including internet participants)
(a) PD
(b) RPS
60
40
20
70
60
10
0
−10
50
HBA CJAL
60
% of games
80
Total payoff
80
Total welfare
Total payoff
100
HBA CJAL
HBA JAL
Draw
Won
50
40
HBA JAL
(Albrecht and Ramamoorthy, 2013)
Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS):
Sample interaction trajectories based on beliefs over types
Action selection with UCT (Kocsis and Szepesvári, 2006)
Combine with reinforcement learning (Q-values)
Approximate but efficient
(Barrett et al., 2011)
Monte Carlo Tree Search – Experiment
Method tested in Pursuit domain:
4 predator agents trap 1 prey agent
Control single predator
Other predators have predefined behaviours
(Barrett et al., 2011)
Monte Carlo Tree Search – Types
Many different types:
Author-defined types (4)
Student-defined types (12)
Each implemented in agent template
Various scenarios:
Type distribution (Υ) unknown
Type distribution and space (Υ, Θ+ ) unknown
Learning new types online (decision tree learning)
(Barrett et al., 2011)
use as soon as they take one action that
stic behavior, that incorrect model can
ochasticMonte
teammates Carlo
are moreTree
difficultSearch – Results
nificant overlap in the actions that are
(a) 5x5+World
(a) 5x5 World
Author
(b) 10x10
World types (Υ unknown, Θ known):
(b) 10x10 World
(Selection)
(b) 10x10 World
(b) 10x10 World
which
teammates
which behaviors
behaviors itsitsteammates
stsappropriately.
Forexample,
example,
appropriately. For
nt
20.19steps
stepstotocapture
capture
enttakes
takes 20.19
oses
model totomimic
(labeled
ses aa model
mimic
(labeled
ableto
to run
run VI
as it
(c) 20x20 World
able
VIon
onthis
thiscase,
case,
as it
10x10 World
(c)unknown
20x20 World
behavior for
combination
behavior
foreach
each(b)
combination
Figure 5: Results with
stochastic teammates. MCTS(All) means
ate combinations).
Figure(a)5:5x5
Results
unknown
stochastic
teammates.
MCTS(All)
m
that the MCTS-based
agentwith
planned
considering
homogeneous
teams
te
ic combinations).
teammates.
World
(b)all
10x10
World
+ unknown;
+
(a)models
5x5 World
(b)
that predator
the MCTS-based
agent planned
considering
homogeneous
t
using
author
as Θ
):all10x10
of Θ
the known
according
totypes
the current
probabilities
ofWorld
the
stochasticStudent
teammates.types (Υ,
models. of the known predator models according to the current probabilities o
VI-based agents are camodels.
edstill
andsignificantly
VI-based agents
are
caoutThe
results
in Figure 5 show that both the VI and MCTS agents
nty, and still
significantly
outmmates.
Similar
to the
Thedespite
resultsthis
in Figure
5 show
both thewhich
VI and
MCTS ag
perform
well
uncertainty
andthat
determine
model
its
y asteammates.
well as VI Similar
taking to the
perform
well
despite
thisgiven
uncertainty
and
determine
which m
its
teammates
are
following
when
the
set
of
possible
models.
ms
as well Probas VI taking
epsnearly
with Greedy
its
teammates
are
following
when
given
the
set
of
possible
mo
These
results
are
not
quite
as
good
as
if
the
agent
had
the
correct
(b) with
10x10 Greedy
World
mmates
respectively
on Prob5.82 steps
(c) results
20x20
models to
start
with,World
but
still
perform
quite as
well.
For
example,
These
are
not
quite
as
good
if
the
agent
had
the
co
ficant,
but
MCTS
stillMCTS(All)
ns
teammates
respectively
on means
stochastic
teammates.
Figure
7: Results
withagent
studentknew
teams.its teammates were
(b)
10x10
World
on the 20x20
world,
if the
ad hoc
models
to
start
with,
but
still
perform
quite well. For exam
(c)
20x20
World
(b)
10x10
World
al
behavior.
e significant,
still teams
nned
consideringbut
allMCTS
homogeneous
using theonProbabilistic
Destinations
behavior,
it took
26.14
steps to
model
teammates
the 20x20 to
world,
if7: the
ad hoc
agent
knew
Figure
Results
with(Barrett
student
teams.
current
probabilities 5.6
of the Learning
MCTStobehavior.
isthemuch
betet its
al.,teammates
2011) w
efcording
optimal
capture,
while the
if5.5,
itProbabilistic
needed
to select
the correct
model,
it
took
27.83
using
Destinations
behavior,
it
took
26.14 step
In
Section
the
ad
hoc
agent
tried
to
deal
with
the
unknown
e, on the
mance
of 20x20
MCTSworld,
is much bet-
MAP/Thompson Response
So far, type space Θ+ finite
What if Θ+ uncountable (infinite)? → Belief is density over Θ+
MAP/Thompson Response:
Play best-response to...
MAP: most likely type from posterior
Thompson: sampled type from posterior
Can approximate both with importance sampling
(Southey et al., 2005)
MAP/Thompson Response – Experiment
Poker:
Texas Hold ’Em – very large game, 1018 states
Leduc Hold ’Em – smaller game for testing
Actions observed but cards in deck
and hands private
Authors use domain-specific
method to compute beliefs
(Southey et al., 2005)
MAP/Thompson Response – Types
How to represent
uncountable type space?
⇒ Density over
behaviour parameters:
(Southey et al., 2005)
MAP/Thompson Response – Results
700
5
BBR
MAP
Thompson
Freq
Opti
Best Response
600
4.5
4
Average Winning Rate
Average Bankroll
500
400
300
3.5
3
2.5
2
200
1.5
100
1
0
0
20
40
60
80
100
120
Hands Played
140
160
180
200
0.5
0
Leduc Hold ’Em – Opponent sampled from prior
Figure 2: Leduc hold’em: Avg. Bankroll per hands played
Figure 3
(Southey
et al., 2005)
for BBR, MAP, Thompson’s, Opti, and Frequentist
vs. Priplayed fo
ors.
vs. Prior
MAP/Thompson Response – Results
300
tell
Thompson
Freq
Opti
[3] Da
ber
Jon
Sea
ma
den
Ga
250
Average Bankroll
200
150
100
50
0
0
20
40
60
80
100
120
Hands Played
140
160
180
200
[4] Fre
rith
pok
enc
Sep
Texas Hold ’Em – Opponent sampled from prior
Figure 8: Texas hold’em: Avg. Bankroll per hands played
[5] D.
(Southey et al., 2005) tion
for Thompson’s, Frequentist, and Opti vs. Priors.
Offline Model Solving
Presented methods are online
Response computed during interaction
Can use types for offline model solving
(Chalkiadakis and Boutilier, 2003)
Interactive POMDPs → Part 3
Type-based Methods in Multiagent Systems
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Beliefs Over Types
Prj (θj |H t ) = η L(H t |θj ) Pj (θj )
Beliefs over types are central aspect of any type-based method:
Prior beliefs Pj : before any actions are observed
Posterior beliefs Prj : after actions have been observed
Determine choice of own actions
Beliefs Over Types
Prj (θj |H t ) = η L(H t |θj ) Pj (θj )
Important questions:
How to incorporate observations into beliefs?
When will posterior beliefs be correct?
What long-term impact do prior beliefs have?
Posterior Beliefs
How to incorporate observations into beliefs?
Depends on assumption about type distribution:
Are types fixed?
∃θ : Υ(θ) = 1
Are types changing?
∃θ : Υ(θ) < 1
Are types correlated?
Υ cannot be factored into Υ(θ) =
Y
i
Υi (θi )
Posterior Beliefs
How to incorporate observations into beliefs?
Depends on assumption about type distribution:
Are types fixed?
∃θ : Υ(θ) = 1
Product posterior:
L(H t |θj ) =
t−1
Y
τ =0
→ Most common in literature
πj (H τ , ajτ , θj )
Convergence
Product posterior:
L(H t |θj ) =
t−1
Y
πj (H τ , ajτ , θj )
τ =0
Theorem: Let Γ be a SBG with pure Υ. If HBA uses a product
posterior and if Pj positive, then for any > 0, there is a time t
from which (τ ≥ t)
PPr (H τ,H ∞)(1−) ≤ PΥ (H τ,H ∞) ≤ (1+)PPr (H τ,H ∞)
for all H ∞ with PΥ (H τ,H ∞) > 0.
Not guaranteed to learn true type or mixed Υ
Assumes Θ+ is known
(Albrecht and Ramamoorthy, 2014)
Posterior Formulations
Other posterior formulations exist, e.g.
Sum posterior:
t
L(H |θj ) =
t−1
X
πj (H τ , ajτ , θj )
τ =0
Correlated posterior:
Pr(θ−i |H t ) = η P(θ−i )
t−1 Y
X
πj (H τ , ajτ , θj )
τ =0 θj ∈θ−i
(Albrecht and Ramamoorthy, 2014)
Prior Beliefs
Prj (θj |H t ) = η L(H t |θj ) Pj (θj )
Subjective likelihood of types before any actions observed
Questions:
1
Do prior beliefs have long-term impact on payoffs?
2
If so, how?
3
Can we compute automatically?
(Albrecht et al., 2015)
Prior Beliefs
Empirical study:
78 repeated 2 × 2 matrix games
10 performance criteria
3 automatic methods to generate types
10 automatic methods to compute prior beliefs
Against adaptive and learning opponents
(Albrecht et al., 2015)
Prior Beliefs
Types automatically generated from games:
Leader-Follower-Trigger Agents (LFT)
hybrid (deterministic/stochastic)
Co-Evolved Decision Trees (CDT)
fully deterministic
Co-Evolved Neural Networks (CNN)
fully stochastic
10 randomly generated types provided to HBA
(Albrecht et al., 2015)
Prior Beliefs
Automatic prior beliefs:
Uniform prior: Pj (θj∗ ) = |Θ∗j |−1
Random prior: Pj (θj∗ ) = .0001 for random half, rest uniform
Value priors: Pj (θj∗ ) = η ψ(θj∗ )b
e.g. Utility prior: ψU (θj∗ ) = Uit (θj∗ )
LP-priors: quadratic “loss” matrix Aj,j 0 in linear program
e.g. LP-Utility prior: Aj,j 0 = ψU (θj∗ ) − Uit (θj∗ |θj∗0 )
(Albrecht et al., 2015)
Prior Beliefs
Results:
1
Prior beliefs can have significant long-term impact
2
Planning horizon of HBA is important factor
3
Can compute prior beliefs with consistent effects
3.08
3.06
4.00
3.02
3.98
3.00
Average payoff (p1)
3.96
Average payoff (p1)
Average Payoff (p1)
3.04
3.94
3.92
3.90
3.88
3.86
3.84
2.98
2.96
2.94
3.02
Uniform
Random
Utility
Stackelberg
Welfare
Fairness
LP−Utility
LP−Stackelberg
LP−Welfare
LP−Fairness
3
2.98
2.92
2.90
2.96
2.88
2.86
3.82
2.94
2.84
3.80
20
40
60
80
Time slice (each 100 steps)
100
5
10
15
Time slice (each 5 steps)
20
2
4
6
8
10
12
Time Slice
(Albrecht et al., 2015)
14
Type-based Methods in Multiagent Systems
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Exploration in Type-based Methods
Belief convergence is passive
Does not tell us how to choose actions
Posterior belief may never learn true type!
This can be a problem
Knowledge of true type may lead to better results
→ May need exploration
f encounters asing
a counterexamples,
repeated game.
two-player
sinceA
following
the same behavior repeated
erates, but never forgives defection – after one single defe
vents
exploration
of
unknown
aspects
of the
strateg
R1 ,itRreacts
sets
of opponent’s
R1 , R2 , u1 , u2 ⟩, where
2 are
opponent
by finite
repeatedly
defecting.
Assume that a
Exploration inThe
Type-based
Methods
left part of Figure
4 shows an example of an opponent’s st
agent holds
the model described
in the
of the figur
he players (called
pure
strategies),
u2 right
: thepart
for IPD.
The
opponent will and
defect u
as1 ,long
as
player does, b
has lower utility than cooperation according to the model,
cooperatethe
forever
after one
of the player. The best-re
functions that define
utility
of cooperation
a joint move
non-exploring agent will always cooperate yielding utility
Exploration-exploitation
against thisdilemma:
strategy is “play c then all-d” which yields a utility
For example, the
Prisoner’s
dilemma
(PD)
exploring agent
will eventually
tryisd a
and acquire a perfec
If the
player
uses the model
shown
in the right part of the figur
Should we
explore
interaction
orexploring
exploit
knowledge?
Grim.
However,
the
action
d
takes the agent into
each player hasthetwo
actions,corresponding
cooperateto(c)
best-response
this and
model is “all-d” yielding
1tion sink” without any opportunity
1 to get out, yielding util
for any
γ ≥ 5 . A model-based player h
1−γ which is suboptimal
c, d}. The utilityoffunctions,
u1 ,than
u2 , the
are
described
lower
utility
of cooperation
any γ
this which
model is
will
repeatedly
defect,
preventing
it fromfor
exploring
own in
Figure
1.
exploration
indeed
yielded
better
knowledge,
but
the
cost
Prisoners’ dilemma:
“c” and observing counterexamples.
Hence, the wrong model
wilo
Two types:
II
c
I
c
this knowledge
is player
too high.
be corrected,
and the
will be stuck with sub-optimal stra
d
3
3
d 5
cd(3)
(1)
5
1
1
c c(0),d(1)
(3)
dc
d
(0)
dc(5)
0
0
c(0),d(1)
c(3),d(5)
cd
Grim
d (1)
d (5)
c
c(0)
d
sub-model
Grim
Evilminimum. (Le
Figure 4. An example of an
opponent model in local
Opponent’s
strategy.
(Right):
opponent
model.
The numbers
in the
Figure 5.
Left:
The
GrimAn
strategy
for IPD.
Right:
the current
m
x for the Prisoner’s
dilemma
game.
Lower
left
numbers
theses
theAn
outcomes
of the
PD against
game according
to the
theshow
agent.
exploratory
action
grim (d), will
be player
follow
Upper right numbers
are
the
utilities
for
player
II.
actions.
dictates
“all-d”
while the actual
intoThe
the model
“defection
sink”playing
without
any opportunity
to getbest-resp
out.
“play c then all-d”.
(Carmel and Markovitch, 1999)
to get out, yielding utility of 1−γ ,
1tion sink” without any opportunity
of 1−γ
which is suboptimal for any γ ≥ 15 . A model-based player holding
1
lower
than the
utility
of cooperation
any γ ≥action
this which
model is
will
repeatedly
defect,
preventing
it fromfor
exploring
2 . Thus,
Exploration
in
Type-based
Methods
exploration
indeed
yielded better
knowledge,
but model
the cost
ofnever
acquiring
“c” and
observing
counterexamples.
Hence,
the wrong
will
this knowledge
is player
too high.
be corrected,
and the
will be stuck with sub-optimal strategy.
cd(3)
(1)
c(0),d(1)
c(3),d(5)
c c(0),d(1)
(3)
dc
d
(0)
dc(5)
cd
Grim
d (1)
d (5)
c
c(0)
d
sub-model
Grim
Evilminimum. (Left): An
Figure 4. An example of an
opponent model in local
Opponent’s
strategy.
(Right):
opponent
model.
The numbers
in the
parenFigure 5.
Left: The
GrimAn
strategy
for IPD.
Right:
the current
model
held by
theses
theAn
outcomes
of the
PD against
game according
to the
theshow
agent.
exploratory
action
grim (d), will
be players’
followed joint
by falling
actions.
model
dictates
“all-d”
while the actual
is
intoThe
the
“defection
sink”playing
without
any opportunity
to getbest-response
out.
response
“playBest
c then
all-d”. against Grim is always cooperate
Best response
against clearly
Evil is always defectthat exploratory behavior
above
example
The The
above
example
demonstratesdemonstrates
that it is better sometimes to play
requires a better
mechanism
for opponent’s
predicting the risk involved
in taking
sub-optimally
order
to explore
the
Defect is in
safe
choice
(cannot
be exploitedbehavior.
by otherAction
player)c has
sub-optimal
actions.
One
of
the
problems
with
the
exploration
mechalow utility according to the current model, but since it has not been
nismsbad
described
Section 3ifistrue
the use
But
in the in
long-term
typeof isa utility
Grim function that assumes
a stationary opponent model throughout the expected course of the
game.doThis
⇒ What
to?assumption is not rational for a learning agent who continuously modifies its opponent model. In order to develop a risk-sensitive
exploration strategy we need a mechanism that allow the agent to take
Value of Information
HBA has built-in solution for optimal exploration:
Implicitly encodes value of information (Howard, 1966)
What can action reveal, how will this impact interaction?
HBA does so by revising beliefs during planning:
Esai (Ĥ) =
X
∗
Pr(θ−i
|Ĥ)
∗ ∈Θ∗
θ−i
−i
Qsa (Ĥ) =
X
s 0 ∈S
X
a
Qs i,−i (Ĥ)
a−i ∈A−i
Y
πj (Ĥ, aj , θj∗ )
j6=i
T (s, a, s 0 ) ui (s, a) + γ max Esa0i hĤ, a, s 0 i
ai
Value of Information
If we cooperate and opponent...
defects, then we know that he is Evil and best response is to always
defect → Limit average payoff = 1
cooperates, then we know that he is Grim and best response is to
always cooperate → Limit average payoff = 3
Value of Information
If we cooperate and opponent...
defects, then we know that he is Evil and best response is to always
defect → Limit average payoff = 1
cooperates, then we know that he is Grim and best response is to
always cooperate → Limit average payoff = 3
If we defect and opponent...
defects, then we know that he is Evil and best response is to always
defect → Limit average payoff = 1
cooperates, then we know that he is Grim and best response is to
always defect → Limit average payoff = 1
Value of Information
If we cooperate and opponent...
defects, then we know that he is Evil and best response is to always
defect → Limit average payoff = 1
cooperates, then we know that he is Grim and best response is to
always cooperate → Limit average payoff = 3
If we defect and opponent...
defects, then we know that he is Evil and best response is to always
defect → Limit average payoff = 1
cooperates, then we know that he is Grim and best response is to
always defect → Limit average payoff = 1
⇒ Choose to cooperate!
(but prior beliefs and planning depth may have impact)
Exploration in Type-based Methods
Not all type-based methods revise beliefs during planning:
e.g. (Barrett et al., 2011)
Easier to implement and less expensive to compute
Other forms of exploration exist:
Undirected/directed methods (Carmel and Markovitch, 1999)
Myopic EVOI (Chalkiadakis and Boutilier, 2003)
Type-based Methods in Multiagent Systems
Topics covered:
1
Stochastic Bayesian games and HBA algorithm
2
Different implementations and domains
3
Properties of beliefs over types
4
Exploration in type-based methods
5
Incorrect hypothesised types
Incorrect Hypothesised Types
What if hypothesised types Θ∗ incorrect?
Types wrong → predictions wrong → suboptimal actions
Question: What relation must hypothesised types have to true
types in order for HBA to complete task, despite incorrectness?
Difficult question: have to consider types and beliefs
(Albrecht and Ramamoorthy, 2014)
Methodology
Two stochastic processes induced by system:
Ideal process X — knows all current and future types
Always chooses true optimal action
User process Y — uses Pr and Θ∗
May choose sub-optimal actions
What relation must Y have to X to satisfy termination guarantees?
(Albrecht and Ramamoorthy, 2014)
Termination Guarantees
Use Probabilistic real-time Computation Tree Logic (PCTL)
(Hansson and Jonsson, 1994) to specify termination guarantees:
≤t
<∞
Fp
term, Fp
term, ∈ {>, ≥}
s ∈ S labeled with term iff. s ∈ S̄ (terminal state)
Write s |=C φ to say that state s satisfies PCTL formula φ in
process C ∈ {X , Y }.
(Albrecht and Ramamoorthy, 2014)
Termination Guarantees
<∞
<∞
Property 1: s 0 |=X F>0
term ⇒ s 0 |=Y F>0
term
<∞
<∞
Property 2: s 0 |=X F≥1
term ⇒ s 0 |=Y F≥1
term
<∞
<∞
Property 3: s 0 |=X F≥p
term ⇒ s 0 |=Y F≥p
term
≤t
≤t
Property 4: s 0 |=X F≥p
term ⇒ s 0 |=Y F≥p
term
{Property 1, Property 2} ⊂ Property 3 ⊂ Property 4
Property 4 strong criterion for optimality of Θ∗
→ Perform as well as if we knew true types
(Albrecht and Ramamoorthy, 2014)
Probabilistic Bisimulation
A probabilistic bisimulation (Larsen and Skou, 1991) between X
and Y is an equivalence relation B ⊆ S × S such that
(i) (s 0 , s 0 ) ∈ B
(ii) sX |= term ⇔ sY |= term for all (sX , sY ) ∈ B
(iii) µ(HXt , Ŝ|X ) = µ(HYt , Ŝ|Y ) for any histories HXt , HYt with
(sXt , sYt ) ∈ B, and all equivalence classes Ŝ under B.
Concept used in automatic verification (model checking)
(Albrecht and Ramamoorthy, 2014)
Optimal Θ∗
Theorem: Property 4 holds in both directions if there is a
probabilistic bisimulation B between X and Y .
Practical implications:
Can use model checking methods to verify optimality of Θ∗
for task completion
Types can be arbitrarily wrong as long as probabilistic
bisimulation exists
(Albrecht and Ramamoorthy, 2014)
Behavioural Hypothesis Testing
Question:
Given interaction history H and hypothesis θj∗ for agent j,
does agent j really behave according to θj∗ ?
If persistently reject hypothesis:
Construct alternative hypothesis
Resort to default plan of action (e.g. maximin)
No universal theory to decide truth of behavioural hypothesis!
(Albrecht and Ramamoorthy, 2015)
Two-Sample Problem
We control i and observe j
θj+ is true behaviour of j
θj∗ is hypothesised behaviour of j
Question: θj+ = θj∗ ?
Cannot answer directly since θj+ unknown, but
We know atj = (aj0 , ..., ajt−1 ) from Hit
Can generate âtj = (âj0 , ..., âjt−1 ) using θj∗
Two-sample problem: atj and âtj generated by θj∗ ?
(Albrecht and Ramamoorthy, 2015)
Frequentist Hypothesis Test
Compute p-value:
p = P |T (ãtj , âtj )| ≥ |T (atj , âtj )|
ãtj ∼ δ t (θj∗ ) = θj∗ (Hi0 ), ..., θj∗ (Hit−1 )
Reject θj∗ if p below some “significance level” α ∈ [0, 1]
(Albrecht and Ramamoorthy, 2015)
Test Statistic
Test statistic:
t
T (ãtj , âtj )
1X
=
Tτ (ãτj , âτj )
t
τ =1
Tτ (ãτj , âτj )
=
K
X
k=1
wk zk (ãτj , θj∗ ) − zk (âτj , θj∗ )
wk ∈ R is weight for score function zk ∈ Z
(Albrecht and Ramamoorthy, 2015)
Example Score Functions
z1 (atj , θj∗ ) =
t−1
θj∗ (Hiτ )[ajτ ]
1X
t
maxaj ∈Aj θj∗ (Hiτ )[aj ]
τ =0
z2 (atj , θj∗ )
t−1
1X
a∼
=
1 − Eθj∗ (H τ )θj∗ (Hiτ )[ajτ ] − θj∗ (Hiτ )[aj ]
j
i
t
τ =0
z3 (atj , θj∗ )
=
X
#
t−1
t−1
1X ∗ τ
1X τ
min
[aj = aj ]1 ,
θj (Hi )[aj ]
t
t
aj ∈Aj
"
τ =0
τ =0
(Albrecht and Ramamoorthy, 2015)
Learning the Test Distribution
Test statistic eventually normal, but:
Shaped gradually over time
Initially skewed in either direction
Need special distribution to capture dynamics:
Skew-normal distribution (Azzalini, 1985)
2
x −ξ
x −ξ
f (x | ξ, ω, β) = φ
Φ β
ω
ω
ω
Learn parameters ξ, ω, β during interaction
(Albrecht and Ramamoorthy, 2015)
100
πj* ≠ πj
|Aj| = 2
|Aj| = 10
|Aj| = 20
1
1 0
2
0.5
3
1
0.9
1
2
3
[1 2]
[2 3]
[1 3]
[1 2 3]
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
100
200
250
Time
200
300
300
350
400
(a) |Aj | = 2
400
500
450
[2 3]
[1 3]
1
0.9
0.8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.5
Time
150
[1 2]
p−value
−0.5
πj* = πj
p−value
% Correct
100
90
80
70
60
50
40
30
20
10
0
p−value
1
Experiments – Random Behaviours
500
1
Time
1.5
2
4
x 10
(b) |Aj | = 10
1
[1 2 3]
2
3
Time
4
5
4
x 10
(c) |Aj | = 20
(Albrecht and Ramamoorthy, 2015)
Experiments – Adaptive Behaviours
πj* = πj
0.5
% Correct
% Correct
1
100
90
80
70
60
50
40
30
20
10
0
πj* ≠ πj
LFT
0
CDT
CNN
[1 2 3]
−0.5
Θi
(d)
−1
−1
100
90
80
70
60
50
40
30
20
10
0
[1 2 3]
(f) Θi = random behav.
= Θj
−0.5
0
0.5
1
Passive: does not actively probe specific aspects of hypothesis!
(Albrecht and Ramamoorthy, 2015)
Tutorial Roadmap
Type-based methods in...
Partially observable stochastic game
POSGs are a generalization of POMDPs and
normal form games to multiple states and
multiple agents
POSG
(multiple states
multiple agents)
Normal form
game
(multiple states (single state
single agent) multiple agents)
POMDP
Multiagent POMDPs
Multiagent POMDP frameworks generalize POMDPs to
multiagent settings
Decentralized POMDPs (DEC-POMDPs)
Objective view of the interaction
(What should all agents do?)
Applicable to team problems
Initial beliefs of agents are common knowledge
Interactive POMDPs (I-POMDPs)
Subjective view of the interaction
(What should a particular agent do?)
Applicable to cooperative and non-cooperative problems
Beliefs of other agents are unknown
I-POMDP
Key ideas
Include possible behavioral models of other agents in
the state space. Agent’s beliefs are distributions over
the physical state and models of others
Intentional (types) and subintentional models
Intentional models contain beliefs. Beliefs over models
give rise to interactive belief systems
Interactive epistemology, recursive modeling
Finitely nested belief system as a computable
approximation of the interactive belief system
Compute best response to agent’s belief (subjective
rationality)
Potential applications
Robotics
Planetary exploration
Surface mapping by rovers
Coordinate to explore predefined region optimally
Spirit
Opportunity
Uncertainty due to sensors
Robot soccer
Coordinate with teammates
and deceive opponents
Anticipate and track others’
actions
RoboCup Competition
I-POMDP
Definition of a finitely nested I-POMDP of strategy
level l for agent i in a 2 agent setting
ISi ,l , A, Ti , i , Oi , Ri , OCi
ISi,l is the set of interactive states
ISi ,l S M j ,l 1 where M j ,l 1 j ,l 1 SM j
j ,l 1 b j ,l 1 , A, T j , j , O j , R j , OC j
and Bayes rational
I-POMDP
Definition of a finitely nested I-POMDP of strategy
level l for agent i in a 2 agent setting
ISi ,l , A, Ti , i , Oi , Ri , OCi
ISi,l is the set of interactive states
A is the set of joint actions
Ti is the transition function defined on the physical state
(beliefs of others cannot be directly manipulated)
i is the set of observations of agent i
Oi is the observation function (beliefs of others are not
directly observable)
Ri is the reward function of agent i
Interactive beliefs in I-POMDP
“In interactive contexts […], it is important to take into
account not only what the players believe about
substantive matters […] but also what they believe
about the beliefs of other players.”
“One specifies what each player believes about the
substantive matters, about the beliefs of others about
these matters, about the beliefs of others about the
beliefs of others, and so on ad infinitum.”
- Robert J. Aumann
New concept: Interactive beliefs
New approach to game theory: Epistemic, decision
analytic
Interactive beliefs in I-POMDP
Agent i’s belief is a distribution over the physical state
and models of j
bi ( ISi )
Uncountably infinite
b j ( IS j )
bi ( ISi )
bi ( ISi )
b j ( IS j )
bi ( ISi )
Hierarchical
belief systems
have been explored
in game theory
bi ( ISi )
Observation
Amount of information in interactive belief
hierarchy is finite
Information content decreases asymptotically with
the number of levels
Question 1: How many levels should we
include?
Answer: As many as we can
Can one work with infinite levels?
Answer: Yes, in some special cases
I-POMDP
Integrate models of others in a decision-theoretic
framework
An important model is a POMDP describing an agent – it
includes all factors relevant to agent’s decision making.
These are intentional models or types (BDI)
Represent uncertainty by maintaining beliefs over the state
and models of other agents. This gives rise to interactive
belief systems
interactive epistemology
When no other agents are present beliefs become “flat”
and classical POMDP results
Computable approximation of the interactive beliefs:
finitely nested belief systems
infinitely nested beliefs are computable if there is common
knowledge – Nash equilibria
Belief update in I-POMDP
Formalization
Belief update in I-POMDP
Multiagent Tiger problem
Task: Maximize collection of gold over a finite or
infinite number of steps while avoiding tiger
Each agent hears growls as well as creaks (S, CL, or
CR)
Each agent may open doors or listen
Each agent is unable to perceive other’s observation
agents i & j
Understanding the I-POMDP (level 1) belief update
pj (TL)
pj (TL)
L
Belief update in I-POMDP
Multiagent Tiger problem
Task: Maximize collection of gold over a finite or
infinite number of steps while avoiding tiger
Each agent hears growls as well as creaks (S, CL, or
CR)
Each agent may open doors or listen
Each agent is unable to perceive other’s observation
agents i & j
Understanding the I-POMDP (level 1) belief update
pj (TL)
pj (TL)
L
pj (TL)
pj (TL)
GL,S
Belief update in I-POMDP
Multiagent Tiger problem
Task: Maximize collection of gold over a finite or
infinite number of steps while avoiding tiger
Each agent hears growls as well as creaks (S, CL, or
CR)
Each agent may open doors or listen
Each agent is unable to perceive other’s observation
agents i & j
Understanding the I-POMDP (level 1) belief update
L,GL
L,GR
pj (TL)
L
pj (TL)
L,GR
pj (TL)
L,GL
pj (TL)
GL,S
Belief update in I-POMDP
Multiagent Tiger problem
Task: Maximize collection of gold over a finite or
infinite number of steps while avoiding tiger
Each agent hears growls as well as creaks (S, CL, or
CR)
Each agent may open doors or listen
Each agent is unable to perceive other’s observation
agents i & j
Understanding the I-POMDP (level 1) belief update
L,GL
L,GR
pj (TL)
L
pj (TL)
L,GR
GL,S
pj
(TL)
L,GL
pj (TL)
pj (TL)
pj (TL)
DP in I-POMDP
Recurse through levels beginning with level 0
Agent j
level 0 models of horizon 1
(assumes agent i is noise)
a2
a1
a1
a1
a2
a1
a2
DP in I-POMDP
Best response to level 1 belief at horizon 1
Agent i
level 1
Agent j
level 0 models of horizon 1
a1
a1
a2
DP in I-POMDP
Agent j
level 0 models of horizon 2
Agent i
level 1
a1
a1
o1
o2 o1
a1
a2 a2
a1
o2
a1
a1
o1
o2 o1
a1
a2 a2
o2
a1
DP in I-POMDP
Best response to level 1 belief at horizon 2
Agent j
level 0 models of horizon 2
Agent i
level 1
a1
o1
a1
o2
a2
a1
a1
o1
o2 o1
a1
a2 a2
a1
o2
a1
a1
o1
o2 o1
a1
a2 a2
o2
a1
DP in I-POMDP
Agent j
level 0 models of horizon 3
Agent i
level 1
a1
o1
a1
o1
a1
a1
o2
o1
a1
o2 o1
a2
a2
o2
a1
o 2 o1
a1 a1
a1
o2 o1
a2
a2
o2
a1
DP in I-POMDP
Best response to level 1 belief at horizon 3
Agent j
level 0 models of horizon 3
Agent i
level 1
a1
o1
o2
a1
o1
a1
a1
o2 o1
a2
a2
o2
a1
a1
o1
a1
o1
a1
a1
o2
o1
a1
o2 o1
a2
a2
o2
a1
o 2 o1
a1 a1
a1
o2 o1
a2
a2
o2
a1
POMDPs and I-POMDPs
Beliefs – probability distributions over states
are sufficient statistics
They fully summarize the information contained in
any sequence of observations
Solving POMDPs is hard (P-space)
We need approximations (e.g., particle filtering)
Solving I-POMDPs is at least as hard
An approximation: interactive particle filtering
If recursion does not terminate, look for fixed
points
Improving DP in I-POMDP
The interactive state space is very large because
it includes models of other agents. Theoretically,
the space of computable models is countably
infinite
The curse of dimensionality is especially potent for
I-POMDP
I-POMDP faces the curse of history afflicting both
agents
Can we reduce the size of the interactive state
space and thereby mitigate the curse of
dimensionality?
Issue 1: Space of agent models is
infinite
Approach
Select a few initial models of the other agent
Need to ensure that the true model is within this set,
otherwise the belief update is inconsistent
Select models so that the Absolute Continuity
Condition is satisfied
Subjective distribution over future observations
(paths of play) should not rule out the observation
histories considered possible by the true distribution
How to satisfy ACC?
Cautious beliefs
Select a finite set of models, ~
, with the partial
(domain) knowledge that the true
i / jor an equivalent
model is one of them
Issue 2: Representing nested beliefs is
difficult
Level 0 beliefs are standard discrete distributions
(vectors of probabilities that sum to 1)
Level 1 beliefs could be represented as probability
density functions over level 0 beliefs
Probability density functions over level 1 beliefs
may not be computable in general
Parameters of level 1 beliefs may not be bounded (e.g., a
polynomial of any degree)
Level 2 beliefs are strictly partial recursive functions
Approach
~
We previously limited the set of models, i / j
Level l belief becomes a discrete probability
distribution
~
~
IS i ,l S j ,l 1
~
~
bi ,l ( IS i ,l )
Candidate agent models grow over time and must
be tracked
Type equivalence
Equivalence
Classes of Beliefs
P1
P2
P3
Equivalence classes of interactive
states
• Definition
– Combination of a physical state and an equivalence class
of models
Lossless aggregation
In a finitely nested I-POMDP, a probability
distribution over
, provides a
sufficient statistic for the past history of i’s
observations
Transformation of the interactive state space into
behavioral equivalence classes is value-preserving
Optimal policy of the transformed finitely nested IPOMDP remains unchanged
Solving I-POMDPs exactly
Procedure Solve-IPOMDP ( AGENTi, Belief Nesting L ) :
Returns Policy
If L = 0 Then
Return { Policy : = Solve-POMDP ( AGENTi ) }
Else
For all AGENTj < > AGENTi
Policyj : = Solve-IPOMDP( AGENTj , L-1)
End
Mj := Behavioral-Equivalence-Types(Policyj )
ECISi : = S x { xj Mj }
Policy : = Modified-GIP(ECISi , Ai , Ti , Ωi , Oi , Ri )
Return Policy
End
29
Beliefs on ECIS
Agent j’s policy
30
Agent i’s policy in the
presence of another agent j
Policy becomes diverse as i’s
ability of observing j’s actions
improves
32
Discussion on ECIS
A method that enables exact solution of
finitely nested interactive POMDPs
Aggregate agent models into behavioral
equivalence classes
Discretization is lossless
Interesting behaviors emerge in the multiagent Tiger problem
33
Summary of I-POMDPs
I-POMDPs: A framework for decision making in uncertain multiagent settings
Analogous to POMDPs but with an enriched state space
interactive beliefs
Uses decision-theoretic solution concept
MEU
For infinitely nested beliefs, look for fixed points
Intractability of I-POMDPs
Curse of dimensionality: belief space complexity
Curse of history: policy space complexity
Exact: Equivalence classes of interactive states
Lossless transformation of IS into a discrete space
Approximation 1: Interactive Particle Filter
Randomized algorithm for approximating the nested belief update
Partial error bounds
Approximation 2: Interactive Point-based Value Iteration
Algorithm for partial update of the value function
Linear program not needed
Partial and loose error bounds
Approximation 3: Interactive Bounded Policy Iteration
Update the nested policy directly
Represent policies using finite-state machines
Local optima
Graphical model counterpart: Interactive Dynamic Influence Diagrams (I-DIDs)
Types in probabilistic graphical
models
Growli
Open or
Listeni
Ri
Tiger
loc
Growlj
Open or
Listenj
Rj
Multiagent influence diagram
(MAID)
(Koller&Milch01)
Types in probabilistic graphical
models
Growli
Open or
Listeni
Ri
Tiger
loc
Growlj
Open or
Listenj
Rj
MAIDs offer a richer representation for a game
and may be transformed into a normal- or
extensive-form game
Types in probabilistic graphical
models
Growli
Open or
Listeni
Ri
Tiger
loc
Growlj
Open or
Listenj
Rj
A strategy of an agent is an assignment of a
decision rule to every decision node of that
agent
What if the agents are using differing
models of the same game to make
decisions, or are uncertain about the
mental models others are using (types)?
Growli
Open or
Listeni
Ri
Tiger
loc
Growlj
Open or
Listenj
Rj
Let agent i believe with probability, p, that j will listen
and with 1- p that j will do the best response decision
Analogously, j believes that i will open a door with
probability q, otherwise play the best response
Top-level
q
Open
p
Listen
Network of ID
(NID)
Open
Listen
L
OL
OR
L
OL
OR
0.9
0.05
0.05
0.1
0.45
0.45
Block L
Block O
(Gal&Pfeffer08)
Let agent i believe with probability, p, that j will likely listen
and with 1- p that j will do the best response decision
Analogously, j believes that i will mostly open a door with
probability q, otherwise play the best response
Growli
Open or
Listeni
Ri
Tiger
loc
Growlj
Open or
Listenj
Rj
Top-level Block -- MAID
Let agent i believe with probability, p, that j will likely listen
and with 1- p that j will do the best response decision
Analogously, j believes that i will mostly open a door with
probability q, otherwise play the best response
Mod[j;Di]
GrowlTLi
RTLi
Mod[i;Dj]
GrowlTLj
BR[i]TL
Open or
ListenTLi
OpenO
Tiger
locTL
BR[j]TL
Open or
ListenTLj
RTLj
MAID representation for the NID
ListenL
MAIDs and NIDs
Rich languages for games based on IDs that
models problem structure by exploiting
conditional independence
Focus is on computing equilibrium which
does not allow for best response to a
distribution of non-equilibrium behaviors
Generalize IDs to dynamic
interactions in multiagent settings
Challenge: Other agents could be
updating beliefs and changing
strategies
Interactive IDs
Open or
Listeni
Model node: Mj,l-1
Models or types of
agent j at level l-1
Ri
Tiger
loci
Open or
Listenj
Growli
Mj,l-1
Level l I-ID
Policy link: dashed arrow
Distribution over the
other agent’s actions
given its models
Belief on Mj,l-1: Pr(Mj,l-1|s)
Open or
Listenj
Mj,l-1
S
Different chance nodes are
solutions of models mj,l-1
Mod[Mj] represents the
different models of agent j
Mod[Mj]
mj,l-11
Aj1
mj,l-12
Members of the model node
Aj2
mj,l-11, mj,l-12 could be I-IDs, IDs or simple distributions
Open or
Listenj
Mj,l-1
S
Mod[Mj]
mj,l-11
Assumes the distribution
of each of the action
nodes (Aj1, Aj2) depending
on the value of Mod[Mj]
Aj1
mj,l-12
CPT of the chance node Aj
is a multiplexer
Aj2
Could I-IDs be extended over
time?
We must address the challenge
Interactive Dynamic IDs (I-DIDs)
Ri
Ri
Ait+1
Ait
Ajt
Ajt+1
St
St+1
Oit
Oit+1
Mj,l-1t
Model update link
Mj,l-1t+1
Ajt
Mj,l-1t
Mod[Mjt+1]
st
mj,l-1t+1,1
Mod[Mjt]
Aj1
Aj2
Ajt+1
Mj,l-1t+1
Oj
mj,l-1t+1,3
mj,l-1t,1
O j1
mj,l-1t,2
mj,l-1t+1,2
O j2
Aj3
mj,l-1t+1,4
Aj4
These models differ in their initial
beliefs, each of which is the result
of j updating its beliefs due to its
actions and possible observations
Aj2
Aj1
Applications of Type-Based Methods
Adversarial reasoning in the context of money laundering
2010)
(Ng et al.,
Behavioral modeling of recursive reasoning data in Centipede Game
(Doshi et al., 2010)
Predicting opponent strategies in Lemonade Stand Game
2011)
Learning from human teachers in the context of robotics
Wood, 2012)
(Wunder et al.,
(Woodward &
Generalizations or specializations
Trust enabled I-POMDPs (Seymour & Peterson, 2009)
Models of the other agent include trust levels as well
Parameterized I-POMDPs (Wunder et al., 2011)
Distribution over lower-level models is learned parameter from agent
population
Intention-aware POMDPs (Hoang & Low, 2012)
Specialization: Assumes that the other agent observes its state perfectly
Hierarchy reduces to a nested MDP
Reinforcement learning in I-POMDPs (Ng et al., 2012)
Bayes-adaptive RL
Application 1: Adversarial reasoning in
money laundering
Money laundering domain
Red team (money launderers) hold money in
accounts
{dirty pot, bank accounts, securities, shell companies,...}
Blue team (law enforcement) must sense the
money
{no sensors, bank accounts, shell companies, casinos,...}
Red team’s actions involve placing, layering or
integrating the money, and observing the blue
team’s sensors
Blue team’s actions involve placing the sensors,
and observing reports and sensor information
|S| = 99, |Ai| = 9, |Aj|= 4, |i| = 11, |j| = 4
Application: Adversarial reasoning
(contd.)
Approach
Formulate a level 1 I-POMDP for each team
Combine I-PF with a sampled reachability tree
for both agents to generate separate policy
trees for red and blue teams with initial beliefs
Experiments
Laundering game was played by simulating the
two teams’ policy trees across 50 trials
For most settings of particles and agent
solution horizons, red team has the advantage!
Blue team wins when each team models the
opponent at just horizon 1
Application 2: Behavioral modeling of
recursive reasoning data
Two large studies involving human subjects
on levels of recursive reasoning
Two-player alternating-move game with
complete and perfect information
General sum game & fixed sum game
Experimental studies (contd)
Two levels of reasoning:
Opponent type
myopic
predictive
Computational model: Interactive
POMDP
Modeling behavioral data gathered from study
Multiagent setting
State space includes other agents’ models
A finitely nested I-POMDP of agent i with a
strategy level l interacting with another agent j,
is defined as
ISi,l : Interactive states, defined as ISi,l = S × Mj,l-1
where
for
, and ISi,0 = S where
S is states of physical environment
: intentional models of agent j, defined as
where bj,l-1 is j’s level l-1 belief, is the frame
SMj : subintentional models of j
Empirically informed I-POMDP
I-POMDPi,2:
Interactive States:
physical state space S = {A,B,C,D} (perfectly
observable)
model set
is the level 1 predictive model of the opponent
is the level 0 myopic model of the opponent
Action:
Ai = Aj = {Stay, Move} (deterministic)
Observation:
Ωi = {Stay, Move}
Empirically informed I-POMDP
(contd.)
Descriptive decision model
Subjects made non-normative choice
Rationality errors observed
Quantal response model
e .U ( bi , ai )
e .U (bi ,ai )
*
q ( ai* Ai )
ai Ai
q(ai Ai) is the probability assigned to action ai by
the model
U(bi, ai) is the utility for i performing the action ai
given its belief bi
controls how responsive is the model to value
differences
Empirically informed I-POMDP
(contd.)
Descriptive judgment model
Subjects learned from previous game
learning is slow
subjects could be underweighting the evidence that
they observe
Updating belief:
Underweighting when γ<1
Overweighting when γ>1
Normative updating when γ=1
γ controls the learning rate
Learning
Two parameters to learn
γ controls learning rate
λ controls non-normative choice
Gradient Descent
Error function: the inverse of the data likelihood
is the action from Ai selected by subject i in
the gth game
Results
We utilized the learned values to
parameterize the underweighting and
quantal response models within the IPOMDP
Comparison of model predictions with actual
data
Application 3: Learning from a human
teacher
Domain
Agent (robot) learning interactively from a
non-technical human teacher
Learning by demonstration
Learning by reinforcements
Interaction consists of signals generated by
the agent and teacher
Examples of signals: words, gestures, facial
expressions, eye gaze, rewards, ...
Application to learning (contd.)
Approach
Model the learning problem as a I-POMDP
All signals from the teacher and environment are
modeled as agent’s observations
Teacher is modeled in the agent’s IS
Teacher’s belief about the state of the world, about
agent’s variables and beliefs are maintained
Action selection accounts for the predicted future
actions of the teacher
Benefits of the approach
Principled formulation of the problem
Complex interactions possible due to nested
modeling
Application to learning (contd.)
Benefits (contd.)
Acting to reduce inconsistency in its modeling
of the teacher’s modeling
Interrupt the teacher to request a change in
teaching subject
Ask a clarification whether the previous action of the
teacher was about a different topic
Issue a correction to the teacher about the topic of
the question that the agent had asked
Bibliography
Game theory
1.
2.
3.
4.
5.
6.
7.
8.
Fundenberg, D. and Tirole, J., Game theory. MIT Press
(textbook)
Owen, G., Game theory. 3rd Edition, Academic Press
(textbook)
Binmore, K., Essays on foundations of game theory. Pittman,
(edited book)
Harsanyi, J. C. (1967). Games with incomplete information
played by ‘Bayesian' players. Management. Science, 14(3),
159-182 (reference on Bayesian games)
Fudenberg, D., & Levine, D. (1997). Theory of Learning in
Games. MIT Press (book for fictitious play)
Aumann, R. J. (1999). Interactive epistemology i: Knowledge.
International Journal of Game Theory, 28, 263-300
Brandenburger, A., & Dekel, E. (1993). Hierarchies of beliefs
and common knowledge. Journal of Economic Theory, 59,
189-198 (ref. on hierarchical belief systems)
Kalai, E., Lehrer, E., 1993. Rational learning leads to Nash
equilibrium. Econometrica 61 (5), 1019–1045.
Bibliography
Type-based methods
1)
Albrecht, S., Crandall, J., Ramamoorthy, S., 2016. Belief and Truth in Hypothesised Behaviours.
URL: http://arxiv.org/abs/1507.07688 (Complete reference for Part 2)
2)
Albrecht, S., Ramamoorthy, S., 2015. Are you doing what I think you are doing? Criticising uncertain
agent models. In: 31st Conf. on Uncertainty in Artificial Intelligence. pp. 52–61.
3)
Albrecht, S., Crandall, J., Ramamoorthy, S., 2015. An empirical study on the practical impact of prior
beliefs over policy types. In: 29th AAAI Conf. on Artificial Intelligence. pp. 1988–1994.
4)
Albrecht, S., Ramamoorthy, S., 2014. On convergence and optimality of best-response learning with
policy types in multiagent systems. In: 30th Conf. on Uncertainty in Artificial Intelligence. pp. 12–21.
5)
Albrecht, S., Ramamoorthy, S., 2013. A game-theoretic model and best-response learning method for
ad hoc coordination in multiagent systems. In: 12th Int. Conf. on Autonomous Agents and Multiagent
Systems. pp. 1155–1156.
6)
Barrett, S., Stone, P., Kraus, S., Rosenfeld, A., 2013. Teamwork with limited knowledge of teammates.
In: 27th AAAI Conf. on Artificial Intelligence. pp. 102–108.
7)
Barrett, S., Stone, P., Kraus, S., 2011. Empirical evaluation of ad hoc teamwork in the pursuit domain.
In: 10th Int. Conf. on Autonomous Agents and Multiagent Systems. pp. 567–574.
8)
Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., Rayner, C., 2005. Bayes’ bluff:
opponent modelling in poker. In: 21st Conf. on Uncertainty in Artificial Intelligence. pp. 550–558.
9)
Chalkiadakis, G., Boutilier, C., 2003. Coordination in multiagent reinforcement learning: a Bayesian
approach. In: 2nd Inf. Conf. on Autonomous Agents and Multiagent Systems. pp. 709–716.
10) Carmel, D., Markovitch, S., 1999. Exploration strategies for model-based learning in multi-agent
systems: exploration strategies. Autonomous Agents and Multi-Agent Systems 2 (2), 141–172.
Bibliography
Type-based methods (contd.)
1) Stone, P., Kaminka, G., Kraus, S., Rosenschein, J., 2010. Ad hoc
autonomous agent teams: collaboration without pre-coordination. In:
24th AAAI Conf. on Artificial Intelligence. pp. 1504–1509.
2) Kocsis, L. and Szepesvari, C., 2006. Bandit based Monte-Carlo
planning. In Machine Learning: ECML 2006, pages 282–293.
3) Hansson, H., Jonsson, B., 1994. A logic for reasoning about time and
reliability. Formal Aspects of Computing 6 (5), 512–535.
4) Larsen, K., Skou, A., 1991. Bisimulation through probabilistic testing.
Information and Computation 94 (1), 1–28.
5) Azzalini, A., 1985. A class of distributions which includes the normal
ones. Scandinavian Journal of Statistics 12, 171–178.
6) Howard, R., 1966. Information value theory. IEEE Transactions on
Systems Science and Cybernetics 2 (1), 22–26.
7) Bellman, R., 1957. Dynamic Programming. Princeton University
Press.
Bibliography
Interactive POMDP
1.
2.
3.
4.
5.
6.
7.
8.
Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning
in multi-agent settings. Journal of Artificial Intelligence Research, 24:49-79
(ref. on I-POMDP)
Doshi, P., & Gmytrasiewicz, P.J. (2006). On the Difficulty of Achieving
Equilibrium in Interactive POMDPs. Twenty First National Conference on
Artificial Intelligence (AAAI) (ref. on convergence properties of I-POMDP)
Doshi, P. (2007). Improved State Estimation in Multiagent Settings with
Continuous or Large Discrete State Spaces. Twenty Second Conference on
Artificial Intelligence (AAAI) (ref. on state estimation for continuous state
spaces)
Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo Sampling Methods for
Approximating Interactive POMDPs. Journal of Artificial Intelligence
Research, 34:297-337 (ref. on PF in I-POMDP)
Perez, D., & Doshi, P. (2008). Generalized Point Based Value Iteration for
Interactive POMDPs. In Twenty-third Conference on Artificial Intelligence
(AAAI) (ref. on PBVI in I-POMDP)
Sonu, E. and Doshi, P. (2012). Generalized and Bounded Policy Iteration for
Interactive POMDPs. Eleventh International Autonomous Agents and
Multiagent Systems Conference (AAMAS) (ref. on BPI in I-POMDP)
Hoang, T. & Low, K. (2012). Intention-Aware Planning under Uncertainty for
Interactingwith Self-Interested, Boundedly Rational Agents. Eleventh
International Autonomous Agents and Multiagent Systems Conference
(AAMAS) (ref. on specialization, IA-POMDPs)
Ng, B., Boakye, K., Meyers, C., & Wang, A. (2012). Bayes-Adaptive
Interactive POMDPs. Twenty-Sixth AAAI Conference on Artificial Intelligence
(ref. on RL in I-POMDP)
Bibliography
Applications of Interactive POMDP
1. Seymour, R.S., & Peterson, G.L. (2009). Responding to
Sneaky Agents in Multi-agent Domains. Twenty-Second
International Florida Artificial Intelligence Research Society
Conference (FLAIRS)
2. Ng, B., Meyers, C., Boakye, K., & Nitao, J. (2010). Towards
applying interactive POMDPs to real-world adversary
modeling. Innovative Applications in Artificial Intelligence
(IAAI)
3. Wunder, M., Kaisers, M., Yaros, J.R., Littman, M. (2011).
Using Iterated Reasoning to Predict Opponent Strategies.
Tenth International Conference on Autonomous Agents and
Multiagent Systems (AAMAS)
4. Woodward, M.P., & Wood, R.J. (2012). Learning from
Humans as an I-POMDP. CoRR,
http://arxiv.org/abs/1204.0274
5. Doshi, P., Qu, X., Goodie, A., & Young, D. (2012). Modeling
Human Recursive Reasoning using Empirically-Informed
Interactive POMDPs. IEEE Transactions on Systems, Man and
Cybernetics (SMC), Part A, Vol. 42(6):1529-1542
© Copyright 2026 Paperzz