Uncertainty and Dynamics - Teamcore USC

Software Multiagent Systems:
Lecture 13
Milind Tambe
University of Southern California
[email protected]
Teamwork
When agents act together
Understanding Teamwork
 Ordinary traffic
 Driving in a convoy
 Two friends A & B together drive in a convoy
 B is secretly following A
 Pass play in Soccer
 Contracting with a software company
 Orchestra
Understanding Teamwork
Together
Joint Goal
Co-labor  Collaborate
• Not just a union of simultaneous coordinated actions
• Different from contracting
Why Teamwork?
Why not: Master-Slave? Contracts?
Why Teams
Robust organizations
Responsibility to substitute
Mutual assistance
Information communicated to peers
Still capable of structure (not necessarily flat)
Subteams, subsubteams
Variations in capabilities and limitations
Approach
Theory
Practical teamwork architectures
Taking a step back…
Key Approaches in Multiagent Systems
Distributed Constraint
Optimization (DCOP)
Distributed POMDP
Hybrid
x1
DCOP/
POMDP/
AUCTIONS/
BDI
x2
x3
x4
• Essential
in large-scale multiagent teams
Belief-Desire-Intention (BDI)
Market mechanisms
• Synergistic interactions Logics and Psychology
Auctions
(JPG  p
(MB  p) ۸ (MG  p) ۸
(Until [(MB  p) ۷ (MB  p)]
(WMG  p))
Key Approaches for Multiagent Teams
Local
Uncertainty Human usability
interactions
& plan structure
DCOP
Dis POMDPs
BDI
Markets
BDI-POMDP
Hybrid
Local
utility




















Distributed POMDPs
Three papers on the web pages:
What to read:
Ignore all the proofs
Ignore complexity results
JAIR article: the model and the results at the end
Understand fundamental principles
Domain: Teamwork for Disaster Response
Multiagent Team Decision Problem (MTDP)
MTDP: < S, A, P, W, O, R>
S: s1, s2, s3…
Single global world state, one per epoch
A: domain-level actions; A = {A1, A2, A3,…An}
Ai is a set of actions for each agent i
Joint action
MTDP
P: Transition function:
P(s’ | s, a1, a2, …an)
RA: Reward
R(s, a1, a2,…an)
One common reward; not separate
Central to teamwork
MTDP (cont’d)
W: observations
Each agent: different finite sets of possible observations
W1, W2...
O: probability of observation
O(destination-state, joint-action, joint-observation)
P(o1,o2..om | a1, a2,…am, s’)
Simple Scenario
Cost of action: -0.2
Must fight fires together
Observe own location and fire status
+20
+40
MTDP Policy
The problem: Find optimal JOINT policies
One policy for each agent
pi: Action policy
Maps belief state into domain actions
(Bi  A) for each agent
Belief state: sequence of observations
MTDP Domain Types
Collectively partially observable: general case, no assumptions
Collectively observable: Team (as a whole) observes state
For all joint observations, there is a state s, such that, for all other
states s’ not equal to s, Pr (o1,o2…on | s’) = 0
Pr (o1, o2, …on | s ) = ?
Pr (s | o1,o2..on) = ?
Individually observable: each agent observes the state
For all individual observations, there is a state s, such that for all
other states s’ not equal to s, Pr (oi | s’) = 0
From MTDP to COM-MTDP
Two separate actions: communication vs domain actions
Two separate reward types:
Communication rewards and domain rewards
Total reward: sum two rewards
Explicit treatment of communication
Analysis
Communicative MTDPs(COM-MTDPs)
S: communication capabilities, possible “speech acts”
e.g., “I am moving to fire1.”
RS: communication cost (over messages)
e.g., saying, “I am moving to fire1,” has a cost
RS <= 0
Why ever communicate?
Two Stage Decision Process
World
Actions
Observes
• P2: Action policy
Communications
to and from
•Two state
estimators
Agent
SE1
b1
• P1: Communication
policy
P1
SE2
P2
b2
• Two belief
State updates
COM-MTDP Continued
B: Belief state (each Bi history of observations, Communication)
Two stage belief update
Stage 1: Pre-communication belief state for agent i (updates just from
observations)
< <Wi0, S0 >, <Wi1, S1 > .. <Wi t-1, S t-1 >, <Wi t, . > >
Stage 2: Post-communication belief state for i (updates from
observations and communication)
< <Wi0, S0 >, <Wi1, S1 > .. <Wi t-1, S t-1 >, <Wi t, S t > >
Cannot create probability distribution over states
COM-MTDP Continued
The problem: Find optimal JOINT policies
One policy for each agent
pS: Communication policy
Maps pre-communication belief state into message
(Bi  S) for each agent
pA: Action policy
Maps post-communication belief state into domain actions
(Bi  A) for each agent
More Domain Types
General Communication: no assumptions on RS
Free communication: RS(s,s) = 0
No communication: RS(s,s) is negatively infinite
Teamwork Complexity Results
No
Individual
Collective
Collective
observability
observability
Partial obser.
P-complete
NEXP
NEXP
complete
complete
NEXP
NEXP
complete
complete
P-complete
PSPACE
communication
General
P-complete
communication
Full
communication
P-complete
complete
Classifying Different Models
No
Individual
Collective
Collective
observability
observability
Partial obser.
MMDP
DEC-POMDP
communication
General
communication
Full
communication
POIPSG
XUAN-LESSER
COM-MTDP
True or False
If agents communicated all their observations at each step then the
distributed POMDP would be essentially a single agent POMDP
In distributed POMDPs, each agent plans its own policy
Solving Distributed POMDPs with two agents is of same complexity
as solving two separate individual POMDPs
Algorithms
NEXP-complete
No known efficient algorithms
Brute force search
1. Generate space of possible joint policies
2. For each policy in policy space
3.
Evaluate over finite horizon T
Complexity:
No. of policies
Cost of evaluation
Locally optimal search
Joint equilibrium based search for policies
JESP
Nash Equilibrium in Team Games
Nash equilibrium vs Global optimal reward for the team
u
3,6
A
B v
u
7,1
x
5,1
8,2
y
z
6,0
A
B
9
8
6
10
6
8
x
y
6,2
z
v
JESP: Locally Optimal Joint Policy
•
•
Iterate keeping one agent’s policy fixed
More complex policies the same way
u
A
v
w
9
5
8
6
7
10
6
3
8
x
y
z
B
Joint Equilibrium-based Search
Description of algorithm:
1. Repeat until convergence
2. For each agent i
3.
Fix policy of all agents apart from i
4.
Find policy for i that maximizes joint reward
Exhaustive-JESP:
brute force search in policy space of agent I
Expensive
JESP: Joint Equilibrium Search
(Nair et al, IJCAI 03)
Repeat until convergence to local equilibrium, for each agent K:
Fix policy for all except agent K
Find optimal response policy for agent K
Optimal response policy for K, given fixed policies for others in MTDP:
Transformed to a single-agent POMDP problem:
“Extended” state defined as
Define new transition function
Define new observation function
Define multiagent belief state
Dynamic programming over belief states
Fast computation of optimal response
not as
Extended State, Belief State
Sample progression of beliefs: HL and HR are observations
a2: Listen
Run-time Results
Method
2
3
4
5
6
7
Exhaustive-JESP
10
317800
-
-
-
-
DP-JESP
0
0
1360
30030
20 110
Is JESP guaranteed to find the global
optimal?
9
5
8
6
7
10
6
3
8
Random restarts
Scaling up Distributed POMDPs for Agent Networks
Not All Agents are Equal
Runtime
POMDP vs. distributed POMDP
Distributed POMDPs more complex
Joint transition and observation functions
Better policy
Free communication = POMDP
Less dependency = lower complexity
BDI vs. distributed POMDP
BDI teamwork
Distributed POMDP
teamwork
Explicit joint goal
Explicit joint reward
Plan/organization hierarchies
Unstructured plans/teams
Explicit commitments
Implicit commitments
No costs / uncertainties
Costs & uncertainties included