Generalizing Incremental Pruning - Computer and Information Science

Incremental Pruning: A simple, Fast, Exact Method for
Partially Observable Markov Decision Processes
Anthony Cassandra
Computer Science Dept.
Brown University
Providence, RI 02912
[email protected]
Michael L. Littman
Dept. of Computer Science
Duke University
Durham, NC 27708-0129
[email protected]
Presented by
Costas Djouvas
Nevin L. Zhang
Computer Science Dept.
The Hong Kong U. of Sci. & Tech.
Clear Water Bay, Kwolon, HK
[email protected]
POMDPs: Who Needs them?
Tony Cassandra
St. Edwards University
Austin, TX
http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml
Markov Decision Processes (MDP)


A discrete model for decision making under
uncertainty.
The four components of MDP model:




States: The world is divided into states.
Actions: Each state has a finite number of actions to choose
from.
Transition Function: Probabilistic relationship between states
and available actions for each state.
Reward Function: The expected reward of taking action a
under state s.
MDP More Formally

S = A set of possible world states.

A = A set of possible actions.

Transition Function: A real number function
T(s,a,s') = Pr(s'|s, a).

Reward Function: A real number function R(s,a).
MDP Example (1/2)



S = {OK, DOWN}.
A = {NO-OP, ACTIVE-QUERY, RELOCATE}.
Reward Function
R(a, s)
s
a
OK
DOWN
NO-OP
+1
-5
-22
-10
-5
-20
A-Q
RELOCATE
MDP Example (2/2)

Transition Functions:
T(s, NO-OP, s')
T(s, A-Q, s')
s'
T(s, RELOCATE, s')
s'
s'
s
OK
DOWN
s
OK
DOWN
s
OK
DOWN
OK
0.98
0.02
OK
0.98
0.02
OK
1.00
0.00
DOWN
0.00
1.00
DOWN
0.00
1.00
DOWN
1.00
0.00
POMDP
Best Strategy

Value Iteration Algorithm:




Input: Actions, States, Reward Function, Probabilistic
Transition Function.
Derive a mapping from states to “best” actions for a given
horizon of time.
Starts with horizon length 1 and iteratively found the value
function for the desired horizon.
Optimal Policy
Maps states to actions (S  A).
 It depends only on current state (Markov Property).
 To apply this we must know the agent’s state.

Partially Observable
Markov Decision Processes

Domains with partial information available
about the current state (we can’t observe the
current state).

The observation can be probabilistic.



We need an observation function.
Uncertainly about current state.
Non-Markovian process: required keeping track
of the entire history.
Partially Observable
Markov Decision Processes

In addition to MDP model we have:

Observation: A set of observation of the state.


Z = A set of observations.
Observation Function: Relation between the state
and the observation.

O(s, a, z) = Pr(z |s, a).
POMDP Example

In addition to the definitions
of the MDP example, we
must define the observation
set and the observation
probability function.
O(s, NO-OP, Observation)
Observation
s
OK
PO
PT
AO
AD
0.970 0.000 0.030 0.000
DOWN 0.025 0.000 0.975 0.000
O(s, ACTIVE_QUERY, Observation)
Observation
s
Z={pink-ok, pink-timeout,
active-ok, active-down}.
OK
PO
PT
AO
AD
0.000 0.999 0.000 0.001
DOWN 0.000 0.010 0.000 0.990
O(s, RELOCATE, Observation)
Observation
s
OK
PO
PT
AO
AD
0.250 0.250 0.250 0.250
DOWN 0.250 0.250 0.250 0.250
Optimal Policy
Background on Solving POMDPs

We have to find a mapping from probability
distribution over states to actions.




Belief State: the probability distribution over states.
Belief Space: the entire probability space.
Assuming finite number of possible actions and
observations, there are finite number of possible next
beliefs states.
Our next belief state is fully determined and it depends
only on the current belief state (Markov Property).
Background on Solving POMDPs
Next Belief State
Background on Solving POMDPs




Start from belief state b (Yellow Dot).
Two states s1, s2.
Two actions a1, a2.
Tree observations z1, z2, z3.
Belief States
Belief Space
Policies for POMDPs


An optimal POMDP policy maps belief states
to actions.
The way in which one would use a computed
policy is to start with some a priori belief about
where you are in the world. The continually:
1.
2.
3.
4.
5.
Use the policy to select action for current belief state;
Execute the action;
Receive an observation;
Update the belief state using current belief, action and
observation;
Repeat.
Example for Optimal Policy
Pr(OK)
Action
0.000 – 0.237
RELOCATE
0.237 – 0.485
ACTIVE
0.485 – 0.493
ACTIVE
0.493 – 0.713
NO-OP
0.713 – 0.928
NO-OP
0.928 – 0.989
NO-OP
0.989 – 1.000
NO-OP
ACTIVE
RELACATE
0
ACTIVE
NO-OP
NO-OP
NO-OP
Belief Space
1
Value Function
Policy Graph
Value Function


The Optimal Policy computation is based on Value
Iteration.
Main problem using the value iteration is that the space
of all belief states is continuous.
Value Function

For each belief state get a single expected value.

Find the expected value of all belief states.

Yield a value function defined over all belief space.
Value Iteration Example


Two states, two actions, three observations.
We will use a figure to represent the Belief Space and the
Transformed Value Function.
We will use the s(a, z) function to transform the continues space
Value Function.
Dot Product
Transformed Value

Belief Space
Value Iteration Example



Start from belief state b
One available action, a1 for the first decision
and then two a1 and a2.
Three possible observations, z1, z2, z3.
Value Iteration Example

For each of the three new belief states compute
the new value function, for all actions.
Transformed Value Functions for all observations
Partition for action a1
Value Iteration Example
Value Function and partition for action a1
Value Function and partition for action a2
Combined a1 and a2 values functions
Values functions for horizon 2
Transformed Value Example
MDP Example
Incremental Pruning: A simple, Fast, Exact Method
for Partially Observable Markov Decision Processes


The agent is not aware of its current state.
It only knows its information (belief) state x (probability
distribution over possible states).
Notations
S: a finite set of states
new information state xza
a
where
A: a finite set of possible actions
Z: a finite set of possible observations
α
Α
s
S
z
Z
rα(s)
R
Transition function: Pr(s'|s, α)
Observation function: Pr(z'|s', α)
[0, 1]
[0, 1]
Introduction



Algorithms for POMDPs use a form of
dynamic programming, called dynamic
programming updates.
One Value Function is translated into a another.
Some of the algorithms using DPU:
One pass (Sondik 1971)
 Exhaustive (Monahan 1982)
 Linear support (Cheng 1988)
 Witness (Littman, Cassandra & Kaelbling 1996)
 Dynamic Pruning (Zhang & Liu 1996)

Dynamic Programming Updates

Idea: Define a new value function V' in terms of a given value function V.
Using value iteration, in infinite-horizon, V' represents an approximation
that is very close to optimal value function.
The V' is defined by:

So the function V can be expressed as vectors


for some finite set of |S|-vectors Szα, Sα, S'

The transformations preserve piecewise linearity and convexity
(Smallwood & Sondik, 1973).
Dynamic Programming Updates
Some more notations

Vector Comparison:


Vector dot product:


α.β = Σs α(s)β(s)
Cross sum:


α1 > α2 if and only if for a1(s) > a2(s) for all s
A
B = {α + β|α
Α, β
Set subtraction:

Α\Β = {α Α|β
Β}
Β}
S.
Dynamic Programming Updates

Using these notations, we can characterize the “S”
sets described earlier as:
purge(.) takes a set of vectors and reduces it to its
unique minimum form
Pruning Sets of Vectors

Given a set of |S|-vectors A and a vector α, define:
which is called “witness region” the set of information states for which vector α is
the clear “winner” (has the largest dot product) compared to all the others
vectors of A.

Using the definition of R, we can define:
which is the set of vectors in A that have non-empty witness region and is
precisely the minimum-size set.
Pruning Sets of Vectors

Implementation of purge(F)
Returns the vectors in F with
non-empty witness region.
Returns an information state x for which α
gives larger dot product that any vector in A.
Incremental Pruning

Computes Sα efficiently:
• Conceptually easier than witness.
• Superior performance and
asymptotic complexity.
• A = purge(A), B = purge(B).
• W = purge(A
B).
•|W| ≥ max(|A|, |B|).
•It never grows explosively compared
to its final size.
Incremental Pruning

We first construct all of
S(a,z) sets.

We do all combinations
of the S(a,z1) and S(a,z2)
vectors.
Incremental Pruning

We yields the new value
function.

We then eliminate all
useless (light blue)
vectors.
Incremental Pruning

We are left with just
three vectors.

We then combine these
three with the vectors in
S(a,z3).

This is repeated for the
other action.
Generalizing Incremental Pruning

Modification of FILTER to take advantage of
the fact that the set of vectors has a great
deal of regularity.

Replace x  DOMINATE(Φ, W) with x  DOMINATE(Φ,
D\{Φ}).
Recall:





A B : filtering set of vectors.
W: set of wining vectors.
Φ: the “winner” vectors of the W
D
A B
Generalizing Incremental Pruning

D must satisfying any of the following properties:
(1)
(2)
(3)
(4)
(5)


Different choices of D result in different incremental pruning
algorithms.
The smaller the D set the more efficient the algorithm.
Generalizing Incremental Pruning




To IP algorithm uses equation 1.
A variation of the incremental pruning method using a
combination of 4 and 5 is referred as restricted region
(RR) algorithm.
The asymptotic total number of linear programs does
not change RR, actually requires slightly more linear
programs than IP in the worst case.
However empirically it appears that the savings in the
total constraints usually saves more time than the extra
linear programs require.
Generalizing Incremental Pruning
Complete RR algorithm
Empirical Results
Total execution time
Total time spent constructing Sα sets.
Conclusions



We examined the incremental pruning method for
performing dynamic programming updates in partially
observable Markov decision processes.
It compares favorably in terms of ease of
implementation to the simplest of the previous
algorithms.
It has asymptotic performance as good as or better
than the most efficient of the previous algorithms and
is empirically the fastest algorithm of its kind.
Conclusion



In any event even the slowest variation of the
incremental pruning method that we studied is a
consistent improvement over earlier algorithms.
This algorithm will make it possible to greatly expand
the set of POMDP problems that can be solved
efficiently.
Issues to be explored:


All algorithms studied have a precision parameter ε, which
differs from algorithm to algorithm.
Develop better best-case and worst-case analyses for RR.