Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University Providence, RI 02912 [email protected] Michael L. Littman Dept. of Computer Science Duke University Durham, NC 27708-0129 [email protected] Presented by Costas Djouvas Nevin L. Zhang Computer Science Dept. The Hong Kong U. of Sci. & Tech. Clear Water Bay, Kwolon, HK [email protected] POMDPs: Who Needs them? Tony Cassandra St. Edwards University Austin, TX http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml Markov Decision Processes (MDP) A discrete model for decision making under uncertainty. The four components of MDP model: States: The world is divided into states. Actions: Each state has a finite number of actions to choose from. Transition Function: Probabilistic relationship between states and available actions for each state. Reward Function: The expected reward of taking action a under state s. MDP More Formally S = A set of possible world states. A = A set of possible actions. Transition Function: A real number function T(s,a,s') = Pr(s'|s, a). Reward Function: A real number function R(s,a). MDP Example (1/2) S = {OK, DOWN}. A = {NO-OP, ACTIVE-QUERY, RELOCATE}. Reward Function R(a, s) s a OK DOWN NO-OP +1 -5 -22 -10 -5 -20 A-Q RELOCATE MDP Example (2/2) Transition Functions: T(s, NO-OP, s') T(s, A-Q, s') s' T(s, RELOCATE, s') s' s' s OK DOWN s OK DOWN s OK DOWN OK 0.98 0.02 OK 0.98 0.02 OK 1.00 0.00 DOWN 0.00 1.00 DOWN 0.00 1.00 DOWN 1.00 0.00 POMDP Best Strategy Value Iteration Algorithm: Input: Actions, States, Reward Function, Probabilistic Transition Function. Derive a mapping from states to “best” actions for a given horizon of time. Starts with horizon length 1 and iteratively found the value function for the desired horizon. Optimal Policy Maps states to actions (S A). It depends only on current state (Markov Property). To apply this we must know the agent’s state. Partially Observable Markov Decision Processes Domains with partial information available about the current state (we can’t observe the current state). The observation can be probabilistic. We need an observation function. Uncertainly about current state. Non-Markovian process: required keeping track of the entire history. Partially Observable Markov Decision Processes In addition to MDP model we have: Observation: A set of observation of the state. Z = A set of observations. Observation Function: Relation between the state and the observation. O(s, a, z) = Pr(z |s, a). POMDP Example In addition to the definitions of the MDP example, we must define the observation set and the observation probability function. O(s, NO-OP, Observation) Observation s OK PO PT AO AD 0.970 0.000 0.030 0.000 DOWN 0.025 0.000 0.975 0.000 O(s, ACTIVE_QUERY, Observation) Observation s Z={pink-ok, pink-timeout, active-ok, active-down}. OK PO PT AO AD 0.000 0.999 0.000 0.001 DOWN 0.000 0.010 0.000 0.990 O(s, RELOCATE, Observation) Observation s OK PO PT AO AD 0.250 0.250 0.250 0.250 DOWN 0.250 0.250 0.250 0.250 Optimal Policy Background on Solving POMDPs We have to find a mapping from probability distribution over states to actions. Belief State: the probability distribution over states. Belief Space: the entire probability space. Assuming finite number of possible actions and observations, there are finite number of possible next beliefs states. Our next belief state is fully determined and it depends only on the current belief state (Markov Property). Background on Solving POMDPs Next Belief State Background on Solving POMDPs Start from belief state b (Yellow Dot). Two states s1, s2. Two actions a1, a2. Tree observations z1, z2, z3. Belief States Belief Space Policies for POMDPs An optimal POMDP policy maps belief states to actions. The way in which one would use a computed policy is to start with some a priori belief about where you are in the world. The continually: 1. 2. 3. 4. 5. Use the policy to select action for current belief state; Execute the action; Receive an observation; Update the belief state using current belief, action and observation; Repeat. Example for Optimal Policy Pr(OK) Action 0.000 – 0.237 RELOCATE 0.237 – 0.485 ACTIVE 0.485 – 0.493 ACTIVE 0.493 – 0.713 NO-OP 0.713 – 0.928 NO-OP 0.928 – 0.989 NO-OP 0.989 – 1.000 NO-OP ACTIVE RELACATE 0 ACTIVE NO-OP NO-OP NO-OP Belief Space 1 Value Function Policy Graph Value Function The Optimal Policy computation is based on Value Iteration. Main problem using the value iteration is that the space of all belief states is continuous. Value Function For each belief state get a single expected value. Find the expected value of all belief states. Yield a value function defined over all belief space. Value Iteration Example Two states, two actions, three observations. We will use a figure to represent the Belief Space and the Transformed Value Function. We will use the s(a, z) function to transform the continues space Value Function. Dot Product Transformed Value Belief Space Value Iteration Example Start from belief state b One available action, a1 for the first decision and then two a1 and a2. Three possible observations, z1, z2, z3. Value Iteration Example For each of the three new belief states compute the new value function, for all actions. Transformed Value Functions for all observations Partition for action a1 Value Iteration Example Value Function and partition for action a1 Value Function and partition for action a2 Combined a1 and a2 values functions Values functions for horizon 2 Transformed Value Example MDP Example Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes The agent is not aware of its current state. It only knows its information (belief) state x (probability distribution over possible states). Notations S: a finite set of states new information state xza a where A: a finite set of possible actions Z: a finite set of possible observations α Α s S z Z rα(s) R Transition function: Pr(s'|s, α) Observation function: Pr(z'|s', α) [0, 1] [0, 1] Introduction Algorithms for POMDPs use a form of dynamic programming, called dynamic programming updates. One Value Function is translated into a another. Some of the algorithms using DPU: One pass (Sondik 1971) Exhaustive (Monahan 1982) Linear support (Cheng 1988) Witness (Littman, Cassandra & Kaelbling 1996) Dynamic Pruning (Zhang & Liu 1996) Dynamic Programming Updates Idea: Define a new value function V' in terms of a given value function V. Using value iteration, in infinite-horizon, V' represents an approximation that is very close to optimal value function. The V' is defined by: So the function V can be expressed as vectors for some finite set of |S|-vectors Szα, Sα, S' The transformations preserve piecewise linearity and convexity (Smallwood & Sondik, 1973). Dynamic Programming Updates Some more notations Vector Comparison: Vector dot product: α.β = Σs α(s)β(s) Cross sum: α1 > α2 if and only if for a1(s) > a2(s) for all s A B = {α + β|α Α, β Set subtraction: Α\Β = {α Α|β Β} Β} S. Dynamic Programming Updates Using these notations, we can characterize the “S” sets described earlier as: purge(.) takes a set of vectors and reduces it to its unique minimum form Pruning Sets of Vectors Given a set of |S|-vectors A and a vector α, define: which is called “witness region” the set of information states for which vector α is the clear “winner” (has the largest dot product) compared to all the others vectors of A. Using the definition of R, we can define: which is the set of vectors in A that have non-empty witness region and is precisely the minimum-size set. Pruning Sets of Vectors Implementation of purge(F) Returns the vectors in F with non-empty witness region. Returns an information state x for which α gives larger dot product that any vector in A. Incremental Pruning Computes Sα efficiently: • Conceptually easier than witness. • Superior performance and asymptotic complexity. • A = purge(A), B = purge(B). • W = purge(A B). •|W| ≥ max(|A|, |B|). •It never grows explosively compared to its final size. Incremental Pruning We first construct all of S(a,z) sets. We do all combinations of the S(a,z1) and S(a,z2) vectors. Incremental Pruning We yields the new value function. We then eliminate all useless (light blue) vectors. Incremental Pruning We are left with just three vectors. We then combine these three with the vectors in S(a,z3). This is repeated for the other action. Generalizing Incremental Pruning Modification of FILTER to take advantage of the fact that the set of vectors has a great deal of regularity. Replace x DOMINATE(Φ, W) with x DOMINATE(Φ, D\{Φ}). Recall: A B : filtering set of vectors. W: set of wining vectors. Φ: the “winner” vectors of the W D A B Generalizing Incremental Pruning D must satisfying any of the following properties: (1) (2) (3) (4) (5) Different choices of D result in different incremental pruning algorithms. The smaller the D set the more efficient the algorithm. Generalizing Incremental Pruning To IP algorithm uses equation 1. A variation of the incremental pruning method using a combination of 4 and 5 is referred as restricted region (RR) algorithm. The asymptotic total number of linear programs does not change RR, actually requires slightly more linear programs than IP in the worst case. However empirically it appears that the savings in the total constraints usually saves more time than the extra linear programs require. Generalizing Incremental Pruning Complete RR algorithm Empirical Results Total execution time Total time spent constructing Sα sets. Conclusions We examined the incremental pruning method for performing dynamic programming updates in partially observable Markov decision processes. It compares favorably in terms of ease of implementation to the simplest of the previous algorithms. It has asymptotic performance as good as or better than the most efficient of the previous algorithms and is empirically the fastest algorithm of its kind. Conclusion In any event even the slowest variation of the incremental pruning method that we studied is a consistent improvement over earlier algorithms. This algorithm will make it possible to greatly expand the set of POMDP problems that can be solved efficiently. Issues to be explored: All algorithms studied have a precision parameter ε, which differs from algorithm to algorithm. Develop better best-case and worst-case analyses for RR.
© Copyright 2026 Paperzz