POMDP MDP (Markov Decision Process) • A MDP model contains: – A set of states S – A set of actions A – A set of state transition description T • Deterministic or Stochastic – A reward function R (s, a) POMDP • A POMDP model contains: – A set of states S – A set of actions A – A set of state transition description T – A reward function R (s, a) – A finite set of observations Ω – An observation function O:S A→Π(Ω) ╳ • O(s’, a, o) POMDP Problem • 1. Belief state – First approach: choose the most probable state of the world, given past experience • Informational properties described via observations – Not explicit – Second approach: probability distributions over states of the world. POMDP Problem • 2. Finding an optimal policy: – The optimal policy is the solution of a continuousspace “Belief MDP”: • The set of belief states B • The set of actions A • The state-transition function τ(b, a, b’) • The reward function on belief states ρ(b, a) Value function • Policy Tree • Infinite Horizon • Witness Algorithm Policy Tree • A tree of depth t that specifies a complete t-step policy. – Nodes: actions, the top node determines the first action to be taken. – Edges: the resulting observation Sample Policy Tree Policy Tree • Value Evaluation: – Vp(s) is the value function of step-t that starting from state s and executing policy tree p. Policy Tree • Value Evaluation: – Expected value under policy tree p: • Where – Expected value that execute different policy trees from different initial belief states Policy Tree • Value Evaluation: – Vt with only two states: Policy Tree • Value Evaluation: – Vt with three states: Infinite Horizon • The three algorithm to compute V: – Naive approach – Improved by choosing useful policy tree – Witness algo. Infinite Horizon • Naive approach: – εis a small number – This policy tree contains: • nodes • Each nodes can be labeled with |A| possible actions – Total number of policy threes: Infinite Horizon • Improved by choosing useful policy tree: – Vt-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. – And there are |A||Vt-1||Ω| elements in Vt+ Infinite Horizon • Improved by choosing useful policy tree: Infinite Horizon • Witness algorithm: Infinite Horizon • Witness algorithm: – is a set of t-step policy trees that have action a at their root – is the value function – And Infinite Horizon • Witness algorithm: – Finding witness: • At each iteration we ask, Is there some belief state,b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U? • Provided Infinite Horizon • Witness algorithm: – Finding witness: • Now we can state the witness theorem [25]: The true Q-function, , differs from the approximate Q-function, , if and only if there is some , , and for which there is some b such that Infinite Horizon • Witness algorithm: – Finding witness: Infinite Horizon • Witness algorithm: – Finding witness: • The linear program used to find witness points: Infinite Horizon • Witness algorithm: – Complete value-iteration: • An agenda containing any single policy tree • A set U containing the set of desired policy tree • Using pnew to determine whether it is an improvement over the policy trees in U – 1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. – 2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda. Infinite Horizon • Witness algorithm: – Complexity: • Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) – only trees can ever be added to the agenda (in addition to the one tree in the initial agenda). • Each linear program solved has variables and no more than constraints. • Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times). Tiger Problem • Two doors: – Behind one door is a tiger – Behind another door is a large reward • Two states: – the state of the world when the tiger is on the left as sl and when it is on the right as sr • Three actions: – left, right, and listen. • Rewards: – reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 • Observations: – to hear the tiger on the left (Tl) or to hear the tiger on the right (Tr) – in state sl, the listen action results in observation Tl with probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr. Tiger Problem Tiger Problem Tiger Problem • Decreasing listening reliability from 0.85 down to 0.65:
© Copyright 2026 Paperzz