POMDP

POMDP
MDP (Markov Decision Process)
• A MDP model contains:
– A set of states S
– A set of actions A
– A set of state transition description T
• Deterministic or Stochastic
– A reward function R (s, a)
POMDP
• A POMDP model contains:
– A set of states S
– A set of actions A
– A set of state transition description T
– A reward function R (s, a)
– A finite set of observations Ω
– An observation function O:S A→Π(Ω)
╳
• O(s’, a, o)
POMDP Problem
• 1. Belief state
– First approach: choose the most probable state of the
world, given past experience
• Informational properties described via observations
– Not explicit
– Second approach: probability distributions over states
of the world.
POMDP Problem
• 2. Finding an optimal policy:
– The optimal policy is the solution of a continuousspace “Belief MDP”:
• The set of belief states B
• The set of actions A
• The state-transition function τ(b, a, b’)
• The reward function on belief states ρ(b, a)
Value function
• Policy Tree
• Infinite Horizon
• Witness Algorithm
Policy Tree
• A tree of depth t that specifies a complete
t-step policy.
– Nodes: actions, the top node determines the
first action to be taken.
– Edges: the resulting observation
Sample Policy Tree
Policy Tree
• Value Evaluation:
– Vp(s) is the value function of step-t that
starting from state s and executing policy tree
p.
Policy Tree
• Value Evaluation:
– Expected value under policy tree p:
• Where
– Expected value that execute different policy
trees from different initial belief states
Policy Tree
• Value Evaluation:
– Vt with only two states:
Policy Tree
• Value Evaluation:
– Vt with three states:
Infinite Horizon
• The three algorithm to compute V:
– Naive approach
– Improved by choosing useful policy tree
– Witness algo.
Infinite Horizon
• Naive approach:
– εis a small number
– This policy tree contains:
•
nodes
• Each nodes can be labeled with |A| possible
actions
– Total number of policy threes:
Infinite Horizon
• Improved by choosing useful policy tree:
– Vt-1 is the set of useful (t – 1)-step policy trees,
can be used to construct a superset of the
useful t-step policy tree.
– And there are |A||Vt-1||Ω| elements in Vt+
Infinite Horizon
• Improved by choosing useful policy tree:
Infinite Horizon
• Witness algorithm:
Infinite Horizon
• Witness algorithm:
–
is a set of t-step policy trees that have
action a at their root
–
is the value function
– And
Infinite Horizon
• Witness algorithm:
– Finding witness:
• At each iteration we ask, Is there some belief
state,b, for which the true value,
, computed
by one-step lookahead using Vt-1, is different from
the estimated value,
, computed using the set
U?
• Provided
Infinite Horizon
• Witness algorithm:
– Finding witness:
• Now we can state the witness theorem [25]: The
true Q-function,
, differs from the approximate
Q-function,
, if and only if there is
some
,
, and
for which there
is some b such that
Infinite Horizon
• Witness algorithm:
– Finding witness:
Infinite Horizon
• Witness algorithm:
– Finding witness:
• The linear program used to find witness points:
Infinite Horizon
• Witness algorithm:
– Complete value-iteration:
• An agenda containing any single policy tree
• A set U containing the set of desired policy tree
• Using pnew to determine whether it is an improvement over
the policy trees in U
– 1. If no witness points are discovered, then that policy tree is
removed from the agenda. When the agenda is empty, the
algorithm terminates.
– 2. If a witness point is discovered, the best policy tree for that
point is calculated and added to U and all policy trees that dier
from the current policy tree in a single subtree are added to the
agenda.
Infinite Horizon
• Witness algorithm:
– Complexity:
• Since we know that no more than
witness points are
discovered (each adds a tree to the set of useful policy trees)
– only
trees can ever be added to the agenda (in
addition to the one tree in the initial agenda).
• Each linear program solved has
variables and no more
than
constraints.
• Each of these linear programs either removes a policy from
the agenda (this happens at most
times) or a witness point is discovered (this happens at
most
times).
Tiger Problem
• Two doors:
– Behind one door is a tiger
– Behind another door is a large reward
• Two states:
– the state of the world when the tiger is on the left as sl and when it is on
the right as sr
• Three actions:
– left, right, and listen.
• Rewards:
– reward for opening the correct door is +10 and the penalty for choosing
the door with the tiger behind it is -100, the cost of listen is -1
• Observations:
– to hear the tiger on the left (Tl) or to hear the tiger on the right (Tr)
– in state sl, the listen action results in observation Tl with probability 0.85
and the observation Tr with probability 0.15; conversely for world state sr.
Tiger Problem
Tiger Problem
Tiger Problem
• Decreasing listening reliability from 0.85
down to 0.65: