Markov Decision Processes

Nov 14th
Homework 4 due
Project 4 due 11/26
How do we use planning graph heuristics?
Progression
h-A
on-A-B
St-A-B
on-B-A
h-B
Pick-A
Pick-B
Regression
h-A
h-B
~cl-A
~cl-A
~cl-B
~cl-B
St-B-A
~he
onT-A
onT-A
onT-B
onT-B
~he
Ptdn-A
onT-A
onT-B
Ptdn-B
cl-A
cl-A
cl-B
cl-B
he
he
cl-A
Pick-A
cl-B
Pick-B
he
h-A
on-A-B
St-A-B
on-B-A
h-B
Pick-A
Pick-B
h-A
h-B
~cl-A
~cl-A
~cl-B
~cl-B
St-B-A
~he
onT-A
onT-A
onT-B
onT-B
~he
Ptdn-A
onT-A
onT-B
Ptdn-B
cl-A
cl-A
cl-B
cl-B
cl-A
Pick-A
cl-B
Pick-B
he
he
h-A
Pick-A
Pick-B
he
St-A-B
on-A-B
on-B-A
h-B
h-A
h-B
~cl-A
~cl-A
~cl-B
St-B-A
~cl-B
~he
onT-A
~he
onT-A
Ptdn-A
onT-B
onT-B
onT-A
onT-B
Ptdn-B
cl-A
cl-A
cl-B
cl-B
Pick-A
cl-A
cl-B
Pick-B
he
he
he
h-A
Pick-A
Pick-B
St-A-B
on-A-B
on-B-A
h-B
h-A
h-B
~cl-A
~cl-A
~cl-B
St-B-A
~cl-B
~he
onT-A
~he
onT-A
Ptdn-A
Qn: How far should we grow each
planning graph?
Ans: Maximum to the level-off
(i.e., prop-lists don’t change
between consecutive levels)
onT-B
onT-B
cl-A
cl-A
cl-B
cl-B
he
he
onT-A
onT-B
Ptdn-B
Pick-A
cl-A
cl-B
Pick-B
he
Use of PG in Progression
Regression
Remembervs
the Altimeter
metaphor..
• Progression
– Need to compute a PG for
each child state
• As many PGs as there are
leaf nodes!
• Lot higher cost for heuristic
computation
– Can try exploiting overlap
between different PGs
– However, the states in
progression are consistent..
• So, handling negative
interactions is not that
important
• Overall, the PG gives a
better guidance even
without mutexes
• Regression
– Need to compute PG only
once for the given initial
state.
• Much lower cost in
computing the heuristic
– However states in
regression are “partial
states” and can thus be
inconsistent
• So, taking negative
interactions into account
using mutex is important
– Costlier PG construction
• Overall, PG’s guidance is
not as good unless higher
order mutexes are also
Historically, the heuristic was first used with progression
taken into account
planners. Then they used it with regression planners. Then they
found progression planners do better. Then they found that combining
them is even better.
There is a whole lot more..
• Planning graph heuristics can also be
made sensitive to
– Negative interactions
– Non-uniform cost actions
– Partial satisfaction planning
• Where actions have costs and goals have utilites
and the best plan may not achieve all goals
• See rakaposhi.eas.asu.edu/pg-tutorial
– Or the AI Magazine paper Spring 2007
What if you didn’t have any hard
goals..?
And got rewards continually?
And have stochastic actions?
MDPs as Utility-based problem
solving agents
[can generalize
to have action costs C(a,s)]
If Mij matrix is not known a priori, then we have
a reinforcement learning scenario..
What does a solution to an MDP
look like?
• The solution should tell the
optimal action to do in each
state (called a “Policy”)
– Policy is a function from states to actions (*
see finite horizon case below*)
– Not a sequence of actions anymore
• Needed because of the non-deterministic
actions
– If there are |S| states and |A| actions that we
can do at each state, then there are |A||S|
policies
• How do we get the best policy?
– Pick the policy that gives the maximal
expected reward
– For each policy p
• Simulate the policy (take actions suggested by
the policy) to get behavior traces
We will concentrate on
infinite horizon problems
(infinite horizon doesn’t
necessarily mean that
that all behavior traces are
infinite. They could be finite
and end in a sink state)
U* is the maximal expected utility (value)
assuming optimal policy
CalledThink
valueoffunction
U*
these as
Think
of values…
these as
h*()
related to h* values
Why are they called Markov
decision processes?
•
Markov property means that state contains all the information (to decide the
reward or the transition)
– Reward of a state Sn is independent of the path used to get to Sn
– Effect of doing an action A in state Sn doesn’t depend on the way we reached
state Sn
– (As a consequence of the above) Maximal expected utility of a state S doesn’t
depend on the path used to get to S
•
Markov properties are assumed (to make life simple)
– It is possible to have non-markovian rewards (e.g. you will get a reward in state
Si only if you came to Si through SJ
• E.g. If you picked up a coupon before going to the theater, then you will get a reward
– It is possible to convert non-markovian rewards into markovian ones, but it leads to a blow-up in
the state space. In the theater example above, add “coupon” as part of the state (it becomes an
additional state variable—increasing the state space two-fold).
– It is also possible to have non-markovian effects—especially if you have partial
observability
• E.g. Suppose there are two states of the world where the agent can get banana smell
MDPs and Deterministic Search
• Problem solving agent search corresponds to what special case of
MDP?
– Actions are deterministic; Goal states are all equally valued, and
are all sink states.
• Is it worth solving the problem using MDPs?
– The construction of optimal policy is an overkill
• The policy, in effect, gives us the optimal path from every
state to the goal state(s))
– The value function, or its approximations, on the other hand are
useful. How?
• As heuristics for the problem solving agent’s search
• This shows an interesting connection between dynamic
programming and “state search” paradigms
– DP solves many related problems on the way to solving the one
problem we want
– State search tries to solve just the problem we want
– We can use DP to find heuristics to run state search..
Optimal Policies depend on
rewards..
-
-
-
-
(Value)
(“sequence of states” = “behavior”)
How about deterministic
case?
U(si) is the shortest path
to the goal 