mdps - Probabilistic Robotics

Probabilistic Robotics
Planning and Control:
Markov Decision Processes
1
SA-1
Problem Classes
• Deterministic vs. stochastic actions
• Full vs. partial observability
2
Deterministic, fully observable
3
Stochastic, Fully Observable
4
Stochastic, Partially Observable
5
Markov Decision Process (MDP)
r=1
0.01
s2
0.7
0.9
0.1
0.3
r=0
0.99
s3
0.3
s1
r=20
0.3
0.4
r=0
s4
0.2
0.8
s5
r=-10
6
Markov Decision Process (MDP)
•
•
•
•
•
Given:
States x
Actions u
Transition probabilities p(x‘|u,x)
Reward / payoff function r(x,u)
• Wanted:
• Policy p(x) that maximizes the future
expected reward
7
Rewards and Policies
• Policy (general case):
p : z1:t 1, u1:t 1  ut
• Policy (fully observable case):
p : xt  ut
• Expected cumulative payoff:
T  
R T  E   rt  
  1

• T=1: greedy policy
• T>1: finite horizon case, typically no discount
• T=infty: infinite-horizon case, finite reward if discount < 1
8
Policies contd.
• Expected cumulative payoff of policy:
T


p

RT ( xt )  E   rt  | ut   p ( z1:t  1u1:t  1 )
  1

• Optimal policy:
p
T
p  argmax R ( xt )

p
• 1-step optimal policy:
p1 ( x)  argmax
r ( x, u)
u
• Value function of 1-step optimal policy:
V1 ( x)   max r ( x, u )
u
9
2-step Policies
• Optimal policy:
p 2 ( x)  argmax
r(x, u)  V (x' ) p(x'| u, x)dx'
1
u
• Value function:
V2 ( x)   max
u
r(x, u)  V (x' ) p(x'| u, x)dx'
1
10
T-step Policies
• Optimal policy:
p T ( x)  argmax
r(x, u)  V
T 1
u

( x' ) p( x'| u, x)dx'
• Value function:
VT ( x)   max
u
r(x, u)  V
T 1

( x' ) p( x'| u, x)dx'
11
Infinite Horizon
• Optimal policy:
V ( x)   max
u
r(x, u)  V (x' ) p(x'| u, x)dx'

• Bellman equation
• Fix point is optimal policy
• Necessary and sufficient condition
12
Value Iteration
• for all x do
Vˆ ( x)  rmin
• endfor
• repeat until convergence
• for all x do
Vˆ ( x)   max
u
r(x, u)  Vˆ (x' ) p(x'| u, x)dx'
• endfor
• endrepeat
p ( x)  argmax
u
r(x, u)  Vˆ (x' ) p(x'| u, x)dx'
13
Value Iteration for Motion
Planning
14
Value Function and Policy
Iteration
• Often the optimal policy has been reached
long before the value function has
converged.
• Policy iteration calculates a new policy
based on the current value function and
then calculates a new value function based
on this policy.
• This process often converges faster to the
optimal policy.
15