Probabilistic Robotics Planning and Control: Markov Decision Processes 1 SA-1 Problem Classes • Deterministic vs. stochastic actions • Full vs. partial observability 2 Deterministic, fully observable 3 Stochastic, Fully Observable 4 Stochastic, Partially Observable 5 Markov Decision Process (MDP) r=1 0.01 s2 0.7 0.9 0.1 0.3 r=0 0.99 s3 0.3 s1 r=20 0.3 0.4 r=0 s4 0.2 0.8 s5 r=-10 6 Markov Decision Process (MDP) • • • • • Given: States x Actions u Transition probabilities p(x‘|u,x) Reward / payoff function r(x,u) • Wanted: • Policy p(x) that maximizes the future expected reward 7 Rewards and Policies • Policy (general case): p : z1:t 1, u1:t 1 ut • Policy (fully observable case): p : xt ut • Expected cumulative payoff: T R T E rt 1 • T=1: greedy policy • T>1: finite horizon case, typically no discount • T=infty: infinite-horizon case, finite reward if discount < 1 8 Policies contd. • Expected cumulative payoff of policy: T p RT ( xt ) E rt | ut p ( z1:t 1u1:t 1 ) 1 • Optimal policy: p T p argmax R ( xt ) p • 1-step optimal policy: p1 ( x) argmax r ( x, u) u • Value function of 1-step optimal policy: V1 ( x) max r ( x, u ) u 9 2-step Policies • Optimal policy: p 2 ( x) argmax r(x, u) V (x' ) p(x'| u, x)dx' 1 u • Value function: V2 ( x) max u r(x, u) V (x' ) p(x'| u, x)dx' 1 10 T-step Policies • Optimal policy: p T ( x) argmax r(x, u) V T 1 u ( x' ) p( x'| u, x)dx' • Value function: VT ( x) max u r(x, u) V T 1 ( x' ) p( x'| u, x)dx' 11 Infinite Horizon • Optimal policy: V ( x) max u r(x, u) V (x' ) p(x'| u, x)dx' • Bellman equation • Fix point is optimal policy • Necessary and sufficient condition 12 Value Iteration • for all x do Vˆ ( x) rmin • endfor • repeat until convergence • for all x do Vˆ ( x) max u r(x, u) Vˆ (x' ) p(x'| u, x)dx' • endfor • endrepeat p ( x) argmax u r(x, u) Vˆ (x' ) p(x'| u, x)dx' 13 Value Iteration for Motion Planning 14 Value Function and Policy Iteration • Often the optimal policy has been reached long before the value function has converged. • Policy iteration calculates a new policy based on the current value function and then calculates a new value function based on this policy. • This process often converges faster to the optimal policy. 15
© Copyright 2026 Paperzz