Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Infinite Domains 1 Draw a 9 of Diamonds 2 3 4 Draw an Ace of Spades 5 of Hearts on 6 of Spades 5 6 Probabilis;c Heuris;c Search Algorithm Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states. 1 If there is a tie, choose the policy with the greatest expected total reward 2 If there is still a tie, choose the policy arbitrarily, though consistently Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1 Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability 1 If there is a tie, choose one state arbitrarily, though consistently Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1 Policy Choice: Return the last expanded policy Optimal Policy is finite ü Actions transition to a finite number of states ü The greatest probability of reaching the goal is 1* ü The reward scheme causes states within a finite number of steps of the start to have greater values than states an infinite number of steps away from the start With Discoun;ng 0≤γ<1 Ac;on Penalty Step -‐> -‐1 Goal -‐> 0 Modified Breadth First Search ü ü Uses short circuiting termination Guaranteed to find the optimal policy but not to terminate when it does, without the greatest probability of reaching the goal equaling 1 ü Both this and the Probabilistic Search Algorithm do not find the policy with the fewest number of expected steps GRWD Expands Fewer States Reward Func;ons Termina;on ü Judah Schvimer Advisor: Prof. Michael LiEman Goal Reward Step -‐> 0 Goal -‐> 1 Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other APWD’s Reward APND’s Choice No Discoun;ng γ=1 GRWD’s Choice APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor APND Expands Fewer States GRWD’s Choice Worst possible performance - undirected, infinite, wandering APND’s Choice GRWD’s GRND Doesn’t Terminate
© Copyright 2026 Paperzz