Termina on Modified Breadth First Search Reward Func ons

Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Infinite Domains 1 
Draw a 9 of Diamonds 2 
3 
4 
Draw an Ace of Spades 5 of Hearts on 6 of Spades 5 
6 
Probabilis;c Heuris;c Search Algorithm Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard
planning algorithm, assuming optimistically that unexplored states are goal states.
1 
If there is a tie, choose the policy with the greatest expected total reward
2 
If there is still a tie, choose the policy arbitrarily, though consistently
Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the
goal is better than the best optimistic estimate from a different first action, go to Step 6 and
return only the optimal first action
Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return
to Step 1
Choose Expansion State: Among all fringe states, choose the one reached with the greatest
probability
1 
If there is a tie, choose one state arbitrarily, though consistently
Expand the chosen fringe state by seeing where its actions transition and adding those states to
the MDP; go to Step 1
Policy Choice: Return the last expanded policy
Optimal Policy is finite
ü 
Actions transition to a finite number of states
ü 
The greatest probability of reaching the goal is 1*
ü 
The reward scheme causes states within a finite
number of steps of the start to have greater values
than states an infinite number of steps away from
the start
With Discoun;ng 0≤γ<1 Ac;on Penalty Step -­‐> -­‐1 Goal -­‐> 0 Modified Breadth First Search ü 
ü 
Uses short circuiting termination
Guaranteed to find the optimal policy but not to terminate when it
does, without the greatest probability of reaching the goal equaling 1
ü 
Both this and the Probabilistic Search Algorithm do not find the
policy with the fewest number of expected steps
GRWD Expands Fewer States Reward Func;ons Termina;on ü 
Judah Schvimer Advisor: Prof. Michael LiEman Goal Reward Step -­‐> 0 Goal -­‐> 1 Expand same states
- Find same answer
- Take same time
Why?
- The algorithm uses
rewards in the Set Policy Step
- Tie breaker chooses the
policy with the greatest
expected total reward
- APWD and GRWD’s
expected total rewards are
linear transformations of each
other
APWD’s Reward APND’s Choice No Discoun;ng γ=1 GRWD’s Choice APND and GRWD/APWD
are not comparable
- domains exist where each
performs better
- GRWD/APWD makes
different decisions based on
the discount factor
APND Expands Fewer States GRWD’s Choice Worst possible
performance
- undirected, infinite,
wandering
APND’s Choice GRWD’s GRND Doesn’t Terminate