LEAP Algorithm Reinforcement Learning with Adaptive Partitioning Tsufit Cohen Eyal Radiano Supervised by: Andrey Bernstein Agenda Intro Q-learning Leap Algorithm Simulation LEAP vs Q-Learn Conclusions Intro Reinforcement Learning – – – Learn optimal policy by trying Reward for “Good” steps Performance improvement סוכן Q-learn Definitions: Key specification : – – – – Table representation לכתוב את הנוסחא שלQ במאמר6 ואת ההגדרות זה מעמוד כדי שנוכל להסביר מה זהQ Exploration policy: epsilon greedy אולי לפצל את זה לשני שקפים LEAP Learning Entity (LE) Adaptive Partitioning Key specifications : Macro States – Multi Partitioning (each partition is called LE) – – Pruning and Joining Algorithm Action Selection Incoherence Criterion JLE Generation Update Pruning Mechanism Action Selection Q s, a Q s, a iL s i i 2 1 i ˆ i s, a 2 ˆ j s, a jL s 1 Action Selection ( Cont. ) T s, a, s ' R s, a max Q s ', a ' a 'A Incoherence Criterion JLE Generation Update s, a s, a R s, a max Q s ', a ' s, a Qi s, a Qi s, a R s, a max Q s ', a ' Qi s, a i a 'A 2 i i vi s, a vi s, a 1 a 'A i i Pruning Mechanism Changes and Add-ons to the Algorithm Change the order of pruning and updating Epsilon Greedy policy starts from 0.9 Boundary condition – Q=0 for End of game. Implementation Key Operation : – – – Finding Active LE List for a given state Finding a macro state within a LE Add/Remove JLE and/or macro state Data Structures LE Basic LE – JLE inheritance – CList<macrostate> Macro_list Int* ID_arr_ Int order Basic LE CList<JLE>* Sons_lists_arr JLE General Data Structure Implementation Basic LE array: Basic LE 1 Basic LE 2 Basic LE 3 Basic LE 1 - magnification: macro list, Id array, order Sons list array pointer to JLEs list in order 1 (empty) pointer to JLEs list in order 2 pointer to JLEs list in order 3 3D Grid World Implementation Example Basic LE array: Basic LE X Basic LE Y Basic LE Z Sons list array: Sons list array: 0 1 JLE XY JLE XZ 2 JLE XYZ Sons list array: 0 1 JLE YZ 2 0 1 2 Simulation 1 – 2D Grid World Start point Environment Properties: – – – – – – – Size: 20x20 Step cost: -1 Award: +2 Available Moves: Up, Down, Left, Right Wall Bumping – No movement. Award Taking – Start a new episode. Basic LEs: X,Y חלוקה y- לפי חלוקה x - לפי prize Simulation 1 Results - Policy 0 RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN RIGHT DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT DOWN start DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN -2 DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN RIGHT DOWN DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN RIGHT DOWN -4 DOWN UP RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN RIGHT RIGHT UP UP DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN -6 RIGHT UP LEFT RIGHT LEFT UP RIGHT DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN UP UP RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN -8 RIGHT RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN RIGHT UP UP DOWN DOWN RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN -10 DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN DOWN RIGHT DOWN RIGHT DOWN RIGHT DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN -12 LEFT LEFT RIGHT RIGHT LEFT UP DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN DOWN RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN RIGHT RIGHT RIGHT DOWN -14 DOWN DOWN DOWN DOWN UP RIGHT DOWN RIGHT DOWN RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN DOWN DOWN UP UP DOWN DOWN RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN DOWN DOWN LEFT LEFT LEFT RIGHT DOWN DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN RIGHT RIGHT RIGHT DOWN RIGHT RIGHT DOWN DOWN -16 DOWN UP UP DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT DOWN DOWN RIGHT DOWN DOWN RIGHT DOWN DOWN -18 RIGHT UP -20 0 UP DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN RIGHT UP 2 RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT 4 6 8 10 12 14 16 18 20 Prize Results – Average Reward & refined macrostates count Average Reward Vs Number of Trials Average Reward Vs Number of Trials -25 -30 Average Reward -35 -50 -40 -45 -50 -55 -60 600 650 700 Number of Trials ( x 50) 750 -150 Refined macrostates count Vs Number of Trials 160 -200 140 120 Number of Macrostates Average Reward -100 -250 -300 100 80 60 40 0 100 200 300 400 500 Number of Trials ( x 50) 600 700 20 0 500 1000 Number of Trials ( x 50) 1500 Simulation 2 – Grid Word with an obstacle Environment Properties : – – – – Size : 5x5 Step Cost: -1 Award: +2 Obstacle: -3 start prize Simulation 2 – Grid Word with an obstacle Simulation 2 Results Average Reward Vs Number of Trials 0 -5 DOWN -0.5 DOWN RIGHT DOWN RIGHT -10 -1 DOWN -1.5 RIGHT RIGHT RIGHT DOWN Average Reward start -2 -2.5 DOWN RIGHT RIGHT RIGHT DOWN DOWN RIGHT RIGHT RIGHT DOWN -3 -15 -20 -25 -3.5 -30 -4 RIGHT -4.5 RIGHT RIGHT RIGHT DOWN -35 -5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 • Note: the policy changes – Due to Epsilon 0.5 1 1.5 2 Number of Trials ( x 50) 2.5 4 x 10 LEAP vs Q-Learn Avergae Reward Vs Number of Trials 0 -100 Average Reward -200 LEAP with multi partition Regular Q learning -300 -400 -500 -600 -700 -800 0 100 200 300 400 Number of Trials ( x 50) 500 600 700 Conclusions Memory reduction Complexity of implementation Deviation from optimal policy Questions ?
© Copyright 2026 Paperzz