Leap Algorithm Learning Entity Adaptive Partitioning

LEAP Algorithm
Reinforcement Learning with Adaptive Partitioning
Tsufit Cohen
Eyal Radiano
Supervised by:
Andrey Bernstein
Agenda







Intro
Q-learning
Leap
Algorithm
Simulation
LEAP vs Q-Learn
Conclusions
Intro

Reinforcement Learning
–
–
–
Learn optimal policy by trying
Reward for “Good” steps
Performance improvement
‫סוכן‬
Q-learn

Definitions:

Key specification :
–
–
–
–


Table representation
‫לכתוב את הנוסחא של‬Q
‫ במאמר‬6 ‫ואת ההגדרות זה מעמוד‬
‫כדי שנוכל להסביר מה זה‬Q
Exploration policy: epsilon greedy
‫אולי לפצל את זה לשני שקפים‬
LEAP Learning Entity (LE) Adaptive Partitioning

Key specifications :
Macro States
– Multi Partitioning
(each partition is called LE)
–
–
Pruning and Joining
Algorithm





Action Selection
Incoherence Criterion
JLE Generation
Update
Pruning Mechanism
Action Selection
Q  s, a  
   Q  s, a  
iL s 
i
i
 2

1
i   ˆ i  s, a   2


ˆ j  s, a  
jL  s  

1
Action Selection ( Cont. )
T  s, a, s '  R  s, a     max Q  s ', a '
a 'A
Incoherence Criterion
JLE Generation
Update




  s, a     s, a      R  s, a     max Q  s ', a '     s, a   


Qi  s, a   Qi  s, a    R  s, a     max Q  s ', a '  Qi  s, a  i
a 'A
2
i
i
vi  s, a   vi  s, a   1
a 'A
i
i
Pruning Mechanism
Changes and Add-ons to the Algorithm



Change the order of pruning and updating
Epsilon Greedy policy starts from 0.9
Boundary condition – Q=0 for End of game.
Implementation

Key Operation :
–
–
–

Finding Active LE List for a given state
Finding a macro state within a LE
Add/Remove JLE and/or macro state
Data Structures
LE
Basic LE
– JLE
inheritance
–
CList<macrostate> Macro_list
Int* ID_arr_
Int order
Basic LE
CList<JLE>* Sons_lists_arr
JLE
General Data Structure Implementation
Basic LE array:
Basic LE
1
Basic LE
2
Basic LE
3
Basic LE 1 - magnification:
macro list, Id array, order
Sons list array
pointer to
JLEs list in
order 1
(empty)
pointer to
JLEs list in
order 2
pointer to
JLEs list in
order 3
3D Grid World Implementation Example
Basic LE array:
Basic LE X
Basic LE Y
Basic LE Z
Sons list array:
Sons list array:
0
1
JLE
XY
JLE
XZ
2
JLE
XYZ
Sons list array:
0
1
JLE
YZ
2
0
1
2
Simulation 1 – 2D Grid World
Start point

Environment Properties:
–
–
–
–
–
–
–
Size: 20x20
Step cost: -1
Award: +2
Available Moves: Up, Down, Left, Right
Wall Bumping – No movement.
Award Taking – Start a new episode.
Basic LEs: X,Y
‫חלוקה‬
y- ‫לפי‬
‫חלוקה‬
x - ‫לפי‬
prize
Simulation 1 Results - Policy
0
RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN RIGHT DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT DOWN
start
DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN
-2
DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN RIGHT DOWN
DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN RIGHT DOWN
-4
DOWN
UP
RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN
RIGHT RIGHT
UP
UP
DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN
-6
RIGHT
UP
LEFT
RIGHT LEFT
UP
RIGHT DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN
UP
UP
RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN
-8
RIGHT RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN
RIGHT
UP
UP
DOWN DOWN RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN
-10
DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN DOWN RIGHT DOWN
RIGHT DOWN RIGHT DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN
-12
LEFT
LEFT
RIGHT
RIGHT LEFT
UP
DOWN DOWN DOWN RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN
DOWN RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN RIGHT DOWN DOWN RIGHT RIGHT RIGHT DOWN
-14
DOWN DOWN DOWN DOWN UP
RIGHT DOWN RIGHT DOWN RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN DOWN DOWN
UP
UP
DOWN DOWN RIGHT DOWN DOWN DOWN RIGHT RIGHT RIGHT DOWN DOWN DOWN RIGHT DOWN RIGHT DOWN DOWN DOWN
LEFT
LEFT
LEFT
RIGHT DOWN DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT DOWN RIGHT RIGHT RIGHT DOWN RIGHT RIGHT DOWN DOWN
-16
DOWN
UP
UP
DOWN DOWN RIGHT DOWN RIGHT RIGHT RIGHT DOWN DOWN RIGHT DOWN DOWN RIGHT DOWN DOWN
-18
RIGHT
UP
-20
0
UP
DOWN DOWN DOWN DOWN RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT DOWN DOWN DOWN DOWN DOWN
RIGHT
UP
2
RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT RIGHT
4
6
8
10
12
14
16
18
20
Prize
Results – Average Reward & refined
macrostates count
Average Reward Vs Number of Trials
Average Reward Vs Number of Trials
-25
-30
Average Reward
-35
-50
-40
-45
-50
-55
-60
600
650
700
Number of Trials ( x 50)
750
-150
Refined macrostates count Vs Number of Trials
160
-200
140
120
Number of Macrostates
Average Reward
-100
-250
-300
100
80
60
40
0
100
200
300
400
500
Number of Trials ( x 50)
600
700
20
0
500
1000
Number of Trials ( x 50)
1500
Simulation 2 – Grid Word with an obstacle

Environment Properties :
–
–
–
–
Size : 5x5
Step Cost: -1
Award: +2
Obstacle: -3
start
prize
Simulation 2 – Grid Word with an obstacle
Simulation 2 Results
Average Reward Vs Number of Trials
0
-5
DOWN
-0.5
DOWN
RIGHT
DOWN
RIGHT
-10
-1
DOWN
-1.5
RIGHT
RIGHT
RIGHT
DOWN
Average Reward
start
-2
-2.5
DOWN
RIGHT
RIGHT
RIGHT
DOWN
DOWN
RIGHT
RIGHT
RIGHT
DOWN
-3
-15
-20
-25
-3.5
-30
-4
RIGHT
-4.5
RIGHT
RIGHT
RIGHT
DOWN
-35
-5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
• Note: the policy changes – Due to Epsilon
0.5
1
1.5
2
Number of Trials ( x 50)
2.5
4
x 10
LEAP vs Q-Learn
Avergae Reward Vs Number of Trials
0
-100
Average Reward
-200
LEAP with multi partition
Regular Q learning
-300
-400
-500
-600
-700
-800
0
100
200
300
400
Number of Trials ( x 50)
500
600
700
Conclusions



Memory reduction
Complexity of implementation
Deviation from optimal policy
Questions ?