Neural Dynamics and Reinforcement Learning

Neural Dynamics and
Reinforcement Learning
Presented By:
Matthew Luciw
DFT SUMMER SCHOOL, 2013
Istituto
Dalle Molle
Di Studi
sull’Intelligenza
Artificiale
IDSIA
IDSIA – Lugano, Switzerland
IDSIA
www.idsia.ch
Our Lab’s Director: Juergen Schmidhuber
-Cognitive robotics, robot learning, universal search and learning algorithms, Kolmogorov complexity, algorithmic
probability, Speed Prior, minimal description length, generalization and data compression, recurrent neural networks,
financial forecasting with low-complexity nets, independent component analysis, low-complexity codes,
reinforcement learning in partially observable environments, adaptive subgoal generation, multiagent learning,
artificial evolution, probabilistic program evolution, automatic music composition, metalearning, self-modifying
policies, Gödel machines, low-complexity art, theories of interestingness and beauty-
Motivation
IDSIA
• How do we learn sequences of behavior, to
achieve goals, in the DFT framework?
• How are these sequences learned from delayed
rewards?
• How do sequences of these behaviors emerge
as an agent autonomously explores its
environment?
Reinforcement Environment
IDSIA
state / situation st
AGENT
reward rt
rt+1
ENVIRONMENT
st+1
action at
Reinforcement Learning Basics
IDSIA
•
•
•
•
Agent in situation st chooses action at
Outcome: new situation st+1
Agent perceives situation st+1 and reward rt+1
Policy: law of how the agent acts
• Reinforcement learning is both improving
the policy and selecting actions to provide
experience stream (history)
s0a0s1r1a1s2r2a2…
• Goal: produce history to maximize sum of ri
Reinforcement Learning: What We Need
IDSIA
1.
What is the learner’s internal state?
– e.g., state values, state-action values
– Needs states and actions to be defined
2.
How does the agent sense the world state?
– Sensors? Features?
3.
How are possible actions evaluated?
– e.g., state-action values, one-step state predictor and state values
4.
How are possible actions chosen?
– policy
– exploration method
5.
How are the actions executed?
– e.g., low-level controllers
6.
How is the internal state updated?
– e.g., value iteration, Q-learning, SARSA
Elementary Behaviors for RL (Example:
Find Color)
IDSIA
Sensory Input
Perceptual Field Activity
Pr
40
60
80
50
100
150
200
250
300
Color Hue
es
ha
pe
20
Perceptual Field Output
Pixel Column
ot
M
Current Intention: Green
or
Fi
d
el
Intention Nodes
Heading Direction
Mo
to r
s
CoS node
EBs cover #2 - how does the agent sense the
world state?, and #5 – how are the actions
executed?
Learner’s Internal State – Value Function
IDSIA
• Value function – predicts reward, estimates total
future reward given a course of action
State values
(γ is a discount factor)
State-action values
Learner estimates values from experience
IDSIA
Elementary Behaviors can function as
States for an RL system
Discretize the continuous world…
Behavior Chaining
IDSIA
Functionally a deterministic state transition…
Lets add multiple outcome EBs, and (possibly multiple) ways to select
one of them…
Adaptive Value Nodes for Policy Learning
Co
S
No
de
s
Inte
ntio
n No
de s
IDSIA
Value Nodes
Perceptual Field
Output
Greedy policy execution becomes: for the previously
completed EB, select the intention of the next most
valuable EB … this encodes a sequence of EBs
Adaptive Value Nodes for Policy Learning
Co
S
No
de
s
Inte
ntio
n No
de s
IDSIA
Value Nodes
Perceptual Field
Output
This covers #1 – what is the internal state of the learner?
and #3 – how are possible actions evaluated?
Learning the Policy
IDSIA
• In RL, we’re generally trying to learn an
optimal policy
• If we know the dynamics of the environment
and the reward function (together: the
model), we can use dynamic programming
to get
• Dynamics of environment:
• Reward function:
Temporal Difference Learning
IDSIA • We can learn an optimal policy without learning the
model with model-free methods
• These learn directly (on-line) from experience
• Update estimate of V(s) after visiting state s
Our DFT TD-Learning Algorithm
IDSIA
DN-SARSA(λ) combines:
a process description of DFT to allow operation in
real-time, continuous environments,
with RL algorithm SARSA(λ) to enable agent to
learn sequences of behaviors that lead to reward
Deals with #6 – how is the internal state
updated?
TD Algorithm DN-SARSA(λ)
IDSIA
SARSA(λ)
Dynamic Neural-SARSA(λ)
DN-SARSA(λ)
Architecture
IDSIA
Value opposition field is
where the TD-error
calculation lives
Eligibility trace this particular
implementation uses
Item and Order working
memory (Sohrob)
Transient pulse cells do
state transition signaling
and memory of last stateaction
Epuck in a Color Sequence Learning Task
Avg
.
TD−Er
r
or
Cumulative Reward
IDSIA
(a)
15000
Four EBs
10000
1. Find Blue
Explore
0 2. Find Red
0
1
2
3
Time Step [S
3. Find Yellow
(b)
4. Find Purple
5000
4
2
0
−2
0
Explore
2000
4000
6000
8
Error Measurem
The Eligibility Trace is Important
IDSIA
for sequence learning
~~“The Tree of Life”~~
IDSIA
α
A
A
B
C
D
A
B
B
C
D
C
A
B
D
C
D
all possible histories…
A
B
Ω
C
D
Somewhere, A Reward…
IDSIA
C
D
+100!
What Caused It?
IDSIA
C
D
+100!
C
D
+0 
Memory Capacity Can Matter!
IDSIA
A
B
C
D
+100!
Memory Capacity Can Matter!
IDSIA
A
B
REINFORCED
C
D
+100!
Memory Capacity Can Matter!
IDSIA
B
A
A
B
REINFORCED
NOPE
C
C
D
+100!
D
+0
Grid World Analogy
IDSIA
Grid World Analogy
IDSIA
Grid World Analogy
IDSIA
The Eligibility Trace is Essential
IDSIA
for our system…
But the length of sequences it can learn is limited…
Note: if a sequence is very long, you couldn’t learn it either
Epuck in a Color Sequence Learning Task
Avg
.
TD−Er
r
or
Cumulative Reward
IDSIA
(a)
15000
Four EBs
10000
1. Find Blue
Explore
0 2. Find Red
0
1
2
3
Time Step [S
3. Find Yellow
(b)
4. Find Purple
5000
4
2
0
−2
0
Explore
2000
4000
6000
8
Error Measurem
One Last Thing: Policy Iteration
IDSIA
• More than TD value updates are needed to
achieve
• This constitutes policy evaluation – prediction
of return for some policy
• But we will only learn the values of the policy
through which the agent is sampling the stateaction space
• Policy improvement – change policy to increase
prediction of return
• Need to interleave policy evaluation and policy
improvement to get
• Epsilon greedy - More random exploration early,
(hopefully) mostly exploitation later
Plots
Should have used e-greedy!!!
15000
Time When Correct Sequence Learned
4
5
x 10
Exploit
4
Time Step
Cumulative Reward
IDSIA
10000
5000
1
2
3
2
1
Explore
0
0
3
4
5
Time Step [S * 32]
6
0
7
1
2
3
4
5
4
6
x 10
4
2
0
Exploit
Explore
2000
4000
6000
8000
Error Measurements
(d)
8
9
10 11 12 13
(c)
10000
12000
Cumulative Reward
Avg
.
TD−Er
r
or
(b)
−2
0
7
Run #
4
2
x 10
Sequence Finding Difficulty(Run 6)
1
0
0
1
2
3
4
5
Time Step [S * 32]
(e)
6
7
4
x 10
Dynamics of Behavioral Transitions
IDSIA
Simulated Environment: Exploration
IDSIA
Video –
See
Demo
Material
IDSIA
Video –
See
Demo
Material
Sequence Learned Transferred to…
THE REAL WORLD
Different Agent+Environment
IDSIA
Possible Transitions
IDSIA
Reward if A-> B-> C -> D -> E
Start
Search
(A)
Grab
(C)
Transp.
(D)
Approach
(B)
FAIL
Drop
(E)
Goal Sequence
IDSIA
Video –
See
Demo
Material
IDSIA
Video –
See
Demo
Material
Learning the Sequence
(Now with E-Greedy!)
Exploration Mishaps
IDSIA
Video –
See
Demo
Material
Exploration Mishaps
IDSIA
Video –
See
Demo
Material
Exploration Mishaps
IDSIA
Video –
See
Demo
Material
Nao Experiment
IDSIA
Boris Duran, Gauss Lee, Robert Lowe
Motivation
Dynamic Field Theory
Behavioral Organization in DFT
SARSA / DN-SARSA
Conclusion
Video –
See
Demo
Material
Video –
See
Demo
Material
Conclusions
IDSIA
Reinforcement Learning can enable Neural
Dynamics models to autonomously learn
rewarding behavioral sequences
There are some limitations of the current
method
References
IDSIA
Kazerounian*, S., Luciw*, M., Richter, M., Sandamirskaya, Y. (2013).
Autonomous Reinforcement of Behavioral Sequences in Neural Dynamics.
Proceedings of the International Joint Conference on Neural Networks
(IJCNN).
Duran, B., Lee, G., Lowe, R. (2013), Learning a DFT-Based Sequence with
Reinforcement Learning: A NAO Implementation. PALADYN Journal of
Behavioral Robotics.
Sandamirskaya, Y., Richter, M., Schöner, G. (2011). A Neural-Dynamic
Architecture for Behavioral Organization of an Embodied Agent.
Proceedings of the International Conference on Development and Learning
(ICDL).
Sutton, R.S., Barto, A. G. (1998). Reinforcement Learning: An Introduction.
Material from: http://www.inf.ed.ac.uk/teaching/courses/rl/slides/