Neural Dynamics and Reinforcement Learning Presented By: Matthew Luciw DFT SUMMER SCHOOL, 2013 Istituto Dalle Molle Di Studi sull’Intelligenza Artificiale IDSIA IDSIA – Lugano, Switzerland IDSIA www.idsia.ch Our Lab’s Director: Juergen Schmidhuber -Cognitive robotics, robot learning, universal search and learning algorithms, Kolmogorov complexity, algorithmic probability, Speed Prior, minimal description length, generalization and data compression, recurrent neural networks, financial forecasting with low-complexity nets, independent component analysis, low-complexity codes, reinforcement learning in partially observable environments, adaptive subgoal generation, multiagent learning, artificial evolution, probabilistic program evolution, automatic music composition, metalearning, self-modifying policies, Gödel machines, low-complexity art, theories of interestingness and beauty- Motivation IDSIA • How do we learn sequences of behavior, to achieve goals, in the DFT framework? • How are these sequences learned from delayed rewards? • How do sequences of these behaviors emerge as an agent autonomously explores its environment? Reinforcement Environment IDSIA state / situation st AGENT reward rt rt+1 ENVIRONMENT st+1 action at Reinforcement Learning Basics IDSIA • • • • Agent in situation st chooses action at Outcome: new situation st+1 Agent perceives situation st+1 and reward rt+1 Policy: law of how the agent acts • Reinforcement learning is both improving the policy and selecting actions to provide experience stream (history) s0a0s1r1a1s2r2a2… • Goal: produce history to maximize sum of ri Reinforcement Learning: What We Need IDSIA 1. What is the learner’s internal state? – e.g., state values, state-action values – Needs states and actions to be defined 2. How does the agent sense the world state? – Sensors? Features? 3. How are possible actions evaluated? – e.g., state-action values, one-step state predictor and state values 4. How are possible actions chosen? – policy – exploration method 5. How are the actions executed? – e.g., low-level controllers 6. How is the internal state updated? – e.g., value iteration, Q-learning, SARSA Elementary Behaviors for RL (Example: Find Color) IDSIA Sensory Input Perceptual Field Activity Pr 40 60 80 50 100 150 200 250 300 Color Hue es ha pe 20 Perceptual Field Output Pixel Column ot M Current Intention: Green or Fi d el Intention Nodes Heading Direction Mo to r s CoS node EBs cover #2 - how does the agent sense the world state?, and #5 – how are the actions executed? Learner’s Internal State – Value Function IDSIA • Value function – predicts reward, estimates total future reward given a course of action State values (γ is a discount factor) State-action values Learner estimates values from experience IDSIA Elementary Behaviors can function as States for an RL system Discretize the continuous world… Behavior Chaining IDSIA Functionally a deterministic state transition… Lets add multiple outcome EBs, and (possibly multiple) ways to select one of them… Adaptive Value Nodes for Policy Learning Co S No de s Inte ntio n No de s IDSIA Value Nodes Perceptual Field Output Greedy policy execution becomes: for the previously completed EB, select the intention of the next most valuable EB … this encodes a sequence of EBs Adaptive Value Nodes for Policy Learning Co S No de s Inte ntio n No de s IDSIA Value Nodes Perceptual Field Output This covers #1 – what is the internal state of the learner? and #3 – how are possible actions evaluated? Learning the Policy IDSIA • In RL, we’re generally trying to learn an optimal policy • If we know the dynamics of the environment and the reward function (together: the model), we can use dynamic programming to get • Dynamics of environment: • Reward function: Temporal Difference Learning IDSIA • We can learn an optimal policy without learning the model with model-free methods • These learn directly (on-line) from experience • Update estimate of V(s) after visiting state s Our DFT TD-Learning Algorithm IDSIA DN-SARSA(λ) combines: a process description of DFT to allow operation in real-time, continuous environments, with RL algorithm SARSA(λ) to enable agent to learn sequences of behaviors that lead to reward Deals with #6 – how is the internal state updated? TD Algorithm DN-SARSA(λ) IDSIA SARSA(λ) Dynamic Neural-SARSA(λ) DN-SARSA(λ) Architecture IDSIA Value opposition field is where the TD-error calculation lives Eligibility trace this particular implementation uses Item and Order working memory (Sohrob) Transient pulse cells do state transition signaling and memory of last stateaction Epuck in a Color Sequence Learning Task Avg . TD−Er r or Cumulative Reward IDSIA (a) 15000 Four EBs 10000 1. Find Blue Explore 0 2. Find Red 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 5000 4 2 0 −2 0 Explore 2000 4000 6000 8 Error Measurem The Eligibility Trace is Important IDSIA for sequence learning ~~“The Tree of Life”~~ IDSIA α A A B C D A B B C D C A B D C D all possible histories… A B Ω C D Somewhere, A Reward… IDSIA C D +100! What Caused It? IDSIA C D +100! C D +0 Memory Capacity Can Matter! IDSIA A B C D +100! Memory Capacity Can Matter! IDSIA A B REINFORCED C D +100! Memory Capacity Can Matter! IDSIA B A A B REINFORCED NOPE C C D +100! D +0 Grid World Analogy IDSIA Grid World Analogy IDSIA Grid World Analogy IDSIA The Eligibility Trace is Essential IDSIA for our system… But the length of sequences it can learn is limited… Note: if a sequence is very long, you couldn’t learn it either Epuck in a Color Sequence Learning Task Avg . TD−Er r or Cumulative Reward IDSIA (a) 15000 Four EBs 10000 1. Find Blue Explore 0 2. Find Red 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 5000 4 2 0 −2 0 Explore 2000 4000 6000 8 Error Measurem One Last Thing: Policy Iteration IDSIA • More than TD value updates are needed to achieve • This constitutes policy evaluation – prediction of return for some policy • But we will only learn the values of the policy through which the agent is sampling the stateaction space • Policy improvement – change policy to increase prediction of return • Need to interleave policy evaluation and policy improvement to get • Epsilon greedy - More random exploration early, (hopefully) mostly exploitation later Plots Should have used e-greedy!!! 15000 Time When Correct Sequence Learned 4 5 x 10 Exploit 4 Time Step Cumulative Reward IDSIA 10000 5000 1 2 3 2 1 Explore 0 0 3 4 5 Time Step [S * 32] 6 0 7 1 2 3 4 5 4 6 x 10 4 2 0 Exploit Explore 2000 4000 6000 8000 Error Measurements (d) 8 9 10 11 12 13 (c) 10000 12000 Cumulative Reward Avg . TD−Er r or (b) −2 0 7 Run # 4 2 x 10 Sequence Finding Difficulty(Run 6) 1 0 0 1 2 3 4 5 Time Step [S * 32] (e) 6 7 4 x 10 Dynamics of Behavioral Transitions IDSIA Simulated Environment: Exploration IDSIA Video – See Demo Material IDSIA Video – See Demo Material Sequence Learned Transferred to… THE REAL WORLD Different Agent+Environment IDSIA Possible Transitions IDSIA Reward if A-> B-> C -> D -> E Start Search (A) Grab (C) Transp. (D) Approach (B) FAIL Drop (E) Goal Sequence IDSIA Video – See Demo Material IDSIA Video – See Demo Material Learning the Sequence (Now with E-Greedy!) Exploration Mishaps IDSIA Video – See Demo Material Exploration Mishaps IDSIA Video – See Demo Material Exploration Mishaps IDSIA Video – See Demo Material Nao Experiment IDSIA Boris Duran, Gauss Lee, Robert Lowe Motivation Dynamic Field Theory Behavioral Organization in DFT SARSA / DN-SARSA Conclusion Video – See Demo Material Video – See Demo Material Conclusions IDSIA Reinforcement Learning can enable Neural Dynamics models to autonomously learn rewarding behavioral sequences There are some limitations of the current method References IDSIA Kazerounian*, S., Luciw*, M., Richter, M., Sandamirskaya, Y. (2013). Autonomous Reinforcement of Behavioral Sequences in Neural Dynamics. Proceedings of the International Joint Conference on Neural Networks (IJCNN). Duran, B., Lee, G., Lowe, R. (2013), Learning a DFT-Based Sequence with Reinforcement Learning: A NAO Implementation. PALADYN Journal of Behavioral Robotics. Sandamirskaya, Y., Richter, M., Schöner, G. (2011). A Neural-Dynamic Architecture for Behavioral Organization of an Embodied Agent. Proceedings of the International Conference on Development and Learning (ICDL). Sutton, R.S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. Material from: http://www.inf.ed.ac.uk/teaching/courses/rl/slides/
© Copyright 2026 Paperzz