Goal-Directed Feature Learning Cornelius Weber and Jochen Triesch Frankfurt Institute for Advanced Studies (FIAS) IJCNN, Atlanta, 17th June 2009 for taking action, we need only the relevant features y z x unsupervised learning in cortex actor state space reinforcement learning in basal ganglia Doya, 1999 reinforcement learning go up? go right? go down? go left? reinforcement learning action a weights input s reinforcement learning v(s,a) value of a state-action pair (coded in the weights) action a weights input s minimizing value estimation error: repeated running to goal: in state s, agent performs best action a (with random), yielding s’ and a’ d v(s,a) ≈ 0.9 v(s’,a’) - v(s,a) moving target value d v(s,a) ≈ 1 - v(s,a) fixed at goal --> values and action choices converge reinforcement learning actor input (state space) simple input complex input go right! go right? go left? can’t handle this! complex input scenario: bars controlled by actions, ‘up’, ‘down’, ‘left’, ‘right’; reward given if horizontal bar at specific position sensory input action reward need another layer(s) to pre-process complex data action selection a action Q weight matrix encodes v s state position of relevant bar feature detection W weight matrix feature detector I network definition: s = softmax(W I) P(a=1) = softmax(Q s) v = aQs input action selection feature detection a action Q weight matrix s state W weight matrix I network training: input minimize error w.r.t. current target E = (0.9 v(s’,a’) - v(s,a))2 = δ2 d Q ≈ dE/dQ = δ a s d W ≈ dE/dW = δ Q s I + ε reinforcement learning δ-modulated unsupervised learning network training: minimize error w.r.t. target Vπ identities used: note: non-negativity constraint on weights SARSA with WTA input layer learning the ‘short bars’ data feature weights RL action weights data action reward short bars in 12x12 average # of steps to goal: 11 learning ‘long bars’ data RL action weights feature weights data input reward 2 actions (not shown) WTA non-negative weights SoftMax no weight constraints SoftMax non-negative weights models’ background: - gradient descent methods generalize RL to several layers Sutton&Barto RL book (1998); Tesauro (1992;1995) - reward-modulated Hebb Triesch, Neur Comp 19, 885-909 (2007), Roelfsema & Ooyen, Neur Comp 17, 2176-214 (2005); Franz & Triesch, ICDL (2007) - reward-modulated activity leads to input selection Nakahara, Neur Comp 14, 819-44 (2002) - reward-modulated STDP Izhikevich, Cereb Cortex 17, 2443-52 (2007), Florian, Neur Comp 19/6, 1468-502 (2007); Farries & Fairhall, Neurophysiol 98, 3648-65 (2007); ... - RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996) Discussion - two-layer SARSA RL performs gradient descent on value estimation error - approximation with winner-take-all leads to local rule with δ-feedback - learns only action-relevant features - non-negative coding aids feature extraction - link between unsupervised- and reinforcement learning - demonstration with more realistic data still needed Sponsors Bernstein Focus Neurotechnology, BMBF grant 01GQ0840 EU project 231722 “IM-CLeVeR”, call FP7-ICT-2007-3 Frankfurt Institute for Advanced Studies, FIAS thank you ... Sponsors Bernstein Focus Neurotechnology, BMBF grant 01GQ0840 EU project 231722 “IM-CLeVeR”, call FP7-ICT-2007-3 Frankfurt Institute for Advanced Studies, FIAS
© Copyright 2025 Paperzz