Machine Learning, 28, 77–104 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. ° CHILD: A First Step Towards Continual Learning [email protected] Adaptive Systems Research Group, GMD — German National Research Center for Information Technology, Schloß Birlinghoven, D-53 754 Sankt Augustin, Germany MARK B. RING Editors: Lorien Pratt and Sebastian Thrun Abstract. Continual learning is the constant development of increasingly complex behaviors; the process of building more complicated skills on top of those already developed. A continual-learning agent should therefore learn incrementally and hierarchically. This paper describes CHILD, an agent capable of Continual, Hierarchical, Incremental Learning and Development. CHILD can quickly solve complicated non-Markovian reinforcementlearning tasks and can then transfer its skills to similar but even more complicated tasks, learning these faster still. Keywords: Continual learning, transfer, reinforcement learning, sequence learning, hierarchical neural networks 1. Introduction Supervised learning speeds the process of designing sophisticated software by allowing developers to focus on what the system must do rather than on how the system should work. This is done by specifying the system’s desired outputs for various representative inputs; but even this simpler, abstracted task can be onerous and time consuming. Reinforcement learning takes the abstraction a step further: the developer of a reinforcement-learning system need not even specify what the system must do, but only recognize when the system does the right thing. The developer’s remaining challenge is to design an agent sophisticated enough that it can learn its task using reinforcement-learning methods. Continual learning is the next step in this progression, eliminating the task of building a sophisticated agent by allowing the agent to be trained with its necessary foundation instead. The agent can then be kept and trained on even more complex tasks. Continual learning is a continual process, where learning occurs over time, and time is monotonic: A continual-learning agent’s experiences occur sequentially, and what it learns at one time step while solving one task, it can use later, perhaps to solve a completely different task. A continual learner: • Is an autonomous agent. It senses, takes actions, and responds to the rewards in its environment. • Can learn context-dependent tasks, where previous senses can affect future actions. • Learns behaviors and skills while solving its tasks. • Learns incrementally. There is no fixed training set; learning occurs at every time step; and the skills the agent learns now can be used later. • Learns hierarchically. Skills it learns now can be built upon and modified later. 78 • • M.B. RING Is a black box. The internals of the agent need not be understood or manipulated. All of the agent’s behaviors are developed through training, not through direct manipulation. Its only interface to the world is through its senses, actions, and rewards. Has no ultimate, final task. What the agent learns now may or may not be useful later, depending on what tasks come next. Humans are continual learners. During the course of our lives, we continually grasp ever more complicated concepts and exhibit ever more intricate behaviors. Learning to play the piano, for example, involves many stages of learning, each built on top of the previous one: learning finger coordination and tone recognition, then learning to play individual notes, then simple rhythms, then simple melodies, then simple harmonies, then learning to understand clefs and written notes, and so on without end. There is always more to learn; one can always add to one’s skills. Transfer in supervised learning involves reusing the features developed for one classification and prediction task as a bias for learning related tasks (Baxter, 1995, Caruana, 1993, Pratt 1993, Sharkey & Sharkey, 1993, Silver & Mercer, 1995, Thrun, 1996, Yu & Simmons, 1990). Transfer in reinforcement learning involves reusing the information gained while learning to achieve one goal to learn to achieve other goals more easily (Dayan & Hinton, 1993, Kaelbling, 1993a, Kaelbling, 1993b, Ring, 1996, Singh, 1992, Thrun & Schwartz, 1995). Continual learning, on the other hand, is the transfer of skills developed so far towards the development of new skills of greater complexity. Constructing an algorithm capable of continual learning spans many different dimensions. Transfer across classification tasks is one of these. An agent’s ability to maintain information from previous time steps is another. The agent’s environment may be sophisticated in many ways, and the continual-learning agent must eventually be capable of building up to these complexities. CHILD, capable of Continual, Hierarchical, Incremental Learning and Development, is the agent presented here, but it is not a perfect continual learner; rather, it is a first step in the development of a continual-learning agent. CHILD can only learn in a highly restricted subset of possible environments, but it exhibits all of the properties described above. 1.1. An Example Figure 1 presents a series of mazes ranging from simple to more complex. In each maze, the agent can occupy any of the numbered positions (states). The number in each position uniquely represents the configuration of the walls surrounding that position. (Since there are four walls, each of which can be present or absent, there are sixteen possible wall configurations.) By perceiving this number, the agent can detect the walls immediately surrounding it. The agent can move north, south, east, or west. (It may not enter the barrier positions in black, nor may it move beyond the borders of the maze.) The agent’s task is to learn to move from any position to the goal (marked by the food dish), where it receives positive reinforcement. The series of mazes in Figure 1 is intended to demonstrate the notion of continual learning in a very simple way through a series of very simple environments — though even such 79 CHILD 1 3 5 2 0 10 12 2 9 7 3 5 4 2 0 6 10 12 3 9 3 5 4 2 0 6 10 12 12 12 9 Agent 1 2 0 0 2 0 4 10 8 12 3 1 5 4 2 0 0 6 10 8 12 9 9 9 Agent 12 9 12 9 5 4 2 0 4 6 2 0 0 4 6 2 0 4 6 12 6 2 0 4 6 6 10 8 12 6 Agent 9 9 5 4 2 0 0 6 2 0 4 6 2 0 6 10 8 9 9 1 1 9 9 9 7 6 4 9 6 12 12 9 9 1 5 7 3 1 5 3 1 5 7 2 0 4 6 2 0 4 2 0 4 6 2 0 0 4 2 0 0 0 0 0 2 0 4 6 2 0 4 2 0 4 6 2 0 4 6 2 0 4 2 0 4 6 2 0 4 6 2 0 4 2 0 4 6 10 8 12 6 10 8 12 10 8 12 6 9 9 9 9 12 N 4 E W S 6 6 9 9 Agent 3 9 6 3 3 8 Agent 4 7 7 Agent 9 7 12 9 7 6 6 5 9 Agent 7 6 5 3 9 6 6 Agent 7 4 Agent 9 9 9 9 9 9 9 12 Figure 1. These nine mazes form a progression of reinforcement-learning environments from simple to more complex. In a given maze, the agent may occupy any of the numbered positions (states) at each time step and takes one of four actions: North, East, West, or South. Attempting to move into a wall or barrier (black regions) will leave the agent in its current state. The digits in the maze represent the sensory inputs as described in the text. The goal (a reward of 1.0) is denoted by the food dish in the lower left corner. Each maze is similar to but larger than the last, and each introduces new state ambiguities — more states that share the same sensory input. 80 M.B. RING seemingly simple tasks can be very challenging for current reinforcement-learning agents to master, due to the large amount of ambiguous state information (i.e., the large number of different states having identical labels). A continual-learning agent trained in one maze should be able to transfer some of what it has learned to the task of learning the next maze. Each maze preserves enough structure from the previous one that the skills learned for one might still be helpful in the next; thus, the basic layout (a room, a doorway, and two hallways) is preserved for all mazes. But each maze also makes new demands on the agent, requiring it to extend its previous skills to adapt to the new environment, and each is therefore larger than the previous one and introduces more state ambiguities. In the first maze, the agent can learn to choose an action in each state based exclusively upon its immediate sensory input such that it will reach the goal no matter where it begins. To do this, it must learn to move south in all positions labeled “6” and to move west in all positions labeled “12”. In the second maze, however, there is an ambiguity due to the two occurrences of the input “9”. When the agent is transferred to this maze, it should continue to move east when it senses “9” in the upper position, but it should move west upon sensing “9” in the lower position. It must distinguish the two different positions though they are labeled the same. It can make this distinction by using the preceding context: only in the lower position has it just seen input “12”. When placed in the third maze, the agent’s behavior would need to be extended again, so that it would move west upon seeing “9” whenever its previous input was either a “12” or a “9”. This progression of complexity continues in the remaining mazes in that each has more states than the last and introduces new ambiguities. The continual-learning agent should be able to transfer to the next maze in the sequence skills it developed while learning the previous maze, coping with every new situation in a similar way: old responses should be modified to incorporate new exceptions by making use of contextual information that disambiguates the situations. 2. Temporal Transition Hierarchies CHILD combines Q-learning (Watkins, 1989) with the Temporal Transition Hierarchies (TTH) learning algorithm. The former is the most widely used reinforcement-learning method and will not be described here. The Temporal Transition Hierarchies algorithm is a constructive neural-network-based learning system that focuses on the most important but least predictable events and creates new units that learn to predict these events. As learning begins, the untrained TTH network assumes that event probabilities are constants: “the probability that event A will lead to event B is xAB .” (For example, the probability that pressing button C2 on the vending machine will result in the appearance of a box of doughnuts is 0.8). The network’s initial task is to learn these probabilities. But this is only the coarsest description of a set of events, and knowing the context (e.g., whether the correct amount of change was deposited in the slot) may provide more precise probabilities. The purpose of the new units introduced into the network is to search the preceding time steps for contextual information that will allow an event to be predicted more accurately. 81 CHILD In reinforcement-learning tasks, TTH’s can be used to find the broader context in which an action should be taken when, in a more limited context, the action appears to be good sometimes but bad other times. In Maze 2 of Figure 1 for example, when the agent senses “9”, it cannot determine whether “move east” is the best action, so a new unit is constructed to discover the context in which the agent’s response to input 9 should be to move east. The units created by the TTH network resemble the “synthetic items” of Drescher (1991), created to determine the causes of an event (the cause is determined through training, after which the item represents that cause). However, the simple structure of the TTH network allows it to be trained with a straightforward gradient descent procedure, which is derived next. 2.1. Structure, Dynamics, and Learning Transition hierarchies are implemented as a constructive, higher-order neural network. I will first describe the structure, dynamics, and learning rule for a given network with a fixed number of units, and then describe in Section 2.3 how new units are added. Each unit (ui ) in the network is either: a sensory unit (si ∈ S); an action unit (ai ∈ A); or i a high-level unit (lxy ∈ L) that dynamically modifies wxy , the connection from sensory unit y to action unit x. (It is the high-level units which are added dynamically by the network.)1 The action and high-level units can be referred to collectively as non-input units (ni ∈ N ). The next several sections make use of the following definitions: def S = {ui | 0 ≤ i < ns} def A = {ui | ns ≤ i < ns + na} def L = {u | ns + na ≤ i < nu} i def N = {ui | ns ≤ i < nu} i def u (t) = the value of the ith unit at time t def T i (t) = the target value for ai (t), (1) (2) (3) (4) (5) (6) where ns is the number of sensory units, na is the number of action units, and nu is the total number of units. When it is important to indicate that a unit is a sensory unit, it will be denoted as si ; similarly, action units will be denoted as ai , high-level units will be denoted as li , and non-input units will be denoted when appropriate as ni . 82 M.B. RING The activation of the units is very much like that of a simple, single-layer (no hidden units) network with a linear activation function, X ni (t) = ŵij (t)sj (t). (7) j The activation of the ith action or higher-level unit is simply the sum of the sensory inputs multiplied by their respective weights, ŵij , that lead into ni . (Since these are higher-order connections, they are capable of non-linear classifications.) The higher-order weights are produced as follows: ½ wij + lij (t − 1) if unit lij for weight wij exists (8) ŵij (t) = otherwise. wij If no l unit exists for the connection from i to j, then wij is used as the weight. If there is such a unit, however, its previous activation value is added to wij to compute the higher-order weight ŵij .2 2.2. Deriving the Learning Rule A learning rule can be derived that performs gradient descent in the error space and is much simpler than gradient descent in recurrent networks; so, though the derivation that follows is a bit lengthy, the result at the end (Equation 29) is a simple learning rule, easy to understand as well as to implement. As is common with gradient-descent learning techniques, the network weights are modified so as to reduce the total sum-squared error: X E(t) E = t 1X i (T (t) − ai (t))2 . E(t) = 2 i (9) In order to allow incremental learning, it is also common to approximate strict gradientdescent by modifying the weights at every time step. At each time step, the weights are changed in the direction opposite their contribution to the error, E(t): def ∆wij (t) = t X ∂E(t) ∂w ij (τ ) τ =0 wij (t + 1) = wij (t) − η∆wij (t) , (10) (11) where η is the learning rate. The weights, wij , are time indexed in Equation 10 for notational convenience only and are assumed for the purposes of the derivation to remain the same at all time steps (as is done with all on-line neural-network training methods). 83 CHILD It can be seen from Equations 7 and 8 that it takes a specific number of time steps for a weight to have an effect on the network’s action units. Connections to the action units affect the action units at the current time step. Connections to the first level of high-level units — units that modify connections to the action units — affect the action units after one time step. The “higher” in the hierarchy a unit is, the longer it takes for it (and therefore its incoming connections) to affect the action units. With this in mind, Equation 10 can be rewritten as: t X ∂E(t) ∂E(t) ∆wij (t) = = , i ∂w (τ ) ∂w ij ij (t − τ ) τ =0 def (12) where τ i is the constant value for each action or high-level unit ni that specifies how many time steps it takes for a change in unit i’s activation to affect the network’s output. Since this value is directly related to how “high” in the hierarchy unit i is, τ is very easy to compute: ½ 0 if ni is an action unit, ai (13) τi = x i 1 + τ if ni is a higher-level unit, lxy . The derivation of the gradient proceeds as follows. Define δ i to be the partial derivative of the error with respect to non-input unit ni : def δ i (t) = ∂E(t) . ∂ni (t − τ i ) (14) What must be computed is the partial derivative of the error with respect to each weight in the network: ∂E(t) ∂ni (t − τ i ) ∂E(t) = i i i ∂wij (t − τ ) ∂n (t − τ ) ∂wij (t − τ i ) = δ i (t) ∂ni (t − τ i ) . ∂wij (t − τ i ) (15) From Equations 7 and 8, the second factor can be derived simply as: ∂ni (t − τ i ) ∂ ŵij (t − τ i ) = sj (t − τ i ) i ∂wij (t − τ ) ∂wij (t − τ i ) ( ∂lij (t−τ i −1) 1 + ∂w if lij exists j i i ij (t−τ ) = s (t − τ ) 1 otherwise. (16) (17) Because wij (t − τ i ) does not contribute to the value of lij (t − τ i − 1), ∂ni (t − τ i ) = sj (t − τ i ). ∂wij (t − τ i ) (18) 84 M.B. RING Therefore, combining 12, 15 and 18, ∆wij (t) = ∂E(t) = δ i (t)sj (t − τ i ) . ∂wij (t − τ i ) (19) Now, δ i (t) can be derived as follows. First, there are two cases depending on whether node i is an action unit or a high-level unit: ∂E(t) i i ∂ai (t − τ i ) if n is an action unit, a (20) δ i (t) = ∂E(t) i i i is a higher-level unit, l . if n xy ∂lxy (t − τ i ) The first case is simply the immediate derivative of the error with respect to the action units from Equation 9. Since τ i is zero when ni is an action unit, ∂E(t) ∂E(t) = i i ∂a (t − τ ) ∂ai (t) = ai (t) − T i (t). (21) The second case of Equation 20 is somewhat more complicated. From Equation 8, ∂E(t) ∂E(t) ∂ ŵxy (t − τ i + 1) = i (t − τ i ) i (t − τ i ) ∂lxy ∂ ŵxy (t − τ i + 1) ∂lxy (22) ∂E(t) . ∂ ŵxy (t − τ i + 1) (23) = Using Equation 7, this can be factored as: ∂E(t) ∂nx (t − τ i + 1) ∂E(t) = . i x i ∂ ŵxy (t − τ + 1) ∂n (t − τ + 1) ∂ ŵxy (t − τ i + 1) (24) Because ni is a high-level unit, τ i = τ x + 1 (Equation 13). Therefore, ∂E(t) ∂nx (t − τ x ) ∂E(t) = i x x −τ ) ∂n (t − τ ) ∂ ŵxy (t − τ x ) (25) = δ x (t)sy (t − τ x ), (26) i (t ∂lxy from Equations 7 and 14. Finally, from Equation 19, ∂E(t) = ∆wxy (t). i (t − τ i ) ∂lxy (27) Returning now to Equations 19 and 20, and substituting in Equations 21 and 27: The change ∆wij (t) to the weight wij from sensory unit sj to action or high-level unit ni can be written as: ∆wij (t) = δ i (t)sj (t − τ i ) ½ i a (t) − T i (t) if ni is an action unit, ai j i = s (t − τ ) i ∆wxy (t) if ni is a higher-level unit, lxy . (28) (29) CHILD 85 Equation 29 is a particularly nice result, since it means that the only values needed to make a weight change at any time step are (1) the error computable at that time step, (2) the input recorded from a specific previous time step, and (3) other weight changes already calculated. This third point is not necessarily obvious; however, each high-level unit is higher in the hierarchy than the units on either side of the weight it affects: (i > x) ∧ (i > y), for all i . This means that the weights may be modified in a simple bottom-up fashion. Error lxy values are first computed for the action units, then weight changes are calculated from the bottom of the hierarchy to the top so that the ∆wxy (t) in Equation 29 will already have been i and all sensory units j. computed before ∆wij (t) is computed, for all high-level units lxy i The intuition behind the learning rule is that each high-level unit, lxy (t), learns to utilize the context at time step t to correct its connection’s error, ∆wxy (t + 1), at time step t + 1. If the information is available, then the higher-order unit uses it to reduce the error. If the needed information is not available at the previous time step, then new units may be built to look for the information at still earlier time steps. While testing the algorithm, it became apparent that changing the weights at the bottom of a large hierarchy could have an explosive effect: the weights would oscillate to ever larger values. This indicated that a much smaller learning rate was needed for these weights. Two learning rates were therefore introduced: the normal learning rate, η, for weights without i exists); and a fraction, ηL , of η for those higher-level units (i.e., wxy where no unit lxy i weights whose values are affected by higher-level units, (i.e., wxy where a unit lxy does exist). 2.3. Adding New Units So as to allow transfer to new tasks, the net must be able to create new higher-level units whenever they might be useful. Whenever a transition varies, that is, when the connection weight should be different in different circumstances, a new unit is required to dynamically set the weight to its correct value. A unit is added whenever a weight is pulled strongly in opposite directions (i.e., when learning forces the weight to increase and to decrease at the same time). The unit is created to determine the contexts in which the weight is pulled in each direction. In order to decide when to add a new unit, two long-term averages are maintained for every connection. The first of these, ∆w̄ij , is the average change made to the weight. The second, ∆w̃ij , is the average magnitude of the change. When the average change is small but the average magnitude is large, this indicates that the learning algorithm is changing the weight by large amounts but about equally in the positive as in the negative direction; i.e., the connection is being simultaneously forced to increase and to decrease by a large amount. Two parameters, Θ and ², are chosen, and when ∆w̃ij > Θ|∆w̄ij | + ², a new unit is constructed for wij . (30) 86 M.B. RING Since new units need to be created when a connection is unreliable in certain contexts, the long-term average is only updated when changes are actually made to the weight; that is, when ∆wij (t) 6= 0. The long-term averages are computed as follows: ½ ∆w̄ij (t) = ∆w̄ij (t − 1) if ∆wij (t) = 0 σ∆wij (t) + (1 − σ)∆w̄ij (t − 1) otherwise, and the long-term average magnitude of change is computed as: ½ ∆w̃ij (t − 1) if ∆wij (t) = 0 ∆w̃ij (t) = σ|∆wij |(t) + (1 − σ)∆w̃ij (t − 1) otherwise, (31) (32) where the parameter σ specifies the duration of the long-term average. A smaller value of σ means the averages are kept for a longer period of time and are therefore less sensitive to momentary fluctuations. A small value for σ is therefore preferable if the algorithm’s learning environment is highly noisy, since this will cause fewer units to be created due strictly to environmental stochasticity. In more stable, less noisy environments, a higher value of σ may be preferable so that units will be created as soon as unpredictable situations are detected. When a new unit is added, its incoming weights are initialized to zero. It has no output weights: its only task is to anticipate and reduce the error of the weight it modifies. In order n is created, the statistics for all to keep the number of new units low, whenever a unit lij i connections into the destination unit (u ) are reset: ∆w̄ij (t) ← −1.0 and ∆w̃ij (t) ← 0.0. 2.4. The Algorithm Because of the simple learning rule and method of adding new units, the learning algorithm is very straightforward. Before training begins, the network has no high-level units and all weights (from every sensory unit to every action unit) are initialized to zero. The outline of the procedure is as follows: For (Ever) 1) Initialize values. 2) Get senses. 3) Propagate Activations. 4) Get Targets. 5) Calculate Weight Changes; Change Weights & Weight Statistics; Create New Units. The second and fourth of these are trivial and depend on the task being performed. The first step is simply to make sure all unit values and all delta values are set to zero for the next forward propagation. (The values of the l units at the last time step must, however, be stored for use in step 3.) 87 CHILD 1) Initialize values Line /* Reset all old unit and delta values to zero. 1.1 For each unit, u(i) 1.2 u(i) ← zero; 1.3 delta(i) ← zero; */ The third step is nearly the same as the forward propagation in standard feed-forward neural networks, except for the presence of higher-order units and the absence of hidden layers. 3) Propagate Activations Line /* Calculate new output values. 3.1 For each Non-input unit, n(i) 3.2 For each Sensory unit, s(j) */ */ */ 3.3 /* UnitFor(i, j) returns the input of unit lij at the last time step. /* Zero is returned if the unit did not exist. (See Equation 8.) l ← UnitFor(i,j); */ */ 3.4 /* To ni ’s input, add sj ’s value times the (possibly modified) /* weight from j to i. (See Equation 7.) n(i) ← n(i) + s(j)*(l + Weight(i, j)); The fifth step is the heart of the algorithm. Since the units are arranged as though the input, output, and higher-level units are concatenated into a single vector (i.e., k < j < i, i is added to the network, it is appended to the end of for all sk , aj , li ), whenever a unit ljk the vector; and therefore (j < i) ∧ (k < i). This means that when updating the weights, the δ i ’s and ∆wij ’s of Equation 29 must be computed with i in ascending order, so that i ∆wxy will be computed before any ∆wij for unit lxy is computed (line 5.3). If a weight change is not zero (line 5.6), it is applied to the weight (line 5.9 or 5.19). If the weight has no higher-level unit (line 5.8), the weight statistics are updated (lines 5.10 and 5.11) and checked to see whether a higher-level unit is warranted (line 5.12). If a unit is warranted for the weight leading from unit j to unit i (line 5.12), a unit is built for it (line 5.13), and the statistics are reset for all weights leading into unit i (lines 5.14–5.16). If a higher-level unit already exists (line 5.17), that unit’s delta value is calculated (line 5.18) and used (at line 5.5) in a following iteration (of the loop starting at line 5.3) when its input weights are updated. 88 M.B. RING 5) Update Weights and Weight Statistics; Create New Units. Line /* Calculate δi for the action units, ai . (See Equations 20 and 21). 5.1 For each action unit, a(i) 5.2 delta(i) = a(i) - Target(i); 5.3 5.4 /* Calculate all ∆wij ’s, ∆w̄ij ’s, ∆w̃ij ’s. /* For higher-order units li , calculate δ i ’s. /* Change weights and create new units when needed. For each Non-input unit, n(i), with i in ascending order For each Sensory unit, s(j) */ */ */ */ */ */ 5.5 /* Compute weight change (Equations 28 and 29). /* Previous(j, i) retrieves sj (t − τ i ). delta w(i, j) ← delta(i) * Previous(j, i); /* If ∆wij 6= 0, update weight and statistics. (Eqs. 31 and 32). if (delta w(i,j) 6= 0) */ 5.6 */ 5.7 n n ; or -1 if lij doesn’t exist. /* IndexOfUnitFor(i, j) returns n for lij n ← IndexOfUnitFor(i, j); */ 5.8 n /* If lij doesn’t exist: update statistics, learning rate is η. if (n = -1) 5.9 /* Change weight wij . (See Equation 11.) Weight(i, j) ← Weight(i, j) - ETA * delta w(i, j); 5.10 */ /* Update long-term average, ∆w̄ij . (See Equation 31) lta(i,j) ← SIGMA * delta w(i,j) + (1-SIGMA) * lta(i,j); /* Update long-term mean absolute deviation ∆w̃ij . (Eq. 32) ltmad(i,j) ← SIGMA * abs(delta w(i,j)) + (1-SIGMA) * ltmad(i,j); */ 5.11 n should be created (Equation 30) ... /* If Higher-Order unit lij if (ltmad(i, j) > THETA * abs(lta(i, j)) + EPSILON) */ 5.12 */ N /* ... create unit lij (where N is the current network size). BuildUnitFor(i, j); */ 5.13 /* Reset statistics for all incoming weights. For each Sensory unit, s(k) lta(i, k) ← -1.0; ltmad(i, k) ← 0.0; */ 5.14 5.15 5.16 5.17 5.18 5.19 n /* If lij does exist (n 6= −1), store δ n (Equation 20 and 27). */ /* Change wij , learning rate = ηL ∗ η. */ else delta(n) ← delta w(i, j); Weight(i, j) ← Weight(i, j) - ETA L*ETA * delta w(i, j); CHILD 3. 89 Testing CHILD in Continual-Learning Environments This section demonstrates CHILD’s ability to perform continual learning in reinforcement environments. CHILD combines Temporal Transition Hierarchies with Q-learning (Watkins, 1989). Upon arriving in a state, the agent’s sensory input from that state is given to a transition hierarchy network as input. The output from the network’s action units are Q-values (one Q-value for each action) which represent the agent’s estimate of its discounted future reward for taking that action (Watkins, 1989). At each time step an action unit, aC(t) , is chosen stochastically using a Gibbs distribution formed from these values and a temperature value T . The agent then executes the action associated with the chosen action unit aC(t) . The temperature is initially set to 1.0, but its value is decreased at the beginning of each trial to be 1/(1 + n∆T ), where n is the number of trials so far and ∆T is a user-specified parameter. The network is updated like the networks of Lin (1992): ( r(t − 1) + γ max(ak (t)) if ai ≡ aC(t−1) i k (33) T (t − 1) = ai (t − 1) otherwise, where T i (t − 1) is the target as specified in Equation 29 for action unit ai at time step t − 1; r(t − 1) is the reward the agent received after taking action aC(t−1) ; and γ is the discount factor. Only action unit aC(t−1) will propagate a non-zero error to its incoming weights. The sequence of mazes introduced in Figure 1 are used as test environments. In these environments there are 15 sensory units — one for each of the 15 possible wall configurations surrounding the agent; therefore exactly one sensory unit is on (has a value of 1.0) in each state. There are 4 action units — N (move one step north), E (east), W (west), and S (south). CHILD was tested in two ways: learning each maze from scratch (Section 3.1), and using continual learning (Section 3.2). In both cases, learning works as follows. The agent “begins” a maze under three possible conditions: (1) it is the agent’s first time through the maze, (2) the agent has just reached the goal in the previous trial, or (3) the agent has just “timed out”, i.e, the agent took all of its allotted number of actions for a trial without reaching the goal. Whenever the agent begins a maze, the learning algorithm is first reset, clearing its short-term memory (i.e., resetting all unit activations and erasing the record of previous network inputs). A random state in the maze is then chosen and the agent begins from there. 3.1. Learning from Scratch 100 agents were separately created, trained, and tested in each maze (i.e., a total of 900 agents), all with different random seeds. Each agent was trained for 100 trials, up to 1000 steps for each trial. The agent was then tested for 100 trials; i.e., learning was turned off and the most highly activated action unit was always chosen. If the agent did not reach the 90 M.B. RING goal on every testing trial, training was considered to have failed, and the agent was trained for 100 more trials and tested again. This process continued until testing succeeded or until the agent was trained for a total of 1000 trials. Since CHILD is a combination of Q-learning and Temporal Transition Hierarchies, there are seven modifiable parameters: two from Q-learning and five from TTH’s. The two from Q-learning are: γ — the discount factor from Equation 33, and ∆T , the temperature decrement. The five from the TTH algorithm are: σ, Θ, and ² from Equations 30, 31 and 32, η — the learning rate from Equation 11, and ηL — the fraction of η for weights with highlevel units. Before training began, all seven parameters were (locally) optimized for each maze independently to minimize the number of trials and units needed. The optimization was done using a set of random seeds that were not later used during the tests reported below. 3.2. The Continual-Learning Case To measure its continual-learning ability, CHILD was allowed in a separate set of tests to use what it had learned in one maze to help it learn the next. This is a very tricky learning problem, since, besides the added state ambiguities, the distance from most states to the goal changes as the series of mazes progresses, and the Q-values for most of the input labels therefore need to be re-learned. There were three differences from the learning-from-scratch case: (1) after learning one maze, the agent was transferred to the next maze in the series; (2) the agent was tested in the new maze before training — if testing was successful, it was moved immediately to the following maze; and (3) the parameters (which were not optimized for this approach) were the same for all mazes: η=β ηL γ ∆T = = = = 0.25 0.09 0.91 2.1 σ = 0.3 Θ = 0.56 ² = 0.11 T was reset to 1.0 when the agent was transferred to a new maze. 3.3. Comparisons For both the learning-from-scratch case and the continual-learning case, the total number of steps during training was averaged over all 100 agents in each maze. These results are shown in Figure 2A. For the continual-learning approach, both the average number of steps and the average accumulated number of steps are shown. The average number of units created during training is given in Figure 2B. Figure 2C compares the average number of test steps. Since it was possible for an agent to be tested several times in a given training run before testing was successful, only the final round was used for computing the averages (i.e., the last 100 testing trials for each agent in each maze). 91 CHILD Continual Learning vs. Learning From Scratch Average Number of New Units Average Training Steps 45,000 Scratch 26 Scratch 24 40,000 Continual Continual 22 Cumulative A B 35,000 20 18 30,000 16 25,000 14 12 20,000 10 15,000 8 6 10,000 4 5,000 2 0 0 1 2 3 4 5 6 7 8 9 Maze 1 2 3 4 5 6 7 8 9 Maze Num. States Average Test Steps 65 22 Scratch 20 Continual 60 55 Ave. Distance to Goal 18.0 Ambiguity 6.0 5.5 16.0 C D 50 18 14.0 45 16 14 12 40 12.0 35 10.0 30 10 8.0 25 8 6 20 6.0 15 4.0 10 4 5 2 0 0 1 2 3 4 5 6 7 8 9 Maze 2.0 0.0 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 Maze Figure 2. Graph (A) compares Learning from Scratch with Continual Learning on the nine Maze tasks. The middle line shows the average cumulative number of steps used by the continual-learning algorithm in its progression from the first to the ninth maze. Graph (B) compares the number of units created. The line for the continual-learning case is, of course, cumulative. Graph (C) compares the performance of both methods during testing. The values shown do not include cases where the agent failed to learn the maze. Graph (D) shows the number of states in each maze (solid line, units shown at far left); the average distance from each state to the goal (dotted line, units at left); and the “ambiguity” of each maze, i.e., the average number of states per state label (dashed line, units at right). There were five failures while learning from scratch — five cases in which 1000 training trials were insufficient to get testing correct. There were two failures for the continuallearning case. All failures occurred while learning the ninth maze. When a failure occurred, the values for that agent were not averaged into the results shown in the graphs. 92 4. M.B. RING Analysis Due to the large amount of state ambiguity, these tasks can be quite difficult for reinforcementlearning agents. Even though a perfect agent could solve any of these mazes with a very small amount of state information, learning the mazes is more challenging. Consider for example the fact that when the agent attempts to move into a barrier, its position does not change, and it again receives the same sensory input. It is not in any way informed that its position is unchanged. Yet it must learn to avoid running into barriers nevertheless. On the other hand, the bottom row in Maze 9 is exactly the opposite: the agent must continue to move east, though it repeatedly perceives the same input. Also, once a maze has been learned, the sequence of states that the agent will pass through on its path to the goal will be the same from any given start state, thereby allowing the agent to identify its current state using the context of recent previous states. However, while learning the task, the agent’s moves are inconsistent and erratic, and information from previous steps is unreliable. What can the agent deduce, for example, in Maze 9 if its current input is 4 and its last several inputs were also 4? If it has no knowledge of its previous actions and they are also not yet predictable, it cannot even tell whether it is in the upper or lower part of the maze.3 The progression of complexity over the mazes in Figure 1 is shown with three objective measures in Figure 2D.The number of states per maze is one measure of complexity. Values range from 12 to 60 and increase over the series of mazes at an accelerating pace. The two other measures of complexity are the average distance from the goal and the average ambiguity of each maze’s state labels (i.e., the average number of states sharing the same label). The latter is perhaps the best indication of a maze’s complexity, since this is the part of the task that most confuses the agent: distinguishing the different states that all produce the same sensory input. Even in those cases where the agent should take the same action in two states with the same label, these states will still have different Q-values and therefore will need to be treated differently by the agent. All three measures increase monotonically and roughly reflect the contour of graphs A–C for the learning-from-scratch agents. The performance of the learning-from-scratch agents also serves as a measure of a maze’s complexity because the parameters for these agents were optimized to learn each maze as quickly as possible. Due to these optimized parameters, the learning-from-scratch agents learned the first maze faster (by creating new units faster) than the continual-learning agents did. After the first maze, however, the continual-learning agents always learned faster than the learning-from-scratch agents. In fact, after the third maze, despite the disparity in parameter optimality, even the cumulative number of steps taken by the continual-learning agent was less than the number taken by the agent learning from scratch. By the ninth maze the difference in training is drastic. The number of extra steps needed by the continuallearning agent was tiny in comparison to the number needed without continual learning. The cumulative total was about a third of that needed by the agent learning from scratch. Furthermore, the trends shown in the graphs indicate that as the mazes get larger, as the size and amount of ambiguity increases, the difference between continual learning and learning from scratch increases drastically. Testing also compares favorably for the continual learner: after the fourth maze, the continual-learning agent found better solutions as well as finding them faster. This is CHILD 93 perhaps attributable to the fact that, after the first maze, the continual-learning agent is always in the process of making minor corrections to its existing Q-values, and hence to its policy. The corrections it makes are due both to the new environment and to errors still present in its Q-value estimates. The number of units needed for the continual learner escalates quickly at first, indicating both that the parameters were not optimal and that the skills learned for each maze did not transfer perfectly but needed to be extended and modified for the following mazes. This number began to to level off after a while, however, showing that the units created in the first eight tasks were mostly sufficient for learning the ninth. The fact that the learning-from-scratch agent created about the same number shows that these units were also necessary. That CHILD should be able to transfer its skills so well is far from given. Its success at continual learning in these environments is due to its ability to identify differences between tasks and then to make individual modifications even though it has already learned firmly established rules. Moving from Maze 4 to Maze 5, for example, involves learning that the tried and true rule 4 → S (move south when the input is “4”), which was right for all of the first four mazes, is now only right within a certain context, and that in a different context a move in the opposite direction is now required. At each stage of continual learning, the agent must widen the context in which its skills were valid to include new situations in which different behaviors are indicated. 4.1. Hierarchy Construction in the Maze Environments It is interesting to examine what CHILD learns: which new units are built and when. The following progression occurred while training one of the 100 agents tested above. For the 20 . This was unit number 20, which modifies first maze, only one unit was constructed, lW,12 the connection from sensory unit 12 to action unit W (move west). (Units 1–15 are the sensory units; units 16–19 are the action units; and the high-level units begin at number 20). Unit 20 resolved the ambiguity of the two maze positions labeled 12. With the unit in place, the agent learned to move east in the position labeled 10. It then could move north in position 12 when having seen a 10 in the previous step, but west when having seen 20 20 was positive, and the weight from s10 to lW,12 was a 6. Thus, the weight from s6 to lW,12 0 20 negative. The weight from s to lW,12 was also negative, reflecting the fact that, though the optimal route does not involve moving south from the position labeled 0, during training this would happen nevertheless. For the same reason, there was also a negative weight 20 . If the agent was placed in one of the two positions labeled 12 at the from s12 to lW,12 beginning of the trial, it would move north, since it had no way of distinguishing its state. (It would then move south if it saw a 6 or east if it saw a 0.) The second maze contains two new ambiguities: the positions labeled 9. Two new units 21 22 21 , and lN,12 . The first, lW,9 , was needed to disambiguate the two 9 were created: lW,9 positions. It had a strong positive weight from s12 so that the agent would move west from 22 20 , complemented lW,12 . It a position labeled 9 if it had just seen a 12. The second, lN,12 was apparently needed during training to produce more accurate Q-values when the new 9 94 M.B. RING position was introduced, and had a weak positive connection from s9 and a weaker negative connection from s10 . The third maze introduces another position labeled 9. This caused a strong weight to 21 23 develop from s9 to lW,9 . Only one new unit was constructed, however: l20,12 . The usefulness of this unit was minimal: it only had an effect if in a position labeled 12, the agent moved south or east. As a consequence, the unit had a weak negative connection from s10 and all other connections were zero. The fourth maze introduces new ambiguities with the additional positions labeled 0 and 9 and two new labels, 1 and 8. No new units were constructed for this maze. Only the weights were modified. The fifth maze introduces the additional ambiguities labeled 0 and 4. The label 4 definitely requires disambiguation, since the agent should choose different actions (N and S) in the two positions with this label. Since the agent can still move to the goal optimally by moving east in positions labeled 0, no unit to disambiguate this label is necessary. Five new units were 24 25 26 27 28 created: lS,4 , l21,9 , l25,9 , l26,9 , and lE,9 . The first disambiguates the positions labeled 4. It 9 has a positive weight from s and negative weights from s0 , s4 , and s12 . The second, third, 25 26 27 and fourth new units, l21,9 , l25,9 , and l26,9 all serve to predict the Q-values in the states 28 , also helps nail down these Q-values and labeled 9 more accurately. The last new unit, lE,9 that of the upper position labeled 9. Though the above example was one of the agent’s more fortunate, in which CHILD tested perfectly in the remaining mazes without further training, similar unit construction occurred in the remainder of the mazes for the agent’s slower-learning instantiations. 4.2. Non-Catastrophic Forgetting Continual learning is a process of growth. Growth implies a large degree of change and improvement of skills, but it also implies that skills are retained to a certain extent. There are some skills that we undoubtedly lose as we develop abilities that replace them (how to crawl efficiently, for example). Though it is not the case with standard neural networks that forgotten abilities are regained with less effort than they had originally demanded, this tends be the case with CHILD. To test its ability to re-learn long-forgotten skills, CHILD was returned to the first maze after successfully learning the ninth. The average number of training steps needed for the 100 cases was about 20% of what were taken to learn the maze initially, and in two-thirds of these cases, no retraining of any kind was required. (That is, the average over the cases where additional learning was required was about 60% of the number needed the first time.) The network built on average less than one new unit. However, the average testing performance was 20% worse than when the maze was first learned. 4.3. Distributed Senses One problem with TTH’s is that they have no hidden units in the traditional sense, and their activation functions are linear. Linear activation functions sometimes raise a red flag in CHILD 95 the connectionist community due to their inability in traditional networks to compute nonlinear mappings. However, transition hierarchies use higher-order connections, involving the multiplication of input units with each other (Equations 7 and 8). This means that the network can in fact compute non-linear mappings. Nevertheless, these non-linearities are constructed from inputs at previous time steps, not from the input at the current time step alone. This is not a problem, however, if the current input is repeated for several time steps, which can be accomplished in reinforcement environments by giving the agent a “stay” action that leaves it in the same state. The agent can then stay in the same position for several time steps until it has made the necessary non-linear discrimination of its input data — simulating the temporal process of perception more closely than traditional one-step, feed-forward computations. Since the stay action is treated the same as other actions, it also develops Q-values to reflect the advantage of staying and making a non-linear distinction over taking one of the other, perhaps unreliable actions. Again, context from preceding states can help it decide in which cases staying is worthwhile. With the additional stay action, CHILD was able to learn the mazes using a distributed sense vector consisting of five units: bias, WN (wall north), WW (wall west), WE (wall east), and WS (wall south). The bias unit was always on (i.e., always had a value of 1.0). The other units were on only when there was a wall immediately to the corresponding direction of the agent. For example, in positions labeled 12, the bias, WE, and WS units were on; in positions labeled 7, the bias, WN, WW, and WE units were on, etc. The agent was able to learn the first maze using distributed senses, but required much higher training times than with local senses (averaging nearly 6000 steps). It turned out, however, that the agent was still able to learn the mazes effectively even without the stay action. In all positions except those labeled 0, the effect of the stay action could be achieved simply by moving into a barrier. Furthermore, the agent could often make the necessary discriminations simply by using the context of its previous senses. Whenever it moved into a new state, information from the previous state could be used for disambiguation (just as with the locally encoded sense labels above). In the distributed case, however, previous sensory information could be used to distinguish states that were in principle unambiguous, but which were in practice difficult to discriminate. After the agent had learned the first maze, it was transferred to the second in the same continual-learning process as described above. An interesting result was that CHILD was able to generalize far better, and in some training episodes was able to solve all the mazes after being trained on only the first two. It did this by learning a modified right-hand rule, where it would follow the border of the maze in a clockwise direction until it reached the goal; or, if it first hit a state labeled 0, it would instead move directly east. In one case it did this having created only six high-level units. In most training episodes, more direct routes to the goal were discovered; but the cost was the creation of more units (usually 15–20). 4.4. Limitations As was mentioned in the introduction, CHILD is not a perfect continual learner and can only learn in a restricted subset of possible environments. Though the TTH algorithm is very fast (exhibiting a speedup of more than two orders of magnitude over a variety of 96 M.B. RING recurrent neural networks on supervised-learning benchmarks (Ring, 1993)), it can only learn a subset of possible finite state automata. In particular, it can only learn Markovk sequences: sequences in which all the information needed to make a prediction at the current time step occurred within the last k time steps if it occurred at all. Though TTH’s can learn k and indefinitely increase it, they cannot keep track of a piece of information for an arbitrary period of time. Removing this limitation would require adding some kind of arbitrary-duration memory, such as high-level units with recurrent connections. Introducing the ability to learn push-down automata (as opposed to arbitrary FSA) would even more greatly relax CHILD’s restrictions, but would be far more challenging. Transition hierarchies also have a propensity for creating new units. In fact, if the parameters are set to allow it, the algorithm will continue building new units until all sources of non-determinism are removed, under the assumption that ignorance is the sole cause of all unpredictable events. When the parameters are properly set, only the needed units are created. Since there is no a priori way of knowing when an unpredictable event is due to ignorance or to true randomness, different kinds of environments will have different optimal parameter settings. Unnecessary units may also result when tasks change and previously learned skills are no longer useful. Finding a method for removing the extra units is nontrivial because of the difficulty involved in identifying which units are useless. This is more complicated than the problem of weight elimination in standard neural networks. A large hierarchy may be very vital to the agent, and the lower-level units of the hierarchy may be indispensable, but before the hierarchy is completed, the lower units may appear useless. One problem that faced CHILD in the previous section was that the environments of Figure 1 kept changing. This resulted in constant modification of the reinforcement landscape, and the agent had to re-learn most or all of its Q-values again and again. It might be better and more realistic to use a single environment composed of multiple layers of complexity such that the agent, once it has learned some simple skills, can use them to achieve an ever increasing density of reward by continually uncovering greater environmental complexities and acquiring ever more sophisticated skills. The mazes in Figure 1 were designed to give CHILD the opportunity to perform continual learning. An ideal continual learner would be capable of learning in any environment and transferring whatever skills were appropriate to any arbitrary new task. But, of course, it will always be possible to design a series of tasks in which skills learned in one are not in any way helpful for learning the next — for example, by rearranging the state labels randomly or maliciously. There will always be a certain reliance of the agent on its trainer to provide it with tasks appropriate to its current level of development. This is another argument in favor of the “multiple layers” approach described in the previous paragraph, which may allow the agent smoother transitions between levels of complexity in the skills that it learns. 5. Related Work Continual learning, CHILD, and Temporal Transition Hierarchies bring to mind a variety of related work in the areas of transfer, recurrent networks, hierarchical adaptive systems, and reinforcement-learning systems designed to deal with hidden state. CHILD 5.1. 97 Transfer Sharkey and Sharkey (1993) discuss transfer in its historical, psychological context in which it has a broad definition. In the connectionist literature, the term has so far most often referred to transfer across classification tasks in non-temporal domains. But, as with the other methods described in this volume, continual learning can also be seen as an aspect of task transfer in which transfer across classification tasks is one important component. There seem to be roughly two kinds of transfer in the connectionist literature. In one case, the network is simultaneously trained on related classification tasks to improve generalization in its current task (Caruana, 1993, Yu & Simmons, 1990). In the other case, feedforward networks trained on one classification task are modified and reused on a different classification task (Baxter, 1995, Pratt, 1993, Sharkey & Sharkey, 1993, Silver & Mercer, 1995, Thrun, 1996). Sometimes the new tasks lie in contradiction to the old tasks, that is, the old task and the new are inconsistent in that a classification made in one task would necessarily be made differently in the other. The autoassociation and bit-displacement tasks explored by Sharkey and Sharkey (1993), as one example, require different outputs for the same input. In other cases there may not be any necessary inconsistency and the tasks might be seen as separate parts of a larger task. For example, the two tasks of recognizing American speech and recognizing British speech, as described by Pratt (1993), are each sub-tasks of recognizing English speech. Continual learning and CHILD introduce into this discussion the issue of context: In certain cases, one mapping may be correct, in others the identical mapping may be incorrect. The issue is thus one of identifying the larger context in which each is valid. Transfer across classification tasks is therefore one aspect of continual learning, where temporal context, incremental development, and hierarchical growth are some of the others. Another kind of transfer that is not confined to classification tasks is the very general “learning how to learn” approach of Schmidhuber (1994) , which not only attempts transfer in reinforcement domains, but also attempts to learn how to transfer (and to learn how to learn to transfer, etc.). 5.2. Recurrent Networks Transition hierarchies resemble recurrent neural networks such as those introduced by Elman (1993), Jordan (1986), Robinson and Fallside (1987), Pollack (1991), etc., in that they learn temporal tasks using connectionist methods. Most relevant here, Elman showed that his network was capable of learning complex grammatical structures by “starting small” — by imposing storage limitations on the recurrent hidden units and then gradually relaxing them. Elman’s interest in learning complex sequences by first learning simpler sequences is closely related to continual learning, though there are a few important differences between his system and CHILD. The first is subtle but important: Elman discovered that his network could only learn certain complex sequences if he imposed artificial constraints and then gradually relaxed them; CHILD on the other hand was designed specifically for the purpose of continual learning. As a result, the internals of Elman’s net must be externally altered as learning progresses, whereas CHILD detects for itself when, how, and to what degree to 98 M.B. RING increase its capacity. Also, Elman used a fixed-sized network and only two kinds of inputs: simple and complex, whereas CHILD adds new units so that it can grow indefinitely, learning a never-ending spectrum of increasing complexity. Elman’s net is therefore limited to a certain final level of development, whereas for CHILD, there is no final level of complexity. It must be emphasized, however, that Elman’s intention was to shed light on certain aspects of human cognition, not to develop a learning agent. Recurrent networks are in general capable of representing a greater range of FSA than are TTH’s, but they tend to learn long time dependencies extremely slowly. They also are not typically constructive learning methods, though RCC, Recurrent Cascade Correlation (Fahlman, 1991), and the network introduced by Giles, et al.(1995) are exceptions and introduce new hidden units one at a time during learning. Both these networks are unsuitable for continual learning in that they must be trained with a fixed data set, instead of incrementally as data is presented. In addition, these nets do not build new units into a structure that takes advantage of what is already known about the learning task, whereas TTH units are placed into exact positions to solve specific problems. Wynne-Jones (1993), on the other hand, described a method that examines an existing unit’s incoming connections to see in which directions they are being pulled and then creates a new unit to represent a specific area of this multidimensional space. In contrast, Sanger’s network adds new units to reduce only a single weight’s error (Sanger, 1991), which is also what the TTH algorithm does. However, Both Wynne-Jones’ and Sanger’s network must be trained over a fixed training set, and neither network can be used for tasks with temporal dependencies. 5.3. Hierarchical Adaptive Systems The importance of hierarchy in adaptive systems that perform temporal tasks has been noted often, and many systems have been proposed in which hierarchical architectures are developed by hand top down as an efficient method for modularizing large temporal tasks (Albus, 1979, Dayan & Hinton, 1993, Jameson, 1992, Lin, 1993, Roitblat, 1988, Roitblat, 1991, Schmidhuber & Wahnsiedler, 1993, Singh, 1992, Wixson, 1991). In the systems of Wixson (1991), Lin (1993), and Dayan and Hinton (1995), each high-level task is divided into sequences of lower-level tasks where any task at any level may have a termination condition specifying when the task is complete. Jameson’s system (Jameson, 1992) is somewhat different in that higher levels “steer” lower levels by adjusting their goals dynamically. The system proposed by Singh (1992), when given a task defined as a specific sequence of subtasks, automatically learns to decompose it into its constituent sequences. In all these systems, hierarchy is enlisted for task modularization, allowing higher levels to represent behaviors that span broad periods of time. Though Wixson suggests some possible guidelines for creating new hierarchical nodes, none of these systems develop hierarchies bottom-up as a method for learning more and more complicated tasks. Wilson (1989) proposed a hierarchical classifier system that implied but did not implement the possibility of automatic hierarchy construction by a genetic algorithm. The schema system proposed by Drescher (1991) supports two hierarchical constructs: “composite actions” (sequences of actions that lead to specific goals) and “synthetic items” (concepts CHILD 99 used to define the pre-conditions and results of actions). Drescher’s goal — simulating early stages of Piagetian development — is most congruous with the philosophy of continual learning, though the complexity of his mechanism makes learning more cumbersome. Another constructive bottom-up approach is Dawkins’ “hierarchy of decisions” (Dawkins, 1976), similar to Schmidhuber’s “history compression” (Schmidhuber, 1992), where an item in a sequence that reliably predicts the next several items is used to represent the predicted items in a reduced description of the entire sequence. The procedure can be applied repeatedly to produce a many-leveled hierarchy. This is not an incremental method, however — all data must be specified in advance — and it is also not clear how an agent could use this method for choosing actions. Macro-operators in STRIPS (see Barr & Feigenbaum, 1981) and chunking in SOAR (Laird et al., 1986) also construct temporal hierarchies. Macro-operators are constructed automatically to memorize frequently occurring sequences. SOAR memorizes solutions to subproblems and then reuses these when the subproblem reappears. Both methods require the results of actions to be specified in advance, which makes learning in stochastic environments difficult. For real-world tasks in general, both seem to be less suitable than methods based on closed-loop control, such as reinforcement learning. 5.4. Reinforcement Learning with Hidden State Chrisman’s Perceptual Distinctions Approach (PDA) (Chrisman, 1992) and McCallum’s Utile Distinction Memory (UDM) (McCallum, 1993) are two methods for reinforcement learning in environments with hidden state. Both combine traditional hidden Markov model (HMM) training methods with Q-learning; and both build new units that represent states explicitly. UDM is an improvement over PDA in that it creates fewer new units. In an environment introduced by McCallum (1993), CHILD learned about five times faster than UDM (Ring, 1994) and created about one third the number of units. In larger state spaces, such as Maze 9 above, transition hierarchies can get by with a small number of units — just enough to disambiguate which action to take in each state — whereas the HMM approaches need to represent most or all of the actual states. Also, both UDM and PDA must label every combination of sensory inputs uniquely (i.e., senses must be locally encoded), and the number of labels scales exponentially with the dimensionality of the input. Distributed representations are therefore often preferable and can also lead to enhanced generalization (as was shown in Section 4.3). CHILD learns to navigate hidden-state environments without trying to identify individual states. It takes a completely action-oriented perspective and only needs to distinguish states well enough to generate a reasonable Q-value. If the same state label always indicates that a particular action needs to be taken, then none of the states with that label need to be distinguished. Very recently, McCallum has introduced the U-Tree algorithm (McCallum, 1996), which also takes an action-oriented approach. Like Temporal Transition Hierarchies, U-tree is limited to learning Markov-k domains. Also like TTH’s, U-tree learns to solve tasks by incrementally extending a search backwards in time for sensory information that helps it disambiguate current Q-values. It then (like TTH’s) keeps track of this information and uses it while moving forward in time to choose actions and update Q-values. But unlike TTH’s, 100 M.B. RING this method stores its information in an explicit tree structure that reflects the dependencies of the Q-values on the previous sensory information. It also makes assumptions about the environment’s Markov properties that the TTH algorithm does not. In particular, it assumes that all Markov dependencies manifest themselves within a fixed number of time steps. As a result, it can miss dependencies that span more than a few steps. This apparent weakness, however, enables the algorithm to perform statistical tests so that it only builds new units when needed (i.e., only when a dependency within this fixed period actually does exist). The algorithm can thereby create fewer units than the TTH algorithm, which is inclined (depending on parameter settings) to continue building new units indefinitely in the search for the cause of its incorrect predictions. 6. Discussion and Conclusions Given an appropriate underlying algorithm and a well-developed sequence of environments, the effort spent on early training can more than pay for itself later on. An explanation for this is that skills can be acquired quickly while learning easy tasks that can then speed the learning of more difficult tasks. One such skill that CHILD developed was a small dance that it always performed upon beginning a maze in certain ambiguous positions. The dance is necessary since in these ambiguous states, neither the current state nor the correct actions can be determined.4 Once an agent had performed the dance, it would move directly to the goal. Often the dance was very general, and it worked just as well in the later mazes as in the earlier ones. Sometimes the skills did not generalize as well, and the agent had to be trained in each of the later environments. However, before training began it would have been extremely difficult for the trainer to know many (if any) of the skills that would be needed. In real-world tasks it is even more difficult to know beforehand which skills an agent will need. This is precisely why continual learning is necessary — to remove the burden of such decisions from the concerns of the programmer/designer/trainer. The problems that CHILD solves are difficult and it solves them quickly, but it solves them even more quickly with continual learning. The more complicated the reinforcementlearning tasks in Section 3 became, the more CHILD benefited from continual learning. By the ninth maze, the agent showed markedly better performance. (It also showed no sign of catastrophic interference from earlier training, and was in fact able to return to the first maze of the series and solve the task again with only minimal retraining.) Taking a fast algorithm and greatly improving its performance through continual learning shows two things. It shows the usefulness of continual learning in general, and it demonstrates that the algorithm is capable of taking advantage of and building onto earlier training. CHILD is a particularly good system for continual learning because it exhibits the seven properties of continual learning listed in the introduction. • It is an autonomous agent. It senses, takes actions, and responds to the rewards in its environment. It handles reinforcement-learning problems well, since it is based on the Temporal Transition Hierarchy algorithm, which can predict continuous values in noisy domains (learning supervised learning tasks more than two orders of magnitude faster than recurrent networks (Ring, 1993)). CHILD 101 • It learns context-dependent tasks where previous senses affect future actions. It does this by adding new units to examine the temporal context of its actions for clues that help predict their correct Q-values. The information the agent uses to improve its predictions can eventually extend into the past for any arbitrary duration, creating behaviors that can also last for any arbitrary duration. • It learns behaviors and skills while solving its tasks. The “dance” mentioned above, for example, is a behavior that it learned in one maze that it then was able to use in later mazes. In some cases the dance was sufficient for later tasks; in other cases it needed to be modified as appropriate for the new environment. • Learning is incremental. The agent can learn from the environment continuously in whatever order data is encountered. It acquires new skills as they are needed. New units are added only for strongly oscillating weights, representing large differences in Q-value predictions and therefore large differences in potential reward. Predictions vary the most where prediction improvements will lead to the largest performance gains, and this is exactly where new units will be built first. • Learning is hierarchical. New behaviors are composed from variations on old ones. When new units are added to the network hierarchy, they are built on top of existing hierarchies to modify existing transitions. Furthermore, there is no distinction between learning complex behaviors and learning simple ones, or between learning a new behavior and subtly amending an existing one. • It is a black box. Its internals need not be understood or manipulated. New units, new behaviors, and modifications to existing behaviors are developed automatically by the agent as training progresses, not through direct manipulation. Its only interface to its environment is through its senses, actions, and rewards. • It has no ultimate, final task. What the agent learns now may or may not be useful later, depending on what tasks come next. If an agent has been trained for a particular task, its abilities afterwards may someday need to be extended to include further details and nuances. The agent might have been trained only up to maze number eight in Figure 1 for example (its training thought to be complete), but later the trainer might have needed an agent that could solve the ninth maze as well. Perhaps there will continue to be more difficult tasks after the ninth one too. In the long term, continual learning is the learning methodology that makes the most sense. In contrast to other methods, continual learning emphasizes the transfer of skills developed so far towards the development of new skills of ever greater complexity. It is motivation for training an intelligent agent when there is no final goal or task to be learned. We do not begin our education by working on our dissertations. It takes (too) many years of training before even beginning a dissertation seems feasible. It seems equally unreasonable for our learning algorithms to learn the largest, most monumental tasks from scratch, rather than building up to them slowly. Our very acts of producing technology, building mechanisms, writing software, is to make automatic those things that we can do ourselves only slowly and with the agony of 102 M.B. RING constant attention. The process of making our behaviors reflexive is onerous. We build our technology as extensions of ourselves. Just as we develop skills, stop thinking about their details, and then use them as the foundation for new skills, we develop software, forget about how it works, and use it for building newer software. We design robots to do our manual work, but we cannot continue to design specific solutions to every complicated, tiny problem. As efficient as programming may be in comparison with doing a task ourselves, it is nevertheless a difficult, tedious, and time-consuming process — and program modification is even worse. We need robots that learn tasks without specially designed software solutions. We need agents that learn and don’t forget, keep learning, and continually modify themselves with newly discovered exceptions to old rules. Acknowledgments This paper is a condensation of the reinforcement-learning aspects of my dissertation (Ring, 1994), and I would therefore like to acknowledge everyone as I did there. Space, however, forces me just to mention their names: Robert Simmons, Tim Cleghorn, Peter Dayan, Rick Froom, Eric Hartman, Jim Keeler, Ben Kuipers, Kadir Liano, Risto Miikkulainen, Ray Mooney, David Pierce. Thanks also to Amy Graziano, and to NASA for supporting much of this research through their Graduate Student Researchers Program. Notes i , and l 1. A connection may be modified by at most one l unit. Therefore li , lxy xy are identical but used as appropriate for notational convenience. 2. With a bit of mathematical substitution, it can be seen that these connections, though additive, are indeed higher-order connections in the usual sense. If one substitutes the right-hand side of Equation 7 for lij in x exists) and then replaces ŵ in Equation 7 with the result, then Equation 8 (assuming a unit nx ≡ lij ij ni (t) = X sj (t)[wij (t) + = j 0 sj (t − 1) ŵxj 0 (t − 1)] j0 j X X j [s (t)wij (t) + X 0 sj (t)sj (t − 1) ŵxj 0 (t − 1)]. j0 As a consequence, l units introduce higher orders while preserving lower orders. 3. For this reason, it is sometimes helpful to provide the agent with the proprioceptive ability to sense the actions it chooses. Though this entails the cost of a larger input space, it can sometimes help reduce the amount of ambiguity the agent must face (Ring, 1994). 4. The dance is similar to the notion of a distinguishing sequence in automata theory (which allows any state of a finite state machine to be uniquely identified) except that CHILD does not need to identify every state; it only needs to differentiate those that have significantly different Q-values. References Albus, J. S. (1979). Mechanisms of planning and problem solving in the brain. Mathematical Biosciences, 45:247–293. CHILD 103 Barr, A. & Feigenbaum, E. A. (1981). The Handbook of Artificial Intelligence, volume 1. Los Altos, California: William Kaufmann, Inc. Baxter, J. (1995). Learning model bias. Technical Report NC-TR-95-46, Royal Holloway College, University of London, Department of Computer Science. Caruana, R. (1993). Multitask learning: A knowledge-based source of inductive bias. In Machine Learning: Proceedings of the tenth International Conference, pages 41–48. Morgan Kaufmann Publishers. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the National Conference on Artificial Intelligence (AAAI-92). Cambridge, MA: AAAI/MIT Press. Dawkins, R. (1976). Hierarchical organisation: a candidate principle for ethology. In Bateson, P. P. G. and Hinde, R. A., editors, Growing Points in Ethology, pages 7–54. Cambridge: Cambridge University Press. Dayan, P. & Hinton, G. E. (1993). Feudal reinforcement learning. In Giles, C. L., Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing Systems 5, pages 271–278. San Mateo, California: Morgan Kaufmann Publishers. Drescher, G. L. (1991). Made-Up Minds: A Constructivist Approach to Artificial Intelligence. Cambridge, Massachusetts: MIT Press. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48:71–79. Fahlman, S. E. (1991). The recurrent cascade-correlation architecture. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 190–196. San Mateo, California: Morgan Kaufmann Publishers. Giles, C., Chen, D., Sun, G., Chen, H., Lee, Y., & Goudreau, M. (1995). Constructive learning of recurrent neural networks: Problems with recurrent cascade correlation and a simple solution. IEEE Transactions on Neural Networks, 6(4):829. Jameson, J. W. (1992). Reinforcement control with hierarchical backpropagated adaptive critics. Submitted to Neural Networks. Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. ICS Report 8604, Institute for Cognitive Science, University of California, San Diego. Kaelbling, L. P. (1993a). Hierarchical learning in stochastic domains: Preliminary results. In Machine Learning: Proceedings of the tenth International Conference, pages 167–173. Morgan Kaufmann Publishers. Kaelbling, L. P. (1993b). Learning to achieve goals. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1094–1098. Chambéry, France: Morgan Kaufmann. Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in soar: The anatomy of a general learning mechanism. Machine Learning, 1:11–46. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293–321. Lin, L.-J. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University. Also appears as Technical Report CMU-CS-93-103. McCallum, A. K. (1996). Learning to use selective attention and short-term memory in sequential tasks. In From Animals to Animats, Fourth International Conference on Simulation of Adaptive Behavior, (SAB’96). McCallum, R. A. (1993). Overcoming incomplete perception with Utile Distinction Memory. In Machine Learning: Proceedings of the Tenth International Conference, pages 190–196. Morgan Kaufmann Publishers. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7:227–252. Pratt, L. Y. (1993). Discriminability-based transfer between neural networks. In Giles, C., Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing Systems 5, pages 204–211. San Mateo, CA: Morgan Kaufmann Publishers. Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In Giles, C. L., Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing Systems 5, pages 115–122. San Mateo, California: Morgan Kaufmann Publishers. Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712. Ring, M. B. (1996). Finding promising exploration regions by weighting expected navigation costs. Arbeitspapiere der GMD 987, GMD — German National Research Center for Information Technology. Robinson, A. J. & Fallside, F. (1987). The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department. 104 M.B. RING Roitblat, H. L. (1988). A cognitive action theory of learning. In Delacour, J. and Levy, J. C. S., editors, Systems with Learning and Memory Abilities, pages 13–26. Elsevier Science Publishers B.V. (North-Holland). Roitblat, H. L. (1991). Cognitive action theory as a control architecture. In Meyer, J. A. and Wilson, S. W., editors, From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, pages 444–450. MIT Press. Sanger, T. D. (1991). A tree structured adaptive network for function approximation in high-dimensional spaces. IEEE Transactions on Neural Networks, 2(2):285–301. Schmidhuber, J. (1992). Learning unambiguous reduced sequence descriptions. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural Information Processing Systems 4, pages 291–298. San Mateo, California: Morgan Kaufmann Publishers. Schmidhuber, J. (1994). On learning how to learn learning strategies. Technical Report FKI–198–94 (revised), Technische Universität München, Institut für Informatik. Schmidhuber, J. & Wahnsiedler, R. (1993). Planning simple trajectories using neural subgoal generators. In Meyer, J. A., Roitblat, H., and Wilson, S., editors, From Animals to Animats 2: Proceedings of the Second International Conference on Simulation of Adaptive Behavior, pages 196–199. MIT Press. Sharkey, N. E. & Sharkey, A. J. (1993). adaptive generalisation. Artificial Intelligence Review, 7:313–328. Silver, D. L. & Mercer, R. E. (1995). Toward a model of consolidation: The retention and transfer of neural net task knowledge. In Proceedings of the INNS World Congress on Neural Networks, volume III, pages 164–169. Washington, DC. Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8: 323-340. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In Touretzky, D. S., Mozer, M. C., and Hasselno, M. E., editors, Advances in Neural Information Processing Systems 8. MIT Press. Thrun, S. & Schwartz, A. (1995). Finding structure in reinforcement learning. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 7, pages 385–392. MIT Press. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College. Wilson, S. W. (1989). Hierarchical credit allocation in a classifier system. In Elzas, M. S., Ören, T. I., and Zeigler, B. P., editors, Modeling and Simulation Methodology. Elsevier Science Publishers B.V. Wixson, L. E. (1991). Scaling reinforcement learning techniques via modularity. In Birnbaum, L. A. and Collins, G. C., editors, Machine Learning: Proceedings of the Eighth International Workshop (ML91), pages 368–372. Morgan Kaufmann Publishers. Wynn-Jones, M. (1993). Node splitting: A constructive algorithm for feed-forward neural networks. Neural Computing and Applications, 1(1):17–22. Yu, Y.-H. & Simmons, R. F. (1990). Extra ouput biased learning. In Proceedings of the International Joint Conference on Neural Networks. Hillsdale, NJ: Erlbaum Associates.
© Copyright 2026 Paperzz