An evolutionary cognitive architecture made of a bag of networks

Evol. Intel.
DOI 10.1007/s12065-014-0121-7
SPECIAL ISSUE
An evolutionary cognitive architecture made of a bag of networks
Alexander W. Churchill • Chrisantha Fernando
Received: 15 September 2014 / Revised: 1 November 2014 / Accepted: 4 November 2014
Ó Springer-Verlag Berlin Heidelberg 2014
Abstract A cognitive architecture is presented for modelling some properties of sensorimotor learning in infants,
namely the ability to accumulate adaptations and skills
over multiple tasks in a manner which allows recombination and re-use of task specific competences. The control
architecture invented consists of a population of compartments (units of neuroevolution) each containing networks
capable of controlling a robot with many degrees of freedom. The nodes of the network undergo internal mutations,
and the networks undergo stochastic structural modifications, constrained by a mutational and recombinational
grammar. The nodes used consist of dynamical systems
such as dynamic movement primitives, continuous time
recurrent neural networks and high-level supervised and
unsupervised learning algorithms. Edges in the network
represent the passing of information from a sending node to
a receiving node. The networks in a compartment operate
in parallel and encode a space of possible subsumption-like
architectures that are used to successfully evolve a variety
of behaviours for a NAO H25 humanoid robot.
Keywords Cognitive architecture Darwinian
neurodynamics Open-ended evolution Robotics
1 Introduction
Learning a controller for a complex multiple degree of
freedom device such as a child or a robot is a non-trivial
problem and the principles by which development achieves
A. W. Churchill (&) C. Fernando
School of Electronic Engineering and Computer Science, Queen
Mary, University of London, London, UK
e-mail: [email protected]
goals with such high-dimensional systems has been the
subject of considerable investigation from the early work
of Bernstein [54], the dynamical systems approach of
Thelen and Smith [64], to the robotics experiments of
Schaal’s group on Dynamic Motor Primitives [50]. A fascinating and critical property of developmental systems is
their ability to accumulate adaptations without catastrophic
forgetting, a topic that has recently been highlighted by
[39] as being highly suggestive of a role for structural
plasticity in the brain along constructivist principles [8]. A
holy grail in developmental robotics would be to achieve
open-ended adaptation over long periods of time, with the
same cognitive architecture being able to bootstrap prior
knowledge and competences to achieve more and more
complex tasks.
In this paper we present a cognitive architecture formed
of parallel cyclic networks of potentially complex nodes.
These nodes can range from simply outputting a constant
value to encapsulating a voice recognition unit. The
composition of these networks is produced using an evolutionary approach, a practice first introduced by Husbands et al. in 1993 [28]. A key motivation behind
Evolutionary Robotics is to produce complex, robust and
adaptable behaviour through feedback from the environment. Using evolution, it is possible for complex behaviour to emerge from interactions among several parts,
which could have extremely complicated or chaotic
dynamics. For example, Koos et al. produced a highly
robust robot which can adapt to unforeseen circumstances,
such as damaging a leg [31]. In this paper experiments are
performed on a humanoid robot, which is challenging due
to many degrees of freedom. Evolutionary techniques
have been used to produce reaching (e.g. [41, 48], dancing
(e.g. [16]), and bipedal motion (e.g. [17, 18]) behaviour in
humanoid robots.
123
Evol. Intel.
Various attempts have been made to ‘‘scale up’’ the
behaviour of robotic controllers to produce a growing
library of learned competencies, such as through subsumption [4] and incremental evolution [26]. In the former
case, independent behaviours are encoded in separate
modules, which get activated through sensory conditions.
Koza used Genetic Programming (GP) to evolve a subsumption architecture, with controllers based on rule-sets
[32]. This has been adapted to use modules made of neural
network controllers in [66] but scaling up to high-dimensional robots has not been attempted. Incremental evolution
has been demonstrated to be successful for one or two
steps, e.g. in the system of Gomez and Miikkulainen [21],
and recent work from Mouret and Doncieux has produced
promising results by bootstrapping controllers based on
behavioural diversity [43]. However, an open-ended
accumulation of adaptation has not been achieved.
Other than subsumption and incremental evolution,
there have been attempts to evolve neural network controllers that can switch between behaviours given different
contexts. Calabretta et al. presented an early example of
this, evolving two different behaviours in two individual
neural networks [7]. More recent work from Durr et al. also
explicitly evolves two neural networks for a robotic task
which has changing goals depending on the current context
[15]. The modular approach, i.e. using multiple networks,
was found to greatly outperform a single network. In the
world of videogames, Schrum et al. evolved modular
neural networks that produced multiple behaviours in Ms.
Pac-Man [52].
The system presented here is influenced by the research
outlined above, and can implement a subsumption-like
architecture, through the interactions of disjoint graphs. It
deviates from the reported work in a number of ways. The
nodes are complex and could consist of a neural network
like many of the examples above but could also use any
type of processing unit, such as a finite state machine. This
is similar to [67], which evolves a subsumption architecture
with several modules with predefined behaviours (e.g.
wandering, recharging, etc.). Each module can consist of
any structure capable of producing motor commands. The
architcture presented in this paper differs on a key aspect—
in [67] only one module can be active at a time, while here
many modules can operate in parallel. This can enable
complex control of a high-dimensional robot, as well as
enabling modules to be active that do not directly affect the
environment. For example, the ability to build models
could be a key component for open-ended adaption [24].
Many researchers have used models for solving robotic
problems. Typically these are divided into forward models
(predicting the next state given the current state and an
action) and inverse models (predicting the action required
to reach a desired state from the current state) [30]. These
123
are generally learnt through machine learning techniques
techniques such as Gaussian Process Regression (e.g. in
[60]), Bayesian Networks (e.g. in [12]), Symbolic
Regression (e.g. in [42]), Learning Classifier Systems (e.g.
in [6]) and Reinforcement Learning (e.g. in [2]). Recent
work from Bellas et al. build models with evolutionary
techniques [3], and show that this is especially useful in
non-stationary environments.
Many researchers have used models as a driver for
Intrinsic Motivation or Artificial Curiosity, where a learning machine autonomously creates its own goals, i.e. there
is no explicit external task set by a human experimenter
[1]. One method is to set up reward functions to maximise
learning progress, for example Schmidhuber’s function to
maximise the change in compressibility over time [51] and
Oudeyer et al.’s maximisation of prediction error [44].
These functions drive the learning agent to discover
something new about the world, while refraining from
attempting to model regions with no or very little information, such as noise. Although not implemented in the
experiments in this paper, below we describe how the
presented system could be used evolve its own tasks.
There are also similarities to the architecture presented
in this paper and Learning Classifier Systems (LCS). While
there are many different varieties of classifier system, the
canonical ‘‘Michigan style’’ evolves a set of classifiers
consisting of a condition, an action and a prediction (a
detailed introduction to LCS can be found in [5]). LCS
have been applied to learning agents both in simulation,
e.g. the early work of Wilson’s ZCS implemented in [9]
and Butz and Herbort’s application of XCSF to a robotic
arm [6]), and real robots such as in Hurst and Bull’s work
using ZCS and neural classifiers [27], and Studley and
Bull’s work using an extension of XCS [59]. The work on
LCS is very relevant to the system presented here. The
controller graphs that are evolved here have an input
condition that if satisfied will activate the graph (allow
input to flow downstream) and produce an output which
can be interpreted as an action, similar to the rules in an
LCS such as ZCS. The most related literature to the work
presented here is Dorigo and Colombetti’s robot shaping
using hierarchical LCS modules [14]. In their influential
book where they experiment on a number of robotic
problems, the hierarchies and architectures are designed by
hand, and modules are incrementally trained and then
frozen, in a similar manner to incremental evolution. In the
work presented here, the architectures themselves can
evolve. Hence, differently to most Learning Classifier
Systems, here the unit of evolution is a group of ‘‘classifiers’’ rather than a single rule. This group contains many
graphs that can operate in parallel, meaning that there is
both parallel message passing between graphs and parallel
sensor-motor contingencies. This allows the system to
Evol. Intel.
control a real-world high degree-of-freedom humanoid
robot with continuous sensor and motor commands, while
Dorigo and Colombetti concentrate on a discritised version
of the world.
In this work we use a directed graph based control
architecture capable of parallel processing, in which cyclic
networks are allowed. The network is analogous to a neural
network controller, however, the nodes can be much more
complex than simple neurons. There is a long history of
evolving parallel potentially cyclic directed graph representations, for example Teller and Veloso use parallel
evolution for data mining (the PADO algorithm) to solve
supervised learning problems [62]. The same authors also
introduced the first GP method for producing neural networks, called ‘‘Neural Programming’’ [63]. Ricardo Poli
uses a parallel genetic programming system called PDGP
(parallel distributed genetic programming) to evolve
feedforward parallel programs to solve the XOR problem
amongst others [46]. Cartesian genetic programming [40]
also evolves feed-forward programs. The dominant paradigm for structural evolution of networks is now NEAT
[56], a powerful system that maintains diversity, protects
innovations, and controls the dimensionality of a genotype
carefully. The hyper-NEAT extension moves to an indirect
encoding, which produces much better performance on
large networks [55].
The system presented here differs from the above
approaches in several respects. First, it is intended to be a
cognitive architecture for the real-time control of robots,
secondly the nodes are intended to be complex machine
learning algorithms, not arithmetic primitives or simple
mathematical functions, thirdly, our system is intended to
be capable of controlling a humanoid robot. As in the
existing systems, our nodes have multiple functions and
transmit information between each other. Our motivation
for using the graph representation is to eventually achieve
compositionality, systematicity, and productivity because it
is possible for specific node types to be physical symbol
systems [20] if the rules for operating on them are grammatical. Previous attempts have been made to achieve
grammar based genetic programming [38] in tree structures, but not in cyclic distributed parallel graph structures
of the types described above. The challenge is to discover
an evolvable grammar that generates fit variants with high
probability, a task not unlike that faced in language
learning by the fluid construction grammar [57]. The system could also be used to evolve fitness functions, providing a method for intrinsic motivation, a crucial
component of open ended cognition in robotic agents.
The structure of the paper is as follows. The network
representation is described followed by the stochastic
grammatical variation operators, and the overall evolutionary algorithm that controls the robot in real-time by
generating and selecting active units. Some preliminary
adaptation results are presented, and the network structures
responsible for the behaviour are described. We have not yet
achieved the accumulation of adaptation, but we have shown
that a variety of distinct tasks can be evolved with disjoint
network representations that are suitable for recombination.
Several critical problems are encountered and discussed, and
an approach is sketched for making progress.
2 Methods
2.1 Nodes
The basic building block of the model is a node. It gets
vector inputs, performs a processing operation, and sends
vector messages to other nodes with some delay. When
activated, nodes transmit this activity state to downstream
nodes, and remain active for a fixed period. Nodes can have
internal states and are connected to create potentially disjoint directed graphs. A node is defined by the following
general tuple {t; i:d:; I; O; P; f }, where t is the node type;
i:d: is a unique identification number; I is a vector that
represents the inputs to the node, the i.d.’s and/or raw
sensory locations from where the inputs originate, and
delay associated in transmission from each i:d: listed; O is
a vector that represents the outputs of the node, and the i.d.
and activation state of this node, and the outputs of the
node function; P is a vector that represents internal states of
the node (all nodes have an activity state, and time to
remain active after turning on); and f is the transformation
function or learning algorithm, converting inputs to outputs, possibly using internal states and modifying these
internal states.
We define several classes of nodes in order to reduce the
abstractness of the above formalism. Each class contains
various sub-classes of nodes. All nodes share the generic
‘‘activation’’ property: when a node becomes activated, it
transmits this information in its output message. Sensory
nodes get raw sensory input, or act as internal pattern
generators that encode the amplitudes and frequencies of a
vector of sine waves. Sensor nodes are the only initiator
nodes for a graph, i.e. the only nodes that can initiate
activation. Motor nodes contain a vector of motors that
they control. Their input vector is used to set the motor
angles of these motors at each time-step that the node is
active, and a re-afferent copy or corollary discharge of the
command angles is always output as in [11]. If there is no
input, a default motor angle for each motor is stored.
Processing nodes receive inputs only from other nodes.
They include Euclidean distance nodes that get a vector
from its input nodes and calculates the distance to an
internally stored parameter vector; a linear transform node
123
Evol. Intel.
that does the dot product of the input vector and an internally stored weight matrix to produce a vector of outputs;
and an Izhikevich neuron based liquid state machine [29]
node. A reinforcement learning node has a two part input
vector, a data part and a reward part. The data part consists
of inputs from a set of other nodes, typically transform
nodes or sensory nodes e.g. signalling joint angles. The
reward part must be a single scalar value obtained from a
transform node, for example this could signal the prediction error or the proximity to a desired state. The output of
a learning node is a vector that will typically encode some
parameters for downstream motor nodes to use, or it might
be a temporal difference error signal. Types of reinforcement learning node include a stochastic hill climbing node,
an actor-critic node, a simulated annealing node and a
genetic algorithm node. Supervised learning nodes undertake online supervised learning based on an input training
vector, and an input target vector. They do function
approximation (regression) or classification. Such a node is
essential for efficiently learning models of the world, e.g.
forward models, or inverse models [23, 34, 53]. The type of
model depends purely on the identity of the transform node
that sends it the training and target vectors during its
training. Unsupervised learning nodes take an input vector
and compress it, e.g. a K-means clustering node does
online clustering and outputs the class of each novel data
point input to it; a Principle Component Analysis node
takes an input vector, has a parameter n which is the
number of principle components to output, and the output
vector is the intensity of those n first principle components.
In the NAO setup, there are high-level face recognition
nodes, i.e. a sensor node that receives input from a camera
and outputs a label and x,y retinal coordinate of a face; and
a speech recognition node, which is a sensor node that
receives input from a microphone and outputs either the
text or label of the speech. Other high level nodes in the
NAO include walk nodes, motor nodes that control the legs
to walk to a particular location and take high level Cartesian instructions as input.
To summarise, the repertoire of nodes is essentially
unbounded and is only limited by their encoding into the
framework. In this manner, any combination of such nodes
can form, thus enabling an open-ended evolution of complex controllers. This advantage is also a disadvantage, the
number of nodes is huge, and the search space of possible
networks is absolutely enormous. This makes evolution in
such a space extremely difficult, which is discussed in
detail later in this paper.
2.2 Bags of networks
The unit of neuroevolution consists of a set of disjoint
graphs of connected nodes in a compartment or bag of
123
networks to which fitness is assigned as a unit. Activation
of the networks in a compartment always starts at the
sensory nodes, these activate in parallel the downstream
processing nodes with the appropriate delays, which in turn
are typically connected to motor nodes. There may be
recurrent connections. The motivation for allowing disjoint
graphs is that it enables a robot to be subject to potentially
multiple parallel independent behavioural policies that
together control the whole robot in parallel. By allowing
graphs to be disjoint, a bag of potentially useful network
motifs can be preserved silently and become utilised by the
system in later variation events during evolution. This
unusual nature of compartment representations will
become more clear when the initialisation of networks and
the variation operators acting on them is described below.
2.3 Evolvable action grammar
We now describe the variation operations or graph rewrite
rules [47]. De novo node constructor operators make new
networks at initialisation giving the robot a primary sensorimotor repertoire of reflexes. For example such a network may consist of a single sensory node with a random
number of sensory inputs connected to a linear transform
node with a delay. The linear transform unit may have a
random initial weight matrix sending output with random
delays to three motor nodes. Each motor node may control
1 to 4 motors, randomly chosen. A complex constructor is
also used throughout evolution as a macro mutation device
for inserting functional motifs that consist of random sensory-(linear transform)-motor triplets in a chain of activation. Graph replicator operators produce perfect copies of
graphs. These are required for replication and evolution of
compartments (bag of networks). Graph connectivity
operators only influence the connectivity between already
existing nodes within a single graph i.e. the messages that
nodes get from other nodes within the graph, the delays
with which the messages influence activation, and the time
period for which a node is active after activation. For
example, the connection deletor removes a message i.d.
from the input vector of a node. A connection adder adds a
new message i.d. of a node in the graph to the input vector
of another node in the graph. For now, all nodes are
allowed to connect to/from any other node type, but
evolvability in future work will demand this to be constrained. A critical feature of the above operators is that
deletion of an edge may create two disjoint graphs that
remain in the same compartment. Node duplication operations make copies of nodes within a single graph. The
parallel node replicator takes a node and copies it entirely
along with its input and output links such that this node
now exists in parallel with its parent node, with the same
inputs and outputs of the parent node. The serial node
Evol. Intel.
replicator copies a single node but instead of copying the
inputs and outputs of the node, it makes the input to the
offspring node the output of the parent node, and makes the
offspring node connect to those nodes that the parent node
connected to, i.e. it inserts a copy of the parent node
downstream of the parent node. A node deletor operator
removes a node along with its input and output edges, thus
potentially disconnecting a graph into two (or more)
components. All these operators act on the message i.d. list
in the input vector of a node. Thus connectivity is encoded
at the input end of nodes, not at the output ends. Intra-node
operators modify the sensory inputs, the motors controlled,
the transfer functions and the internal states of the nodes.
Intra-node mutations depend on the specific node type.
Intra-node operators do not mutate the type of the node.
This greatly restricts the mutability of the genome, but
increases its evolvability. Multi-graph recombinators take
N networks and construct a network that is derived from
properties of the N parent networks. In analogy with
genetic programming [33], a standard operator is the
branch crossover operator that is defined only on trees and
not recurrent graphs. This chooses two nodes and crosses
them over along with any downstream nodes to which they
are connected. Motif crossover operators allow crossover
only between functionally isomorphic digraph motifs, in
order to prevent crossover from being destructive due to
the competing conventions problem.
encoded as goal graphs. The co-evolution of goal networks
and action networks has the potential to produce openended behaviour by allowing the agent to produce its own
tasks to develop skills and competencies. This is the next
step for the architecture and is not presented in the results
below.
2.5 Evolutionary algorithm
The evolutionary algorithm consists of a population of
10-100 units of evolution (bags of networks). A binarytournament microbial steady-state Genetic Algorithm is
used to select pairs of units which are evaluated sequentially on the robot [25]. The fitness of the two units is
compared and the winner overwrites the loser with mutation and recombination as described above in the operator
section. This continues until the behaviour is interesting, or
the robot breaks or overheats.
3 Results
This section should be read while referring to the accompanying videos in the Supplementary Material.1 Simulations are conducted in Webots for NAO, and real world
experiments are conducted with the NAO robot itself.
3.1 Maximising distance travelled by a humanoid robot
2.4 Intrinsic motivation
Robot goals are usually set by an external user. On one
extreme, ‘‘dumb’’ robots’ goals and actions are handdesigned, allowing no flexibility at all. Optimal control
algorithms that enable the automatic computation of
(approximate) solutions that achieve a given task are a
more robust and efficient way [36, 65]. Machine learning
algorithms, such as reinforcement learning, enable the
robot to learn what actions to do in order to solve a specific
game or goal [37, 61]. And finally, on the other extreme,
curious robots learn to act and move in order to explore
their environment, i.e. learn to map and categorise it
autonomously [2, 13, 22, 45].
However, in all these examples, the goals or tasks are set
by the external user; even curious robots’ goals are set, i.e.
go and learn. As described above, this can be accomplished
by rewarding learning progress such as by maximising
prediction error. This could encourage a learning agent to
explore unseen parts of the environment. A goal node is a
special class of node that receives input vectors and outputs
a scalar. Using a goal node as the sink node in a network
consisting of the nodes described above (with the exception
of motor nodes), fitness functions can be encoded. In all of
the experiments below, the fitness functions have been
In the first experiment, the control networks are evolved to
maximise the distance travelled by the NAO during a
single episode of 100 time steps. The fitness function is,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð1Þ
f ðxÞ ¼ ðx2t¼1 x2t¼100 Þ þ ðy2t¼1 y2t¼100 Þ
In order to test the efficacy of the intra-node mutations, an
initial graph de novo constructor was used. An example
initial graph can be seen in Fig. 1 and is formed of three
nodes—a sensor that is used for activation only, i.e. with no
output, and two motor nodes, which operate the same
motors. The activation and delay times were initialised to
create an oscillating effect by alternating between motor
angles set to opposite directions in each motor node.
In the first evolutionary run of this system, only the
motor angles and the delay and activation times of the
action graphs were mutable, not the graph topology. This
meant that fitness could only be improved through
increasing the speed and strength of oscillations during the
fixed time period. Each trial lasted for 50 time steps of
0.1 s, with the simulated NAO robot beginning from a
resting supine position, as illustrated in Fig. 2. Evolution
1
Supplementary material is available at http://www.alexchurchill.
com/papers/evin2014.
123
Evol. Intel.
was performed for 250 generations, each consisting of 10
binary tournaments and 20 evaluations. The fitness of the
individuals at every 10 generations is shown in Fig. 4a.
There is a clear, although not dramatic increase in fitness
(distance travelled) over the course of evolution. Fitness
improvements occur after 100 and 120 generations, where
the best solutions move from a distance of around 0.3 m to
close to 0.4 and then close to 0.5 m. The initial setup
conditions create a largely symmetrical rocking motion that
moves the robot slowly towards its left. The best behaviour
after 50 generations rocks the robot with much greater
force, achieved by raising the legs higher, which causes it
to move further to the left over the course of a time trial. At
100 generations, the legs move higher and are raised for
more time, which causes the robot to rotate further, and
Fig. 1 Showing a hand-designed controller graph, where sensor
nodes (pentagrams) provide activation signals only. Edge numbers
indicate delay times, and the integer within a motor node (rectangle)
denotes the number of steps it remains active for. Motors are labelled
in the motor node (red). Motor key: HY: Head Yaw, HP: Head Pitch,
LSP: L Shoulder Pitch, LSR: L Shoulder Roll, LEY: L Elbow Yaw,
LER: L Elbow Roll, LWY: L Wrist Yaw, LH: L Hand, LHYP: L Hip
Yaw Pitch, LHR: L Hip Roll, LHP: L Hip Pitch, LKP: L Knee Pitch,
LAP: L Ankle Pitch, LAR: L Ankle Roll, RHYP: R Hip Yaw Pitch,
RHR: R Hip Roll, RHP: R Hip Pitch, RKP: R Knee Pitch, RAP: R
Ankle Pitch, RAR: R Ankle Roll, RSP: R Shoulder Pitch, RSR: R
Shoulder Roll, REY: R Elbow Yaw, RER: R Elbow Roll, RWY: R Wrist
Yaw, RH: R Hand
Fig. 2 Showing the resting
position of the NAO, which it
was reset to at the start of each
trial. Images are from different
angles of the same scene.
a Front. b Side
123
thus increases the distance travelled. At 150 generations, it
employs the same behaviour, moving to its right, but with
the legs oscillating much more quickly. This final behaviour appears to be a local optimum for the robot.
In a second evolutionary run, the complexity of node
mutations was increased to make motor identities mutable, but again with graph topology held fixed. The same
starting conditions, i.e. the node structure and parameters,
were used as in the first run. The fitness of individuals
over the course of evolution is shown in Fig. 4b. Here, a
similar performance to the first run is achieved after only
70 generations, and by 100 evaluations the system
exceeds the best results found on the first run. After 50
generations the behaviour looks similar to that exhibited
by the control structures found in the first run. The robot
rotates from side to side, biased towards movements on
its left side. The action graphs in Fig. 3 show that after 50
generations, five left sided motors and three right sided
motors are employed. After 100 generations, the behaviour of the best compartment has changed. The left leg
now extends quite far and then drags the robot towards its
left side. Looking at the node structure, the second node
is now composed entirely of left sided motors. At 200
generations the right leg also extends and a stepping
motion emerges. The behaviour of the best controller after
100 and 5,000 evaluations is shown in video V1 in the
Supplementary Material, both in simulation and on the
real robot.
3.2 Evolving headstands in a humanoid robot
In the second experiment, bags of networks were evolved
for 3,000 binary tournaments (6,000 evaluations) with all
of the variation operators described in Sect. 2.3 above
applied in the evolutionary process. Fitness was obtained
by maximising the sum of the z-accelerometer sensor in the
torso of NAO over the course of 150 time steps,
Evol. Intel.
Fig. 3 Showing the graphical structure of the best controller graphs
after being evolved for different numbers of generations on the
second evolutionary run of the maximise distance task. Sensor nodes
(pentagrams) provide activation signals only. Edge numbers indicate
delay times, and the integer within a motor node (rectangle) denotes
the number of steps it remains active for. Refer to Fig. 1 for the motor
key. a Gen 50. b Gen 100. c Gen 150. d Gen 200. e Gen 250
Fig. 5 Showing the fitness of controller graphs over 300 generations
of evolution on the maximise z-accelerometer task
Fig. 4 Showing the fitness of action bags of networks over 250
generations of evolution on the maximise distance task. a Activation
and Delay time and motor angle mutations only. b Activation and
Delay time, motor angle and motor mutations
f ðxÞ ¼
150
X
zt
ð2Þ
t¼1
Maximal fitness would be attained by the robot standing
on its head. Figure 5 shows the fitness of each evaluated
member of the population at each pseudo-generation of 10
binary tournaments. The bags of networks are clearly
increasing in fitness over the course of evolution. There is
an especially large jump over the first 50 generations, and
then a more gradual but positive ascent throughout the
rest of evolution. After 32 generations have passed there
is a sudden drop in fitness. This occurred because the
NAO moved into a position on its side where it was
balanced and unable to return to the rest position on its
back. In this situation compartments are now selected in
effectively a different environment. After 36 generations,
the NAO moves out of this position and can reset normally again.
The bag of networks representing the fittest individual
found at the end of evolution can be seen in Fig. 6. The
compartment is made up of three disconnected graphs that
get activated and produce movement over the course of a
trial, labelled as Groups A–C, with a small number of
nodes that will lie dormant (never activated). Among the
dormant nodes there is a sensor node connected to a linear
transform node that will be activated but will have no
123
Evol. Intel.
Fig. 6 Showing the compartment that represents the best individual
found on the maximise Z-accelerometer task. There are three disjoint
connected graphs that get activated during a run, and these have been
highlighted and labelled A, B and C. Pentagrams show sensor nodes
with key denoted below, triangles represents linear transform nodes
Table 1 Fitnesses of the individual and combinations of disjoint
graphs from best compartment found on the maximise z-accelerometer task, labelled as shown in Fig. 6
Group
Fitness
A
575
B
459
C
A?B
272
772
A?C
634
B?C
590
A?B?C
807
A ? B ? C (Changed Z-accelerometer sensors to
X-accelerometer sensors)
467
effect on the behaviour of the robot. The fitnesses obtained
by the three separate active groups can be seen in Table 1.
The behaviour of each entry in the table can be seen in
Video V2. The compartment containing all three groups
performs the best, and the fitness is considerably greater
than any one group on its own, showing that the disjoint
graphs do interact to create a more adaptive behaviour than
any single disjoint component of the bag of networks. The
final robot stance obtained by this compartment can be seen
in Fig. 7a. Group A, the largest disjoint graph, has the
highest fitness when considered on its own. It causes both
123
with a maximum of five outputs, diamonds Euclidian distance nodes,
stars show K-Means nodes and rectangles shows motor nodes with
motors denoted inside (refer to Fig. 1 for the motor key). Sensor key:
HY Head Yaw, LKP L Knee Pitch, RER R Elbow Roll, RHP R Hip
Pitch, GY Gyroscope Y, ZA Accelerometer Z, DZ Distance Z
of NAO’s legs to move, as well as its right arm, which
causes it to move backwards into an arch-like position,
raising its torso and so increase its z-accelerometer reading.
Group B also scores well by itself, mainly moving NAO’s
legs, with a larger emphasis on the right one. This leads it
to move its torso backwards. Group C does not produce
much movement when starting from a resting position. The
combination of Groups A and B moves both legs far back
underneath the torso, arching NAO backwards at a sharp
angle, as can be seen in Fig. 7b. With all three groups
together, the robot moves its legs even further backwards,
drawing the torso even closer to a vertical position.
The two highest scoring groups (A and B) both make
use of the Z-accelerometer sensor (143 in Fig. 6). To test
the impact of this sensor, the action graphs were modified
to replace all occurrences of the Z-accelerometer sensor
with the X-accelerometer, which had the affect of oscillating the robot’s torso from side to side, leading to a much
reduced fitness score. The final stance produced by the
modified compartment can be seen in Fig. 7c. This demonstrates that a true closed-loop solution has been found.
An interesting variant of the above task involves initialising the robot from the prone position rather than the
supine position. Figure 8 shows that evolving to maximise
z from a prone solution is much faster and more effective
than evolving to maximise z from a supine position. Video
V3 shows the best behaviour evolved after 1,400
Evol. Intel.
Fig. 7 Showing the final position of the NAO after different disjoint
graphs are activated together to control the robot, labelled as shown in
Fig. 6. In a graphs A, B and C are activated together, in b graphs A
and B are activated and in c graphs A, B and C are activated with the
z-accelerometer sensor substituted for the x-accelerometer. a A, B
and C. b A and B. c A, B and C (X-accel substituted for Z-accel)
Fig. 8 Evolution to maximise the z-accelerometer from the prone position. a There is a punctuated increase in fitness. b The final pose obtained
is of high fitness. a Evolutionary fitness dynamics. b Final pose obtained
evaluations. The solution is quite different, exploits the
initial position of the NAO and is elegantly simple.
3.3 Evolving bouncing behaviour in a humanoid robot
on a Jolly Jumper
In the third set of experiments, a physical NAO robot is
placed in a Jolly Jumper infant bouncer (see Fig. 9), which
consists of a harness attached to a stand by a spring. This is
influenced by Thelen’s work [64], which describes how an
infant placed in a bouncer discovers configurations of
movement dynamics that provide acceptable solutions and
then tunes these to produce efficient bouncing. The robot is
placed in the harness and suspended with the tip of its feet
firmly touching a cushion on the ground, in the first
experiment, and just touching the floor in the second. The
fitness function maximises the first derivative of the zaccelerometer (z) at each time step (t),
f ðxÞ ¼
T
X
t¼1
jztþ1 zt j
ð3Þ
Each trial lasts for T ¼ 100 time steps of 100ms each. At
the end of the trial, the robot’s joints are relaxed and it rests
for 3 s. It may not return to the same position at the start of
each trial, and the spring will not be completely dampened
from the previous appraisal, which introduces noisy fitness
evaluations.
3.3.1 Fixed topology
In the first experiment, a Microbial GA [25] with a population of 10 is applied for 800 evaluations to a fixed
graphical structure, evolving only the parameters. The
compartment can be seen in Fig. 10, and consists of four
Dynamic Motor Primitive (DMP) [49] nodes, connected to
the knee and ankle motors, and receiving afferent signals
from the motors as input. Additionally, two separate graphs
take the force sensitive resistors (FSR) from each foot as
input, and output to the ankle and knee motors for their
respective sides of the body. These graphs will take control
if the FSR signal is above an evolvable threshold. Figure 11 shows the fitness of the population over the course
123
Evol. Intel.
quickly improves, and solutions adapt to this new condition, with the knees travelling much greater distances
during each oscillation. After 500 evaluations fitness
returns to the previous level and then is quickly surpassed,
with the best solutions achieving a fitness of over 9. This
shows that the presented architecture is able to quickly
adapt in a dynamic environment.
3.3.2 Evolved topology
A second experiment explores a similar task but with an
evolved topology. A more complex evolutionary process is
employed in this experiment, inspired by NEAT [56]. A
population is seeded with 20 identical copies of a small
graph consisting of a DMP node connected to the left knee
and receiving an afferent signal as input. This is initialised
to perform a basic oscillatory motion. As in NEAT, each
node is assigned a unique id. An individual, i, is sequentially compared to each member, j, of each species using
the similarity measure,
1
sði; jÞ ¼ ½ss ði; jÞ þ sp ði; jÞ þ sm ði; jÞ þ sshc ði; jÞ
4
Fig. 9 The NAO robot in a Jolly Jumper infant bouncer
ð4Þ
of evolution. The run proceeds in several stages, which are
shown in Video V4. Initially a kicking motion develops,
that enables fitness to increase quickly. After 120 evaluations, the feet get stuck on the cushion, and the robot is
unable to use its previous motion. A new strategy emerges,
moving the knee as far back as possible before kicking.
After 200 evaluations, with the feet now unstuck, a new
solution is found, moving the left knee far back and relying
on a fast kicking motion in the right leg to bounce. After
350 evaluations, the cushion is removed, and the robot is
now suspended above the ground at the start of each trial.
Fitness falls sharply, as previous strategies relying on
contact with the ground are much less successful. Fitness
where ss ði; jÞ, sm ði; jÞ, sp ði; jÞ, are functions which return
the number of shared sensors, motors and processing node
functions respectively, and sshc ði; jÞ returns the number of
shared connections between nodes with the same unique id.
If sði; jÞ is below a threshold, sthresh (0.5 in this experiment), i is assigned to the species of j and the search is
halted. Otherwise, i is assigned to a new species. The fitness score of each individual is divided by the number
individuals in its species. A mutation count, mutc , initialised at 0, is also assigned to each new individual, and
incremented if a mutation event occurs. Structural mutations are only permitted if mutc [ 3. In this way, innovations are protected and behavioural diversity is encouraged
in the population. Additional variation operators pertaining
to the DMP nodes are also included in this experiment.
Fig. 10 The fixed topology compartment used in the first Jolly
Jumper experiment. Pentagrams denote sensor nodes (LFSR left force
sensitive resistor, RFSR right force sensitive resistor), triangles
denotes processing nodes (DMP dynamic motor primitive, LT linear
transform), rectangles denote motor nodes. Disjoint graphs are
separated by broken lines
123
Evol. Intel.
Fig. 11 Showing the fitness solutions of every 10 evaluations, over
800 evaluations on the first Jolly Jumper experiment
Disjoint graphs containing a DMP have an explicit probability of duplicating, and a DMP in one graph can be
replicated and replace an existing DMP in another.
Evolution was again run directly on a real NAO robot,
over the course of 11 h. Figure 12 shows the fitness of each
individual in the population after each evaluation. There is
a clear improvement in competence on this task from the
individuals in the initial population to those at the end of
the run. The initial behaviour that the population was
seeded with provided an acceptable fitness of between 3
and 5. After around 500 evaluations, structural mutations
begin to have a noticeable effect on fitness. There is a
climb in fitness from evaluations 500 to 1,000, where the
best individuals improve from fitness of 5 to 6.5, and again
from evaluations 1,100 to 1,500, where the fitness of the
best increases to around 8.5. Figure 13 shows the compartment of the best individual from the final population.
As with the compartment shown in Fig. 6, it contains
redundancy, with several graphs lacking sensors or motors.
The main driving force is a single DMP, which is used to
control the left hip, right hip, left shoulder and right knee.
A second graph also controls the right knee, which can
reset actions sent from the DMP, and create additional
oscillatory movements. This solution produces rapid
kicking motions, enabling it to move quickly along the
longitudinal axis, and thus attain fast changes in its
z-accelerometer sensor.
4 Discussion
An outline and some preliminary investigations of a
modular and evolvable control architecture for highdimensional robotic systems has been described. The
Fig. 12 Showing the fitness of all solutions over 11 h of evaluations
on the second Jolly Jumper experiment
results show that using this architecture, interesting
behaviour can emerge on a number of tasks using only a
small number of evaluations. The majority of the paper
describes the generative operators we have chosen as a first
attempt to study ontogenic cognitive evolvability. We have
only skimmed the surface of node types and variability
operators that are possible within this general formulation.
This architecture opens an entire research program about
what kinds of operator and node allows lifetime evolvability of adaptive behaviour, i.e. about the grammar of
thought and the grammar of action. Our work can be seen
as an action fluid grammar akin to the work of Luc Steels
on Fluid Construction Grammar for language, but in the
domain of behaviour [19]. A related approach is that of the
‘‘options framework’’ [58], however no algorithm has so
far been proposed that is capable of effective stochastic
search over option space. Our framework is richer than the
options framework and does show how stochastic search
over an unlimited behaviour space could take place.
The unlimited behaviour space is both a blessing and a
curse for the architecture in its present form. As seen in the
experiments that allow free topologies for the headstand
and Jolly Jumper tasks, the evolved networks can suffer
from redundant nodes and bloated topologies. The methods
inspired by NEAT described in Sect. 3.3.2 can encourage
diversity and protection of innovation that must be given
time to develop. It also places constraints that encourage
slow growth. However, this alone does not seem to be
enough to prevent bloat and it is clear that extra constraints
must be added to the system. There are several constraints
that should be further explored. For example, one could
impose restrictions on which nodes can be connected
together, or weight the probabilities of new connecting
nodes, to make certain node combinations more likely to
123
Evol. Intel.
Fig. 13 Showing the best
compartment found at the end of
evolution on the second Jolly
Jumper experiment. Pentagrams
denote sensor nodes, triangles
denotes processing nodes (DMP
dynamic motor primitive, LT
linear transform), rectangles
denote motor nodes. Disjoint
graphs are separated by broken
lines
occur. Putting a cost on adding connections and nodes
could go a long way towards preventing bloated structures
and redundant nodes. This regularisation could also produce more modular networks [10]. Learning Classifier
Systems such as XCS [68] have been designed to produce
specialist classifiers. Although they do not deal with bloat
and redundancy in terms of the structure of graphs, they
can prevent several graphs responding to the same sensory
input and producing similar actions. An interesting extension to the current work would be to use the nodes and
variation operators described in this paper to evolve
graphical classifiers in an XCS-style system. While this
could create complex controllers, a system would have to
be designed to allow several classifiers to operate in parallel, perhaps by extending Dorigo and Colombetti’s work
[14].
Other constraints can be imposed on node types and
variation operators from the child development literature,
especially the work under the umbrella of Esther Thelen’s
dynamical systems approach [64]. For example we know
how rhythmic movements are assembled from component
oscillators in models that fit child motor control. Motor
synergies of various kinds limit the degrees of freedom in a
control problem, for example fixed joint velocity ratios
between shoulder and elbow. There is also progressive
uncoupling of joint rotations, i.e. freezing and unfreezing
of couplings, which could be introduced into the system.
As well as improving the system by making it more
efficient and evolvable and extending the library of node
types, there are two main areas that should be explored
with the architecture described in this paper—(1) the
accumulation of adaptive behaviours and (2) the co-evolution of behaviours and tasks. For the first issue, we have
been exploring the use of an archive to store learnt
123
behaviours. Both individual graphs or full bags of networks
could be added to a solution from the archive over the
course of evolution. In this way learnt skills can be
remembered without catastrophic forgetting and integrated
into new solutions.
As described in Sect. 2.4 above, tasks can be evolved in
the framework through the use of special goal nodes, and
all fitness functions in the experiments above have been
encoded in this way. An idea that is currently being
explored is the co-evolution of controller and task compartments. A population (or island) of controllers is
assessed based on its performance on a particular task. By
having several islands, each with a different goal network,
we are able to assess and evolve the goal networks themselves. For example, a binary tournament could take place
between two islands, with the goal network and/or population of the winner replacing the loser, with mutation of
the goal network. The difficulty is finding the appropriate
function to evaluate the goal networks. Initially we have
tried competence progress, the improvement in fitness of
controllers from the start to the end of a fixed number of
evolutionary events. This was found to lead to trivial tasks,
as goal networks could be trivially modified to enable easy
adaptation from the controllers. Additional measures are
certainly needed, with novelty being a promising candidate
[35].
5 Conclusion
A system has been introduced that allows complex control
motifs to be evolved, producing interesting behaviour on a
number of tasks demonstrated both in simulation and on a
physical robot. It remains to be seen whether the
Evol. Intel.
architecture is capable of the accumulation of adaptation
and transfer learning across multiple tasks by the use of an
archive of control motifs. A challenge for the utility of the
architecture is simultaneously its strength, i.e. in its present
form it is too rich. A number of methods for constraining
the system have been outlined and a detailed investigation
of their integration into the architecture is needed. One
aspect missing from the architecture currently is gating and
channeling at the inputs and outputs. Basic switching has
been implemented but evolvable control flow operations
are also needed. This adds an even greater complexity to
the evolvability problem.
The system supports the evolution of tasks (fitness
functions) as well as control structures, and so it is possible
for it to be applied as an evolutionary method for intrinsic
motivation. If the issues surrounding bloated networks can
be controlled and an evaluation method introduced to
evolve non-trivial goal tasks, the architecture described in
this paper could be used to produce open-ended cognition.
Acknowledgments The work is funded by the FQEB Templeton
grant ‘‘Bayes, Darwin and Hebb’’, and the FP-7 FET OPEN Grant
INSIGHT.
References
1. Baldassarre G, Mirolli M (2013) Intrinsically motivated learning
in natural and artificial systems. Springer, New York
2. Baranes A, Oudeyer PY (2013) Active learning of inverse models
with intrinsically motivated goal exploration in robots. Robot
Auton Syst 61(1):49–73
3. Bellas F, Duro RJ, Faiña A, Souto D (2010) Multilevel darwinist
brain (mdb): artificial evolution in a cognitive architecture for
real robots. IEEE Trans Auton Mental Dev 2(4):340–354
4. Brooks RA (1990) Elephants don’t play chess. Robot Auton Syst
6(1):3–15
5. Bull L, Kovacs T (2005) Foundations of learning classifier systems, vol 183. Springer, New York
6. Butz MV, Herbort O (2008) Context-dependent predictions and
cognitive arm control with xcsf. In: Proceedings of the 10th
annual conference on Genetic and evolutionary computation,
ACM, New York, pp 1357–1364
7. Calabretta R, Nolfi S, Parisi D, Wagner GP (2000) Duplication of
modules facilitates the evolution of functional specialization.
Artif Life 6(1):69–84
8. Chklovskii D, Mel B, Svoboda K (2004) Cortical rewiring and
information storage. Nature 431:782–788
9. Cliff D, Ross S (1994) Adding temporary memory to zcs. Adapt
Behav 3(2):101–150
10. Clune J, Mouret JB, Lipson H (2013) The evolutionary origins of
modularity. Proc R Soc B Biol Sci 280(1755):20122,863
11. Crapse TB, Sommer MA (2008) Corollary discharge across the
animalkingdom. Nat Rev Neurosci 9(8):587–600 [Crapse, Trinity
B Sommer, Marc A R01-EY017592/EY/NEI NIH HHS/United
States Research Support, N.I.H., Extramural Research Support,
Non-U.S. Gov’t Review England Nature reviews]. Neuroscience
Nat Rev Neurosci. 2008 Aug; 9(8):587–600
12. Dearden A, Demiris Y (2005) Learning forward models for
robots. In: IJCAI, vol 5, p 1440
13. Der R, Martius G (2012) The playful machine. Cognitive systems
monographs. Springer, New York
14. Dorigo M (1998) Robot shaping: an experiment in behavior
engineering. MIT press, Cambridge
15. Durr P, Mattiussi C, Floreano D (2010) Genetic representation
and evolvability of modular neural controllers. IEEE Comput
Intell Mag 5(3):10–19
16. Eaton M (2013) An approach to the synthesis of humanoid robot
dance using non-interactive evolutionary techniques. In: IEEE
international conference on systems, man, and cybernetics
(SMC), pp 3305–3309. IEEE
17. Eaton M, Davitt TJ (2007) Evolutionary control of bipedal
locomotion in a high degree-of-freedom humanoid robot: first
steps. Artif Life Robot 11(1):112–115
18. Farchy A, Barrett S, MacAlpine P, Stone P (2013) Humanoid
robots learning to walk faster: from the real world to simulation
and back. In: Proceedings of the 2013 international conference on
autonomous agents and multi-agent systems, pp 39–46. International foundation for autonomous agents and multiagent systems
19. Fernando C, Zachar I, Szathmry E (2010) Linguistic constructions as neuronal replicators. In: Steels L (ed) Fluid construction
grammar. John Bengamin, Oxford
20. Fodor J, Pylyshyn Z (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28:3–71
21. Gomez F, Miikkulainen R (1997) Incremental evolution of
complex general behavior. Adapt Behav 5(3–4):317–342
22. Gordon G, Ahissar E (2012) A curious emergence of reaching,
lecture notes in computer science, vo. 7429, chap 1. Springer,
Berlin, pp 1–12
23. Gordon G, Ahissar E (2012) Hierarchical curiosity loops and
active sensing. Neural Netw 32:119–129 [Gordon, Goren Ahissar, Ehud Neural Netw. 2012 Aug; 32:119–129. Epub 2012 Feb
14]
24. Haruno M, Wolpert D, Kawato M (2001) Mosaic model for
sensorimotor learning and control. Neural Comput 13:2201–2220
25. Harvey I (2011) The microbial genetic algorithm. In: Advances in
artificial life. Darwin meets von Neumann. Springer, New York,
pp 126–133
26. Harvey I, Husbands P, Cliff D (1994) Seeing the light: artificial
evolution, real vision. School of Cognitive and Computing Sciences, University of Sussex, Falmer
27. Hurst J, Bull L (2006) A neural learning classifier system with
self-adaptive constructivism for mobile robot control. Artif Life
12(3):353–380
28. Husbands P, Harvey I, Cliff D (1993) An evolutionary approach
to situated ai. In: Proceedings of the 9th Bi-annual conference of
the society for the study of artificial intelligence and the simulation of behaviour (AISB 93), pp 61–70
29. Izhikevich EM (2006) Polychronization: computation with
spikes. Neural Comput 18(2):245–282 [Izhikevich, Eugene M.
Neural Comput.18(2), 245–82 (2006)]
30. Jordan MI, Rumelhart DE (1992) Forward models: supervised
learning with a distal teacher. Cogn Sci 16(3):307–354
31. Koos S, Cully A, Mouret JB (2013) Fast damage recovery in
robotics with the t-resilience algorithm. Int J Robot Res
32(14):1700–1723. doi:10.1177/0278364913499192
32. Koza JR (1992) Evolution of subsumption using genetic programming. In: Proceedings of the 1st European conference on
artificial life, pp 110–119
33. Koza JR (1999) Genetic programming III: darwinian invention
and problem solving. Morgan Kaufmann, San Francisco.
99010099 [John R. Koza et al. ill.; 25 cm. Includes bibliographical references (p. [1081]–1114)
34. Lalazar H, Vaadia E (2008) Neural basis of sensorimotor learning: modifying internal models. Curr Opin Neurobiol
18((6)):573–581
123
Evol. Intel.
35. Lehman J, Stanley KO (2011) Abandoning objectives: evolution
through the search for novelty alone. Evol Comput
19(2):189–223
36. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control, 3rd edn.
Wiley, Hoboken
37. Li H, Liao X, Carin L (2009) Multi-task reinforcement learning
in partially observable stochastic environments. J Mach Learn
Res 10(1131–1186):1577109
38. Mckay RI, Hoai NX, Whigham PA, Shan Y, ONeill M (2010)
Grammar-based genetic programming: a survey. Genet Program
Evol Mach 11(3–4):365–396
39. Miller JF, Khan GM (2011) Where is the brain inside the brain?
Memet Comput 3(3):217–228
40. Miller JF, Thomson P (2000) Cartesian genetic programming. In:
Genetic programming. Springer, New York, pp 121–132
41. Mohamed Z, Kitani M, Kaneko Si, Capi G (2014) Humanoid
robot arm performance optimization using multi objective evolutionary algorithm. Int J Control Autom Syst 12(4):870–877
42. Moriguchi H, Lipson H (2011) Learning symbolic forward
models for robotic motion planning and control. In: Proceedings
of European conference of artificial life (ECAL 2011),
pp 558–564
43. Mouret JB, Doncieux S (2009) Overcoming the bootstrap problem in evolutionary robotics using behavioral diversity. In: IEEE
congress on evolutionary computation, 2009. CEC’09,
pp 1161–1168. IEEE
44. Oudeyer PY, Kaplan F, Hafner VV (2007) Intrinsic motivation
systems for autonomous mental development. IEEE Trans Evol
Comput 11(2):265–286
45. Pape L, Oddo CM, Controzzi M, Cipriani C, Frster A, Carrozza
MC, Schmidhuber J(2012) Learning tactile skills through curious
exploration. Frontiers in neurorobotics 6
46. Poli R (1996) Parallel distributed genetic programming. Citeseer
47. Rozenberg G (1997) Handbook of graph grammars and computing by graph transformation. World Scientific, Singapore.
96037597 edited by Grzegorz Rozenberg. ill.; 23 cm. Includes
bibliographical references and indexes. v. 1. Foundations
48. Savastano P, Nolfi S (2013) A robotic model of reaching and
grasping development
49. Schaal S (2006) Dynamic movement primitives-a framework for
motor control in humans and humanoid robotics. In: Adaptive
motion of animals and machines. Springer, New York,
pp 261–280
50. Schaal S, Peters J, Nakanishi J, Ijspeert A (2005) Learning
movement primitives. In: Robotics research. Springer, New York,
pp 561–572
51. Schmidhuber J (2009) Driven by compression progress: a simple
principle explains essential aspects of subjective beauty, novelty,
surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In: Anticipatory behavior in adaptive learning
systems. Springer, New York, pp 48–76
52. Schrum J, Miikkulainen R (2014) Evolving multimodal behavior
with modular neural networks in ms. pac-man. In: Proceedings of
123
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
the genetic and evolutionary computation conference (GECCO
2014). Vancouver, BC, Canada, pp 325–332. http://nn.cs.utexas.
edu/?schrum:gecco2014. Best Paper: Digital Entertainment and
Arts
Shadmehr R (2004) Generalization as a behavioral window to the
neural mechanisms of learning internal models. Hum Mov Sci
23:543–568
Sporns O, Edelman GM (1993) Solving bernstein’s problem: a
proposal for the development of coordinated movement by
selection. Child Dev 64:960–981
Stanley KO, D’Ambrosio DB, Gauci J (2009) A hypercube-based
encoding for evolving large-scale neural networks. Artif Life
15(2):185–212
Stanley KO, Miikkulainen R (2002) Evolving neural networks
through augmenting topologies. Evol Comput 10(2):99–127
Steels L, De Beule J (2006) Unify and merge in fluid construction
grammar
Stolle M, Precup D (2002) Learning options in reinforcement
learning, lecture notes in computer science, vol 2371, chap 16.
Springer, Berlin, pp 212–223
Studley M, Bull L (2005) X-tcs: accuracy-based learning classifier system robotics. In: The 2005 IEEE congress on evolutionary
computation, vol 3, pp 2099–2106. IEEE
Sturm J, Plagemann C, Burgard W (2008) Unsupervised body
scheme learning through self-perception. In: IEEE international
conference on robotics and automation, 2008. ICRA 2008,
pp 3328–3333. IEEE
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press,
Cambridge, MA. [97026416 Richard S. Sutton and Andrew G.
Barto. ill.; 24 cm. Includes bibliographical references (p. [291]–
312) and index]
Teller A, Veloso M (1995) Program evolution for data mining. Int
J Exp Syst Res Appl 8:213–236
Teller A, Veloso M (1996) Neural programming and an internal
reinforcement policy. In: Late breaking papers at the genetic
programming 1996 conference, pp 186–192. Citeseer, New
Jersey
Thelen E (1995) Motor development: a new synthesis. Am Psychol 50(2):79
Todorov E, Jordan MI (2002) Optimal feedback control as a
theory of motor coordination. Nat Neurosci 5(11):1226–1235.
doi:10.1038/nn963
Togelius J (2004) Evolution of a subsumption architecture neurocontroller. J Intell Fuzzy Syst 15(1):15–20
Urzelai J, Floreano D, Dorigo M, Colombetti M (1998) Incremental robot shaping. Connect Sci 10(3–4):341–360
Wilson SW (2000) Get real! xcs with continuous-valued inputs.
In: Learning classifier systems. Springer, New York, pp 209–219