The Evolution of Motivated and Modulated Robot Selection

ARTICLE
International Journal of Advanced Robotic Systems
The Evolution of Motivated
and Modulated Robot Selection
Regular Paper
Fernando Montes-Gonzalez1,* and Carlos M. Contreras2,3
1 Facultad de Física e Inteligencia Artificial, Universidad Veracruzana, Xalapa, Veracruz, Mexico
2 Laboratorio de Neurofarmacología, Instituto de Neuroetología, Universidad Veracruzana, Xalapa, Veracruz, Mexico
3 Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, Xalapa, Veracruz, Mexico
* Corresponding author E-mail: [email protected]
Received 3 Jul 2012; Accepted 2 Oct 2012
DOI: 10.5772/53991
© 2013 Montes-Gonzalez and Contreras; licensee InTech. This is an open access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract In this paper we focus on the development of a selection model that allows a robot to select pre‐evolved behaviour in a motivated environment. A collection task is set in an arena where the Khepera robot has to collect cylinders, which simulate food. Therefore, two basic motivations, labelled as ‘obesity’ and ‘anorexia’, both affect the selection of the behavioural repertoire. Additionally, in this model is possible to modulate the model to be either more or less selective depending on the tonic value of the simulated ‘mesolimbic/cortical tone’. Next, we use evolution for optimizing the motivated selection network employed for behavioural switching. Finally, we describe the results of varying, by hand and evolution, the levels of reward by the mesolimbic‐cortical tone. Keywords Computational Systems Biology 1. Introduction The problem of action selection is similar to that found in ethology and identified as a behaviour‐switching problem. The selection of a set of actions, which form a www.intechopen.com
single behavioural pattern, is mostly observed when several behavioural modules try to gain access to the motor plant. However, an internal arbitration scheme (action selection) is necessary to choose one behavioural module at a time, until completion or until its execution proves ineffective. The behaviour‐based approach has been implemented by Artificial Intelligence researchers [1,2], who followed a methodology based on modular decomposition of complex behaviour, to model an example of the action selection problem in robotics. In this approach behavioural modules are developed as separate components that can be executed depending on the internal and external elements that will be evaluated for execution by an Action Selection Mechanism (ASM). Commonly, the interaction between models and the modules themselves are hand coded [3]. In this work, these kinds of modules are implemented as neural networks, programming routines and a mixture of both. Final selection parameters are optimized by the use of artificial evolution. In our previous work we have employed hand‐coded selection to produce regular patterns of behaviour [4]. Furthermore, we have used coevolution to optimize both behaviour and selection, IntMontes-Gonzalez
J Adv Robotic Sy, and
2013,
Vol. M.
10,Contreras:
125:2013
Fernando
Carlos
The Evolution of Motivated and Modulated Robot Selection
1
which provided an economy in the selection of modules and produced non‐regular patterns of behaviour [5]. In our previous work we let the ASM evolve without the use of internal motivations to facilitate optimization and avoid instability in the system. In our current work we set a foraging task where two simple motivations (‘obesity’ and ‘anorexia’) contribute to the calculation of salient signals that represent competing behavioural modules trying to gain execution through the motor effectors of the robotic agent. Furthermore, selection may regulate the strength of the motor output and the permissiveness on the selection of multiple behavioural outputs, by changing the tonic value of a parameter denominated the mesolimbic‐cortical tone. 1.1 Evolutionary Robotics Evolutionary Robotics is a methodology that uses evolutionary computation to develop controllers for autonomous robots [6]. Genetic algorithms and neural networks are natural candidates for the preferred methodology for developing complete solutions derived from single evolved neural controllers. These controllers are individuals of populations being evaluated, which will produce more adapted solutions through a series of computer‐program iterations. Individuals may evolve together at incremental steps by the use of staged evolution, focusing on the development and refinement of single prominent features. On the other hand, pairs or groups of individuals can coevolve together. Therefore, similar to its biological counterpart, behavioural coevolution features are ‘inhibited’ or ‘promoted’ during the evolution of single individuals and are also affected by changes in the evolution of related individuals within the group [7]. 1.2 Action Selection in the Vertebrate Brain Action Selection can be characterized as what to do next and how to choose a pertinent action amongst a behavioural repertoire. In ethology the action selection problem is related to decision‐making, where a behavioural pattern is expressed until completion or another behavioural pattern takes over and the animal engages in a new behavioural activity. Recent works support the idea that in the vertebrate brain, at specific loci, structures in the brain play an important role in central action selection [8]. The basal ganglia act as a relay station in the planning and the execution of movements (behaviour); hence gathering information from the cortex and motor cortex. Also, these brain structures are able to mediate cognitive and muscular processes. The Basal Ganglia, together with the cerebellum and the sensory cerebrum, are able to veto muscular contraction by denying the motor areas sufficient activation. In this work [9] the author set the initial development of a robot basal ganglia model that was implemented in a Khepera robot 2
Int J Adv Robotic Sy, 2013, Vol. 10, 125:2013
[10] within a foraging task. Then, a computational model of the intrinsic basal ganglia circuitry [11] was employed to arbitrate between hand‐coded behavioural modules. Additionally, the robot basal ganglia model was driven by simple motivations labelled as ‘fear’ and ‘hunger’ and it was shown that setting different levels of simulated ‘dopamine’ could modulate selection. Later on behavioural modules were developed as evolvable modules and embedded within robot basal ganglia selection [12]. In this paper, we employ a motivated and modulated central action selection model that provides similar functionality with that of the robot ganglia model. 1.3 Dopamine Modulation of Action Selection The neuromodulator dopamine is known to play an important role in animal behaviour switching. Abnormal dopamine levels are also related to human disorders of the basal ganglia, for instance a reduced dopamine (DA) condition is critical in Parkinson’s disease. Behavioural similarities are pointed out between animals treated with the dopamine blocker haloperidol and with the dopamine agonist amphetamine. Several researchers [13‐15] have provided helpful insights on the modulatory role of dopamine in animal behaviour switching, as mediated by the basal ganglia. These brain structures are also implicated in many mental illnesses and abnormal DA levels have been identified in some of them, including: Parkinson’s disease [16], Huntington’s disease [17], Tourette’s syndrome [18,19] and schizophrenia [20,21]. In the basal ganglia dopamine has effects in both tonic (continuous) and phasic (intermittent) ways. Where it is released tonically, for instance in projections from the substantia nigra, reduced levels of extracellular DA can act as a ‘brake’ on the systems to which it projects [15]. On the other hand, DA is released in a phasic manner when DA neuron firing is activated by behaviourally meaningful stimuli [22]. Here we focus on the effects of relatively long‐term changes in tonic dopamine levels. Redgrave, et al. [15] summarize a range of evidence suggesting that slight increases in tonic DA activity tend to promote animal behavioural switching, while equivalent decreases in transmission impede switching. Other known effects involved in switching include a change in dominance relations between behaviours, changes in the variability of behaviour and failure to complete behaviours. For example, when a rat is placed in a novel environment it usually investigates its surroundings in a fairly systematic fashion. At first the rat will fear the novel environment and spend time restricting its exploration to the perceived safety of the walls in the open field. As it habituates to its surroundings it investigates further afield. When the rat is motivated to eat, in a normal condition, it picks up the www.intechopen.com
food with its forepaws and eats from one pellet until it is finished and then leaves the area. There are several known effects of using DA agonists and antagonists in experiments of this type. For instance, Salamone [23] found that administration of haloperidol causes fragmented and disorganized responses. On the other hand, high doses of amphetamine prevent the process of habituation and can produce psychosis‐like effects in the animal; this is exhibited as stereotypy in animals, such as continuous sniffing in rats [24]. In his work Montes [9] presented an example of varying the effects of simulated dopamine in the robot basal ganglia model. This implementation of the basal ganglia showed that less dopamine in the robot caused early interruption and failure to complete a behaviour and also less exploration and interruption of the usual movement of the robotic arm. In a high dopamine condition an abrupt movement of the robot arm and a constant lifting and lowering of the arm while the robot was on the move was identified. 2. Evolutions and Design of Behavioural Modules In our experiments the development of selection and behaviour was carried out using staged evolution. The initial step was to evolve behaviour and then the modulation parameters of selection were tuned by artificial evolution. In general, evolution was carried out in a similar fashion for both behaviour and modulated selection. Initially behavioural patterns of the Khepera robot were evolved in the Webots robot simulator [25] and later they were further evolved in the real Khepera robot. The infrared sensors of the Khepera are distributed around the body and directed towards the frontal part of the robot; moreover two DC motors control the movement of the wheels. An additional gripper turret can be attached to the body of the Khepera.The arm of the gripper has two degrees of freedom with encoders for determining its position and two sensors in the gripper hand for detecting the presence and the resistivity of a collected item. Next, we set a foraging task in a square walled arena where the robot has to collect simulated ‘food’ in the form of wooden cylinders. For the cylinder‐
collection task behavioural patterns can be identified as belonging to two different kinds: some related to travelling the arena and others related to handling objects with the gripper. The behavioural repertoire is as follows; cylinder‐seek locates and positions the robot body in front of a cylinder in order to activate cylinder‐pickup, which moves the robot backwards to safely lower the robot arm and then pickup a cylinder; wall‐seek travels the arena searching for the closest wall and then corner‐seek runs parallel to a wall until the robot finds a corner; finally cylinder‐deposit lowers the robot arm, opens the gripper and returns the arm to an upper position. www.intechopen.com
2.1 Exploration Behaviour The exploration of the arena was carried out using information from the infrared sensors of the Khepera robot. In order to develop a common framework for the exploration of the arena, the behavioural patterns wall‐seek, corner‐seek and cylinder‐seek were encoded as neural controllers. These behavioural patterns employ a fully connected feed forward multilayer‐perceptron neural network with no recurrent connections. The topology of the neural network is six neurons in the input layer, four neurons in the hidden layer and two in the output layer. The sigmoid transfer function is used at the hidden and output neurons. The infrared sensor readings of the Khepera, ranging from 0 to 1023 from the six front sensors, forms the input to the neural network. The output of the neural network is scaled to the ±20 values required for driving the DC motors at full speed. Next, a genetic algorithm with selection, crossover and mutation operators is applied to the neural network and the desired behaviour for each individual module is shaped using different fitness functions (Eqs. 1‐3). The weights of each neural network are directly encoded into a vector w of 32 elements and the weights are initialized with random values ranging from ‐1 <wi< 1 for all elements. Therefore, a single vector representation is used to define each of the individuals in the population. The initial population G0 consists of n=100 neural controllers. Selection is made using elitism to replicate the two best individuals from one generation to the next. Next, a tournament allows random parents to be chosen from (n/2)‐1 competitions. The fittest parents are bred in pairs with a random crossover point, generated with a probability of 0.5. Each individual in the new population is then affected with a mutation probability of 0.01. The initial location and orientation of individuals are randomized across trials and running each individual in the simulator for about 30 seconds scores fitness. The behavioural pattern for locating a wall (wall‐seek) resembles obstacle‐avoidance behaviour because the arena has to be explored while avoiding cylinders. The robot stops when is positioned in front of a wall. The fitness formula for this behavioural module was 3700
f   abs(lsi ) * (1  ds i ) * (1  max_ iri ) (1) c1
i0
where for iteration i: ls is the linear speed in both wheels (the absolute value of the sum of the left and right speeds), ds is the differential speed on both wheels (a measurement of the angular speed) and max_ir is the highest infrared normalized‐value. A formula such as this favours the evolution of fast individuals that run in a straight line, while avoiding obstacles. The behavioural module for locating a corner relies on running parallel to a wall until an additional wall is Fernando Montes-Gonzalez and Carlos M. Contreras:
The Evolution of Motivated and Modulated Robot Selection
3
found, though round‐obstacles blocking a straight path to the nearest corner have to be first avoided. The module stops when a corner is detected. The fitness formula employed for the behaviour corner‐seek was f
c2
 f
c1
* (tgh)2
(2) The formula employs a thigmotaxis factor (tgh) that accounts for the tendency to remain next to walls and is calculated as the fraction of the test period for which an individual is close to any of the walls in the arena. This formula therefore evolves individuals that avoid obstacles while travelling parallel to the arena walls. The cylinder‐seek behavioural module facilitates the exploration of the arena while avoiding walls and ends when a cylinder is located in the middle of the arena. A cylinder is considered located once it is detected by the front most pair of infrared sensors. Then the robot has to completely stop to let the gripping‐behaviour handle cylinder‐collection. The formula for a behavioural module such as this was f  f  K1 * cnear  K2 * cfront
c3
c1
(3)
In this formula avoidance is exhibited during the exploration of the arena. The constants K1 and K2, with K1<K2, are employed for rewarding the robot when a cylinder is detected around the body of the robot, assuming that the cylinder is near (cnear). However, the robot is most rewarded when it aligns its frontal part with a nearby cylinder (cfront). 2.2 Gripper‐Handling Behaviour The previous behavioural modules can be considered as sequences of actions triggered by an initial sensory stimulus. However, behaviour related to handling the gripper should be modelled as a sequence of specific actions always executed in the same order and with the same duration. Thus, behavioural patterns consisting of timed sequences of actions can be thought of as fixed action patterns [26]. Take cylinder‐pickup for instance, which requires the gripper hand to be opened before the robot moves backwards to create free space in front of the body. Next the gripper can be closed and the arm moved back into the upright position. On the other hand, cylinder‐deposit requires a fixed sequence of lowering the arm, opening the gripper and then raising the arm. Therefore, these two behavioural modules were implemented as algorithmic routines following the aforementioned action sequences. 3. Evolution of Central Action Selection Different models have been proposed for design systems, which are able to exhibit a variety of behaviours and to 4
Int J Adv Robotic Sy, 2013, Vol. 10, 125:2013
arbitrate between them [1,27,28]. Nevertheless, models based on explicit design do not seem to be scalable enough for developing systems capable of displaying a large variety of behavioural patterns that cope with task/environmental variations. In previous research we have proved that a computational model of the intrinsic circuitry of the vertebrate basal ganglia [29] produces action selection when embedded in a robot control system [8,30]. The motivated robot basal ganglia model had been set in a similar environment for hand‐coded [8] and evolved behavioural patterns [12]. The importance of the basal ganglia in natural action selection becomes evident when we observe that these nuclei are an archaic feature, common to all vertebrate animals [29]. However, we have also worked on an alternative selection model named CASSF (Central Action Model with Sensor Fusion) [31] that shares common features with the robot basal ganglia model. Both are centralized and produce motor selection based on building‐up perceptual information from raw sensory input. However, CASSF is a simpler model than that of the basal ganglia, having less parameters to be adjusted in order to obtain an appropriate selection. 3.1 Motivated Action Selection One of the main features of CASSF is that it is modular and able to cope with the variations of a dynamic environment. However, in this study we have extended this model to include internal motivations for the calculation of motor selection (Figure1). Furthermore, this is an effective action selection mechanism [32] that is centralized and presents sufficient persistence to complete a task. The implementation of tasks such as foraging can be carried out in CASSF by determining a set of behavioural patterns that can be integrated in time to complete the final behavioural setup. Additionally, the decision neural network weights (selection parameters) have been optimized by the use of evolution. The adjustment of selection parameters and behaviour has been tuned by coevolution in CASSF, as described in [5]. The foraging activity for our behavioural setup has been loosely based on observations of hungry rats placed in a box containing a central small dish of food. These animals, even when deprived from food for twenty‐four hours, will be fearful and exhibit a preference for staying next to walls and corners. Later on, they will go across the arena to collect food from the dish that is then consumed in a corner. In this model the salience (the urgency to be selected) of each of the behavioural modules is tuned to provide appropriate behavioural selections that simulate the avoidance‐related and food‐acquisition‐related behaviour observed in these animals. Therefore, the salience (si) for each module depends on the values of a number of extrinsic and intrinsic variables. Extrinsic values are bi‐polar perceptual variables calculated from robot raw sensory information. These perceptual variables are labelled as wall_detector(ew), www.intechopen.com
gripper_sensor(eg), cylinder_detector (ec) and corner_detector (er). These perceptual variables form the context vector, which is constructed as follows, e=[ew,eg,ec,er],ew,eg,ec,er {‐1, 1}. The information from the sensors is updated at every step of the simulation and the perceptual variables are recalculated depending on the presence (+1) or absence (‐1) of the relevant target feature (e.g. a cylinder, a wall, a corner, or an object in the gripper). Additionally, behavioural modules are also able to generate a ‘busy signal’ (ci) that facilitates its own selection during critical phases of activity. The value of the busy signal is a binary value that is on when a critical period of activity has been reached. The current busy‐
status vector is formed as: c=[cs,cp,cw,cr,cd], cs,cp,cw,cr,ct {1,0}, for cylinder‐seek, cylinder‐pickup, wall‐seek, corner‐seek and cylinder‐deposit respectively. Figure 1. In the model named CASFF, perceptual variables (ei) form the input to the decision neural network(notice that not all connections are shown). Motivations (eo, ea) are added to the perceptual variables as inputs (Ii) to the decision network. The output of winning behaviours with the highest saliences (si) is gated to the motors of the Khepera. Take for instance the busy‐status signal (c1) from behaviour B1 to the output neuron O1. The behavioural repertoire (B1 ‐ Bn) is extended by preserving similar connections for each of the additional behavioural modules. Thus, modules receive raw sensory information to express motor activity (mi). Motor activity is then converted to robot motor commands depending on the simulated mesolimbic‐cortical tone (th, tl). Intrinsic variables are produced by motivational modules and are functions of recent experience and internal state. In our experiments these roughly model ‘obesity’ (increases with time and reduced when ‘food’ is deposited outside the arena) and ‘anorexia’ (initially high and reduced when exploring the arena). Therefore, the value for each of the simulated motivations is a single scalar value in the range (0‐1) that can be either increased or decreased over time. ‘Obesity’ is also reduced by a fixed amount when a cylinder is deposited in a corner of the arena. Therefore, the simulated motivations ‘obesity’ (eo) and ‘anorexia’ (ea) are added to the context vector,where 0 ≤ eo, ea ≤ 1. As a result the salience is calculated from the relevant information for each behavioural module composed by perceptual variables (bi‐polar), its own busy signal (binary) and intrinsic motivations (scalar values). These signals constitute the input vector Ii for the selection network. www.intechopen.com
Activation is computed at every step of the simulation and the output of the network (Oi) produces the raw salience (si) of each behavioural module (Bi). Then, the salience is passed through a limiter, li = L(si), which constrains the output from zero to a maximum value of one (0 ≤ l ≤ 1). Motor outputs of the behavioural modules are encoded into positive and negative vectors (mi = [m+i, m‐i], 0 ≤ mi ≤ 1), which represent changes in direction for the left and right motor wheels and lifting and opening the gripper arm. The positive component encodes desired positive changes in direction, that is, forwards for the wheels and lowering and closing for the gripper and is given by (m+i = [m+il, m+ir, m+ia, m+ig]). The negative component (m‐i = [m‐il, m‐ir, m‐ia, m‐ig]) encodes desired negative changes in direction, which is a backward movement of the left and right wheels and the lifting and opening of the gripper. The motor output of the winning behavioural module is Fernando Montes-Gonzalez and Carlos M. Contreras:
The Evolution of Motivated and Modulated Robot Selection
5
allowed to take control of the motor plant, depending on the binary values of the salience signals (oi = mili). In a normal dopamine status only one behaviour is allowed to win the competition and the resultant vector forms the negative and positive final motor activity for the left, right, arm and gripper motors (o = [o+l, o+r, o+a, o+g, o‐l, o‐r, o‐a, o‐g], 0 ≤ oi ≤ 1). Motor activity, which ranges from 0 to 1, is converted into actual motor commands as follows: left _ motor  ( ol  ol ) * MAX _ SPEED
right _ motor  ( or  or ) * MAX _ SPEED
(4)  DOWN : o  0.5 & o  o
a
a
a

arm  MIDDLE : oa  0.0 & oa  oa (5) 

UP : oa  0.5
OPEN : o   0.0 & o  o
g
a
a

gripper 
(6)

CLOSE : og  0.0
Next, perceptions in CASSF are converted into valid motor commands within a main‐loop, in which sensor readings are updated and motor commands are executed. At each time‐step, at the normal mesolimbic‐cortical tone, salience is calculated and the competition between behavioural components solved in a winner‐takes‐all manner. 0.6327
0.5628
0.493
0.4231
0.3533
0.2834
0.2136
0.1437
0.0739
0.004
20%‐30%
10%‐20%
0%‐10%
Winner
Figure 2. A partial view of the raw salience space sampled by the robot in a typical run. Axes denote the salience of the winning behavioural module (horizontal) and of the most salient looser (vertical). Intensity of colours indicate the proportion (darker=greater) of 5,000 salience pairs falling within a particular window. Winner saliences take a value around 1.0 with the highest loosing competitor at around a value of 0.2, which occurred in 28% of the typical run of the robot. 3.2 Modulated Action Selection The busy signal and the context vector form the input of the selection network. The output of the selection network is the salience that drives the selection of the most activated module, allowing the execution of only one behaviour. Nevertheless, selection can be modulated by a variation in 6
Int J Adv Robotic Sy, 2013, Vol. 10, 125:2013
 n 
o  L   li  (7)


 i 1 
The modulation of the mesolimbic‐cortical tone in the motor output is provided by introducing a multiplicative factor in the final motor output (m) of the selection network. Thus, modulation using a high value will exhibit an increase in the permissiveness of selection by allowing the expression of competitor signals when the effective weight is (1+th), where 0 ≤th≤ 1. In contrast a reduced condition will reflect low permissiveness in the selection of a single winner when the weight is (1‐tl), where 0 ≤tl≤ 1. 0,2936
0,3929
0,4922
0,5915
0,6908
0,79
0,8893
0,9886
1,0879
1,1872
Most‐salient looser
Raw Salience Space
the values of the simulated mesolimbic‐cortical tone. Thus, these are regulated by two variables: tl for a lowstate and th for a high modulation. An absolute value of 0.2 defines a tonic value for both variables and the one that takes the higher value dominates over the other with a lesser value. In a regular tonic value (for both tl and th) competitions between behaviours are resolved, producing a single winner that takes over the robot motors. However, in a low state (tl> 0.2 and tl > th) we calculate the difference between the winning salience and the most salient loser and only if that difference is greater than the established low value (tl) is the winning behavioural module activated. We model a low state in this way to let strong signals win the competition over poor competitor signals. In Figure 2, at the normal mesolimbic‐cortical tone, the raw salience of the winning activity and the most salient loser are shown. On the other hand, in a high condition (th> 0.2 and th > tl) if the salience of a loser signal, increased by a proportion of the high value (th), is greater than the winning activity then the behavioural module is selected and its output is combined with the output of the other winning activities. Therefore, an increase in the mesolimbic‐cortical tone is modelled as a heightening in the salience of losing competitors. In a high state the regulated salience vector (l) is summed together with the output of other channels, this sum (o) is then constrained by a limiter L, and the resultant vector ( o ʹ, 0  o ʹ  1) forms the final motor activity. 3.3 CASSF Evolution In this paper artificial evolution is employed to adjust the weights of the decision network. Evolution was carried out in two steps using the same evolutionary method. Initially exploration behavioural modules were evolved. Next, the decision network weights were optimized using evolution. Six selection weights depend on the input of the context vector including intrinsic motivations (ei) and five more weights on the current busy status (ci). The salience (si) or urgency is calculated from the input of the decision network Ii that in turn modifies the output mi of the behavioural modules, by allowing the most salient signals to win the competition, depending on the modulation of the mesolimbic‐cortical tone (oi). Hence, selection is evolved for five behavioural modules with a context vector www.intechopen.com
composed of eleven elements making a selection vector chs of 55 weights with initial random values of chs, −Kw<chs<Kw with Kw = 0.75. The chromosome takes two additional values for the mesolimbic‐cortical tone (chh, cho) that are constrained to fall in the range −1 <chh, cho< 1. Therefore, the final chromosome consists of 57 weights encoded in a single chromosome. The fitness formula for the evolution of the weights of the decision network was f
c4
 ( K1 * cwfactor )  ( K2 * fc 2 )  ( K3 * pkfactor * (1.0  e a )) 
( K4 * dpfactor * e o )
(4) The evolution of the weights of the selection network was nearly optimized using the fitness formula (fc4) with the constants K1, K2, K3 and K4 with K1<K2<K3<K4 for the selection of those individuals that locate corners and walls in the arena (cwfactor). On the other hand, the fitness formula also rewards locating cylinders (fc2), their collection inside the arena (pkfactor) multiplied by a proportional value of ‘anorexia’ (ea), and their release near the outside walls (dpfactor) multiplied by ‘obesity’ (eo). We included motivations on the calculation of the fitness to reward individuals that collect cylinders when they are less fearful and hungrier. The average fitness of the population and its maximum individual fitness for over 100 generations are shown in Figure 3. Figure 4. Mesolimbic‐cortical tone during evolution after 100 evolution cycles. Blue circles denote the initial values of one hundred individuals, green dots represent the various values across evolution and black squares are common values after the fitness stabilizes. The simulated mesolimbic‐cortical tone at the end of evolution takes the value of 0.297647 for low condition and 0.659772 for a high state. 4. Experiments and Results The foraging task was set in an arena with four cylinders simulating food (see Figure 5). It is important to notice that the use of a fitness function for evolution shapes selection by the optimization of behaviour in time and in the physical environment and also depends on the chosen mesolimbic‐cortical tone values. In this paper a final behaviour is considered as the joint product of the environment, observer and internal status of the robot. Hence, a regular grasping‐depositing pattern in the foraging task should be the result of the selection of the behavioural modules: cylinder‐seek, cylinder‐pickup, wall‐
seek, corner‐seek, and cylinder‐deposit, in that order. The use of selection with hand‐coded parameters and evolved behavioural modules is shown in Figure 6, in addition to some elementary statistics shown in Table 1. Figure 3. Fitness is plotted across 100 generations. For each generation the highest fitness of one individual was obtained from the average fitness of five trials under similar conditions. The maximum fitness of all individuals was averaged as a measure of the population fitness. Individuals are rewarded more if they avoid obstacles, collect cylinders and deposit cylinders close to corners. The evolution is stopped after fitness stabilizes over a value around 6500. Individuals in the population also evolve the values of the mesolimbic‐cortical tone. Initially the random values for high and low states are generated and during evolution (Figure 4) the best individuals take an absolute value around 0.5, which is the tonic value for the simulated low and high conditions. Variations in levels of the mesolimbic‐
cortical tone help to optimize the selection of behavioural modules and selection in time, reducing the execution of behaviours to complete the collection task. www.intechopen.com
Figure 5. The selection of behavioural modules and their related motivations are shown in the image. The robot simulator window and the window for the gripper and infrared sensors are also displayed. In the motivations window, the blue line corresponds to ‘anorexia’ and the green line represents ‘obesity’. Fernando Montes-Gonzalez and Carlos M. Contreras:
The Evolution of Motivated and Modulated Robot Selection
7
Behavioral Modules
none
cylinder-seek
cylinder-pickup
wall-seek
corner-seek
cylinder-deposit
Total
Freq
1.00
5.00
7.00
5.00
8.00
4.00
30.00
Latency
0.00
4.41
7.30
0.03
1.08
11.61
0.00
TotDur
0.03
74.37
8.97
2.82
16.93
5.29
108.41
TotDur%
0.03
68.60
8.27
2.60
15.61
4.88
100.00
Mean
0.03
14.87
1.28
0.56
2.12
1.32
3.61
StdDev
0.00
18.24
0.69
0.31
1.09
0.03
8.54
StdErr
0.00
8.16
0.26
0.14
0.39
0.01
1.56
MinDur
0.03
2.89
0.41
0.30
0.47
1.30
0.03
MaxDur
0.03
47.10
2.22
1.05
3.34
1.36
47.10 Table 1. Hand‐coded selection is summarized in this table. Labels in the table are as follows: Freq shows the frequency in the selection of a module, Latency represents the time when the module was initially selected, the total duration of the module is indicated by Totdur and its percentage by TotDur%, Mean, Standard Deviation (StdDev) and Standard Error (StdErr) are some simple statistics, MinDur represents the minimal time the module was selected and MaxDur the maximal time for the selection of the module. We notice that although cylinder‐pickup was selected 7 times, cylinder‐deposit was only selected 4 times. The latter is due to the fact that failed attempts in the collection of cylinders are accounted as triggered behaviour. Regular collection patterns can be disrupted if, for example, the cylinder slips from the gripper or a corner is immediately found. Another cause for disruption in a pattern occurs after long search periods, when a cylinder is not promptly located. For instance, travelling for long time increases the value of ‘obesity’ up to its maximum value, which makes locating a cylinder erratic and increases exploration periods (after 35 seconds in Figure 6). The use of motivations also causes interruption in the collection of cylinders. Another additional factor that may alter the order in behaviour selection is the fitness scored by the agent when solving the foraging task. Moreover, interruption may occur by the selection of a low mesolimbic‐cortical tone value. A regular behavioural pattern, in a normal mesolimbic‐cortical tone condition, is commonly observed for hand‐coded selection, whereas for evolved selection and optimized mesolimbic‐cortical tone values, the use of redundant behaviours is avoided in order to optimize time selection. The use of evolved behaviour and selection in a low mesolimbic‐cortical tone state is presented in Figure 7. Figure 6. Ethogram for a run of typical hand‐coded behaviour selection at a normal mesolimbic‐cortical tone (tl =0.2,th = 0.2). Behavioural modules are numbered as 1‐cylinder‐seek, 2‐
cylinder‐pickup, 3‐wall‐seek, 4‐corner‐seek, 5‐cylinder‐deposit and 6‐no action selected. The deposit of four cylinders can be easily observed when green lines fall near a value of zero. Thus, a regular grasping‐depositing pattern of selection (1‐2‐3‐4‐5) is exhibited by hand‐coded individuals. 8
Int J Adv Robotic Sy, 2013, Vol. 10, 125:2013
Figure 7. A typical run for evolved selection in a low mesolimbic‐
cortical tone condition (tl =0.4, th = 0.2). Behavioural modules are numbered as 1‐cylinder‐seek, 2‐cylinder‐pickup, 3‐wall‐seek, 4‐
corner‐seek, 5‐cylinder‐deposit and 6‐no action selected. A standard grasping‐depositing pattern is not exhibited at all. Only one cylinder is collected and never released. As a result, the selection of only two behaviours (1, 2) are observed in the graph. Figure 8. A typical run for evolved selection and mesolimbic‐
cortical tone parameters (withtl =0.3, th = 0.7). Behavioural modules are numbered as 1‐cylinder‐seek, 2‐cylinder‐pickup, 3‐
wall‐seek, 4‐corner‐seek, 5‐cylinder‐deposit and 6‐no action selected. A standard grasping‐depositing pattern is not easily observed here. Because of opportunism in the delivery of cylinders, artificial evolution optimizes the use of wall‐seek, which is never selected. Instead corner‐seek is used after the collection happened, for taking the cylinder to the nearest wall where the cylinder is released. A standard pattern never occurs due to multiple selection of modules in a high mesolimbic‐
cortical tone condition. www.intechopen.com
Behavioral Modules
none
cylinder-seek
cylinder-pickup
wall-seek
corner-seek
cylinder-deposit
Total
Freq
1.00
79.00
91.00
0.00
13.00
0.00
184.00
Latency
0.00
3.45
0.03
79.90
0.02
79.90
0.00
TotDur
0.02
1.47
77.20
0.00
1.29
0.00
79.90
TotDur%
0.02
1.84
96.50
0.00 NaN
1.62
0.00 NaN
100.00
Mean
0.02
0.02
0.85
0.10
NaN
StdDev
0.00
0.01
0.87
NaN
0.43
0.09
NaN
NaN
0.73
StdErr
0.00
0.00
0.09
MinDur
0.02
0.00
0.02
0.00
0.02
0.00
0.00
0.03
0.05
MaxDur
0.02
0.06
6.51
0.00
0.33
0.00
6.51
Table 2. Evolved selection is presented as some basic statistics. Labels are the same as in the previous table. Here we observe that cylinder‐seek and corner‐seek are both selected several times. However, cylinder collection/delivery occurs only three times (as shown in Figure 8 where motivations cross). Redundancy of behaviour occurs due to the adopted strategy of moving while lifting and lowering the robot arm. Additionally, failed attempts in the collection/delivery are accounted for as triggered behaviour. In the case when behaviour is never selected, latency accounts for the total time of the run. It is also important to notice the occurrence of NaN (Not a Number) values in the table, which are the result of having zero values. Selection at low mesolimbic‐cortical tone levels causes the robot to diminish the level of permissiveness in the selection of a single winner. In the same figure we notice that the robot is able to select the cylinder‐pickup behaviour only if the value of fear is very low and ‘obesity’ is near its maximum level (around 60 seconds in the same figure). In Figure 8 and Table 2 we present selection after evolution through one hundred generations. Mesolimbic‐cortical tone levels for selection after evolution were rounded to tl =0.3, th = 0.7. Evolution tends to optimize the selection of behavioural patterns and because of the high mesolimbic‐cortical tone condition the selection of more than one behaviour is displayed. Therefore, the robot develops a strategy for collecting cylinders consisting of lifting and lowering the gripper arm while running across the arena. Additionally, optimization by evolution includes discarding redundant activities. In the same figure we observe that the selection of wall‐seek is avoided. Cylinder‐deposit is selected for lifting and lowering the arm but never selected on its own. These behavioural modules are discarded in order to improve the delivery time of the robot. In terms of fitness evolved, selection exhibits an improvement over Behavioral Modules
Mean 1
StDev 1
none
0.03
14.87
1.28
0.56
2.12
1.32
3.61
0.00
18.24
0.69
0.31
1.09
0.03
8.54
cylinder-seek
cylinder-pickup
wall-seek
corner-seek
cylinder-deposit
Total
N1
1.00
5.00
7.00
5.00 NaN
8.00
4.00 NaN
30.00
hand‐coded selection with evolved behaviour (as seen in Table 3). In Table 3 we perform a T‐test to determine whether two behaviours from an assumed normal distribution could have the same mean when the standard deviations are unknown but assumed equal. We are interested in the probability (significance) that the observed value of T could be as large or larger under the null hypothesis that the mean of one behaviour is equal to the mean of the other behaviour. For each behaviour the result, h =1, signifies that we can reject the null hypothesis. For example, for cylinder‐seek the significance is 0.00, which shows that by chance we would have observed more extreme values of t. A 95% confidence interval for the mean is [+11.16, +18.55], which includes the theoretical (and hypothesized) difference of ‐0.5. On the other hand, for cylinder‐pickup, h=0, thus we accept the null hypothesis. The rationale behind this is because this behaviour is sequential it is always executed in the same order. In summary, we notice a rejection of the null hypothesis for behaviour used for exploring the arena (cylinder‐seek and corner‐seek). Mean 2
StDev 2
N2
0.02
0.02
0.85
0.00
0.01
0.86
1.00
79.00
91.00
0.00
13.00
0.00
184.00
NaN
0.10
0.09
NaN
0.43
0.73
df
Mean Diff
0.00
0.02
82.00
14.86
96.00
0.43
3.00 NaN
NaN
19.00
2.02
2.00 NaN
NaN
212.00
3.18
CStErr
NaN NaN
1.86
0.34
NaN
0.30
NaN
0.64
T‐test
Null Hyp. Significance
NaN
7.99
1.29
NaN
1.00
0.00
NaN
6.74
1.00
NaN
5.00
0.00
0.20
NaN
0.00
NaN
1.00
0.00
Conf. Interv.
(NaN, NaN)
(+11.16, +18.55)
(‐0.23, +1.10)
(NaN, NaN)
(+1.39, +2.64)
(NaN, NaN)
(+1.93,+4.43)
Table 3. Comparison of unpaired hand‐coded and evolved behaviour. In the table the Meani, Standard Deviation (StDevi) and the size of the group (Ni) are shown. Additionally, the degree of freedom (df) of the unpaired T‐test, Mean Difference (Mean Diff), Combined Standard Error (CStErr) and the value of the T‐test are shown. Furthermore, we observe the result of the test for the Null Hypothesis and its two‐sided probability (Significance) within the Confidence Interval. 5. Discussion and Future Work For the sake of our discussion it is important to remember that there is evidence of central selection in the vertebrate brain [29], particularly at the basal ganglia buried under the cortex. These structures receive information from several different regions of the cerebral cortex. We have based the development of motivated and modulated CASSF [31] on that of the robot basal ganglia [33]. Here we have developed a central action selection model that makes use of artificial evolution for the optimization of both neural www.intechopen.com
behaviour and the decision network (ruling out the evolution of sequential behaviour). In our implementation of selection we have built an intrinsic perception of the world based on raw sensory information, to provide pre‐processed information to the decision network, in order to produce a unified perception of the ‘extrinsic’ world. Intrinsic variables, such as simulated ‘obesity’ and ‘anorexiaʹ, affect the results of selection. Additionally, selected mesolimbic‐
cortical tone values help to build up a reward condition that utterly modifies permissiveness in the selection Fernando Montes-Gonzalez and Carlos M. Contreras:
The Evolution of Motivated and Modulated Robot Selection
9
mechanism. Therefore, selection arbitrates amongst competing behavioural modules to allow the execution of behaviour in response to a specific configuration of the world, the modulated condition of the internal selection mechanism and the internal motivational status of the animal robot. In relation to dopamine, an influential concept in contemporary computational neuroscience, there is the reward prediction error hypothesis of phasic dopaminergic function. It maintains that midbrain dopaminergic neurons signal the occurrence of unpredicted reward, which is used in appetitive learning to reinforce existing actions that most often lead to reward. However, the availability of limited afferent sensory processing and the precise timing of dopaminergic signals suggest that they may instead play a central role in identifying which aspects of context and behavioural output are critical in causing unpredicted events. In this work we develop a model where modulation on selection can be changed by establishing different values for the mesolimbic‐cortical tone. In Figure 4, we showed how the mesolimbic‐cortical tone varies through evolution until the collection‐delivery task is finally completed. In this way, we may think of mesolimbic‐cortical tone as a reinforcer for identifying novel aspects that eventually will lead to reward. In future implementations we intend to develop a predator‐prey setup, where the prey employs evolvable modulated and motivated action selection and the predator a pure evolvable approach; both optimized by means of coevolution. This would allow us to study any potential improvements in selection under such a competitive scheme. 6. Conclusion In our experiments the development of motivated and modulated action selection led us to include a component of the simulated ‘mesolimbic‐cortical tone’, similar to the one reported as ‘dopamine’ in [30], to regulate behaviour through motor commands sent to the Khepera robot. Later on, we analysed motivated behaviour at the normal mesolimbic‐cortical tone to see the elicitation of movement in a regular condition. Next, we mimicked an abnormal condition as the result of inducing different, low and high, levels of the simulated mesolimbic‐cortical tone. The selection mechanism and neural behaviour were both further evolved at two separate stages. The experiments presented in this paper provide an insight into the effects of evolution when optimizing behaviour that needs to be coupled within a regular pattern. The use of evolution constrains candidate solutions to those that maximize the proposed fitness function. In addition, permissiveness in selection is modified by final mesolimbic‐cortical tone values after evolution. Finally, our current work aims to reduce the number of decisions 10 Int J Adv Robotic Sy, 2013, Vol. 10, 125:2013
made by the human designer when evolving both selection and behaviour. 7. Acknowledgments This work has been sponsored by CONACYT‐MEXICO grant SEP No. 0100895. 8. References [1] Arkin, R. C. (1998) Behaviour‐Based Robotics. The MIT Press, USA. [2] Brooks, R. A. (1999) Cambrian Intelligence. The MIT Press, USA. [3] Bryson, J. (2001) Intelligence by Design: Principles of Modularity and Coordination for Engineering Complex Adaptive Agents. PhD thesis, Massachusetts Institute of Technology, USA. [4] Montes‐González, F.,Santos Reyes, J., and Ríos Figueroa, H. (2006) Integration of Evolution with a Robot Action Selection Model, In A. Gelbukh and C. A. Reyes‐García (Eds.), MICAI 2006, LNAI 4293: 1160‐1170. [5] Montes‐Gonzalez, F. (2007) The Coevolution of Robot Behaviour and Central Action Selection, In J. Mira and J.R. Alvarez (Eds.), IWINAC 2007, Part II, LNCS 4528: 439‐448. [6] Nolfi, S. and Floreano, D. (2000) Evolutionary Robotics. The MIT Press, USA. [7] Lapchin L. and Guillemaud, T. (2005) Asymmetry in host and parasitoid diffuse coevolution: when the red queen has to keep a finger in more than one pie. Frontiers in Zoology 2:4. [8] Prescott, T. J.,Montes González, F. M., Gurney, K., Humphries, M. D.and Redgrave, P. (2006) A robot model of the basal ganglia: Behaviour and Intrinsic Processing. Neural Networks 19: 31‐61. [9] Montes Gonzalez, F. (2001) An action Selection Mechanism based on Vertebrate Basal Ganglia. Ph.D. Dissertation, Psychology Department, University of Sheffield, UK. [10] Mondada, F.,Franzi, E. and Ienne P. (1993) Mobile robot miniaturization: a tool for investigation in control algorithms. Proceedings of the Third International Symposium on Experimental Robotics. [11] Gurney, K., Prescott, T. J. and Redgrave, P. (2001) A computational model of action selection in the basal ganglia II: Analysis and simulation of behaviour. Biological Cybernetics 84:411‐423. [12] Montes‐Gonzalez, F., Prescott, T. J. and Negrete‐
Martinez, J. (2007) Minimizing human intervention in the development of basal ganglia‐inspired robot control. Applied Bionics and Biomechanics, 4(3): 101‐109. [13] Cools, A. R. (1980) Role of the neostriatal dopaminergic activity in sequencing and selecting behavioural strategies: facilitation of processes involved in selecting the best strategy in a stressful situation. Behavioural Brain Research 1: 361‐378. www.intechopen.com
[14] Ikemoto, S. and Panksepp, J. (1999) The role of the nucleus accumbens dopamine in motivated behaviour: a unifying interpretation with special reference to reward seeking. Brain Research Reviews 31(6): 41. [15] Redgrave, P., Prescott, T. and Gurney, K. N. (1999) The basal ganglia: A vertebrate solution to the selection problem. Neuroscience 89: 1009‐1023. [16] Khan, Z. U., Gutiérrez, A., Martín, R., Peñafiel, A., Rivera, A. and De La Calle, A. (1998) Differential regional and cellular distribution of dopamine D2‐
like receptors: an immunocytochemical study of subtype‐specific antibodies on rat and human brain. Journal of Comparative Neurology 402: 353‐371. [17] Nolte, J. (1999) The Human Brain: An Introduction to Its Functional Anatomy. Missouri, Mosby. [18] Brito, G. N. O. (1997) A neurobiological model for Tourette syndrome centered on the nucleus accumbens. Med. Hypotheses 49: 133‐142. [19] Purves, D., Augustine, G. J., Fitzpatrick, D., Lawrence, C. K.,LaMantia, A. S. and McNamara, J. O. (1997) Neuroscience. Massachusetts, Sinauer Associates. [20] Andreasen, N. C., (1995) Symptoms, signs and diagnosis of schizophrenia. The Lancet 346(8973): 477‐481. [21] Calabresi, P., D. M. and Bernardi, G. (1997) The neostriatum beyond the motor function: Experimental and clinical evidence. Neuroscience 78: 39‐60. [22] Grace, A. A. (1995) The tonic/phasic model of dopamine system regulation: its relevance for understanding how stimulant abuse can alter basal ganglia function. Drug and Alcohol Dependence 37: 111‐129. [23] Salamone, J. D. (1988) Dopaminergic involvement in activational aspects of motivation ‐ effects of haloperidol on schedule‐induced activity, feeding, and foraging in rats. Psychobiology 16(3): 196‐206. [24] Turgeon, S. M., Pollack, A. E. and Fink, J. S. (1997) Enhanced CREB phosphorylation and changes in c‐Fos and FRA expression in striatum accompany amphetamine sensitization. Brain Research 749: 120‐
126. [25] Cyberbotics (2012)Webots, Commercial Mobile Robot Simulation Software, http://www.cyberbotics.com. Accessed 2012 Jun 28. [26] Mcfarland, D. (1993) Animal Behaviour. Harlow, Essex. Longman Scientific and Technical. [27] Brooks, R. A. (1986) A Robust Layered Control System for a Mobile Robot. IEEE Journal of Robotics and Automation, 2(1): 14‐23. [28] Maes, P. (1989) The Dynamics of Action Selection. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI‐89), Detroit, MI: 991‐997. [29] Prescott, T. J., Redgrave, P. and Gurney, K. (1999) Layered control architectures in robots and vertebrates. Adaptive Behaviour, 7(1): 99‐127. [30] Montes Gonzalez, F., Prescott, T. J., Gurney, K., Humphries, M. and Redgrave, P. (2000) An embodied model of action selection mechanisms in the vertebrate brain. From Animals to Animats 6: Proceedings of the 6th International Conference on the Simulation of Adaptive Behaviour: 157‐166. [31] Montes‐González, F. M. and Marín‐Hernández, A. (2004) Central Action Selection using Sensor Fusion. Proceedings of the Fifth Mexican International Conference in Computer Science (ENCʹ04). Mexico, IEEE Press: 289‐296. [32] Montes‐González, F. M., Marín Hernández, A. and Ríos Figueroa, H. (2006) An Effective Robotic Model of Action Selection. R. Marín et al. (Eds.): CAEPIA 2005, LNAI 4177: 123‐132. [33] Prescott, T. J., Gurney, K., Montes‐Gonzalez, F., Humphries, M. and Redgrave, P. (2002) The Robot Basal Ganglia: Action selection by and embedded model of the basal ganglia. Basal Ganglia VII edited by Nicholson L. F. B. and R. L. M. Faull. New York: Kluwer Academic/Plenum Press: 349‐358. www.intechopen.com
Fernando Montes-Gonzalez and Carlos M. Contreras:
The Evolution of Motivated and Modulated Robot Selection
11

Download Report

The Evolution of Motivated and Modulated Robot Selection

Paperzz.com

Your Paperzz