Broekens and Verbeek 2005. MNAS 2005. Action

Computational Investigations
of the Regulative Role of
Pleasure in Adaptive Behavior
Action-Selection Biased by
Pleasure-Regulated Simulated
Interaction
Joost Broekens,
Fons J Verbeek,
LIACS, Leiden University, The Netherlands.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Overview
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning
• The agent's memory structure is modelled with a
directed graph.
• Constructs a distributed state-prediction of next
states.
• Learns through continuous interaction.
– The memory is adapted while the agent interacts with its
environment.
– Agent selects an action a, executes a and combines
with resulting perception p into a situation s1=<a, p>.
Memory adds s1 if s1 does not yet exists.
– Do another action, resulting in s2, add s2, and
– connect s1 to s2 by creating an interactron node I1.
– Recursively apply this process, use interactrons to
predict situations.
– (!Encoding situations in this way is too symbolic for real
world applications!)
s1
s1
a
s2
b
I1
s1
s2
c
I3
s1
I1
I2
s2
s3
d
I3
I1
s1
s2
I2
e
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: example
• Agent environment, grid world consisting of:
–
–
–
–
lava (red) r=-1, can walk on lava but is discouraged to do so,
food (yellow) r=1,
agent (black),
path (white) r=0.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: reinforcement
• Learning to predict values follows standard RL techniques (Sutton
and Barto, 1998).
– Except that learning is per interactron (node) and that there are two
prediction values per node.
• Every interactron (representing a sequence of occurred situations)
maintains
– a direct reinforcement , the interactron’s own predicted value changed
by the reinforcement r from the environment,
– a lazy-propagated indirect reinforcement value, that estimates the future
reinforcement, , based on the predicted next interactions, and
– a combined value μ=+.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: reinforcement example
Lazy reinforcement
propagation
Action-Selection
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-Selection
• Integrate distributed predictions into action values.
– Action values are a result of the parallel inhibition and excitation of
actions in the agent’s set of actions, A. Calculated using formula:
k
X yi
l t (ah )    *  t ( x i j | yi )  ( x i j | yi )
i 1 j 1
– with lt(ah) = the resulting level of activation of an action ahA at time t,
– yi an active interactron, and
– xij predicts action ah
 Sum over the weighted values of all predicted interactions into actions.
• Action-selection is based on these action values.
– If any lt(ah)>threshold  aselect such that lt(aselect)=max(lt(a1),…,lt(a|A|)).
– If all lt(ah)<threshold  aselect stochastically from l(a1),…,l(a|A|).
• Other selection mechanisms possible, i.e. Boltzmann. (!With a static
threshold our selection suffers from lack of exploration!).
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-Selection: example
k
X yi
l t (ah )    *  t ( x i j | yi )  ( x i j | yi )
i 1 j 1
Internal Simulation
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Thinking as Internal Simulation of
Behavior
• Internal simulation of behavior
– Covertly execute and evaluate potential interaction using sensory-motor
substrates (Hesslow, 2002; Damasio; Cotterill, 2001), but see also
– “interaction potentialities” (Bickhard), and
– “state anticipation” (Butz, Sigaud, Gérard, 2003).
– Existing mechanisms are basis for simulation
 Evolutionary continuity!
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Simulation: action-selection bias
At every step, instead of action-selection, select a subset of predicted
interactions from reinforcement learning model  feed back to RL model.
1. Interaction-selection: select a subset of predicted interactions.
2. Simulate-and-bias-predicted-benefit: feed back to model as if a real
interaction. (note that the memory advances to time t+1, so we have to
3. reset-memory-state to time=t to be able to select an appropriate action.)
4. Action-selection: select
the next action using
the action-selection
mechanism explained
earlier based on the
now biased action
values.
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Hierarchical-state
RL model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Simulation: example
• Action list before simulation (!hypothetical example!):
– {up=0.2, down=-0.5, right=-1, left=-1}
• Action-selection would have selected “up”,
– at least using our naive action-selection mechanism.
– With Boltzmann high probability for “up”.
• Simulate all interactions.
Roadblock r=-.5
– Propagate back the predicted values by simulating interaction with
environment.
– Effect is a “value look-ahead” of 1 step.
• Action list after simulation:
– {up=0.1, down=0.5, right=-1, left=-1}
• Action-selection selects “down”.
• In this example simulating all predicted interactions helps .
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
But: Simulating Everything is not
Always Best
• Even apart from fact that simulating everything costs mental effort.
• Earlier experiments (Broekens, 2005) showed that
– simulation has benefit, especially when many interactions are simulated.
This is not surprising (better heuristic). However,
– in some cases less simulation resulted in better learning.
 Dynamic relation between environment and simulation “strategy” (i.e.
simulation threshold: percentage of all predicted interactions to be
simulated).
 Emotion as metalearning to adapt amount of internal simulation?
(Doya, 2002)
– Pleasure is an indication of the current performance of the agent (Clore
and Gasper, 2000). Also,
– high pleasure top down thinking, and
low pleasure bottom up thinking (Fiedler and Bless, 2000).
Pleasure Modulates Simulation
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Pleasure Modulates Simulation
•
•
Many theories of emotion.
We use core-affect (or activation-valence) theory of emotion as basis.
– Two fundamental factors, pleasure and arousal (Russell, 2003).
– Pleasure relates to emotional valence, and
– arousal relates to action-readiness, or activity.
•
In this study we model pleasure as simulation threshold.
– We use pleasure to dynamically adapt the amount of interactions that are
simulated. It is thus used as a dynamic simulation threshold.
– We study the indirect effect of emotion as a metalearning parameter affecting
information processing that on its turn influences action-selection.
•
Many models study emotion as direct influence on action-selection (or
motivation(-al states)) (Avila-Garcia and Cãnamero, 2004; Cãnamero, 1997;
Velasquez, 1998), or as information (e.g. Botelho and Coelho).
•
Example of exception: Belavkin (2004), relation between emotion, entropy
and information processing.
Pleasure Modulates Simulation
• Pleasure: indication of current performance relative to what the
agent is used to.
– Tried to capture this by the normalized difference between the short
term average reinforcement signal and the long term average
reinforcement signal:
• Continuous pleasure feedback:
e p  (r star  (r ltar  f ltar )) 2 f ltar
– High pleasure, going well?
Continue strategy, goal directed
thinking.
• > ep, high threshold, simulate
predicted best interactions,
– Low pleasure? Look broader, pay
more attention to all predicted
interactions.
• < ep, low threshold, simulate many
interactions.
Cognitive influence
simulated reinforcement
simulated interaction
pleasure, ep
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Hierarchical-state
RL model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Experiments
• To measure adaptive effect of pleasure-modulated simulation: force
agent to adapt to new task.
– First the agent has 128 trials to learn task 1, then
– switch environment to new task, 128 trials to learn task 2.
– Repeat for many different parameter settings (e.g. the window of the
long and short term average reinforcement signals, the learning rate,
etc…)
• Pleasure predictions:
–
–
–
–
Pleasure increases to value near 1 (agent gets better at task)
then slowly converges down to .5. (agent gets used to task)
At switch: pleasure drops, (new task, drop in performance)
then increases to value near 1, and converges down to .5 (agent gets
used to new task)
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Performance of pleasure-modulated simulation is comparable with
simulating ALL / Best 50% predicted interactions (static simulation
threshold), but, using only 30% / 70% of the mental resources.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Some settings even have a significantly better performance at lower
mental cost.
• Predicted pleasure curve was confirmed
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Conclusions
• Simple pleasure feedback can be used to determine the broadness
of internal simulation, when simulation is used as action-selection
bias, performance is comparable and mental effort decreases.
– Since we introduce few new mechanism for simulation relevant to the
understanding of the evolutionary plausibility of the simulation
hypothesis, as increased adaptation at lower cost is an evolutionary
advantageous feature.
• Our results provide clues of a relation between the simulation
hypothesis and emotion theory.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-selection discussion, and
questions.
• Use emotion to:
– vary action-selection distribution (Doya, 2002), and/or
– vary interaction-selection distribution (e.g. temperature of Boltzmann,
threshold of our AS mechanism).
• Interplay between: covert interaction (simulation) and overt
interaction (action-selection).
– Simulate the best interaction, but chose an action stochastically, see
also (Gadanho, 2003):
 Gives extra “drive” to certain actions.
– The inverse? Seems rational too:
 Simulate bad actions for “mental (covert) exploration”, choose best actions
for “overt exploitation”.
 Early experiments do not (yet) show clear benefit.
• How to integrate internal simulation input into action–selection?
• Questions?