Computational Investigations
of the Regulative Role of
Pleasure in Adaptive Behavior
Action-Selection Biased by
Pleasure-Regulated Simulated
Interaction
Joost Broekens,
Fons J Verbeek,
LIACS, Leiden University, The Netherlands.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Overview
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning
• The agent's memory structure is modelled with a
directed graph.
• Constructs a distributed state-prediction of next
states.
• Learns through continuous interaction.
– The memory is adapted while the agent interacts with its
environment.
– Agent selects an action a, executes a and combines
with resulting perception p into a situation s1=<a, p>.
Memory adds s1 if s1 does not yet exists.
– Do another action, resulting in s2, add s2, and
– connect s1 to s2 by creating an interactron node I1.
– Recursively apply this process, use interactrons to
predict situations.
– (!Encoding situations in this way is too symbolic for real
world applications!)
s1
s1
a
s2
b
I1
s1
s2
c
I3
s1
I1
I2
s2
s3
d
I3
I1
s1
s2
I2
e
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: example
• Agent environment, grid world consisting of:
–
–
–
–
lava (red) r=-1, can walk on lava but is discouraged to do so,
food (yellow) r=1,
agent (black),
path (white) r=0.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: reinforcement
• Learning to predict values follows standard RL techniques (Sutton
and Barto, 1998).
– Except that learning is per interactron (node) and that there are two
prediction values per node.
• Every interactron (representing a sequence of occurred situations)
maintains
– a direct reinforcement , the interactron’s own predicted value changed
by the reinforcement r from the environment,
– a lazy-propagated indirect reinforcement value, that estimates the future
reinforcement, , based on the predicted next interactions, and
– a combined value μ=+.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: reinforcement example
Lazy reinforcement
propagation
Action-Selection
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-Selection
• Integrate distributed predictions into action values.
– Action values are a result of the parallel inhibition and excitation of
actions in the agent’s set of actions, A. Calculated using formula:
k
X yi
l t (ah ) * t ( x i j | yi ) ( x i j | yi )
i 1 j 1
– with lt(ah) = the resulting level of activation of an action ahA at time t,
– yi an active interactron, and
– xij predicts action ah
Sum over the weighted values of all predicted interactions into actions.
• Action-selection is based on these action values.
– If any lt(ah)>threshold aselect such that lt(aselect)=max(lt(a1),…,lt(a|A|)).
– If all lt(ah)<threshold aselect stochastically from l(a1),…,l(a|A|).
• Other selection mechanisms possible, i.e. Boltzmann. (!With a static
threshold our selection suffers from lack of exploration!).
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-Selection: example
k
X yi
l t (ah ) * t ( x i j | yi ) ( x i j | yi )
i 1 j 1
Internal Simulation
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Thinking as Internal Simulation of
Behavior
• Internal simulation of behavior
– Covertly execute and evaluate potential interaction using sensory-motor
substrates (Hesslow, 2002; Damasio; Cotterill, 2001), but see also
– “interaction potentialities” (Bickhard), and
– “state anticipation” (Butz, Sigaud, Gérard, 2003).
– Existing mechanisms are basis for simulation
Evolutionary continuity!
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Simulation: action-selection bias
At every step, instead of action-selection, select a subset of predicted
interactions from reinforcement learning model feed back to RL model.
1. Interaction-selection: select a subset of predicted interactions.
2. Simulate-and-bias-predicted-benefit: feed back to model as if a real
interaction. (note that the memory advances to time t+1, so we have to
3. reset-memory-state to time=t to be able to select an appropriate action.)
4. Action-selection: select
the next action using
the action-selection
mechanism explained
earlier based on the
now biased action
values.
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Hierarchical-state
RL model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Simulation: example
• Action list before simulation (!hypothetical example!):
– {up=0.2, down=-0.5, right=-1, left=-1}
• Action-selection would have selected “up”,
– at least using our naive action-selection mechanism.
– With Boltzmann high probability for “up”.
• Simulate all interactions.
Roadblock r=-.5
– Propagate back the predicted values by simulating interaction with
environment.
– Effect is a “value look-ahead” of 1 step.
• Action list after simulation:
– {up=0.1, down=0.5, right=-1, left=-1}
• Action-selection selects “down”.
• In this example simulating all predicted interactions helps .
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
But: Simulating Everything is not
Always Best
• Even apart from fact that simulating everything costs mental effort.
• Earlier experiments (Broekens, 2005) showed that
– simulation has benefit, especially when many interactions are simulated.
This is not surprising (better heuristic). However,
– in some cases less simulation resulted in better learning.
Dynamic relation between environment and simulation “strategy” (i.e.
simulation threshold: percentage of all predicted interactions to be
simulated).
Emotion as metalearning to adapt amount of internal simulation?
(Doya, 2002)
– Pleasure is an indication of the current performance of the agent (Clore
and Gasper, 2000). Also,
– high pleasure top down thinking, and
low pleasure bottom up thinking (Fiedler and Bless, 2000).
Pleasure Modulates Simulation
Cognitive influence
simulated reinforcement
simulated interaction
pleasure
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Distributed-state RL
model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Pleasure Modulates Simulation
•
•
Many theories of emotion.
We use core-affect (or activation-valence) theory of emotion as basis.
– Two fundamental factors, pleasure and arousal (Russell, 2003).
– Pleasure relates to emotional valence, and
– arousal relates to action-readiness, or activity.
•
In this study we model pleasure as simulation threshold.
– We use pleasure to dynamically adapt the amount of interactions that are
simulated. It is thus used as a dynamic simulation threshold.
– We study the indirect effect of emotion as a metalearning parameter affecting
information processing that on its turn influences action-selection.
•
Many models study emotion as direct influence on action-selection (or
motivation(-al states)) (Avila-Garcia and Cãnamero, 2004; Cãnamero, 1997;
Velasquez, 1998), or as information (e.g. Botelho and Coelho).
•
Example of exception: Belavkin (2004), relation between emotion, entropy
and information processing.
Pleasure Modulates Simulation
• Pleasure: indication of current performance relative to what the
agent is used to.
– Tried to capture this by the normalized difference between the short
term average reinforcement signal and the long term average
reinforcement signal:
• Continuous pleasure feedback:
e p (r star (r ltar f ltar )) 2 f ltar
– High pleasure, going well?
Continue strategy, goal directed
thinking.
• > ep, high threshold, simulate
predicted best interactions,
– Low pleasure? Look broader, pay
more attention to all predicted
interactions.
• < ep, low threshold, simulate many
interactions.
Cognitive influence
simulated reinforcement
simulated interaction
pleasure, ep
Interaction-selection
Emotion process
interaction
predicted interactions
action
Reactive behavior
Perception
percept
Hierarchical-state
RL model
Action-selection
reinforcement
stimulus
ENVIRONMENT
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Experiments
• To measure adaptive effect of pleasure-modulated simulation: force
agent to adapt to new task.
– First the agent has 128 trials to learn task 1, then
– switch environment to new task, 128 trials to learn task 2.
– Repeat for many different parameter settings (e.g. the window of the
long and short term average reinforcement signals, the learning rate,
etc…)
• Pleasure predictions:
–
–
–
–
Pleasure increases to value near 1 (agent gets better at task)
then slowly converges down to .5. (agent gets used to task)
At switch: pleasure drops, (new task, drop in performance)
then increases to value near 1, and converges down to .5 (agent gets
used to new task)
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Performance of pleasure-modulated simulation is comparable with
simulating ALL / Best 50% predicted interactions (static simulation
threshold), but, using only 30% / 70% of the mental resources.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Some settings even have a significantly better performance at lower
mental cost.
• Predicted pleasure curve was confirmed
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Conclusions
• Simple pleasure feedback can be used to determine the broadness
of internal simulation, when simulation is used as action-selection
bias, performance is comparable and mental effort decreases.
– Since we introduce few new mechanism for simulation relevant to the
understanding of the evolutionary plausibility of the simulation
hypothesis, as increased adaptation at lower cost is an evolutionary
advantageous feature.
• Our results provide clues of a relation between the simulation
hypothesis and emotion theory.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-selection discussion, and
questions.
• Use emotion to:
– vary action-selection distribution (Doya, 2002), and/or
– vary interaction-selection distribution (e.g. temperature of Boltzmann,
threshold of our AS mechanism).
• Interplay between: covert interaction (simulation) and overt
interaction (action-selection).
– Simulate the best interaction, but chose an action stochastically, see
also (Gadanho, 2003):
Gives extra “drive” to certain actions.
– The inverse? Seems rational too:
Simulate bad actions for “mental (covert) exploration”, choose best actions
for “overt exploitation”.
Early experiments do not (yet) show clear benefit.
• How to integrate internal simulation input into action–selection?
• Questions?
© Copyright 2026 Paperzz