Surprise-based learning: Neuromodulation by - Infoscience

Surprise-based learning: Neuromodulation by surprise
in multi-factor learning rules
1
1
Mohammad Javad Faraji , Kerstin Preuschoff , and Wulfram Gerstner
1
1. School of Life Sciences, Brain Mind Institute & School of Computer and Communication Sciences, EPFL, Switzerland
Abstract
Surprise triggers new clusters
Computational model for learning rate
The role of surprise in synaptic learning rules in neural networks is
largely undetermined. We address (1) how surprise affects learning
and (2) how surprise signals are reflected in neural networks. We
show that surprise in principle can improve learning by
I modulating the learning rate,
I regulating the exploration-exploitation trade off, and
I generating new environmental states.
Modulating learning rate by surprise
A classic K-means clustering algorithm is modified such that if the total number of
clusters is initially unknown, the agent (classifier) equipped with surprise is able to
add more clusters (black circles) whenever it judges a pattern (colored data point)
to be surprising, i.e., a pattern which may belong to none of the existing clusters.
It represents an agent who is able to generate (trigger) new states, an essential
feature for learning new environments.
Adaptive reinforcement learning in dynamic environments
Reward prediction error δn = rn − µ̂n−1 and the estimated risk σ̂n of the environment is used to measure surprise Sn = f(|δn|/σ̂n−1). Dynamics of the
learning rate α is then controlled by Sn and the level of estimation uncertainty
1
1
3
1,2
ûn = std(µ̂) which determines variation of the estimated mean reward µ̂. In
words, α̇
= −α/(kû
n−1) + Sn where
1. University of Zurich, Switzerland, 2. École Polytechnique Fédérale de Lausanne, Switzerland, 3.other
California
Institute
of Technology,
USAk is a constant.
The variational free energy
[email protected]
(blue line), used for estimating the likelihood of the inAbstract
Results II
Results I
put
patterns
(digits
from
the
The agent estimates
the probability
of reward
in a reversal
taskthat
(upperIn reward learning,
the learning
rate is delivery
a fundamental
parameter
has been shown
In a second step, we used particle filtering to access subjective estimates of the mean
Using a standard reinforcement learning model we first estimated the trial-by-trial
MNIST
dataset)
in
a
Boltzleft). We altered
a standard
SARSA learning
line) such
that
to adapt
to the characteristics
of a algorithm
changing (blue
environment.
How
thewhen
learning rate is
and standard deviation of the underlying distribution as well as the underlying learnlearning rate directly from the prediction errors. During the feedback stage, we find a
mann
machine
can
be
used
the agent detects
an unexpected
event beyond
thehow
stochasticity
of thetheir
environment,
implemented
in the human
brain and
humans adjust
learning rate in a dying rate. During the feedback stage, the learning rate correlates with the medial froncorrelation of the learning rate with the BOLD response in anterior insula, medial
as a surprise measure.
Theses tal
results
equally
well insula
for different
surprise
measures:
sur-guessing
the ensuing namic
surprise
signal (bottom-left)
temporarily
learning
leadingneural
to
environment
remains unclear.
Here,accelerates
we study the
underlying
mechagyrus,hold
bilateral
anterior
and dorsal
striatum
(Figure 4).Shannon
During the
Free
energy
in
neural
networks
Chaohui Guo , Yosuke Morishima , Peter Bossaerts
, Kerstin Preuschoff
frontal gyrus and striatum (Figure 2). In addition, two measures of uncertainty - the
more accumulated
rewardinfor
surprise-based
reinforcement
learnercomputational
(red line). In models of
nisms of learning
a changing
environment
by combining
the dynamicreward
decision
making
task
(upper-right),
theresonance
learner observes
samples
learning
and
functional
magnetic
imaging
(fMRI). from
a Gaussian distribution with varying mean (black line). Modulating the learning
rate by surprise
signals (bottom-right) leads to faster detection of change
Paradigm
points (redTwenty-one
line) than that
in the SARSA model (blue line).
healthy subjects participated in an fMRI study. Each subject viewed a
prise
−
log
P(r
|
µ̂
,
σ̂
),
Bayesian
approach
D
[P
(
µ̂|r
)||P
(
µ̂)],
i
n−1
n−1
KL
n+1
n
n
stage, the representation of the learning rate shifts toward more frontal regions inabsolute prediction error and the entropy as derived from the true standard deviaand model-free
µ̂n−1frontal
|/σ̂n−1
.
n−
cluding |r
the
medial
cortex
and superior/medial frontal gyrus (Figure 5).
tion - are reflected in a network that includes the striatum, medial frontal gyrus and
Neural
signatures
of
surprise
bilateral anterior insula (Figure 3).
A
B
y=14
right insula
2
Conclusion
1.5
Surprise-driven modulations can enhance the learning performance
at both the behavioral and neural network level. In two decision
series of samples drawn from a normal distribution the mean and standard deviation
making tasks, surprise-based SARSA accelerated learning. A
of which could change over time. Subjects were asked to make a series of estimations
surprise-based clustering algorithm can trigger new clusters if it
of the true mean of the hidden distribution based on the samples provided (Figure 1).
left insula
right insula
1. Guo et al. ”Neural Correlates of the Learning Rate in a Changing
judges a pattern to be novel. Further, we simulated a classic
The learning rates as well as prediction errors and measures of uncertainty were calEnvironment”, Cosyne abstract, 2012.
Boltzmann machine to use the network activity itself to measure
culated based on reinforcement learning models. The first model uses objective meaBOLD response to adaptive learning rates during a dynamic decision making task
2. Preuschoff et al. ”Human insula activation reflects risk prediction errors as well
Figurea4. new pattern is surprising. Since the surprise signal is
how
much
sures such as the true underlying standard deviation. The second model estimates
Figure
2. Learning
rate
BOLD response to learning rate during feedback stage using particle filter estimates (p<0.05, corrected at
(left)
[1].
BOLD
response
to surprise when measured as risk prediction error in a
as risk.”, J. Neuroscience 28.11 (2008): 2745-2752.
generated
the network itself, it can be used as a biologically
BOLD response in bilateral anterior insula and medial frontal gyrus correlates with trial-by-trial learning
clusterby
level).
hidden variables such as the subjects' belief about the noise level of the hidden distrigambling
task
(right)
[2].
rates
(p<0.001,
uncorrected).
Research was supported by the ERC (grant no. 268 689, H.S.)
plausible third factor in multi-factor learning rules.
bution and the subjective uncertainty of their estimations of the true mean. These
mean beta (df=18)
References
1
0.5
0
-0.5
-1
after first card
after second card
-1.5
-2
two variables capture different aspects of subjects' hidden beliefs: one describes sub-
0
0.2
0.4
0.6
prediction risk error
0.8