Surprise-based learning: Neuromodulation by surprise in multi-factor learning rules 1 1 Mohammad Javad Faraji , Kerstin Preuschoff , and Wulfram Gerstner 1 1. School of Life Sciences, Brain Mind Institute & School of Computer and Communication Sciences, EPFL, Switzerland Abstract Surprise triggers new clusters Computational model for learning rate The role of surprise in synaptic learning rules in neural networks is largely undetermined. We address (1) how surprise affects learning and (2) how surprise signals are reflected in neural networks. We show that surprise in principle can improve learning by I modulating the learning rate, I regulating the exploration-exploitation trade off, and I generating new environmental states. Modulating learning rate by surprise A classic K-means clustering algorithm is modified such that if the total number of clusters is initially unknown, the agent (classifier) equipped with surprise is able to add more clusters (black circles) whenever it judges a pattern (colored data point) to be surprising, i.e., a pattern which may belong to none of the existing clusters. It represents an agent who is able to generate (trigger) new states, an essential feature for learning new environments. Adaptive reinforcement learning in dynamic environments Reward prediction error δn = rn − µ̂n−1 and the estimated risk σ̂n of the environment is used to measure surprise Sn = f(|δn|/σ̂n−1). Dynamics of the learning rate α is then controlled by Sn and the level of estimation uncertainty 1 1 3 1,2 ûn = std(µ̂) which determines variation of the estimated mean reward µ̂. In words, α̇ = −α/(kû n−1) + Sn where 1. University of Zurich, Switzerland, 2. École Polytechnique Fédérale de Lausanne, Switzerland, 3.other California Institute of Technology, USAk is a constant. The variational free energy [email protected] (blue line), used for estimating the likelihood of the inAbstract Results II Results I put patterns (digits from the The agent estimates the probability of reward in a reversal taskthat (upperIn reward learning, the learning rate is delivery a fundamental parameter has been shown In a second step, we used particle filtering to access subjective estimates of the mean Using a standard reinforcement learning model we first estimated the trial-by-trial MNIST dataset) in a Boltzleft). We altered a standard SARSA learning line) such that to adapt to the characteristics of a algorithm changing (blue environment. How thewhen learning rate is and standard deviation of the underlying distribution as well as the underlying learnlearning rate directly from the prediction errors. During the feedback stage, we find a mann machine can be used the agent detects an unexpected event beyond thehow stochasticity of thetheir environment, implemented in the human brain and humans adjust learning rate in a dying rate. During the feedback stage, the learning rate correlates with the medial froncorrelation of the learning rate with the BOLD response in anterior insula, medial as a surprise measure. Theses tal results equally well insula for different surprise measures: sur-guessing the ensuing namic surprise signal (bottom-left) temporarily learning leadingneural to environment remains unclear. Here,accelerates we study the underlying mechagyrus,hold bilateral anterior and dorsal striatum (Figure 4).Shannon During the Free energy in neural networks Chaohui Guo , Yosuke Morishima , Peter Bossaerts , Kerstin Preuschoff frontal gyrus and striatum (Figure 2). In addition, two measures of uncertainty - the more accumulated rewardinfor surprise-based reinforcement learnercomputational (red line). In models of nisms of learning a changing environment by combining the dynamicreward decision making task (upper-right), theresonance learner observes samples learning and functional magnetic imaging (fMRI). from a Gaussian distribution with varying mean (black line). Modulating the learning rate by surprise signals (bottom-right) leads to faster detection of change Paradigm points (redTwenty-one line) than that in the SARSA model (blue line). healthy subjects participated in an fMRI study. Each subject viewed a prise − log P(r | µ̂ , σ̂ ), Bayesian approach D [P ( µ̂|r )||P ( µ̂)], i n−1 n−1 KL n+1 n n stage, the representation of the learning rate shifts toward more frontal regions inabsolute prediction error and the entropy as derived from the true standard deviaand model-free µ̂n−1frontal |/σ̂n−1 . n− cluding |r the medial cortex and superior/medial frontal gyrus (Figure 5). tion - are reflected in a network that includes the striatum, medial frontal gyrus and Neural signatures of surprise bilateral anterior insula (Figure 3). A B y=14 right insula 2 Conclusion 1.5 Surprise-driven modulations can enhance the learning performance at both the behavioral and neural network level. In two decision series of samples drawn from a normal distribution the mean and standard deviation making tasks, surprise-based SARSA accelerated learning. A of which could change over time. Subjects were asked to make a series of estimations surprise-based clustering algorithm can trigger new clusters if it of the true mean of the hidden distribution based on the samples provided (Figure 1). left insula right insula 1. Guo et al. ”Neural Correlates of the Learning Rate in a Changing judges a pattern to be novel. Further, we simulated a classic The learning rates as well as prediction errors and measures of uncertainty were calEnvironment”, Cosyne abstract, 2012. Boltzmann machine to use the network activity itself to measure culated based on reinforcement learning models. The first model uses objective meaBOLD response to adaptive learning rates during a dynamic decision making task 2. Preuschoff et al. ”Human insula activation reflects risk prediction errors as well Figurea4. new pattern is surprising. Since the surprise signal is how much sures such as the true underlying standard deviation. The second model estimates Figure 2. Learning rate BOLD response to learning rate during feedback stage using particle filter estimates (p<0.05, corrected at (left) [1]. BOLD response to surprise when measured as risk prediction error in a as risk.”, J. Neuroscience 28.11 (2008): 2745-2752. generated the network itself, it can be used as a biologically BOLD response in bilateral anterior insula and medial frontal gyrus correlates with trial-by-trial learning clusterby level). hidden variables such as the subjects' belief about the noise level of the hidden distrigambling task (right) [2]. rates (p<0.001, uncorrected). Research was supported by the ERC (grant no. 268 689, H.S.) plausible third factor in multi-factor learning rules. bution and the subjective uncertainty of their estimations of the true mean. These mean beta (df=18) References 1 0.5 0 -0.5 -1 after first card after second card -1.5 -2 two variables capture different aspects of subjects' hidden beliefs: one describes sub- 0 0.2 0.4 0.6 prediction risk error 0.8
© Copyright 2026 Paperzz