Threshold Learning for Optimal Decision Making Poster #63 NIPS: 3756-3764, 2016 Nathan F. Lepora ([email protected], www.lepora.com) Department of Engineering Mathematics and Bristol Robotics Laboratory, University of Bristol, U.K. Introduction REINFORCE • The standard psychological model of 2AFC decision making (the drift-diffusion model [1]) sets decision thresholds to tune the speed-accuracy and choice balance. • It is unknown how animals learn these decision thresholds. • We consider rewards based on the cost function of an equivalent statistical test [2] (SPRT). However, these rewards are stochastic and hence challenging to optimize. • Here we introduce two successful optimization methods based on reinforcement learning and discuss the relation with animal learning. Drift-diffusion model, SPRT and reward Consider thresholds a linear combination 𝜃0 = 𝑛𝑢𝑛𝑖𝑡𝑠 𝑗=1 𝑠𝑗 𝑦𝑗 , 𝜃1 = 𝑛𝑢𝑛𝑖𝑡𝑠 𝑗=1 𝑠𝑗 𝑦𝑗+𝑛𝑢𝑛𝑖𝑡𝑠 with coefficients 𝑠𝑗 exponentially spread (n-ary coding) for stochastic units 𝑦𝑗 = 0,1 with 𝑝 𝑦𝑗 = 1 = 𝑓 𝑤𝑗 = 1 1 − 𝑒 −𝑤𝑗 Then the REINFORCE learning rule is ∆𝑤𝑗 = 𝛽 𝑦𝑗 − 𝑓 𝑤𝑗 𝑅 and implements a policy gradient method [3]. Bayesian optimization REINFORCE vs Bayesian optimization Reward dependence on the decision threshold is modelled with a Gaussian process 𝑅(𝜃0 , 𝜃1 )~𝐺𝑃 𝑚(𝜃0 , 𝜃1 , 𝑘(𝜃0 , 𝜃1 ; 𝜃′0 , 𝜃′1 )] fitting a mean 𝑚 & squared-exp covariance 𝑘. Sampled thresholds are chosen to maximize the Single learning episode: Bayesian opt. with decision costs 𝑐 = 0.05, 𝑊0 = 0.1, 𝑊1 = 1. normal CDF: 𝛼(𝜃0 , 𝜃1 ) = Φ 𝑚(𝜃0 ,𝜃1 )−𝑅(𝜃0∗ ,𝜃1∗ ) 𝑘(𝜃0 ,𝜃1 ;𝜃0 ,𝜃1 ) i.e. the chance of improvement over the current best estimate 𝑅 𝜃0∗ , 𝜃1∗ . Bayesian optimization iterates the GP-fit and data acquisition [4]. Threshold learning Single learning episode: REINFORCE with decision costs 𝑐 = 0.05, 𝑊0 = 0.1, 𝑊1 = 1. Multiple episodes REINFORCE Bayesian optimization Computation time 0.5 sec (5000 trials) 50 sec (500 trials) Uncertainty Δ𝜃 0.23 0.75 Comparison with animal experiments Rodent data [5] over the acquisition phase of a vibrotactile 2-alternative forced-choice (2AFC) task shows learning over ~5000 trials. Resembles the REINFORCE learning rule. ∆𝑧~𝑁(𝜇, 𝜎 2 ) Drift-diffusion model: evidence update 𝑧 𝑡 + 1 = 𝑧 𝑡 + ∆𝑧, increment decision termination at one of two choices for 𝑧 𝑡 < −𝜃0 or 𝑧 𝑡 > 𝜃1 Multiple learning episodes: REINFORCE with equal thresholds 𝜃0 = 𝜃1 ; 0 < 𝑐 𝑊0 = 𝑐 𝑊1 < 0.1 Sequential probability ratio test (SPRT): Bayesian 2-hypothesis test; equivalent to drift diffusion model for Gaussian likelihoods. Optimal under cost function of expected error 𝑒 = {0,1} & time 𝑇: 𝐶risk = 1 2 𝑊0 E0 𝑒 + 𝑐E0 𝑇 + 1 2 (𝑊0 E1 𝑒 + 𝑐E1 𝑇 ) • Optimization problem: find the decision thresholds 𝜃0 and 𝜃1 that optimize 𝐶risk Reinforcement learning: equivalently can maximize expected reward E(R)= −𝐶risk Reward with 2 thresholds (averaged over 100 trials) 0 -0.25 Reward Reward function: Equivalent single-trial rewards: −𝑊0 − 𝑐𝑇, incorrect decision of 𝐻0 𝑅 = −𝑊1 − 𝑐𝑇, incorrect decision of 𝐻1 −𝑐𝑇, correct decision -0.5 Issue: these rewards are highly stochastic and hence challenging to optimize. (See figures – standard methods will fail) -0.75 𝜃1 𝜃0 -1 However, the animal progressively improves in accuracy (slowing decision time), whereas both threshold learning methods do the opposite. Multiple learning episodes: REINFORCE with two thresholds; costs 0 < 𝑐 𝑊0 < 1, 0 < 𝑐 𝑊1 < 1 Session (of 200 trials) Discussion Proposed 2 learning rules for decision thresholds in drift-diffusion model of decision making. Key step to use single trial rewards based on trial averaged cost function. • REINFORCE is a simple 2-factor single-trial learning rule for neural networks; based on policy gradient method (equivalent to using trial-averaged gradients). • Bayesian optimization fits a probabilistic (GP) model to the rewards, to sample thresholds to maximize the improvement over the current best estimate. • REINFORCE uses more trials to learn, but is more accurate and computationally faster than Bayesian optimization. It also better resembles animal learning. References [1] R. Ratcliff. A theory of memory retrieval. Psychological Review, 85:59–108, 1978. [2] A. Wald. Sequential analysis. John Wiley and Sons (NY), 1947. [3] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. [4] E. Brochu, V. Cora, N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions. 2011 [5] J. Mayrhofer, et al. Novel two-alternative forced choice paradigm for bilateral vibrotactile whisker frequency discrimination in head-fixed mice and rats. Journal of neurophysiology, 109(1):273–284, 2013
© Copyright 2026 Paperzz