Threshold Learning for Optimal Decision Making

Threshold Learning for Optimal Decision Making
Poster #63
NIPS: 3756-3764, 2016
Nathan F. Lepora ([email protected], www.lepora.com)
Department of Engineering Mathematics and Bristol Robotics Laboratory, University of Bristol, U.K.
Introduction
REINFORCE
• The standard psychological model of 2AFC decision making (the drift-diffusion
model [1]) sets decision thresholds to tune the speed-accuracy and choice balance.
• It is unknown how animals learn these decision thresholds.
• We consider rewards based on the cost function of an equivalent statistical test [2]
(SPRT). However, these rewards are stochastic and hence challenging to optimize.
• Here we introduce two successful optimization methods based on reinforcement
learning and discuss the relation with animal learning.
Drift-diffusion model, SPRT and reward
Consider thresholds a linear combination
𝜃0 =
𝑛𝑢𝑛𝑖𝑡𝑠
𝑗=1
𝑠𝑗 𝑦𝑗 , 𝜃1 =
𝑛𝑢𝑛𝑖𝑡𝑠
𝑗=1
𝑠𝑗 𝑦𝑗+𝑛𝑢𝑛𝑖𝑡𝑠
with coefficients 𝑠𝑗 exponentially spread (n-ary
coding) for stochastic units 𝑦𝑗 = 0,1 with
𝑝 𝑦𝑗 = 1 = 𝑓 𝑤𝑗 = 1 1 − 𝑒 −𝑤𝑗
Then the REINFORCE learning rule is
∆𝑤𝑗 = 𝛽 𝑦𝑗 − 𝑓 𝑤𝑗 𝑅
and implements a policy gradient method [3].
Bayesian optimization
REINFORCE vs Bayesian optimization
Reward dependence on the decision threshold
is modelled with a Gaussian process
𝑅(𝜃0 , 𝜃1 )~𝐺𝑃 𝑚(𝜃0 , 𝜃1 , 𝑘(𝜃0 , 𝜃1 ; 𝜃′0 , 𝜃′1 )]
fitting a mean 𝑚 & squared-exp covariance 𝑘.
Sampled thresholds are chosen to maximize the
Single learning episode: Bayesian opt. with decision costs 𝑐 = 0.05, 𝑊0 = 0.1, 𝑊1 = 1.
normal CDF: 𝛼(𝜃0 , 𝜃1 ) = Φ
𝑚(𝜃0 ,𝜃1 )−𝑅(𝜃0∗ ,𝜃1∗ )
𝑘(𝜃0 ,𝜃1 ;𝜃0 ,𝜃1 )
i.e. the chance of improvement over the current
best estimate 𝑅 𝜃0∗ , 𝜃1∗ . Bayesian optimization
iterates the GP-fit and data acquisition [4].
Threshold learning
Single learning episode: REINFORCE with decision costs 𝑐 = 0.05, 𝑊0 = 0.1, 𝑊1 = 1.
Multiple episodes
REINFORCE
Bayesian optimization
Computation time
0.5 sec (5000 trials)
50 sec (500 trials)
Uncertainty Δ𝜃
0.23
0.75
Comparison with animal experiments
Rodent data [5] over the acquisition phase of a
vibrotactile 2-alternative forced-choice (2AFC)
task shows learning over ~5000 trials.
Resembles the REINFORCE learning rule.
∆𝑧~𝑁(𝜇, 𝜎 2 )
Drift-diffusion model: evidence update 𝑧 𝑡 + 1 = 𝑧 𝑡 + ∆𝑧, increment
decision termination at one of two choices for 𝑧 𝑡 < −𝜃0 or 𝑧 𝑡 > 𝜃1
Multiple learning episodes: REINFORCE with equal thresholds 𝜃0 = 𝜃1 ; 0 < 𝑐 𝑊0 = 𝑐 𝑊1 < 0.1
Sequential probability ratio test (SPRT): Bayesian 2-hypothesis test; equivalent to drift
diffusion model for Gaussian likelihoods. Optimal under cost function of expected
error 𝑒 = {0,1} & time 𝑇: 𝐶risk = 1 2 𝑊0 E0 𝑒 + 𝑐E0 𝑇 + 1 2 (𝑊0 E1 𝑒 + 𝑐E1 𝑇 )
•
Optimization problem: find the decision thresholds 𝜃0 and 𝜃1 that optimize 𝐶risk
Reinforcement learning: equivalently can
maximize expected reward E(R)= −𝐶risk
Reward with 2 thresholds
(averaged over 100 trials)
0
-0.25
Reward
Reward function: Equivalent single-trial rewards:
−𝑊0 − 𝑐𝑇, incorrect decision of 𝐻0
𝑅 = −𝑊1 − 𝑐𝑇, incorrect decision of 𝐻1
−𝑐𝑇,
correct decision
-0.5
Issue: these rewards are highly stochastic
and hence challenging to optimize.
(See figures – standard methods will fail)
-0.75
𝜃1
𝜃0
-1
However, the animal progressively improves in
accuracy (slowing decision time), whereas both
threshold learning methods do the opposite.
Multiple learning episodes: REINFORCE with two thresholds; costs 0 < 𝑐 𝑊0 < 1, 0 < 𝑐 𝑊1 < 1
Session (of 200 trials)
Discussion
Proposed 2 learning rules for decision thresholds in drift-diffusion model of decision
making. Key step to use single trial rewards based on trial averaged cost function.
• REINFORCE is a simple 2-factor single-trial learning rule for neural networks;
based on policy gradient method (equivalent to using trial-averaged gradients).
• Bayesian optimization fits a probabilistic (GP) model to the rewards, to sample
thresholds to maximize the improvement over the current best estimate.
• REINFORCE uses more trials to learn, but is more accurate and computationally
faster than Bayesian optimization. It also better resembles animal learning.
References
[1] R. Ratcliff. A theory of memory retrieval. Psychological Review, 85:59–108, 1978.
[2] A. Wald. Sequential analysis. John Wiley and Sons (NY), 1947.
[3] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine learning, 8(3-4):229–256, 1992.
[4] E. Brochu, V. Cora, N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions. 2011
[5] J. Mayrhofer, et al. Novel two-alternative forced choice paradigm for bilateral vibrotactile whisker
frequency discrimination in head-fixed mice and rats. Journal of neurophysiology, 109(1):273–284, 2013