The optimism bias may support rational action

The optimism bias may support rational action
Falk Lieder, Sidharth Goel, Ronald Kwan, Thomas L. Griffiths
University of California, Berkeley
1
Introduction
People systematically overestimate the probability of good outcomes [1] and systematically underestimate how long it will take to achieve them [2]. From an epistemic perspective, optimism is
irrational because it misrepresents the information that we have been given. Yet, surprisingly, optimistic people often perform better than their more realistic peers [1]. How can it be that being
irrational leads to better performance than being more rational? In this abstract, we explore a potential solution to this puzzle. Concretely, we we investigate the hypothesis that overestimating the
probability of achieving good outcomes can compensate for the cognitive limitations that prevent
us from looking far ahead into the future and fully considering the long-term consequences of our
actions. Previous work in reinforcement learning has used different notions of optimism to promote
learning through exploration and thereby benefits the returns of future decisions [3–7]. Here, we
investigate the immediate benefits of optimism for the returns of present actions rather than learning and future returns, and explore a different notion of optimism that formalizes the psychological
theory that people overestimate the probability of good events relative to bad events.
2
Model
We model the decision environment E as a Markov Decision Process (MDP) [8]
E = (S, A, T, γ, r) ,
(1)
where S are the states, A are the actions, T are the transition probabilities, γ = 1 is the discount
factor, and r is the reward function. We model the agent’s internal model M of the environment E
by the MDP
M = S, A, T̂ , γ, r ,
(2)
whose transition probabilities T̂ may differ from the true probabilities T . Concretely, we use this
distortion of the transition probabilities to model optimism and pessimism according to
T̂α (s0 |s, a) ∝ T (s0 |s, a) · sig(VE (s0 ) − VE (s))α ,
(3)
where sig is the sigmoidal function, VE is the optimal value function of the MDP E, α = 0 corresponds to realism, α > 0 corresponds to optimism, and α < 0 corresponds to pessimism.
We model the the implications of bounded cognitive resources on people’s performance in sequential
decision-making by assuming that they can look only h steps ahead and therefore act according to
the policy
πh (s) = arg max Qh (s, a),
a
"
(4)
Qh (st , a) = ET̂ r(st , a, St+1 ) + max
π
t+h−1
X
#
r(Si , π(Si ), Si+1 ) ,
(5)
i=t+1
where the expectation ET̂ is taken with respect to the subjective transition probabilities T̂ . We
compute this solution using backward induction with planning horizon h [9].
1
Figure 1: Illustration of the MDP structure is in the simulations and Experiment.
3
Simulation of sequential decision-making with delayed rewards
One of the key challenges in decision-making is that immediate rewards are sometimes misaligned
with long-term rewards. This problem is exacerbated by the limited number of steps that agents
can plan ahead. Hence, decision-makers tend to underweight distant uncertain rewards relative to
immediate certain rewards [10]. Our theory predicts that optimism can compensate for this problem.
To illustrate this prediction we simulated the effect of optimism on a bounded agent’s performance
in the sequential decision problem illustrated in Figure 1. The states s0 , · · · , s100 correspond to
having completed 0% to 100% of the work needed to reach the goal. At each point in time the
agent can choose between working towards a goal (a1 ) which costs effort and resources (r1 = −1)
versus leisure (a0 ) which generates a small reward (r0 = +1) but does not advance the agent’s
progress. The states represent the agent’s progress towards the goal. Once the goal has been achieved
(St = 100%) the agent can reap a large reward (r2 = +100). This task is a finite-horizon MDP
lasting 12 rounds. The agent can plan only 5 rounds ahead (h = 5), and its average rate of progress
when working towards the goal for one round is 20%:
T (s0 |s, a1 ) = Binomial (s01 − s1 , 100, 0.2) .
(6)
To simulate the effects of optimism and pessimism on decision-making in this environment, we
computing the policy π5 (Equation 5) for the internal models Mpessimism , Mrealism , and Mpessimism
with T̂α for αpessimism = −10, αrealism = 0, and αoptimism = 10 respectively, and simulated the
performance of the resulting myopic policies in the environment E.
We found that the bounded agent whose model of the world was accurate (myopic realism) performed much worse than the optimal policy. While the optimal policy chooses action a1 until it
reaches the goal state in almost all cases, the myopic realistic agent does always chooses action a0
in state s0 (0%) and consequently never reached the goal state (s100 ). By contrast, the optimistic
bounded agent chose to always invest effort (a1 ) in the initial state and consequently performed
optimally (see Figure 2A). The pessimistic bounded agent performed at the same level as the realistic one. Note that the optimistic agent exhibited the planning fallacy [2]: while the expected
completion time following the optimal policy was 4.5 months the optimistic agent’s estimate of the
expected completion time was only 3.9 months. This made it worthwhile for the optimistic bounded
agent to pursue the goal even though it was thinking only 5 steps ahead. Thus, the irrational planning
fallacy led to rational action. Hence, according to our theory, people with a very accurate model of
the world might perform worse in sequential decision problems with the chain structure shown in
Figure 1 than their more optimistic peers. To test this prediction, we have planned an experiment
that will induces people to be either optimistic, realistic, or pessimistic and then measures their performance in the chain-structured MDP shown in Figure 1, and conducted a pilot experiment to tune
the proposed experimental design.
4
Pilot Experiment
To evaluate the effectiveness of our experimental manipulations and determine the decisions that our
model would predict for the resulting internal models we conducted a pilot experiment.
2
A
B
Figure 2: A: Performance of the optimistic, realistic, pessimistic myopic bounded agents and the
optimal policy in the decision environment shown in Figure 1. B: Simulation of the main experiment
by condition depending on how many steps people can plan ahead.
4.1
Methods
We recruited 200 adult participants on Amazon Mechanical Turk. Participants received $0.50 for
about 6 minutes of work. Eight participants were excluded since their answers to the survey questions indicated that they had not performed the task.
Participants solved a sequential decision problem with the structure shown in Figure 1. To convey
this structure to our participants we created a game called Product Manager. In this game participants play the manager of a car company. In each round (month) the participant decides whether
the company will focus on marketing the existing product SportsCar (a0 ) or invest in the development of a new product HoverCar (a1 ). Participants started with a capital of 1 000 000 and their
task was to maximize the company’s capital after 4, 12, 24, or 72 months. The reward for marketing
the old product was drawn from a normal distribution with mean $10 000 and standard deviation
$1 000 (r0 ∼ N (µ = 10 000, σ = 1000)), the reward for investing into development was drawn
from a normal distribution with mean $ − 15 000 and standard deviation $1 500 (r1 ∼ N (µ =
−15 000, σ = 1 500)), and the reward for marketing the new product was normally distributed with
a mean of $135 000 and standard deviation $13 500 (r2 ∼ N (µ = 135 000, σ = 13 500)). In each
round the participant was shown the current state (e.g., “HoverCar is currently 0% developed.”), the
number of the current round and the total number of rounds, their current balance, and the rewards
of their most recent decision.
The experiment was structured into four blocks: instructions, training block, and a survey. The
instructions introduced the game and informed the participants about the number of rounds, the
return on marketing the existing product the cost of developing the new product, and the return on
marketing the new product. In the training block the participants’ task was to explore the effects of
investing in development versus marketing in three simulations lasting 10 rounds each. Finally, the
survey asked participants to estimate the average rate of progress that occurred when they decided
to invest into development and marketing respectively.
Each participant was randomly assigned to one of three experimental conditions. The three experimental conditions differed in the rate of progressed used in the simulations of the training phase and
were designed to induce pessimism, realism, and optimism respectively. In the pessimism condition,
the average rate of progress in the training block was half the true rate of progress; in the realism
condition, it was equal to the true rate of progress; and in the optimism condition, it was twice the
true rate of progress. The true rate of progress was set such the expected number of investments
needed to reach 100% development was 80% of the total duration of the game.
4.2
Results and Discussion
To examine the effectiveness of our experimental manipulations we compared the three groups’
estimates of the average amount of progress achieved by a single investment in product development.
We found that people estimated the rate of progress to be higher in the optimism condition than
in the realism condition (t(123.7) = 2.30, p = 0.01) and the pessimism condition (t(126.2) =
2.59, p < 0.01). The difference between the realism condition and the pessimism condition was not
3
statistically significant (t(122.5) = 0.53, p = 0.30). In conclusion, our experimental manipulation
was successful at inducing optimism and created significant group differences.
Furthermore, we found that each group’s estimate of the rate of progress was significantly higher
than the rate of progress in the examples they had observed. In the pessimism condition, people
overestimated the rate of progress by 12.4% (t(65) = 4.16, p < 0.0001). In the realism condition,
participants overestimated the presented rate of progress by 7.4% (t(61) = 2.94, p = 0.0023), and
in the optimism condition, people overestimated the presented rate of progress by 5.2% (t(63) =
1.66, p = 0.05). The overestimation of the frequency of positive events is consistent with the
optimism bias [1]. Interestingly, the optimism bias decreased with the true frequency. Hence, at
least in our study, the optimism bias could result from Bayesian inference with an optimistic prior.
5
Planned Main Experiment
The main experiment will use the paradigm used in the pilot experiment with the addition of a test
block following the training block. In the test block participants will play the Product Manager game
for 4, 12, 24, or 72 rounds starting at 0% progress. Participants will receive a financial bonus of up
to $2 proportional to their capital at the end of the test phase.
5.1
Model Predictions
We used the subjective transition probabilities induced by our experimental manipulations in the
pilot experiment to derive our model’s prediction for the results of the main experiment. As shown in
Figure 2B, our simulation shows that optimism shortens the number of steps that people have to plan
ahead to realize that they should invest into product development. Our model predicts that when the
game lasts only four rounds, then participants in the optimism condition will invest but participants
in the realism and pessimism condition will not. When the game lasts 12 or more rounds, then the
benefit of optimism over realism depends on how many steps people can plan ahead.
5.2
Next Steps
Since it appears plausible that people would plan less than 6 steps ahead, we will refine the experimental manipulations to induce larger differences between the three groups’ subjective transition
probabilities such that our model predicts a benefit of optimism in the 12-month condition even if
people plan only 5 steps ahead. We will test this prediction by running the main experiment with the
experimental manipulations determined through iterative piloting. We will use the data to test our
theory’s prediction that optimism improves people’s performance in sequential decision problems in
which determining the best action requires planning ahead many steps.
6
Discussion
Our theory suggests that the optimism bias might serve to improve people’s decisions in environments in which the rewards for prolonged cumulative effort justify forgoing immediate gratification. According to our model the optimism bias achieves this by compensating for the cognitive
limitations that prevent us from looking far enough into the future to fully consider long-term consequences unless we underestimate how long it will take to achieve them. Hence, at least in some
cases, the planning fallacy [2] helps us make better decisions. Hence, it may be a sign of bounded
rationality rather than irrationality.
This abstract continues the line of work begun by [11] by generalizing the definition of optimism
proposed therein from a specific class of decision problems to general decision problems and testing its predictions empirically. We have demonstrated that this extension can capture the benefits
of optimism when obtaining a high reward requires persistent effort. Furthermore, the proposed
experiment will be the first to empirically test our boundedly rational theory of optimism.
The beneficial effects of optimism illustrate that for bounded agents there is a tension between
epistemic rationality (having beliefs that are as accurate as possible) versus instrumental rationality
(choosing the actions that maximize one’s expected utility). Concretely, our simulations suggest
4
that bounded agents have to be epistemically irrational to achieve instrumental rationality [12]. This
might be the deeper reason for why we are optimistic for ourselves but not for others [2]. In conclusion, our theory suggests that optimism and the planning fallacy might not be irrational after all but
reflect the rational use of limited cognitive resources [13, 14].
Acknowledgments. This work was supported by ONR MURI N00014-13-1-0341.
References
[1] T. Sharot, “The optimism bias,” Current Biology, vol. 21, no. 23, pp. R941–R945, 2011.
[2] R. Buehler, D. Griffin, and M. Ross, “Exploring the” planning fallacy”: Why people underestimate their
task completion times.,” Journal of personality and social psychology, vol. 67, no. 3, p. 366, 1994.
[3] R. S. Sutton, “Integrated architectures for learning, planning, and reacting based on approximating
dynamic programming,” in Proceedings of the seventh international conference on machine learning,
pp. 216–224, 1990.
[4] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,” J. Mach. Learn. Res., vol. 3,
pp. 397–422, Mar. 2003.
[5] I. Szita and A. Lőrincz, “The many faces of optimism: a unifying approach,” in Proceedings of the 25th
international conference on Machine learning, pp. 1048–1055, ACM, 2008.
[6] P. Sunehag and M. Hutter, “Rationality, optimism and guarantees in general reinforcement learning,”
Journal of Machine Learning Research, vol. 16, pp. 1345–1390, 2015.
[7] P. Sunehag and M. Hutter, “A dual process theory of optimistic cognition,” in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini, M. McShane, and B. Scassellati,
eds.), (Austin, TX), Cognitive Science Society, 2014.
[8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge: MIT Press, 1998.
[9] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley &
Sons, 2014.
[10] J. Myerson and L. Green, “Discounting of delayed rewards: Models of individual choice.,” Journal of the
experimental analysis of behavior, vol. 64, pp. 263–276, Nov. 1995.
[11] R. Neumann, A. N. Rafferty, and T. L. Griffiths, “A bounded rationality account of wishful thinking,”
in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini,
M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014.
[12] F. Lieder, M. Hsu, and T. L. Griffiths, “The high availability of extreme events serves resource-rational
decision-making,” in Proceedings of the 36th Annual Conference of the cognitive science society (P. Bello,
M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014.
[13] T. L. Griffiths, F. Lieder, and N. D. Goodman, “Rational use of cognitive resources: Levels of analysis
between the computational and the algorithmic,” Topics in Cognitive Science, vol. 7, no. 2, pp. 217–229,
2015.
[14] F. Lieder, T. L. Griffiths, and N. D. Goodman, “Burn-in, bias, and the rationality of anchoring,” in Adv.
Neural Inf. Process. Syst. 25 (P. Bartlett, F. C. N. Pereira, L. Bottou, C. J. C. Burges, and K. Q. Weinberger,
eds.), 2013.
5