Statistical Models for the Beads Task Problem in

Statistical Models for the Beads Task
Problem in Psychiatry
Benedicte Meløy
Christensen
Master i fysikk og matematikk
Innlevert:
juni 2016
Hovedveileder:
Håkon Tjelmeland, MATH
Norges teknisk-naturvitenskapelige universitet
Institutt for matematiske fag
Statistical Models for the Beads Task Problem
in Psychiatry
Benedicte Meløy Christensen
June 2016
MASTER THESIS
Industrial Mathematics
Department of Mathematical Sciences
Norwegian University of Science and Technology
Preface
This thesis is written as a part of my Master of Science degree at the Norwegian University of
Science and Technology. It concludes the five-year study programme in "Applied Physics and
Mathematics" with specialization in "Industrial Mathematics". I would like to thank my supervisor Håkon Tjelmeland for continually following up on my work and giving me guidance. I
would also like to express my gratitude to Robert Biegler and Gerit Pfühl for introducing me to
their work within psychology and for providing me with a data set.
i
Abstract
The tendency to "jump to conclusion" and make decisions on basis of little evidence has been
linked with being prone to delusions. This behavior is particularly seen in the "beads task",
where participant are presented with a sequence of beads and asked to decide which jar the
beads are drawn from. In this thesis, we construct statistical models that explain the observed
behavior in the beads task. Our models are based on work previously done by Moutoussis et al..
The models in our thesis contain quantifiable parameters that capture the bias towards making
hasty decisions and the noise in decision-making. Two of the models also incorporate the noise
in subjective probability estimates. We describe how the model parameters can be estimated
through Bayesian analysis and perform a simulation study which shows that the parameters
can be accurately retained. Finally, we fit the models to a real data set. These models may be
utilized to see if there are any group differences between deluded patients and healthy controls.
ii
Contents
1 Introduction
2
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 The Beads Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2 Statistical Theory
9
2.1 Hierarchical Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3 MCMC Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3 The Optimal Stopping Framework
15
3.1 Ideal Bayesian Agent’s Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2 Agent with Behavioral Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3 Hierarchical Costed Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.3.1 The HCB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3.2 The HCBU Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3.3 Incorporating the Likelihood Ratings - The HCBP and HCBUP Models . . .
25
4 Parameter Estimation
29
4.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2 Bayesian Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.2.1 Choice of Hyperprior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.2.2 Choice of Proposal Distribution for M-H . . . . . . . . . . . . . . . . . . . . .
34
4.3 Parameter Estimation in the HCBP and HCBUP Models . . . . . . . . . . . . . . . .
37
iii
CONTENTS
5 Simulation Study
5.1 M-H Simulations Under Different Scenarios . . . . . . . . . . . . . . . . . . . . . . .
6 Data Set and Results
1
40
40
54
6.1 The Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.2 Model Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
7 Conclusion and Future Work
63
A Results
65
A.1 HCBU Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.2 HCBP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.3 HCBUP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
B Acronyms
71
Chapter 1
Introduction
The "beads in a jar task" is a psychology experiment that frequently is used in order to access
how human beings make statistical inference. We describe this task in the coming section. It has
been discovered that delusion-proneness is associated with making hasty decisions based on little evidence when performing this task. In this report, we construct four alternative models that
describes the decision-making process in the beads task. The report is structured as follows: In
the first chapter, we look at the broader context and explain why it is relevant to make statistical
models that describe how human beings perform the beads task. The second chapter presents
some statistical concepts that are needed later in the report. In the third chapter, we build four
statistical models, while in the fourth chapter, we derive a procedure for estimating the parameters in these models. We test our parameter-estimation procedure on simulated data sets with
known parameters in chapter five. In chapter six, we fit the models on actual data sets from human beings performing the beads task. We also discuss the most important findings. Finally, in
chapter seven, we sum up our findings, discuss the limitations of our models, and present some
suggestions for further work.
1.1 Background
Schizophrenia (SZ) is characterized by positive symptoms such as hallucinations and delusions,
negative symptoms such as lack of motivation, and cognitive symptoms such as trouble focusing. Delusions are considered as one of the key traits of schizophrenia and it occurs in about
2
1.1. BACKGROUND
3
three fourths of those diagnosed [9]. A delusion is a fixed belief held by a person in spite of
emergence of strong conflicting evidence [1]. In order to improve treatments of mental disorders in which delusions are present, such as SZ, researchers have attempted to understand the
psychological mechanisms behind delusions. This involves understanding how delusions are
formed and how they are maintained [9]. A number of theories have been suggested [7]. It has
been hypothesized that reasoning biases cause people to arrive at false conclusions which in
turn causes them to form or maintain delusions. The word ’bias’ in this context means the tendency to systematically behave in a way that differs from some reference group. Such reference
groups may be healthy controls or psychiatric controls. In the following, we introduce several
reasoning biases that researchers have suggested to be present in deluded patients. Some of
these biases contradict each other, some elaborate on each other, and some compliment each
other.
A reasoning bias that frequently has been reported is the tendency to making hasty decisions
on the basis of little evidence. In literature, this reasoning bias is referred to as the jumping to
conclusion (JTC) bias [7]. The JTC bias can explain formation of delusions because an implausible hypothesis may be prematurely accepted and therefore prevent more realistic alternatives
from being considered. Probabilistic inference tasks, for example beads in a jar tasks, are often
utilized to observe the JTC behavior. There are many versions of the beads task, but a typical
version designed to capture the JTC bias goes as follows: Participants are told that beads will be
drawn with replacement from one of two jars. Each jar contains black and white beads. One
of the jars contains substantially more white than black beads and the other jar contains the
opposite ratio of white to black beads. This ratio is known to the participants. The beads are
drawn one by one from a hidden jar, and the participants can see another bead until they are
certain about which of the jars the beads are being drawn from. All the participants are given
the same predetermined sequences of beads, and the number of beads drawn before a decision
is made is recorded. The participants are given several such tasks, and the ratios of black to
white beads may vary from tasks to task. This version of the beads task is often called the draws
to decision (DTD) beads task, as the main outcome is the number of beads drawn before a decision is made [7]. Numerous studies have concluded that delusion-prone individuals in fact
4
CHAPTER 1. INTRODUCTION
collect less evidence than healthy controls before making a decision in the DTD beads task. A
meta-analysis on the DTD beads task, based on 55 studies, has been performed by Dudley et al.
[5]. They concluded that people with psychosis collect significantly less evidence then controls
before making a decision in the two-jar DTD task.
Recent studies performed by Moritz et al. [12] challenge the idea that deluded patients systematically gather less evidence than controls before making a conclusion. They agree that a
JTC bias is present in the two-jar DTD beads task, but they claim that deluded patients do not
generally show a tendency to jump to a conclusion. Instead, they suggest the liberal acceptance
(LA) bias. This account claims that deluded patients have a lower threshold for accepting an alternative as viable than controls; however, this does not necessarily imply a premature response
since multiple alternatives may be considered plausible, and no single option can be selected.
In the case of two jars, the LA bias explains the tendency to JTC because one of the options early
surpass the threshold while the other option does not, causing the former to be chosen. Under
ambiguity, however, the LA account predicts that the decision in fact may be delayed. For example, in a DTD beads task with four possible jars, the lowered threshold may cause two of the
jars to be deemed plausible in situations where a higher threshold would only have deemed one
as plausible. Here, the person with the high threshold is not ready to conclude, while the person
with the low threshold needs more evidence [12]. Moritz et al. put the LA account to a test by
letting deluded patients and healthy controls perform beads experiments with two jars and with
four jars. Their conclusion was in agreement with the LA account: JTC bias was indeed present
in the two-jar task while there was no group difference in the four-jar task. The LA account may
explain why a delusion is formed: a low acceptance threshold may cause an unlikely hypothesis
to be accepted as tenable.
Yet another reasoning bias that is postulated to be associated with delusion is the so called
overadjustment bias or bias toward disconfirmatory evidence [15]. This bias is the tendency to
be very responsive to disconfirmatory evidence, and it has been reported in the draws to certainty (DTC) version of the beads task. In the DTC version of the beads task, participants are
shown a predetermined sequence of beads as in the DTD version. However, after each bead
1.1. BACKGROUND
5
is drawn, participants are asked to provide a probability estimate for one of the jars being the
source of the beads. A new bead is drawn until all the beads in the predetermined sequence
are displayed. Some studies have found that deluded participants change the probability rating
more than controls when for example a white bead is shown after a long streak of black beads.
This is the bias toward disconfirmatory evidence; the participant reacts strongly when evidence
against their currently favored hypothesis is presented. The overadjustment bias is in contradiction to another bias that frequently is reported, called the bias against disconfirmatory evidence
(BADE). This is a bias where the individuals show a strong commitment to the initially favored
hypothesis even when presented with evidence strongly supporting another hypothesis. This
bias has been discovered in tasks where participants are presented with pictures from a comic
strip, one by one. They are asked to interpret what is going on in the comic strip, and they are
given several alternative interpretations to choose among. The first picture is designed to favor
one or two of the available alternatives, while the subsequent pictures are strongly supporting
one of the other alternatives. It has been discovered that deluded participants are less willing to
move away from the interpretation that initially seemed plausible [16]. The LA account and the
BADE may together explain why delusions are formed and maintained. While the LA account
may explain why delusions are formed, BADE may explain why they are maintained; When an
unlikely idea has been embraced, incoming evidence against this idea is for some reason resisted.
As noted earlier, delusions are beliefs that are held in spite of strong evidence contradicting
them. However, the overadjustment bias, which has been reported in the beads task, says that
deluded patients are easily persuaded by incoming disconfirmatory evidence. This bias seems
to be in disagreement with the very nature of delusions, namely that beliefs are resistant to disconfirmatory evidence. Speechley et al. [15] have performed a study to gain more insight into
the overadjustment bias, and they discovered what they refer to as the hypersalience account.
This account may make the overadjustment bias compatible with delusions. Before we explain
this account, we take a look at the experimental procedure that gave data supporting the account. Instead of the regular DTC procedure with one probability slider, Speechley et al. provided one slider for each of the jars. These sliders, or likelihood ratings, were on a continous
6
CHAPTER 1. INTRODUCTION
scale ranging from "Very Unlikely" to "Very Likely". The experimental data showed that the deluded group had an exaggerated increase in likelihood ratings relative to the control group for
whichever jar 1 that matched the current bead; however, there was no exaggerated decrease in
the likelihood ratings for the jar that did not match the current bead. Based on this, Speechley et al. suggest that the overadjustment reported in other studies is not caused by an strong
reaction to the disconfirmatory evidence – instead, they suggest that the overadjustment is a
consequence of the deluded participants giving more faith in the option that is supported by
the newest piece of evidence [7]. Putting greater trust in the hypothesis supported by the most
recent piece of information is what Speechley et al. call the hypersalience account.
As we have seen, several cognitive biases that may explain delusions have been proposed. These
biases have been supported by experimental data, often through variants of the beads task. The
data analyses have mainly been done by calculating trivial summary statistics and then using
for example ANOVA for testing whether there are group differences. Less attention has been
directed towards making statistical models for the decision-making process in the beads task,
containing key parameters that can be quantified. Moutoussis et al. [13], however, have presented one such statistical model. In this report, we introduce their model - as well as several
tweaked versions of it - and estimate the models parameters based on a data set from a group of
healthy subjects. The model aims to capture parameters that can explain the JTC behavior seen
in the beads task. The goal of this report is to lead the way for building statistical models that
describe the decision-making process in the beads task.
1.2 The Beads Task
As described in the previous section, the classical DTD beads task works as follows: There are
two or more sources/jars of beads, each with a given ratio of white to black beads. The participants are told that beads will be drawn with replacement from one of the sources and that the
source will remain the same throughout the trial. The color of new bead as well as the previously
drawn beads are shown to the participant at all times. After each bead is drawn, the participant
1
Instead of jars and beads, Speechley et al. used lakes and fish. Their computer program looks similar to the one
shown in Figure 1.1.
1.2. THE BEADS TASK
7
Table 1.1: The trials and the proportion white (W), black (B) and red (R) beads in each of the
jars. In the three first taks there are two possible jars. In the fourth task, there are four possible
jars.
Task
1
2
3
4
Sequence
Jar A
W
B
0.2 0.8
0.9 0.1
0.5 0.5
0.1 0.9
Jar B
W
B
0.8 0.2
0.1 0.9
0.8 0.2
0.5 0.5
Jar C
W
B
0.9 0.1
W
0
Jar D
B
0.1
R
0.9
can either 1) state that he is sure about the source of the bead(s) and which jar this is or 2) state
that he wants to see more beads before deciding. The participants are told that they can see a
maximum of n beads. In the version of the beads task that we are studying in this paper, the participants additionally have to provide likelihood estimates for each of the jars being the source
after each bead is drawn, as in the DTC task with one slider for each jar. This is done by having
separate sliders ranging continuously from "Can’t be this lake" to "Must be this lake" for each
of the jars. Prior to the experiment, the participants are instructed that even though they have
decided on a jar, new beads will be drawn and probability estimates must be provided until all
the n = 10 beads are drawn. They do not have to commit to any jar during the sequence; when
10 beads have been drawn, they can state that they still are not sure about which of the jars that
is the source. The experiment is performed on a computer. Instead of jars and beads, the participants are confronted with lakes and fish. A screen shot of the interface is displayed in Figure 1.1.
Each participant is given four tasks. These are summarized in Figure 1.1. The tasks are exactly the same for all the participants. In the first three tasks, there are only two possible sources
of beads. In the last task however, there are four possible sources. More specifically, in Task 1,
jar A has 20% white beads and 80% black beads while jar B has 80% white beads and 20% black
beads. The sequence "BBBWBBBBWB" is given to the participants, where "W" and "B" indicate
white and black beads, respectively. This task is the same as task number one in [12]. In Task
2, jar A has 90% white and 10% black beads while jar B has 10% white and 90% black beads.
The sequence provided is "WWWWBWWWWW". This is the same as task number two in [12].
In Task 3, jar A has 50% white and 50% black beads while jar B has 80% white and 20% black
8
CHAPTER 1. INTRODUCTION
Figure 1.1: Two screen shots from the experiment. In the left panel, the participant can choose
to commit to one of the jars or to postpone the commitment. No matter which of these actions
the participants chooses, he is asked to provide likelihood estimates for each of the jars. This is
shown in the right panel. Source: Robert Biegler.
beads, and the sequence "BWBWWBWBBW" is drawn. This is the same task as task number
one in [15]. In the last task, there are four possible sources of beads instead of only two. Jar A
contains 10% white and 90% black beads, jar B contains 50% white and 50% black beads, jar
C contains 90% white and 10% black beads and jar D contains 10% black and 90% red beads.
The sequence "WWWBWWWWWW" is provided. This is the same as task number three in [12].
Researchers at the Department of Psychology at respectively the "Norwegian University of Science and Technology" and the "Arctic University of Norway" have designed this experiment and
gathered experimental data.
Chapter 2
Statistical Theory
In this chapter, we introduce some statistical concepts that are utilized in this report. In particular, we discuss hierarchical Bayesian mode, the Metropolis-Hastings algorithm and Markov
chain Monte Carlo diagnostics.
2.1 Hierarchical Bayesian Models
Hierarchical models or multilevel models are useful in statistical applications where the is believed to be some dependency between the parameters. Figure 2.1 shows a typical hierarchical
structure. Let us use this figure and an example to illustrate the idea behind hierarchical models.
For instance, let us say we have n individuals, and each of these individuals are characterized by
some parameter τi , where i denotes the index of the individual. This parameter is not observed
itself, but there is a known stochastic model that maps τi to some observed output of the individual, y i . That is, the density function p(y i |τi ) is known or specified. The observable output
constitutes the first level of the hierarchical structure. The unobserved individual-parameters
τi make up the second level of the model. Let us say that all the individuals are expected to be
somewhat similar. We can model this by imposing a common prior distribution on the parameters τi , namely a population distribution [10]. The population distribution is specified by some
hyper-parameter, let us call it α. In other words, we have the density function p(τi |α) which is
common for all i ’s. At this point, the model is specified such that the likelihood p(y 1 , ..., y n |α) can
be expressed by integrating over the τi parameter for each individual. In order for this model to
9
3
10
The classical hierarchical model
CHAPTER 2. STATISTICAL THEORY
• The classical hierarchical model looks like this:
Fixed hyperparameter
Shared hyperparameter
Per-group parameters
Multiple groups of observations
Figure 2.1: A typical hierarchical structure. Credit: David M. Blei.
• We observe multiple groups of observations, each from its own parameter.
be fully Bayesian, however, we need to view the hyperparameter as a random variable by putting
• The distribution of the parameters is shared by all the groups. It too has a distribution.
a prior on it, i.e. a hyperprior p(α). The parameters of the hyperprior are known and they make
2
up the top level of the hierarchical structure. Then, we have a probability model on the entire
set of parameters. The joint posterior distribution of all the model parameters can be written
out as
p(τ1 , ..., τn , α|y 1 , ..., y n ) =
p(α)
Qn
i =1 p(τi |α)p(y i |τi )
p(y 1 , ..., y n )
(2.1)
After having obtained the posterior distribution in a hierarchical model, it may be desirable
to look at some elements that summarize its information. One may for example be interested
in location parameters, e.g. median and mode, and dispersion parameters, e.g. variance and
precision [8]. These measures give us information about the parameters in the model. We can
look at each individual in the group by investigating the measures concerned with the marginal
distribution of the τi ’s, but we can also say something about the group as a unit by looking at
the meaures concerned with the marginal distribution of the hyperparameter α.
If one believes that there is some similarity across the individuals, constructing a Bayesian
hierarchical model is advantageous. A hierarchical structure provides borrowing strength across
individuals. This means that observation from one individual contributes with information
2.2. THE METROPOLIS-HASTINGS ALGORITHM
11
to the individual-specific parameter of another individual, through the hyper-parameter. The
individual-specific parameters are shrunken towards a common mean, and this common mean
is itself driven by the observations. As a consequence, we may make better inference about the
individual-parameters than in the case where the individual-specific parameters are based only
on the observations from the individual in question. This is particularly advantageous if there
are few observations per individual.
2.2 The Metropolis-Hastings Algorithm
Markov chain Monte Carlo (MCMC) methods are powerful tools for sampling from high-dimensional
probability distributions. These methods rely on constructing a Markov chain whose stationary
distribution is the target density function that we wish to sample from. The Monte Carlo part
refers to utilizing random sampling to obtain empirical result, while the Markov chain part refers
to the fact that the sample is obtained by simulating a Markov chain. MCMC methods are useful
for numerically calculating high dimensional integrals that are difficult to solve analytically, for
example the expected value or variance of a multidimensional random variables [8].
The Metropolis-Hastings (M-H) algorithm is a MCMC procedure that we utilize in this report.
We do not go into the theory behind the algorithm, but instead we introduce how the algorithm works and refer to [8] for a more profound introduction. Let us denote the target density
function from which we want to sample as π(θ). This may for example be the posterior density
function in (2.1). Also, let θ (t ) be the t -th value sampled by the M-H algorithm. The algorithm
goes as follows: First, choose an arbitrary1 initial state for the chain, θ (0) . Also, choose an arbitrary proposal distribution q(·|θ (t ) ) from which the candidate states will be sampled, given the
current state. The only requirement for the proposal distribution is that is must ensure that
the resulting Markov chain is aperiodic and irreducible, as these are sufficient conditions for
convergence to a unique stationary distribution [11]. Then, for each iteration t , sample a candidate value θ̃ from the proposal distribution q(·|θ (t ) ). Calculate the Metropolis-Hastings Ratio,
R(θ̃|θ (t ) ) =
1
π(θ̃)q(θ (t ) |θ̃)
.
π(θ (t ) )q(θ̃|θ (t ) )
Set θ (t +1) = θ̃ with probability α = min{1, R(θ̃|θ (t ) )}, called the accep-
The initial state must be chosen such that π(θ (0) ) > 0
12
CHAPTER 2. STATISTICAL THEORY
tance probability. If the suggested parameter is not accepted, set θ (t +1) = θ (t ) . The draws θ (t )
become increasingly close to being draws from the limiting distribution π(θ) as t gets larger [8].
In the case where θ is a vector, one can use component-wise updating, which means that only
one of the components of the parameter vector θ may change value at each iteration while the
other components remain as they were in the previous time step. One can either cycle through
the components of the parameters vector or choose a component randomly weighted by how
often we would like the different components to be sampled [8]. This is a special case of block
updating.
2.3 MCMC Diagnostics
In MCMC procedures, the draws become increasingly close to draws from the stationary distribution as t increases [8]. In order to obtain a sample that represents the target distribution, the
sample should be taken from the chain after it has converged to its equilibrium distribution [8].
The word converged in this context means that the draws produced by the chain approximates
the target density well enough. A poorly chosen starting point for the chain may cause the chain
to take long to converge. For that reason, it is crucial to know whether the chain has converged.
There exists some theory on how to obtain quantitative bounds on number of iterations needed
for convergence, but it has had little impact on practical work [4]. Instead, one usually has to
rely on diagnostic tools to get some idea of whether convergence is reached. This involves applying graphical techniques and calculating statistical properties of the output of the chain.
The samples generated by a MCMC algorithm are dependent by construction, since the next
state in the chain depends on the current state. The mixing of a chain is concerned with how
far apart in the chain two sampled values must be in order to be considered approximately independent [11]. If the values that are sampled by the MCMC procedure are strongly correlated,
the sample needs to be large in order to represent the target density well. The mixing of a chain
has a say in how quickly the chain forgets its starting value and explores the support of the target density [11]. For this reason, the rate of convergence in a MCMC algorithm is affected by
2.3. MCMC DIAGNOSTICS
13
the mixing property of the chain. It is common practice to discard some burn-in period in the
beginning of the chain in order to exclude the draws that are generated before convergence is
reached.
As noted above, obtaining a good sample through MCMC simulations requires the user to have
control of two properties of the chain, namely the rate of convergence and the mixing. The first
of these governs the burn-in period, and the latter governs the the run-length (and the burn-in
period). The user must 1) have an idea of whether convergence is reached so that representative
values can be collected, and 2) know how good the mixing of the chain is, so that he knows how
large the sample should be. The rate of convergence and the mixing of a chain are properties
that overlap. Therefore, many techniques can be used to access both these properties at once.
In the following, we introduce some of these techniques.
The trace plot shows the sample number t plotted against a component of the sampled value
θ (t ) . These plots can provide a hint of whether convergence is reached, as convergence is characterized by rapid fluctuation around a stable mean [3]. Bad mixing can also be revealed through
a trace plot. A chain that is mixing poorly maintains the same - or nearly the same - values over
many successive iterations [11]. Another graphical tools that may be useful in reveling potential
problems with MCMC algorithm is auto-correlation plots. This plots shows the correlation in
the sequence of θ (t ) at different lags. If θ (t ) is a vector, the autocorrelation-plot can be plotted
for each parameter. A slow decay may suggest bad mixing [11]. Monitoring the cross-correlation
of the chain, namely the correlation between parameters, may also be useful . If two parameters
are strongly correlated, it may suggest a poor choice of parameterization or overparameterization [4] and that the convergence is slow. The cross-correlations can be presented through scatter plots of pairs of parameters in the parameter vector θ (t ) . It can also be visualized through a
correlogram, which is a image of the correlation matrix.
For the Metropolis-Hastings algorithm, the acceptance rate of the suggestions from the proposal
distribution should be monitored. This is because the acceptance rate says something about
whether the spread of the proposal distribution is appropriate, and the spread of the proposal
14
CHAPTER 2. STATISTICAL THEORY
influences the mixing and the convergence of the chain. A spread that is too large causes many
proposals to be rejected and the chain will move slowly because the same state is visited many
times in a row (causing slow convergence), but a spread that is too small will cause the chain
to explore the support of the target distribution slowly because each step taken in the chain is
small [2]. Literature suggest an acceptance rate between 20% and 50% as a rule of thumb [8].
There exists a number of convergence diagnostics that are more formal and that do not rely
on visual inspection, e.g. the Gelman and Rubin diagnostics. However, for the analysis in this
report, the visual techniques introduced above will suffice. We refer to [4] for an introduction of
other convergence diagnostics.
Chapter 3
The Optimal Stopping Framework
The DTD experiment is a task in which the agent has to determine when to stop collecting evidence and instead make a decision about the source of the beads. This can be viewed as an
optimal stopping problem. When we talk about the optimal strategy for this problem, we mean
the strategy that maximizes the expected utility. So what is the utility in the DTD task? In the experiment instructions, no information is provided about the reward (r ) of answering correctly,
the cost of answering incorrectly (c w ), nor the cost of sampling another bead (c s ). However,
we may assume that each participant has his own internal values for these parameters. For example, some participants may be very concerned about answering incorrectly, and thus have
a large reward for correct answers, large cost for incorrect answers and small cost of sampling.
The utility in this framework is the units gained by answering correctly minus the units spent
on sampling and answering incorrectly. Given a set of parameters (r, c w , c s ) there is an optimal
strategy for the DTD task. The optimal stopping framework for the DTD problem is introduced
by Moutoussis et al. in [13]. In the following sections we present, elaborate on and make some
modifications to his work.
3.1 Ideal Bayesian Agent’s Approach
In this section, we introduce the optimal strategy for the DTD problem. We call an agent following this strategy an ideal Bayesian agent (IBA). We provide detailed derivations for the case
where there are only two possible sources of beads, and where these sources contain white and
15
16
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
black beads only. However, the calculations are similar for the case where there are more than
two sources and more than two colors. We define the rules to be such that the agent has to commit to a jar at some point during the course of the ten beads. This is different from the rules
given to the participants in the data set that we study; however, in Section 6.2 we describe how
we can handle this difference. Let r A and r B be the proportions of white beads in jar A and B,
respectively. Furthermore, let x i denote the color of the i t h bead drawn, taking the value 0 if the
bead is black and 1 if the bead is white. Furthermore, let x = {x 1 , ..., x n } be the whole sequence
given in the task. Then, the set {x, r A , r B } defines the task. Given the parameters (r, c w , c s ), the
task has a optimal solution, namely the pair (d , m), where d ∈ {d A , d B } is the urn chosen and
m ∈ {1, ..., n} is the number of beads displayed before deciding on the jar. Let us denote the solution strategy of an individual as S, and in particualar, let SI B A be the solution strategy of the
IBA. Furthermore, let us define the utility (U ) as the reward (R) minus the cost (C ) in the task,
namely
U = R −C .
(3.1)
Next, let us define a state as s i = n w for 0 ≤ n w ≤ i ≤ n, where i is the total number of beads
drawn and n w is the number among these that are white. Since only white or black beads are
drawn, it is implicitly given that the number of black beads are i − n w . Also, let d i denote the
action taken by the agent after the i t h bead is drawn. d i can take on values among d A (decide jar
A), d B (decide jar B), and d S (decide to sample one more bead) depending on how many beads
i that have been drawn; d S is only an available choice when strictly less than n beads have been
drawn. Let Di denote the set of available actions for a given task when i beads have been drawn.
Now, we define the action value, Q, for taking an action d in the state s i = n w under strategy S
as the expected additional utility of taking action d in state s i = n w under strategy S; that is,
Q(d , i , n w , S) = E [U |S i = n w , D i = d , S].
(3.2)
The optimal strategy for the DTD problem, SI B A , is to calculate the action values in state s i = n w
(knowing that you behave as an IBA) and deterministically choose the action that has the largest
action value. This is the strategy that in the long run will yield the largest gain (or smallest cost).
We proceed by deriving the expression for the action values in (3.2) under the IBA strategy. We
3.1. IDEAL BAYESIAN AGENT’S APPROACH
17
start by looking at the action values for deciding on each of the jars. Let V ∈ {A, B } be the jar
that is being drawn from. We use the letter V for vase. Then, the action value for choosing jar
v ∈ V = {A, B } in any state s i = n w can be written as
Q(d v , i , n w , S) = E [U |S i = n w , D i = d v , S]
= E [U |S i = n w , D i = d v ]
= r · P (V = v|S i = n w ) − c w · P (V 6= v|S i = n w )
= r · P (V = v|S i = n w ) − c w · (1 − P (V = v|S i = n w ))
= (r + c w )P (V = v|S i = n w ) − c w .
(3.3)
We note that the action value for choosing a jar does not depend on the strategy of the agent. Let
us take a look at the posterior probability of the source being jar v, which is the second term in
the expression above. In the experiment, the prior probability for each of the jars are equal, so
we have P (V = A) = P (V = B ) = 12 in task number one through three. Let r v be the ratio of white
beads in jar v. Then, we have that the probability of the source of beads being jar v ∈ V = {A, B },
when i ∈ {1, .., n} beads are drawn and n w ∈ {0, ..., i } of these are white, is
P (V = v) · P (S i = n w |V = v)
P (S i = n w )
P (V = v) · P (S i = n w |V = v)
=P
ṽ∈V P (V = ṽ) · P (S i = n w |V = ṽ)
P (S i = n w |V = v)
=P
ṽ∈V P (S i = n w |V = ṽ)
P (V = v|S i = n w ) =
n
=P
r v w (1 − r v )i −n w
nw
i −n w
ṽ∈V r ṽ (1 − r ṽ )
,
(3.4)
where we in the third equality used that the prior probabilities for each jar are equal. Equations
(3.3) and (3.4) are what we need in order to calculate the action values for choosing jar A and
jar B in any given state. Now, let us move on to deriving the expression for the action value
of sampling another bead. This action value depends on the strategy of the agent, since the
expected future return depends on the behavior of the agent. As we will see, the agent must
"search" through all the future outcomes to calculate this quantity. Define the probability of
18
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
choosing action d in state s i = n w under the strategy S as the weight
w(d , i , n w , S) = P (D i = d |S i = n w , S).
(3.5)
The IBA agent deterministically chooses the action with the largest action value, so all the weight
is given to the action with the largest action value, and we have
w(d , i , n w , SI B A ) =



1 if d = arg max Q(a, i , n w , SI B A ),
a∈Di
(3.6)


0 otherwise.
If there are two or more actions with equal action values and these are larger than all the other
action values, it does not matter which of these the an agent chooses in order to be ideal; therefore, we define the IBA to choose randomly among the two jars if the action values for these are
equal and larger than the action value for sampling. If the action value for choosing one of the
jars is equal to the action value for sampling, and this value is larger than the action value for
choosing the other jar, we define the IBA to decide on the jar instead of sampling. Thus, the IBA
will choose as quickly as possible. Next, we note that the expected utility of an agent in state
s i = n w given that the true source is jar v ∈ {A, B } can be written as
E [U |S i = n w ,V = v, S] =
X
P (D i = d |S i = n w ,V = v, S) · E [U |S i = n w ,V = v, D i = d , S]
d ∈Di
=
X
P (D i = d |S i = n w , S) · E [U |S i = n w ,V = v, D i = d , S]
d ∈Di
=
X
w(d , i , n w , S) · h(i , n w , v, d , S),
(3.7)
d ∈Di
where we have defined the function h(·) = E [U |S i = n w ,V = v, D i = d , S] because it later will
turn up recursively in the expression of the action value of sampling. We have
h(i , n w , v, d , S) = E [U |S i = n w ,V = v, D i = d , S]



r · I [v=A] − c w · I [v6= A] if d = d A ,
=


r · I [v=B ] − c w · I [v6=B ] if d = d B ,
3.1. IDEAL BAYESIAN AGENT’S APPROACH
19
and for the case where d = d S , we have
h(i , n w , v, d S , S)
= E [U |S i = n w ,V = v, D i = d S , S]
X
P (S i +1 = s|V = v, S i = n w , D i = d S , S) · E [U |V = v, S i = n w , D i = d S , S i +1 = s, S]
= −c s +
s∈{n w ,n w +1}
= −c s +
X
P (S i +1 = s|S i = n w ,V = v) · E [U |V = v, S i +1 = s, S]
s∈{n w ,n w +1}
= −c s +
X
P (S i +1 = s|S i = n w ,V = v)
s∈{n w ,n w +1}
= −c s +
X
P (S i +1 = s|S i = n w ,V = v)
s∈{n w ,n w +1}
X
d ∈Di +1
X
P (D i +1 = d |V = v, S i +1 = s, S)E [U |V = v, S i +1 = s, D i +1 = d , S
w(d , i + 1, s, S) · h(i + 1, s, v, d , S).
(3.8)
d ∈Di +1
Here, we see a recursive pattern; in every state where there is an option of sampling another
bead, the function h(i , n w , v, d S , S) makes use of h(i +1, n w , v, d S , S) and h(i +1, n w +1, v, d S , S).
This recursion terminates when n beads are drawn, because at this point it is not possible to
sample another bead. The agent is forced to choose a jar. When i = n in (3.7), d S is no longer
contained in the sum, so the recursive part stops.
Finally, the action value for sampling another bead in state s i = n w where 0 < n w ≤ i < n can
be written out as
Q(d S , i , n w , S) = E [U |S i = n w , D i = d S , S]
X
=
P (V = v|S i = n w , D i = d S , S) · E [U |S i = n w , D i = d S ,V = v, S]
v∈V
=
X
v∈V
P (V = v|S i = n w ) · h(i , n w , v, d S , S).
(3.9)
The optimal stopping strategy for an agent that assumes the game parameters (r, c w , c s ) is to calculate the action values Q(d , i , n w , SI B A ) for each of the available actions d , and then to choose
the action that yields the largest value. This model is in fact a one-parameter model. We can
fix two of the parameters, and let the last one be flexible. We choose r = 0 and c w = 100 as a
reference for the model.
20
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
3.2 Agent with Behavioral Uncertainty
Human beings do certainly not follow the strategy of the IBA. First of all, the strategy requires
heavy computations that involve searching through all possible future states. It is unlikely that
human beings are able to perform these. Secondly, humans do not choose as deterministically
as an ideal agent; provided with the same information several times, the choices may still vary.
To account for the behavioral uncertainty, we need to introduce some noise in the model. This
can be done by for example assuming that the human agents pick an action randomly, but that
the probability of picking each action is weighted based on the magnitude of the action value.
We use the same weight function that is used in [13], namely the the Softmax function with an
individual-specific parameter τ that specifies how arbitrary the choices tend to be. With Softmax
weighting, the probability of choosing action d in state s i = n w becomes
w(d , i , n w , S) = P
e Q(d ,i ,n w ,S)/τ
d˜∈Di
˜
e Q(d ,i ,n w ,S)/τ
.
(3.10)
We can see that as τ goes to zero, the probability of choosing the action with the largest action
value goes to 1, and the agent gets closer to being an IBA. On the other hand, as τ goes to infinity, the probabilities for each of the actions become equal, namely one divided by the number
of possible actions, and the agent makes arbitrary choices.
As we can see in (3.10), the probability of some agent choosing an action d depends on the
action values for choosing the different actions. The action values for choosing actions d A and
d B are calculated straightforwardly as shown in (3.3). Calculating the action value for sampling
another bead, however, requires the agent to iterate through all possible future outcomes. These
outcomes are not only related to what color the future beads have - they are also related to what
action the agent chooses to perform when the beads appear. Therefore, in order to calculate the
action value for sampling, the agent must have some idea - which may be correct or incorrect of how he behaves. That is, he must have an idea of his own strategy, which may or may not coincide with his true strategy. Under behavioral uncertainty, the agent can either 1) know that he
makes choices according to the softmax choice function or 2) think that he behaves as an IBA.
Therefore we get two alternative models. In the first model, the agent takes into account that
3.3. HIERARCHICAL COSTED BAYESIAN MODELS
21
he in fact behaves stochastically, when calculating the action value for sampling another bead.
Then, the choice is made stochastically based on the action values. We can refer to this model as
the CB model, for Costed Bayesian model. In the second model, the agent calculates the action
values for sampling as if he were an IBA, and then an action is chosen stochastically based on
the weight of each action. We can all this the CBU model, for Costed Bayesian Unaware model,
since the agent is unaware of his own decision noise. We stress that these models do not use the
probability estimates provided by the participants in any manner. The only observations taken
into account are the DTD’s and the final jar chosen. In Section 3.3.3, we incorporate the subjective probability estimates into the framework already introduced above. When incorporating
these estimates, we will arrive at two new models for how an agent acts in the beads task. We
call these the CBP and the CBUP models. For now, however, we omit the subjective probability
estimates in the models.
In the model(s) described above, each agent has the two parameters c s and τ. The sampling
cost parameter captures the eagerness to stop collecting evidence and instead make a decision.
It may be seen as an equivalent to a decision threshold; a large sampling cost corresponds to a
small decision threshold and vice versa. The noise parameter accounts for the randomness in
the decision-making. If each individual provides responses for a large number of sequences, we
are able to accurately estimate the specific parameters for each individual, for example through
maximum likelihood estimation. However, for the data that we analyze in this report, there are
only a few trials for each participant. As a consequence, there will be a large uncertainty associated with each individual’s parameter estimates. We can alleviate this issue by constructing a
hierarchical model, as done by Moutoussis et al. In the next section, we describe this model.
3.3 Hierarchical Costed Bayesian Models
In line with Moutoussis et al., let us call the sampling costs (c ks ) and noise parameters (τk ) microparameters. These are individual-specific parameters. Within one population1 , it is reasonable
to assume that the individuals are somewhat similar, so we expect some similarities between
1
A population may for example be a group of schizophrenic individuals or a group of healthy individuals.
22
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
the individuals’ micro-parameters. Thus, we can view the pairs of of micro-parameters (c ks , τk )
as stochastic variables drawn independently from some common prior distribution governed by
some unknown hyper-parameters. This hierarchical structure allows us to investigate the hyperparameters instead of - or in addition to - the micro-parameters.
In the hierarchical structure, the sampling costs and noise parameters are considered to be
drawn from some prior. As done by Moutoussis et al., we can assume that the sampling costs are
drawn independently from the gamma distribution with parameters α and β and that the noise
parameters are drawn independently from the gamma distribution with parameters η and δ,
namely
c kS ∼ Gamma(α, β),
τk ∼ Gammma(η, δ), k = 1, ..., Nind ,
where Nind is the number of individuals. Thus, the expected value of the sampling cost is
p
µ1 = E [c ks ] = αβ and the standard deviation (SD) is σ1 = SD(c ks ) = αβ. Similarly, for the noise
p
parameters, we have µ2 = ηδ and σ2 = ηδ. Under this hierarchical structure, we can estimate
the hyperparameters α, β, η, and δ to infer about the population as a whole. The hierarchical
structure is illustrated in Figure 3.1.
As noted in the previous section, the optimal stopping problem with behavioral uncertainty
gives rise to two alternative models. We get the first model by assuming that the agents themselves know that they choose between the actions based on Softmax weights instead of choosing
deterministically. In this case, the agent will calculate the action values for sampling another
bead, knowing that he behaves stochastically. The agent is behaving ideally in the sense that
he maximizes the expected reward in the setting where there is decision noise present. This is
the same model as the one introduced in [13]. We will from now on refer to this model as the
HCB model, for Hierarchical Costed Bayesian model. We get the second model by assuming
that the agents are unaware of the fact that they behave stochastically, when calculating actions
values for sampling. In this case, the agents calculate the action values as if they were IBA’s but
3.3. HIERARCHICAL COSTED BAYESIAN MODELS
λ2
λ1
µ1
c 1s
y1
23
σ1
µ2
s
cN
τ1
σ2
τN
yN
Figure 3.1: The structure of the HCB and the HCBU models. Instead of considering the parameters α, β, η, and δ, we consider µ1 , σ1 , µ2 , and σ2 . The constants λ1 and λ2 are concerned with
the hyper-prior that is used in the Bayesian analysis discussed in section 4.2.
still choose randomly among the actions based on Softmax weights. This model will be referred
to as the HCBU model, which is short for Hierarchical Costed Bayesian Unaware. Let us take a
closer look at these two models.
3.3.1 The HCB Model
In the HCB model, the agent is aware of the fact that he has decision noise when choosing an
action. Therefore, when he calculates the action value of sampling (i.e. the expected additional
gain if he chooses to sample), he takes into account his stochastic behavior in the future. This
means that he uses (3.10) when calculating (3.8). Consequently, his action value for sampling
will be lower than an IBA’s action value for sampling. As τ increases, the action value for sampling will decrease while the action values for each of the jars remain the same. Therefore, the
probability of choosing to sample will decrease as τ increases, and it is converging to one over
the number of possible actions. Thus, we see that increasing noise implies a greater probability
of committing to a jar. The probability of committing to the jar that is posteriorly more likely
of being the source is always greater than the probability of committing to any of the other jars.
We know that the action value of sampling increases as c s increases. As a result, early decisions
in this model can be caused by having a large c s and/or by having a large τ. If there are a great
24
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
HCB Model Action Probabilities, Task 3
Individual with: cs =1, τ =30
1
Commit A
Commit B
Sample
Commit A
Commit B
Sample
0.9
0.8
0.8
0.7
0.7
Action Probability
Action Probability
0.9
HCBU Model Action Probabilities, Task 3
Individual with: cs =1, τ =30
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
Bead
5
6
7
8
9
10
Bead
Figure 3.2: HCB vs HCBU model. The IBA with c s would have chosen to sample until the 9th
bead was drawn. There, he would have chosen jar A. In the HCBU model, the relativeness of
the actions are preserved. If the IBA thinks that sampling is the best option, choosing jar A is
the second best, and choosing jar B is the worst alternative, then the agent in the HCBU model
will have largest probability of choosing to sample, second largest probability of choosing jar A,
and smallest probability of choosing jar B. In the HCB model however, the noise may cause the
probability of choosing jar A to move beyond the probability of sampling. So in the HCB model,
large noise provoke faster decisions.
number of instances of choosing the wrong2 jar, however, it can only be explained by a large τ.
This model is the same as the one introduced by Moutoussis et al..
3.3.2 The HCBU Model
In the HCBU model, the agent assumes that he is an IBA when he calculates the additional gain
expected to be obtained when sampling another bead. He uses (3.6) when calculating (3.8),
which does not involve τ. Therefore, two agents with different τ will obtain the exact same action values for sampling another bead. As τ increases, the probability of choosing each of the
available actions get closer to each other as we can see from (3.10). This may cause the probability of sampling to decrease. However, in contrast to the agent in the HCB model, the probability
of sampling does not decrease as a consequence of the action value for sampling decreasing. In
this manner, the HCBU model removes some of the "overlap" between the parameters c s and τ
that exists in the HCB model.
2
Wrong here means choosing a jar that is not the posteriorly more likely jar.
3.3. HIERARCHICAL COSTED BAYESIAN MODELS
λ2
λ1
µ1
c 1s
y1
25
σ1
µ2
s
cN
τ1
σ2
τN
ω1
ωN
ρ̃ 1
ρ̃ 1
yN
Figure 3.3: The structure of the HCBP and HCBUP models. Instead of considering the parameters α, β, η, and δ, we consider µ1 , σ1 , µ2 , and σ2 . The constants λ1 and λ2 are concerned with
the hyper-prior that is used in the Bayesian analysis as discussed in section 4.2. The y’s and ω̃’s
are observations, the λ’s are chosen constants, and the rest of the parameters are latent variables.
3.3.3 Incorporating the Likelihood Ratings - The HCBP and HCBUP Models
The model(s) we have built so far tries to account for the participants’ eagerness to make a decision as well as their randomness in choosing between options. Now, we would like to extend
the model so that it also captures the participant’s noise in estimating probabilities. We do this
by incorporating the likelihood ratings, which the participants provide after each bead is drawn,
in the model. Let us denote the likelihood ratings for jar A and jar B as L A and L B , respectively.
These estimates are provided by the participants through sliders ranging from "Can’t be this
lake" to "Must be this lake". There are no constraints regarding the relationship between the
likelihood ratings; for example, both ratings may be set to the value "Must be this lake". Therefore, it is not obvious how we should interpret these ratings. Some participants may assume
that the sliders range from 0 to 1, and that they represent the posterior probability of the corresponding jar being the source of the beads. In this case, the likelihood ratings provided should
sum to 1. In order for us to incorporating these ratings in the model, we need to transform them
26
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
to something meaningful. We cannot simply let them range from 0 to 1 and say that they represent the posterior probability of each of the jars. This is because the ratings provided by the
participants rarely sum to one, and this does not make sense theoretically. Instead, we define
the posterior probability estimate for jar v to be
p̃ v = P
Lv
ṽ∈{A,B } L ṽ
, v ∈ {A, B }.
(3.11)
We see that for the individuals that in fact interpreted the slider values as posterior probability
estimates for jar v, we get p̃ v = L v . Under this transformation, a person who rates the two jars
as "Can’t be this lake" and a person who rates the two jars as "Must be this lake" are both saying
that there is a 50% chance of each of the jars being the source. The person that rates both jars as
very unlikely may do so because he thinks that both jars are equally likely of being the source,
but at the same time he thinks that the displayed beads are not very representative for any of
the two jars. The person who rates both as very likely may do so because he thinks that they are
equally likely, and that the displayed beads are very representative for both of the jars.
In contrast to the IBA, participants are not able to correctly calculate the posterior probability of a jar being the source of the beads. Participants’ probability estimates are subject to noise.
As noted in Chapter 2, we can add normally distributed noise to the log-odds transformation of
the true probability of jar A. Let p A be the true probability that jar A is the source of the beads in
a given state, and let p̃ A the individual’s estimate of this probability. Then, we get
log(
p̃ A
pA
) = log(
) + ω ·U ,
1 − p̃ A
1− pA
(3.12)
where U ∼ N (0, 1). This is equivalent to
p̃ A =
pA
p A + (1 − p A )e −ω·U
The probability estimate of jar B being the source is p̃ B = 1 − p̃ A . The parameter ω says how
much the probability estimate provided by a participant tends to deviate from the true probability. The IBA has ω = 0, while an individual that has difficulties in estimating the posterior
3.3. HIERARCHICAL COSTED BAYESIAN MODELS
27
probability has a large ω. In the case where there are more than two jars, we can make a similar
transformation by setting one of the jars as baseline, and use the fact that the probabilities for
the jars sum to one. This is the same technique that is used in the multinomial logit model, and
we refer to [6] for the technical details.
So how should we incorporate the noise in probability estimates in the model? As noted earlier, the process of calculating the action value of sampling another bead requires the agent to
calculate the expected gain in all the future states that may be visited. When performing these
calculations, the IBA utilizes the value of p A for the current state and all the possible future
states. Originally, we wanted to construct a model where there is noise every time the agent
calculates a p A . However, this requires us to integrate out all the unobserved U ’s, which turns
out to be a non-trivial task. Instead, we simplify the model by letting the agent have noise only
when calculating p A for the current state that he finds himself within. These estimates are actually "observed" as p̃ A . We write observed in quotation mark because they are strictly speaking
not observed; instead, L A and L B are observed, and these are utilized to find the individual’s
estimates of p A . The structure is illustrated in Figure 3.3 Under this model, when an agent is in
state s i = n w , he calculates the action value for choosing a jar v ∈ {A, B } as
Q(d v , i , n w , S) = (r + c w ) · p̃ v − c w ,
(3.13)
where the p̃ A corresponds to the estimate the participant provides in state s i = n w . This is a
modified version of (3.3). Furthermore, an agent in state s i = n w calculates the action value for
sampling another bead as
Q(d S , i , n w , S) =
X
p̃ v · h(i , n w , v, d S , S),
(3.14)
v∈{A,B }
which is (3.9) where P (V = v|S i = n w ) has been replaced by the subjective estimate p̃ v . This setting gives rise to two alternative models, as for the HCB and HCBU models. In the first model,
the agents know that they have decision noise. We call this model HCBP, where the "P" is there
to denote that the subjective probability estimates are taken into account. In the second model,
the agents are unaware that they have decision noise. We refer to this model as the HCBUP
28
model.
CHAPTER 3. THE OPTIMAL STOPPING FRAMEWORK
Chapter 4
Parameter Estimation
So far, we used the IBA as a reference and constructed four models that describe the behavior
of human beings performing the beads task. Two of these models, the HCB and HCBU models,
account for bias towards making a hasty conclusion (through c s ) and decision noise (through
τ). The two other models, the HCBP and HCBUP models, additionally try to capture the noise in
estimating probabilities (through ω). In this chapter, we describe how we estimate the parameters in these models. First, we only look at the HCB and HCBU models. We start off by looking at
how the parameters can been obtained through maximum likelihood estimation (MLE). However, for reasons that we discuss later, we do not carry out the MLE estimation in practice. Then,
we move on to Bayesian parameter estimation. In this setting, we view the model parameters
as stochastic variables instead of fixed parameters. We make inference about the parameters
based on the resulting posterior density, by calculating the marginal means and variances. As
we cannot perform the integration of the posterior density analytically, we utilize the M-H algorithm as a tool for numerical integration. We round off this chapter by showing how we estimate
the parameters in the HCBP and HCBUP models.
Before we look at the parameter estimation, we need to make a couple of modifications to the
models so that they are compatible with the experimental data we posses. In the models we
have looked at so far, the participants must make a decision at some point in the task. As noted
earlier, however, the participants in the experiment are not required to make a commitment to
a jar during the task. They can sample through all the n beads and say that they are still not
29
30
CHAPTER 4. PARAMETER ESTIMATION
sure regarding the source of the beads when the n-th bead is drawn. We handle this by imagining that when the participants calculate the action values, they assume that there is no limit on
the number of beads they can see before making a decision. In the existing models, we can let
the individuals assume that they can see n assumed number of beads instead of only n number of
beads in the task. As n assumed goes to infinity, the individuals assume that there is no limit on
the number of beads they are allowed to see before making a decision, and they act accordingly.
In theory, we can choose n assumed as large as we want in order to approximate that there is no
limit to the number of beads that can be seen. In practice, however, the parameter estimation
becomes very computationally expensive as n assumed increases. Therefore, we need to give it
moderately large value. In the following, the action values for sampling, Q(d S , ·, ·, ·, ·), are calculated as if n assumed number of beads can be drawn instead of only n. We also need to make a
modification to the data set; some participants provide probability estimates p v that are zero or
one. Our model for how these are generated does not allow for these to actually reach zero or 1,
since the logit function only approach 0 and 1 asymptotically. We handle this by adding 104 to
the instances that are 0, and subtracting this value to instances that are 1.
4.1 Maximum Likelihood Estimation
In this section, we derive the likelihood function of the hyper-parameters in the HCB model. Let
us first look at the likelihood function for the parameters of a single individual under these two
models. We use the notation introduced earlier, but we add an extra t to the subscript to denote
which task we are considering, ranging from 1 to Ntasks . That is, we denote x t i as the color of
bead i in task t and xt = {x t 1 , ..., x t n } as the entire displayed sequence in task t . Additionally, we
introduce x = {xt } as the collection of all the displayed sequences that are given to the agents.
Also, we let r = {(r t A , r t B )} contain the ratio of white beads in each of the jars in each of the
trials. Now, we let the subscript k ∈ {1, ..., Nind } denote the agent in question. Then, we denote
d t i k as the action taken by agent k in trial t after i beads have been drawn. Furthermore, we
let y k = {(m t k , d t k )} denote the responses of individual k on each of the tasks, containing m t k
which is the number of beads displayed before a decision was made in task t , and d t k which
says which jar that was chosen eventually in task t . If the individual did not commit to a jar in
4.1. MAXIMUM LIKELIHOOD ESTIMATION
31
the task, we define m t k to take the value n + 1 for the sake of notation and d t k to take the value
0. Finally, let y = {y k : k = 1, ..., Nind } be the collection of the responses from all the individuals.
Then, the likelihood function for the micro-parameters of one single individual can be written
as
p(y k |c ks , τk , x, r, S) =
=
t k −1
¢I [m t k 6=n+1] mY
P (D t m t k = d t k |x, r, S)
P (D t i k = d S |x, r, S)
NY
tasks ¡
t =1
i =1
µ
NY
tasks
w(d t k , m t k ,
t =1
=
NY
tasks


t =1
e
P
m
tk
X
x t s , r, S)
¶I [m t k 6=n+1] m t k −1
Y
s=1
Pm t k
Q(d t k ,m t k , s=1
x t s ,S)/τk
d ∈Dt m t k
e Q(d ,m t k ,
Pm t k
s=1
x t s ,S)/τk
w(d S , i ,
i =1
I [m t k 6=n+1]
i
X
x t s , r, S)
s=1
mY
t k −1

i =1
e Q(d
P
S
,i ,
Pi
s=1 x t s ,S)/τk
Q(d ,i ,
d ∈Dt i e
Pi
S
s=1 x t s ,c k ,S)/τk
where we have used the indicator function I [·] in order to remove the first term from the product
if the participant does not commit to any jar in the task. In the hierarchical model, the contribution to the likelihood function from a single individual becomes
L kHC B (α, β, η, δ|y k , x, r ) = p(y k |α, β, η, δ, x, r )
Z
LT P
=
p(c S |α, β, η, δ, y k , x, r ) · p(y k |c S , α, β, η, δ, x, r )d c S
S
Z c
=
p(c S |α, β) · p(y k |c S , η, δ, x, r )d c S
S
cZ
Z
LT P
S
=
p(c |α, β) p(τ|η, δ)p(y k |c S , τ, x, r )d τd c S
S
c
τ
Z Z
=
p(c S |α, β)p(τ|η, δ)p(y k |c S , τ, x, r )d τd c S
cS τ
where p(c S |α, β) and p(τ|η, δ) are the probability density functions of the gamma distribution,
respectively. The full likelihood function is the product of the likelihood functions for each individual, and it can be evaluated by using numerical integration to evaluate the double integrals. The MLE estimates can be found by maximizing this function with respect to α, β, η, and
δ through numerical maximization. However, as the integrands are computationally expensive
to evaluate, it turns out that maximizing this function numerically takes is to time consuming.
Instead, we would like to impose an estimation procedure that does not require integrating out
the latent individual-specific parameters c s and τ. Moutoussis et al. rely on the Expectationmaximization algorithm to obtain the MLE parameters. We, on the other hand, turn to Bayesian
,
32
CHAPTER 4. PARAMETER ESTIMATION
estimation and look at the posterior distribution of all the model parameters.
4.2 Bayesian Parameter Estimation
In the Bayesian setting, we treat the model parameters as random variables. We set a prior on
the model parameters and from this prior we obtain their posterior distribution. Then, we can
make inference about the parameters by calculating the mean and variance of the posterior distribution. Before performing the Bayesian analysis, we make a convenient re-parametrization of
the model: instead of dealing with α and β as the parameters of the gamma distribution, we use
p
the mean and standard deviation (SD) of the gamma distribution, i.e. µ1 = αβ and σ1 = αβ.
p
Similarly, we work with µ2 = ηδ and σ2 = ηδ instead of η and δ. These are more intuitive parameters to deal with because they represent the mean and SD of the sampling costs and the
mean and SD of the noise parameters and we have some understanding of what these parameters mean.
4.2.1 Choice of Hyperprior
Let us treat the hyper-parameters µ1 , σ1 , µ2 and σ2 as random variables. We need to choose
a prior for the hyper-parameters, i.e. a hyperprior. There is no obvious choice for this prior.
We could use a diffuse hyperprior that gives vague information about the parameters. In this
case, the posterior is mainly driven by the observations. However, as Moutoussis et al. has performed a similar analysis before us, we possess some prior knowledge about the parameters.
We may use this information when constructing our hyperprior. In particular, we may choose a
hyperprior with a mean that has a value close to the value of the MLE parameters obtained by
Moutoussis et al. Furthermore, there is no reason to believe that there is a dependency between
the governing parameters for the sampling cost and the governing parameters for the noise parameter, so we choose a prior such that (µ1 , σ1 ) and (µ2 , σ2 ) are independent. Additionally, we
note that the sampling costs and the noise parameters are non-negative. With this in mind, we
choose that the mean of the sampling cost (µ1 ) is exponentially distributed with a mean equal
4.2. BAYESIAN PARAMETER ESTIMATION
33
to some chosen constant λ1 . That is,
µ
¶
1
µ1 ∼ Exp
,
λ1
so that the parameter is positive with an expected value
(4.1)
1
.
λ1
We use results obtained Moutoussis
et al. to set a value for λ1 . Also, we would like the SD of the sampling cost (σ1 ) to be of the same
magnitude as the mean of the sampling cost (µ1 ). Therefore, we choose
µ
¶
1
σ1 |µ1 ∼ Exp
,
kµ1
(4.2)
so that E [σ1 |µ1 ] = kµ1 where we can set a constant 0 < k < 1. We use the same reasoning to
obtain the hyper-prior related to µ2 and σ2 , and we get
µ
1
µ2 ∼ Exp
λ2
¶
(4.3)
and
¶
1
.
σ2 |µ2 ∼ Exp
kµ2
µ
(4.4)
The equations (4.1) to (4.4) give us the prior density function of the hyperparameters, namely
p(µ1 , σ1 , µ2 , σ2 ) = p(µ1 )p(σ1 |µ1 )p(µ2 )p(σ2 |µ2 ),
(4.5)
where all the densities are the PDF of the exponential distribution. Now, let us write out the
expression for the joint posterior density of all the parameters in the model, p(θ|y, x, r ), where
s
, τ1 , ..., τNind ) is the vector containing all the model parameters. By
θ = (µ1 , σ1 , µ2 , σ2 , c 1s , ..., c N
ind
34
CHAPTER 4. PARAMETER ESTIMATION
Bayes Rule (BR) and by the construction of the model, we have
BR
p(θ|y, x, r ) =
=
=
=
p(θ|x, r )p(y|θ, x, r )
p(y|x, r )
s
, τ1 , ..., τNind , x, r )
p(θ)p(y|c 1s , .., c N
ind
p(y|x, r )
s
s
, τ1 , ..., τNind , x, r )
|µ1 , σ1 )p(τ1 , ..., τNind |µ2 , τ2 )p(y|c 1s , .., c N
p(µ1 , σ1 , µ2 , σ2 )p(c 1s , ..., c N
ind
ind
p(y|x, r )
p(µ1 )p(σ1 |µ1 )p(µ2 )p(σ2 |µ2 )
QNind
∝ p(µ1 )p(σ1 |µ1 )p(µ2 )p(σ2 |µ2 )
i =1
p(c is |µ1 , σ1 )p(τi |µ2 , σ2 )p(y i |c is , τi , x, r )
p(y|x, r )
N
ind
Y
i =1
p(c is |µ1 , σ1 )p(τi |µ2 , σ2 )p(y i |c is , τi , x, r ),
(4.6)
where all the densities involved already have been presented. This density function provide us
with posterior information about the model parameters. Ideally, we would calculate sample
mean and variance analytically by integrating this function. However, analytical solutions cannot be obtained in this case. Instead, we choose to utilize the M-H algorithm in order sample
from the distribution. Through this sample, we empirically obtain information about the posterior distribution of the parameters in the model. For example, the mean of the distribution is
estimated by calculating the sample average. We have omitted the normalizing constant in the
last line in (4.6) since the M-H algorithm only needs the ratio of the posterior densities, causing
the normalizing constant to cancel.
4.2.2 Choice of Proposal Distribution for M-H
We use the H-M algorithm to sample from the posterior in (4.6). As noted in Section 2.2, the
algorithm requires us to choose a proposal distribution, which together with the acceptance
probability generates the next state of the Markov chain. Though this function can be chosen
arbitrarily, the choice may strongly affects the rate of convergence and the mixing of the chain.
We recall that rate of convergence means how many iterations it takes for the chain to reach the
stationary distribution from an arbitrary starting point. The mixing concerns how quickly the
chain explores the support of the target distribution and how correlated the samples produces
4.2. BAYESIAN PARAMETER ESTIMATION
35
by the chain are. Both the rate of convergence and the mixing are influenced by the spread of
the proposal distribution. Therefore, we should construct a proposal distribution for which the
spread can be tuned. A way of checking whether the spread of the proposal density seems to be
appropriate is by calculating the acceptance rate of the proposals. We bear this in mind when
selecting a proposal distribution for our implementation.
As we would like to sample from the posterior density of θ, we set π(θ) = p(θ|y, x, r ) in the H-M
algorithm described in Section 2.2. We use component-wise updating. This means that only
one of the components of the parameter vector θ may change value at each iteration while the
other components remain as they were in the previous time step. This is convenient because it
allows us to work with univariate distributions. Since all the parameters in the parameter vector
by definition are non-negative, we would like to have proposal distributions that only sample
positive candidates; a candidate with negative value is guaranteed to be rejected as the posterior density is zero, which in turn causes the acceptance probability to be zero. Letting θk be the
k-th parameter in the parameter vector and θ̃k be the k-th element of the candidate vector (with
the rest of the elements in this candidate vector being equal to the the corresponding element
in θ (t ) ), we use a uniform proposal distributions that covers the current point,
Ã
θ̃k |θk(t )
∼ Unif
θk(t )
1 + ²k
!
, θk(t ) (1 + ²k )
, ²k > 0, k = 1, .., 2 · Nind + 4.
(4.7)
Thus, we have the proposal density functions
(t )
q k (θ̃|θ ) =



³
1
´
1
1+²k − 1+²
θk(t )
if
1
θ (t )
1+²k k
< θ̃k < (1 + ²k )θk(t ) ,
(4.8)
k


0 otherwise.
θ (t )
k
In other words, the candidate parameters are sampled uniformly within the interval ( 1+²
, θk(t ) (1+
k
²k )), which covers the current parameter value, θk(t ) .
These proposals will 1) ensure that the can-
didate parameters always are positive, and 2) allow for adjusting the spread through the tuning
parameters ²k . As ²k increases, the spread/variance of the proposal gets larger since the support
of the uniform proposal increases.
36
CHAPTER 4. PARAMETER ESTIMATION
We proceed by writing out the Metropolis-Hastings ratio we get when using the proposal distributions given in (4.7). By inserting the posterior density given in (4.6), we get
R k (θ̃|θ) =
=
p(θ̃|y, x, r, S)q k (θ|θ̃)
p(θ|y, x, r, S)q k (θ̃|θ)
QNind
p(c̃ is |µ̃1 , σ̃1 )p(τ̃i |µ̃2 , σ̃2 )p(y i |c̃ is , τ̃i , x, r, S)
i =1
QNind
p(c is |µ1 , σ1 )p(τi |µ2 , σ2 )p(y i |c is , τi , x, r, S)
p(µ1 )p(σ1 |µ1 )p(µ2 )p(σ2 |µ2 ) i =1
p(µ̃1 )p(σ̃1 |µ̃1 )p(µ̃2 )p(σ̃2 |µ̃2 )
·
θk
θ̃k
.
Since only parameter number k is sampled in each iteration, all the terms that do not involve
this parameter cancel. Therefore, if k = 1 then θ1 = µ1 is the parameter that is sampled, and all
the terms that do not involve this parameter vanish. Thus, we get
ind p(c s |µ̃ , σ ) µ
p(µ̃1 ) p(σ1 |µ̃1 ) NY
1 1
1
i
·
s
p(µ1 ) p(σ1 |µ1 ) i =1 p(c i |µ1 , σ1 ) µ̃1
R 1 (θ̃|θ) =
When k = 2, the parameter θ2 = σ1 , is sampled and the M-H ratio is
R 2 (θ̃|θ) =
ind p(c s |µ , σ̃ ) σ
p(σ̃1 |µ1 ) NY
1 1
1
i
· .
s
p(σ1 |µ1 ) i =1 p(c i |µ1 , σ1 ) σ̃1
For k = 3, the parameter θ3 = µ2 is sampled and we get
R 3 (θ̃|θ) =
ind p(τ |µ̃ , σ ) µ
p(µ̃2 ) p(σ2 |µ̃2 ) NY
i 2 2
2
· .
p(µ2 ) p(σ2 |µ2 ) i =1 p(τi |µ2 , σ2 ) µ̃2
For k = 4, the parameter θ4 = σ2 is sampled, and we get
ind p(τ |µ , σ̃ ) σ
p(σ̃2 |µ2 ) NY
i 2 2
2
R 4 (θ̃|θ) =
· .
p(σ2 |µ2 ) i =1 p(τi |µ2 , σ2 ) σ̃2
These four first elements are the hyper-parameters. For k ∈ {5, ..., 4 + Nind } one of the sampling
cost micro-parameters is sampled. So θk = c is , where i = k − 4, and we get
R k (θ̃|θ) =
p(c̃ is |µ1 , σ1 ) p(y i |c̃ is , τi , x, r, S) c is
· .
p(c is |µ1 , σ1 ) p(y i |c is , τi , x, r, S) c̃ is
4.3. PARAMETER ESTIMATION IN THE HCBP AND HCBUP MODELS
37
For k ∈ {5 + Nind , ..., 2Nind }, one of the noise micro-parameters are sampled. So θk = τi where
i = k − 4 − Nind , and we get
s
p(τ̃i |µ2 , σ2 ) p(y i |c i , τ̃i , x, r, S) τi
R k (θ̃|θ) =
· ,
p(τi |µ2 , σ2 ) p(y i |c is , τi , x, r, S) τ̃i
where i = k − 4 − Nind . The M-H ratios R 1 , R 2 , R 3 and R 4 are relatively computationally inexpensive to calculate as they only involve the gamma and exponential PDF’s. For k > 4, however,
R k involves calculating p(y i |c is , τi , x, r, S) which requires some computation time. When implementing the M-H algorithm, we use two ’tricks’ to make the algorithm as efficient as possible. First of all, we make sure that θ1 , θ2 , θ3 and θ4 are sampled more frequently than θk , k > 4.
Specifically, we have chosen that 50% of the iterations are designated the first four parameters.
Secondly, we continually store the likelihood values p(y i |c is , τi , x, r, S) for each of the individuals. Then, we only have to evaluate the computationally expensive likelihood value once instead
of twice every time we calculate the M-H ratio for k > 4. We also note that we always work on
logarithmic scale when calculating the probabilities, ratio of probabilities and product of probabilities in the implementations. We do this to avoid problems caused by numerical instability.
In the next chapter, we perform a simulation study and sensitivity analysis to see how well the
algorithm works.
4.3 Parameter Estimation in the HCBP and HCBUP Models
As we recall from section 3.3.3, the HCBP model adds yet another individual-specific parameter
to each individual. In order for the HCBP model to be fully Bayesian, we need to put a prior
1
on the parameter vector θ ∗ = (ω1 , ..., ωNind ), and write out the posterior density of the model
parameters. As before, we let y be the collection of all the observations that have to do with the
s
DTD responses, and θ = (µ1 , σ1 , µ2 , σ2 , c 1s , ..., c N
, τ1 , ..., τNind ). Furthermore, we let ρ̃ k be all the
ind
probability estimates provided by participant k and ρ̃ be the collection of all the participants’
1
Alternatively, we could build an hierarchical structure by having a prior with unknown parameters and putting
a hyperprior with known parameters on top.
38
CHAPTER 4. PARAMETER ESTIMATION
estimates. Then, the posterior density in this model can be written as
p(θ, θ ∗ |y, ρ̃, x, r ) ∝ p(y, ρ̃|θ, θ ∗ , x, r )p(θ, θ ∗ )
= p(ρ̃|θ, θ ∗ , x, r )p(y|θ, θ ∗ , ρ̃, x, r )p(θ, θ ∗ )
= p(ρ̃|θ ∗ , x, r )p(y|θ, ρ̃, x, r )p(θ)p(θ ∗ )
= p(ρ̃|θ ∗ , x, r )p(θ ∗ ) · p(y|θ, ρ̃, x, r )p(θ).
As the posterior density can be written as a product of one function that involves θ ∗ and one
that involves θ, these parameter vectors are posteriorly independent. As a consequence, we can
make inference on these parameter vectors separately. The estimate of θ ∗ obtain the exact same
value in the HCBP and HCBUP models. We perform a Bayesian analysis on θ, using the same
prior as we did for the HCB and HCBU models in section 4.2.1. To access information about the
posterior density, we utilize the M-H algorithm with the same proposal distribution as for the
HCB and HCBU models. To infer about θ ∗ , on the other hand, we use MLE. We could have estimated this vector as Bayesians, but we find MLE more convenient because the estimates have
a simple analytical solution in this case. Next, we write out the expression for the MLE estimate
of ω2 for a single individual. Let ρ A,l be the probability of jar A being the source of the beads,
where l ranges over all the N states where the participant provides a probability estimate. Similarly, let ρ̃ A,l be the probability estimate that the particular individual actually provides. Then,
ρ
ρ̃
define z l = log( 1−ρA,l ) and z̃ l = log( 1−ρ̃A,l ), which are the probabilities mapped to the real line.
A,l
A,l
In our model, we have z̃ l ∼ N (z l , ω ). By writing out the log-likelihood function, differentiating
2
with respect to ω2 and setting the expression equal to zero, we obtain
ω̂2 =
N
1 X
(z̃ l − z l )2 .
N l =1
(4.9)
This expression gives us the estimates of ω for each of the individuals in the data set.
In this chapter, we have proposed a procedure of estimating the parameters in the four models. Before applying these estimation procedures on the actual data set, we would like to see
how well they work. In order to do so, we perform a simulation study. We implicitly also get to
4.3. PARAMETER ESTIMATION IN THE HCBP AND HCBUP MODELS
39
check whether there seems to be any programming errors. In the next chapter, we carry out the
simulation study.
Chapter 5
Simulation Study
In this chapter, we simulate some experiment responses under the model(s) built in Chapter 3
and use our implementation of the H-M algorithm to sample from the resulting posterior distribution. There are two main purposes of performing this analysis. First of all, we would like
to see whether our choice of proposal distribution in the algorithm works well. This involves
investigating the convergence and the mixing of the Markov chain. Secondly, we aim to figure
out how accurately we are able to retain the parameters in the model when using data from a
reasonable number of participants and tasks per participant. This involves checking how sensitive the algorithm is to the choice of hyperprior. However, we implicitly also get to confirm that
the implementation is made correctly. If there are any major programming errors, these should
be detected during this analysis. We have carried out a simulation study on all the four models,
but as the results are fairly similar, we only present the simulation study on the HCB model in
this report.
5.1 M-H Simulations Under Different Scenarios
We run a total of seven simulations in this study. Each simulation consists of two steps. The
first step is response generation. The responses are generated by first choosing some hyperparameters, then sampling micro-parameters for a chosen number of individuals, and finally
simulating responses to some chosen tasks for each of these individuals. The second step is
parameter estimation. Here, we specify some parameters for the hyperprior, and run H-M algo40
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
41
rithm with some chosen tuning parameters on the responses that were generated.
The setup for each of the simulations is summarized in table 5.1. They all have in common
that n = n assumed = 10. Hence, ten beads can be drawn, and the participants know that they
must make a commitment to a jar at some point in the sequence of ten beads. As we recall from
Section 4.2.2, the proposal distribution is tuned by the ²k parameters, each of these specifying
the possible step size when there is a proposal for component k of θ (t ) . It is not necessary to
tune the proposal distributions for the c s parameters individually, so we set ²5 = ... = ²4+Nind .
The same goes for the τ parameters, so we give all of them the same tuning by setting ²4+Nind =
... = ²4+2·Nind . Now, we define the vector ² = (²1 , ²2 , ²2 , ²4 , ²4 , ²5 , ²5+Nind ), which contains all the
information about the tuning of the proposal distribution.
Each of the simulations serves to illustrate some point, either by itself or when compared to
another simulation. From one simulation to another, we modify only one component in order
to see the effect of changing that particular component. Before we go into the details of each
simulation, let us give a brief overview of what points we are trying to bring forth through these
simulations. Simulation 1 and simulation 2 aim to illustrate the effect of tuning the proposal
distribution. Simulation 1 and simulation 3 are designed to show the impact of including more
individuals in the data set. Simulation number 1 and 4 show the consequence of giving more
tasks per individual. Simulation number 1 and 5 aim to illustrate how sensitive the parameter
estimation is to the choice of hyperprior parameters λ1 and λ2 .
In the first scenario, we have chosen some moderately
1
large values for hyperparameters,
namely µ1 = 2, σ1 = 1, µ2 = 5, and σ2 = 3. We sample 100 individuals based on these hyperparameters, and let them respond to the four tasks given in Table 1.1. The "average individual"
generated under this scenario has parameters c s = µ1 = 2 and τ = µ2 = 5. With these parameters,
the probability of committing to jar A, jar B, and choosing to continue sampling are respectively
14%, 0.0001%, and 86% after one beads is drawn in task 1. In Figure 5.1, we present the responses/observations that were produced. For the parameter estimation, we choose the values
1
Recall that we have chosen r = 0 and c w 100 as references for the model.
42
CHAPTER 5. SIMULATION STUDY
Table 5.1: The scenarios in the simulation study. The symbol ’*’ Means that ² = (1, 1, 1, 1, 1, 1),
and ’**’ means that ² == (0.5, 0.5, 0.5, 0.5, 2, 2)
Data Generation
Experiment
Hyperprior
Metropolis-Hasings
SIM #
µ1
σ1
µ2
σ2
Nind
n trials
n = n assumed
λ1
λ2
k
²
θ (0)
Ni t
1
2
3
4
5
2
2
2
2
2
1
1
1
1
1
5
5
5
5
5
3
3
3
3
3
100
100
200
100
100
4
4
4
8
4
10
10
10
10
10
0.25
0.25
0.25
0.25
0.1
0.25
0.25
0.25
0.25
0.1
0.5
0.5
0.5
0.5
0.5
*
**
*
*
*
10
10
10
10
10
1.0 · 106
1.0 · 106
1.2 · 106
1.0 · 106
1.0 · 106
λ1 = λ2 = 0.25 in the hyperprior. Therefore, hyperprior says that E [µ1 ] = E [µ2 ] = 1/0.25 = 4,
which is reasonably close to the true values. To study how fast the M-H algorithm converges,
we choose to let all the components of θ (0) have the value 10 so that they are reasonably far
away from the values that were used to generate the responses. Let us take a look at the M-H
simulation for this scenario. In Figure 5.2, we show trace plots of the Markov chain for some
selected parameters. One the first row, we have trace plots of the hyperparameters governing
the sampling costs (namely µ1 and σ1 ) as well as the trace plot of an individual-specific sampling cost parameter (c 1s ). Similarly, on the second row, we have the trace plots of µ2 , σ2 , and
τ1 . We ran 1 · 106 iterations of the M-H algorithm. From these trace plots, it seems that it takes
a while before convergence is reached. The chain makes some large leaps in the first iterations,
and then it gradually moves towards a state where the parameters have values that are close to
those used for response generation. It looks like about 5 · 104 iterations are needed before all the
parameters seem to fluctuate randomly about some value. These plots also give us an indication of bad mixing; we can see that the samples fluctuate about some mean in a cyclic pattern
with low frequency. In other words, the samples seem to have a large auto-correlation. This is
particularly seen for the hyper-parameters. As we recall, we have decided that as much as 50%
of the iterations of the M-H algorithm should perturb one of the hyperparameters components
of θ. Lowering this percentage may decrease the auto-correlation of the hyper-parameters.
In Figure 5.3, we have plotted the acceptance rate for each parameters. The acceptance rate for
a parameter is the proportion of times the candidate was accepted when there was a proposal
for that particular parameter. As we recall, the rule of thumb is that acceptance rates should lie
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
1
No decision
Jar A
Jar B
50
0.8
0.6
30
0.5
0.4
20
Probability
0.7
40
0.3
0.2
10
1
No decision
Jar A
Jar B
0.9
Number of Commitments
0.8
0.7
40
0.6
30
0.5
0.4
20
0.3
0.2
10
0.1
0
0.1
0
1
2
3
4
5
6
7
8
9
10
0
No
0
1
2
3
4
Beads Drawn (i)
1
No decision
Jar A
Jar B
50
6
7
8
9
10
No
0.7
0.6
30
0.5
0.4
20
0.3
0.2
10
1
No decision
Jar A
Jar B
Jar C
Jar D
0.9
50
0.8
40
Task 4
60
Probability
Number of Commitments
Number of Commitments
5
Beads Drawn (i)
Task 3
60
0
0
2
3
4
5
6
7
Beads Drawn (i)
8
9
10
No
0.9
0.8
0.7
40
0.6
30
0.5
0.4
20
0.3
0.2
10
0.1
1
0.9
Probability
Number of Commitments
50
Task 2
60
Probability
Task 1
60
43
0.1
0
0
1
2
3
4
5
6
7
8
9
10
No
Beads Drawn (i)
Figure 5.1: SIM1. The responses in simulation number 1. The plot shows how many individuals
that responded after the first, second, etc bead was shown. There were in total 100 individuas in
the generated data set. The true probability of each of the jars being the the source is displayed
on the right y-axis. Table 1.1 shows the ratios of black, white, and red beads in each of these tasks.
Since the participants had to make a commitment during each task, the ’No Commitment’ bar
is empty.
44
CHAPTER 5. SIMULATION STUDY
Trace Plot of µ1
18
16
14
Trace Plot of σ1
10
M-H sampled parameter
True parameter
Mean of hyperprior (chosen λ1 )
Burn-in
9
12
Trace Plot of cs1
10
M-H sampled parameter
True parameter
Burn-in
M-H sampled parameter
True parameter
Estimated prior mean (µ̂1 )
Burn-in
9
8
8
7
7
6
6
θ5
θ2
θ1
10
5
5
8
4
4
3
3
2
2
2
1
1
0
0
6
4
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
7
8
9
10
0
1
2
3
4
5
×105
θ105
θ4
θ3
8
5
6
4
3
4
10
10
6
6
9
M-H sampled parameter
True parameter
Estimated prior mean (µ̂2 )
Burn-in
12
7
8
8
×105
8
10
7
Trace Plot of τ1
14
M-H sampled parameter
True parameter
Burn-in
9
6
t
Trace Plot of σ2
10
M-H sampled parameter
True parameter
Mean of hyperprior (chosen λ2 )
Burn-in
12
6
t
Trace Plot of µ2 ,
14
0
10
×105
t
4
2
2
2
1
0
0
0
1
2
3
4
5
t
6
7
8
9
10
0
0
1
2
3
4
×105
5
6
7
8
9
10
0
×105
t
1
2
3
4
5
6
7
8
t
9
10
×105
Figure 5.2: SIM1. Trace plots for some of the parameters in simulation study number 1. In the
first row, the trace plots of the hyperparameters of c s and one of the c s parameters are plotted.
The plots on the second row show the traces of the hyperparameters of τ as well as one of the τ
parameters. We have chosen a burn-in period of 2 · 105 iterations.
Acceptance Rates for θk ’s
1
0.9
0.8
Acceptance Rates
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
20
40
60
80
100
120
140
160
180
200
k
Figure 5.3: SIM1. Acceptance rates for the parameters in simulation study number 1.
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
45
between 20% and 50%. In this simulation, the acceptance rates are about 10% for the hyperparameters and 50% for the individual-specific parameters. Therefore, tuning the proposal by
decreasing ²’s for the hyperparameter proposals and increasing them for the proposals of c s ’s
and τ’s should be considered.
Yet another way to monitor the Markov chain is by checking the correlation between the parameters in θ (t ) . The correlation matrix with the correlation between all pairs of parameters is
visualized in the left panel in Figure 5.4. In the right panel of this figure, we can see the correlation between the hyperparameters only. These plots show that there is some positive correlation
between µ1 and σ1 , and between µ2 and σ2 . These sample correlations are 0.64 and 0.55, respectively. The correlation seen for these parameters is caused by the construction of the model. We
chose µ1 and σ1 to be a priori dependent, as we recall from Figure 3.1. This dependency is reflected in the posterior distribution. The same holds for µ2 and σ2 . Furthermore, it appears to
be some positive correlation between µ1 and several of the c ks parameters, and between µ2 and
several of the τk parameters. These correlations are also caused by the structure of the model.
Looking at the left panel in Figure 5.4, we see that there generally is some negative correlation
between c is and τi . Since the c s and τ are a priori independent, the correlation seen in the posterior must be caused by the likelihood function of θ. We recall that the model parameters c s
and τ explain some of the same observed behavior. As c s increases, the individual will tend to
commit to a jar early. But as τ increases, the person will also tend to commit earlier. Therefore,
a person that exhibits many early commitments can either have a large c s and a moderate τ, or
he can have a large τ and a moderate c s . This may cause some negative correlation between c is
and τi in the posterior density. In Figure 5.5, we show scatter plots of some selected pairs of parameters. We can clearly see the positive correlation between µ1 and σ1 and between µ2 and σ2 .
We also see that there is some negative correlation between c s and τ, especially for individual
100.
Let us look at the parameter estimates we obtain from this sample. We remove the first 2 · 105
iterations as a burn-in period. The histograms of the remaining sample are displayed in Figure
5.6. These histograms show the (empirical) marginal distribution of the parameters. By taking
46
CHAPTER 5. SIMULATION STUDY
Correlation Matrix Visualization
1
Correlation Matrix Visualization
1
5
0.8
0.8
µ1
0.6
0.6
0.4
0.4
k
0.2
σ1
0.2
0
0
105
-0.2
-0.2
µ2
-0.4
-0.4
-0.6
-0.8
204
-0.6
σ2
-0.8
-1
5
105
k
-1
204
µ1
σ1
µ2
σ2
Figure 5.4: Simulation 1: The left panel visualizes the correlation matrix for all the parameters.
The right panel has zoomed in on the top left corner of the correlation plot in the left panel. It
shows the correlation matrix of the hyperparameters only.
Figure 5.5: SIM1. Scatter plot of pairs of parameters. In the top left panel we have µ1 versus σ1
and in the top right we have µ2 versus σ2 . In the bottom left and bottom right we have c 1s versus
s
τ1 and c 100
versus τ100 , respectively.
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
47
Figure 5.6: SIM1. Histograms of some of the parameters after removing the burn-in period.
The true parameter values that were used to generate the individuals and their responses are
plotted as red horizontal lines. These histograms represent the shape of estimated marginal
distributions of the posterior p(θ|y, x, r ) given in (4.6).
Table 5.2: Summary statistics for some marginal posterior distribution in simulation study 1, 3,
and 4. From simulation 1 to simulation 3, we double the number of individuals. From simulation 1 to simultion 3, we double the number of taks per individual.
Sample Mean (Sample SD))
θk
True
SIM 1
SIM 3
SIM 4
µ1
σ1
µ2
σ2
c 1s
τ1
Sample Size
2.0000
1.0000
5.0000
3.0000
0.3178
3.8371
-
1.9517 (0.2209)
1.0904 (0.2850)
4.4157 (0.4221)
2.0387 (0.4709)
1.2755 (0.5765)
3.7476 (1.4606)
0.8 · 106
1.8543 (0.1379)
0.9678 (0.1727)
4.9354 (0.3647)
2.6645 (0.4154)
1.1225 (0.5081)
3.8785 (1.5321)
0.6 · 106
1.8811 (0.1779)
1.0995 (0.2118)
4.9321 (0.3536)
2.5256 (0.3475)
1.1570 (0.5664)
3.8673 (1.1590)
0.8 · 106
48
CHAPTER 5. SIMULATION STUDY
the sample averages, we get the empirical Bayes estimates of the parameters. These estimates
and the estimates of the SD of the marginal distributions are shown in Table 5.2. Based on this
table and the histograms in Figure 5.6, it seems that by using 100 individuals and four tasks
per individual, we are able to estimate the hyperparameters reasonably accurately. In order to
visualize how well we are able to retain all the model parameters, we have plotted the true parameters versus the estimates in Figure 5.7. From this plot, we can clearly see the "shrinkage
towards the mean" effect of the hierarchical model. The estimates are somewhat squeezed vertically towards the estimated hyperparameter that specifies the mean of the prior distribution.
In Figure 5.8, we see a scatter plot of the individuals in the data set; the true c s is plotted against
the true τ for each individual. In the same plot, we have also included the a scatter plot of the individuals based on their estimated parameters. This figure also clearly illustrates the shrinkage
effect. The individuals that have extreme parameters are pulled towards the estimated mean of
the prior. So individuals with a large c s tend to get strongly pulled towards µ̂1 , and those with
large τ are likely to be pulled towards µ̂2 . We note that the sample size probably is somewhat
slim. Even though it contains 8 · 105 values, many of these are duplicates or nearly duplicates
that give little extra information. Since the mixing is reasonably bad, it takes a very large sample
to represent the distribution well.
Now, we move on to simulation number 2. As we saw in simulation number 1, the acceptance
rates were somewhat small for the hyperparameters and somewhat large for the individualspecific parameters. Let us tune the M-H algorithm by decreasing the spread of the former
and increasing the spread of the latter. Specifically, we set ² = (0.5, 0.5, 0.5, 0.5, 2, 2) instead of
² = (1, 1, 1, 1, 1, 1) . We run the M-H algorithm on the exact same data as in simulation study
number 1. That is, we have the same individuals 2 and the same responses for each individual.
We want to see how this affects the the acceptance rate, mixing and convergence of the algorithm. We start the chain in the same state as for simulation 1 and we run the same number of
iterations. The resulting trace plots are shown in Figure 5.9. This tuning does change the acceptance rates. For µ1 , σ1 , µ2 , and σ2 they have increased to 0.16, 0.25, 0.12, and 0.26 respectively.
The average acceptance rate for c s and τ have decreased to 0.44 and 0.31, respectively. Com2
An individual in this context means a pair (c s , τ)
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
49
True vs. Estimated Parameters
14
Hyperparameters
cs
τ
12
Estimated θk
10
8
6
µ2
4
σ2
µ1
2
σ1
0
0
2
4
6
8
10
12
14
True θk
Figure 5.7: SIM1. The true parameters that generated the responses versus the estimates obtained through the Bayesian analysis in simulation study number 1. We can clearly see the
"shrinkage" effect on the c s parameters. All the individuals’ sampling costs estimates have a
value that is close to the estimated mean of the prior.
Scatter Plot of the Individuals
14
47
Estimated
True
µ̂1 ,µ̂2
12
74
10
47
τ
8
74
6
16
16
59
59
4
49
49
2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
cs
Figure 5.8: SIM1. A scatter plot of the individuals and the estimated individuals in simulation
number 1. We have included the ID on some of the individuals in order to show how how the
estimated parameters are compared to the true parameters.
50
CHAPTER 5. SIMULATION STUDY
Trace Plot of µ1
12
Trace Plot of σ1
10
M-H sampled parameter
True parameter
Mean of hyperprior (chosen λ1 )
Burn-in
9
10
θ2
θ1
8
4
M-H sampled parameter
True parameter
Estimated prior mean (µ̂1 )
Burn-in
9
8
8
7
7
6
6
5
6
Trace Plot of cs1
10
M-H sampled parameter
True parameter
Burn-in
θ5
14
5
4
4
3
3
2
2
1
1
2
0
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
7
8
9
10
0
2
3
4
5
6
7
8
9
Trace Plot of τ1
12
M-H sampled parameter
True parameter
Estimated prior mean (µ̂2 )
Burn-in
10
8
7
10
×105
t
M-H sampled parameter
True parameter
Burn-in
9
10
1
×105
Trace Plot of σ2
10
M-H sampled parameter
True parameter
Mean of hyperprior (chosen λ2 )
Burn-in
12
6
t
Trace Plot of µ2 ,
14
0
0
×105
t
8
θ4
θ3
6
θ105
6
8
5
6
4
4
3
4
2
2
2
1
0
0
0
1
2
3
4
5
t
6
7
8
9
10
0
0
1
2
×105
3
4
5
t
6
7
8
9
10
×105
0
1
2
3
4
5
6
7
8
9
t
10
×105
Figure 5.9: SIM2. Trace plots for some of the parameters in simulation study number 2.
paring the trace plots of simulation 1 and 2 in Figure 5.2 and 5.9 respectively, it does not seem
like the tuning improved the mixing. On the other hand, it turns out that the "improvement" in
the tuning actually slowed down the rate of convergence. It takes about 4 · 105 iterations before
stationarity appears to be reached. What we take home from these two simulations, is that the
tuning may greatly affect the rate of convergence. However, we should be careful drawing general conclusion from this single example.
In simulation 3, we extend the generated data set from simulation 1. We include the same
100 individuals and their responses as in simulation 1, but we also add another 100 individuals sampled with the same hyperparameters as in simulation 1. We would like to see how this
improves the parameters estimation. In advance, we expect an improvement in the hyperparameter estimates. This is because we include more individuals in the data set, and this should
give us more information about the distribution from which they are sampled. At the same time,
we may expect the rate of convergence to decrease in the M-H algorithm, since the number of
parameters are almost doubled. The trace plots for the M-H samples obtained from this data set
look reasonable similar to those of simulation 1. It appears that it takes longer for the chain to
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
51
True vs. Estimated Parameters
14
Hyperparameters
cs
τ
12
Estimated θk
10
8
6
µ2
4
σ2
2
µ1
σ1
0
0
2
4
6
8
10
12
14
True θk
Figure 5.10: SIM4. The true parameters that generated the responses versus the estimates obtained through the Bayesian analysis in simulation study number 4. The shrinkage is less than
in simulation number 1.
converge, as anticipated. The estimates of the marginal sample means and SD, after removing
an appropriate burn-in period, are printed in Table 5.2. We can see that the estimate of µ1 actually is somewhat further away from the true value in simulation 3 than in simulation 1. However,
we can see that the sample SD of the marginal distribution has decreased, as we expect. For the
other hyperparameters we see that the estimates have improved and the SD’s have decreased as
we move from simulation 1 to simulation 3. This coincides with what we expect.
In simulation 4, we would like to see what happens if we increase the number of tasks per
individual. We use the same individuals, tasks and responses as in simulation 1, but we give the
individuals four additional tasks and record their responses. In three of these additional tasks,
there are two sources of beads, and each with some ratio of white to black beads. In the last
additional task, there are four possible sources, where three of the sources have black and white
beads, and the fourth has red and black beads. The fourth jar is not the true source in the last
task, so no red beads are displayed. The four extra tasks are the same for all the individuals. With
respect to the rate of convergence and mixing, the trace plots of simulation 4 look fairly similar
to those of simulation 1. We expect an improvement in the individual-specific parameters since
52
CHAPTER 5. SIMULATION STUDY
more information is gathered from each individual. In turn, we may expect this to improve the
hyperparameter estimates, because the more precise information about the individuals should
lead to more precise information about the population they come from. The values for some
of the parameters estimates are printed in Table 5.2. The estimates of the hyperparameters are
reasonably similar as those in simulation 1, and the sample SDs are slightly smaller than in simulation 1. In Figure 5.10, we have plotted the true parameters versus the estimates. We can
clearly see that the shrinkage effect is smaller than in simulation 1, especially for the τ parameters. It appears that the individual-specific parameters in general are attained more accurately
in simulation 4 than in simulation 1, as expected when more information is gathered from each
individual.
In simulation 5, we want to check how sensitive the model is to the hyperprior. We use the
same individuals and responses as in simulation 1. However, we make changes to the Bayesian
analysis by changing the values of the parameters in the hyperprior. More specifically, we decrease λ1 from 0.25 to 0.1 and λ2 from 0.25 to 0.1. This makes the hyperprior less informative,
as the variance is increased. It turns out that this modification has very little effect on the parameter estimates. As seen in Figure 5.11, the marginal posterior densities are almost identical
in simulation 1 and simulation 5.
This simulation study has given us an impression of how well the H-H algorithm works and
how accurately we are able to retain the true model parameters. We have seen that the mixing is
reasonably bad. Consequently, we need to perform many iterations of the M-H algorithm. The
bad mixing may be due to a poor choice of proposal distribution for the M-H algorithm. We
have also seen that we can make good inference about the hyperparameters when the data set
consists of 100 individuals with four tasks per individual. The real data set that we analyze in this
paper consists of 73 individuals who have performed four tasks each. There is reason to believe
that this is enough data to make good inference about the population. However, with only four
tasks per participants, the individuals with extreme parameters are not so easily detected. We
are now ready to take a look at the data set of real human beings performing the beads task.
5.1. M-H SIMULATIONS UNDER DIFFERENT SCENARIOS
53
Figure 5.11: Simulation 1 versus simulation 5. We can see that the change in the parameters in
the hyperprior have very little effect on the posterior distribution.
Chapter 6
Data Set and Results
In this chapter, we fit the models to experimental data from the Department of Psychology at the
"Norwegian University of Science and Technology". The data set consists of the responses of 73
participants performing the tasks in Table 1.1. To our knowledge, none of these participants
have been diagnosed with psychosis. We start by looking at the raw data set. Then, we proceed
by estimating the parameters under the four models and dicuss what we observe.
6.1 The Raw Data
In this section, we present the raw data from a group of people performing the experiment. We
recall that there are two types of responses that are recorded in this experiment. The first type
has to do with the DTD part of the experiments, i.e. the number of beads drawn before a commitment to a jar is made and which jar that is chosen. The second type has to do with the DTC
part of the experiment, and it consists of the likelihood estimates for each of the jars after each
bead is drawn.
Figure 6.1 visualizes the all DTD responses for each of the four tasks. The x-axis shows the
i -th bead that is drawn in the task. The left y-axis represents how many participants that responded after seeing the i -th bead, and the vertical bars are associated with this axis. The color
of the bars provide information about which of the jars that was chosen. The right y-axis is associated with the solid lines, which are the true probabilities for each of the jars being the source
54
6.1. THE RAW DATA
55
1
No decision
Jar A
Jar B
0.5
10
0.4
0.3
5
Probability
0.6
0.2
0.7
0.6
15
0.5
10
0.4
0.3
5
0.2
0.1
0
0.1
0
2
3
4
5
6
7
8
9
10
0
No
0
1
2
3
4
Beads Drawn (i)
Task 3
25
1
No decision
Jar A
Jar B
6
7
8
9
10
No
0.5
0.4
0.3
5
0.2
20
Probability
Number of Commitments
0.6
10
1
No decision
Jar A
Jar B
Jar C
Jar D
0.9
0.7
15
Task 4
25
0.8
20
Number of Commitments
5
Beads Drawn (i)
0
1
2
3
4
5
6
7
Beads Drawn (i)
8
9
10
No
0.8
0.7
0.6
15
0.5
10
0.4
0.3
5
0.2
0.1
0
0.9
Probability
1
0.9
0.8
20
Number of Commitments
Number of Commitments
0.9
0.7
15
1
No decision
Jar A
Jar B
0.8
20
Task 2
25
Probability
Task 1
25
0.1
0
0
1
2
3
4
5
6
7
8
9
10
No
Beads Drawn (i)
Figure 6.1: DTD responses in the data set. The plot shows how many individuals that responded
after the first, second, etc bead was shown. There were in total 73 individuas in the data set. The
true probability of each of the jars being the the source is displayed on the right y-axis. Table 1.1
shows the ratios of black, white, and red beads in each of these tasks.
56
CHAPTER 6. DATA SET AND RESULTS
Task 1
1
0.9
0.8
A (p)
A (mean p̃)
B (p)
B (mean p̃)
Jar
Jar
Jar
Jar
0.9
0.8
0.7
Posterior Probability
Posterior Probability
Task 2
1
Jar
Jar
Jar
Jar
0.6
0.5
0.4
0.3
0.7
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
Bead
6
7
8
9
10
Bead
Task 3
1
Task 4
1
Jar
Jar
Jar
Jar
0.9
0.8
A (p)
A (mean p̃)
B (p)
B (mean p̃)
Jar
Jar
Jar
Jar
Jar
Jar
Jar
Jar
0.9
0.8
0.7
Posterior Probability
Posterior Probability
A (p)
A (mean p̃)
B (p)
B (mean p̃)
0.6
0.5
0.4
0.3
0.7
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
A (p)
A (mean p̃)
B (p)
B (mean p̃)
C (p)
C (mean p̃)
D (p)
D (mean p̃)
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
Bead
5
6
7
8
9
10
Bead
Figure 6.2: DTC responses in the data set. The solid line is the posterior probability of the jar
being the source of the beads. The dashed line is the probability estimate, p̃, averaged over all
the individuals.[Rewrite this caption!]
when the first i beads have been drawn. The color of the drawn bead is indicated on the x-axis.
As we can see, the participants choose the jar with the largest posterior probability of being the
in most of the cases where a commitment is made. There are only a few instances where a jar
that is not the most likely is chosen. It is also worth noticing that for each of the tasks, there is
a large proportion of the individuals that do not commit to any jar during the course of the ten
beads. For the first task, we can see that the probability of jar A being the true jar is 0.9 already
after the firs bead, and the probability only gets larger as more beads are drawn. Still, as much
as 21% of the participants do not commit to any jar in this task.
In Figure 6.2, we present the DTC responses in the data set. The solid line shows the posterior probability of the source being a certain jar. The dashed line, on the other hand, shows
the average of the probability estimates provided by the participants. We can see that on aver-
6.2. MODEL RESULTS AND DISCUSSION
57
Table 6.1: The empirical Bayes estimates of some of the model parameters. The sample SD’s are
shown in parenthesis.
Model
µ̂1
σ̂1
µ̂2
σ̂2
cˆ1s
τˆ1
HCB
HCBU
HCB-P
HCBU-P
0.0004 (0.0007)
0.0008 (0.0012)
11.4663 (2.4699)
19.3062 (4.6289)
0.0001 (0.0004)
2.8622 (1.2025)
0.0201 (0.0148)
0.0511 (0.0397)
16.2805 (3.6602)
30.9884 (7.4775)
0.0028 (0.0081)
3.8465 (2.0799)
0.0149 (0.0034)
0.0055 (0.0030)
9.4664 (1.3642)
8.9336 (1.8960)
0.0158 (0.0072)
6.2683 (1.0397)
2.9378 (1.3144)
8.6428 (4.2397)
10.0232 (1.0897)
7.3853 (1.3855)
0.2825 (0.3585)
9.8286 (2.9514)
age, the probabilities provided by the participants are less extreme than the true probabilities.
Furthermore, the participants tend to react strongly when a bead of a different color than its
preceding beads shows up. This is for example seen when the fourth bead is drawn in Task 1.
6.2 Model Results and Discussion
We fit model parameters under the HCB, HCBU, HCB-P, and HCBU-P models. To account for
the fact that the participants are not required to make a commitment to any jar during the task,
we have used that n assumed = 20. We use the M-H algorithm with the tuning parameters as in
simulation 1 in Table 5.1. The resulting (empirical) mean and SD of the marginal distribution of
some selected parameters are shown in Table 6.1.
Let us start by analyzing the posterior distribution of the model parameters attained under
the HCB model. This is the same model as the one introduced by Moutoussis et al.. In Figure
6.3, we present the trace plots for the hyperparameters µ1 , σ1 , µ2 , and σ2 , and the trace plots
for c s and τ for one of the individuals. After checking the trace plot of every single parameter, it
looks like convergence is attained after about 4 · 105 iterations. The corresponding histograms
after removing the burn-in period are shown in Figure 6.4. In agreement with Moutoussis et al.,
we find that the mean of the sampling costs (µ1 ) is near zero. Furthermore, we find that the
mean of the decision noise parameters (µ2 ) is of the same magnitude as the one obtained by
Moutoussis et al.. Specifically, we find µ̂2 ≈ 11.47, while Moutoussis et al. obtained the value
58
CHAPTER 6. DATA SET AND RESULTS
Trace Plot of µ1
7
Trace Plot of σ1
5
M-H sampled parameter
Mean of hyperprior (chosen λ1 )
Burn-in
6
4.5
5
4
4
3.5
θ2
3
θ5
3
θ1
M-H sampled parameter
Estimated prior mean (µ̂1 )
Burn-in
4.5
3.5
4
Trace Plot of cs1
5
M-H sampled parameter
Burn-in
2.5
2.5
3
2
2
2
1.5
1.5
1
1
0.5
0.5
1
0
0
1
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
θ4
15
10
5
0
1
2
3
4
5
t
6
7
8
9
10
1
2
3
4
5
6
8
9
16
35
14
30
12
10
×105
Trace Plot of τ1
M-H sampled parameter
Estimated prior mean (µ̂2 )
Burn-in
18
40
25
7
t
20
10
20
8
15
6
10
4
5
2
0
0
0
M-H sampled parameter
Burn-in
45
20
10
×105
Trace Plot of σ2
50
M-H sampled parameter
Mean of hyperprior (chosen λ2 )
Burn-in
25
θ3
1
t
Trace Plot of µ2 ,
30
0
0
×105
t
θ78
0
0
0
1
2
×105
3
4
5
t
6
7
8
9
10
0
1
×105
2
3
4
5
t
6
7
8
9
10
×105
Figure 6.3: Trace plot for HCB model results.
5.09 for the control group and the value 12.07 for the paranoid group.
Even though each individual only responds to four tasks, it is interesting to take a look at the
estimates of the individual-specific parameters. In Figure 6.5, we show the histograms of the ĉ s
and τ̂. All the individuals get an estimated sampling cost that is less than 1.3 · 10−3 , which is basically zero since it has negligible effect on the action value for sampling . For the decision noise
parameter τ, the estimates are ranging from 0.0011 to 60. We can see that there is a large group
of individuals that obtain a τ which is very close to zero. 11 individuals have τ̂ with a magnitude
around 0.0011. A person with τ = 0.0011 is basically behaving as an IBA. Taking a closer look at
the 11 individuals with smallest τ parameters, we find that all these individuals sampled 10 all
beads without committing to any jar in the end, in each of the four tasks. It makes sense that
these get very small estimates of τ, because an IBA with c s = 0 would with probability 1 sample
through all the beads without making any commitment to a jar. The prior, however, is preventing these individuals from obtaining τ̂ = 0.
The individuals have estimated sampling costs that are virtually zero. The decision noise pa-
6.2. MODEL RESULTS AND DISCUSSION
59
Figure 6.4: HCB model restult. Histograms of some of the parameters after removing the burn-in
period.
Histogram of the Individuals’ cˆs
4
Histogram of the Individuals’ τ̂
12
3.5
10
3
8
2.5
2
6
1.5
4
1
2
0.5
0
0
0.2
0.4
0.6
0.8
cˆs
1
1.2
1.4
×10-3
0
0
10
20
30
40
50
60
τ̂
Figure 6.5: HCB model results. From left to right, the plots respectively show histogram of the
estimated sampling costs and histogram of estimated decision noise parameters.
60
CHAPTER 6. DATA SET AND RESULTS
Action Probabilities, Task 3
Individual with: cs =0, τ =0.0011267
1
Commit A
Commit B
Sample
0.9
Commit A
Commit B
Sample
0.9
0.8
0.8
0.7
0.7
Action Probability
Action Probability
Action Probabilities, Task 3
Individual with: cs =0, τ =10
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
Bead
Action Probabilities, Task 3
Individual with: cs =0, τ =30
1
7
8
9
10
Action Probabilities, Task 3
Individual with: cs =0, τ =59.664
1
Commit A
Commit B
Sample
0.9
Commit A
Commit B
Sample
0.9
0.8
0.8
0.7
0.7
Action Probability
Action Probability
6
Bead
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
Bead
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Bead
Figure 6.6: Action probabilities for different individuals under the HCB model. The task is Task
3 in Table 1.1. These plots help us interpret the values of τ̂ that we obtain in the data set.
rameters are ranging from 0.0011 to 60. So what does this imply on the individuals’ behavior?
We are curious to know if τ = 60 is so large that the agent gives a weight of 1/3 to each action and
therefore makes arbitrary choices. We recall that an agent chooses stochastically between each
action, and the probability of each action is weighted based on the action values. Let us take a
closer look at the expected behavior of agents with c s = 0 for different values of τ. In Figure 6.6,
we have plotted the action probabilities in Task 3 for individuals with τ = 0.0011, τ = 10, τ = 30,
and τ = 60. As we can see, the agent with τ = 0.0011 deterministically chooses to sample another bead in each state. As τ increases, the agent becomes less prone to sampling. Instead, the
probability of choosing a jar increases. The probability of choosing the more likely jar is always
larger than the probability of choosing the less likely jar, as we known theoretically from 3.10.
For an individual with τ = 60, the probability of sampling after the first bead is shown (in Task
3) is about 0.35, which means that the probability of committing is about 0.65. This probability
6.2. MODEL RESULTS AND DISCUSSION
61
Estimated ω 2 for Each Individual
55
50
45
40
ω̂ 2
35
30
25
20
15
10
5
0
10
20
30
40
50
60
70
80
Individual
Figure 6.7: In the left panel, we see the ω̂2 as well as their 95% confidence intervals. In the right
panel, we have plotted a histogram of the esimated ω̂2 .
remains reasonable stable throughout the task. However, the weight given to each of the jars
fluctuates as new beads are drawn. After the first bead is drawn, the probability of choosing the
jar that is more likely of being the source is about 0.45. These plots give us some understanding
of the τ̂ that we obtain in the data set.
Next, we look at the MLE estimate we obtain for θ ∗ = (ω1 , ..., ωNind ) in the HCB-P and HCBU-P
models. In the left panel in Figure 6.7, we have plotted each ω̂2i and their estimated 95% confidence intervals based on the observed Fisher information. All ω̂ lie within the range (3.24, 7.19).
As we recall from (3.12), we have assumed that the noise term is additive and normally distributed. Therefore, if the assumption is reasonable, we should see that the residuals are normally distributed with a mean equal to zero and an variance equal to ωi for each individual i . In
Figure 6.8, we show residual plots for two arbitrary individuals. The residual plots for the rest of
the individuals are reasonably similar to these. We can see that the normality assumption is not
very appropriate. The grey vertical lines show where a new trial starts. For the two first trials,
we can see that the residuals increase in absolute value as more beads are drawn. This trend is
caused by our choice of using the log-odds function for mapping the ρ to the real line; when ρ
ρ
ρ−∆
departs from 0.5 towards 0 or 1, the difference between log( 1−ρ ) and log( 1−(ρ−∆) ) increases. As a
result, an individual that for example underestimates the probability ρ with ∆ = 0.05, the effect
of this error is much larger (on the transformed scale) if the ρ = 0.90 than if ρ = 0.5. Therefore,
62
CHAPTER 6. DATA SET AND RESULTS
Residual Plot for Individual 1
20
Residual Plot for Individual 21
15
15
10
Residuals (x̃l − xl )
Residuals (x̃l − xl )
10
5
0
-5
5
0
-5
-10
-10
-15
-20
-15
0
5
10
15
20
25
30
35
40
45
50
0
5
10
l
15
20
25
30
35
40
45
50
l
Figure 6.8: Residual plots.
the error in probability estimate that is made when the true probability is close to 0 and 1 has a
very large impact on ω̂.
Let us compare the results obtained in the four models. Comparing the hyperparameter estimates in Table 6.1, we note the following. First of all, we see that when moving from the HCB
model to the HCB-P model, µ̂2 decreases. The same is seen when moving from the HCBU to the
HBCU-P model. In other words, the expected value of the noise parameter decreases. This may
be because the decisions made by the participants are more reasonable when we consider how
likely they think that each jar is. Secondly, we note that σ2 hat is reasonably large in all the models, especially for the HCB and HCBU models. This suggest that there is a large variety within
the population. Thirdly, is appears that the sampling costs actually play a role in the HCBUP
model, in contrast to the other three models.
Chapter 7
Conclusion and Future Work
In this report, we have introduced four alternative models that attempt to explain the decisionmaking process for individuals performing the beads task. These models use the IBA as a reference and adds noise to his behavior. In two of the models, the HCB and the HCBU models,
there is one parameter (c s ) that captures the eagerness to make a commitment to a jar and one
parameter (τ) that accounts for the randomness in choice for each individual. The two other
models, the HCBUP and the HCBP models, additionally tries to take into consideration that the
individuals have noise when they calculate the posterior probability of each of the jars being the
source of the displayed beads. This is done by imposing yet another individual-specific parameter in the model (ω).
In the Background section, we discussed several reasoning biases that may explain why delusions are formed and maintained. Now, let us take a look at how the models we have introduced
in this report may account for - or fail to account for - two of these reasoning biases. We recall
that the JTC bias is the tendency to make hasty decisions. The models we have introduced may
explain why some people make hasty decisions by the fact that they have a large c s . However, a
large decision noise parameter may also boost hasty decisions, since the probability of choosing
to sample another bead decreases as τ increases. When a large decision noise parameter is the
cause of hasty decisions, the models predict that there should be more wrong commitments.
Here, when we say "wrong commitment", we refer to choosing a jar that is not the more likely
for being the source of the drawn beads. We suggest modifying these models such that large
63
64
CHAPTER 7. CONCLUSION AND FUTURE WORK
decision noise has a less impact on hastiness. Perhaps there should be separate procedures for
determining 1) whether or not to decide on a jar or keep sampling and 2) which of the jars to
pick. The LA account, in contrast to the JTC account, says that the hastiness relative to a control
group may vanish under ambiguity. This is because multiple alternatives may seem plausible
enough, causing none of them to be chosen and instead more evidence to be sought. Our models, however, do not justify the LA bias by design. If a person has a greater tendency of making
a fast decision than another person in the two-jar case (either as a result of larger c s or τ), the
models predict that he will have a greater tendency of making a faster decision than the other
person in the four-jar case as well.
When applying our model of the probability estimates, p(ρ̃|ω), to the real data set, we discovered that it had some flaws. First of all, we discovered that it had some undesirable effects when
the true probability for either of the jars was close to 0 or 1. Secondly, it appeared that we should
include a parameter that takes bias into account; it seemed like the participants were conservative in their probability ratings. Therefore, we suggest making a new model which includes a
bias parameter that systematically pulls the probability estimate towards 0.5.
In our report, we did not compare the fit of the four models. We suggest performing model selection to figure out which of the four models that is better at accounting for the decision-making
process of human beings. Lastly, we suggest fitting the models to a healthy control group and a
group of deluded patients in order to see if there are any significant group differences.
Appendix A
Results
In this appendix, we present some plots of the results ob applying the HCBU, HCBP, and HCBUP
models to the real data set. These plots are equivalent to those presented for the HCB model in
the main text.
A.1 HCBU Model
A.2 HCBP Model
A.3 HCBUP Model
65
66
APPENDIX A. RESULTS
Trace Plot of µ1
6
Trace Plot of σ1
5
M-H sampled parameter
Mean of hyperprior (chosen λ1 )
Burn-in
5
4.5
4
4
3.5
3.5
3
θ5
θ2
θ1
M-H sampled parameter
Estimated prior mean (µ̂1 )
Burn-in
4.5
4
3
3
Trace Plot of cs1
5
M-H sampled parameter
Burn-in
2.5
2.5
2
2
1.5
1.5
2
1
1
0.5
0.5
1
0
0
0
5
10
15
5
10
35
0
10
15
×105
Trace Plot of τ1
25
M-H sampled parameter
Estimated prior mean (µ̂2 )
Burn-in
M-H sampled parameter
Burn-in
80
20
70
30
5
t
Trace Plot of σ2
90
M-H sampled parameter
Mean of hyperprior (chosen λ2 )
Burn-in
15
×105
t
Trace Plot of µ2 ,
40
0
0
×105
t
60
25
15
θ6
θ4
θ3
50
20
40
10
15
30
10
20
5
5
10
0
0
0
5
10
t
15
×105
0
0
5
10
t
15
×105
0
5
10
t
15
×105
Figure A.1: HCBU model reuslts.
Figure A.2: HCBU model restult. Histograms of some of the parameters after removing the burnin period.
A.3. HCBUP MODEL
67
Histogram of the Individuals’ ĉs
18
Histogram of the Individuals’ τ̂
12
16
10
14
8
12
10
6
8
4
6
4
2
2
0
0
0
0.01
0.02
0.03
0.04
0.05
0.06
0
0.07
10
20
30
40
ĉs
50
60
70
80
90
τ̂
Figure A.3: HCBP model results.
Trace Plot of µ1
9
Trace Plot of σ1
5
M-H sampled parameter
Mean of hyperprior (chosen λ1 )
Burn-in
8
4.5
4
4
6
3.5
3.5
θ2
3
θ5
3
θ1
M-H sampled parameter
Estimated prior mean (µ̂1 )
Burn-in
4.5
7
5
Trace Plot of cs1
5
M-H sampled parameter
Burn-in
2.5
2.5
4
2
2
1.5
1.5
3
2
1
1
1
0.5
0.5
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
×105
t
Trace Plot of µ2 ,
7
8
9
10
0
1
2
3
4
5
×105
6
7
8
9
Trace Plot of τ1
12
M-H sampled parameter
Estimated prior mean (µ̂2 )
Burn-in
M-H sampled parameter
Burn-in
25
10
20
8
10
×105
t
Trace Plot of σ2
30
M-H sampled parameter
Mean of hyperprior (chosen λ2 )
Burn-in
20
6
t
θ78
θ4
θ3
15
15
6
10
10
4
5
2
5
0
0
0
1
2
3
4
5
t
6
7
8
9
10
×105
0
0
1
2
3
4
5
t
6
7
8
9
10
×105
Figure A.4: HCBP model reuslts.
0
1
2
3
4
5
t
6
7
8
9
10
×105
68
APPENDIX A. RESULTS
Figure A.5: HCBP model restult. Histograms of some of the parameters after removing the burnin period.
Histogram of the Individuals’ τ̂
6
Histogram of the Individuals’ ĉs
4
3.5
5
3
4
2.5
3
2
1.5
2
1
1
0.5
0
0.0135
0
0.014
0.0145
0.015
0.0155
0.016
0
5
10
s
15
20
τ̂
ĉ
Figure A.6: HCBP model results.
25
30
35
A.3. HCBUP MODEL
69
Trace Plot of µ1
12
10
Trace Plot of σ1
35
M-H sampled parameter
Mean of hyperprior (chosen λ1 )
Burn-in
Trace Plot of cs1
5
M-H sampled parameter
Burn-in
M-H sampled parameter
Estimated prior mean (µ̂1 )
Burn-in
4.5
30
4
25
3.5
8
3
θ5
θ2
θ1
20
6
2.5
15
2
4
1.5
10
1
2
5
0.5
0
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
7
8
9
14
12
12
10
10
0
1
2
3
4
5
6
7
8
9
10
×105
t
Trace Plot of τ1
25
M-H sampled parameter
Estimated prior mean (µ̂2 )
Burn-in
M-H sampled parameter
Burn-in
16
14
10
×105
Trace Plot of σ2
18
M-H sampled parameter
Mean of hyperprior (chosen λ2 )
Burn-in
16
6
t
Trace Plot of µ2 ,
18
0
0
×105
t
20
θ78
θ4
θ3
15
8
8
6
6
4
4
2
2
10
0
5
0
0
1
2
3
4
5
t
6
7
8
9
10
×105
0
0
1
2
3
4
5
t
6
7
8
9
10
×105
0
1
2
3
4
5
t
6
7
8
9
10
×105
Figure A.7: HCBP model reuslts.
Figure A.8: HCBU-P model restult. Histograms of some of the parameters after removing the
burn-in period.
70
APPENDIX A. RESULTS
Histogram of the Individuals’ ĉs
60
Histogram of the Individuals’ τ̂
6
50
5
40
4
30
3
20
2
10
1
0
0
0
5
10
15
20
ĉs
25
30
35
40
0
5
10
15
τ̂
Figure A.9: CAPTION. HCBUP model results.
20
25
30
Appendix B
Acronyms
BADE Bias Against Disconfirmatory Evidence
BR Bayes’ Rule
DTD Draws to Decision
IBA Ideal Bayesian Agent
JTC Jumping to Conclusion
LA Liberal Acceptance
MCMC Markov chain Monte Carlo
M-H Metropolis-Hastings
MLE Maximum Likelihood Estimation
SZ Schizophrenia
HCB Hierarchial Costed Bayesian
HCBU Hierarchial Costed Bayesian, Unaware
HCBP Hierarchial Costed Bayesian with Probability estimates
HCBUP Hierarchial Costed Bayesian, Unaware with Probability estimates
71
Bibliography
[1] American Psychiatric Association et al. Diagnostic and statistical manual of mental disorders (DSM-5®). American Psychiatric Pub, 2013.
[2] Siddhartha Chib and Edward Greenberg. Understanding the metropolis-hastings algorithm. The american statistician, 49(4):327–335, 1995.
[3] Peter Congdon. Bayesian statistical modelling, volume 704. John Wiley & Sons, 2007.
[4] Mary Kathryn Cowles and Bradley P Carlin. Markov chain monte carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association, 91(434):883–
904, 1996.
[5] Robert Dudley, Peter Taylor, Sophie Wickham, and Paul Hutton. Psychosis, delusions
and the “jumping to conclusions” reasoning bias: a systematic review and meta-analysis.
Schizophrenia bulletin, page sbv150, 2015.
[6] Julian J Faraway. Extending the linear model with R: generalized linear, mixed effects and
nonparametric regression models. CRC press, 2005.
[7] Niall Galbraith. Aberrant Beliefs and Reasoning. Psychology Press, 2014.
[8] D. Gamerman and H.F. Lopes. Markov Chain Monte Carlo: Stochastic Simulation for
Bayesian Inference, Second Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2006.
[9] Philippa Garety, Daniel Freeman, Suzanne Jolley, Kerry Ross, Helen Waller, and Graham
Dunn. Jumping to conclusions: the psychology of delusional reasoning. Advances in psychiatric treatment, 17(5):332–339, 2011.
72
BIBLIOGRAPHY
73
[10] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis,
volume 2. Taylor & Francis, 2014.
[11] G.H. Givens and J.A. Hoeting. Computational Statistics. Wiley Series in Computational
Statistics. Wiley, 2012.
[12] Steffen Moritz, Todd S Woodward, and Martin Lambert. Under what circumstances do
patients with schizophrenia jump to conclusions? a liberal acceptance account. British
Journal of Clinical Psychology, 46(2):127–137, 2007.
[13] Michael Moutoussis, Richard P Bentall, Wael El-Deredy, and Peter Dayan. Bayesian modelling of jumping-to-conclusions bias in delusional patients. Cognitive Neuropsychiatry,
16(5):422–447, 2011.
[14] V.K. Rohatgi and A.K.M.E. Saleh. An Introduction to Probability and Statistics. Wiley Series
in Probability and Statistics. Wiley, 2011.
[15] William J Speechley and Jennifer C Whitman. The contribution of hypersalience to the"
jumping to conclusions" bias associated with delusions in schizophrenia. Journal of psychiatry & neuroscience: JPN, 35(1):7, 2010.
[16] Todd S Woodward, Lisa Buchy, Steffen Moritz, and Mario Liotti. A bias against disconfirmatory evidence is associated with delusion proneness in a nonclinical sample. Schizophrenia
bulletin, 33(4):1023–1028, 2007.

Download Report

Statistical Models for the Beads Task Problem in

Paperzz.com

Your Paperzz