Optimal Group Decision: A Matter of Confidence

Optimal Group Decision: A Matter of Confidence
Calibration∗
Sébastien Massoni1† and Nicolas Roux2
1
QuBE - School of Economics and Finance & ACE, Queensland University of Technology
2
Max Planck Institute for Research on Collective Goods
Abstract
The failure of groups to make optimal decisions is an important topic in
human sciences. Recently this issue has been studied in perceptual settings
where the problem could be reduced to the question of an optimal integration
of multiple signals. The main result of these studies asserts that inefficiencies in
group decisions increase with the heterogeneity of its members in terms of performances. We assume that the ability of agents to appropriately combine their
private information depends on how well they evaluate the relative precision of
their information. We run two perceptual experiments with dyadic interaction
and confidence elicitation. The results show that predicting the performance of
a group is improved by taking into account its members’ confidence in their own
precision. Doing so allows us to revisit previous results on the relation between
the performance of a group and the heterogeneity of its members’ abilities.
Keywords: group decision, social interactions, confidence, metacognition, signal detection theory, psychophysics.
∗
Research for this paper was supported by a grant from Paris School of Economics. The authors
are grateful to Scott Brown, Steve Fleming, Chris Frith, Nicolas Jacquemet, Asher Koriat, Rani
Moran, Jean-Marc Tallon, Jean-Christophe Vergnaud and anonymous reviewers for their insightful
comments.
†
Corresponding author: School of Economics and Finance, Queensland University of Technology,
2 George St., 4000, Brisbane, QLD, Australia. Email: [email protected].
1
“For difficult problems, it is good to have 10 experts in the same room, but
it is far better to have 10 experts in the same head.” John Von Neumann.
1
Introduction
Groups are often trusted to make decisions because they gather the information of
their members. However, the extent to which groups are able to combine information
coming from different sources remains an open question.
One robust evidence is the fact that group decisions are governed by a confidence
heuristic (Yaniv, 1997; Price and Stone, 2004): group discussions are dominated by
the more confident members in a group.1 In general, as shown by Koriat (2012b), the
response of the more confident member is more likely to be correct than that of the
less confident member. However the ability to form reliable confidence varies a lot
between individuals (Pallier et al., 2002) and these individual differences are weakly
correlated with the accuracy of judgments (Williams et al., 2013). Thus differences in
confidence within a group are not always linked to differences in accuracy. This opens
the question to know whether difference in confidence calibration will negatively
affect the benefits of group decisions over individual ones.
Recently, the ability of groups to combine individual information has been carried
into the field of psychophysics by studying group decision making in signal detection
experiments (Sorkin et al., 2001; Bahrami et al., 2010, 2012b,a, 2013; Koriat, 2012b,
2015; Bang et al., 2014; Mahmoodi et al., 2015; Pescetelli et al., 2016). Signal
detection experiments consist of asking subjects to make a binary decision based
on noisy perceptive information (Faisal et al., 2008). A typical signal detection
experiment in groups consists of asking subjects individually and then as a group,
to determine which one of two visual stimuli was the strongest.
As long standing literature has shown, people’s decisions in these types of situations could be considered as being made by a Bayesian decision maker equipped with
1
Even if Trouche et al. (2014) recently show that arguments more than confidence explained
group performance in a cognitive task.
2
some (perceptive) information structure (Green and Swets, 1966; Ma et al., 2006).
The modeling of perceptive information makes it possible to determine what would
be the performance of a group if it perfectly combined its members’ information
(Sorkin et al., 2001).
Comparing actual group performance to this benchmark, Bahrami et al. (2010,
2012b) find that groups whose members are heterogeneous in terms of perceptive
abilities (that is one of them has a higher probability of finding the strongest stimulus) tend to perform poorly. The failure of heterogeneous groups suggests that the
precision of individual information is not well accounted for in the way it is aggregated. Bahrami et al. (2010) propose, as explained by Ernst (2010), that groups
use a suboptimal decision rule that overweights the recommendations of the least
able member. The resulting efficiency loss increases with the difference in group
members’ information reliabilities. This model therefore postulates the existence of
a systematic failure in the way private information is aggregated.
We propose an alternative explanation that relates these results to biases in subjects’ confidence calibrations. We assume that subjects’ beliefs about their perceptive abilities are initially not related to their actual perceptive abilities. If everyone
holds similar beliefs about his performances, the most able subjects tend to be relatively underconfident as compared to the least able subjects (Kruger and Dunning,
1999). Consequently, a group will put too much weight on the least able member, so
that heterogeneity induces greater collective inefficiencies. Therefore our explanation of collective inefficiencies does not rely on the incapacity of humans to aggregate
heterogeneous information. We rather see them as an inevitable consequence of the
lack of information subjects have access to. Thus we conjecture that a dyad inefficiently combines information only when its evaluation of the relative reliability of
its information sources is biased. If one member performs better at the task than
the other and both are aware of this fact, then the dyad should not exhibit any
inefficiency. On the contrary, if both believe they are equally good, then the opinion
of the least able member will be followed too often, resulting in a lower success rate
3
for the dyad.
This paper contributes to the literature on collective decisions in perceptual
tasks. Models of perceptual collective decisions are based on the idea that members communicate their confidence to reach a decision. This confidence will carry
different information about the strength and reliability of individual observations.
Sorkin et al. (2001) propose a first model predicting that groups will perform better
than their members. Unfortunately their model cannot account for certain types of
failures of collective decisions. Bahrami et al. (2010) propose a new modelization in
which groups of two can fail to reach the optimal decision. They show that if the
performances of the two members are too different, dyads will fail to outperform
the best member. Their prediction is confirmed in different experiments and settings. The robustness of the failures of groups with heterogeneous performances has
been established for multiple treatments: feedback (Bahrami et al., 2010, 2012a);
social interactions (Bahrami et al., 2012a; Bang et al., 2014); and confidence heuristic (Bang et al., 2014). Mahmoodi et al. (2015) explain this sub-optimality by an
equality bias in which individuals assign equal weights to other members decisions
regardless of their abilities. Alternatively Koriat (2012a, 2015) shows that a different mechanism can also impair collective decisions. If groups base their decisions
on the confidence heuristic facing consensually wrong items, the group will perform
worse that its members (Koriat, 2012a). In the case of social interactions and confidence sharing there is an amplification of this effect with a decrease of performances
while confidence increases for inaccurate answers (Koriat, 2015). All these studies
of group decisions are based on the idea that confidence is shared between members.
But one important point is missing: how well does confidence carry some reliable information? This aspect has been recently studied by Pescetelli et al. (2016) in terms
of metacognitive sensitivity. They show that the average discrimination ability of
members is positively correlated with the group performance. In this paper we are
interested in another type of metacognitive ability: the calibration of confidence.
This aspect has never been integrated in the group decision process; while we show
4
that having similar levels of calibration, measured in terms of overconfidence, is crucial to reach collective benefits. This study of calibration heterogeneity allows us
to make two new steps toward the understanding of group decision processes. First
we offer an alternative explanation of the performance heterogeneity effect with the
inclusion of confidence miscalibration. Then we emphasize the important aspect of
the quality of confidence in the confidence sharing mechanism. A dyad will combine
information inefficiently when its evaluation of the relative reliability of its information sources is biased. Thus the worse members are calibrated, the worse group
performance will be.
Since our aim is to show that inefficiencies are related to subjects’ beliefs about
their perceptive abilities, we conduct two signal detection experiments with group
decisions in which we elicit subjects’ confidence at each trial. Confidence is defined
as the subject’s belief that he chose the right stimulus. In the first study we provided some feedback after each trial on the accuracy of the decision. Furthermore
to stay in line with the design of Bahrami et al. (2010) we added an artificial heterogeneity of performances with a time difference of the stimuli presentation. To
avoid the potential effect of the feedback and to increase the ecological validity of
our design we suppressed these two aspects of the setting in a second study. Both
experiments provide results confirming that the failure of group decisions can be
linked to a difference of miscalibration. It offers an alternative explanation of the
performance heterogeneity effect: heterogeneous dyads are more inefficient because
their evaluation is more biased. Subjects indeed have no experience with the task,
so we suspect that, on average, they treat opinions equally whether they are truly
equally reliable or not (in line with the equality bias found in Mahmoodi et al.,
2015). Those dyads in which subjects do not hold equally reliable information then
exhibit more inefficiencies.
5
2
Methods
2.1
Theory
It is well established in Signal Detection Theory that the perceptive information
subjects receive can be fruitfully modeled as a Bayesian information structure (Green
and Swets, 1966). A subject draws at each trial a perceptive signal x ∈ (−∞, +∞).
Signals are drawn from a normal distribution whose mean, θ, depends on the actual
contrast difference between the two stimuli. The variance σi2 captures the precision
of subject i’s perception. We will often talk about a subject’s precision parameter
as the inverse of his variance, τi = 1/σi2 . As we use only one level of difficulty, the
contrast difference θ can take one of two values, µ (right stimulus stronger) and −µ
(left stimulus stronger), which are equally likely to occur. Subjects are asked to
say whether θ is positive or negative. Individually, their decision rule is to follow
the sign of the signal they receive. The probability that subject i makes the right
decision corresponds to the probability that he receives a positive signal conditional
√
on θ = µ. It is thus given by Φ(µ τi ) where Φ(·) is the standard normal cumulative
distribution function.
As a group, if subjects perfectly combine their private information, they make
decisions based on the sign of the sum of their signals weighted by the precision
of their information: xG = τ1 x1 + τ2 x2 . Note that this statistic is positive if and
only if the likelihood of (x1 , x2 ) given µ is greater than the likelihood given −µ.
The probability of an (optimal) group making a correct choice is thus given by the
probability that xG is positive conditional on µ. xG is normally distributed with
√
mean (τ1 + τ2 )µ and precision 1/(τ1 + τ2 + 2ρ τ1 τ2 ), where ρ is the correlation
coefficient between the group members’ signals.2 It follows that the ideal group’s
2
Other models do not take into account this correlation but as our data shows a correlation
between signals, we have integrated it into our model and the alternative ones. It is computed as
follows: we observe the probability of the two group members to make the right decision simultaneously. According to the model, this probability should be equal to the probability of both individual
signals being positive given µ = 1. Conditional on µ = 1, the distributionof the pair of signals is a
σ12
ρσ1 σ2
bivariate normal with mean (1, 1) and covariance matrix
. ρ takes the value that
ρσ1 σ2
σ22
equalizes the theoretical and observed probabilities of group members being simultaneously right.
6
information precision is given by
τG∗ =
(τ1 + τ2 )2
√
τ1 + τ2 + 2ρ τ1 τ2
(1)
.
According to the findings of Bahrami et al. (2010), the comparison of the observed group’s success rate and its ideal success rate, τG∗ , reveals that inefficiencies
are positively related to the heterogeneity of the group with respect to the precisions
of its members. They proposed an alternative model which is based on a suboptimal decision rule (named thereafter the suboptimal model). Groups make decisions
√
√
τ1 x1 + τ2 x2 . Weighting each member’s
based on the sign of the statistic xsub
G =
signal by the square root of its precision instead of the precision induces groups to
follow the individual with the lowest precision too often (Ernst, 2010). The group
precision as a function of its members’ precisions is
τGsub
√
√
( τ1 + τ2 )2
=
2(1 + ρ)
(2)
which corresponds to the optimal case when τ1 = τ2 but gets lower as τ1 and τ2
become different, i.e. in case of group heterogeneity.
We propose an alternative model, the belief model, in which the failures of heterogeneous groups comes from a lack of information about their members’ precisions.
Assume that subject i holds some beliefs about his precision parameter whose expectation is noted τi,e . We make the approximation that a group decision rule is
based on the expected values of precision parameters of its members, i.e. a group
chooses right when τ1,e x1 + τ2,e x2 is positive.3 In other words, the group behaves
Note that Sorkin et al. (2001) propose also to take into account the correlation into their modelization. An estimation of this correlation gives a mean value of 0.2573 (sd 0.0191, min -0.1861, max
0.6166) for Study 1 and a mean value of 0.3290 (sd 0.201, min -0.0791, max 0.6356) for Study 2.
3
The optimal decision rule is a much more complex object to study as it must take into account
the whole beliefs about subjects’ precisions. In our model we make the approximation that a group
decision rule is based on the expected values of precision parameters of its members, i.e. a group
chooses right when τ1,e x1 +τ2,e x2 is positive. The exact optimal decision rule must take into account
the whole beliefs about subjects’ precisions. Noting subject’s beliefs about τi by Γi , the optimal
7
as if it were sure that these expected precisions are true. Given that xi , i = 1, 2, is
actually distributed with precision τi , the group statistics are normally distributed
2 τ + τ 2 τ + 2ρτ τ √τ τ ). It
with mean (τ1,e + τ2,e )µ and precision τ1 τ2 /(τ1,e
2
1,e 2,e
1 2
2,e 1
follows that the precision of such a belief-based group is given by
τGbel =
2 τ
τ1,e
2
τ1 τ2 (τ1,e + τ2,e )2
2 τ + 2ρ√τ τ τ τ
+ τ2,e
1
1 2 1,e 2,e
(3)
If subjects’ expectations are well calibrated, i.e. τi,e = τi for i = 1, 2, the belief-based
group reaches its optimal precision level, i.e. τGbel = τG∗ . Actually, subjects may have
biased expectations and still reach their optimal collective precision: group decisions
are optimal as long as τ1,e /τ2,e = τ1 /τ2 . These expectations could be estimated by
eliciting the level of confidence of subjects in their choices. This belief model predicts
that the heterogeneity of a group with respect to the precision parameters of its
members has no direct impact on the group performance. However, since subjects do
not initially know their precision, subjects’ expected precisions should be (at least)
initially unrelated to actual precisions. This assumption is supported by empirical
evidence showing that miscalibration (over/under-confidence) is a common trait of
many individuals (Lichtenstein et al., 1982; Wallsten and Budescu, 1983; Harvey,
1997; Kruger and Dunning, 1999).
Overall these three models offer different predictions about the accuracy of group
decisions. The optimal model states that group decisions will perfectly integrate
member’s information. The suboptimal model assumes that the weaker member
will be followed too often and thus the more members are heterogeneous in terms
of performance, the worst the group performance will be. The belief model assumes
that group decision incorporates expectations of its members and thus the more
heterogeneous in terms of confidence calibration the members are, the worst the
decision rule depends on whether the group posterior about θ = µ
Z ∞Z ∞
P (x1 , x2 ) =
P (x1 , x2 ; τ1 , τ2 )dΓ(τ1 ; α1 , β1 )dΓ(τ2 ; α2 , β2 )
0
0
is higher or lower than .5.
8
group performance will be.
2.2
Experiments
In order to show that collective inefficiency is related to subjects’ beliefs about their
perceptive abilities, we conducted two perceptual experiments with individual and
group decisions in which we elicit subjects’ confidence at each trial. Confidence was
defined as the subject’s belief about his own decision accuracy. In the first study we
provided feedback after each trial on the accuracy of the decision. We also added
an artificial heterogeneity of performances by presenting the stimuli to the members
of the dyad during different time intervals. To avoid the potential effects of the
feedback and the artificial heterogeneity we suppressed feedback and time difference
of the setting in a second study.
2.2.1
Stimuli
The experiment was conducted on MATLAB using Psychophysics Toolbox version 3
(Brainard, 1997) and has been displayed on computers of resolution 1920×1080. We
use a 2AFC density task, which is known to be convenient to fit SDT models (Bogacz
et al., 2006; Nieder and Dehaene, 2009) and was previously used for metacognitive
experiments (e.g. Hollard et al., 2016; Fleming et al., 2016). The stimuli consisted
of two circles with a certain number of dots in each circle (see Figure 1B). All dots
were of the same size (diameter 0.4◦ ) and the average distance between dots was kept
constant. They appeared at random positions inside two outline circles (diameter
5.1◦ ) first displayed with fixation crosses at their centers at eccentricities of ±8.9◦ .
One of the two circles contained 50 dots while the other contained 52 dots. The
position of the circle containing the greater number of dots was randomly assigned
to be on the left or right in each trial. Note that using only one level of difficulty
prevents us to estimate psychophysical functions as Bahrami et al. (2010) did. But
it allows to express directly all our analyses in terms of success rates.
9
Figure 1: Design of the experiment. (A) Time line of a group trial: facing the fixation
cross, subjects initiate the trial, then the stimuli appear for a certain amount of time (different for both members in Study 1, identical for Study 2), they make an individual choice and
confidence observation, a group decision of choice and confidence observation, which is then
taken after reaching an agreement by free discussion, finally individuals give anew their own
individual choice and confidence. At the end feedback on the accuracy of the answer is given
in Study 1 but not in Study 2. (B) Example of our density stimuli. Subjects observe the
stimuli for a short period of time and then have to decide which circle contains the most of
dots. (C) Principles of the Matching Probability mechanism to elicit the level of confidence.
10
2.2.2
Experiment phase and payments
The experimental design comprised two kinds of trials, individual and group. In
the individual trials the following decisions are made: first two outline circles were
displayed with fixation crosses at their center. The subject initiated the trial by
pressing the “space” key on a standard computer keyboard. The dot stimuli then
appeared for a certain amount of time, and subjects were asked to respond whether
the left or right circle contained more dots by pressing the “f” or “j” keys, respectively. There was no time limit for responding. After responding, subjects were
asked to indicate their level of confidence in their choice on a gauge from 0% to
100% with steps of 5%, using the up and down keys, again with no time limit in the
response. In the group trials the time line was different (Figure 1A): participants
first conducted the trial in isolation. After their decisions participants had to discuss
with the other member of their dyad to reach an agreement. The two members were
seated in front of two nearby computers but with the impossibility to observe the
other screen. During the discussion they had to reveal their choice and confidence
but they were free to discuss decisions as long as they wanted with their partner.4
Then they reported the group’s choice and confidence (which have to be identical
between members), and finally they gave again their individual choice and confidence in order to check if they agreed with the group choice. Overall a group trial
is composed of the following sequence: stimuli; individual decision and confidence;
free discussion; group decision and confidence; individual decision and confidence.
Subjects’ payment comprised e 5 for participation and they accumulated points
according to the accuracy of their stated confidence. The incentive mechanism used
was the Matching Probability mechanism (Figure 1C) by which they won or lost
points for each trial. This rule has advantages: it is a proper scoring rule, i.e. subjects have incentives to reveal their true beliefs, which is free of risk aversion and
which provides a very good fit with the signal detection models (Massoni et al. (2014)
4
Note that the individual responses were revealed by the subjects themselves. Thus they could
lie about their own answers. But as all decisions were incentivized, subjects had no interest to give
such misreport.
11
and references therein). The principle of this proper scoring rule is to exchange a
risky bet based on their subjective probability, p, with a bet based on an objective
probability, l1 which is a number uniformly drawn from [1, 100]. If p ≥ l1 , a subject
wins 1 point if the judgment is correct, and loses 1 point if incorrect. If p < l1 , a new
number is drawn for an uniform distribution, l2 . If l2 ≤ l1 , 1 point is won; if l2 > l1 ,
1 point is lost. Prior to the experiment we explained in details to subjects how their
stated confidence would determine their payment and showed how different rating
strategies would lead to different earning schemes. The accumulated points were
paid at the exchange rate of 1 point = e 0.05.
- Study 1
In Study 1, participants performed 50 individual trials first and then 150 group
trials. At the end of both types of trials, feedback on accuracy was given. In order
to replicate the design of Bahrami et al. (2010) we created an artificial heterogeneity
between subjects of a dyad by changing the presentation time of the stimuli. One
subject saw the stimulus during 550 ms while the other saw it during 850 ms. This
difference of time applied for individual and group trials. Subjects were unaware of
such differences.
- Study 2
In Study 2, participants performed 50 individual trials first and then 100 group
trials. As we believed that the feedback given in Study 1 may affect the subsequent
decisions and the level of confidence, we did not include any feedback on the accuracy
of decisions in this study. Furthermore the difference in presentation time reduces the
ecological validity of the previous design as well as the generalization of the results
to realistic situations. Thus in this study the presentation time were identical for
both subjects and were set to 700 ms.
12
2.2.3
Participants
Study 1 was conducted in May and June 2012 in the Laboratory of Experimental
Economics in Paris (LEEP) of the University of Paris 1. Participants were recruited
by standard procedure in the LEEP’s database. 35 dyads i.e. 70 healthy subjects (36
men; age 18-45 years, mean age 22.1 years, most of them enrolled as undergraduate
students at the University of Paris) participated in the experiment for pay. We lost
the data of two groups due to a problem with a computer during the experiment.
The experiment ran for approximately 120 minutes and subjects were paid at the
end of the experiment on average e 17 (min 14, max 21).
Study 2 was conducted at the same place in December 2013 on 27 dyads i.e. 54
healthy subjects (23 men; age 19-72 years, mean age 38.1 years). The experiment ran
for approximately 90 minutes and subjects were paid at the end of the experiment
on average e 13 (min 10 max 15).
2.3
Data analysis
This section will present the main variables of interest as well as the statistical methods used. Note that we use a double approach when testing our main hypotheses: a
frequentist analysis based on a null-hypothesis-significance-testing approach and a
Bayesian analysis based on posteriors estimations and Bayes Factors computations.
Both approaches are complementary and allow to check the robustness of our results.
This dual approach of the data analysis will be applied to the test of the mediation
effect and to the models comparison.5
Variables of interest. All the variables of interests are aggregated over trials at
an individual level. The success rate is defined as the mean accuracy. Each models
prediction are done in terms of average success rate per individual.6 The calibration
5
See Mulder and Wagenmakers (Eds.) for the recent advances on Bayesian statistics applied to
psychological research.
6
Note that we have presented the models in terms of precision parameters τ . We will present our
results using direct success rates, s, which is equivalent since the precision parameter completely
determines success rate. Indeed our experiment features only one level of dots’ difference between the
two stimuli so that a subject’s success rate fully characterizes his perceptive information precision.
13
is computed as the bias (or reliability-in-the-large – Yates, 1982) rather than the
calibration index (reliability-in-the-small) to emphasize the importance of overconP
fidence in group decision. It is computed as follows: n1 ni=1 (pi − xi ) with n the
number of trials, pi the level of confidence and xi and indicator variable of accuracy. We measure the collective inefficiency as the difference between the optimal
group success rate and the actual one. This inefficiency will be explained by the
performance heterogeneity (the difference between the two members’ success rates)
and the calibration heterogeneity (the difference of over/under-confidence between
members). We measure the collective benefit in its strong form as the difference
between the actual group success rate and the success rate of the best member of
the dyad; and in its weak form as the difference between the actual group success
rate and the mean success rate of the two members of the dyad.
Mediation analysis. The relationships between different measures are assessed
by performing Ordinary Least Squares (OLS) regressions with robust estimations
of standard errors. As we aggregated variables at the individual level, estimations
are based on one observation per group. In order to disentangle the effects of performance heterogeneity and calibration heterogeneity on collective inefficiency we
perform a mediation analysis using calibration heterogeneity as a mediator. This
methodology provides a new explanation of how performance heterogeneity affect
group performance via miscalibration of confidence (see MacKinnon et al., 2007,
for a review of mediation analysis methods). To assess this mediation we follow
the requirements of Baron and Kenny (1986): to confirm the effect of performance
heterogeneity on collective inefficiency; to confirm that performance heterogeneity is
linked to calibration heterogeneity; to test that calibration heterogeneity is a significant predictor of collective inefficiency while controlling for performance heterogeneity. We observe a full mediation if inclusion of the mediator drops the relationship
between performance heterogeneity and group inefficiency to zero. In addition to
this standard procedure we test the statistical power of the mediation effect by bootstrapping with case re-sampling (5000 replications) and percentile estimate of the
14
confidence interval (Preacher and Hayes, 2004). To complement this analysis we use
a Bayesian approach of the mediation (Yuan and MacKinnon, 2009) and a default
Bayesian hypothesis test (Nuijten et al., 2014) computed with the BayesMed package on R. This test is based on a combination of evidence from both paths of the
mediation and uses Jeffreys–Zellner–Siow priors to correlations testing.
Models comparison. In order to assess the fit of the three models of group decisions, we first perform independent beta regressions of the group success rate as
predicted by our three models (Equations 1, 2 and 3) on the actual group success rate. These beta regressions are well suited for bounded dependent variables
(the group success rate being a proportion with values in [0; 1]). We then measure their goodness-of-fit. As we have non-nested models with identical complexity
(one parameter) the different information criteria give the same information. We
compute the difference between the BIC (Bayesian Information Criterion, Schwarz,
1978) of each model and use Raftery (1995) grades of evidence to compare models.
The difference in BIC provides support for one model against the other with the
following levels: none for a negative difference; weak for value between 0 and 2;
positive for value between 2 and 6; strong for value between 6 and 10; very strong
for value higher than 10. Anew we check the robustness of the results by performing
Bayesian estimations. After a Bayesian beta regression, we compute the LOO information criteria (leave-one-out cross-validation – Vehtari et al., 2016b). This index
allows for estimate pointwise out-of-sample prediction accuracy from the Bayesian
estimations using the log-likelihood evaluated at the posterior simulations of the parameter. The difference between expected log pointwise predictive density (elpd) of
each model and its standard error define the difference in their expected predictive
accuracy. This Bayesian analysis is computed using the R packages rstanarm (Stan
Development Team, 2016) and loo (Vehtari et al., 2016a).
15
3
Results
After presenting the main descriptive statistics of the two studies we will perform
three kind of analysis: to show how the effect of performance heterogeneity on
optimal collective inefficiencies could be mediated by the difference of confidence
calibration; to explain that this difference also mitigates the actual benefits of the
collective decision; to compare the different models in their ability to predict the
actual group success rate.
3.1
Descriptive statistics
The Table 1 summarizes the main statistics about the different decisions in both
studies.
Success Rate
Confidence
Calibration
ROC Area
Agreement in choice
Agreement
in
confidence
Success Rate
Confidence
Calibration
ROC Area
Agreement in choice
Agreement
in
confidence
Individual
Decision 1
Study 1
66.3% (.06)
67.6% (.07)
+1.4% (.08)
0.6041 (.07)
62.6% (.05)
Higher: 37.6% (.13)
Identical: 21.1% (.08)
Lower: 41.3% (.14)
Study 2
65.6% (.06)
67.9% (.07)
+2.3% (.08)
0.5951 (.07)
65.6% (.09)
Higher: 36.1% (.14)
Identical: 23.7% (.12)
Lower: 36.1% (.14)
Collective
Decision
Individual
Decision 2
69.9% (.06)
71.1% (.09)
+1.2% (.08)
0.6545 (.07)
.
71.3% (.07)
73.1% (.11)
+1.9% (.09)
0.6572 (.08)
85.7% (.08)
Higher: 29.4% (.18)
Identical: 33.8% (.18)
Lower: 36.8% (.21)
.
66.6% (.06)
68.8% (.07)
+2.2% (.08)
0.6234 (.07)
.
.
66.5% (.05)
68.9% (.07)
+2.4% (.08)
0.6271 (.06)
87.4% (.08)
Higher: 26.5% (.13)
Identical: 42.0% (.20)
Lower: 26.5% (.18)
Table 1: Mean levels with standard deviations in parenthesis of accuracy, confidence, metacognitive abilities (calibration and area under the ROC Curve) and rate
of agreement during the different stages of Study 1 and Study 2. Agreement in
choice means that subjects facing the same trial report the same circle in individual
decision. Agreement in confidence means that subjects report higher, identical or
lower level of confidence in individual decisions.
16
- Study 1
We observe that the mean success rate of the group is statistically higher than
the individual one (difference +3.6, t(65) = 5.3882, P < 0.001). Interestingly we
also note that the individual success rate taken after the group decision is statistically higher than the group one (difference +1.4, t(65) = 3.2121, P = 0.001). We
found the same statistically significant increase of the average level of confidence
between individual decisions and group decisions. These two evolutions lead to a
non-significant difference of calibration between the three decisions. Lastly we can
note that some disagreements on the accurate choice persist after the group decision
(the second individual decision is different in 14.3% of the trials). This experiment
was done with different viewing time between members. The difference of time exposure (650ms vs. 850ms) leads to a statistically significant difference in terms of
mean accuracy between group members (difference +4.0, t(65) = 3.1972, P = 0.001).
- Study 2
Without feedback the pattern of results is quite different. We did not observe
an increase of the success rate between the group and the first individual decision
(difference +1.0, t(53) = 1.4737, P = 0.073) as well as between the group decision
and the second individual decision (difference -0.1 , t(53) = -0.4034, P = 0.344).
This result is in line with Bahrami et al. (2012a), but contrary to this study we
did not observe a significant improvement with practice (difference +0.86 in the
first half of the trials, +1.04 in the second half, t-test of these differences: t(53) =
0.0017, P = 0.872). Thus not providing feedback clearly decreases the integration of
information between group members. Note that Study 2 implies also the removal of
time exposure difference. However as the difference of mean performance between
the group members is statistically non-significant between Study 1 and Study 2
(difference -0.2, t(58) = -0.2075, P = 0.836), we can assume that the difference
between group performance is lead by the removal of feedback and not of time
exposure difference.
17
3.2
Collective inefficiency relative to optimal performance
This first analysis links the failures of a group to make an optimal decision to a
difference of calibration rather than a heterogeneity of performances. With the help
of a mediation analysis and the observed difference of confidence calibration as a
mediator, we show that the effect of performance heterogeneity on group inefficiency
is in fact mediated by the calibration heterogeneity.
Figure 2: Mediation analysis. These path diagrams show the regression coefficients of the
relationship between performance heterogeneity and collective inefficiency as mediated by
calibration heterogeneity. The regression coefficient between performance heterogeneity and
collective inefficiency, controlling for calibration heterogeneity, is in parentheses. ** and *
mean statistically significant at 1 and 5%. (A) Values for Study 1. (B) Values for Study 2.
- Study 1
We start by checking that the effect of performance heterogeneity on group inefficiency holds in our data. Thus we perform an OLS regression of the performance
heterogeneity on the collective inefficiency at the group level. We find a positive and
statistically significant coefficient (0.3205, t(32) = 1.72, P = 0.047, Fig 2A path 1)
that confirms the effect of performance heterogeneity on collective losses. We confirm that calibration heterogeneity is a potential mediator by regressing performance
heterogeneity on it. The effect is positive and significant (0.7540, t(32) = 3.38, P =
0.001, Fig 2A path 2). To investigate if the effect on group performance is in fact due
to a problem of confidence calibration rather than the difference of performances,
we add to the regression the calibration heterogeneity as an explanatory variable. In
this case the effect of performance heterogeneity is no more significant (0.1330, t(32)
= 0.63, P = 0.535, Fig 2A path 1 in parenthesis), while the calibration heterogeneity
18
has a positive and statistically significant impact (0.2485, t(32) = 1.70, P = 0.049,
, Fig 2A path 3). Thus the requirements to identify a mediation effect are fulfilled
by our results: significant effect of performance heterogeneity on collective losses,
significant effect of performance heterogeneity on calibration inefficiency, significant
effect of calibration inefficiency on collective losses, while the effect of performance
heterogeneity is lower in absolute value than the original effect. Furthermore the
non-significant effect of performance heterogeneity in the last regression indicates
that we observe a full mediation of the calibration inefficiency. We confirm the significativity of this mediation effect by bootstrapping (replications = 5000, percentile
estimate of 99% confidence interval = [0.0069; 0.9239]). As a robustness check we
perform a Bayesian mediation analysis. The test of the mediated effect gives only
anecdotal evidence in favor of the mediation (Bayes Factor = 1.28 with one-sided
path tests). But it confirms the full aspect of the mediation with a lack of direct effect of performance heterogeneity by controlling for calibration heterogeneity
(Bayes Factor = 0.32). These results provide only a partial support to our idea that
the effect of performance heterogeneity on failures of group decision is (completely)
mediated by the difference in confidence calibration.
- Study 2
We reproduce the same analysis on the Study 2 data in which subjects received
no feedback and had no time difference in stimuli presentation. We find anew the
relationship between collective losses and performance heterogeneity. The OLS regression on the collective inefficiency gives a significant positive coefficient of 0.4630
(t(26) = 3.46, P = 0.001, Fig 2B path 1). Calibration heterogeneity is also a potential mediator with a positive and significant effect of performance heterogeneity on
calibration heterogeneity (0.8981, t(26) = 6.74, P < 0.001, Fig 2B path 2). Thus if
we add calibration heterogeneity to the first regression, previous effect is mitigated
by this variable: performance heterogeneity is no more significant (0.0225, t(26) =
0.10, P = 0.918, Fig 2B path 1 in parenthesis) while the calibration heterogeneity
has a positive effect (0.4905, t(26) = 2.44, P = 0.012, Fig 2B path 3). This re19
sult confirms that the calibration effect is a full mediation with the non-significance
of the performance effect in the mediation model. However a bootstrap test leads
only to a weakly significant effect (7%) for the indirect mediation effect (replications
= 5000, percentile estimate of 93% confidence interval = [0.0186; 0.6958]). This
weaker aspect of the mediation is confirmed by a Bayesian test on the mediated effect which provides moderate evidence against the mediation (Bayes Factor = 0.26
with one-sided path tests). This shows that our results only partially hold even in
a framework without feedback and artificial heterogeneity.
- Discussion
This analysis presents our main result implying that the well-known effect of
performance heterogeneity on optimal collective loss is in fact fully mediated by
the effect of confidence calibration heterogeneity. Unfortunately this result does
not seem robust to a Bayesian analysis. The anecdotal evidence in favor of the
mediation effect in Study 1 and the moderate evidence against it in Study 2 show
that the frequentist identification of the mediation can be arguable. This divergence
of conclusions between these statistical methods can be linked to potential weakness
in both approaches (e.g. tendency to overestimate the evidence against the null
hypothesis in small sample for frequentist analysis; use of non-informative default
priors in our Bayesian analysis). As it is not the scope of this paper to define the best
statistical approach, we should therefore consider cautiously the mediation effect
of the heterogeneity of confidence calibration. Still we argue that the importance
of confidence calibration to reach optimal group decision is quite intuitive: more
than having performance heterogeneity, the worst is to share unreliable information
about his own performance between members. This result is in line with recent
papers studying the importance of confidence in group decisions (Bang et al., 2014;
Mahmoodi et al., 2015; Pescetelli et al., 2016). While these papers show how crucial
confidence can be in a social context they are not explicitly focused on confidence
calibration but more on the level of confidence or the confidence sensitivity.
We measure heterogeneities effects on the whole set of trials. But it an be
20
interesting to see how these measures change over time and whether we can identify
some trends in miscalibration and performance heterogeneity. Our design, with
the removal of feedback from Study 1 to Study 2, should allow to test different
predictions derived from belief and suboptimal models in case of learning effect.
Indeed feedback on performances may affect differently the evolution of performance
heterogeneity and calibration heterogeneity. It is thus possible to examine how group
performances are affected by these changes in the models’ parameters. Unfortunately
we are unable to provide such tests as we did not observe any learning effects on
heterogeneities in both studies. By comparing results for the first and second half
of trials, the trends in Study 1 are in favor of an increase of both performance
heterogeneity and calibration heterogeneity. While there is a trend toward a decrease
of performance heterogeneity in Study 2 and no change in terms of difference of
miscalibration. Any of these changes are statistically significant at 10%. We are
thus unable to discriminate between heterogeneities effects in a dynamic approach.
3.3
Actual collective benefits
We have shown that the calibration heterogeneity prevents a group to make an optimal decision. However some groups perform better than their members individually.
Thus it could be interesting to see whether the difference of calibration also reduces
the actual benefits of a collective decision.
- Study 1
The actual strict collective benefit, measured as the difference between the group
success rate and the best success rate of its members, is very low and non statistically
significant different from 0 (mean 0.0063, sd 0.008, t(32) = 0.7525, P = 0.457).
But we can link this lack of benefits to the difference of calibration between group
members. Indeed an OLS regression of the calibration heterogeneity on the actual
group benefits shows it has a significant negative effect on the benefits (-0.3736,
t(32) = -2.77, P = 0.009, Figure 2A). If we use a weaker measure of collective
benefit (defined as the difference between the group success rate and the average
21
Figure 3: Observations and linear fit with 95% confidence interval of the calibration heterogeneity on the actual strong collective benefits. (A) Plot for Study 1. (B) Plot for Study
2.
success rate of its members) we find a statistically significant benefit (mean 0.0362,
sd 0.008, t(32) = 4.8241, P < 0.001). The difference of calibration between members
has a negative effect but no more significant (-0.1953, t(32) = -1.49, P = 0.146).
It can be interesting to see whether the collective benefits are also impacted by
the sensitivity of confidence. Pescetelli et al. (2016) show recently that the mean level
of discrimination of members (measured by the area under the ROC Curve) is highly
correlated with the group collective benefits (defined as the difference between group
performance and the average performances of its members). We tried to replicate
this effect on our data using the area under the ROC Curve (Galvin et al., 2003)
or the difference between meta-d’ and d’ (Maniscalco and Lau, 2012)7 on strong or
weak collective benefits. Any of the four regressions allows us to identify a significant
effect at 5% of the mean discrimination.8
- Study 2
Even if the actual strict benefits of a group decision are negative without feedback
(mean -0.0235, sd 0.007, t(26) = -3.1203, P = 0.004) we find anew the negative
effect of calibration heterogeneity on the collective benefits with an OLS estimation
7
This measure uses Signal Detection Theory tools to measure discrimination of confidence without any trouble of performance confounds.
8
We only obtained a weakly significant effect for one of the regressions: the average difference
between meta-d’ and d’ has an effect on the strong collective benefits (coefficient 0.0205, se 0.011,
t(30) = 1.81, P = 0.080).
22
(-0.4103, t(26) = -4.13, P < 0.001, Figure 2B). In terms of weak collective benefits,
even if they are no more negatives, we find only weak statistically significant benefits
(mean 0.0103, sd 0.005, t(26) = 1.9406, P = 0.063). Heterogeneity of calibration
between members has a negative effect but still not significant (-0.0731, t(26) =
-0.84, P = 0.411).
We checked also for a mean discrimination effect on the collective benefits. Again,
neither mean area under the ROC Curve or mean difference between meta-d’ and
d’ have a significant effect on strong or weak collective benefits.
- Discussion
These results shed lights on the importance of feedback on collective benefits.
The changes in design of Study 1 and 2 involve the removal of feedback and difference
in time exposure. But as the performances are not affected (in terms of individual
success as well as difference of success between groups’ members – see above) we can
assume that the differences in observed behaviors are linked to the presence (or not)
of feedback. While there is evidence that communication of confidence estimates
between members is necessary to obtain collective benefits (Bahrami et al., 2010,
2012b; Pescetelli et al., 2016). Our results tend to show that providing feedback
about the accuracy of the groups’ decision does matter. This result is in line with
the literature linking feedback and group performance in more general setting (e.g.
Tindale, 1989).
Interestingly we also observed a surprising behavior only in Study 1. Besides that
the group outperforms the best member, we also found that the second individual
decision is better than the collective one (difference +1.4, t(65) = 3.2121, P = 0.001)
while the two decisions are similar in Study 2 (difference -0.1 , t(53) = -0.4034, P =
0.344). Furthermore this effect is weakly correlated with the agreement rate of the
members first individual decisions (r = 0.2157, P = 0.082) meaning that the more
often the first individual decisions were similar, the more often the second individual
decision will outperform the collective one.
These results provide evidence in favor of the importance of feedback to reach
23
collective benefits: using efficiently others’ opinions to make a group decision implies
they must have access to the true reliability of past group decision. If it is the case,
we can even observe an increase of the decision quality during an individual decision
taken after the collective deliberation.
3.4
Models comparison
Previous results provide evidence for the importance of calibration heterogeneity
of group decisions. With the help of signal detection models we can also make
predictions according to three different models (optimal, suboptimal, beliefs). We
can compare the ability of each model to predict the actual collective success rate.
The group decision processes for each model are the following: optimal, s∗G , where
groups incorporate perfectly all the information of its members; suboptimal, ssub
G ,
where groups cannot handle performance heterogeneity and follow the member with
the lowest precision too often; and belief, sbel
G - where groups take into account
individual confidence of its members to evaluate the quality of information).
- Study 1
We test the different models in terms of their power to explain the behavioral
data. All three models make a prediction about the group success rate so we can
compare the explanatory power of the three models on the observed success rate
bel
sG . We perform separate beta regressions of sG on s∗G , ssub
G and sG and we ob-
tain the respective marginal effects: 0.8509, 0.8785, and 0.9317 (all are statistically
significant with p-values < 0.001). We can compare the three models according to
their goodness-of-fit in terms of Raftery’s grades of evidence. The differences of BIC
provide positive support for the belief model against the suboptimal one (+3.59)
and the optimal one (+4.71).9 . This comparison in favor of the belief model is robust to a Bayesian analysis. The Bayesian beta regressions provide the same range
of coefficients and the LOO information criteria confirm the better fit of the belief
9
Note that other indexes of goodness-of-fit are also in favor of the belief model (log-likelihood,
pseudo-R2)
24
model (LOOICbel = -114.6, LOOICsub = -110.8, LOOIC∗ = -109.5; elpd difference
= +2.6, se 1.9, against the optimal model; = +1.9, se 1.7, against the suboptimal
one).
- Study 2
If we study the explanatory powers of the three models without feedback we
obtain a different result. Running some beta regressions of each predicted success
rate on the actual group success rate provides the following marginal effects: 0.9124
for the optimal, 0.9587 for the suboptimal and 0.7995 for the belief (all are statistically significant with p-values < 0.0001). But this time the BIC comparison of
each model according to Raftery’s grade of evidence provides a positive support for
the suboptimal model against the belief one (difference 4.438) and a weak support
for the belief model against the optimal model (difference 1.564). This better fit of
the suboptimal model is confirmed by the Bayesian analysis. The LOO information
criterion after the Bayesian beta regressions are in favor of the suboptimal model
(LOOICbel = -99.1, LOOICsub = -105.2, LOOIC∗ = -99.3; elpd difference = -3.1,
se 4.2, against the suboptimal model; = +0.1, se 5.6, against the optimal one).
- Discussion
Models comparison reveals that the belief model outperforms the suboptimal
one in Study 1 while the inverse happens in Study 2. This difference of explanatory
power may result from the suppression of feedback from one study to the other.
But another factor may be involved in this change. The belief model is designed
to capture group inefficiencies with uncorrelated signals between members while
the suboptimal should capture inefficiencies in the presence of such correlations.
Estimation of this correlation shows that mean values are weakly significantly greater
in Study 2 than Study 1 (0.2572 vs 0.3290, unpaired t-test: t(58) = 1.4115, P =
0.0817). In order to test the potential effect of correlation levels we perform the
models comparison on both studies on two sub-samples of groups according to their
correlation estimates (lower than 0.33, 12 groups in Study 1 and 12 in Study 2; and
greater or equal to 0.33, 21 groups in Study 1 and 15 in Study 2). This distinction
25
implies only a weak change in Study 1 with the belief model still outperforming the
suboptimal one (difference of BIC = 1.423, weak support, elpd difference = +1.0,
se 1.0, on the low correlated groups; difference of BIC = 1.445, weak support, elpd
difference = +0.1, se 2.5, on the highly correlated groups). But on Study 2 these subsample estimations lead to a change of previous results. In highly correlated groups
the suboptimal model still provides a better fit than the belief one (difference of BIC
= 8.366, strong support, elpd difference = +4.4, se 3.0). But on groups exhibiting
a low correlation between signals’ members, the belief model now outperforms the
suboptimal one (difference of BIC = 2.153, positive support, elpd difference = 0.6,
se 2.5).
This result tends to support the idea that the belief model better captures group
inefficiencies in the absence of correlated signals between members while the suboptimal provides a better fit for highly correlated signals.
4
Conclusion
We propose a model of optimal group decision making by incorporating group members’ calibration. Our results show that the failure of group decisions could be linked
to the problem of a difference in confidence calibration of the group’s members. Our
result contributes to an increasing literature on the difficulties of groups to make an
optimal decision in perceptive tasks. A group will make better decisions if there is
a similar level of performances between members (Bahrami et al., 2010; Mahmoodi
et al., 2015), the decision is reached after social interactions (Bahrami et al., 2012a;
Bang et al., 2014), and members have a high metacognitive sensitivity (Pescetelli
et al., 2016). We add a new explanation of this observed behavior with the importance of having a similar level of confidence calibration between members.
Our evidence relies on the measurement of the subjects’ confidence, which may
not be innocuous. There is another way of testing our hypothesis, without measuring a subject’s confidence. Our hypothesis states that heterogeneous groups
26
under-perform because their members have very different calibrations. A way of
testing this hypothesis is to manipulate the subjects’ calibration, which can be done
by providing more or less information about their performances. For example, we
regularly provide some groups with summary statistics of their members’ performances, while other groups are kept in the dark. If our hypothesis applies, then we
should expect the relationship between group performance heterogeneity and collective inefficiencies to vanish as subjects receive more information about their relative
performances.
In this study we are only interested in problems of calibration but we can expect that discrimination abilities play a role in group decisions.10 Discrimination
refers to the ability of an agent to distinguish between two signals of different values. Limited discrimination abilities suggest that perceptive signals are filtered
before being accessible to the individual so that the assumption that signals are
perfectly observed should be weakened. An optimal model of group decision based
on confidence should incorporate these two aspects of metacognitive abilities. Koriat (2012b) proposes a model in which group decision is led by the most confident
member. This optimal model computed on our data leads to an underestimation of
the group performances. It would therefore be worth investigating the relation between group members’ metacognitive abilities (calibration and discrimination) and
group performance. Indeed the idea that group decisions are generally dominated
by the most confident members implies that the failures of a group are driven by the
correspondence between confidence and accuracy (Bang et al., 2014). Two kinds of
relationship should be taken into account. First individuals have access to a significant level of discrimination and thus can discriminate between correct and incorrect
choices (Koriat, 2012a). Thus when two members of a group disagree, the choice of
the most confident member is the most likely to be correct. But at the same time,
10
Note that in the present study we were unable to reproduce the effect identified by Pescetelli
et al. (2016) of the average discrimination level on collective benefits. But this lack of effect may
be due to our design. Indeed we did not calibrate the task according to individual abilities. This
difference of performances between members affect the reliability of the area under the ROC Curve
and may also affect the quality of meta-d’-d’ estimations.
27
at an individual level we did not observe comparable levels of calibration, which
indicates a very low correlation between mean confidence and mean accuracy across
individuals. With these two observations our results are straightforward: for a homogeneous group, i.e. whose members have similar mean confidence and accuracy,
the group performances should be higher than the individual ones if the group follows the most confident members. For a heterogeneous group, the most confident
member may not have the best probability of being correct. Overall this question
could be linked to the problem of meta-metacognition i.e. the lack of knowledge
an individual has about his own quality of confidence in terms of discrimination or
calibration.
Finally one question about our theoretical and empirical results is to know if they
hold in a larger group setting. We referred to dyad as a group in this paper. But
there are controversial debates about the nature of dyadic interactions with respect
to group exchanges. Moreland (2010) argues that dyads cannot be considered as
groups as their mechanisms are simpler while the social interactions felt inside are
stronger. But Williams (2010) shows that in most cases dyads are groups of two and
are led by the same principles and theories that explain larger groups processes. The
robustness of optimal decision models in larger groups has been studied by Migdal
et al. (2012). They find that voting can be as efficient as more complex rules in group
of size n with members of similar performances. Denkiewicz et al. (2013) confirm
empirically that groups of three follow the majority rule. But even if the group
decision is better than members’ average decisions, it cannot outperform the best
member. Considering the importance of confidence in collective decisions we can
expect that in a more complex situation (more information to aggregate) confidence
should still play a major role to assess the reliability of these multiple sources of
information and to reach an optimal collective decision. This idea is confirmed by
Juni and Eckstein (2015) who show that groups of three in a changing environment
abandon the majority rule to follow the member with the highest confidence. Thus
it is expected that calibration of confidence should still matter to reach optimal
28
decisions in larger groups.
References
Bahrami, B., Didino, D., Frith, C.D., Butterworth, B. and Rees, G.
(2013). ‘Collective Enumeration’, Journal of Experimental Psychology: Human
Perception and Performance 39(2), 338–347.
Bahrami, B., Olsen, K., Bang, D., Roepstorff, A., Rees, G. and Frith,
C. D. (2012a). ‘Together, slowly but surely: the role of social interaction and
feedback on the build-up of benefit in collective decision-making’, Journal of Experimental Psychology: Human Perception and Performance 38(1), 3–8.
Bahrami, B., Olsen, K., Bang, D., Roepstorff, A., Rees, G. and Frith,
C. D. (2012b). ‘What failure in collective decision-making tells us about metacognition’, Philosophical Transactions of the Royal Society B: Biological Sciences
367(1594), 1350–1365.
Bahrami, B., Olsen, K., Latham, P. E., Roepstorff, A., Rees, G. and Frith,
C. D. (2010). ‘Optimally interacting minds’, Science 329(5995), 1081–1085.
Bang, D., Fusaroli, R., Tylen, K., Olsen, K., Latham, P. E., Lau, J.Y.F.,
Roepstorff, A., Rees, G., Frith, C.D. and Bahrami, B. (2014). ‘Does
interaction matter? Testing whether a confidence heuristic can replace interaction
in collective decision-making’, Consciousness and Cognition 26, 13–23.
Baron, R.M. and Kenny, D.A. (1986). ‘The moderator-mediator variable distinction in social psychological research - conceptual, strategic, and statistical
considerations’, Journal of Personality and Social Psychology 51(6), 1173–1182.
Bogacz, R., Brown, E., Moehlis, J., Holmes, P. and Cohen, J. D. (2006).
‘The physics of optimal decision making: a formal analysis of models of perfor-
29
mance in two-alternative forced choice tasks’, Psychological Review 113(4), 700–
765.
Brainard, D. (1997). ‘The psychophysics toolbox’, Spatial Vision 10, 433–436.
Denkiewicz, M., Raczaszek-Leonardi, J., Migdal, P. and Plewczynski, D.
(2013), Information-Sharing in Three Interacting Minds Solving a Simple Perceptual Task, in N. Sebanz I. Wachsmuth M. Knauff, M. Pauen., ed., ‘Proceedings of the 35th annual conference of the cognitive science society’, Austin,
TX: Cognitive Science Society, pp. 2172–2176.
Ernst, M. O. (2010). ‘Decisions made better’, Science 329(5995), 1022–1023.
Faisal, A. A., Selen, L. P. J. and Wolpert, D. M. (2008). ‘Noise in the nervous
system’, Nature Reviews Neuroscience 9, 292–303.
Fleming, S.M., Massoni, S., Gajdos, T. and Vergnaud, J.-C. (2016).
‘Metacognition about the past and future: quantifying common and distinct influences on prospective and retrospective judgments of self-performance’, Neuroscience of Consciousness 2016(1), niw018.
Galvin, S. J., Podd, J. V., Drga, V. and Whitmore, J. (2003). ‘Type 2 tasks
in the theory of signal detectability: discrimination between correct and incorrect
decisions’, Psychonomic Bulletin and Review 10, 843–876.
Green, D. M. and Swets, J. A. (1966), Signal Detection Theory and Psychophysics, John Wiley and Sons.
Harvey, N. (1997).
‘Confidence in judgment’, Trends in Cognitive Sciences
1(2), 78–82.
Hollard, G., Massoni, S. and Vergnaud, J.-C. (2016). ‘In Search of Good
Probability Assessors: An Experimental Comparison of Elicitation Rules for Confidence Judgments.’, Theory and Decision 80(3), 363–387.
30
Juni, M.Z. and Eckstein, M.P. (2015).
‘Flexible human collective wis-
dom.’, Journal of experimental psychology: human perception and performance
41(6), 1588–1611.
Koriat, A. (2012a). ‘The self-consistency model of subjective confidence’, Psychological Review 119, 80–113.
Koriat, A. (2012b). ‘When are two heads better than one and why?’, Science
336(6079), 360–362.
Koriat, A. (2015). ‘When two heads are better than one and when they can
be worse: The amplification hypothesis’, Journal of Experimental Psychology:
General 144(5), 934–950.
Kruger, J. and Dunning, D. (1999). ‘Unskilled and unaware of it: how difficulties
in recognizing one’s own incompetence lead to inflated self-assessments’, Journal
of Personality and Social Psychology 77(6), 1121–1134.
Lichtenstein, S., Fischhoff, B. and Phillips, L. (1982), Calibration of probabilities: the state of the art to 1980, in D. Kahneman, P. Slovic and A.
Tversky., eds, ‘Judgment under Uncertainty: Heuristic and biases’, Cambridge,
UK: Cambridge University Press, pp. 306–334.
MacKinnon, D.P., Fairchild, A.J. and Fritz, M.S. (2007). ‘Mediation Analysis’, Annual Review of Psycholgy 58, 593–614.
Mahmoodi, A., Bang, D., Olsen, K., Zhao, Y.A., Shi, Z., Broberg,
K., Safavi, S., Han, S., Ahmadabadi, M.N., Frith, C.D., Roepstorff,
A., Rees, G. and Bahrami, B. (2015).
‘Equality bias impairs collective
decision-making across cultures’, Proceedings of the National Academy of Sciences
112(12), 3835–3840.
31
Maniscalco, B. and Lau, H. (2012). ‘A signal detection theoretic approach for
estimating metacognitive sensitivity from confidence ratings’, Consciousness and
Cognition 21(1), 422–430.
Massoni, S., Gajdos, T. and Vergnaud, J.-C. (2014). ‘Confidence Measurement in the Light of Signal Detection Theory’, Frontiers in Psychology 5, 1455.
Ma, W. J., Beck, J. M., Latham, P. E. and Pouget, A. (2006). ‘Bayesian
inference with probabilistic population codes’, Nature Neuroscience 9, 1432–1438.
Migdal, P., Raczaszek-Leonardi, J., Denkiewicz, M. and Plewczynski,
D. (2012). ‘Information-sharing and aggregation models for interacting minds’,
Journal of Mathematical Psychology 56(6), 417–426.
Moreland, R.L. (2010).
‘Are Dyads Really Groups?’, Small Group Research
41(2), 251–267.
Mulder, J. and Wagenmakers (Eds.), E.-J. (2016). ‘Special Issue - Bayes
Factors for Testing Hypotheses in Psychological Research: Practical Relevance
and New Developments’, Journal of Mathematical Psychology 72, 1–220.
Nieder, A. and Dehaene, S. (2009). ‘Representation of number in the brain’,
Annual Review of Neuroscience 32(1), 185–208.
Nuijten, M.B., Wetzels, R.W., Matzke, D., Dolan, C.V. and Wagenmakers, E.-J. (2014). ‘A default Bayesian hypothesis test for mediation’, Behavior
Research Methods 47(1), 85–97.
Pallier, G., Wilkinson, R., Danthiir, V., Kleitman, S., Knezevic, G.,
Stankov, L. and Roberts, R.D. (2002). ‘The role of individual differences in
the accuracy of confidence judgments’, Journal of General Psychology 129(3), 257–
299.
32
Pescetelli, N., Rees, G. and Bahrami, B. (2016). ‘The perceptual and social componenets of metacognition’, Journal of Experimental Psychology: General
1458(8), 949–965.
Preacher, K.J. and Hayes, A.F. (2004). ‘SPSS and SAS procedures for estimating indirect effects in simple mediation models’, Behavioral Research Methods,
Instruments, & Computers 36(6), 717–731.
Price, P.C. and Stone, E.R. (2004). ‘Intuitive evaluation of likelihood judgment
producers: Evidence for a confidence heuristic’, Journal of Behavioral Decision
Making 17, 39–57.
Raftery, A.E. (1995). ‘Bayesian model selection in social research’, Sociological
Methodology 25, 111–163.
Schwarz, G. E. (1978). ‘Estimating the dimension of a model’, Annals of Statistics
6(2), 461–464.
Sorkin, R. D., Hays, C. J. and West, R. (2001). ‘Signal detection analysis of
group decision making’, Psychological Review 108, 183–203.
Stan Development Team. (2016), ‘rstanarm: Bayesian applied regression modeling via Stan.’. R package version 2.13.1.
URL: http://mc-stan.org/
Tindale, R.S. (1989). ‘Group vs individual information processing: The effects
of outcome feedback on decision making’, Organizational Behavior and Human
Decision Processes 44(3), 454–473.
Trouche, E., Sander, E. and Mercier, H. (2014). ‘Arguments, More Than
Confidence, Explain the Good Performance of Reasoning Groups’, Journal of
Experimental Psychology: General 143(5), 1958–1971.
33
Vehtari, A., Gelman, A. and Gabry, J. (2016a), ‘loo: Efficient leave-one-out
cross-validation and WAIC for Bayesian models’. R package version 1.1.0.
URL: https://CRAN.R-project.org/package=loo
Vehtari, A., Gelman, A. and Gabry, J. (2016b). ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and Computing
pp. 1–20.
Wallsten, T. S. and Budescu, D. V. (1983). ‘Encoding subjective probabilities:
a psychological and psychometric review’, Management Science 29(2), 151–173.
Williams, E.F., Dunning, D. and Kruger, J. (2013). ‘The Hobgoblin of Consistency: Algorithmic Judgment Strategies Underlie Inflated Self-Assessments of
Performance’, Journal of Personality and Social Psychology 104(6), 976–994.
Williams, K.D. (2010). ‘Dyads Can Be Groups (and Often Ares)’, Small Group
Research 41(2), 268–274.
Yaniv, I. (1997).
‘Weighting and trimming: Heuristics for aggregating judg-
ments under uncertainty’, Organizational Behavior and Human Decision Processes 69(3), 237–249.
Yates, J.F. (1982). ‘External correspondence: decompositions of the mean probability score’, Organizational Behavior and Human Performance 30(1), 132–156.
Yuan, Y. and MacKinnon, D.P. (2009). ‘Bayesian mediation analysis’, Psychological Methods 14(4), 301–322.
34