The Theory is Predictive, but is it Complete? An Application to

The Theory is Predictive, but is it Complete?
An Application to Human Perception of Randomness
Jon Kleinberg, Annie Liang, and Sendhil Mullainathan
Extended Abstract
When we test theories, we most often focus on what one might call correctness: do the predictions
of the theory match what we see in the data? For example, if human capital theory says that wages
are determined by one’s knowledge and capabilities, one test of the theory is whether higher education
predicts higher wages in labor data. Such a finding suggests that education affects wages, but provides
little insight into whether human capital theory explains a small or large fraction of earnings variation.
Beyond correctness we also care about what one might call completeness: how much of the explainable
variation in the data is captured by the theory?
Despite an interest in completeness, we focus on correctness in social science for a pragmatic reason.
We can measure the fit of any given theory to data, but we have no intuition for what constitutes a
“good” fit. Suppose we are interested in predicting a binary variable and a theory predicts accurately
in .55 of observed trials. For certain problems — prediction of changes in stock returns next period
given the past history of returns, for example — this is a stunning success. In others — prediction of
college matriculation given socioeconomic and other personal characteristics — it is only mediocre. Most
social phenomena cannot be perfectly predicted, due to irreducible noise in the problem, but there exists
(possibly quite dramatic) variation in the extent of noise across problems. Testing completeness therefore
requires an approach to construction of a more realistic benchmark than perfect accuracy: we hope to
understand how well a theory’s predictive power lines up against some best achievable accuracy for the
problem.
Recent advances in machine learning provide a way to generate this benchmark. These advances have
produced valuable practical contributions, enabling substantial progress in problems such as computer
vision and gene expression analysis. But they are often criticized for using empirical variables in an ad hoc
and atheoretical way, searching for the best prediction function over a large set of explanatory variables.
The resulting prediction functions perform well empirically but are almost always hard to interpret and
rarely reveal a deep theoretical structure.
This paper suggests that even under the most pessimistic view of the interpretability of models in
machine learning, its techniques can still be useful for testing theory completeness. In fact, permissiveness
towards atheoretical models is a feature towards this goal, providing rough guidance for a “maximal”
achievable accuracy. The proposed approach is simple: compare the performance of existing (interpretable
and economically meaningful) models to the performance of an atheoretical machine learning algorithm.
We test this approach on a simple problem with a long history of study in psychology and behavioral
economics: human perception of randomness. It is well known that humans misperceive randomness
(Bar-Hillel & Wagenaar, 1991; Kahneman & Tversky 1972), a phenomenon with relevance to prediction
of stock returns and performance of mutual fund managers, among other economic problems. A common
approach to studying the nature of this misperception is through human generation of random sequences.
Following this literature, we collect 14,050 eight-length binary strings generated by subjects on the platform Mechanical Turk “as if” the result of eight flips of a fair coin. Consistent with the literature, we find
that strings with long runs and extreme ratios of Heads to Tails are under-generated by our subjects.
Two well known behavioral models for this misperception are Rabin (2002) and Rabin & Vayanos
1
(2010). We test these models on our data for two (string-by-string) predictive tasks. First, what is the
probability that the eight generated flip is Heads, given the first seven flips? Second, given a set of strings,
half generated by a true Bernoulli(0.5) process and half generated by experimental subjects, what is the
probability that a given string is generated by a subject? We measure prediction error using a squared
loss function and 10-fold cross validation.
The naive approach corresponding to the true Bernoulli process, in which each probability estimate is
0.5, performs with a prediction error of 0.25. In Table 1, we compare this quantity against the prediction
error achieved using Rabin (2002) and Rabin & Vayanos (2010). The crux of our problem of interest is
illustrated here: both behavioral models are clearly more predictive than the naive benchmark (and hence
in some sense the theories are “correct”), but the improvement on prediction error is nearly impossible
to assess. Is failure to achieve a significantly lower prediction error due to incompleteness of the available
theories, in which case better theories could greatly reduce prediction error, or is the achievable prediction
error simply bounded far away from 0?
Table 1:
Continuation
0.25
Classification
0.25
Rabin (2002)
0.2495
(0.0001)
0.2493
(0.0001)
Rabin (2010)
0.2491
(0.0001)
0.2495
(0.0001)
Guessing 50-50
We use our proposed approach to provide an answer. The specific learning algorithm we use is
referred to as table lookup — our model is the empirical distribution of strings in the training data.
Under the assumption that strings are i.i.d., this model approximates the “best possible” prediction
error with arbitrary precision as the quantity of training data is increased. Using this approach, we
achieve a predictive error of approximately 0.243 (see Table 2). A simple measure of “completeness” of
existing theories is then the ratio of improvement in prediction error of the best behavioral model (over
the naive approach) to the improvement in prediction error using table lookup. Table 2 suggests that
existing behavioral models produce roughly 10% of the achievable improvement in prediction error for
this problem.
Table 2:
Continuation
0.25
Classification
0.25
Rabin (2002)
0.2495
(0.0001)
0.2493
(0.0001)
Rabin (2010)
0.2491
(0.0001)
0.2495
(0.0001)
Table Lookup
0.2427
(0.0001)
0.2430
(0.0001)
Ratio of improvement
0.1233
0.0857
Guessing 50-50
One might be concerned that the estimated ratio is special to the problem of prediction of eight-length
strings of coin flips, and not representative of the performance of behavioral models in prediction of human
generation of randomness more generally. As a robustness check, we collect two new datasets in which
2
we vary string length and string alphabets: (a) 3000 strings of 15-length coin flips generated by human
subjects, (b) 6200 strings of 8-length binary strings from {r, 2} and {@, !}. We then repeat the above
prediction tasks, training on the eight-length coin flip data and predicting outcomes in the 15-length coin
flip data (first four columns of Table 3) and in the 8-length binary data with new alphabets (final two
columns of Table 3).
s9
0.25
Guessing 50-50
Rabin (2002)
Rabin (2010)
Table Lookup
Ratio of improvement
{r, 2}
0.25
s15
0.25
{@, !}
0.25
s11
0.25
s13
0.25
0.2474
0.2482
0.2480
0.2500
0.2493
0.2499
(.0001)
(.0001)
(.0001)
(.0001)
(0.0001)
(0.0001)
0.2459
0.2475
0.2468
0.2462
0.2491
0.2497
(.0001)
(.0001)
(.0001)
(.0001)
(0.0001)
(0.0001)
0.2389
0.2400
0.2404
0.2369
0.2451
0.2456
(.0001)
(.0001)
(.0001)
(.0001)
(0.0001)
(0.0001)
0.3693
0.2500
0.3333
0.2900
0.1836
0.0682
Table 3: Left. Predicting the 9th, 11th, 13th, and 15th flip in the 15-length {H, T } data. Right. Predicting
the final flip in the 8-length {r, 2} and {@, !} data.
This table suggests that existing behavioral models produce up to approximately 13 of the achievable
improvement in prediction error, so that the estimated ratio in Table 1 varies slightly but is roughly robust
across “nearby problems”. As a final robustness check, we contrast the prediction error achieved using
table lookup with that of a LASSO regression model using only variables motivated by the behavioral
literature.
Continuation
0.25
Classification
0.25
Rabin (2002)
0.2495
(0.0001)
0.2494
(0.0001)
Rabin (2010)
0.2491
(0.0001)
0.2495
(0.0001)
LASSO
0.2444
(0.0001)
0.2445
(0.0001)
Table Lookup
0.2425
(0.0001)
0.2430
(0.0001)
Ratio of improvement using LASSO as a benchmark
0.1607
0.0909
Ratio of improvement using TL as a benchmark
0.1233
0.0857
Bernoulli
Table 5 suggests that the prediction error achieved using table lookup is indeed “achievable” using other
(less demanding) machine learning algorithms, and possibly using interpretable models.
3
References
Rabin, Matthew. 2002. “Inference by Believers in the Law of Small Numbers.” The Quarterly Journal of
Economics .
Rabin, Matthew & Dmitri Vayanos. 2010. “The Gambler’s and Hot-Hand Fallacies: Theory and Applications.” Review of Economic Studies .
4