acquiring multiple overhypotheses

Simultaneous learning of categories and classes of categories:
acquiring multiple overhypotheses
Amy Perfors
School of Psychology
University of Adelaide
Daniel J. Navarro
School of Psychology
University of Adelaide
Joshua B. Tenenbaum
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology
Word count: 11,474
Abstract
This work investigates people’s ability to learn on multiple levels of abstraction at once by figuring out how to categorize novel items at the same time
as making higher-order inferences about that categorization (e.g., that categories tend to be organized by shape). We show that not only are people
capable of this sort of learning, called overhypothesis learning – but also that
people can learn multiple different overhypotheses at once (while still simultaneously figuring out the categories). We present a computational model
that illuminates both how this sort of learning is possible and sheds light on
why people might fail to achieve it in certain circumstances.
Introduction
Learning is sometimes thought of as acquiring knowledge, as if it simply consists of
gathering facts like pebbles scattered on the ground. Very often, however, effective learning
also requires learning how to learn: forming abstract inferences about how those pebbles
are scattered – how that knowledge is organized – and using those inferences to guide one’s
future learning. Indeed, most learning operates on many levels at once. We do gather facts
about specific objects and actions, and we also learn about categories of objects and actions.
But an even more powerful form of human learning, evident throughout development, extends to even higher levels of abstraction: learning about classes of categories and making
inferences about what categories are like in general. This knowledge enables us to learn
entirely new categories quickly and effectively, because it guides the generalizations we can
make about even small amounts of input.
MULTIPLE OVERHYPOTHESES
2
Consider a learner acquiring knowledge about different classes of animal. After encountering many animals he or she notices that cats generally have four legs and a tail,
spiders have eight legs and no tail, and monkeys have two legs and a tail. Because animals
in the same category tend to have the same number of legs, whereas animals in different
categories can have different numbers of legs, the feature is diagnostic. The learner can use
this diagnostic information to infer that a new animal with eight legs is a spider and not a
cat. In this inference, the learner compares a novel item to a familiar category, which we
refer to as a first-order generalization.
The majority of studies of human category learning tend to look at first-order generalization problems: the participant is required to categorize new items into one of several
categories that he or she has alread learned. However, learning about the diagnosticity of
different features allows the learner to go beyond first-order generalization, and draw inferences about entirely novel categories, which we refer to as second-order generalization. In
the scenario described above, a learner who recognizes that “number of legs” is a diagnostic
feature for animal categories will tend to infer that an animal that has six legs does not
belong to any of the familiar categories (i.e., not a spider, cat or monkey). Moreover, he
or she would expect that when additional members of this animal category are encountered,
they too will have six legs. This second-order generalization relies on a much more abstract
form of knowledge. In the first example, the learner classifies a new animal as a spider
because it has eight legs, as do other spiders. In the second example, the learner draws
inferences based on the abstract idea that “number of legs” is a diagnostic property for
all (or most) animal categories, even those animal categories that have not yet been encountered. This abstract knowledge is typically referred to as an overhypothesis (Goodman,
1955) because it represents knowledge that is applied to an entire kind or class of category.
Overhypotheses can be powerful tools for guiding inductive inferences about unfamiliar categories. However, the inferences suggested by a particular overhypothesis are only
sensible if the new category is of the same qualitative class as previous ones. For instance,
while “number of legs” is fairly diagnostic for animal categories – and therefore supports an
overhypothesis about animals in general – it is much less useful when applied to furniture.
Tables and stools can have one, three, four, six, or more legs and still be considered tables
and stools. As a consequence, if the first example of a new class of furniture turns out to
have three legs, there is no reason to expect that future examples will also have three legs.
In other words, in order to apply an overhypothesis sensibly, a learner needs to infer the
class of category to which it is applicable. Number of legs varies little within categories
of animal, but varies more within categories of furniture. Similarly, taste is an important
feature for categories of foods but not social categories; color is (unfortunately, and for some
at least) an important feature for social categories but not artifact categories; function is
an important feature for artifact categories but not biological categories; and shape is an
important feature for solid categories but not substances.
The fact that different overhypotheses are applicable to different classes of categories
(simply referred to as “classes” from now on) makes the learning problem – already challenging – much more challenging. Instead of learning a single overhypothesis that is applicable to
all categories, learners need to determine how many classes of category exist, determine the
nature of the overhypothesis that applies to each class, and also figure out which categories
belong to each class. All of this learning must occur over and above the task of assigning
3
MULTIPLE OVERHYPOTHESES
Table 1: Different features matter for different classes of categories. Object categories such as cat
and ball tend to contain objects of a similar shape, but category members can vary a great deal in
color. In contrast, substance categories such as mud and flour tend not to be organised in terms
of typical shapes, but in terms of typical colors and textures.
Class
Category
Cat
Object
Ball
Mud
Substance
Flour
Instance
1
2
3
1
2
3
1
2
3
1
2
3
Shape
Quadruped
Quadruped
Quadruped
Sphere
Sphere
Sphere
Pile
Puddle
Spherical
Pile
Cloud
Scatter
Color
Tabby
White
Grey
Blue
Red
Green
Brown
Brown
Brown
White
White
White
individual items to the correct categories. The problem is illustrated schematically in Table 1, which shows 12 stimuli that belong to four categories: cat, ball, mud and flour.
The categories fall into two classes: cat and ball are object categories, whereas mud and
flour are substance categories. For stimuli that belong to one of the object categories,
the diagnostic feature is shape (e.g., balls are generally round). But substances are easily
moldable (e.g., flour can be encountered in a pile or scattered on a surface), so this is not
a particularly important feature. For substance categories, the diagnostic features tend to
be color or texture. In order to learn effectively in an environment that has this structure,
the learner needs to be able to identify that cat and ball are categories of a different class
as mud and flour, and should make inferences accordingly.
Given the complexity of the learning task, one might think that human learners
would struggle to solve it. Nevertheless, even 24-month-olds – younger in some cases – are
able to form abstract overhypotheses about how categories are organized, forming different
overhypotheses for different classes of things. They realize that categories corresponding to
count nouns tend to have a common shape, but not a common texture or color (Landau,
Smith, & Jones, 1988; Soja, Carey, & Spelke, 1991), whereas categories corresponding to
foods often have a common color but not shape (e.g., Macario, 1991; Booth & Waxman,
2002). The advantages of acquiring this sort of overhypothesis is clear: teaching children a
few novel categories strongly organized by shape results in early acquisition of the shape bias
as well as faster learning even of other, non-taught words (Smith, Jones, Landau, GershkoffStowe, & Samuelson, 2002). This is a noteworthy result because it demonstrates that
overhypotheses can rapidly be acquired on the basis of little input, but it raises questions
about what enables such rapid acquisition. The work in this paper is motivated by these
questions about how knowledge is acquired on multiple different higher levels of abstraction,
and how that kind of learning interacts with lower-level learning about specific items.
There are two central questions addressed in this paper. First, how is it possible to
simultaneously learn on multiple different higher levels of abstraction? Second, are people
capable of this sort of learning quickly, in the lab, on novel stimuli with novel features,
MULTIPLE OVERHYPOTHESES
4
as opposed to slowly over years of development? We evaluate human behavior in series of
experiments in which people must learn on multiple levels at once, in both a supervised
and an unsupervised fashion. We also present a computational model of this learning.
Comparing model performance to human behavior allows us to investigate how learning
multiple overhypotheses depends on the ability to learn categories. Is the former possible
only if the categories are given, or is this sort of learning possible in an unsupervised fashion
as well? How do factors like category coherence affect what can be learned? How do these
impact on the nature of the generalizations (both first-order and second-order) people make?
For computational theories of learning, the central question – the ability to learn on
multiple levels at once – poses something of a chicken-and-egg problem: the learner cannot
acquire overhypotheses without having attained some specific item-level knowledge first,
but acquiring specific item-level knowledge would be greatly facilitated by already having a
correct overhypothesis about how that knowledge might be structured. This chicken-andegg problem is exacerbated even more if it is not known how many classes (and separate
overhypotheses) there are. Often it is simply presumed that acquiring knowledge on the
higher (overhypothesis) level must always follow the acquisition of more specific knowledge.
A computational framework called hierarchical Bayesian modelling can help to explain how learning on multiple levels might be possible. This framework has been applied
to domains as disparate as learning conceptual structure (Kemp & Tenenbaum, 2008),
causal reasoning (Kemp & Tenenbaum, 2009), decision making (M. D. Lee, 2006), the acquisition of abstract syntactic principles (Perfors, Tenenbaum, & Regier, 2011), acquiring
verb knowledge (Perfors, Tenenbaum, & Wonnacott, 2010), learning word-by-word statistics
(Teh, 2009), and learning about feature variability (Kemp, Perfors, & Tenenbaum, 2007).
In this framework, inferences about data are made on multiple levels: the lower level corresponds to specific item-based information, and the overhypothesis level corresponds to
abstract inferences about the lower-level knowledge.
In this paper we present a model of category learning which acquires knowledge about
how specific items should be categorized as well as multiple higher-order overhypotheses
(classes) about how categories in general are organized. This model differs from other
models which can either categorize items but not learn multiple classes (Kruschke, 1992;
Love, Medin, & Gureckis, 2004; Perfors & Tenenbaum, 2009) or learn multiple classes but
not simultaneously categorize items (Kemp et al., 2007; Sanborn, Chater, & Heller, 2009;
Perfors et al., 2010). Our new model can discover how to cluster items at the category level
on the basis of their featural similarity, at the same time that it makes inferences about
higher-level parameters (the overhypotheses) indicating which features are most important
for organizing items into basic-level categories. It can learn these things in addition to
how many classes (overhypotheses) there are in the first place. The model provides a basis
against which to compare human performance.
The structure of this paper is as follows. First, we describe this new model of category
learning and how it is capable of solving the “classes and categories” problem. We then
present two studies investigating people’s abilities to do the same, in a supervised learning
task and an unsupervised learning task. The first study considers only the simplest version
of the problem, in which only a single latent class exists. The second study examines
a more complicated problem, one in which categories are organized into multiple classes.
In both studies we consider first-order generalization and second-order generalization. By
MULTIPLE OVERHYPOTHESES
5
comparing human performance to the Bayesian model, we find that there are some versions
of the problem for which human performance is near optimal. On the other hand, there are
versions of the problem for which people can perform the task, but are substantially worse
than the model. The implications, both theoretical and methodological, are discussed.
A computational analysis of the learning problem
In this section we outline a Bayesian model of the “classes and categories” problem
(see Appendix A for the complete specification). It can be viewed as an extension to
existing Bayesian category learning models (Anderson, 1991; Perfors & Tenenbaum, 2009;
Kemp et al., 2007; Perfors et al., 2010; Sanborn, Griffiths, & Navarro, 2010), one that
is able to organize stimuli into categories, sort categories into classes, and to learn the
abstract regularities that characterize different classes of categories. As with most Bayesian
models, it is best viewed as a computational analysis of the learning problem. It specifies a
probabilistic model that makes reasonable assumptions about the structure of the task, and
learns in an optimal fashion given those assumptions. This approach to modelling human
cognition is referred to as computational analysis, and it has been applied successfully in
areas as diverse as vision, reasoning, and decision-making (see Griffiths, Chater, Kemp,
Perfors, & Tenenbaum, 2010).
The model, which is schematically depicted in Figure 1, assumes that each stimulus is
described in terms of a set of discrete features (e.g., number of legs, shape, colour, label, etc)
each of which can take on many different values. Formally, each stimulus is characterized
by the feature vector y, where the i-th element of y indicates which value the i-th feature
takes on. Given a set of n stimuli described by the vectors y1 , y2 , . . . , yn , the model needs
to solve two problem simultaneously.
The primary task to be solved is the classification problem, in which the learner
must infer the structure that organises the stimuli. As described here, this learning takes
place at two different levels: the stimuli need to be partitioned into a set of categories, and
the categories need to be partitioned into classes. We let c be a vector specifying which items
belong in which category and similarly k be a vector indicating which categories belong to
each class. In real life the learner does not know ahead of time how many categories or
classes exist, and this uncertainty is mirrored in the model. In the model this uncertainty
is captured by placing a prior distribution P (c) over all possible ways of sorting the stimuli
into categories, and another prior P (k) over the assignments of categories to classes. As
with previous Bayesian analyses (e.g., Anderson, 1991; Sanborn et al., 2010) we use the
Chinese restaurant process to specify this prior: the primary function of this prior is to
ensure that the model tries to fit the data using as few classes and categories as necessary.
The second problem to be solved is the distributional learning problem: inferring
the distribution over features associated with different classes and categories. First, consider
the category level. For every category c and every feature f , the model must learn a vector
θ (cf ) that specifies how likely each possible feature value is. For instance, the learner might
infer that spiders have eight legs with probability 0.95, seven legs with probabiliy 0.01 and
(cf )
so on. Formally, the vth element of this vector θv describes the probability that a member
of category c has value v for feature f . The θ vectors collectively describe a probability
distribution over features for each category. Knowledge about θ allows the learner to view
MULTIPLE OVERHYPOTHESES
6
Figure 1. Our hierarchical Bayesian model. Each setting of (α, β) is an overhypothesis: β represents
the distribution of features across items within categories, and α represents the variability/uniformity
of features within categories (i.e., the degree to which each category tends to be coherently organized with respect to a given feature, or not). The model is given data consisting of the features
yi corresponding to individual items i, depicted here as abstract shapes. Learning categories corresponds to identifying the correct assignment of items to categories, and learning classes corresponds
to identifying the correct assignment of categories to classes. In this schematic example, items are
identified as being in one class or another according to whether they are drawn with solid or dotted
lines; items with solid lines are categorized by a different feature (shape) than items with dotted
lines (color). Thus, learning at all of these levels involves learning what feature(s) define different
classes (the lines), as well as learning a conditional dependency between the value on that feature
and which features matter for categorization.
each category as a statistical ensemble of features, as is typical for family resemblance
categories. It is this knowledge that allows the model to make first-order generalizations.
At the class level the distributional learning problem is more complicated. A class is
associated with a set of β vectors that play much the same role that θ plays for categories.
(kf )
For class k and feature f , βv specifies the probability that stimuli that belong to categories
of class k have value v on feature f (roughly speaking). For instance, in the animals scenario
described in the introduction the learner might infer that 40% of animals have four legs,
10% are two legged, and so on. However, classes are richer knowledge structures than
categories. At the category level, it is perfectly sufficient simply to describe the proportion
of stimuli that take on a particular feature value: if a category has θ = 0.5 for some feature
value, then 50% of category members should have that this value on that feature. There is
nothing else that needs to be stated.
At the class level the story is different. There are multiple qualitatively different ways
that a class can have β = 0.5 for some feature value. For instance, a class might consist
of two categories that both have θ = 0.5 for that same feature value. In this situation,
MULTIPLE OVERHYPOTHESES
7
the categories have the same distribution over features (e.g., chimpanzees and bonobos
might both show the same degree of variability in intelligence). A second possibility is
that one category has θ = 1 and the other has θ = 0 (e.g., chimpanzees are almost always
aggressive, bonobos are almost never aggressive). At the class level, the overall distribution
of features is much the same, but the variability is expressed at a different level: in the
first case the variability occurs within categories, in the second case the variability occurs
between categories. The model needs to be able to express this distinction, which it does
by means of learned parameters α(kf ) , which (roughly speaking) captures the extent to
which categories that belong to class k tend to be homogeneous with respect to feature f .
When α is small, it implies that category members all tend to have the same feature values,
and most variation occurs between categories rather than within them. When α is large,
categories tend to be less homogeneous with respect to that feature. Broadly speaking, it
captures the extent to which categories of a particular class are organized by a particular
feature, and it is this knowledge that enables second-order generalizations to occur.
All of the theoretically interesting distributional learning in the model takes place
at the class level and the category level. However, the model also allows a third level of
knowledge representation, formalized in terms of two parameters λ and µ that describe what
the model has learned about the regularities that are common to all classes of categories.
This top-level knowledge describes the learner’s beliefs about how homogeneous classes
themselves tend to be. The learning that takes place at this level is not of qualitatively
important for our experiments, but in our simulations we found that the model behaves more
appropriately when it learns the values of λ and µ are from the data (as in Perfors et al.,
2010), rather than fixing them at arbitrary values (as in Kemp et al., 2007). Nevertheless,
the results were qualitatively the same either way.
The overall structure of the knowledge acquired by this model is outlined in Figure 1.
At the bottom level it learns which items go in which categories (i.e., it learns c) and it
learns a distribution over features for each category (i.e., θ). At the next level it learns
which categories belong to which class (i.e., k), it learns which features are characteristic
of each class (i.e., the β values) and it also learns how homogeneous categories within each
class tend to be on each of the features (i.e. it learns the α values). Finally, at the top-level,
it also learns some high level expectations about classes and categories (i.e., λ and µ) though
as noted above this third level is less important than the other two for our investigations.
In order to acquire this structured knowledge, the Bayesian model specifies a joint
prior distribution over all unobserved variables, namely c, k, λ, µ, β, α, and θ. When the
model observes the features y1 , y2 , . . . , yn that describe the stimuli, this prior is updated
via Bayes’ rule to a joint posterior distribution
P (c, k, λ, µ, β, α, θ|y1 , . . . , yn )
that describes the learner’s beliefs about how the stimuli should be organized into classes and
categories, and what properties those classes and categories have. The particular choice of
prior distribution and the numerical methods used to approximate the posterior distribution
are discussed in Appendix A.
MULTIPLE OVERHYPOTHESES
8
Why use this model?
As discussed earlier, the Bayesian model is best viewed as a computational analysis of the learning problem, and as such it provides a normative standard for inference.
When presenting such analyses it is common to focus on how human performance matches
the predictions made by the model, thereby providing evidence that people are learning
in an optimal fashion. However, normative standards for inductive tasks are often useful
for highlighting differences between model predictions and human behaviour. This is especially relevant in the current context. Because of the highly structured representations it
acquires, this model is capable of drawing powerful inferences from the data with which it
is presented. Given the appropriate data, it would be capable of solving the developmental
problem described earlier: it could, without being told what any of the categories were,
simultaneously acquire two different biases (e.g., that categories corresponding to count
nouns are usually organized by shape and not by color, whereas food categories are more
likely to share a common color). However, the very feature that makes the model powerful –
learning rich representations – also makes it quite computationally intensive: it must track
a large number of latent variables and it searches over a huge space of possible structured
representations of the domain. Using modern computational statistics it is not too difficult
to implement this model on a powerful enough machine, but it is less clear whether people
are able to solve the problem with the same effectiveness.
As this discussion illustrates, the problem of multiple different classes simultaneously,
without being told the correct categorization, is a daunting one. It is clear from the developmental literature that people solve some version of the problem (how else could children
learn the flexible biases they do?) but it is not clear what information they require to do
so, or how that information should be presented. Can people solve this sort of task over
a shorter time frame, as in a standard laboratory-based study, or is it something that is
learnable only over a many months or years? If people can learn to solve such problems,
how efficiently do they do so? It might be a problem that people can solve to a near-optimal
standard in the lab, like multisensory integration (Shams & Beierholm, 2011) or predictions
about familiar events (Griffiths & Tenenbaum, 2006). Alternatively, it may be something
that people can do to some extent but struggle to do well, like generating or detecting
random sequences (Williams & Griffiths, 2013). In order to determine which is the case, it
is useful to compare performance to the predictions of an “optimal” learner.
Study 1: Learning multiple categories of the same class
Our initial exploration is restricted to the case where all categories are of the same
class. In this situation, a feature that tends to be consistent within one category will also
tend to be consistent in all the other categories: for instance, most wugs are circular, most
daxes are square and most feps are triangular. In contrast, a feature that varies within
one category will tend to vary within others: in the example above, wugs, daxes and
feps can all be many different colors. The experiment is designed on the basis of several
observations about what an ideal learner should be able to do when presented with data of
this sort.
MULTIPLE OVERHYPOTHESES
9
Can people learn to make first- and second-order generalizations?
The first thing that the learner needs to do in this situation is learn to correctly
classify new members of existing categories on the basis of familiar feature values. For
instance, when shown a reddish circular object, the learner should use the shape information
to guess that it is a wug, and ignore the nondiagnostic color information. Because this
inference relies on familiar categories (wug) and uses familiar feature values (circles), it is
an example of first-order generalization. The learner should also be able to perform secondorder generalization: if shown a green pentagon, and asked whether it belongs to the same
category as a blue pentagon or a green octagon, the learner should recognize that shape
is more important than color, and select the blue pentagon. Because none of the specific
feature values are familiar ones – the learner has not previously seen pentagons or octagons
– this is second-order inference. If the Bayesian model described in the previous section
provides a good description of human performance, we should expect that people will be
able to make sensible first-order and second-order generalizations. In fact, as we show later,
the Bayesian model presented here predicts that these two kinds of generalizations should
be equally easy for people to make.
One reason why this provides an interesting empirical test is that – somewhat counterintuitively – existing category learning models that include a selective attention component
do not predict that second-order generalization should be as easy as first-order. To illustrate this point, consider the classic ALCOVE model of category learning (Kruschke, 1992),
which has the ability to learn to selectively attend to diagnostic features and to ignore nondiagnostic ones. The original ALCOVE model relied exclusively on stimulus representations
that encoded items in terms of continuous stimulus dimensions, but was later formally extended to handle stimuli that are best described in terms of a collection of discrete features
(M. Lee & Navarro, 2002), as is the case here. However, it is not obvious how selective
attention should operate over discrete features that take on more than two values.
For instance, suppose the learner encounters objects that take on three possible shapes
(circular, square, or triangular). How should “shape” be encoded as a feature within the
model? One possibility is that shape corresponds to a single feature with three possible
values. If the stimulus representation takes this form then the error-driven learning rules
in ALCOVE allow it to learn to attend to the abstract notion of shape (required to make
good second-order generalizations), but they do not allow it to pay special attention to
some shapes and not others (required for good first-order generalization). This version of
ALCOVE was implemented by M. Lee and Navarro (2002) but – like the original version
of ALCOVE – was unable to learn to make first-order generalizations about very simple
categories defined in terms of two three-valued features in a human-like fashion.1
The other alternative is for the feature shape to be viewed as a collection of three
binary features: “is-circular”, “is-square” and “is-triangular.” When this occurs the model
is able to attend separately to specific shapes, and as demonstrated by M. Lee and Navarro
(2002) this ability is necessary to mimic human performance even in simple category learning tasks: people do recognize that in some situations a specific shape is relevant to the
1
This variant of ALCOVE was implemented by M. Lee and Navarro (2002) but was removed from the
final version of the paper because its performance was qualitatively identical to the original ALCOVE model,
and both performed very poorly on the learning tasks.
MULTIPLE OVERHYPOTHESES
10
classification. However, when this stimulus representation is used ALCOVE cannot make
second-order generalizations at all. The model can learn that “is-circular” is important for
distiguishing wugs from daxes and “is-green” is not. But it cannot leverage this knowledge to infer that “is-pentagonal” is more likely to be useful than “is-purple” when asked to
classify a purple pentagon. Because each separate feature value is treated as an independent
representational unit, the model cannot learn higher-order generalizations at all.
These considerations mean that – depending on which stimulus representation is
chosen to describe discrete features – the selective attention mechanisms in ALCOVE either
allow it to make good first-order generalizations or good second-order generalizations, but
never both at once. In order to do both things simultaneously, ALCOVE would need a far
more drastic redesign of the representational structure and/or learning rules, changes that
would make it more akin to the Bayesian model presented here.
The idea that first-order and second-order generalizations should be equally strong
is a non-trivial and powerful prediction of our model. As we’ve seen, it runs counter to
the predictions of other models; it also runs counter to our intuition, which suggests that
it should be easier to learn about the specific items that you actually see than it should
be to learn about abstract properties of those items. This model, however, explains why
first-order and second-order generalizations are equally strong: although the inferences at
higher levels are more abstract, there is actually more evidence relevant to them (since all
individual items in all categories are pertinent). This “blessing of abstraction” is evident
because of the computational-level analysis offered here.
This discussion leads naturally to one question that Study 1 investigates. Do people
treat first-order generalizations the same as second-order ones? Our model predicts that
they should. Taken literally, ALCOVE and related models predict that people should be
able to do one or the other but not both.2 It is therefore natural to ask what humans do.
How do people’s first- and second-order generalizations compare?
What kind of data do people need to do so?
A related question pertains to the quality of data that the learner receives. As the
category-learning experiments of Smith et al. (2002) demonstrated, it is possible for children
to acquire an overhypothesis about the role of shape in categorization after being taught only
a few novel nouns; however, it is not clear precisely what aspects of the input enabled such
rapid acquisition. Was it the fact that the categories were organized on the basis of highly
coherent features, or because the individual items were consistently labelled, effectively
providing strong evidence about category assignments?
Consider the issue of labelling. In many category learning tasks, learning is supervised: people are explicity told which items go in which categories. Do we require this kind
of supervision to be able to learn how to make second-order generalizations? Although some
category learning models are restricted to supervised learning (Kruschke, 1992), there are
others that can infer categories using only the stimulus features that do not require explicit
2
As noted, ALCOVE could probably be adapted to make different predictions, but such adaptation would
be far from trivial. As it stands, it and other models that rely on learned selective attention that operate
similarly cannot. The theoretical point of interest is not a “Bayesian versus connectionist” or “computational
level vs process level” distinction, but rather what kinds of inferences people can and cannot make, and what
mental representations these inferences imply.
MULTIPLE OVERHYPOTHESES
11
labelling (Anderson, 1991; Love et al., 2004; Kurtz, 2007; Perfors & Tenenbaum, 2009;
Sanborn et al., 2010). In the experiment discussed below, the stimuli are designed in such a
way that there is only a slight difference in the predictions made by the Bayesian model in a
supervised and unsupervised context. The correct categorization is more or less determined
by the stimulus features, so an ideal learner should not treat the two situations differently.
However, performing the unsupervised task correctly is more demanding: the learner not
only needs to make the correct generalizations to new items, she also needs to figure out
how the old items should be organized into categories. It is not clear whether people will be
able to make second-order generalizations when given an unsupervised learning problem,
much less whether they will be able to solve this more complicated task as well as our ideal
learner model.
A related issue pertains to the overall “coherence” of the stimuli. It may be that in
order to make complex generalizations, people require very clean data, which is why the
intervention in Smith et al. (2002) was so effective. For instance, perhaps it is not sufficient
to observe that many wugs are circular in shape (and so on) in order for people to be
willing to infer the abstract rule that categories are organized by shape. Perhaps it needs to
be the case that all or most wugs are circular. If people are sensitive to category coherence,
comparison to our model can illustrate whether they need to be this sensitive or not. In
other words, is poor generalization when categories are less coherent a result of the fact
that there is just less information contained in the stimuli about the proper categorization?
Or does poor generalization reflect cognitive limitations tracking and using the information
that is there on the part of the human learner?
Experiment
We presented people with stimuli that vary in terms of eight discrete features, each
of which can take on one of ten possible values. We varied three different factors, namely
(a) the coherence of the diagnostic features within categories; (b) whether the learning task
was supervised or unsupervised; and (c) the category structure to be learned. All factors
were varied within-subject.
Method
Participants. 18 subjects were recruited from a paid participant pool largely consisting
of undergraduate psychology students and their acquaintances. The experiment took 1 hour
to complete and participants were paid $12 for their time.
Design. We varied the level of supervision involved, the quality of the data, and the
category structure.
• Supervision. There were two levels, supervised and unsupervised.
• Quality of the data. There were three coherence levels: 60%, 80%, or 100%. Coherence is defined and explained below.
• Category structure. There were five levels comprising an incomplete 2 x 3 design.
There were either 16 exemplars (divided evenly into either 8, 4 or 2 categories), or 8 exemplars (divided evenly into 4 or 2 categories). The design was incomplete because we could
not fit more than 16 exemplars easily on the computer screen, and dividing 8 exemplars
into 8 categories (so each item was its own category) made little sense. We manipulated
MULTIPLE OVERHYPOTHESES
12
Figure 2. A schematic depiction of the nature of different datasets presented to both humans and
our model in Study 1. Items are associated with four coherent features (fC ) and four random ones
(fR ); for illustrative purposes we depict each feature as a digit and its value as the digit value,
although the actual data given to humans consisted of items with different visual features, as in
Figure 3. (a) An example dataset in the supervised condition with 16 items, four of whose fC
features are 100% coherent (all items in the category share the same feature value). For clarity, the
coherent features are shown in bold and are the four leftmost features, but neither humans nor the
model were told which features were coherent; this had to be learned and was randomized differently
on each trial. (b) An example dataset whose four fC features are 75% coherent: for each feature
and item, there is 25% probability that its value will differ from the value shared by most members
in the category. (c) The same dataset as in (b), but in the unsupervised condition. Here the
learner must ascertain the proper categorization and also draw the correct higher-order inference
about which features are coherent. (d) A sample first-order generalization task: given an item seen
already, which of the test items are in the same category the one sharing the coherent features (top)
or the one sharing the random features (bottom)? (e) A sample second-order generalization, which
is the same except the model is presented with entirely new items with entirely new feature values.
category structure primarily so that participants did not see the same number of exemplars
or categories on every trial, thus limiting their ability to apply learning from previous trials.
For space reasons, we do not analyse the results broken down by this factor.
Overall, these three factors yielded a 2 x 3 x 5 within-subjects design, corresponding
to a total of 30 conditions completed by each subject in a random order. This sounds more
onerous than it was: each condition corresponded to only a single “trial”, and could be
completed relatively quickly.
Stimulus appearance. The appearance of the stimuli is illustrated in Figure 3. Each
stimulus consisted of a square with four characters (one in each quadrant) surrounded by
circles at the corner, each containing a character of its own. The characters corresponded
MULTIPLE OVERHYPOTHESES
13
to the features of the items in the model datasets, and were designed to ensure that they
were distinguishable and discrete. The complete list of possible values for each feature is
provided in Appendix B.
Category description. The categories were designed around a “family resemblance”
scheme, where four of the features are coherent (denoted fC ) and formed the basis of the
family resemblance. For the coherent features, every category had a prototypical value for
that feature: a coherence level of c implies that the proportion of observed exemplars that
possessed the prototypical value is c. Non-prototypical values were selected randomly from
the other feature values. A schematic illustration of what these category structures looked
like is given in Figure 2.
Procedure
As noted above, each of the 30 conditions constituted a single extended trial, and
each participant completed all 30 trials in a random order. Each trial had several phases.
In the sorting phase, participants were shown a set of novel objects on a computer screen.
On the unsupervised trials, they were asked to sort the objects into categories by moving
them around the screen with a mouse, and drawing boxes around the ones they thought
would be in the same category. On a supervised trial, the stimuli were already sorted into
the correct categories, with boxes drawn around stimuli to indicate which items belonged
together. Figure 3(a) illustrates part of a typical trial after the participant has correctly
sorted the eight items into four piles (categories) of two each and drawn a box with the
mouse around each category.
After the sorting phase, each participant was asked two generalization questions,
presented in random order. In the first-order generalization questions, they were shown a
stimulus that was one they had already seen during the sorting phase, and asked which of
two novel items were most likely to belong in the same category as that one. The secondorder generalization questions were identical except that the participants were presented
with stimuli and feature values they had not seen before. All of the sorted items were
visible to participants throughout the entire trial. Sample generalization questions are
shown in Figures 3(b) and 3(c).
The real trials were preceded by two practice trials that had the same structure as
the real ones, but used stimuli that were easier to categorize: six of the eight features were
100% coherent with respect to the relevant category. In these practice trials, people were
informed what the correct classification would have been; this was done to ensure that the
participants understood the task. For the real trials, people were simply told how many of
the two questions they got correct, but not which ones.
Results
Computing model predictions. In order to assess how well people learned the categories and make the correct inferences, we compare human responses to the behavior of the
idealized Bayesian learner outlined earlier in the paper. In order to compute model predictions on the generalization questions we computed the posterior probability that each of the
two choice items belong in the same category as the original item, as described in Appendix
A. In first-order generalization, the original item already occurs in the dataset and the query
MULTIPLE OVERHYPOTHESES
14
Figure 3. (a) End of the first phase of an unsupervised trial. The learner has correctly sorted
the eight objects into four categories composed of two items each. In this trial, coherence is 80%,
which means that the four coherent features (in this case, bottom-left circle, top-left square, bottomleft square, and top-right circle) do not perfectly align with the categories. (b) Sample first-order
generalization question for this trial. The test item corresponds to one of the items from the first
phase. The two options each differ from the test item by four features; in this case, the correct
answer is the item on the left (which shares the coherent features) rather than the item on the
right (which shares the random features). Participants selected their choice by clicking on it. (c)
Second-order generalization trials are identical except that none of the items contain feature values
that have been seen before in that trial. Here, the correct answer is on the right, since that choice
shares the coherent features with the test item.
is whether it is more likely to be in the same category as an item that shares a coherent
feature fC (a “correct” generalization) or a random feature fR . In second-order generalization the situation is identical, except the original item and the two choice items contain
feature values that have not occurred before. Generalization is evaluated independently for
each trial, and additional technical details are given in Appendix A.
Comparing human performance to model predictions. Figure 4 displays the probability of making a correct generalization for the model (panel a) and the human participants
(panel b). These generalization probabilities are broken down by two factors: whether the
correct classifications were provided in the task (i.e., supervised versus unsupervised)
and whether the test question asked for a first-order generalization or a second-order one.
As Figure 4a illustrates, the model predicts that any effects of supervision or generalization
should be negligible: model performance is very similar in all four cases. Human generalization performance is somewhat different. Like the model, people showed almost identical
performance for first-order and second-order generalization items (79% and 77% correct
respectively). However, unlike the model, human performance was noticeably better in the
MULTIPLE OVERHYPOTHESES
(a)
15
(b)
Figure 4. (a) Model generalization averaged across all datasets based on the nature of the category
information given. There is no significant difference between first- and second-order generalization.
Category information aids in generalization, but the effect is small. (b) Humans also do not show
a difference between first- and second-order generalization, although they benefit more from being
given category information.
supervised than the unsupervised condition (84% vs 73% correct respectively).
The third factor of theoretical interest is the different coherence levels of the data
provided to learners. As shown in Figure 5, the model predicts that generalization performance will degrade as the coherence of the data decreases. Human performance exhibits
same qualitative pattern. Overall, when the data were 100% coherent, people generalized
correctly 83% of the time: this figure drops to 78% for data that are 80% coherent, and
74% for 60% coherent data.
To quantify the extent to which model predictions are supported, we note that the
theoretical predictions include both null effects (generalization type and supervision) and
actual effects (coherence), and as such it is inappropriate to rely on orthodox null hypothesis
tests because these do not allow us to quantify the strength of evidence for a null hypothesis.
With this in mind, we turn to Bayesian data analysis. The experiment is a standard
repeated measures design, with coherence acting as a continuous variable and generalization
and supervision treated as binary factors.3 Applying this analysis to human we find very
strong evidence for an effect of coherence (odds of about 36000:1), strong evidence for
an effect of supervision (odds of around 100:1) and modest evidence for a null effect of
generalization type (odds of around 4.5:1).4 In other words, human learners appear to
3
To be precise: consistent with repeated measures ANOVA, we assume a random intercept for each
subject. Bayes factors reported rely on the Bayesian equivalent of Type II test in ANOVA: for each main
effect the null model corresponds to the full model minus the relevant predictor. Bayes factors were calculated
using the BayesFactor package in R (Morey & Rouder, 2014, v 0.9.7). Because it is typical to obtain a
range of possible factors within a confidence interval, for simplicity we report the approximate factor.
4
It is worth noting that (a) smaller odds ratios are grossly typical for null effects because it is fundamen-
MULTIPLE OVERHYPOTHESES
(a)
16
(b)
Figure 5. (a) Model generalization averaged across all datasets based on coherence of the categories. Coherence affects generalization, especially in the unsupervised condition. (b) Human
generalization is also affected by category coherence.
mirror the Bayesian analysis in two respects: first-order generalization and second-order
generalization are equally easy, and the effect of low-quality data is similar for both model
and humans. Where human performance differs from the ideal observer model, however, is
in the usefulness of supervision, a topic to which we now turn.
Why does supervision matter to humans more than the model?. On two of the three
factors considered, human performance closely matches the predictions of an idealized
Bayesian analysis of the learning problem. The one point of discrepancy is that the model
performance on unsupervised learning problems differs only trivially from its performance
on the supervised version, implying that the featural information presented in the task is
(in principle) sufficiently rich to allow people to infer the correct category structure, and
hence make appropriate generalizations. However, people do not meet this standard, with
human performance dropping from 84% correct to to 73% correct when stimuli are not
already appropriately grouped.
Why does this decline in performance occur? One possibility is that people simply
fail to identify the correct categories, and in these cases generalize incorrectly. Another
possibility is that they succeed in identifying the correct categories most of the time, but are
less confident in those categories or are for some other reason less able to make appropriate
generalizations on the basis of the categories that they have inferred. To investigate these
possibilities, we evaluate the correctness of category assignments using the rand index,
which is a measure of similarity between two clusterings (in this case, the correct categories
vs. the category assignments made by the participants). In order to naturally correct for
guessing we use an adjusted measure (adjR), as in Hubert and Arabie (1985). Values range
tally more difficult to obtain such evidence. Nevertheless (b) odds of 5:1, although modest, are essentially
equivalent to the p < .05 standard in orthodox tests (Johnson, 2013).
MULTIPLE OVERHYPOTHESES
(a)
17
(b)
Figure 6. (a) Participant performance on categorization task based on categorization success. The
highest group succeeded in finding the correct categories (had adjR scores above 0.5); the middle
group had adjR scores above chance, but not substantially; and the lowest group were below
chance performance in sorting items into categories. Participants who succeeded in finding the
correct categories had high generalization performance, suggesting that people’s relatively poorer
performance in the unsupervised condition was probably due to a difficulty in identifying the
correct categories. (b) Among trials in which the correct categories were found (i.e., the highest
adjR group), overall generalization (collapsed across first- and second-order) was uniformly high,
regardless of coherence.
between -1 and 1, where 1 indicates perfect agreement between two clusterings, 0 indicates
the amount of agreement one would expect by chance, and negative values indicate less than
chance agreement. To understand human performance in the unsupervised condition, it
is useful to examine the relationship between adjR and generalization performance.
How well did people classify the stimuli in the unsupervised condition? Overall,
classification accuracy was high: 92% of trials showed classification above chance, 67% of
trials had adjR values above 0.5, and 47% of trials produced perfect classifications. When
we divide subject responses into bins based on these three categories (shown in Figure
6a), it is clear that there is a very strong effect of classification accuracy on subsequent
generalization performance. People who classified well also generalized well. A Bayes factor
analysis indicates that the strength of evidence favoring an effect of adjR over and above
effects discussed earlier is on the order of 1032 :1, indicating that it is virtually certain that
the difference is not due to chance. Moreover, if we restrict the analysis to those cases where
classification accuracy was high (adjR > 0.5) we find that the effect of coherence vanishes:
as illustrated in Figure 6b, the proportion of correct responses in the 60%, 80% and 100%
coherent conditions for those trials was 87%, 88% and 88% respectively (Bayes factor: 6:1
in favour of a null effect).
This suggests that although the featural information in the stimuli is sufficient to
learn the correct classifications (since the model does so), people require more information
to arrive at the correct categorization. This is why human performance is poorer in the
MULTIPLE OVERHYPOTHESES
18
unsupervised condition. However, in those cases when people were able to find the correct
classification, their generalization performance was just as good as the model. A reasonable
hypothesis is that people have certain capacity limitations processing and comparing all of
the features at once; thus when there are many features, being given the category information helps to highlight the important ones. If this hypothesis is true we can predict that
being given the categories should help people much more if there are lots of features than
if there are only a few (whereas it shouldn’t make a difference to the model). We test this
prediction, among others, in Study 2.
Summary
Overall, Study 1 demonstrates that – like our Bayesian model – people make secondorder generalizations just as well as first-order generalizations. As previously noted, this has
implications for connectionist models such as ALCOVE in terms of how discrete features
are encoded and dimensional attention rules work. More generally, this result implies that
learning about features is more than learning to associate particular feature values with
particular categories; it is also about learning that a particular feature tends to be relevant
for an entire class of category. Novel feature values on a class-relevant feature are far more
likely to indicate a new category than novel feature values on a class-irrelevant feature.
The other key finding of Study 1 is that people are capable of learning categories as
well as forming an overhypothesis about how those categories are organized – but, nevertheless, they are greatly helped by being given the categories instead. Our model showed very
little improvement in the supervised condition, when being told the categories, whereas
people showed a much larger improvement. In fact, when people identified the correct category, they generalized as well as the model. This suggests that the problem people have
in the unsupervised condition is not inherent in the information in the stimuli. The difficulty lies in finding the categories in the first place, and arises perhaps out of limitations
in working memory or processing capacity on the part of our participants.
Study 2 arises naturally out of the questions that remain. First, how much of human
performance is constrained by capacity limitations relative to the model? We investigate
this by manipulating the visual complexity of the stimuli (i.e., the salience and number of
the features). Second, in the previous task, all categories were of the same class, and as a
result the same inductive bias held for all of them. Are people capable of learning multiple
different classes simultaneously with learning the categories as well?
Study 2: Learning multiple classes of category
Here we address some of the predictions suggested by the first study, along with
the more complicated learning problem of identifying both the correct category and the
correct class for individual objects (i.e., learning more than one overhypothesis or (α, β)
pair). We therefore present the model and humans with data in which some of the items are
exemplars from one class and some are exemplars from another. No learner is ever given the
class information, since that is rarely available in real life; however, as before, in some trials
the category information is given (supervised) while in some it is not (unsupervised).
Study 2 has two main goals. First, can human learners acquire both class and category
information simultaneously in a short time, and can the model explain how such learning
MULTIPLE OVERHYPOTHESES
19
might be possible? Second, how does generalization depend on the number and salience
of the features learners are presented with? Our results indicate that both the model
and humans can learn both classes and categories, at least when there are fewer, more
coherent features; as the number of features increases and their salience decreases, human
performance plummets while model performance does not. The high model performance
indicates that even in this more difficult condition, there is sufficient information in the
input to arrive at the correct categorization; the low human performance suggests that
people fail to use this information, probably because of process-level limitations (e.g., on
working memory).
Method
In this experiment there were two between-subjects conditions varying in difficulty
level (easy vs hard). Within-subject, we varied two additional factors: (a) whether the
learning task was supervised or unsupervised; and (b) the category structure to be learned.
As before, the experiment was designed to present participants with as close as possible to
the exact task and dataset presented to the model.
Participants. 30 participants (15 per condition) were recruited from a paid participant
pool largely consisting of undergraduate psychology students and their acquaintances. The
experiment took 1 hour to complete (slightly less in the easy condition) and participants
were paid $12 for their time.
Design. As noted above, we varied the level of supervision involved, the level of
difficulty, and the category structure.
• Supervision. There were two levels, supervised and unsupervised.
• Difficulty level. There were two levels, easy and hard, which varied according to
stimulus appearance as described below.
• Category structure. There were five levels in an incomplete 2x3 design. There
could either be 16 exemplars (divided into 2 classes and 8 or 4 categories), 12 exemplars
(divided into 2 classes and 6 or 4 categories) and 8 exemplars (divided into 2 classes and 4
categories). The structure was slightly different from Study 1 in order to ensure that each
trial had two classes as well as at least two items per category. As before, we varied this so
that there were fewer trial-to-trial regularities for our participants to learn, and this is not
analyzed in more detail.
Overall, these three factors yielded a 2 x 2 x 5 within-subjects design, corresponding
to a total of 20 conditions completed by each subject in a random order.
Stimulus appearance. The items differed somewhat between the easy and hard
conditions. The easy condition was designed to tax participants’ working memory the
least, so it had only five features, all of which were 100% coherent. Four of the features
were the four characters in the interior square part of the items from Study 1. The fifth
feature always indicated the class and was indicated by the color of the item (in all other
conditions and in Study 1 all items were white). We chose color because that was likely to
be perceptually salient – hence making the task easier – but it was unlikely that participants
had the a priori expectation that it would denote classes, since it does not in the real world.
To ease in processing, the possible feature values for the other four features were restricted
MULTIPLE OVERHYPOTHESES
20
Figure 7. Sample items in Study 2. (a) Items from the easy condition. The top four light items
belong to class 1, in which (in this example) the two leftmost features organize the categories.
The bottom four dark items correspond to class 2, in which the two rightmost features organize
the categories. (b) Items from the hard condition. The top four items belong to class 1 and are
distinguished from those in class 2 by their values on the features in the upper left circle and upper
right square. Within class 1, in this example the features that organize the categories are the upper
left square and the lower right circle. Within class 2, the features that organize the categories are
the lower left square and the lower left circle. Because coherence is 90%, occasionally feature values
will not be completely consistent within a category or class.
to characters the participants were more likely to be familiar with, like letters and numbers
rather than esoteric mathematical symbols. These four features were randomly allocated
to correspond to category organization for one of the two classes, with the constraint that
the two features for one class could not be on the diagonal (that is, they would be either
the top/bottom two or the right/left two). A list of the possible characters in the easy
condition is shown in Appendix B, and items from a sample trial are shown in Figure 7(a).
The items in the hard condition were nearly identical to those from Study 1, although
the features all had 90% coherence. Items appeared the same and the eight features and
feature values were the same. Which features indicated classes and categories was randomly
assigned from trial to trial. Figure 7(b) shows items for a sample trial.
Category description. Figure 8 schematically presents the data given to our learners.
In all conditions, one (easy) or a group of (hard) features indicates which class the item
belongs to; these features are analogous to solidity as the feature that distinguishes substances from solids. These “class” features are indicated in the figure by underlining, but
are not distinguished as class features to the learner, who must figure this out. Classes are
MULTIPLE OVERHYPOTHESES
21
Figure 8. A schematic depiction of the nature of different datasets presented to both humans and
our model in Study 2. In all conditions, one or a group of features indicates which class the item
belongs to. These features are indicated in the figure by underlining, but do not appear different
to the learner. Depending on the value of the class feature(s), a different set of alternate features,
indicated in bold, are the important ones for categorization (just as solids are organized by shape
while non-solids are organized by material). In the actual datasets, which features indicate wha
varies randomly from trial to trial. (a) In the easy condition, one feature (here indicated by the
middle number) indicates the class, two features (the first two numbers) organize categories within
class 1, and the other two features (the last two numbers) organize categories within class 2. (b)
In the hard condition, there are eight features, two of which are always random. Two others
indicate the class (in this case, they are the second and third numbers in the sequence). As in the
easy condition, two different features organize the categories in class 1 and class 2. The coherence
of the features is 90%, rather than 100% as in the easy condition. (c) There are two first-order
generalization questions, one for each class; examples from the easy condition are shown here. (d)
There are also two second-order generalization questions, again one for each class.
also distinguished by which features are important for categorizing within that class (e.g.,
within solids, shape is important, but within non-solids, material is more important).
These factors are the same for both conditions, which differ according to how many
features there are and how coherent those features are. Among the five features in the
easy condition, all of which are 100% coherence, one feature (the color) indicates the class.
Within class 1, two of the other features (also chosen randomly) are coherent with respect to
categories in that class; within class 2, the remaining two features are coherent with respect
to categories in that class. The hard condition has the same logical structure, but with
eight rather than five features and 90% coherence. Two features are random with respect to
all classes and categories, while two features (analogous to color in the hard condition) are
MULTIPLE OVERHYPOTHESES
22
coherent with respect to classes. As in the easy condition, two of the remaining features
are coherent with respect to categories within class 1, and two others are coherent with
respect to categories within class 2.
Procedure
Trial structure in both the easy and hard conditions was very similar to that of
Study 1. In the first phase subjects were asked to sort items (unsupervised trials) or
to observe the items with boxes already drawn around them (supervised). In all cases,
the boxes corresponded to the category structure; classes were never made explicit in any
way. As before, within each trial, the features that organized classes and categories were
randomly assigned (except that in the easy condition the color feature always indicated
class). The test trials were preceded by two sample trials with similar items that were
easier to categorize. All sample trials contained only one class, and the instructions asked
participants to categorize the items, making no reference to different ways of categorizing
different things. This was done intentionally in order to ensure as much as possible that
the participants were not biased to look for different classes or ways of categorizing within
items in a single trial.
Generalization questions were also similar to Study 1, except that four rather than
two were asked for each trial (as depicted in Figure 8). There was one question for each
class and order of generalization, and they were asked in random order. After completing
all of the questions, participants were told how many of the questions they got correct, but
not which ones. As in Study 1, there were two practice trials. None of the practice trials
had more than one class at once, so participants could not have learned that there were two
classes from those.
Results
Our main question was how well people performed in both of the conditions. Overall,
the easy condition was substantially easier than the hard condition: participants classified
93% of stimuli correctly in the easy condition, but were only slightly above chance in the
hard condition, with 54% of items being classified correctly (Bayes factor is 1013 :1 in favor
of an effect). As before, there is no evidence for any effect of generalization type, with
75% of first-order items and 72% of second-order items classified correctly (Bayes factor
of 1:1 favors neither conclusion). Somewhat surprisingly, however, participants performed
similarly in the supervised versus unsupervised conditions, with accuracy of 75% and
73% respectively (Bayes factor of 2:1 provides weak evidence in favor of a null effect).
How do we explain this performance? One possibility is simply that the easy condition was so easy that people scored at ceiling, and the hard condition was so hard that
even when it was supervised people were unable to pick out the underlying regularities.
This would result in no difference between the supervised and unsupervised condition
in both cases. However, an alternate possibility is that in the easy condition people were
leveraging information from the supervised trials to use in the unsupervised ones. If one
of the most difficult parts was realising that one had to sort both by class and within the
classes, the supervised trials could communicate this information, which could then have
been applied on the unsupervised trials.
MULTIPLE OVERHYPOTHESES
23
Figure 9. Human performance in Study 2 by condition and supervision level. Within the easy
and hard conditions, some participants both supervised and unsupervised trials (these are denoted supervised-paired and unsupervised-paired). In the hard condition people performed
at chance, but people did significantly better in all versions of the easy condition. In order to explore
whether people in that condition were using supervised trials to leverage their learning in unsupervised trials, an additional set of participants in the easy condition were shown only unsupervised
trials (these are denoted unsupervised-standalone). They were able to leverage their learning
from supervised trials on the unsupervised-paired trials: performance was better on those than
on the unsupervised-standalone trials in which such leveraging was impossible.
We tested this by running another version of the easy condition that just contained
the unsupervised trials. 27 participants were recruited from either the first-year Psychology
course at the University of Adelaide or the paid participant pool. The experiment took 30
minutes to compete and people were either paid $5 or received course credit. The procedure
and instructions were exactly the same as in the easy condition of Study 2 except that they
only saw the unsupervised trials (including only the unsupervised sample trial).
As Figure 9 shows, performance on these unsupervised-standalone trials was
markedly poorer than performance in the easy condition in which people also had some
supervised trials to learn from (the unsupervised-paired trials). In the original version
of the task, people classified 94% of items correctly in the unsupervised easy condition.
However, in the standalone version of task, only 68% of items were classified correctly. Although the performance in the standalone condition declines relative to the original version
(Bayes factor: 1900:1 in favor of an effect), it remains well above chance (Bayes factor is
of the order of 500:1). The above chance performance in the standalone condition implies
that people were able to learn to categorize on multiple levels even when there was no information telling them that such categorization was called for. Rather, they were able to
notice from the distribution of features that the most sensible category structure involved
multiple levels of sorting.
24
MULTIPLE OVERHYPOTHESES
easy−paired
16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories
5
5
10
10
15
15
easy−standalone
5
10
15
2
2
4
4
6
6
8
8
10
10
12
5
10
15
2
4
6
8
12
2
4
6
8
10
12
2
4
6
8
10
12
16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories
5
5
10
10
15
15
5
10
15
5
10
2
2
4
4
6
6
8
8
10
10
12
12
15
2
4
6
8
10
12
hard
5
10
10
15
15
5
10
15
2
4
4
6
6
8
8
10
10
12
5
10
15
4
6
8
10
12
6
8
8 items, 4 categories
6
8
4
6
8
10
12
2
4
6
8
8 items, 4 categories
2
4
6
8
12
2
4
4
2
2
2
2
16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories
5
8 items, 4 categories
2
4
6
8
10
12
2
4
6
8
Figure 10. Sorting performance by condition in the unsupervised trials. The x and y axis show
each of the n items for each category structure. White colors in the graphs indicate that those
items were always sorted in the same categories as each other; black indicates that those items were
never sorted together. The top row shows participant assignments in the easy-paired condition.
In all cases the actual assignments people made conform very strongly with the actual category
structure. Thus, for instance, in the 16 item, 4 category structure, the first four items shared a
category, as did the next four, and so forth. The bottom row shows participant assignments in the
hard condition. The correct category assignments are exactly the same as in the easy condition,
making it evident that people are not generally clustering correctly in the hard condition. They
do often appear to notice the differentiation between the two classes, capturing the regularity found
in all structures in which the first half of the items are in one class and the second half are in the
other. People did pick up on this pattern but usually made no further categorization within class,
especially when the categories were small. The middle row shows the easy-standalone condition,
in which people made more correct assignments than in the hard condition but did not do as well
as in the easy-paired condition.
How does this performance relate to the categorizations people made? Figure 10
shows the clusters people made in each of the three unsupervised conditions (hard, easypaired, and easy-standalone). It is clear that in the easy-paired condition most of the
clusters are correct: people correctly identify all of the categories. In the easy-standalone
condition they did less well but still often sorted correctly. By contrast, in the hard
condition, almost none of the clusterings identify the correct categories. People do appear
to cluster items into classes, but make no further differentiation. Since performance on the
test questions requires differentiating between categories within classes, it is not a surprise
that people perform at chance.
Does this explain the poor performance in the hard condition? Not entirely; recall
that people are at chance even in the supervised trials in the hard condition, where they
were given the correct categories. This suggests that, even when the categories are shown
to them, people are not picking up on the regularities that define the categories – they
MULTIPLE OVERHYPOTHESES
25
Figure 11. Generalization by condition, split by adjR score. In both easy conditions, performance
on the test questions was higher when people sorted the items more correctly (there were no trials
with low adjR scores in the easy-paired condition). In the hard condition, by contrast, test
performance was unrelated to adjR score. This, in combination with the fact that test performance
was no higher in the supervised trials, suggests that even having correct category information did
not make the regularities that defined the categories apparent.
do not realise that within one class, certain features matter for defining the category, and
within the other class, different features matter. Given the high performance in the easypaired condition, this probably does not reflect an inability to conceptualize the idea – but
rather, the inability to identify which features are important, in which way, out of the many
possible features that make up the stimuli in the hard condition. Consistent with this, if
we break down performance by adjR value, we see that in both easy conditions there was
a significant correlation between the correctness of people’s clusterings as measured by the
adjR value (paired: r = 0.78, Bayes factor: 47:1 for an effect); standalone: r = 0.73,
Bayes factor: 1173:1 for an effect), but there was no relationship between the two within
the hard condition (r = −0.06, Bayes factor: 2.25:1 in favor of a null effect). Figure 11
illustrates this as well: performance in the hard condition did not vary between trials with
different adjR values, but performance in both easy conditions did.
Model performance, shown in Figure 12, contrasts interestingly with human performance. It illuminates what aspects of the task are learnable and suggests some possibilities
for the source of the difficulties people had in the hard condition. The main finding is
that the model did equally well in the easy and hard conditions but significantly better in
the supervised trials than the unsupervised ones, with no interaction. The improved performance when the categories are given is consistent with human performance in the easy
condition (at least when compared to the unsupervised trials in which people couldn’t leverage information from the supervised trials, which is appropriate since the model didn’t have
that information either). As in Study 1, the supervised trials are easier because the category
MULTIPLE OVERHYPOTHESES
26
Figure 12. Model performance on Study 2 stimuli by condition and supervision level. The model
qualitatively replicated the same pattern as humans in the easy-standalone condition, performing
well in the supervised trials and worse in the unsupervised trials. Unlike people, it did equally as
well in the hard condition. This suggests that people’s poor performance in the hard condition is
not because there was not enough information in the task to learn the correct category structure;
rather, performance limitations of some sort, which did not affect the model, prevented people from
learning the correct categories or using that information to perform inferences about novel items.
information is already given.
The main difference between human and model performance is in the hard condition.
Humans performed at chance, even in the supervised trials, suggesting that even when the
categories were given they could not detect the regularities behind the category structure.
The fact that the model could still learn the categories quite well in the hard condition
suggests that the problems humans were having weren’t inherent to the categories or stimuli
themselves. In other words, there was enough information and structure in the data that an
optimal learner could make the correct generalizations; whatever prevented humans from
learning was not that. The fact that an optimal computational-level model could learn but
humans could not suggests that humans were probably limited by process-level constraints,
perhaps on working memory.
One final thing worth noting is that people did better in the easy supervised trials
than the model did. How is this possible, if the model represents optimal performance?
There are probably two reasons for this. The first is that people, unlike the model, were
MULTIPLE OVERHYPOTHESES
27
presented with many trials one after the other; for the model, each trial was calculated
independently of the others. Although the trials differed in the number of categories and
items per category, there were still trial-to-trial regularities that people could have learned:
for instance, that all trials had two classes, that the color feature was always the class
indicator, or that no categories contained an odd number of items. These regularities could
have helped substantially with the inference problem. In fact, it is interesting that people
might be capable of learning such regularities: they are another kind of overhypothesis
learning.
A second possibility has to do with a somewhat minor detail about how the model
calculates second-order generalization. It does so by marginalizing over all possible assignments of the test item to categories and classes. Occasionally the test item is assigned to
an entirely separate class and category; when this occurs, the model performs at chance on
the corresponding generalization. Although this performance is averaged in with the generalizations that occur when the test item is assigned to the correct class, the overall effect
is one of lowering the overall generalization probability relative to people, who (presumably
due to demand characteristics of the experiment) probably rarely if ever believe the test
item is in an entirely different class. Although this behavior by the model does affect the
quantitative generalization probability, it is not a fundamental problem with the model; it
could be fixed by simply changing the prior on γ. We chose not to do so for consistency,
because previous work used the existing prior.
In sum, Study 2 is the first evidence we are aware of that people are capable of quickly
learning two different classes at the same time as categories in those classes, when given
completely novel objects with novel features. This learning is not trivial: people were only
capable of it when the features were salient and not too numerous, as in the easy condition.
The model demonstrates on a computational level how such regularities might be acquired,
and also shows that people’s failure when features were too numerous or noisy was probably
due to failures at the process level rather than inherent limitations on the information in
the data.
Discussion
This paper addressed two central questions. How it is possible to simultaneously learn
on multiple higher levels of abstraction? And are people capable of this sort of learning
quickly, given novel stimuli, or does it require natural stimuli, full supervision, or greater
time periods? We find that people can indeed learn one or even many classes, even if the
lower-level categories are not given (although learning is better when the categories are
given). Interestingly, second-order generalization appears to emerge in tandem with firstorder generalization; as soon as people learn an overhypothesis about a class or classes, they
are capable of using that information to form intelligent inferences about completely novel
items. As discussed previously, this “blessing of abstraction” is a nontrivial prediction made
by our model but not other common models of categorization. In addition to illustrating
how this sort of learning is possible, our model explains why first-order and second-order
generalization are equally easy, why being given category information is easier, and why
performance degrades as the number of items in a category decreases or the categories
themselves grow less coherent.
MULTIPLE OVERHYPOTHESES
28
The model fails to predict human behavior in two ways in Study 2, but both of
those are quite revealing. First, people performed far better in the unsupervised-paired
trials than the model did in its unsupervised trials. This is probably because people were
doing another kind of overhypothesis learning, leveraging information about the task based
on the supervised trials that they also saw. When another group of participants in the
unsupervised-standalone condition was given only unsupervised trials, they performed
much more in line with model predictions. This is compelling evidence that people are
capable of interesting cross-trial learning, enabling them to go even beyond what would be
possible otherwise. Second, the model also predicted much better performance in the hard
condition than people achieved. Since the model operates on the computational level and
is not constrained by limited memory or time, this suggests that human failure when there
are numerous more complex features is probably due to limitations of memory or other
process-level abilities. A full account of performance on this task would need to address
this level as well; however, our computational-level approach has demonstrated that this
learning problem exists and how a solution to it is possible.
Although there is already an array of existing models that can achieve aspects of the
learning tasks in this paper, to our knowledge ours is the only model that can simultaneously
learn overhypotheses about multiple classes at once, while also learning about the categories
themselves, in an unsupervised fashion. Other models can categorize items but not learn
overhypotheses about classes (Anderson, 1991) or learn multiple classes (Kruschke, 1992;
Love et al., 2004; Perfors & Tenenbaum, 2009); some learn multiple classes but cannot
simultaneously categorize items (Kemp et al., 2007; Sanborn et al., 2009; Perfors et al.,
2010). Overhypothesis learning has a great deal in common with dimensional attention
(e.g., Kruschke, 1992; Love et al., 2004), but dimensional-attention learning models such
as ALCOVE or SUSTAIN do not simultaneously learn attention to different dimensions in
multiple classes at once. Furthermore, as we have seen, they also do not predict that firstorder and second-order generalization (at least involving categories with discrete features)
should be equally easy (as they are here).
The empirical findings reported here are also, to our knowledge, the first evidence
that people are capable of learning multiple classes at once over novel features while also
learning the underlying category. The closest similar work is in the area of knowledge
partitioning, which investigates people’s ability to learn one regularity in one context at the
same time as a very different regularity in another (e.g., Lewandowsky & Kirsner, 2000;
Kalish, Lewandowsky, & Kruschke, 2004; Yang & Lewandowsky, 2004; Navarro, 2010).
However, both models and experiments in this area generally either do not involve category
learning at all or involve people learning multiple contexts (or classes) but generally only in
a supervised way. They do not show people, as we do, learning on multiple levels at once:
figuring out how to put (unlabelled) items in categories while at the same time figuring out
different regularities for different classes of items or different contexts.
A final open question is how the ability to learn on multiple levels at once depends
on having all of the information about all of the categories visible at all times, as in our
experiments. In real life, people are not often given examples of many category members
from several classes all at once; rather, they experience individual exemplars one-by-one,
in multiple contexts, sometimes with labels and sometimes without. This greatly increases
the load on working memory which, our evidence suggests, is key to successful learning of
MULTIPLE OVERHYPOTHESES
29
this sort. It is possible that people will be unable to simultaneously learn on multiple levels
if the items are presented more like typical category learning experiments, with each item
seen and labelled (or not) one-by-one. If this is the case, perhaps children take years to
acquire multiple over-hypotheses because that length of time is necessary to overcome other
constraints on working memory, or for different kinds of long-term learning to occur. We
are exploring this possibility with additional experiments in our lab.
Appendix A
Structure learning
The structural component of the model organizes observed stimuli into a two layer
hierarchy, with stimuli assigned to categories and categories assigned to a class. Formally,
we let c denote a vector that assigns each stimulus to a category, such that ci = j if the
i-th stimulus belongs to the j-th category. Similarly we use k to denote a vector assigning
each category to a class, such that kj = z if the j-th category belongs to the z-th class.
The learner does not know the number of categories or the number of classes. In simpler
models where objects are assigned only to categories but the number of such categories is
not known in advance, it is typical to make the assignments using a simple method known
as the Chinese restaurant process (CRP). Although the CRP describes a sequential process,
the assignments generated by it are exchangeable: the actual order in which assignments
are generated is does not affect the probability of the overall partition.
The natural analog of the CRP for the classes-and-categories model is known as the
nested Chinese restaurant process (nCRP), which we now describe. This distribution is
comprised of two separate CRPs, one associated with classes and the other associated with
categories. Suppose the learner has a set of assignments c to categorize the first n stimuli
and a class vector k that organizes the existing categories. The prior probability that
observation n + 1 belongs to the j-the existing category is given by
nj
P (cn+1 = j|c1 , . . . , cn ) =
n+γ
where nj denotes the number of observations already assigned to that category. However,
there is some probability that the new observation belongs to a hitherto unseen category.
If we let m denote the number of categories that have been seen up to this point, then the
probability that item n + 1 belongs to category m + 1 is
γ
P (cn+1 = m + 1|c1 , . . . , cn ) =
n+γ
However, when a new observation is assigned to a new category, then this new category
must itself be assigned to a class. This assignment is made using the same rule. The new
category is assigned to the z-th existing class with probability proportional to the number
of existing categories that belong to that class,
mz
P (km+1 = z|k1 , . . . , km ) =
m+γ
It is assigned to an entirely new class with probability proportional to γ. If the total number
of classes observed so far is q, then
γ
P (km+1 = q + 1|k1 , . . . , km ) =
m+γ
MULTIPLE OVERHYPOTHESES
30
In the general version of the nested CRP, the value of γ for the category-level assignments
can be different than for the class-level assignments, but for the current paper we keep them
the same. In statistical notation, this process describes a joint prior over c and k that is
written
c, k ∼ nCRP(γ)
We fix γ = 1 for all model fitting exercises in the paper.
Distributional learning
The rest of the model is defined conditional on the class and category assignments.
Let yi denote the observed feature vector for the i-th stimulus, where yif = v implies that
the i-th stimulus has value v on feature f . As above, let ci denote the category to which
item i is assigned. Then the probability with which this feature value would be observed is
(c ,f )
denoted θv i . In statistical terms, we say that value for yif is sampled from a multinomial
distribution with size 1 and probability vector θ (ci ,f ) ,
yif ∼ Multinomial(θ (ci ,f ) , 1)
Of course, the learner does not know the probability distribution over feature values associated with any of the categories, and must learn them from the observations. For the
sake of clarity, we now simplify the notation somewhat. Because the model is identical for
all features f , we drop the dependence on the specific feature, and simplify the notation
that describes the category and class assignments, referring in generic terms to category c
and the class k to which it belongs. In this notation, instead of using θ (ci ,f ) to refer to the
distribution over values for feature f associated with the category ci to which the i-th item
belongs, we use θc to refer to one such distribution.
As described in the main text, the learner’s prior beliefs about a category are shaped
by the class to which that category belongs. Specifically, θc depends on the distribution
over feature values βk associated with class k, and on αk , the parameter that describes
the homogeneity of categories of class k. We adopt the standard Dirichlet distribution to
describe this belief, specifically:
θc ∼ Dirichlet(αk βk )
By multiplying the vector that describes the feature distribution βk by the homogeneity
parameter αk this prior can encompass the full range of possible Dirichlet distributions,
using a parameterization that makes more sense psychologically than the usual method.
In order for a Bayesian learner to acquire the class-level knowledge that αk and βk
provide, the prior uncertainty about these parameters must also be described. The prior
over the feature distribution βk for a class takes the same form as the prior over features
for a category, and is described by a Dirichlet distribution,
βk ∼ Dirichlet(1µ)
where 1 denotes a vector consisting entirely of 1s, indicating that the model has no a priori
biases to expect some feature values over others, and µ describes the expected homogeneity
across classes. It plays the same role for classes that α plays for categories, but is not central
MULTIPLE OVERHYPOTHESES
31
to the current paper. The prior over α favors small values (homogeneity), with the strength
of that preference captured by the parameter λ:
αk ∼ Exponential(λ)
Finally, because the parameter λ and µ play an important role in controlling the
expectations that the learner has at in the most general sense, we place diffuse priors over
these and allow the model to learn them from data:
λ ∼ Exponential(1)
µ ∼ Exponential(1)
Inference
In this section, we simplify the notation further for the sake of clarity. The data
available to the learner consists of a collection of feature vectors y. The learner must infer
the category and class assignments (c, k) along with the distributional information at both
levels (θ, β, α) and the top-level information (λ,µ). The description above outlines a joint
prior distribution over all these parameters P (c, k, θ, β, α, λ, µ). Given data y, a Bayesian
learner infers the posterior distribution,
P (c, k, θ, β, α, λ, µ|y) =
P (y|c, k, θ, β, α, λ, µ)P (c, k, θ, β, α, λ, µ)
P (y
and the structure of the model allows the prior to be factorized as follows:
P (c, k, θ, β, α, λ, µ) = P (θ|β, α)P (β|µ)P (α|λ)P (λ)P (µ)P (c, k)
Given the complexity of the model, computing exact values for the posterior probabilities are not feasible, but we can use Markov chain Monte Carlo methods to draw samples
from this posterior distribution (Gilks, Richardson, & Spiegelhalter, 1996). When c and
k are not given, the process of inference alternates between fixing c and fixing k; while
each is fixed, the other is sampled, along with the hyperparameters α, β, λ, and µ. The
hyperparameters are estimated with an MCMC sampler that uses Gaussian proposals on
log(α), log(λ), and log(µ); proposals for β are drawn from a Dirichlet distribution with the
current β as its mean. Each Markov chain was run for 8,000 iterations with 1,000 discarded
as burn-in. Proposals on c and k included splitting, merging, shifting one, adding, and
deleting categories or classes respectively.
Given the ability to draw samples from the posterior distribution over parameters,
calculating first- and second-order generalization probabilities is relatively straightforward.
For two stimuli a and b in a test trial the quantity of interest is P (ca = cb |y), the posterior
probability that items a and b belong to the same category given all of the observed feature vectors y. However, since every sample is a draw from the full posterior distribution
P (c, k, θ, β, α, λ, µ|y), all we need to do to estimate P (ca = cb |y) is count the proportion of
posterior samples that assigned these items to the same category. The only difference between first- and second-order generalization, from the perspective of this model, is whether
or not the learner has previously seen one of the items (or, more precisely, any of the feature
MULTIPLE OVERHYPOTHESES
Feature
1
2
3
4
5
6
7
8
Feature location
upper-left circle
upper-right circle
lower-left circle
lower-right circle
upper-left square
upper-right square
lower-left square
lower-right square
32
Possible feature values
! ? #@$%&+*>
1234567890
qrstuvwxyz
ABCDEFGHJK
αβγδηθλξπσ
Γ∆ΘΞΠΣΦΨΩ∀
c ↔
≈<ℵ∞÷♣♥♠
√
⊇∇ ⊥≡⊕•↓∂
Table 2: List of possible feature values for each of the eight features in Study 1 and the hard
condition of Study 2.
Feature
1
2
3
4
5
Feature location
color
upper-left square
upper-right square
lower-left square
lower-right square
Possible feature values
red, green, blue, yellow, cyan, magenta, gray, orange, light blue, light brown
! ? #@$%&+*>
1234567890
mnrstuvwxz
ABCDEFGHJK
Table 3: List of possible feature values for each of the five features in easy condition of Study 2.
values belonging to any of the items) as part of the training set. The model was tested on
items with one of the coherent features fC , the class feature, and one of the random features fR , rather than the full complement; this was done in order to prevent the model from
placing all test items into entirely new categories and classes (an outcome people avoided
due to task demands). The same goal could have been accomplished simply by forcing the
model not to place the test items into new categories, or by changing the CRP prior γ, but
we deemed this option the most straightforward.
Results for Study 1 represent averages across 16 different simulated datasets for each
category structure and each order of generalization; results for Study 2, which was computationally much more intensive, represent averages across 4 simulated datasets for each
category structure, order of generalisation, and class membership for each item.
Appendix B
The stimuli in Study 1 and the hard condition of Study 2 had eight features, each of
which could take ten possible feature values. Stimuli varied from trial to trial in terms of
which features were coherent and which feature values occurred together in the same item,
but feature values of a given type always occurred in the same location (for instance, the
feature values in the lower right-hand circle always corresponded to a capital letter taken
from the final half of the alphabet). Table 2 shows all of the possible feature values for each
feature. Stimuli presented in the first phase drew from up to eight of those feature values
MULTIPLE OVERHYPOTHESES
33
for each feature; the final two were reserved for the second-order generalization questions,
which required new feature values.
The stimuli in the easy condition in Study 2 had five features, each of which could
also take ten possible feature values, as shown in Table 3. The feature indicating class was
always color, and the other four varied randomly from trial to trial (along with the specific
feature values of each). Colors were chosen to be maximally distinct from each other, and
which color corresponded to each class varied randomly from trial to trial. For instance, in
Trial 1, blue might indicate those items for which the two bottom features were coherent
with respect to categories; in Trial 2, there might be no blue items at all; and in Trial 8,
blue might indicate those items for which the two rightmost features were coherent with
respect to categories.
References
Anderson, J. (1991). The adaptive nature of human categorization. Psychology Review , 98 (3),
409–429.
Booth, A., & Waxman, S. (2002). Word learning is ‘smart’: Evidence that conceptual information
affects preschoolers’ extension of novel words. Cognition, 84 , B11-B22.
Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in practice.
Chapman & Hall.
Goodman, N. (1955). Fact, fiction, and forecast. Cambridge, MA: Harvard Univ Press.
Griffiths, T., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J. B. (2010). Probabilistic models
of cognition: Exploring representations and inductive biases. Trends in Cognitive Sciences,
14 , 357-364.
Griffiths, T., & Tenenbaum, J. B. (2006). Optimal predictions in everyday cognition. Psychological
Science, 17 , 767-773.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193–218.
Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National
Academy of Sciences, 110 (48), 19313–19317.
Kalish, M., Lewandowsky, S., & Kruschke, J. (2004). Population of linear experts: Knowledge
partitioning in function learning. Psychological Review , 111 , 1072–1099.
Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning overhypotheses with hierarchical
Bayesian models. Developmental Science, 10 (3), 307–321.
Kemp, C., & Tenenbaum, J. B. (2008). The discovery of structural form. Proceedings of the National
Academy of Sciences, 105 (31), 10687-10692.
Kemp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive reasoning. Psychological Review , 116 (1), 20–58.
Kruschke, J. (1992). ALCOVE: An exemplar-based connectionist model of category learning.
Psychological Review , 99 , 22-44.
Kurtz, K. (2007). The divergent autoencoder (DIVA) model of category learning. Psychonomic
Bulletin and Review , 14 , 560-576.
Landau, B., Smith, L., & Jones, S. (1988). The importance of shape in early lexical learning.
Cognitive Development, 3 , 299–321.
Lee, M., & Navarro, D. (2002). Extending the alcove model of category learning to featural stimulus
domains. Psychonomic Bulletin and Review , 9 (1), 43-58.
Lee, M. D. (2006). A hierarchical Bayesian model of human decision-making on an optimal stopping
problem. Cognitive Science, 30 , 555–580.
Lewandowsky, S., & Kirsner, K. (2000). Knowledge partitioning: Context-dependent use of expertise. Memory and Cognition, 28 , 295–305.
MULTIPLE OVERHYPOTHESES
34
Love, B., Medin, D., & Gureckis, T. (2004). Sustain: A network model of category learning.
Psychological Review , 111 (2), 309–332.
Macario, J. F. (1991). Young children’s use of color in classification: Foods as canonically colored
objects. Cognitive Development, 6 , 17–46.
Morey, R. D., & Rouder, J. N. (2014). BayesFactor: Computation of Bayes factor for common
designs (R package version 0.9.7. ed.) [Computer software manual].
Navarro, D. (2010). Learning the context of a category. In J. Lafferty, C. Williams, J. Shawe-Taylor,
R. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp.
1795–1803). Curran Associates, Inc.
Perfors, A., Tenenbaum, J., & Wonnacott, E. (2010). Variability, negative evidence, and the
acquisition of verb argument constructions. Journal of Child Language, 37 , 607-642.
Perfors, A., & Tenenbaum, J. B. (2009). Learning to learn categories. In N. Taatgen, H. v. Rijn,
L. Schomaker, & J. Nerbonne (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. 136–141). Austin, TX: Cognitive Science Society.
Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles.
Cognition, 118 (3), 306-338.
Sanborn, A., Chater, N., & Heller, K. (2009). Hierarchical learning of dimensional biases in human
categorization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, & A. Culotta (Eds.),
Advances in Neural Information Processing Systems 22 (Vol. 23, pp. 727–735). Curran Associates, Inc.
Sanborn, A., Griffiths, T., & Navarro, D. (2010). Rational approximations to rational models:
Alternative algorithms for category learning. Psychological Review , 117 , 1144–1167.
Shams, L., & Beierholm, U. (2011). Early integration and Bayesian causal inference in multisensory
perception. In J. Trommershauser, K. Kording, & M. Landy (Eds.), Sensory cue integration
(p. 251-262). Oxford University Press.
Smith, L., Jones, S., Landau, B., Gershkoff-Stowe, L., & Samuelson, L. (2002). Object name learning
provides on-the-job training for attention. Psychological Science, 13 (1), 13–19.
Soja, N., Carey, S., & Spelke, E. (1991). Ontological categories guide young children’s inductions
of word meaning. Cognition, 38 , 179–211.
Teh, Y.-W. (2009). A heirarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual
Meeting of the ACL (p. 985-992).
Williams, J., & Griffiths, T. (2013). Why are people bad at detecting randomness? A statistical
argument. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39 , 1473–
1490.
Yang, L.-X., & Lewandowsky, S. (2004). Knowledge partitioning in categorization: Constraints on
exemplar models. Journal of Experimental Psychology: Learning, Memory, and Cognition,
30 , 1045—1064.