Simultaneous learning of categories and classes of categories: acquiring multiple overhypotheses Amy Perfors School of Psychology University of Adelaide Daniel J. Navarro School of Psychology University of Adelaide Joshua B. Tenenbaum Department of Brain & Cognitive Sciences Massachusetts Institute of Technology Word count: 11,474 Abstract This work investigates people’s ability to learn on multiple levels of abstraction at once by figuring out how to categorize novel items at the same time as making higher-order inferences about that categorization (e.g., that categories tend to be organized by shape). We show that not only are people capable of this sort of learning, called overhypothesis learning – but also that people can learn multiple different overhypotheses at once (while still simultaneously figuring out the categories). We present a computational model that illuminates both how this sort of learning is possible and sheds light on why people might fail to achieve it in certain circumstances. Introduction Learning is sometimes thought of as acquiring knowledge, as if it simply consists of gathering facts like pebbles scattered on the ground. Very often, however, effective learning also requires learning how to learn: forming abstract inferences about how those pebbles are scattered – how that knowledge is organized – and using those inferences to guide one’s future learning. Indeed, most learning operates on many levels at once. We do gather facts about specific objects and actions, and we also learn about categories of objects and actions. But an even more powerful form of human learning, evident throughout development, extends to even higher levels of abstraction: learning about classes of categories and making inferences about what categories are like in general. This knowledge enables us to learn entirely new categories quickly and effectively, because it guides the generalizations we can make about even small amounts of input. MULTIPLE OVERHYPOTHESES 2 Consider a learner acquiring knowledge about different classes of animal. After encountering many animals he or she notices that cats generally have four legs and a tail, spiders have eight legs and no tail, and monkeys have two legs and a tail. Because animals in the same category tend to have the same number of legs, whereas animals in different categories can have different numbers of legs, the feature is diagnostic. The learner can use this diagnostic information to infer that a new animal with eight legs is a spider and not a cat. In this inference, the learner compares a novel item to a familiar category, which we refer to as a first-order generalization. The majority of studies of human category learning tend to look at first-order generalization problems: the participant is required to categorize new items into one of several categories that he or she has alread learned. However, learning about the diagnosticity of different features allows the learner to go beyond first-order generalization, and draw inferences about entirely novel categories, which we refer to as second-order generalization. In the scenario described above, a learner who recognizes that “number of legs” is a diagnostic feature for animal categories will tend to infer that an animal that has six legs does not belong to any of the familiar categories (i.e., not a spider, cat or monkey). Moreover, he or she would expect that when additional members of this animal category are encountered, they too will have six legs. This second-order generalization relies on a much more abstract form of knowledge. In the first example, the learner classifies a new animal as a spider because it has eight legs, as do other spiders. In the second example, the learner draws inferences based on the abstract idea that “number of legs” is a diagnostic property for all (or most) animal categories, even those animal categories that have not yet been encountered. This abstract knowledge is typically referred to as an overhypothesis (Goodman, 1955) because it represents knowledge that is applied to an entire kind or class of category. Overhypotheses can be powerful tools for guiding inductive inferences about unfamiliar categories. However, the inferences suggested by a particular overhypothesis are only sensible if the new category is of the same qualitative class as previous ones. For instance, while “number of legs” is fairly diagnostic for animal categories – and therefore supports an overhypothesis about animals in general – it is much less useful when applied to furniture. Tables and stools can have one, three, four, six, or more legs and still be considered tables and stools. As a consequence, if the first example of a new class of furniture turns out to have three legs, there is no reason to expect that future examples will also have three legs. In other words, in order to apply an overhypothesis sensibly, a learner needs to infer the class of category to which it is applicable. Number of legs varies little within categories of animal, but varies more within categories of furniture. Similarly, taste is an important feature for categories of foods but not social categories; color is (unfortunately, and for some at least) an important feature for social categories but not artifact categories; function is an important feature for artifact categories but not biological categories; and shape is an important feature for solid categories but not substances. The fact that different overhypotheses are applicable to different classes of categories (simply referred to as “classes” from now on) makes the learning problem – already challenging – much more challenging. Instead of learning a single overhypothesis that is applicable to all categories, learners need to determine how many classes of category exist, determine the nature of the overhypothesis that applies to each class, and also figure out which categories belong to each class. All of this learning must occur over and above the task of assigning 3 MULTIPLE OVERHYPOTHESES Table 1: Different features matter for different classes of categories. Object categories such as cat and ball tend to contain objects of a similar shape, but category members can vary a great deal in color. In contrast, substance categories such as mud and flour tend not to be organised in terms of typical shapes, but in terms of typical colors and textures. Class Category Cat Object Ball Mud Substance Flour Instance 1 2 3 1 2 3 1 2 3 1 2 3 Shape Quadruped Quadruped Quadruped Sphere Sphere Sphere Pile Puddle Spherical Pile Cloud Scatter Color Tabby White Grey Blue Red Green Brown Brown Brown White White White individual items to the correct categories. The problem is illustrated schematically in Table 1, which shows 12 stimuli that belong to four categories: cat, ball, mud and flour. The categories fall into two classes: cat and ball are object categories, whereas mud and flour are substance categories. For stimuli that belong to one of the object categories, the diagnostic feature is shape (e.g., balls are generally round). But substances are easily moldable (e.g., flour can be encountered in a pile or scattered on a surface), so this is not a particularly important feature. For substance categories, the diagnostic features tend to be color or texture. In order to learn effectively in an environment that has this structure, the learner needs to be able to identify that cat and ball are categories of a different class as mud and flour, and should make inferences accordingly. Given the complexity of the learning task, one might think that human learners would struggle to solve it. Nevertheless, even 24-month-olds – younger in some cases – are able to form abstract overhypotheses about how categories are organized, forming different overhypotheses for different classes of things. They realize that categories corresponding to count nouns tend to have a common shape, but not a common texture or color (Landau, Smith, & Jones, 1988; Soja, Carey, & Spelke, 1991), whereas categories corresponding to foods often have a common color but not shape (e.g., Macario, 1991; Booth & Waxman, 2002). The advantages of acquiring this sort of overhypothesis is clear: teaching children a few novel categories strongly organized by shape results in early acquisition of the shape bias as well as faster learning even of other, non-taught words (Smith, Jones, Landau, GershkoffStowe, & Samuelson, 2002). This is a noteworthy result because it demonstrates that overhypotheses can rapidly be acquired on the basis of little input, but it raises questions about what enables such rapid acquisition. The work in this paper is motivated by these questions about how knowledge is acquired on multiple different higher levels of abstraction, and how that kind of learning interacts with lower-level learning about specific items. There are two central questions addressed in this paper. First, how is it possible to simultaneously learn on multiple different higher levels of abstraction? Second, are people capable of this sort of learning quickly, in the lab, on novel stimuli with novel features, MULTIPLE OVERHYPOTHESES 4 as opposed to slowly over years of development? We evaluate human behavior in series of experiments in which people must learn on multiple levels at once, in both a supervised and an unsupervised fashion. We also present a computational model of this learning. Comparing model performance to human behavior allows us to investigate how learning multiple overhypotheses depends on the ability to learn categories. Is the former possible only if the categories are given, or is this sort of learning possible in an unsupervised fashion as well? How do factors like category coherence affect what can be learned? How do these impact on the nature of the generalizations (both first-order and second-order) people make? For computational theories of learning, the central question – the ability to learn on multiple levels at once – poses something of a chicken-and-egg problem: the learner cannot acquire overhypotheses without having attained some specific item-level knowledge first, but acquiring specific item-level knowledge would be greatly facilitated by already having a correct overhypothesis about how that knowledge might be structured. This chicken-andegg problem is exacerbated even more if it is not known how many classes (and separate overhypotheses) there are. Often it is simply presumed that acquiring knowledge on the higher (overhypothesis) level must always follow the acquisition of more specific knowledge. A computational framework called hierarchical Bayesian modelling can help to explain how learning on multiple levels might be possible. This framework has been applied to domains as disparate as learning conceptual structure (Kemp & Tenenbaum, 2008), causal reasoning (Kemp & Tenenbaum, 2009), decision making (M. D. Lee, 2006), the acquisition of abstract syntactic principles (Perfors, Tenenbaum, & Regier, 2011), acquiring verb knowledge (Perfors, Tenenbaum, & Wonnacott, 2010), learning word-by-word statistics (Teh, 2009), and learning about feature variability (Kemp, Perfors, & Tenenbaum, 2007). In this framework, inferences about data are made on multiple levels: the lower level corresponds to specific item-based information, and the overhypothesis level corresponds to abstract inferences about the lower-level knowledge. In this paper we present a model of category learning which acquires knowledge about how specific items should be categorized as well as multiple higher-order overhypotheses (classes) about how categories in general are organized. This model differs from other models which can either categorize items but not learn multiple classes (Kruschke, 1992; Love, Medin, & Gureckis, 2004; Perfors & Tenenbaum, 2009) or learn multiple classes but not simultaneously categorize items (Kemp et al., 2007; Sanborn, Chater, & Heller, 2009; Perfors et al., 2010). Our new model can discover how to cluster items at the category level on the basis of their featural similarity, at the same time that it makes inferences about higher-level parameters (the overhypotheses) indicating which features are most important for organizing items into basic-level categories. It can learn these things in addition to how many classes (overhypotheses) there are in the first place. The model provides a basis against which to compare human performance. The structure of this paper is as follows. First, we describe this new model of category learning and how it is capable of solving the “classes and categories” problem. We then present two studies investigating people’s abilities to do the same, in a supervised learning task and an unsupervised learning task. The first study considers only the simplest version of the problem, in which only a single latent class exists. The second study examines a more complicated problem, one in which categories are organized into multiple classes. In both studies we consider first-order generalization and second-order generalization. By MULTIPLE OVERHYPOTHESES 5 comparing human performance to the Bayesian model, we find that there are some versions of the problem for which human performance is near optimal. On the other hand, there are versions of the problem for which people can perform the task, but are substantially worse than the model. The implications, both theoretical and methodological, are discussed. A computational analysis of the learning problem In this section we outline a Bayesian model of the “classes and categories” problem (see Appendix A for the complete specification). It can be viewed as an extension to existing Bayesian category learning models (Anderson, 1991; Perfors & Tenenbaum, 2009; Kemp et al., 2007; Perfors et al., 2010; Sanborn, Griffiths, & Navarro, 2010), one that is able to organize stimuli into categories, sort categories into classes, and to learn the abstract regularities that characterize different classes of categories. As with most Bayesian models, it is best viewed as a computational analysis of the learning problem. It specifies a probabilistic model that makes reasonable assumptions about the structure of the task, and learns in an optimal fashion given those assumptions. This approach to modelling human cognition is referred to as computational analysis, and it has been applied successfully in areas as diverse as vision, reasoning, and decision-making (see Griffiths, Chater, Kemp, Perfors, & Tenenbaum, 2010). The model, which is schematically depicted in Figure 1, assumes that each stimulus is described in terms of a set of discrete features (e.g., number of legs, shape, colour, label, etc) each of which can take on many different values. Formally, each stimulus is characterized by the feature vector y, where the i-th element of y indicates which value the i-th feature takes on. Given a set of n stimuli described by the vectors y1 , y2 , . . . , yn , the model needs to solve two problem simultaneously. The primary task to be solved is the classification problem, in which the learner must infer the structure that organises the stimuli. As described here, this learning takes place at two different levels: the stimuli need to be partitioned into a set of categories, and the categories need to be partitioned into classes. We let c be a vector specifying which items belong in which category and similarly k be a vector indicating which categories belong to each class. In real life the learner does not know ahead of time how many categories or classes exist, and this uncertainty is mirrored in the model. In the model this uncertainty is captured by placing a prior distribution P (c) over all possible ways of sorting the stimuli into categories, and another prior P (k) over the assignments of categories to classes. As with previous Bayesian analyses (e.g., Anderson, 1991; Sanborn et al., 2010) we use the Chinese restaurant process to specify this prior: the primary function of this prior is to ensure that the model tries to fit the data using as few classes and categories as necessary. The second problem to be solved is the distributional learning problem: inferring the distribution over features associated with different classes and categories. First, consider the category level. For every category c and every feature f , the model must learn a vector θ (cf ) that specifies how likely each possible feature value is. For instance, the learner might infer that spiders have eight legs with probability 0.95, seven legs with probabiliy 0.01 and (cf ) so on. Formally, the vth element of this vector θv describes the probability that a member of category c has value v for feature f . The θ vectors collectively describe a probability distribution over features for each category. Knowledge about θ allows the learner to view MULTIPLE OVERHYPOTHESES 6 Figure 1. Our hierarchical Bayesian model. Each setting of (α, β) is an overhypothesis: β represents the distribution of features across items within categories, and α represents the variability/uniformity of features within categories (i.e., the degree to which each category tends to be coherently organized with respect to a given feature, or not). The model is given data consisting of the features yi corresponding to individual items i, depicted here as abstract shapes. Learning categories corresponds to identifying the correct assignment of items to categories, and learning classes corresponds to identifying the correct assignment of categories to classes. In this schematic example, items are identified as being in one class or another according to whether they are drawn with solid or dotted lines; items with solid lines are categorized by a different feature (shape) than items with dotted lines (color). Thus, learning at all of these levels involves learning what feature(s) define different classes (the lines), as well as learning a conditional dependency between the value on that feature and which features matter for categorization. each category as a statistical ensemble of features, as is typical for family resemblance categories. It is this knowledge that allows the model to make first-order generalizations. At the class level the distributional learning problem is more complicated. A class is associated with a set of β vectors that play much the same role that θ plays for categories. (kf ) For class k and feature f , βv specifies the probability that stimuli that belong to categories of class k have value v on feature f (roughly speaking). For instance, in the animals scenario described in the introduction the learner might infer that 40% of animals have four legs, 10% are two legged, and so on. However, classes are richer knowledge structures than categories. At the category level, it is perfectly sufficient simply to describe the proportion of stimuli that take on a particular feature value: if a category has θ = 0.5 for some feature value, then 50% of category members should have that this value on that feature. There is nothing else that needs to be stated. At the class level the story is different. There are multiple qualitatively different ways that a class can have β = 0.5 for some feature value. For instance, a class might consist of two categories that both have θ = 0.5 for that same feature value. In this situation, MULTIPLE OVERHYPOTHESES 7 the categories have the same distribution over features (e.g., chimpanzees and bonobos might both show the same degree of variability in intelligence). A second possibility is that one category has θ = 1 and the other has θ = 0 (e.g., chimpanzees are almost always aggressive, bonobos are almost never aggressive). At the class level, the overall distribution of features is much the same, but the variability is expressed at a different level: in the first case the variability occurs within categories, in the second case the variability occurs between categories. The model needs to be able to express this distinction, which it does by means of learned parameters α(kf ) , which (roughly speaking) captures the extent to which categories that belong to class k tend to be homogeneous with respect to feature f . When α is small, it implies that category members all tend to have the same feature values, and most variation occurs between categories rather than within them. When α is large, categories tend to be less homogeneous with respect to that feature. Broadly speaking, it captures the extent to which categories of a particular class are organized by a particular feature, and it is this knowledge that enables second-order generalizations to occur. All of the theoretically interesting distributional learning in the model takes place at the class level and the category level. However, the model also allows a third level of knowledge representation, formalized in terms of two parameters λ and µ that describe what the model has learned about the regularities that are common to all classes of categories. This top-level knowledge describes the learner’s beliefs about how homogeneous classes themselves tend to be. The learning that takes place at this level is not of qualitatively important for our experiments, but in our simulations we found that the model behaves more appropriately when it learns the values of λ and µ are from the data (as in Perfors et al., 2010), rather than fixing them at arbitrary values (as in Kemp et al., 2007). Nevertheless, the results were qualitatively the same either way. The overall structure of the knowledge acquired by this model is outlined in Figure 1. At the bottom level it learns which items go in which categories (i.e., it learns c) and it learns a distribution over features for each category (i.e., θ). At the next level it learns which categories belong to which class (i.e., k), it learns which features are characteristic of each class (i.e., the β values) and it also learns how homogeneous categories within each class tend to be on each of the features (i.e. it learns the α values). Finally, at the top-level, it also learns some high level expectations about classes and categories (i.e., λ and µ) though as noted above this third level is less important than the other two for our investigations. In order to acquire this structured knowledge, the Bayesian model specifies a joint prior distribution over all unobserved variables, namely c, k, λ, µ, β, α, and θ. When the model observes the features y1 , y2 , . . . , yn that describe the stimuli, this prior is updated via Bayes’ rule to a joint posterior distribution P (c, k, λ, µ, β, α, θ|y1 , . . . , yn ) that describes the learner’s beliefs about how the stimuli should be organized into classes and categories, and what properties those classes and categories have. The particular choice of prior distribution and the numerical methods used to approximate the posterior distribution are discussed in Appendix A. MULTIPLE OVERHYPOTHESES 8 Why use this model? As discussed earlier, the Bayesian model is best viewed as a computational analysis of the learning problem, and as such it provides a normative standard for inference. When presenting such analyses it is common to focus on how human performance matches the predictions made by the model, thereby providing evidence that people are learning in an optimal fashion. However, normative standards for inductive tasks are often useful for highlighting differences between model predictions and human behaviour. This is especially relevant in the current context. Because of the highly structured representations it acquires, this model is capable of drawing powerful inferences from the data with which it is presented. Given the appropriate data, it would be capable of solving the developmental problem described earlier: it could, without being told what any of the categories were, simultaneously acquire two different biases (e.g., that categories corresponding to count nouns are usually organized by shape and not by color, whereas food categories are more likely to share a common color). However, the very feature that makes the model powerful – learning rich representations – also makes it quite computationally intensive: it must track a large number of latent variables and it searches over a huge space of possible structured representations of the domain. Using modern computational statistics it is not too difficult to implement this model on a powerful enough machine, but it is less clear whether people are able to solve the problem with the same effectiveness. As this discussion illustrates, the problem of multiple different classes simultaneously, without being told the correct categorization, is a daunting one. It is clear from the developmental literature that people solve some version of the problem (how else could children learn the flexible biases they do?) but it is not clear what information they require to do so, or how that information should be presented. Can people solve this sort of task over a shorter time frame, as in a standard laboratory-based study, or is it something that is learnable only over a many months or years? If people can learn to solve such problems, how efficiently do they do so? It might be a problem that people can solve to a near-optimal standard in the lab, like multisensory integration (Shams & Beierholm, 2011) or predictions about familiar events (Griffiths & Tenenbaum, 2006). Alternatively, it may be something that people can do to some extent but struggle to do well, like generating or detecting random sequences (Williams & Griffiths, 2013). In order to determine which is the case, it is useful to compare performance to the predictions of an “optimal” learner. Study 1: Learning multiple categories of the same class Our initial exploration is restricted to the case where all categories are of the same class. In this situation, a feature that tends to be consistent within one category will also tend to be consistent in all the other categories: for instance, most wugs are circular, most daxes are square and most feps are triangular. In contrast, a feature that varies within one category will tend to vary within others: in the example above, wugs, daxes and feps can all be many different colors. The experiment is designed on the basis of several observations about what an ideal learner should be able to do when presented with data of this sort. MULTIPLE OVERHYPOTHESES 9 Can people learn to make first- and second-order generalizations? The first thing that the learner needs to do in this situation is learn to correctly classify new members of existing categories on the basis of familiar feature values. For instance, when shown a reddish circular object, the learner should use the shape information to guess that it is a wug, and ignore the nondiagnostic color information. Because this inference relies on familiar categories (wug) and uses familiar feature values (circles), it is an example of first-order generalization. The learner should also be able to perform secondorder generalization: if shown a green pentagon, and asked whether it belongs to the same category as a blue pentagon or a green octagon, the learner should recognize that shape is more important than color, and select the blue pentagon. Because none of the specific feature values are familiar ones – the learner has not previously seen pentagons or octagons – this is second-order inference. If the Bayesian model described in the previous section provides a good description of human performance, we should expect that people will be able to make sensible first-order and second-order generalizations. In fact, as we show later, the Bayesian model presented here predicts that these two kinds of generalizations should be equally easy for people to make. One reason why this provides an interesting empirical test is that – somewhat counterintuitively – existing category learning models that include a selective attention component do not predict that second-order generalization should be as easy as first-order. To illustrate this point, consider the classic ALCOVE model of category learning (Kruschke, 1992), which has the ability to learn to selectively attend to diagnostic features and to ignore nondiagnostic ones. The original ALCOVE model relied exclusively on stimulus representations that encoded items in terms of continuous stimulus dimensions, but was later formally extended to handle stimuli that are best described in terms of a collection of discrete features (M. Lee & Navarro, 2002), as is the case here. However, it is not obvious how selective attention should operate over discrete features that take on more than two values. For instance, suppose the learner encounters objects that take on three possible shapes (circular, square, or triangular). How should “shape” be encoded as a feature within the model? One possibility is that shape corresponds to a single feature with three possible values. If the stimulus representation takes this form then the error-driven learning rules in ALCOVE allow it to learn to attend to the abstract notion of shape (required to make good second-order generalizations), but they do not allow it to pay special attention to some shapes and not others (required for good first-order generalization). This version of ALCOVE was implemented by M. Lee and Navarro (2002) but – like the original version of ALCOVE – was unable to learn to make first-order generalizations about very simple categories defined in terms of two three-valued features in a human-like fashion.1 The other alternative is for the feature shape to be viewed as a collection of three binary features: “is-circular”, “is-square” and “is-triangular.” When this occurs the model is able to attend separately to specific shapes, and as demonstrated by M. Lee and Navarro (2002) this ability is necessary to mimic human performance even in simple category learning tasks: people do recognize that in some situations a specific shape is relevant to the 1 This variant of ALCOVE was implemented by M. Lee and Navarro (2002) but was removed from the final version of the paper because its performance was qualitatively identical to the original ALCOVE model, and both performed very poorly on the learning tasks. MULTIPLE OVERHYPOTHESES 10 classification. However, when this stimulus representation is used ALCOVE cannot make second-order generalizations at all. The model can learn that “is-circular” is important for distiguishing wugs from daxes and “is-green” is not. But it cannot leverage this knowledge to infer that “is-pentagonal” is more likely to be useful than “is-purple” when asked to classify a purple pentagon. Because each separate feature value is treated as an independent representational unit, the model cannot learn higher-order generalizations at all. These considerations mean that – depending on which stimulus representation is chosen to describe discrete features – the selective attention mechanisms in ALCOVE either allow it to make good first-order generalizations or good second-order generalizations, but never both at once. In order to do both things simultaneously, ALCOVE would need a far more drastic redesign of the representational structure and/or learning rules, changes that would make it more akin to the Bayesian model presented here. The idea that first-order and second-order generalizations should be equally strong is a non-trivial and powerful prediction of our model. As we’ve seen, it runs counter to the predictions of other models; it also runs counter to our intuition, which suggests that it should be easier to learn about the specific items that you actually see than it should be to learn about abstract properties of those items. This model, however, explains why first-order and second-order generalizations are equally strong: although the inferences at higher levels are more abstract, there is actually more evidence relevant to them (since all individual items in all categories are pertinent). This “blessing of abstraction” is evident because of the computational-level analysis offered here. This discussion leads naturally to one question that Study 1 investigates. Do people treat first-order generalizations the same as second-order ones? Our model predicts that they should. Taken literally, ALCOVE and related models predict that people should be able to do one or the other but not both.2 It is therefore natural to ask what humans do. How do people’s first- and second-order generalizations compare? What kind of data do people need to do so? A related question pertains to the quality of data that the learner receives. As the category-learning experiments of Smith et al. (2002) demonstrated, it is possible for children to acquire an overhypothesis about the role of shape in categorization after being taught only a few novel nouns; however, it is not clear precisely what aspects of the input enabled such rapid acquisition. Was it the fact that the categories were organized on the basis of highly coherent features, or because the individual items were consistently labelled, effectively providing strong evidence about category assignments? Consider the issue of labelling. In many category learning tasks, learning is supervised: people are explicity told which items go in which categories. Do we require this kind of supervision to be able to learn how to make second-order generalizations? Although some category learning models are restricted to supervised learning (Kruschke, 1992), there are others that can infer categories using only the stimulus features that do not require explicit 2 As noted, ALCOVE could probably be adapted to make different predictions, but such adaptation would be far from trivial. As it stands, it and other models that rely on learned selective attention that operate similarly cannot. The theoretical point of interest is not a “Bayesian versus connectionist” or “computational level vs process level” distinction, but rather what kinds of inferences people can and cannot make, and what mental representations these inferences imply. MULTIPLE OVERHYPOTHESES 11 labelling (Anderson, 1991; Love et al., 2004; Kurtz, 2007; Perfors & Tenenbaum, 2009; Sanborn et al., 2010). In the experiment discussed below, the stimuli are designed in such a way that there is only a slight difference in the predictions made by the Bayesian model in a supervised and unsupervised context. The correct categorization is more or less determined by the stimulus features, so an ideal learner should not treat the two situations differently. However, performing the unsupervised task correctly is more demanding: the learner not only needs to make the correct generalizations to new items, she also needs to figure out how the old items should be organized into categories. It is not clear whether people will be able to make second-order generalizations when given an unsupervised learning problem, much less whether they will be able to solve this more complicated task as well as our ideal learner model. A related issue pertains to the overall “coherence” of the stimuli. It may be that in order to make complex generalizations, people require very clean data, which is why the intervention in Smith et al. (2002) was so effective. For instance, perhaps it is not sufficient to observe that many wugs are circular in shape (and so on) in order for people to be willing to infer the abstract rule that categories are organized by shape. Perhaps it needs to be the case that all or most wugs are circular. If people are sensitive to category coherence, comparison to our model can illustrate whether they need to be this sensitive or not. In other words, is poor generalization when categories are less coherent a result of the fact that there is just less information contained in the stimuli about the proper categorization? Or does poor generalization reflect cognitive limitations tracking and using the information that is there on the part of the human learner? Experiment We presented people with stimuli that vary in terms of eight discrete features, each of which can take on one of ten possible values. We varied three different factors, namely (a) the coherence of the diagnostic features within categories; (b) whether the learning task was supervised or unsupervised; and (c) the category structure to be learned. All factors were varied within-subject. Method Participants. 18 subjects were recruited from a paid participant pool largely consisting of undergraduate psychology students and their acquaintances. The experiment took 1 hour to complete and participants were paid $12 for their time. Design. We varied the level of supervision involved, the quality of the data, and the category structure. • Supervision. There were two levels, supervised and unsupervised. • Quality of the data. There were three coherence levels: 60%, 80%, or 100%. Coherence is defined and explained below. • Category structure. There were five levels comprising an incomplete 2 x 3 design. There were either 16 exemplars (divided evenly into either 8, 4 or 2 categories), or 8 exemplars (divided evenly into 4 or 2 categories). The design was incomplete because we could not fit more than 16 exemplars easily on the computer screen, and dividing 8 exemplars into 8 categories (so each item was its own category) made little sense. We manipulated MULTIPLE OVERHYPOTHESES 12 Figure 2. A schematic depiction of the nature of different datasets presented to both humans and our model in Study 1. Items are associated with four coherent features (fC ) and four random ones (fR ); for illustrative purposes we depict each feature as a digit and its value as the digit value, although the actual data given to humans consisted of items with different visual features, as in Figure 3. (a) An example dataset in the supervised condition with 16 items, four of whose fC features are 100% coherent (all items in the category share the same feature value). For clarity, the coherent features are shown in bold and are the four leftmost features, but neither humans nor the model were told which features were coherent; this had to be learned and was randomized differently on each trial. (b) An example dataset whose four fC features are 75% coherent: for each feature and item, there is 25% probability that its value will differ from the value shared by most members in the category. (c) The same dataset as in (b), but in the unsupervised condition. Here the learner must ascertain the proper categorization and also draw the correct higher-order inference about which features are coherent. (d) A sample first-order generalization task: given an item seen already, which of the test items are in the same category the one sharing the coherent features (top) or the one sharing the random features (bottom)? (e) A sample second-order generalization, which is the same except the model is presented with entirely new items with entirely new feature values. category structure primarily so that participants did not see the same number of exemplars or categories on every trial, thus limiting their ability to apply learning from previous trials. For space reasons, we do not analyse the results broken down by this factor. Overall, these three factors yielded a 2 x 3 x 5 within-subjects design, corresponding to a total of 30 conditions completed by each subject in a random order. This sounds more onerous than it was: each condition corresponded to only a single “trial”, and could be completed relatively quickly. Stimulus appearance. The appearance of the stimuli is illustrated in Figure 3. Each stimulus consisted of a square with four characters (one in each quadrant) surrounded by circles at the corner, each containing a character of its own. The characters corresponded MULTIPLE OVERHYPOTHESES 13 to the features of the items in the model datasets, and were designed to ensure that they were distinguishable and discrete. The complete list of possible values for each feature is provided in Appendix B. Category description. The categories were designed around a “family resemblance” scheme, where four of the features are coherent (denoted fC ) and formed the basis of the family resemblance. For the coherent features, every category had a prototypical value for that feature: a coherence level of c implies that the proportion of observed exemplars that possessed the prototypical value is c. Non-prototypical values were selected randomly from the other feature values. A schematic illustration of what these category structures looked like is given in Figure 2. Procedure As noted above, each of the 30 conditions constituted a single extended trial, and each participant completed all 30 trials in a random order. Each trial had several phases. In the sorting phase, participants were shown a set of novel objects on a computer screen. On the unsupervised trials, they were asked to sort the objects into categories by moving them around the screen with a mouse, and drawing boxes around the ones they thought would be in the same category. On a supervised trial, the stimuli were already sorted into the correct categories, with boxes drawn around stimuli to indicate which items belonged together. Figure 3(a) illustrates part of a typical trial after the participant has correctly sorted the eight items into four piles (categories) of two each and drawn a box with the mouse around each category. After the sorting phase, each participant was asked two generalization questions, presented in random order. In the first-order generalization questions, they were shown a stimulus that was one they had already seen during the sorting phase, and asked which of two novel items were most likely to belong in the same category as that one. The secondorder generalization questions were identical except that the participants were presented with stimuli and feature values they had not seen before. All of the sorted items were visible to participants throughout the entire trial. Sample generalization questions are shown in Figures 3(b) and 3(c). The real trials were preceded by two practice trials that had the same structure as the real ones, but used stimuli that were easier to categorize: six of the eight features were 100% coherent with respect to the relevant category. In these practice trials, people were informed what the correct classification would have been; this was done to ensure that the participants understood the task. For the real trials, people were simply told how many of the two questions they got correct, but not which ones. Results Computing model predictions. In order to assess how well people learned the categories and make the correct inferences, we compare human responses to the behavior of the idealized Bayesian learner outlined earlier in the paper. In order to compute model predictions on the generalization questions we computed the posterior probability that each of the two choice items belong in the same category as the original item, as described in Appendix A. In first-order generalization, the original item already occurs in the dataset and the query MULTIPLE OVERHYPOTHESES 14 Figure 3. (a) End of the first phase of an unsupervised trial. The learner has correctly sorted the eight objects into four categories composed of two items each. In this trial, coherence is 80%, which means that the four coherent features (in this case, bottom-left circle, top-left square, bottomleft square, and top-right circle) do not perfectly align with the categories. (b) Sample first-order generalization question for this trial. The test item corresponds to one of the items from the first phase. The two options each differ from the test item by four features; in this case, the correct answer is the item on the left (which shares the coherent features) rather than the item on the right (which shares the random features). Participants selected their choice by clicking on it. (c) Second-order generalization trials are identical except that none of the items contain feature values that have been seen before in that trial. Here, the correct answer is on the right, since that choice shares the coherent features with the test item. is whether it is more likely to be in the same category as an item that shares a coherent feature fC (a “correct” generalization) or a random feature fR . In second-order generalization the situation is identical, except the original item and the two choice items contain feature values that have not occurred before. Generalization is evaluated independently for each trial, and additional technical details are given in Appendix A. Comparing human performance to model predictions. Figure 4 displays the probability of making a correct generalization for the model (panel a) and the human participants (panel b). These generalization probabilities are broken down by two factors: whether the correct classifications were provided in the task (i.e., supervised versus unsupervised) and whether the test question asked for a first-order generalization or a second-order one. As Figure 4a illustrates, the model predicts that any effects of supervision or generalization should be negligible: model performance is very similar in all four cases. Human generalization performance is somewhat different. Like the model, people showed almost identical performance for first-order and second-order generalization items (79% and 77% correct respectively). However, unlike the model, human performance was noticeably better in the MULTIPLE OVERHYPOTHESES (a) 15 (b) Figure 4. (a) Model generalization averaged across all datasets based on the nature of the category information given. There is no significant difference between first- and second-order generalization. Category information aids in generalization, but the effect is small. (b) Humans also do not show a difference between first- and second-order generalization, although they benefit more from being given category information. supervised than the unsupervised condition (84% vs 73% correct respectively). The third factor of theoretical interest is the different coherence levels of the data provided to learners. As shown in Figure 5, the model predicts that generalization performance will degrade as the coherence of the data decreases. Human performance exhibits same qualitative pattern. Overall, when the data were 100% coherent, people generalized correctly 83% of the time: this figure drops to 78% for data that are 80% coherent, and 74% for 60% coherent data. To quantify the extent to which model predictions are supported, we note that the theoretical predictions include both null effects (generalization type and supervision) and actual effects (coherence), and as such it is inappropriate to rely on orthodox null hypothesis tests because these do not allow us to quantify the strength of evidence for a null hypothesis. With this in mind, we turn to Bayesian data analysis. The experiment is a standard repeated measures design, with coherence acting as a continuous variable and generalization and supervision treated as binary factors.3 Applying this analysis to human we find very strong evidence for an effect of coherence (odds of about 36000:1), strong evidence for an effect of supervision (odds of around 100:1) and modest evidence for a null effect of generalization type (odds of around 4.5:1).4 In other words, human learners appear to 3 To be precise: consistent with repeated measures ANOVA, we assume a random intercept for each subject. Bayes factors reported rely on the Bayesian equivalent of Type II test in ANOVA: for each main effect the null model corresponds to the full model minus the relevant predictor. Bayes factors were calculated using the BayesFactor package in R (Morey & Rouder, 2014, v 0.9.7). Because it is typical to obtain a range of possible factors within a confidence interval, for simplicity we report the approximate factor. 4 It is worth noting that (a) smaller odds ratios are grossly typical for null effects because it is fundamen- MULTIPLE OVERHYPOTHESES (a) 16 (b) Figure 5. (a) Model generalization averaged across all datasets based on coherence of the categories. Coherence affects generalization, especially in the unsupervised condition. (b) Human generalization is also affected by category coherence. mirror the Bayesian analysis in two respects: first-order generalization and second-order generalization are equally easy, and the effect of low-quality data is similar for both model and humans. Where human performance differs from the ideal observer model, however, is in the usefulness of supervision, a topic to which we now turn. Why does supervision matter to humans more than the model?. On two of the three factors considered, human performance closely matches the predictions of an idealized Bayesian analysis of the learning problem. The one point of discrepancy is that the model performance on unsupervised learning problems differs only trivially from its performance on the supervised version, implying that the featural information presented in the task is (in principle) sufficiently rich to allow people to infer the correct category structure, and hence make appropriate generalizations. However, people do not meet this standard, with human performance dropping from 84% correct to to 73% correct when stimuli are not already appropriately grouped. Why does this decline in performance occur? One possibility is that people simply fail to identify the correct categories, and in these cases generalize incorrectly. Another possibility is that they succeed in identifying the correct categories most of the time, but are less confident in those categories or are for some other reason less able to make appropriate generalizations on the basis of the categories that they have inferred. To investigate these possibilities, we evaluate the correctness of category assignments using the rand index, which is a measure of similarity between two clusterings (in this case, the correct categories vs. the category assignments made by the participants). In order to naturally correct for guessing we use an adjusted measure (adjR), as in Hubert and Arabie (1985). Values range tally more difficult to obtain such evidence. Nevertheless (b) odds of 5:1, although modest, are essentially equivalent to the p < .05 standard in orthodox tests (Johnson, 2013). MULTIPLE OVERHYPOTHESES (a) 17 (b) Figure 6. (a) Participant performance on categorization task based on categorization success. The highest group succeeded in finding the correct categories (had adjR scores above 0.5); the middle group had adjR scores above chance, but not substantially; and the lowest group were below chance performance in sorting items into categories. Participants who succeeded in finding the correct categories had high generalization performance, suggesting that people’s relatively poorer performance in the unsupervised condition was probably due to a difficulty in identifying the correct categories. (b) Among trials in which the correct categories were found (i.e., the highest adjR group), overall generalization (collapsed across first- and second-order) was uniformly high, regardless of coherence. between -1 and 1, where 1 indicates perfect agreement between two clusterings, 0 indicates the amount of agreement one would expect by chance, and negative values indicate less than chance agreement. To understand human performance in the unsupervised condition, it is useful to examine the relationship between adjR and generalization performance. How well did people classify the stimuli in the unsupervised condition? Overall, classification accuracy was high: 92% of trials showed classification above chance, 67% of trials had adjR values above 0.5, and 47% of trials produced perfect classifications. When we divide subject responses into bins based on these three categories (shown in Figure 6a), it is clear that there is a very strong effect of classification accuracy on subsequent generalization performance. People who classified well also generalized well. A Bayes factor analysis indicates that the strength of evidence favoring an effect of adjR over and above effects discussed earlier is on the order of 1032 :1, indicating that it is virtually certain that the difference is not due to chance. Moreover, if we restrict the analysis to those cases where classification accuracy was high (adjR > 0.5) we find that the effect of coherence vanishes: as illustrated in Figure 6b, the proportion of correct responses in the 60%, 80% and 100% coherent conditions for those trials was 87%, 88% and 88% respectively (Bayes factor: 6:1 in favour of a null effect). This suggests that although the featural information in the stimuli is sufficient to learn the correct classifications (since the model does so), people require more information to arrive at the correct categorization. This is why human performance is poorer in the MULTIPLE OVERHYPOTHESES 18 unsupervised condition. However, in those cases when people were able to find the correct classification, their generalization performance was just as good as the model. A reasonable hypothesis is that people have certain capacity limitations processing and comparing all of the features at once; thus when there are many features, being given the category information helps to highlight the important ones. If this hypothesis is true we can predict that being given the categories should help people much more if there are lots of features than if there are only a few (whereas it shouldn’t make a difference to the model). We test this prediction, among others, in Study 2. Summary Overall, Study 1 demonstrates that – like our Bayesian model – people make secondorder generalizations just as well as first-order generalizations. As previously noted, this has implications for connectionist models such as ALCOVE in terms of how discrete features are encoded and dimensional attention rules work. More generally, this result implies that learning about features is more than learning to associate particular feature values with particular categories; it is also about learning that a particular feature tends to be relevant for an entire class of category. Novel feature values on a class-relevant feature are far more likely to indicate a new category than novel feature values on a class-irrelevant feature. The other key finding of Study 1 is that people are capable of learning categories as well as forming an overhypothesis about how those categories are organized – but, nevertheless, they are greatly helped by being given the categories instead. Our model showed very little improvement in the supervised condition, when being told the categories, whereas people showed a much larger improvement. In fact, when people identified the correct category, they generalized as well as the model. This suggests that the problem people have in the unsupervised condition is not inherent in the information in the stimuli. The difficulty lies in finding the categories in the first place, and arises perhaps out of limitations in working memory or processing capacity on the part of our participants. Study 2 arises naturally out of the questions that remain. First, how much of human performance is constrained by capacity limitations relative to the model? We investigate this by manipulating the visual complexity of the stimuli (i.e., the salience and number of the features). Second, in the previous task, all categories were of the same class, and as a result the same inductive bias held for all of them. Are people capable of learning multiple different classes simultaneously with learning the categories as well? Study 2: Learning multiple classes of category Here we address some of the predictions suggested by the first study, along with the more complicated learning problem of identifying both the correct category and the correct class for individual objects (i.e., learning more than one overhypothesis or (α, β) pair). We therefore present the model and humans with data in which some of the items are exemplars from one class and some are exemplars from another. No learner is ever given the class information, since that is rarely available in real life; however, as before, in some trials the category information is given (supervised) while in some it is not (unsupervised). Study 2 has two main goals. First, can human learners acquire both class and category information simultaneously in a short time, and can the model explain how such learning MULTIPLE OVERHYPOTHESES 19 might be possible? Second, how does generalization depend on the number and salience of the features learners are presented with? Our results indicate that both the model and humans can learn both classes and categories, at least when there are fewer, more coherent features; as the number of features increases and their salience decreases, human performance plummets while model performance does not. The high model performance indicates that even in this more difficult condition, there is sufficient information in the input to arrive at the correct categorization; the low human performance suggests that people fail to use this information, probably because of process-level limitations (e.g., on working memory). Method In this experiment there were two between-subjects conditions varying in difficulty level (easy vs hard). Within-subject, we varied two additional factors: (a) whether the learning task was supervised or unsupervised; and (b) the category structure to be learned. As before, the experiment was designed to present participants with as close as possible to the exact task and dataset presented to the model. Participants. 30 participants (15 per condition) were recruited from a paid participant pool largely consisting of undergraduate psychology students and their acquaintances. The experiment took 1 hour to complete (slightly less in the easy condition) and participants were paid $12 for their time. Design. As noted above, we varied the level of supervision involved, the level of difficulty, and the category structure. • Supervision. There were two levels, supervised and unsupervised. • Difficulty level. There were two levels, easy and hard, which varied according to stimulus appearance as described below. • Category structure. There were five levels in an incomplete 2x3 design. There could either be 16 exemplars (divided into 2 classes and 8 or 4 categories), 12 exemplars (divided into 2 classes and 6 or 4 categories) and 8 exemplars (divided into 2 classes and 4 categories). The structure was slightly different from Study 1 in order to ensure that each trial had two classes as well as at least two items per category. As before, we varied this so that there were fewer trial-to-trial regularities for our participants to learn, and this is not analyzed in more detail. Overall, these three factors yielded a 2 x 2 x 5 within-subjects design, corresponding to a total of 20 conditions completed by each subject in a random order. Stimulus appearance. The items differed somewhat between the easy and hard conditions. The easy condition was designed to tax participants’ working memory the least, so it had only five features, all of which were 100% coherent. Four of the features were the four characters in the interior square part of the items from Study 1. The fifth feature always indicated the class and was indicated by the color of the item (in all other conditions and in Study 1 all items were white). We chose color because that was likely to be perceptually salient – hence making the task easier – but it was unlikely that participants had the a priori expectation that it would denote classes, since it does not in the real world. To ease in processing, the possible feature values for the other four features were restricted MULTIPLE OVERHYPOTHESES 20 Figure 7. Sample items in Study 2. (a) Items from the easy condition. The top four light items belong to class 1, in which (in this example) the two leftmost features organize the categories. The bottom four dark items correspond to class 2, in which the two rightmost features organize the categories. (b) Items from the hard condition. The top four items belong to class 1 and are distinguished from those in class 2 by their values on the features in the upper left circle and upper right square. Within class 1, in this example the features that organize the categories are the upper left square and the lower right circle. Within class 2, the features that organize the categories are the lower left square and the lower left circle. Because coherence is 90%, occasionally feature values will not be completely consistent within a category or class. to characters the participants were more likely to be familiar with, like letters and numbers rather than esoteric mathematical symbols. These four features were randomly allocated to correspond to category organization for one of the two classes, with the constraint that the two features for one class could not be on the diagonal (that is, they would be either the top/bottom two or the right/left two). A list of the possible characters in the easy condition is shown in Appendix B, and items from a sample trial are shown in Figure 7(a). The items in the hard condition were nearly identical to those from Study 1, although the features all had 90% coherence. Items appeared the same and the eight features and feature values were the same. Which features indicated classes and categories was randomly assigned from trial to trial. Figure 7(b) shows items for a sample trial. Category description. Figure 8 schematically presents the data given to our learners. In all conditions, one (easy) or a group of (hard) features indicates which class the item belongs to; these features are analogous to solidity as the feature that distinguishes substances from solids. These “class” features are indicated in the figure by underlining, but are not distinguished as class features to the learner, who must figure this out. Classes are MULTIPLE OVERHYPOTHESES 21 Figure 8. A schematic depiction of the nature of different datasets presented to both humans and our model in Study 2. In all conditions, one or a group of features indicates which class the item belongs to. These features are indicated in the figure by underlining, but do not appear different to the learner. Depending on the value of the class feature(s), a different set of alternate features, indicated in bold, are the important ones for categorization (just as solids are organized by shape while non-solids are organized by material). In the actual datasets, which features indicate wha varies randomly from trial to trial. (a) In the easy condition, one feature (here indicated by the middle number) indicates the class, two features (the first two numbers) organize categories within class 1, and the other two features (the last two numbers) organize categories within class 2. (b) In the hard condition, there are eight features, two of which are always random. Two others indicate the class (in this case, they are the second and third numbers in the sequence). As in the easy condition, two different features organize the categories in class 1 and class 2. The coherence of the features is 90%, rather than 100% as in the easy condition. (c) There are two first-order generalization questions, one for each class; examples from the easy condition are shown here. (d) There are also two second-order generalization questions, again one for each class. also distinguished by which features are important for categorizing within that class (e.g., within solids, shape is important, but within non-solids, material is more important). These factors are the same for both conditions, which differ according to how many features there are and how coherent those features are. Among the five features in the easy condition, all of which are 100% coherence, one feature (the color) indicates the class. Within class 1, two of the other features (also chosen randomly) are coherent with respect to categories in that class; within class 2, the remaining two features are coherent with respect to categories in that class. The hard condition has the same logical structure, but with eight rather than five features and 90% coherence. Two features are random with respect to all classes and categories, while two features (analogous to color in the hard condition) are MULTIPLE OVERHYPOTHESES 22 coherent with respect to classes. As in the easy condition, two of the remaining features are coherent with respect to categories within class 1, and two others are coherent with respect to categories within class 2. Procedure Trial structure in both the easy and hard conditions was very similar to that of Study 1. In the first phase subjects were asked to sort items (unsupervised trials) or to observe the items with boxes already drawn around them (supervised). In all cases, the boxes corresponded to the category structure; classes were never made explicit in any way. As before, within each trial, the features that organized classes and categories were randomly assigned (except that in the easy condition the color feature always indicated class). The test trials were preceded by two sample trials with similar items that were easier to categorize. All sample trials contained only one class, and the instructions asked participants to categorize the items, making no reference to different ways of categorizing different things. This was done intentionally in order to ensure as much as possible that the participants were not biased to look for different classes or ways of categorizing within items in a single trial. Generalization questions were also similar to Study 1, except that four rather than two were asked for each trial (as depicted in Figure 8). There was one question for each class and order of generalization, and they were asked in random order. After completing all of the questions, participants were told how many of the questions they got correct, but not which ones. As in Study 1, there were two practice trials. None of the practice trials had more than one class at once, so participants could not have learned that there were two classes from those. Results Our main question was how well people performed in both of the conditions. Overall, the easy condition was substantially easier than the hard condition: participants classified 93% of stimuli correctly in the easy condition, but were only slightly above chance in the hard condition, with 54% of items being classified correctly (Bayes factor is 1013 :1 in favor of an effect). As before, there is no evidence for any effect of generalization type, with 75% of first-order items and 72% of second-order items classified correctly (Bayes factor of 1:1 favors neither conclusion). Somewhat surprisingly, however, participants performed similarly in the supervised versus unsupervised conditions, with accuracy of 75% and 73% respectively (Bayes factor of 2:1 provides weak evidence in favor of a null effect). How do we explain this performance? One possibility is simply that the easy condition was so easy that people scored at ceiling, and the hard condition was so hard that even when it was supervised people were unable to pick out the underlying regularities. This would result in no difference between the supervised and unsupervised condition in both cases. However, an alternate possibility is that in the easy condition people were leveraging information from the supervised trials to use in the unsupervised ones. If one of the most difficult parts was realising that one had to sort both by class and within the classes, the supervised trials could communicate this information, which could then have been applied on the unsupervised trials. MULTIPLE OVERHYPOTHESES 23 Figure 9. Human performance in Study 2 by condition and supervision level. Within the easy and hard conditions, some participants both supervised and unsupervised trials (these are denoted supervised-paired and unsupervised-paired). In the hard condition people performed at chance, but people did significantly better in all versions of the easy condition. In order to explore whether people in that condition were using supervised trials to leverage their learning in unsupervised trials, an additional set of participants in the easy condition were shown only unsupervised trials (these are denoted unsupervised-standalone). They were able to leverage their learning from supervised trials on the unsupervised-paired trials: performance was better on those than on the unsupervised-standalone trials in which such leveraging was impossible. We tested this by running another version of the easy condition that just contained the unsupervised trials. 27 participants were recruited from either the first-year Psychology course at the University of Adelaide or the paid participant pool. The experiment took 30 minutes to compete and people were either paid $5 or received course credit. The procedure and instructions were exactly the same as in the easy condition of Study 2 except that they only saw the unsupervised trials (including only the unsupervised sample trial). As Figure 9 shows, performance on these unsupervised-standalone trials was markedly poorer than performance in the easy condition in which people also had some supervised trials to learn from (the unsupervised-paired trials). In the original version of the task, people classified 94% of items correctly in the unsupervised easy condition. However, in the standalone version of task, only 68% of items were classified correctly. Although the performance in the standalone condition declines relative to the original version (Bayes factor: 1900:1 in favor of an effect), it remains well above chance (Bayes factor is of the order of 500:1). The above chance performance in the standalone condition implies that people were able to learn to categorize on multiple levels even when there was no information telling them that such categorization was called for. Rather, they were able to notice from the distribution of features that the most sensible category structure involved multiple levels of sorting. 24 MULTIPLE OVERHYPOTHESES easy−paired 16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories 5 5 10 10 15 15 easy−standalone 5 10 15 2 2 4 4 6 6 8 8 10 10 12 5 10 15 2 4 6 8 12 2 4 6 8 10 12 2 4 6 8 10 12 16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories 5 5 10 10 15 15 5 10 15 5 10 2 2 4 4 6 6 8 8 10 10 12 12 15 2 4 6 8 10 12 hard 5 10 10 15 15 5 10 15 2 4 4 6 6 8 8 10 10 12 5 10 15 4 6 8 10 12 6 8 8 items, 4 categories 6 8 4 6 8 10 12 2 4 6 8 8 items, 4 categories 2 4 6 8 12 2 4 4 2 2 2 2 16 items, 4 categories 16 items, 8 categories 12 items, 4 categories 12 items, 6 categories 5 8 items, 4 categories 2 4 6 8 10 12 2 4 6 8 Figure 10. Sorting performance by condition in the unsupervised trials. The x and y axis show each of the n items for each category structure. White colors in the graphs indicate that those items were always sorted in the same categories as each other; black indicates that those items were never sorted together. The top row shows participant assignments in the easy-paired condition. In all cases the actual assignments people made conform very strongly with the actual category structure. Thus, for instance, in the 16 item, 4 category structure, the first four items shared a category, as did the next four, and so forth. The bottom row shows participant assignments in the hard condition. The correct category assignments are exactly the same as in the easy condition, making it evident that people are not generally clustering correctly in the hard condition. They do often appear to notice the differentiation between the two classes, capturing the regularity found in all structures in which the first half of the items are in one class and the second half are in the other. People did pick up on this pattern but usually made no further categorization within class, especially when the categories were small. The middle row shows the easy-standalone condition, in which people made more correct assignments than in the hard condition but did not do as well as in the easy-paired condition. How does this performance relate to the categorizations people made? Figure 10 shows the clusters people made in each of the three unsupervised conditions (hard, easypaired, and easy-standalone). It is clear that in the easy-paired condition most of the clusters are correct: people correctly identify all of the categories. In the easy-standalone condition they did less well but still often sorted correctly. By contrast, in the hard condition, almost none of the clusterings identify the correct categories. People do appear to cluster items into classes, but make no further differentiation. Since performance on the test questions requires differentiating between categories within classes, it is not a surprise that people perform at chance. Does this explain the poor performance in the hard condition? Not entirely; recall that people are at chance even in the supervised trials in the hard condition, where they were given the correct categories. This suggests that, even when the categories are shown to them, people are not picking up on the regularities that define the categories – they MULTIPLE OVERHYPOTHESES 25 Figure 11. Generalization by condition, split by adjR score. In both easy conditions, performance on the test questions was higher when people sorted the items more correctly (there were no trials with low adjR scores in the easy-paired condition). In the hard condition, by contrast, test performance was unrelated to adjR score. This, in combination with the fact that test performance was no higher in the supervised trials, suggests that even having correct category information did not make the regularities that defined the categories apparent. do not realise that within one class, certain features matter for defining the category, and within the other class, different features matter. Given the high performance in the easypaired condition, this probably does not reflect an inability to conceptualize the idea – but rather, the inability to identify which features are important, in which way, out of the many possible features that make up the stimuli in the hard condition. Consistent with this, if we break down performance by adjR value, we see that in both easy conditions there was a significant correlation between the correctness of people’s clusterings as measured by the adjR value (paired: r = 0.78, Bayes factor: 47:1 for an effect); standalone: r = 0.73, Bayes factor: 1173:1 for an effect), but there was no relationship between the two within the hard condition (r = −0.06, Bayes factor: 2.25:1 in favor of a null effect). Figure 11 illustrates this as well: performance in the hard condition did not vary between trials with different adjR values, but performance in both easy conditions did. Model performance, shown in Figure 12, contrasts interestingly with human performance. It illuminates what aspects of the task are learnable and suggests some possibilities for the source of the difficulties people had in the hard condition. The main finding is that the model did equally well in the easy and hard conditions but significantly better in the supervised trials than the unsupervised ones, with no interaction. The improved performance when the categories are given is consistent with human performance in the easy condition (at least when compared to the unsupervised trials in which people couldn’t leverage information from the supervised trials, which is appropriate since the model didn’t have that information either). As in Study 1, the supervised trials are easier because the category MULTIPLE OVERHYPOTHESES 26 Figure 12. Model performance on Study 2 stimuli by condition and supervision level. The model qualitatively replicated the same pattern as humans in the easy-standalone condition, performing well in the supervised trials and worse in the unsupervised trials. Unlike people, it did equally as well in the hard condition. This suggests that people’s poor performance in the hard condition is not because there was not enough information in the task to learn the correct category structure; rather, performance limitations of some sort, which did not affect the model, prevented people from learning the correct categories or using that information to perform inferences about novel items. information is already given. The main difference between human and model performance is in the hard condition. Humans performed at chance, even in the supervised trials, suggesting that even when the categories were given they could not detect the regularities behind the category structure. The fact that the model could still learn the categories quite well in the hard condition suggests that the problems humans were having weren’t inherent to the categories or stimuli themselves. In other words, there was enough information and structure in the data that an optimal learner could make the correct generalizations; whatever prevented humans from learning was not that. The fact that an optimal computational-level model could learn but humans could not suggests that humans were probably limited by process-level constraints, perhaps on working memory. One final thing worth noting is that people did better in the easy supervised trials than the model did. How is this possible, if the model represents optimal performance? There are probably two reasons for this. The first is that people, unlike the model, were MULTIPLE OVERHYPOTHESES 27 presented with many trials one after the other; for the model, each trial was calculated independently of the others. Although the trials differed in the number of categories and items per category, there were still trial-to-trial regularities that people could have learned: for instance, that all trials had two classes, that the color feature was always the class indicator, or that no categories contained an odd number of items. These regularities could have helped substantially with the inference problem. In fact, it is interesting that people might be capable of learning such regularities: they are another kind of overhypothesis learning. A second possibility has to do with a somewhat minor detail about how the model calculates second-order generalization. It does so by marginalizing over all possible assignments of the test item to categories and classes. Occasionally the test item is assigned to an entirely separate class and category; when this occurs, the model performs at chance on the corresponding generalization. Although this performance is averaged in with the generalizations that occur when the test item is assigned to the correct class, the overall effect is one of lowering the overall generalization probability relative to people, who (presumably due to demand characteristics of the experiment) probably rarely if ever believe the test item is in an entirely different class. Although this behavior by the model does affect the quantitative generalization probability, it is not a fundamental problem with the model; it could be fixed by simply changing the prior on γ. We chose not to do so for consistency, because previous work used the existing prior. In sum, Study 2 is the first evidence we are aware of that people are capable of quickly learning two different classes at the same time as categories in those classes, when given completely novel objects with novel features. This learning is not trivial: people were only capable of it when the features were salient and not too numerous, as in the easy condition. The model demonstrates on a computational level how such regularities might be acquired, and also shows that people’s failure when features were too numerous or noisy was probably due to failures at the process level rather than inherent limitations on the information in the data. Discussion This paper addressed two central questions. How it is possible to simultaneously learn on multiple higher levels of abstraction? And are people capable of this sort of learning quickly, given novel stimuli, or does it require natural stimuli, full supervision, or greater time periods? We find that people can indeed learn one or even many classes, even if the lower-level categories are not given (although learning is better when the categories are given). Interestingly, second-order generalization appears to emerge in tandem with firstorder generalization; as soon as people learn an overhypothesis about a class or classes, they are capable of using that information to form intelligent inferences about completely novel items. As discussed previously, this “blessing of abstraction” is a nontrivial prediction made by our model but not other common models of categorization. In addition to illustrating how this sort of learning is possible, our model explains why first-order and second-order generalization are equally easy, why being given category information is easier, and why performance degrades as the number of items in a category decreases or the categories themselves grow less coherent. MULTIPLE OVERHYPOTHESES 28 The model fails to predict human behavior in two ways in Study 2, but both of those are quite revealing. First, people performed far better in the unsupervised-paired trials than the model did in its unsupervised trials. This is probably because people were doing another kind of overhypothesis learning, leveraging information about the task based on the supervised trials that they also saw. When another group of participants in the unsupervised-standalone condition was given only unsupervised trials, they performed much more in line with model predictions. This is compelling evidence that people are capable of interesting cross-trial learning, enabling them to go even beyond what would be possible otherwise. Second, the model also predicted much better performance in the hard condition than people achieved. Since the model operates on the computational level and is not constrained by limited memory or time, this suggests that human failure when there are numerous more complex features is probably due to limitations of memory or other process-level abilities. A full account of performance on this task would need to address this level as well; however, our computational-level approach has demonstrated that this learning problem exists and how a solution to it is possible. Although there is already an array of existing models that can achieve aspects of the learning tasks in this paper, to our knowledge ours is the only model that can simultaneously learn overhypotheses about multiple classes at once, while also learning about the categories themselves, in an unsupervised fashion. Other models can categorize items but not learn overhypotheses about classes (Anderson, 1991) or learn multiple classes (Kruschke, 1992; Love et al., 2004; Perfors & Tenenbaum, 2009); some learn multiple classes but cannot simultaneously categorize items (Kemp et al., 2007; Sanborn et al., 2009; Perfors et al., 2010). Overhypothesis learning has a great deal in common with dimensional attention (e.g., Kruschke, 1992; Love et al., 2004), but dimensional-attention learning models such as ALCOVE or SUSTAIN do not simultaneously learn attention to different dimensions in multiple classes at once. Furthermore, as we have seen, they also do not predict that firstorder and second-order generalization (at least involving categories with discrete features) should be equally easy (as they are here). The empirical findings reported here are also, to our knowledge, the first evidence that people are capable of learning multiple classes at once over novel features while also learning the underlying category. The closest similar work is in the area of knowledge partitioning, which investigates people’s ability to learn one regularity in one context at the same time as a very different regularity in another (e.g., Lewandowsky & Kirsner, 2000; Kalish, Lewandowsky, & Kruschke, 2004; Yang & Lewandowsky, 2004; Navarro, 2010). However, both models and experiments in this area generally either do not involve category learning at all or involve people learning multiple contexts (or classes) but generally only in a supervised way. They do not show people, as we do, learning on multiple levels at once: figuring out how to put (unlabelled) items in categories while at the same time figuring out different regularities for different classes of items or different contexts. A final open question is how the ability to learn on multiple levels at once depends on having all of the information about all of the categories visible at all times, as in our experiments. In real life, people are not often given examples of many category members from several classes all at once; rather, they experience individual exemplars one-by-one, in multiple contexts, sometimes with labels and sometimes without. This greatly increases the load on working memory which, our evidence suggests, is key to successful learning of MULTIPLE OVERHYPOTHESES 29 this sort. It is possible that people will be unable to simultaneously learn on multiple levels if the items are presented more like typical category learning experiments, with each item seen and labelled (or not) one-by-one. If this is the case, perhaps children take years to acquire multiple over-hypotheses because that length of time is necessary to overcome other constraints on working memory, or for different kinds of long-term learning to occur. We are exploring this possibility with additional experiments in our lab. Appendix A Structure learning The structural component of the model organizes observed stimuli into a two layer hierarchy, with stimuli assigned to categories and categories assigned to a class. Formally, we let c denote a vector that assigns each stimulus to a category, such that ci = j if the i-th stimulus belongs to the j-th category. Similarly we use k to denote a vector assigning each category to a class, such that kj = z if the j-th category belongs to the z-th class. The learner does not know the number of categories or the number of classes. In simpler models where objects are assigned only to categories but the number of such categories is not known in advance, it is typical to make the assignments using a simple method known as the Chinese restaurant process (CRP). Although the CRP describes a sequential process, the assignments generated by it are exchangeable: the actual order in which assignments are generated is does not affect the probability of the overall partition. The natural analog of the CRP for the classes-and-categories model is known as the nested Chinese restaurant process (nCRP), which we now describe. This distribution is comprised of two separate CRPs, one associated with classes and the other associated with categories. Suppose the learner has a set of assignments c to categorize the first n stimuli and a class vector k that organizes the existing categories. The prior probability that observation n + 1 belongs to the j-the existing category is given by nj P (cn+1 = j|c1 , . . . , cn ) = n+γ where nj denotes the number of observations already assigned to that category. However, there is some probability that the new observation belongs to a hitherto unseen category. If we let m denote the number of categories that have been seen up to this point, then the probability that item n + 1 belongs to category m + 1 is γ P (cn+1 = m + 1|c1 , . . . , cn ) = n+γ However, when a new observation is assigned to a new category, then this new category must itself be assigned to a class. This assignment is made using the same rule. The new category is assigned to the z-th existing class with probability proportional to the number of existing categories that belong to that class, mz P (km+1 = z|k1 , . . . , km ) = m+γ It is assigned to an entirely new class with probability proportional to γ. If the total number of classes observed so far is q, then γ P (km+1 = q + 1|k1 , . . . , km ) = m+γ MULTIPLE OVERHYPOTHESES 30 In the general version of the nested CRP, the value of γ for the category-level assignments can be different than for the class-level assignments, but for the current paper we keep them the same. In statistical notation, this process describes a joint prior over c and k that is written c, k ∼ nCRP(γ) We fix γ = 1 for all model fitting exercises in the paper. Distributional learning The rest of the model is defined conditional on the class and category assignments. Let yi denote the observed feature vector for the i-th stimulus, where yif = v implies that the i-th stimulus has value v on feature f . As above, let ci denote the category to which item i is assigned. Then the probability with which this feature value would be observed is (c ,f ) denoted θv i . In statistical terms, we say that value for yif is sampled from a multinomial distribution with size 1 and probability vector θ (ci ,f ) , yif ∼ Multinomial(θ (ci ,f ) , 1) Of course, the learner does not know the probability distribution over feature values associated with any of the categories, and must learn them from the observations. For the sake of clarity, we now simplify the notation somewhat. Because the model is identical for all features f , we drop the dependence on the specific feature, and simplify the notation that describes the category and class assignments, referring in generic terms to category c and the class k to which it belongs. In this notation, instead of using θ (ci ,f ) to refer to the distribution over values for feature f associated with the category ci to which the i-th item belongs, we use θc to refer to one such distribution. As described in the main text, the learner’s prior beliefs about a category are shaped by the class to which that category belongs. Specifically, θc depends on the distribution over feature values βk associated with class k, and on αk , the parameter that describes the homogeneity of categories of class k. We adopt the standard Dirichlet distribution to describe this belief, specifically: θc ∼ Dirichlet(αk βk ) By multiplying the vector that describes the feature distribution βk by the homogeneity parameter αk this prior can encompass the full range of possible Dirichlet distributions, using a parameterization that makes more sense psychologically than the usual method. In order for a Bayesian learner to acquire the class-level knowledge that αk and βk provide, the prior uncertainty about these parameters must also be described. The prior over the feature distribution βk for a class takes the same form as the prior over features for a category, and is described by a Dirichlet distribution, βk ∼ Dirichlet(1µ) where 1 denotes a vector consisting entirely of 1s, indicating that the model has no a priori biases to expect some feature values over others, and µ describes the expected homogeneity across classes. It plays the same role for classes that α plays for categories, but is not central MULTIPLE OVERHYPOTHESES 31 to the current paper. The prior over α favors small values (homogeneity), with the strength of that preference captured by the parameter λ: αk ∼ Exponential(λ) Finally, because the parameter λ and µ play an important role in controlling the expectations that the learner has at in the most general sense, we place diffuse priors over these and allow the model to learn them from data: λ ∼ Exponential(1) µ ∼ Exponential(1) Inference In this section, we simplify the notation further for the sake of clarity. The data available to the learner consists of a collection of feature vectors y. The learner must infer the category and class assignments (c, k) along with the distributional information at both levels (θ, β, α) and the top-level information (λ,µ). The description above outlines a joint prior distribution over all these parameters P (c, k, θ, β, α, λ, µ). Given data y, a Bayesian learner infers the posterior distribution, P (c, k, θ, β, α, λ, µ|y) = P (y|c, k, θ, β, α, λ, µ)P (c, k, θ, β, α, λ, µ) P (y and the structure of the model allows the prior to be factorized as follows: P (c, k, θ, β, α, λ, µ) = P (θ|β, α)P (β|µ)P (α|λ)P (λ)P (µ)P (c, k) Given the complexity of the model, computing exact values for the posterior probabilities are not feasible, but we can use Markov chain Monte Carlo methods to draw samples from this posterior distribution (Gilks, Richardson, & Spiegelhalter, 1996). When c and k are not given, the process of inference alternates between fixing c and fixing k; while each is fixed, the other is sampled, along with the hyperparameters α, β, λ, and µ. The hyperparameters are estimated with an MCMC sampler that uses Gaussian proposals on log(α), log(λ), and log(µ); proposals for β are drawn from a Dirichlet distribution with the current β as its mean. Each Markov chain was run for 8,000 iterations with 1,000 discarded as burn-in. Proposals on c and k included splitting, merging, shifting one, adding, and deleting categories or classes respectively. Given the ability to draw samples from the posterior distribution over parameters, calculating first- and second-order generalization probabilities is relatively straightforward. For two stimuli a and b in a test trial the quantity of interest is P (ca = cb |y), the posterior probability that items a and b belong to the same category given all of the observed feature vectors y. However, since every sample is a draw from the full posterior distribution P (c, k, θ, β, α, λ, µ|y), all we need to do to estimate P (ca = cb |y) is count the proportion of posterior samples that assigned these items to the same category. The only difference between first- and second-order generalization, from the perspective of this model, is whether or not the learner has previously seen one of the items (or, more precisely, any of the feature MULTIPLE OVERHYPOTHESES Feature 1 2 3 4 5 6 7 8 Feature location upper-left circle upper-right circle lower-left circle lower-right circle upper-left square upper-right square lower-left square lower-right square 32 Possible feature values ! ? #@$%&+*> 1234567890 qrstuvwxyz ABCDEFGHJK αβγδηθλξπσ Γ∆ΘΞΠΣΦΨΩ∀ c ↔ ≈<ℵ∞÷♣♥♠ √ ⊇∇ ⊥≡⊕•↓∂ Table 2: List of possible feature values for each of the eight features in Study 1 and the hard condition of Study 2. Feature 1 2 3 4 5 Feature location color upper-left square upper-right square lower-left square lower-right square Possible feature values red, green, blue, yellow, cyan, magenta, gray, orange, light blue, light brown ! ? #@$%&+*> 1234567890 mnrstuvwxz ABCDEFGHJK Table 3: List of possible feature values for each of the five features in easy condition of Study 2. values belonging to any of the items) as part of the training set. The model was tested on items with one of the coherent features fC , the class feature, and one of the random features fR , rather than the full complement; this was done in order to prevent the model from placing all test items into entirely new categories and classes (an outcome people avoided due to task demands). The same goal could have been accomplished simply by forcing the model not to place the test items into new categories, or by changing the CRP prior γ, but we deemed this option the most straightforward. Results for Study 1 represent averages across 16 different simulated datasets for each category structure and each order of generalization; results for Study 2, which was computationally much more intensive, represent averages across 4 simulated datasets for each category structure, order of generalisation, and class membership for each item. Appendix B The stimuli in Study 1 and the hard condition of Study 2 had eight features, each of which could take ten possible feature values. Stimuli varied from trial to trial in terms of which features were coherent and which feature values occurred together in the same item, but feature values of a given type always occurred in the same location (for instance, the feature values in the lower right-hand circle always corresponded to a capital letter taken from the final half of the alphabet). Table 2 shows all of the possible feature values for each feature. Stimuli presented in the first phase drew from up to eight of those feature values MULTIPLE OVERHYPOTHESES 33 for each feature; the final two were reserved for the second-order generalization questions, which required new feature values. The stimuli in the easy condition in Study 2 had five features, each of which could also take ten possible feature values, as shown in Table 3. The feature indicating class was always color, and the other four varied randomly from trial to trial (along with the specific feature values of each). Colors were chosen to be maximally distinct from each other, and which color corresponded to each class varied randomly from trial to trial. For instance, in Trial 1, blue might indicate those items for which the two bottom features were coherent with respect to categories; in Trial 2, there might be no blue items at all; and in Trial 8, blue might indicate those items for which the two rightmost features were coherent with respect to categories. References Anderson, J. (1991). The adaptive nature of human categorization. Psychology Review , 98 (3), 409–429. Booth, A., & Waxman, S. (2002). Word learning is ‘smart’: Evidence that conceptual information affects preschoolers’ extension of novel words. Cognition, 84 , B11-B22. Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in practice. Chapman & Hall. Goodman, N. (1955). Fact, fiction, and forecast. Cambridge, MA: Harvard Univ Press. Griffiths, T., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J. B. (2010). Probabilistic models of cognition: Exploring representations and inductive biases. Trends in Cognitive Sciences, 14 , 357-364. Griffiths, T., & Tenenbaum, J. B. (2006). Optimal predictions in everyday cognition. Psychological Science, 17 , 767-773. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193–218. Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110 (48), 19313–19317. Kalish, M., Lewandowsky, S., & Kruschke, J. (2004). Population of linear experts: Knowledge partitioning in function learning. Psychological Review , 111 , 1072–1099. Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning overhypotheses with hierarchical Bayesian models. Developmental Science, 10 (3), 307–321. Kemp, C., & Tenenbaum, J. B. (2008). The discovery of structural form. Proceedings of the National Academy of Sciences, 105 (31), 10687-10692. Kemp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive reasoning. Psychological Review , 116 (1), 20–58. Kruschke, J. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review , 99 , 22-44. Kurtz, K. (2007). The divergent autoencoder (DIVA) model of category learning. Psychonomic Bulletin and Review , 14 , 560-576. Landau, B., Smith, L., & Jones, S. (1988). The importance of shape in early lexical learning. Cognitive Development, 3 , 299–321. Lee, M., & Navarro, D. (2002). Extending the alcove model of category learning to featural stimulus domains. Psychonomic Bulletin and Review , 9 (1), 43-58. Lee, M. D. (2006). A hierarchical Bayesian model of human decision-making on an optimal stopping problem. Cognitive Science, 30 , 555–580. Lewandowsky, S., & Kirsner, K. (2000). Knowledge partitioning: Context-dependent use of expertise. Memory and Cognition, 28 , 295–305. MULTIPLE OVERHYPOTHESES 34 Love, B., Medin, D., & Gureckis, T. (2004). Sustain: A network model of category learning. Psychological Review , 111 (2), 309–332. Macario, J. F. (1991). Young children’s use of color in classification: Foods as canonically colored objects. Cognitive Development, 6 , 17–46. Morey, R. D., & Rouder, J. N. (2014). BayesFactor: Computation of Bayes factor for common designs (R package version 0.9.7. ed.) [Computer software manual]. Navarro, D. (2010). Learning the context of a category. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 1795–1803). Curran Associates, Inc. Perfors, A., Tenenbaum, J., & Wonnacott, E. (2010). Variability, negative evidence, and the acquisition of verb argument constructions. Journal of Child Language, 37 , 607-642. Perfors, A., & Tenenbaum, J. B. (2009). Learning to learn categories. In N. Taatgen, H. v. Rijn, L. Schomaker, & J. Nerbonne (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. 136–141). Austin, TX: Cognitive Science Society. Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118 (3), 306-338. Sanborn, A., Chater, N., & Heller, K. (2009). Hierarchical learning of dimensional biases in human categorization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 22 (Vol. 23, pp. 727–735). Curran Associates, Inc. Sanborn, A., Griffiths, T., & Navarro, D. (2010). Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review , 117 , 1144–1167. Shams, L., & Beierholm, U. (2011). Early integration and Bayesian causal inference in multisensory perception. In J. Trommershauser, K. Kording, & M. Landy (Eds.), Sensory cue integration (p. 251-262). Oxford University Press. Smith, L., Jones, S., Landau, B., Gershkoff-Stowe, L., & Samuelson, L. (2002). Object name learning provides on-the-job training for attention. Psychological Science, 13 (1), 13–19. Soja, N., Carey, S., & Spelke, E. (1991). Ontological categories guide young children’s inductions of word meaning. Cognition, 38 , 179–211. Teh, Y.-W. (2009). A heirarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL (p. 985-992). Williams, J., & Griffiths, T. (2013). Why are people bad at detecting randomness? A statistical argument. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39 , 1473– 1490. Yang, L.-X., & Lewandowsky, S. (2004). Knowledge partitioning in categorization: Constraints on exemplar models. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30 , 1045—1064.
© Copyright 2026 Paperzz