Random Choice and Learning Paulo Natenzon∗ Washington University in Saint Louis December 2012 Abstract We study a decision maker who gradually learns the utility of available alternatives. This learning process is interrupted when a choice must be made. From the observer’s perspective, this formulation yields a random choice model. We propose and analyze one such learning process, the Bayesian Probit Process. We apply our model to the similarity puzzle: Tversky’s similarity hypothesis asserts that if two options, x and y have roughly equal utility and z is similar to x then making z available will reduce the probability of choosing x more than it reduces the probability of choosing y. However, there is also evidence that introducing a similar but inferior z may hurt y more than x and can even increase the probability of choosing x (attraction effect). We provide a definition of similarity based only on the random choice process and show that z is more like x than y if and only if the correlation between z and x’s signals is larger than the correlation between z and y’s signals. Our main result establishes that when z and x are correlated and sufficiently close in utility, introducing z hurts y more than x and may even increase the probability of choosing x early in the learning process, but eventually hurts x more than y. Hence, the attraction effect arises when the decision maker is relatively uninformed, while Tversky’s hypothesis holds when she becomes sufficiently familiar with the options. We then show that if z is similar to x and sufficiently inferior, the attraction effect never disappears. ∗ E-mail: [email protected]. This paper is based on the first chapter of my doctoral dissertation at Princeton University. I wish to thank my advisor, Faruk Gul, for his continuous guidance and dedication. I am grateful to Wolfgang Pesendorfer for many discussions in the process of bringing this project to fruition. I also benefited from numerous conversations with Dilip Abreu, Roland Bénabou, Meir Dan-Cohen, Daniel Gottlieb, Justinas Pelenis, and seminar participants at Alicante, Arizona State, Berkeley, D-TEA Workshop, IESE, IMPA, Johns Hopkins, Kansas University, Haifa, Hebrew University, NYU, Princeton, the RUD Conference at the Colegio Carlo Alberto, SBE Meetings, Toronto, UBC, Washington University in St. Louis and Yale. All remaining errors are my own. 1 1 Introduction The similarity puzzle is best introduced with an example. Let’s consider a consumer who is uncertain about her preference over two comparable computers: a Mac and a PC. She hasn’t shopped for computers in a few years, she doesn’t follow technology blogs, and she is not sure which one will turn out to be better for her office. If pressed to make a decision, she is equally likely to choose either one: P{she chooses the Mac} =1 P{she chooses the PC} Suppose we add a third computer to the menu, and the new option is similar to one of the existing alternatives. To fix ideas, suppose the original PC option is a Dell, and we introduce a similar PC made by Toshiba. What should happen to the ratio above? If the Toshiba is chosen with any positive probability when the three options are available, it will necessarily ‘hurt’ the probability that the Mac is chosen, or the probability that the Dell is chosen, or both. But which existing option should the Toshiba hurt proportionally more? The similarity puzzle arises from disparate answers to this question. On the one hand, there is a large body of evidence and theoretical work, culminating in Tversky’s similarity hypothesis, based on the idea that the ratio should increase once the Toshiba is introduced. But there is also a large body of empirical evidence showing that in many situations the ratio decreases, the most prominent example being the attraction effect. Let’s now look at each side of this puzzle in turn. Tversky (1972b) proposes the following similarity hypothesis: The addition of an alternative to an offered set ‘hurts’ alternatives that are similar to the added alternative more than those that are dissimilar to it. According to this hypothesis, the consumer substitutes more intensely among similar options, so the Toshiba should hurt the Dell proportionally more. Tversky’s similarity hypothesis summarizes a large body of work from the 1960’s and the 1970’s which tried to understand how similarity affects choices. The similarity hypothesis is one of the underlying principles in most of the modern tools used in discrete choice estimation (see McFadden (2001) and Train (2009)). 2 While it may be tempting to treat Tversky’s similarity hypothesis as a general principle, it is incompatible with a large amount of empirical evidence originated in the marketing literature, the most prominent example being the attraction effect. The attraction effect refers to a situation in which the probability of an existing alternative being chosen increases with the introduction of a similar but inferior alternative.1 In our consumer example, suppose the Toshiba computer, while being almost identical to the Dell, is clearly worse in minor ways, such as having a lower processing power, a slightly smaller screen, and costing a little more. In other words, while the Dell and the Toshiba share many features, the Toshiba is clearly a dominated option. It would not be surprising if almost no one chose the Toshiba in this situation. What is perhaps surprising is that in many experimental settings the presence of the dominated option increases the probability that the dominant option is chosen. In the example, adding a dominated Toshiba helps boost the sales of the Dell to the detriment of the Mac. The attraction effect stands in stark contrast to Tversky’s similarity hypothesis. In the attraction effect, the introduction of the Toshiba hurts the Mac proportionally more. Moreover, while the probability of choosing the Mac is reduced, the probability of choosing the Dell actually increases. This constitutes a violation of monotonicity, a very basic property for models of discrete choice. Monotonicity (also known as regularity) states that the probability of existing alternatives cannot decrease when an option is removed from the choice set. Every random utility model, including the multinomial logit, probit, the family of generalized extreme value models, and virtually all models currently used in discrete choice estimation are monotonic, and therefore incompatible with the attraction effect. In this paper, we propose a model that allow us to understand and resolve the similarity puzzle. The model explicitly incorporates two frictions in the decision making process. First, the decision maker gradually learns the utility of the available alternatives. A parameter t ≥ 0 captures the amount of information available to the decision maker at the time a choice is made. Second, for any 1 Huber et al. (1982) and Huber and Puto (1983) provide the first experimental evidence for the attraction effect. Since then, it has been confirmed in many different settings, including choice over political candidates, risky alternatives, investment decisions, medical decisions and job candidate evaluation. For these and other examples, see Ok et al. (2012) and the references therein. 3 given amount of information the decision maker has about the alternatives (for example, if the decision maker is allowed to contemplate the alternatives for a fixed amount of time), similarity determines how certain the decision maker is about the relative ranking of each pair of alternatives. In the model, similarity is captured by a parameter γij for each pair of alternatives (i, j). Our main results show how these two new dimensions, similarity and information, interact to explain the similarity puzzle. While Tversky’s similarity hypothesis holds for more informed choices (for a sufficiently large level of information), the opposite happens and may even lead to the attraction effect for less informed choices (for a sufficiently small level of information). In our model, stochastic choice arises from the decision maker’s incomplete information about the choice alternatives. The decision maker lacks, ex-ante, any information about the alternatives that would favor a choice of one alternative over another. When contemplating the menu of choices, the decision maker gradually learns the utility of the alternatives. This learning process is interrupted when a choice must be made. From the point of view of the observer, this yields a random choice model. A random choice rule describes frequencies of choice in each menu for a fixed level of information precision. By indexing a family of random choice rules according to information precision, we model observable choice behavior as a random choice process. The decision maker can ultimately be described by a rational preference relation represented by a utility function µ. Observed choices increasingly reflect this rational preference as the decision maker’s information about the alternatives becomes more precise. Formally, the model can be described as follows. Let A be the universe of choice alternatives. The decision maker has a standard rational preference relation in A. This preference is represented by the numeric utility µi of each alternative i ∈ A. In addition, each pair of alternatives i, j is described by a correlation parameter γij . The decision maker knows γ but does not know µ. Her beliefs ex-ante about the values of µi are iid Gaussian. When facing a choice from menu b, the decision maker first observes a noisy signal about the true utilities, and updates her beliefs about µ using Bayes’ rule. The signal correlation is given by γ. The probability of choosing option i in menu b is equal to the probability that alternative i has the highest posterior mean belief among the 4 alternatives in b. We call the resulting random choice rule the Bayesian probit rule. We allow the precision of the signals received to vary continuously from pure noise to arbitrarily precise signals. The precision of the signals is captured by the parameter t ≥ 0. A family of Bayesian probit rules ordered according to signal precision t is a Bayesian probit process. The key component of the model proposed in this paper is a new behavioral definition of similarity. This new notion of similarity is subjective and identified (revealed) from choices, much as utility is identified in classical revealed preference theory. The theory allows us to identify, from choice data, one ranking of alternatives according to preference, and one ranking of pairs of alternatives according to similarity. The behavioral notion of similarity that we propose in this paper abstracts from (and is orthogonal to) utility. Our definition is best motivated with the following geometric example. Suppose the universe of choice alternatives is the set of all triangles in the euclidean plane. Suppose further that your tastes over triangles are very simple: you always prefer triangles that have larger areas. So your preferences over triangles can be represented by the utility function that assigns to each triangle the numerical value of its area. Among the options shown in Figure 1, which triangle would you choose? Figure 1 illustrates that it may be hard to discriminate among comparable options. Since the triangles in Figure 1 have comparable areas, they are close in your preference ranking, and you may have a hard time picking the best (largest) one. If you were forced to make your choice with little time to examine the options, you would have a good chance of making a mistake. If the difference in area were more pronounced, i.e., if one triangle were much larger than the other, you probably wouldn’t have any trouble discriminating among the options. Likewise, a consumer presented with two products, of which one is much cheaper and of much better quality, should seldom make a mistake. Hence, any probabilistic model of choice should include the gap in desirability (measured as the difference in utility) as an important explanatory variable for the ability of subjects to discriminate among the options. But the gap in utility alone cannot tell the whole story. Even if the utility of a pair of objects is kept exactly the same, it is possible to increase your ability to discriminate among a pair of options by changing some of their other features. 5 For example, suppose you still prefer larger triangles, and consider the pair shown in Figure 2. The triangle shown on the left in Figure 2 has exactly the same area as the triangle on the left in Figure 1, while the triangle on the right is the same in both Figures. Hence, from the point of view of desirability, the pair shown in Figure 1 is identical to the pair shown in Figure 2. However, note that you are less likely to make a mistake when choosing among the options in Figure 2 (make your pick). You probably chose the triangle on the left. You probably also found this choice task much easier than the task in Figure 1. Why? It certainly has nothing to do with the gap in the desirability of the two options—as we pointed out, the pairs in both Figures exhibit exactly the same difference in area. So what happened? When we substituted triangle i0 for triangle i, we kept desirability (the area) the same, but we made every other feature of triangle i0 more closely resemble triangle j. What changed from the pair in Figure 1 to the pair in Figure 2 is that we kept constant the features you care about (the area), while increasing the overlap of other features. And this increased overlap clearly helped improve your ability to discriminate among the options. In the theory that we propose in this paper, we build on this intuition and call the pair (i0 , j) of Figure 2 more similar than the pair (i, j) of Figure 1. Similarity is measured independently from utility, and is another important factor determining the ability of a decision maker to discriminate among the choice options. The pair of triangles in Figure 2 is in fact similar according to the mathematical definition for geometric figures. In euclidean geometry, two triangles are similar when they have the same internal angles. The concept of geometric similarity abstracts from size, and only refers to the shape of the objects being compared. In the same spirit, the notion of similarity that we propose in this paper abstracts from utility. But instead of defining the similarity of two alternatives based on any of their observables attributes, we infer the level of similarity perceived by the decision maker from her choices. Similarity in our theory is a behavioral notion: it is subjective and inferred from choice data. To understand the formal definition of similarity, suppose all choice data is obtained under a fixed a level of information precision t and consider two pairs of alternatives {i, j} and {k, `}. Under the assumptions of the model, the utility of 6 the alternatives can be inferred from pairwise choices: in any given pair, and for any positive level of information, the more desirable alternative is chosen more often. When each object in a pair is chosen exactly 50 percent of the time, for every level of information precision, we infer that they are equally desirable. Without loss of generality, suppose that i is better than j. Suppose moreover that k is at least as good as i, and that j is at least as good as `. In other words, the gap in desirability is wider in the pair {k, `}. Without any difference in similarity across the pairs, the gap in desirability alone show allow the decision maker to discriminate among k and ` at least as well as among i and j. In other words, based solely on the gap in utility in each pair, the better option k should be chosen from {k, `} at least as often as the better option i is chosen from {i, j}. But it is exactly when k is not chosen from {k, `} at least as often as i is chosen from {i, j}, that the pair {i, j} is revealed to have a larger degree of similarity.2 A second visual example helps illustrate this point.3 Suppose the universe of choice alternatives is the set of all star-shaped figures on the euclidean plane. Your tastes over star-shaped figures are again very simple. This time, you only care about the number of points on a star. For example, when the options are a six-pointed star and a five-pointed star, you always strictly prefer the six-pointed star. With just a glance at the options in Figure 3, which star would you pick? Again, if pressed to make a choice in a short amount of time, you may be likely to make a mistake. The probability of making a mistake would certainly decrease if you were given more time. You would also be more likely to make the correct choice if one of the alternatives had a much larger number of points than the other. Given a limited amount of time, mistakes are likely, because the options are comparable and therefore difficult to discriminate. Now consider the options in Figure 4. If you still prefer stars that have more points, which of the two options would you choose? Here, similarity comes to the rescue. The star on the left is the same in both Figures and the star on the right has the same number of points in both Figures. Hence the gap in desirability is the same in both Figures. But even though the difference in utility is the same, there is a much greater overlap of features in the pair shown in Figure 4. Hence 2 The attentive reader will notice that pairwise choices may only partially reveal the similarity ranking. As we show later, choices over menus of three options are sufficient to fully identify similarity. 3 I am grateful to David K. Levine for suggesting this example. 7 i j Figure 1: Which triangle has the largest area? i’ j Figure 2: Which triangle has the largest area? Figure 3: Which star has more points? Figure 4: Which star has more points? 8 it is much easier to discriminate among the options in Figure 4. According to our definition of similarity, we infer from this pattern of choices that the pair in Figure 4 is more similar than the pair in Figure 3. To see how our model allows us to reconcile Tversky’s similarity hypothesis with the attraction effect, suppose the decision maker faces a choice between two alternatives i and j. Let k be a third alternative with the same utility as i and let the pair i, k be more similar than the pairs i, j and j, k. In Theorem 3, we show that introducing k reduces the probability of choosing j more than it reduces the probability of choosing i, and may even increase the probability of choosing i, for sufficiently low levels of information precision (i.e., at the beginning of the random choice process). Eventually, as information becomes more precise, the opposite occurs: the introduction of k hurts the similar i more than it hurts the dissimilar j. To better isolate the role of similarity, suppose that all three alternatives i, j and k have the same utility. In Corollary 4 we show that, when the signal correlation between i and k is taken to be arbitrarily close to one, both effects obtained above become more extreme. For times close to zero, i and k are each chosen with probability close to one half, while j is chosen with probability close to zero. As time goes to infinity, or as information becomes arbitrarily precise, i and k are each chosen with probability close to 1/4, and j is chosen with probability close to 1/2. Hence when all three alternatives are equally desirable, similarity generates a form of attraction effect early in the process, and that effect tends to disappear as the decision maker learns the utilities of the alternatives with more precision. As the correlation between i and k goes to one, these alternatives are eventually treated as duplicates in the random choice process. In Theorem 5 we tackle the attraction effect. We show that, if alternative k is sufficiently inferior and correlated with i, the attraction effect never disappears: starting arbitrarily early in the random choice process, introducing k increases the probability of choosing i to a value strictly larger than one half (violating monotonicity), and this remains true even as time goes to infinity. 1.1 Related Literature The attraction effect and other related decoy effects are examples of choice behavior that systematically depend on the context in which the decisions are made. 9 Context dependent choice challenges the principle of utility maximization in individual decision making. In particular, violations of the weak axiom of revealed preference break the identification of preferences and choices in standard revealed preference theory. To accommodate context dependence, recent advances in decision theoretic models have relaxed the weak axiom while maintaining the assumption of deterministic choice behavior. In Manzini and Mariotti (2007), a decision maker may exhibit context dependent behavior when the choice procedure consists of applying two or more different rationales sequentially. In Masatlioglu et al. (2012), context-dependence arises from limited attention. For example, a consumer who prefers the Mac may instead choose the Dell when the Toshiba is introduced, because the presence of a third alternative induces her to fail to pay attention to the Mac. In a similar vein, Ok et al. (2012) study context-dependent behavior that arises when one of the alternatives in the menu serves as a reference point, again limiting the attention of the decision maker to a subset of the choices. By maintaining the assumption of deterministic behavior while relaxing the weak axiom of revealed preference, this literature pushes the scope of deterministic models to their limit. But it is possible to go further. As pointed out by Gul et al. (2012), to understand context dependent behavior, there are great advantages in modeling choice as a stochastic phenomenon. The argument is simple: while it may be reasonable to expect a decision maker to choose alternative i from the set {i, j} on one occasion and to choose j from the set {i, j, k} on another occasion, it is unlikely that we would observe these choices on every occasion. Therefore, models which rely on deterministic violations of the weak axiom are at best as likely to find empirical support as the standard model. In this sense, a probabilistic model of choice may be more appropriate for measuring an individual’s tendency to choose i over j. In this paper, modeling choices as stochastic is natural, because the similarity puzzle is formulated in probabilistic terms. All nuanced effects of similarity on choice would disappear once we formulate the problem deterministically. Random choice models have a long history, having originated in psychophysics. Until the mid twentieth century theoretical developments were limited to models of binary choice. In his seminal work on individual choice behavior as a probabilistic phenomenon, Luce (1959) introduces the first theoretical restriction for 10 choice across menus with more than two alternatives, in the form of his random choice hypothesis: The ratio between the probability with which option j is chosen from a set of options to the probability with which k is chosen from the same set is constant across all sets that contain j and k. which leads to the Luce model of random choice. McFadden (1974) constructs a random utility for the Luce model, which gives rise to the widespread use of conditional logit analysis in discrete choice estimation. The random choice hypothesis imposes a strong restriction on the patterns of substitutability between alternatives. Debreu (1960) provides an early example of this limitation of the Luce rule, later referred to as the “duplicates problem” in the discrete estimation literature. A classic version of Debreu’s example is the red bus/blue bus paradox: suppose that the decision maker is indifferent between taking a train and a blue bus, so that each option is chosen with probability one half. She is also indifferent between taking the blue bus or an identical bus that happens to be red, so she also chooses each bus with probability one half. Intuitively, since the decision maker doesn’t care about the color of the bus, the train should still be chosen with probability one half when both buses are available. However, in the Luce model this pattern of choices implies that the train is chosen with probability 1/3 when the three options are available. Debreu’s example illustrates an extreme case in which two options are treated like a single option by the decision maker. But even when options are not exactly treated as duplicates, the decision maker may exhibit a pattern of substitutability that is incompatible with the Luce rule. It is in this vein that Tversky (1972b) proposes the similarity hypothesis: The addition of an alternative to an offered set ‘hurts’ alternatives that are similar to the added alternative more than those that are dissimilar to it. Many random choice models allow for substitutability patterns in line with Tversky’s similarity hypothesis. Commonly used random utility models in discrete choice estimation, such as the Generalized Extreme Value family and the multinomial probit, accommodate complex substitution patterns by allowing correlation in the error terms of random utility (see for example Train (2009)). In a 11 more theoretical vein, Tversky’s (1972a) elimination-by-aspects model and, more recently, Gul et al.’s (2012) attribute rule are two other examples. All the random choice models we mentioned above are random utility maximizers, and therefore they all share the property of monotonicity. A random choice rule satisfies monotonicity if the probability of an existing alternative being chosen can only decrease when a new alternative is added. Thus, the attraction effect presents a challenge to random utility models. Any probabilistic model of individual decision making that accommodates behavior compatible with the attraction effect must therefore depart from random utility. The model proposed in this paper departs from random utility maximization and allows violations of monotonicity. The majority of the recent decision theoretic developments that accommodate behavior compatible with the attraction effect posit some form of bounded rationality or limited attention on the part of the decision maker. In contrast, our model describes a decision maker who is fully aware of every option in her opportunity set, but only learns the utility of the available alternatives gradually. In our model, context dependence arises when similarity allows the decision maker to learn the relative ranking of some pairs of alternatives faster than others. The paper is organized as follows. In the next Section we introduce the model and provide a baseline characterization. In Section 3 we characterize the Bayesian probit process, define our notion of similarity and show that it is captured by signal correlation. Section 4 contains our main results and addresses the similarity puzzle. In Section 5 we compare the Bayesian probit process to existing random utility models in the literature. In Section 6 we offer an application of the model to a setting where noise correlation depends on observable attributes of the choice alternatives. We conclude in Section 7. All proofs are in the Appendix. 2 Learning Process Let A be the grand set of choice objects and let A denote the collection of all finite subsets of A. We write A+ = A \ {∅} for the non-empty finite subsets. Random choice rules are a generalization of deterministic choice explicitly treats choice behavior as probabilistic. 12 2.1 Random Choice Rules Definition. A function ρ : A × A+ → [0, 1] is a random choice rule if for all P b ∈ A+ we have ρ(b, b) = 1 and ρ(a, b) = i∈a ρ(i, b). To simplify the statements below, we identify each i ∈ A with the singleton {i}. The value ρ(i, b) specifies the probability of choosing i ∈ A, given that the selection must be made from the choice set b ∈ A+ . The first equation in the definition of a random choice rule is a feasibility constraint: ρ must choose among the options available in b. The second requirement makes ρ(·, b) a probability over b. A random choice rule ρ is monotone 4 if for every a ⊂ b ∈ A+ and i ∈ a we have ρ(i, a) ≥ ρ(i, b). For monotone random choice rules, the probability of choosing an alternative i can only decrease when new alternatives are added to a menu. The three examples below illustrate classic random choice rules on which applications have relied in the literature. All of them are monotone. Example 1 (Rational Choice). The decision maker has a complete and transitive preference relation < ⊂ A × A. Her random choice rule is specified as ρ(i, b) = 1 whenever i ∈ b is the unique best alternative according to <. If b has exactly two alternatives which are best according to <, they are chosen with the same probability. When b has more than two best alternatives according to <, we only require each of them to be chosen with strictly positive probability, every other alternative to be chosen with zero probability, and the resulting rule to be monotone.5 ♦ Example 2 (Random Utility). Most econometric models of discrete choice such as logit and probit are special examples of random utility maximizers. To fix ideas, suppose the grand set of alternatives A is finite. Let U be a collection of strict utility functions on A. A random utility π is a probability distribution on U. A random choice rule maximizes a random utility π if the probability of choosing alternative i from menu b equals the probability that π assigns to utility functions that attain their maximum in b at i. 4 ♦ Monotone random choice rules are sometimes called regular in the literature. Alternatively, we could have imposed uniform tie-breaks for all sets with multiple best alternatives. The resulting rule would automatically satisfy monotonicity. This alternative definition turns out to be too restrictive for the purposes of the theory developed later on. 5 13 Example 3 (Probit Rule). The probit rule is a random utility maximizer with Gaussian errors. Suppose the set A has n alterantives. The probit rule specifies a random vector X : Ω → Rn with a joint normal distribution, X ∼ N (µ, Σ). In each menu a ⊂ A, alternative i ∈ a is chosen with probability ρ(i, a) = P{Xi > Xj , ∀j ∈ a, j 6= i}. ♦ We will sometimes restrict our attention to binary choices. A binary random choice rule is obtained by restricting the domain of a random choice rule to sets of two alternatives. To simplify the expression for choice probabilities in binary menus we will write ρ(i, j) := ρ(i, {i, j}). Definition. A utility index µ : A → R and a function over pairs γ : A × A → [−1, 1] are utility signal parameters on A if (i) For each i ∈ A, γ(i, i) = 1; (ii) For each i, j ∈ A, γ(i, j) = γ(j, i); (iii) For each enumeration of a finite menu b = {1, 2, . . . , n} ⊂ A, the n × n symmetric matrix whose row i and column j is given by γ(i, j) is positive definite. To simplify notation, we write µi = µ(i) and γij = γ(i, j). Before introducing our model, it is useful to understand how utility signal parameters (µ, γ) can represent the probit rule of Example 3. Let (µ, γ) be utility signal parameters. For each enumeration of a menu b = {1, 2, . . . , n}, let the random utility for the alternatives in b be given by the n-dimensional random vector X, where X is joint normally distributed and has mean (µ1 , µ2 , . . . , µn ) and covariance matrix (γij ). Then by construction ρ(i, b) = P{Xi > Xj , ∀j 6= i} defines a probit rule. The classic interpretation in econometric work for a random utility maximizer ρ is that it describes the frequencies of choices of a population of deterministic utility maximizers. The randomness arises from the modeler’s imperfect knowledge of the tastes of each individual. Even though at the individual level the choice is deterministic, due to unobservable characteristics the econometrician is only able to predict the choices of the individual in probabilistic terms. 14 We can alternatively interpret random utility maximizers as a model of oneshot learning. When confronted with a set of alternatives, the decision maker receives a precise signal of how those alternatives are ranked according to her preferences. While tastes are stochastic, each choice realization is interpreted as perfectly reflecting the ranking of alternatives at the moment in which the choice was made. Our model of random choice departs from random utility maximization and is based on a gradual learning interpretation. Ultimately preferences are described by a rational, complete and transitive ranking of all alternatives. But ex-ante the decision maker has no information about how the alternatives in a particular menu b are ranked. Instead, the decision maker has prior beliefs about the distribution of her preferences in the same kind of choice situation. These prior beliefs are symmetric, i.e., all alternatives are perceived in exactly the same way ex-ante. Before making a choice from the menu b, the individual has the opportunity to contemplate the alternatives. This contemplation takes the form of a signal that conveys partial information about the true utility of the alternatives. From the point of view of the observer, this formulation yields the following random choice rule. 2.2 Bayesian Probit Rule The model can be described as the following modification of the probit rule. The decision maker is described by utility signal parameters (µ, γ) and additionally by prior parameters (m0 , s20 ) where m0 ∈ R and s0 > 0. The parameter µ describes the true utility of each alternative and is unknown to the decision maker. The decision maker has a prior belief about µ that is jointly Gaussian, where the utility of each alternative is seen as an independent draw from the same Gaussian distribution with mean m0 and variance s20 . When facing a choice from a menu b = {1, 2, . . . , n}, the individual chooses as if observing a draw from the random utility vector X, updating beliefs about the utility of each alternative according to Bayes’ rule, and picking the alternative with the highest posterior mean. We refer to the resulting random choice rule as a Bayesian probit rule with utility signal parameters (µ, γ) and prior parameters (m0 , s20 ). How much the decision maker learns about the alternatives before making a choice will depend on the precision of the signals. A natural question to ask in 15 this model is how random choice behavior changes when the information about utility is made more precise. For example, one could do comparative statics analysis of random choice behavior when the decision maker is allowed to draw an additional independent signal from the same distribution. Choices arising from more precise information should exhibit fewer mistakes. More informed behavior can be interpreted as arising, for example, from a smarter decision maker, from a decision maker contemplating the alternatives at a closer distance, or from a decision maker who contemplated the alternatives for a longer period of time before making a decision. Our model describes random choice behavior as the precision of the signal for a random choice rule varies continuously from pure noise to arbitrarily precise signals. To allow such comparisons, we study an entire family of random choice rules, indexed by an ordered set. We call the resulting object a random choice process. 2.3 Random Choice Process To investigate the effects of increasing the precision of the utility signal over random choice behavior, we now introduce a family of random choice rules indexed by an ordered set T. For concreteness, we can think of T = [0, ∞) as time. Definition. A function ρ : A × A+ × T → [0, 1] is a random choice process if for every time t ∈ T the restriction ρ(·, ·, t) is a random choice rule. A random choice process describes the probability distribution of choices for each finite set of alternatives b ∈ A+ and each point in time t ∈ T. Time can be seen merely as an abstract ordering of the different choice rules that compose the random choice process, but it can also be interpreted in the more literal sense of a temporal dimension. For example, a random choice process can be thought of as encoding observable behavior in experiments of individual decision making where both the set of alternatives and the time available to contemplate the alternatives can be manipulated. With the literal interpretation of T as time, it is important to note that we do not treat the entire path of each realization of a choice process as observable. While this stronger enrichment of choice data would be sufficient, and 16 is considered in theoretical work at least since Campbell (1978), and in recent experimental work in Caplin and Dean (2010), it is not necessary here. Henceforward, we write ρt (i, a) instead of ρ(i, a, t). We use ρt to refer to the random choice rule obtained from the random choice process ρ by restricting its domain to a fixed time t. A random choice process ρ is continuous if for every i ∈ a ∈ A+ the mapping t → 7 ρt (i, a) is continuous. We are interested in the random choice process of a decision maker whose choices are based on better information as time increases, approaching rationality as time goes to infinity. We say that a random choice process ρ ultimately learns if there is a random choice rule ρ∞ satisfying the conditions of rational choice in Example 1, such that for each i ∈ a ∈ A+ we have ρt (i, a) → ρ∞ (i, a) as t → ∞. A random choice process ρ starts uninformed when for all menus b ∈ A+ and all alternatives i, j ∈ b, we have ρ0 (i, b) = ρ0 (j, b). In particular, for a binary random choice process we have ρ0 (i, j) = ρ0 (j, i) for all i, j. Definition. A random choice process ρ is a learning process if it is continuous, starts uninformed, ultimately learns, and for all pairs of alternatives i, j ∈ A the mapping t 7→ ρt (i, j) is either strictly increasing, strictly decreasing, or constant. 2.4 Bayesian Probit Process This subsection completes the description of our model. The Bayesian probit process is a random choice process ρ in which every component random choice rule ρt is a Bayesian probit rule with the same utility signal parameters (µ, γ). As t increases, ρt is based on more precise information. To introduce the formal description, consider for a moment the case of discrete time t ∈ {1, 2, 3, . . . }. In this case ρ1 is just the Bayesian probit rule described in Subsection 2.2 above. Given ρt , we let ρt+1 correspond to the random choice rule obtained from ρt as if the decision maker could observe an additional independent utility signal realization from the same distribution. The continuous time case is analogous to the discrete case when we let the time interval go to zero. Utility signals as a diffusion process The decision maker’s prior beliefs about the utility of each alternative are identically and independently distributed Gaussian with mean m0 ∈ R and variance 17 s20 > 0. This prior reflects her information about the alternatives before starting the learning process. Ex-ante, all alternatives look the same, reflecting the fact that no signal has been observed. When the grand set of choice objects A has enough variety, we can think of the prior as the actual distribution of the utility parameter when the decision maker faces choice problems with alternatives from A. With this interpretation, the decision maker has correct beliefs, given the information available, and assuming that each finite menu is composed of an independent random sample of objects from A. Let {1, 2, . . . , n} be an enumeration of the alternatives in a menu b ∈ A+ . We model choices from b at each point in time as being governed by a covert utility signal process. The signal is interpreted as representing the decision maker’s noisy perception of the desirability of each alternative over time. We model the signal process X as a Brownian motion with drift, given by dX(t) = µ dt + Λ dW (t), X(0) = 0 (1) where X(t) is an n-dimensional vector, µ ∈ Rn is a constant drift vector, Λ is a constant n × n matrix with full rank and W (t) = (W1 (t), . . . , Wn (t)) is a Wiener process. The signal starts at X(0) = 0 almost surely. For each fixed time t > 0, the vector X(t) has a joint Gaussian distribution with mean µt and covariance matrix tΛΛ0 . Each coordinate of X is a one-dimensional signal that corresponds uniquely to one of the alternatives in the menu. When we refer to the signal X for menu b ∈ A+ , we have in mind an enumeration of the alternatives in b that corresponds to the ordering of the coordinates in X. We will assume that the signal processes that the decision maker observes when facing different menus are consistent: the drift and covariance parameters are always taken form the corresponding utility signal parameters (µ, γ) on A. For example, when one of the alternatives is deleted from menu b, the corresponding coordinate of X is removed, but otherwise the signal process remain the same. 2.5 Baseline characterization We start by characterizing a baseline Bayesian probit process in which choices evolve as if pairs of alternatives did not vary in similarity. 18 We now introduce some additional notation. When ρ is a random choice rule, the number ρ(i, a) is a probability value between zero and one. To simplify the statements below, it will sometimes be convenient to express this quantity in the scale [−∞, ∞] by using the standard Gaussian cumulative distribution function Φ. Given a random choice rule ρ, we define −∞, ρ̂(i, a) := Φ−1 (ρ(i, a)) , +∞, ρ̂ : A × A+ → [−∞, ∞] by if ρ(i, a) = 0 if ρ(i, a) ∈ (0, 1) (2) if ρ(i, a) = 1. Definition. The random choice rule ρt satisfies the Gaussian triangle condition when, for all i, j, k ∈ A, ρ̂t (i, k) = ρ̂t (i, j) + ρ̂t (j, k). (3) A random choice process ρ satisfies the Gaussian triangle condition when it is satisfied by every ρt . The Gaussian triangle condition is a cross-restriction on how the random choice rule is able to discriminate among pairs of alternatives. It says that if we know how well the random choice rule discriminates among the pairs (i, j) and (j, k), then we should also know how it discriminates between i and k. Note that the condition is expressed using the Gaussian transformation ρ̂ defined in (2). This condition arises naturally in a model with Gaussian noise. Analogous conditions would hold for other distributions, e.g., the logit model. Definition. A random choice process ρ has a Gaussian learning rate when for all alternatives i, j, k, l ∈ A and all times s, t ∈ T ρ̂t (i, j) × ρ̂s (k, l) = ρ̂s (i, j) × ρ̂t (k, l). (4) To understand the substance of this last property, suppose for a moment that t > s and note that when the denominators are different from zero we can rewrite condition (4) as ρ̂t (i, j) ρ̂t (k, l) = . ρ̂s (i, j) ρ̂s (k, l) The left hand side ratio is a measure of how the discrimination between alternatives i and j evolved from time s to time t. The Gaussian learning rate condition requires that this ratio be constant across pairs of alternatives: if we know how 19 the discrimination between i and j changed from s to t, and we know how alternatives k and l were discriminated at time s, then we must know how those two alternatives are discriminated at t. Finally, we introduce a simple notion of time change. A function f : T → T is a time change when it is strictly increasing and continuous. We say that ρ a Bayesian learning process up to a time change when there is a time change f such that ρ0t = ρf −1 (t) is a Bayesian learning process. Of course, every Bayesian learning process ρ can be shown to be Bayesian learning process up to a time change by taking f (t) = t. Theorem 1 (Baseline Model). A binary random choice process ρ is a simple Bayesian probit process (up to a time change) if and only if it is a learning process, has a Gaussian learning rate and satisfies the Gaussian triangle condition. 3 Revealed Similarity In classical revealed preference theory, choice data takes the form of a deterministic choice function. A deterministic choice function picks one or more elements from a given menu of alternatives with probability one. The Weak Axiom of Revealed Preference (WARP) imposes consistency on choice data across different menus of alternatives. If an element j is picked from menu a when k was also available, then WARP does not allow k to be picked in any other menu that contains j, unless j is picked as well. The theory allows us to tease out a preference relation from choice data that satisfies WARP. Our choice data takes the form of a random choice process ρ. Since the choice data is stochastic, a random choice process finely captures the tendency of a decision maker to pick an alternative j over an alternative k. Our theory goes beyond classical revealed preference to explain comparability: how easy is it for a decision maker to choose the best alternative out of the pair j, k? We define comparability as how easy it is to compare two options. Comparability depends on preference. Consider a choice from the pair j, k where j is actually preferred to k. Comparability is directly proportional to how likely the decision maker is to choose j. In other words, the less likely the decision maker is to make a mistake and pick k, the easier it is to compare the pair j, k. Let ρ be a random choice rule. We define a preference relation % on A from 20 the binary random choices in ρ in the following manner. Let i % j and say that alternative i is at least as good as j if ρ(i, j) ≥ 1 ≥ ρ(j, i). 2 Write ∼ for the symmetric part of % as usual. When ρ satisfies the assumptions of Theorem 1, % is complete and transitive and represented by µ. Hence under those assumptions the random choice rule reveals the preference ranking %. Analogously, from a given ρ we can define a similarity relation %̇ over pairs of alternatives on A × A. A pair of alternatives i, j ∈ A is more similar than a pair i0 , j 0 ∈ A, denoted as (i, j)%̇(i0 , j 0 ), when i ∼ i0 , j ∼ j 0 and ρ(i, j)ρ(j, i) ≤ ρ(i0 , j 0 )ρ(j 0 , i0 ) (5) and we say that i is more similar to j than to j 0 when we take i = i0 above. To interpret this notion of similarity, note that i and i0 are equally ranked according to preference, and also j and j 0 are equally ranked according to preference. Insofar as preference is concerned, the alternatives in the pair (i, j) are compared exactly as the alternatives in the pair (i0 , j 0 ). Expression (5) says that the random choice rule discriminates the pair (i, j) at least as well as the pair (i0 , j 0 ). A strict inequality in (5) therefore indicates that (i, j) must be easier to compare. For a random choice process ρ satisfying the assumptions of Theorem 1, there is never strict inequality in (5). In this sense, Theorem 1 characterizes a baseline case in which all pairs of alternatives are equally similar. The characterization is useful because it allows us to identify the Gaussian triangle condition as the property responsible for all alternatives being equally similar. To understand the role of the Gaussian triangle condition, let i ∼ i0 and j 0 ∼ j. This implies that ρ(i, i0 ) = ρ(j, j 0 ) = 1/2 and therefore ρ̂(i, i0 ) = ρ̂(j, j 0 ) = 0. By the Gaussian triangle condition we have ρ̂(i, j) = ρ̂(i, i0 ) + ρ̂(i0 , j) = ρ̂(i, i0 ) + ρ̂(i0 , j 0 ) + ρ̂(j 0 , j) = ρ̂(i0 , j 0 ) and therefore ρ(i, j) = ρ(i0 , j 0 ), so that the inequality in (5) is never strict. Hence, in order to study the consequences of varying degrees of similarity for random choice behavior, we need to relax the Gaussian triangle condition. In the next subsection, we will require the Gaussian triangle condition to fail. 21 Moreover, in order to identify the similarity relation, we will require that the grand set of alternatives A have a pivotal alternative and enough variety in the way that the Gaussian triangle condition is violated by triples that include the pivotal alternative. 3.1 A distinct alternative In order to facilitate the introduction of the next assumption, we will restate the Gaussian triangle condition in a more convenient form. Recall that the Gaussian triangle condition requires that for all triples of alternatives i, j, k ∈ A, ρ̂(i, k) = ρ̂(i, j) + ρ̂(j, k). Therefore, when ρ(i, k) 6= 1/2, so that ρ̂(i, k) 6= 0, we can rewrite this condition as a ratio ρ̂(i, j) + ρ̂(j, k) =1 ρ̂(i, k) or equivalently in the following convenient form 2 ρ̂t (i, j) + ρ̂t (j, k) 1− = 0. ρ̂t (i, k) Definition. A random choice rule ρ has a distinct alternative k ∗ when, for every pair i 6∼ j ∈ A and every ε > 0, there exist a pair of alternatives j 0 , j 00 ∈ A such that j 0 ∼ j ∼ j 00 , ρt (j, k ∗ ) = ρt (j 0 , k ∗ ) = ρt (j 00 , k ∗ ) and 2 2 ρ̂t (i, k ∗ ) + ρ̂t (k ∗ , j 00 ) ρ̂t (i, k ∗ ) + ρ̂t (k ∗ , j 0 ) 1− <ε−1<1−ε<1− . ρ̂t (i, j 00 ) ρ̂t (i, j 0 ) To understand the role of the distinct alternative k ∗ in the condition above, note that it requires that for every pair of alternatives i, j in which i is not equally preferred to j, we can find two alternatives j 0 , j 00 equally preferred to j, such that the triples i, j 0 , k ∗ and i, j 00 , k ∗ sufficiently violate the Gaussian triangle condition in opposite directions. So this condition embeds both an assumption about the existence of a distinct or pivotal alternative k ∗ , and also a condition about a rich variety of options in the set A. Substituting the distinct alternative assumption for the Gaussian triangle condition, one can obtain a Bayesian Probit Process in which γ is not orthogonal. Suppose M is a property of a random choice rule. We say that a random choice process ρ eventually has property M when there is a time T ∈ T such that for all times t > T the random choice rule ρt has property M . 22 Proposition 2 (Ranking stabilization). For every menu b ∈ A+ and every pair of alternatives i j in b, the Bayesian learning process eventually chooses i more often than j in b. 4 Similarity Puzzle In order to state the main results, throughout this section we assume that ρ is a Bayesian Probit Process with a distinct alternative. We use % to denote the preference relation over alternatives corresponding to ρ, and we use %̇ to denote the similarity relation over pairs corresponding to ρ. Choice alternatives are labeled 1, 2 and 3. In the Theorems below, alternatives 1 and 2 are equally desirable: 1 ∼ 2. So alternatives 1 and 2 are chosen with probability one-half from menu {1, 2} at any ˙ (1, 2) and (1, 3) ˙ (2, 3), i.e., the pair (1, 3) is date. We also assume that (1, 3) more similar than pairs (1, 2) and (2, 3). In other words, alternative 3 is more similar to alternative 1 than to alternative 2. Theorem 3 (Similarity Puzzle). There are times T1 , T2 > 0 such that introducing 3 hurts 2 more than it hurts 1 for all times before T1 , and the reverse occurs for all times greater than T2 . Early in the learning process, the decision maker is relatively unfamiliar with the options. The relative ranking of similar options will be learned faster than dissimilar ones. Theorem 3 shows that this can go against Tversky’s similarity hypothesis early in the learning process, but that as the decision maker becomes more familiar with the options this “attraction effect” disappears. The following Corollary illustrates both effects in their most extreme form. Corollary 4. Let 1 ∼ 2 ∼ 3 and ε > 0. If the pairs (1, 2) and (2, 3) are equally similar and the pair (1, 3) is taken sufficiently similar, we have (i) for all t sufficiently small, ρt (1, 123), ρt (3, 123) > 1/2 − ε; and (ii) for all t sufficiently large, ρt (1, 123), ρt (3, 123) ∈ [1/4, 1/4 + ε). Item (i) in the Corollary says that, when the decision maker is uninformed, she will choose one of the alternatives of the very similar pair with probability close to one. To understand this result, consider what happens at times close to 23 zero, when utility signals are very imprecise. As we saw in Proposition 7, in the limit as t goes to zero utility becomes irrelevant, and choice probabilities depend solely on similarity. When the similarity of the pair (1, 3) is extreme, the decision maker rapidly learns wether 1 is a better alternative than 3 or vice versa, before learning much about anything else, and therefore alternative 2 will seldom be the most attractive. Item (ii) says that, in the limit as t goes to infinity, the alternatives in the very similar pair (1, 3) are treated as a single option. In general, any small difference in the utility of the alternatives will eventually be learned by the decision maker, and the best alternative will be chosen with probability close to one. When the three alternatives have the exact same utility, it is similarity, rather than utility, that determines how ties are broken: alternative 2 is chosen with probability close to one-half and an alternative in the very similar pair (1, 3) is be chosen with probability close to one-half. Our final result shows that the initial positive effect of similarity can persist indefinitely when the similar options are not equally desirable. Moreover, similarity can cause the probability of an existing option to increase, leading to a violation of monotonicity and to the classic attraction effect. Recall that alternatives 1 and 2 are equally desirable, and therefore chosen from the menu {1, 2} with probability one-half at any date. Theorem 5 (Attraction effect). For every ε > 0, if alternative 3 is sufficiently inferior, then alternative 1 is chosen from menu {1, 2, 3} with probability strictly larger than one-half for all t > ε. Hence adding a sufficiently inferior alternative 3 to the menu {1, 2} will boost the probability of the similar alternative 1 being chosen —the phenomenon known as the attraction effect. The Theorem shows that the attraction effect appears arbitrarily early in the choice process, and lasts indefinitely. For a concrete illustration of how the attraction effect arises in the context of our model, consider the following numerical example. Example 4 (Attraction Effect). The set of alternatives is given by A = {1, 2, 3}. The decision maker’s prior beliefs are i.i.d. standard normally distributed. Alternatives 1 and 2 are equally desirable with utility µ1 = µ2 = 3. Alternative 3 is less desirable with µ3 = −3. The inferior alternative 3 is similar to alternative 24 1 with γ13 = 1/2, while γ12 = γ23 = 0. When choosing from the menu {1, 2} the decision maker can never discriminate among the two options: each option is chosen with probability one-half at any given time. Figure 5 describes what happens once the inferior but similar alternative is introduced. Choice Probability 1 0.0001 0.01 1 100 10 000 1 2 0 1 1 2 0.0001 0.01 1 Time Hlog scaleL 100 10 000 0 Figure 5: The attraction effect Figure 5 shows time in logarithmic scale, to better visualize the start of the choice process. Note that while the graph shows 10000 units of time, the first half covers the first unit of time. The top curve (solid line) is the probability that alternative 1 is chosen; the middle curve (dotted line) corresponds to the choice probability for alternative 2; and the bottom curve (dashed line) corresponds to the choice probability of alterative 3. At the beginning of the choice process, when choices are based on relatively little information about the alternatives, the probability of choosing the inferior alternative 3 rapidly decreases, while the probability of choosing alternative 1 is boosted up. Similarity allows the decision maker to learn the relative ranking of alternatives 1 and 3 faster than the relative ranking of the other pairs. This leads to the attraction effect: the addition of the inferior but similar alternative 3 causes the probability of choosing alternative 1 to increase. 25 The attraction effect is sustained as information gets more precise. Since alternatives 1 and 2 are equally desirable, eventually each one is chosen with probability converging to one-half. But while the choice probability of alternative 1 tends to one-half from above, the choice probability of alternative 2 tends to one-half from below. The choice probability of alternative 1 very early jumps above one-half, violating monotonicity, and remains there for the entire length of the choice process. By Theorem 5, we know this violation of monotonicity can happen arbitrarily early in the choice process, if the added alternative 3 is taken sufficiently inferior. 5 ♦ Naı̈ve probit versus Bayesian probit Proposition 6 (Equivalence of binary choices). A binary random choice process ρ is a Bayesian probit process if and only if it is a naı̈ve probit process. In the affirmative case, they can share the same utility signal structure. Binary choices alone do not allow us to distinguish a Bayesian probit process from a naı̈ve probit process, in either the baseline characterization or the main characterization above. We now show that while behavior is indistinguishable when restricted to binary choices, the overlap between the two models disappears when choice from menus with three or more alternatives are considered. To see this it is sufficient to consider ternary choice. The next result shows that in menus of three alternatives, the Bayesian probit process and the naı̈ve probit process exhibit radically different choice behavior from the very beginning. Let ρ be a Bayesian probit process and let ρ̈ be a naı̈ve probit process sharing the utility signal parameters (µ, γ). Let b = {1, 2, 3}. Proposition 7 (Ternary choice at time zero). In the limit as t → 0+, the naı̈ve probit rule and the Bayesian probit rule choose alternative 1 with probability 1 1 (1+γ23 −γ12 −γ13 ) ρ̈0 (1, b) = + arctan √2(1−γ )2(1−γ )−(1+γ −γ −γ )2 12 13 23 12 13 4 2π 1 1 (1+γ12 )(1+γ13 )−γ23 (1+γ12 +γ13 +γ23 ) ρ0 (1, b) = + arctan √(3+2γ +2γ +2γ )(1+2γ γ γ −γ 2 −γ 2 −γ 2 ) 12 13 23 12 13 23 12 13 23 4 2π and analogous expressions hold for alternatives 2 and 3. Corollary 8. The Bayesian probit process and the naı̈ve probit process are indistinguishable if and only if they share utility signal parameters and γ is orthogonal. 26 Proposition 7 and its corollary show that when we observe choice behavior from menus with more than two alternatives, the Bayesian probit process and the naı̈ve probit process exhibit incompatible behavior, except for the knife-edge case of a common parameterization with orthogonal noise. Figure 5 illustrates how choices for both processes start at time zero from menu b = {1, 2, 3}. The Figure shows how the probability of choosing alternative 1 changes as a function of the correlation γ12 between alternatives 1 and 2. The remaining correlation parameters γ13 and γ23 are fixed at zero. The solid line corresponds the Bayesian probit rule, and the dashed line corresponds to the naı̈ve probit rule. Note that the curves intersect when γ12 = 0, where noise is orthogonal and both processes start with a uniform distribution, i.e., each alternative is chosen with probability 1/3. As can be seen in Proposition 7, choices at time zero do not depend on µ. Proposition 7 and its corollary relied on choice behavior at the start of the learning process to distinguish between the Bayesian probit and the naı̈ve probit. We can in general distinguish them from observed choice behavior, except in the knife-edge case of orthogonal utility signals. Note that since each random choice rule ρ̈t in the naı̈ve probit is a random utility maximizer, the naı̈ve probit process must satisfy monotonicity. Our main results in the previous section demonstrated another distinction of the Bayesian probit from the naı̈ve probit in terms of observable choices, namely, the violation of monotonicity. 6 Application: Relating Noise to Observables In experimental settings where the attraction and compromise effects are observed, choice alternatives are often described by two attributes. For example, in Simonson (1989) cars are described by ‘ride quality’ and ‘miles per galon’, brands of beer are described by ‘price of a six-pack’ and ‘quality rating’ and brands of mouthwash by ‘fresh breath effectiveness’ and ‘germ-killing effectiveness’. Assume that each alternative xi is described by a vector (xi1 , xi2 ) ∈ R2 . The extra structure allows us to build a model in which the noise correlation depends explicitly on the observable characteristics. This model allows us to distinguish violations of monotonicity based on the location of the new alternatives in the attribute space. For example, it allows us to distinguish the attraction effect from the compromise effect. 27 0.5 0.4 0.3 0.2 0.1 -1.0 0.5 -0.5 1.0 Figure 6: Probability of alternative 1 being chosen from menu b = {1, 2, 3} as a function of the correlation γ12 between the utility signals of alternatives 1 and 2, in the limit as t is taken arbitrarily close to zero. The solid line corresponds to the Bayesian learning model. The dashed line corresponds to the naı̈ve learning probit. The correlations γ13 and γ23 are fixed equal to zero. Since choices at time zero are determined solely by γ, and the parameterization is symmetric from the point of view of alternatives 1 and 2, the graph simultaneously describes the choice probabilities for alternative 2. When γ12 is zero the utility signal structure is simple, choices at time zero correspond to the uniform distribution for both models, and therefore the two lines cross at exactly one third. As γ12 reaches one, alternatives 1 and 2 become maximally similar and the naı̈ve learner chooses each of them with probability 1/4. In contrast, the Bayesian learner chooses alternatives 1 and 2 with probability 1/2 and alternative 3 with probability zero. 28 Example 5 (Parameterization of covariance matrix in multinomial probit). Hausman and Wise (1978) propose the following parameterization of the covariance matrix for a multinomial probit, when each choice alternative xi is described by two positive coordinates (xi1 , xi2 ) ∈ R2 . In their specification, the random utility for alternative xi is given by U (xi ) = β1 xi1 + β2 xi2 where β1 , β2 are independent and normally distributed with the same variance σ 2 > 0. This can be interpreted as a consumer with a linear utility over object in a two-dimensional commodity space, where the linear weights of each dimension are random. The common weights will cause the utility for different alternatives to be correlated. Specifically we have xi1 xj1 σ 2 + xi2 xj2 σ 2 p p Corr(U (xi ), U (xj )) = σ 2 (xi1 + xi2 ) σ 2 (xj1 + xj2 ) xi1 xj1 + xi2 xj2 = ||xi ||||xj || = cos(xi , xj ) so that the correlation coefficient for the utilities of alternatives xi and xj corresponds to the cosine of the angle formed by the two alternatives in the twodimensional commodity space. ♦ Our behavioral notion of similarity translates immediately to this particular setting under the simple parameterization of Example 5. Since similarity is captured by noise correlation, two alternatives are maximally similar when they lie on the same ray departing from the origin. In this case they form an angle of zero radians, and the correlation γij is equal to the cosine of zero, which is one. This corresponds to the intuition that when alternatives clearly dominate each other in all attributes, comparison is facilitated. On the other hand, if alternatives form an angle close to ninety degrees, the cosine of their angle is close to zero, and noise is uncorrelated. Intuitively, since the alternatives do not share any physical attributes, their comparison is hard. 29 γij = cos θ i j θ Figure 7: The noise correlation parameter γij equals the cosine of the angle formed by i and j in R2 . 7 Conclusion We proposed a model of random choice in which the decision maker initially has a coarse and symmetric perception of the utility of the alternatives and before making a choice engages in a learning process about his own ranking of the available options. The learning process is interrupted when a choice must be made. We characterized the random choice behavior of such a decision maker when information about utility follows a diffusion process. We showed conditions that allowed us to identify a preference ranking over the alternatives and, in addition, a similarity ranking over pairs of alternatives. Both rankings are subjective and revealed by the choice process. In the model, the intensity with which a random choice rule discriminates between two alternatives in a menu depends on three factors: the relative ranking of the alternatives according to preference; the similarity of the alternatives; and the overall precision of the information available to the decision maker. We modeled choices as a random choice process in which the precision of the information can be interpreted as the amount of time available to contemplate the options. Discrimination improves when alternatives are more distant according to preference, when the alternatives are more similar, and with the amount of 30 time the decision maker is allowed to contemplate the alternatives. The model captures similarity in the form of utility signal correlation. Signal correlation causes the learning process to be context dependent, and therefore choices are also context dependent. By identifying a notion of similarity that is independent of preference, this model allowed us to solve the similarity puzzle of random choice. By allowing context dependence through noise correlation, the model has the potential to address other empirical regularities found in the literature and related to context dependence, such as the status quo bias and the compromise effect. Applications of the model to these phenomena is left for future work. References Campbell, Donald E., “Realization of Choice Functions,” Econometrica, 1978, 46 (1), pp. 171–180. Caplin, Andrew and Mark Dean, “Search, Choice, and Revealed Preference,” Theoretical Economics, Forthcoming 2010. Debreu, Gerard, “Review: Individual Choice Behavior,” The American Economic Review, 1960, 50 (1), 186–188. Gul, Faruk, Paulo Natenzon, and Wolfgang Pesendorfer, “Random Choice as Behavioral Optimization,” Working Paper, Princeton University February 2012. Hausman, Jerry A. and David A. Wise, “A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences,” Econometrica, 1978, 46 (2), pp. 403–426. Huber, Joel and Christopher Puto, “Market Boundaries and Product Choice: Illustrating Attraction and Substitution Effects,” The Journal of Consumer Research, 1983, 10 (1), pp. 31–44. , J. W. Payne, and C. Puto, “Adding asymmetrically dominated alternatives: Violations of regularity and the similarity hypothesis,” Journal of Consumer Research, 1982, 9 (1), 90–98. 31 Keller, Godfrey, “Brownian Motion and Normally Distributed Beliefs,” Working Paper, University of Oxford December 2009. Luce, R. Duncan, Individual Choice Behavior: a Theoretical Analysis, Wiley New York, 1959. Manzini, Paola and Marco Mariotti, “Sequentially Rationalizable Choice,” American Economic Review, 2007, 97 (5), 1824–1839. Masatlioglu, Y., D. Nakajima, and E.Y. Ozbay, “Revealed attention,” The American Economic Review, 2012, 102 (5), 2183–2205. McFadden, Daniel, “Conditional logit analysis of qualitative choice behavior,” in Paul Zarembka, ed., Frontiers in Econometrics, New York: Academic Press, 1974, chapter 4, pp. 105–42. , “Economic Choices,” The American Economic Review, 2001, 91 (3), 351–378. Ok, E.A., P. Ortoleva, and G. Riella, “Revealed (P)reference Theory,” Working Paper, August 2012. Simonson, Itamar, “Choice Based on Reasons: The Case of Attraction and Compromise Effects,” The Journal of Consumer Research, 1989, 16 (2), pp. 158–174. Train, Kenneth, Discrete Choice Methods with Simulation, 2nd ed., Cambridge University Press, 2009. Tversky, A., “Choice by elimination,” Journal of mathematical psychology, 1972, 9 (4), 341–367. , “Elimination by aspects: A theory of choice,” Psychological Review, 1972, 79 (4), 281–299. 8 Appendix: Proofs Proof of Theorem 1 — Necessity Suppose ρ is the Bayesian learning process with prior parameters (m0 , s20 ) and utility signal structure (µ, γ). Consider an enumeration of a finite menu b = 32 {1, 2, . . . , n}. Prior beliefs about µ are jointly Gaussian with mean m(0) = (m0 , m0 , . . . , m0 ) and covariance matrix s(0) = s20 I, where I is the n × n identity matrix. An application of Bayes’ rule gives that posterior beliefs about µ after observing the signal process X up to time t > 0 are joint Gaussian with mean m(t) and covariance s(t) given by: −1 s(t) = s(0)−1 + t(ΛΛ0 )−1 m(t) = s(t) s(0)−1 m(0) + (ΛΛ0 )−1 X(t) where it is immediate that s(t) is a deterministic function of time, while m(t) is random and has a joint Gaussian distribution itself. The parameters of the posterior mean m(t) are given by E[m(t)] = s(t) s(0)−1 m(0) + t(ΛΛ0 )−1 µ Var[m(t)] = ts(t)(ΛΛ0 )−1 s(t)0 The following Lemma is easy and the proof omitted. For a discussion of the single dimensional case see Keller (2009). Lemma 9. Beliefs converge to the true value: E[m(t)] → µ, Var[m(t)] → 0 and s(t) → 0. Now consider a binary menu enumerated as b = {1, 2}. When facing the alternatives in menu b, the decision maker’s choices at t > 0 depend on the realization of the random vector m(t) = (m1 (t), m2 (t)). Alternative 1 is chosen if and only if m2 (t) − m1 (t) < 0. Since m(t) is Gaussian, m2 (t) − m1 (t) is also Gaussian with mean and variance given by 2 ts0 (µ2 − µ1 ) 2ts40 (1 − γ12 ) , ts20 + 1 − γ12 (ts20 + 1 − γ12 )2 hence ρt (1, 2) = P{m2 (t) − m1 (t) < 0} = Φ √ (µ1 − µ2 ) tp 2(1 − γ12 ) ! where again Φ denotes the cdf of the standard Gaussian distribution. When the utility signal structure (µ, γ) is simple this reduces to √ (µi − µj ) ρt (i, j) = Φ t √ 2 33 From this expression it is straightforward to verify that ρ is continuous, monotonic, starts uninformed, ultimately learns, and has a Gaussian learning rate. Finally, it is also easy to verify that it satisfies the Gaussian triangle condition: ρ̂t (i, j) + ρ̂t (j, k) = √ (µi − µj + µj − µk ) √ t = ρ̂t (i, k). 2 Proof of Theorem 1 — Sufficiency Suppose ρ0 is a continuous, monotonic random choice process that starts uninformed, ultimately learns, has a Gaussian learning rate and satisfies the Gaussian triangle condition. Suppose ρ0t (i∗ , j ∗ ) > 1/2 for some i∗ , j ∗ , t∗ . Step 1: Time change Let f : T → T be given by f (s) = ρ̂0s (i∗ , j ∗ ) ρ̂0t∗ (i∗ , j ∗ ) 2 , ∀s ∈ T and note that it is well-defined since ρ̂0t∗ (i∗ , j ∗ ) 6= 0. Since ρ0 is continuous and monotonic, f is continuous and strictly increasing. So f is a time change. In particular we have f (0) = 0 and f (t∗ ) = 1. We use this time change to define the random choice process ρ by ρt (i, a) = ρ0f −1 (t) (i, a) for each alternative i, menu a and time t. Step 2: Time change sets learning rate For any pair of alternatives k, l ∈ A we have ρ̂t (k, l) = ρ̂0f −1 (t) (k, l) ρ̂0f −1 (t) (i∗ , j ∗ ) = = = hence √ √ ρ̂0t∗ (i∗ , j ∗ ) ! × ρ̂t∗ (k, l) t × ρ̂0f −1 (1) (k, l) t × ρ̂1 (k, l) √ √ √ s × ρ̂t (k, l) = stρ̂1 (k, l) = t × ρ̂s (k, l) 34 f (t) � ρ̂�s (i∗ , j ∗ ) ρ̂�t (i∗ , j ∗ ) �2 1 s t* t Figure 8: Time change. Step 3: Gaussian triangle condition Note ρ satisfies Gaussian triangle condition, just like ρ0 . The Gaussian triangle condition is a restriction on each random choice rule ρt , and therefore not affected by the time change. Step 4: construct utility signal structure Take an alternative i∗ ∈ A and, for each j ∈ A define µ : A → R by µ(j) := √ 2 ρ̂1 (j, i∗ ), for all j ∈ A Since we use the convention ρt (i, i) = 1/2 for all i ∈ A, this implies µ(i∗ ) = 0. Define γ : A × A → [−1, 1] so that (µ, γ) is simple: let γ(i, i) = 1 for all i ∈ A and γ(i, j) = 0 for all i 6= j. Step 5: (µ, γ) is a representation for ρ By the Gaussian learning rate and the time change, for each t > 0 and j ∈ A we have √ √ √ √ t µ(j) = 2 t ρ̂1 (j, i∗ ) = 2 ρ̂t (j, i∗ ) 35 By Gaussian triangle condition, for every i, j ∈ A and t > 0 ρt (i, j) = Φ (ρ̂t (i, j)) = Φ (ρ̂t (i, i∗ ) + ρ̂t (i∗ , j)) = Φ (ρ̂t (i, i∗ ) − ρ̂t (j, i∗ )) √ (µ(i) − µ(j)) √ t =Φ 2 which together with the necessity part shows that ρ is a Bayesian learning process with a simple utility signal structure. Proof of Proposition 2 Since i j we have µi > µj so E[mi (t)−mj (t)] → µi −µj > 0. Since Var[m(t)] → 0 when t goes to infinity, the probability that mi (t) > mj (t) goes to one. Proof of Proposition 7 In general, for random utility models that specify Gaussian error terms, calculating choice probabilities from the parameters of the model involves the numerical computation of multivariate integrals. Here we show that for the starting time of a random choice process, we can obtain expressions for ternary choices in terms of elementary (trigonometric) functions. We interpret choices at time zero as the limit of the distribution of choices at time t when t is taken arbitrarily small. We will obtain expressions that show that a Bayesian learning model and a naı̈ve learning probit model sharing the same utility signal structure (µ, γ) will in general exhibit different choice behavior for times close to zero. The two will coincide only in the knife-edge case in which the common utility signal structure is simple. In this knife-edge case, both choice processes start with a uniform distribution over the available alternatives. Let (µ, γ) be a utility signal structure and let b = {1, 2, 3}. First consider the naı̈ve learning probit model. The utility signal X corresponding to b has three dimensions. For every time t > 0 we have X(t) ∼ N (tµ, tΛΛ0 ). Let L1 be the 2 × 3 matrix given by L1 = " −1 1 0 −1 0 1 36 # so that L1 X(t) = (X2 (t) − X1 (t), X3 (t) − X1 (t)). Then ρt (1, b) = P{X1 (t) > X2 (t) and X1 (t) > X3 (t)} = P{L1 X(t) ≤ 0}. The vector L1 X(t) is jointly Gaussian with mean tL1 µ and covariance tL1 ΛΛ0 L01 . √ So L1 X(t) has the same distribution as tL1 µ+ tM B where B is a bi-dimensional standard Gaussian random vector and M M 0 = L1 ΛΛ0 L01 has full rank. Take M as the Cholesky factorization p 2(1 − γ12 ) 0 M = 1+γ23 −γ12 −γ13 q −γ12 −γ13 )2 √ 2(1 − γ12 ) − (1+γ23 2(1−γ12 ) 2(1−γ12 ) then we can write for each t > 0, ρt (1, b) = P{L1 X(t) ≤ 0} √ = P{tL1 µ + tM B ≤ 0} √ = P{M B ≤ − tL1 µ} and taking t → 0 we obtain ρ0 (1, b) = P{M B ≤ 0} Now M B ≤ 0 if and only if p 0 ≥ B1 2(1 − γ12 ) and q 0 ≥ 1+γ √23 −γ12 −γ13 B1 + 2(1 − γ12 ) − 2(1−γ12 ) (1+γ23 −γ12 −γ13 )2 B2 2(1−γ12 ) if and only if B ≤0 1 and B2 ≤ −B1 √ (1+γ23 −γ12 −γ13 ) 2(1−γ12 )2(1−γ13 )−(1+γ23 −γ12 −γ13 )2 which describes a cone in R2 and, due to the circular symmetry of the standard Gaussian distribution, we have ρ0 (1, b) = 1 1 + arctan 4 2π (1 + γ23 − γ12 − γ13 ) p 2(1 − γ12 )2(1 − γ13 ) − (1 + γ23 − γ12 − γ13 )2 37 ! . with analogous expressions for ρ0 (2, b) and ρ0 (3, b). Now consider the Bayesian learning model with the same utility signal structure and prior parameters (m0 , s20 ). Let L1 be defined as before so that L1 m(t) = (m2 (t) − m1 (t), m3 (t) − m1 (t)). Then the Bayesian learner chooses alternative one at time t > 0 with probability ρt (1, b) = P{m1 (t) > m2 (t) and m1 (t) > m3 (t)} = P{L1 m(t) ≤ 0} The vector L1 m(t) is jointly Gaussian with mean L1 s(t) [s(0)−1 m(0) + t(ΛΛ0 )−1 µ] and covariance tL1 s(t)ΛΛ0 s(t)0 L01 . The two-dimensional random vector L1 m(t) has the same distribution as √ L1 s(t)s(0)−1 m(0) + tL1 s(t)(ΛΛ0 )−1 µ + tM (t)B where B is a bi-dimensional standard Gaussian and M (t)M (t)0 = L1 s(t)ΛΛ0 s(t)0 L01 . Notice the contrast with the naı̈ve learner model, where M did not depend on t. We can take M (t) to be the Cholesky factorization " # M11 (t) 0 M (t) = M21 (t) M22 (t) given by M11 (t) = s C1 (t) −1 − 3s4 t2 − s6 t3 + 2 γ12 + 2 γ13 2 + s2 t −3 + γ 2 + γ 2 + γ 2 − 2γ12 γ13 γ23 + γ23 12 13 23 2 where C1 (t) = s4 −2s8 t4 (−1 + γ12 ) − 2s6 t3 −4 + 2γ12 (1 + γ12 ) + (γ13 − γ23 )2 2 2 2 −4s2 t(2 + γ12 ) −1 + γ12 + γ13 − 2γ12 γ13 γ23 + γ23 2 2 2 −s4 t2 −12 + 7γ13 + 2γ12 γ12 (5 + γ12 ) + γ13 − 6(1 + 2γ12 )γ13 γ23 + (7 + 2γ12 )γ23 2 2 2 − −1 + γ12 + γ13 − 2γ12 γ13 γ23 + γ23 2 + 2γ12 − (γ13 + γ23 )2 and the expressions for M21 (t) and M22 (t) are similarly cumbersome and omitted. Now we can write, for each t > 0, ρt (1, b) = P{L1 m(t) ≤ 0} √ = P{L1 s(t)s(0)−1 m(0) + tL1 s(t)(ΛΛ0 )−1 µ + tM (t)B ≤ 0} √ = P{ tM (t)B ≤ −L1 s(t)s(0)−1 m(0) − tL1 s(t)(ΛΛ0 )−1 µ} √ 1 = P{M (t)B ≤ − √ L1 s(t)s(0)−1 m(0) − tL1 s(t)(ΛΛ0 )−1 µ} t 38 Lemma 10. In the limit as t goes to zero, 1 lim √ L1 s(t)s(0)−1 m(0) = 0, t→0+ t s 2 2 − 2γ13 γ23 − γ23 2 + 2γ12 − γ13 lim M11 (t) = s2 > 0, 2 2 2 t→0+ 1 + 2γ12 γ13 γ23 − γ12 − γ13 − γ23 and (1 + γ12 )(1 + γ13 ) − γ23 (1 + γ12 + γ13 + γ23 ) M21 (t) . =p 2 2 2 t→0+ M22 (t) ) − γ23 − γ13 (3 + 2γ12 + 2γ13 + 2γ23 )(1 + 2γ12 γ13 γ23 − γ12 lim Proof. Long and cumbersome, omitted. Using Lemma 10 we have ρ0 (1, b) = lim ρt (1, b) t→0+ √ 1 −1 0 −1 = lim P M (t)B ≤ − √ L1 s(t)s(0) m(0) − tL1 s(t)(ΛΛ ) µ t→0+ t M21 (t) =P B1 lim M11 (t) ≤ 0 and B2 ≤ −B1 lim t→0+ t→0+ M22 (t) n o (1+γ12 )(1+γ13 )−γ23 (1+γ12 +γ13 +γ23 ) √ =P B1 ≤ 0 and B2 ≤ −B1 (3+2γ +2γ +2γ )(1+2γ γ γ −γ 2 −γ 2 −γ 2 ) 12 13 23 12 13 23 12 13 23 and by the circular symmetry of the standard Gaussian distribution we obtain 1 1 (1+γ12 )(1+γ13 )−γ23 (1+γ12 +γ13 +γ23 ) ρ0 (1, b) = + arctan √(3+2γ +2γ +2γ )(1+2γ γ γ −γ 2 −γ 2 −γ 2 ) 12 13 23 12 13 23 12 13 23 4 2π with analogous expressions for ρ0 (2, b) and ρ0 (3, b). Proof of Corollary 8 Immediate from Proposition 7. Proof of Theorem 3 When the menu of available alternatives is {1, 2}, both 1 and 2 are equally likely to be chosen at the start of the random choice process, i.e., in the limit as t → 0+. By Propostion 7, when the menu includes alternative 3, the probability that alternative 1 is chosen at the start of the random choice process is given by 1 1 + arctan 4 2π (1 + γ12 )(1 + γ13 ) − γ23 (1 + γ12 + γ13 + γ23 ) p 2 2 2 (3 + 2γ12 + 2γ13 + 2γ23 )(1 + 2γ12 γ13 γ23 − γ12 − γ13 − γ23 ) 39 ! and the probability for alternative 2 is given by (1 + γ12 )(1 + γ23 ) − γ13 (1 + γ12 + γ13 + γ23 ) 1 1 + arctan 4 2π p 2 2 2 ) − γ23 − γ13 (3 + 2γ12 + 2γ13 + 2γ23 )(1 + 2γ12 γ13 γ23 − γ12 ! since the function arctan is strictly increasing, the probability of alternative 1 is larger than the probability of alternative 2 if and only if (1+γ12 )(1+γ13 )−γ23 (1+γ12 +γ13 +γ23 ) > (1+γ12 )(1+γ23 )−γ13 (1+γ12 +γ13 +γ23 ) which holds if and only if (γ13 − γ23 )(2 + 2γ12 + γ13 + γ23 ) > 0 which holds since all γij are positive and since the pair (1, 3) is more similar than the pair (2, 3) we have γ13 > γ23 . This shows that for t close to zero, by continuity, introducing alternative 3 hurts alternative 2 more than it hurts alternative 1. The second part of the Theorem follows from the fact that, as t goes to infinity, ρt converges to a standard probit random choice rule with E[m(t)] → µ and Var[m(t)] → ΛΛ0 . Proof of Corollary 4 Fix utility signal parameters γ12 , γ13 , γ23 ∈ [−1, 1] such that γ13 > γ12 and γ13 > γ23 . Since γij form a positive definite matrix for every finite menu, we have the following positive determinant 2 2 2 1 + 2γ12 γ13 γ23 − γ12 − γ13 − γ23 >0 therefore γ12 γ23 − q q 2 2 2 2 (1 − γ12 )(1 − γ23 ) < γ13 < γ12 γ23 + (1 − γ12 )(1 − γ23 ) which provides bounds for γ13 for fixed values of γ12 and γ23 . The next Lemma shows that, when γ13 is the largest of the three parameters, the upper bound is strictly positive. Lemma 11. Let γ be utility signal parameters with γ13 > γ12 and γ13 > γ23 . p 2 2 Then γ12 γ23 + (1 − γ12 )(1 − γ23 ) > 0. 40 Proof. If γ12 and γ23 are negative then γ12 γ23 is positive and we are done. Suppose p 2 2 ) > γ13 > γ12 ≥ 0 and we are done. The )(1 − γ23 γ12 ≥ 0, then γ12 γ23 + (1 − γ12 same applies if γ23 ≥ 0. By Propostion 7, to prove (i) it suffices to show that, as γ13 approaches the p 2 2 ) from the left, we have limit γ12 γ23 + (1 − γ12 )(1 − γ23 (1 + γ12 )(1 + γ13 ) − γ23 (1 + γ12 + γ13 + γ23 ) p → +∞ 2 2 2 ) − γ23 − γ13 (3 + 2γ12 + 2γ13 + 2γ23 )(1 + 2γ12 γ13 γ23 − γ12 (6) so it is sufficient to show that the numerator above converges to a strictly positive number, while the denominator is positive and converges to zero. To verify that the numerator in (6) converges to a strictly positive number, p 2 2 first suppose γ23 ≤ 0. Then as γ13 → γ12 γ23 + (1 − γ12 )(1 − γ23 ), the numerator tends to q 2 γ12 )(1 2 γ23 ) (1 + γ12 ) 1 + γ12 γ23 + (1 − − q 2 2 − γ23 1 + γ12 + γ12 γ23 + (1 − γ12 )(1 − γ23 ) + γ23 q 2 2 2 = (1 + γ12 ) 1 − γ23 + γ12 γ23 + (1 − γ12 )(1 − γ23 ) q 2 2 − γ23 (1 + γ12 ) − γ23 (1 − γ12 )(1 − γ23 ) p 2 2 2 > 0, γ12 γ23 + (1 − γ12 )(1 − γ23 )> which is strictly positive since 1+γ12 > 0, 1−γ23 0 by the previous Lemma, and since we are assuming γ23 ≤ 0, the terms −γ23 (1 + γ12 ) p 2 2 and −γ23 (1 − γ12 )(1 − γ23 ) are also positive. Now suppose instead that γ23 > 0. Since the pair (1, 3) is more similar than p 2 2 (2, 3) we have γ12 γ23 + (1 − γ12 )(1 − γ23 ) > γ23 which holds if and only if 41 √ √ 1 + γ12 / 2. The numerator in (6) converges to q 2 2 2 (1 + γ12 ) 1 − γ23 + γ12 γ23 + (1 − γ12 )(1 − γ23 ) q 2 2 ) − γ23 (1 + γ12 ) − γ23 (1 − γ12 )(1 − γ23 q 2 2 2 > (1 + γ12 ) 1 − γ23 + γ23 − γ23 − γ23 (1 − γ12 ) )(1 − γ23 √ q 1 + γ12 2 2 2 √ > (1 + γ12 ) 1 − γ23 − ) )(1 − γ23 (1 − γ12 2 q p 1 2 2 = (1 + γ12 ) 1 − γ23 − √ (1 + γ12 ) 1 − γ12 1 − γ23 2 q √ q 1 − γ 12 2 2 √ = (1 + γ12 ) 1 − γ23 − 1 − γ23 2 p 2 > 0 and the expression in parenthesis is strictly where (1 + γ12 ) > 0, 1 − γ23 √ √ 2 positive since 0 < γ23 < 1 + γ12 / 2 implies γ23 < (1 + γ12 )/2 which implies r √ q 1 + γ12 1 − γ12 2 √ 1 − γ23 > 1 − = . 2 2 γ23 < Hence the numerator in (6) converges to a strictly positive number. 2 Now consider the denominator in (6). The determinant (1 + 2γ12 γ13 γ23 − γ12 − 2 2 γ13 − γ23 ) is positive and tends to zero as γ13 goes to the upper bound γ12 γ23 + p 2 2 )(1 − γ23 (1 − γ12 ) from below. Moreover, it is easy to check that the bounded 2 − set of utility signal parameters {(γ12 , γ13 , γ23 ) ∈ [−1, 1]3 : (1 + 2γ12 γ13 γ23 − γ12 2 2 ) > 0} is entirely contained in the half space {(γ12 , γ13 , γ23 ) ∈ [−1, 1]3 : − γ23 γ13 (3 + 2γ12 + 2γ13 + 2γ23 ) ≥ 0}. Therefore the expression (3 + 2γ12 + 2γ13 + 2γ23 ) in the denominator is positive and bounded, so the entire denominator is positive and goes to zero as desired. To prove (ii), note that as t → ∞ we have tVar[m(t)] → ΛΛ0 = (γij ) and E[m(t)] → µ. Since 1 ∼ 2 ∼ 3 we have µ1 = µ2 = µ3 and therefore ρt (i, {1, 2, 3}) tends to ρ̈0 (i, {1, 2, 3}) given in Proposition 7 as t → ∞. In other words, in this knife-edge case in which the µ parameters of all three alternatives are identical, the Bayesian Probit Process behaves, in the limit as t goes to infinity, exactly as the naı̈ve probit behaves at time zero. The result then follows from Propostion 7. 42 Proof of Theorem 5 Let ε > 0 and recall m(t) is the (random) vector of posterior mean beliefs at time t, which has a joint normal distribution. The probability that alternative 1 is chosen at time t is equal to the probability that m1 (t) > m2 (t) and m1 (t) > m3 (t). This happens if and only if the bi-dimensional vector (m2 (t) − m1 (t), m3 (t) − m1 (t)) has negative coordinates. This vector has a joint normal distribution with mean given by E[m2 (t) − m1 (t)] =s2 t µ3 (1 + s2 t + γ12 )(γ13 − γ23 ) 2 + µ1 (−1 − s2 t)(1 + s2 t + γ12 ) + γ13 γ23 + γ23 and +µ2 1 + γ12 + s2 t(2 + s2 t + γ12 ) − γ13 (γ13 + γ23 ) / 2 2 2 1 + s2 t 1 + s2 t 2 + s2 t − γ12 − γ13 − γ23 + 2γ12 γ13 γ23 E[m3 (t) − m1 (t)] =s2 t µ2 (1 + s2 t + γ13 )(γ12 − γ23 ) 2 + µ1 (−1 − s2 t)(1 + s2 t + γ13 ) + γ12 γ23 + γ23 +µ3 1 + γ13 + s2 t(2 + s2 t + γ13 ) − γ12 (γ12 + γ23 ) / 2 2 2 1 + s2 t 1 + s2 t 2 + s2 t − γ12 − γ13 − γ23 + 2γ12 γ13 γ23 The denominator in both expressions can be written as 2 2 2 1 + s2 t 1 + s2 t 2 + s2 t − γ12 − γ13 − γ23 + 2γ12 γ13 γ23 = 2 2 2 s2 t s2 t 3 + s2 t + 3 − γ12 − γ13 − γ23 + 2 2 2 1 + 2γ12 γ13 γ23 − γ12 − γ13 − γ23 2 2 2 − γ23 which is clearly positive since s, t > 0, γij2 < 1 and 1 + 2γ12 γ13 γ23 − γ12 − γ13 is the positive determinant of the covariance matrix of X(t). In both numerators, the expression multiplying the coefficient µ3 is positive. In the first case, note that 1 + γ12 + s2 > 0 and that γ13 > γ23 since the pair (1, 3) is more similar than the pair (2, 3). In the second case, the expression multiplying µ3 can be written as 2 [1 − γ12 γ23 ] + 2s2 t + s4 t2 + γ13 (1 + s2 t) − γ12 where each expression in brackets is positive. Therefore for fixed t both coordinates of the mean vector (E[m2 (t) − m1 (t)], E[m3 (t) − m1 (t)]) can be made arbitrarily negative by taking µ3 negative and sufficiently large in absolute value. 43 The covariance matrix Var[m(t)] = ts(t)(ΛΛ0 )−1 s(t)0 does not depend on µ. Since µ1 = µ2 we have both ρt (1, {1, 2, 3}) and ρt (2, {1, 2, 3}) converging to 1/2 when t goes to infinity. Note that, while increasing the absolute value of the negative parameter µ3 does not change Var[m(t)] for any t, it decreases both E[m2 (t) − m1 (t)] and E[m3 (t) − m1 (t)] for every t > 0 and therefore increases ρt (1, {1, 2, 3}) for every t > 0. Moreover, for fixed t > 0, ρt (1, {1, 2, 3}) can be made arbitrarily close to 1 by taking µ3 sufficiently negative. This guarantees that we can have the attraction effect starting arbitrarily early in the random choice process. Moreover, since E[m2 (t) − m1 (t)] above converges to zero from below as t goes to infinity, ρt (1, {1, 2, 3}) will converge to 1/2 from above, while ρt (2, {1, 2, 3}) will converge to 1/2 from below. 44
© Copyright 2026 Paperzz