Journal of Mathematical Psychology 46, 531–553 (2002) doi:10.1006/jmps.2001.1409 Aspiration-Based and Reciprocity-Based Rules in Learning Dynamics for Symmetric Normal-Form Games Dale O. Stahl University of Texas at Austin and Ernan Haruvy University of Texas at Dallas Psychologically based rules are important in human behavior and have the potential of explaining equilibrium selection and separatrix crossings to a payoff dominant equilibrium in coordination games. We show how a rule learning theory can easily accommodate behavioral rules such as aspirationbased experimentation and reciprocity-based cooperation and how to test for the significance of additional rules. We confront this enhanced rule learning model with experimental data on games with multiple equilibria and separatrix-crossing behavior. Maximum likelihood results do not support aspirationbased experimentation or anticipated reciprocity as significant explanatory factors, but do support a small propensity for non-aspiration-based experimentation by random belief and non-reciprocity-based cooperation. © 2002 Elsevier Science (USA) Key Words: aspiration; reciprocity; rules; learning; games. In recent years, the field of learning in games has evolved considerably, with recent models able to make robust predictions in a variety of interesting games. Examples in the reinforcement learning arena include Erev & Roth (1998), Roth & Erev (1995), and Sarin & Vahid (1999). In belief learning, Fudenberg & Levine (1998) and Cheung & Friedman (1997, 1998) provide some comprehensive reviews. In hybrid models, Camerer & Ho (1998, 1999a, 1999b) have led the field. These models, though different in structure, are based on players adjusting, with noise and Partial funding of this research was provided by Grants SBR-9410501 and SBR-9631389 from the National Science Foundation and the Malcolm Forsman Centennial Professorship, but nothing herein reflects their views. Address correspondence and reprint requests to Dale O. Stahl, Department of Economics, University of Texas, Austin, TX 7812. E-mail: [email protected]. ⁄ 531 0022-2496/02 $35.00 © 2002 Elsevier Science (USA) All rights reserved. 532 STAHL AND HARUVY inertia, in the direction of the best response to past observations. In that sense, players’ sophistication is fairly limited. When hypotheses are permitted (in belief and hybrid models), players’ hypotheses on others’ frequencies of choice are limited to a weighted average of others’ past frequencies. In reality, players are far more sophisticated and are able to reason iteratively, as Ho, Camerer & Weigelt (1998), Nagel (1995), and Stahl (1996) have shown. For example, a sophisticated player might believe that other players will tend to choose a best reply to the historical trend (which could be different from a weighted average of others’ past frequencies) and hence choose a best reply to this belief. Action reinforcement dynamics cannot capture such behavior, since the behavior is not simply the tendency to choose an action. Rather the behavior is a function (or behavioral rule) that maps game information and histories into available actions. Stahl’s (1999, 2000a, 2000b, 2001) rule learning theory assumes that players probabilistically select behavioral rules and that the propensities to use specific rules are guided by the law of effect: the propensity of rules that perform better increase over time and vice versa. Precursors of this approach are Newell, Shaw, & Simon (1958). To operationalize the theory, a class of rules is assumed which includes the sophisticated rule mentioned above (see Section 2.3 for details). While this specification of the rule learning theory has been remarkably successful empirically (compared to other learning models), a shortcoming is that all the rules are either a form of trending or myopically rational in the sense of being (perhaps noisy) best replies to a current belief. We say this is a shortcoming because a growing number of laboratory experiments have documented that individuals deviate systematically from myopic self-interest behavior when there are possible future gains from influencing others. Specifically, such nonmyopic actions may be motivated by expectations of future reciprocity by others (e.g., Engle-Warnick & Slonim, 2001; Fehr & Gächter, 1998; Gneezy, Guth, & Verboven, 2001), attempts to teach others (e.g., Camerer, Ho, & Chong, 2000), or attempts to signal intended future actions in coordination games (e.g., Cooper, Dejong, Forsythe, & Ross, 1993). For example, consider the coordination game presented in Fig. 1. The matrix of Fig. 1 gives the row player’s payoffs, and the triangular graph depicts the aggregate choice frequencies of actions A and B for an experiment session. The dotted line depicts the separatrix between the area for which B is the best reply (upper area) and the area for which A is the best reply; C is a dominated action. The experiment session had 25 participants mean-matched for 12 periods, with population feedback after each period (see the Methods section for details). The first period is represented by the dot labeled with a ‘‘1’’, which lies in A’s best-reply area. In the second and third periods (labeled ‘‘2’’ and ‘‘3’’ respectively), the choice frequencies move strongly in the direction of the best reply. Then in period 4 there is a surprising reversal of direction, followed in period 5 by the crossing to action B’s best-reply area and eventual convergence to the payoff-dominant equilibrium (B). This separatrix-crossing phenomenon is robust (9 out of 20 sessions) and is incompatible with myopic rationality. Thus, we have compelling reasons to develop better models that go beyond trending and myopically rational behavior to include other psychologically based modes of behavior. The primary purpose of this paper is to show how the rule ASPIRATION-BASED RULES IN LEARNING 533 FIG. 1. Evidence of nonmyopically rational behavior. learning theory can easily incorporate diverse psychologically based rules and how to test for the significance of additional rules. Having motivated the need with the example in Fig. 1, we will also use this example to motivate the behavioral rules that we will use to achieve our purpose. However, we will leave the mystery of the separatrix crossing unresolved, and we will not answer the quest for the best class of rules since that is not our primary goal. Nonetheless, the methodology we develop is a promising approach for future research. Binmore & Samuelson (1997) offered a theoretical explanation for separatrixcrossing behavior based on aspirations and experimentation. Applied to the above example, players begin with aspirations of payoffs, which during the first three periods depicted in Fig. 1 are not fulfilled, inducing the players to experiment with other strategies. If experimentation leads to enough B (but not C) choices, then the separatrix can be crossed. Another possible explanation of separatrix crossings is that some players make short-term sacrifices to teach others to coordinate on the payoff-dominant equilibrium (an interesting model of teaching behavior in repeated games is given in 534 STAHL AND HARUVY Camerer et al., 2000). We could think about these players as anticipating positive reciprocity from the other players. In general models of reciprocity (e.g., Dufwenberg & Kirchsteiger, 1998; Rabin, 1993), a costly action by one party which is favorable to another is due to intentional actions by the other party perceived to be favorable (positive reciprocity). Similarly, an unfavorable action is chosen in response to intentional actions perceived to be unfavorable (negative reciprocity). Though these responses may be costly for the individual, they could serve to enforce social norms, which may ultimately result in everyone being better off (Brandts & Charness, 1999; Fehr & Gächter, 1998; Hoffman, McCabe, & Smith, 1998). We can incorporate Binmore & Samuelson’s insight into the rule learning theory by introducing an experimentation rule whose propensity is sensitive to the difference between player aspirations and actual payoffs. Players start with an initial aspiration level and an initial propensity toward experimentation (loosely some random deviation from myopic best-reply behavior), and if the actual payoffs fall short of the aspiration level, the propensity to experiment increases, and vice versa. Allowing the aspiration level to be updated gradually could then explain separatrixcrossing behavior. Alternatively, we can incorporate anticipated reciprocity into the rule learning theory by introducing a cooperative rule (in the current examples, to choose the symmetric payoff dominant action), an initial propensity to cooperate (i.e., use this rule), and an initial anticipated-reciprocity payoff (the payoff that would be obtained if everyone cooperates). The anticipated-reciprocity payoff would be updated only gradually, thereby representing patience in waiting for others to reciprocate. Positive reciprocal behavior by others would increase the propensity to cooperative, but if others failed to reciprocate, eventually the propensity to cooperate would plummet. We also consider an alternative modeling of anticipated reciprocity in the form of a tit-for-tat (TFT) rule. The next section presents the basic rule learning model and the following section presents three enhancements. Following that, we present Methods, Results, and Discussion. THE RULE LEARNING FRAMEWORK The Game Environment Consider a finite, symmetric, two-player game G — (N, A, U) in normal form, where N — {1, 2} is the set of players, A — {1, ..., J} is the set of actions available to each player, U is the J × J matrix of expected utility payoffs for the row player, and UŒ, the transpose of U, is the payoff matrix for the column player. For notational convenience, let p 0 — (1/J, ..., 1/J)Œ denote the uniform distribution over A. We focus on single population situations in which each player is matched in every period with every other player; hence, the payoff relevant statistic for any given player is the probability distribution of the choices of the population, and this information is available to the players. To this end, p t will denote the empirical frequency of all players’ actions in period t. It is also convenient to define ASPIRATION-BASED RULES IN LEARNING 535 h t — {p 0, ..., p t − 1} as the history of all players’ choices up to period t with the novelty that p 0 is substituted for the null history. Thus, the information available to a representative player at the beginning of period t is W t — (G, h t). Behavioral Rules and the Rule Learning Dynamic A behavioral rule is a mapping from information W t to D(A), the set of probability measures on the actions A. For the purposes of presenting the abstract model, let r ¥ R denote a generic behavioral rule in a space of behavioral rules R; r(W t) is the mixed strategy generated by rule r given information W t. The second element in the model is a probability measure over the rules: j(r, t) denotes the probability of using rule r in period t. Because of the nonnegativity restriction on probability measures, it is more convenient to specify the learning dynamics in terms of a transformation of j that is unrestricted in sign. To this end, we define w(r, t) as the log-propensity to use rule r in period t, such that ;5F exp(w(z, t)) dz6 . j(r, t) — exp(w(r, t)) (1) Given a space of behavioral rules R and probabilities j, the induced probability distribution over actions for period t is p(t) — F r(W t) dj(r, t). (2) R In other words, pj (t) is the probability that action j will be chosen in period t and is obtained by integrating the probability of j for each rule, rj (W t), with respect to the probability distribution over rules, j(r, t). Computing this integral is the major computational burden of this model. The third element of the model is the equation of motion. The law of effect states that rules which perform well are more likely to be used in the future. This law is captured by the following dynamic on log-propensities, w(r, t+1)=b0 w(r, t)+g(r, W t+1), for t \ 1, (3) where g( ), is the reinforcement function for rule r conditional on information W t+1=(G, h t+1), and b0 ¥ (0, 1] is an inertia parameter. We specify g(r, W t+1) as the rescaled expected utility that rule r would have generated in period t: g(r, W t+1)=b1 r(W t)Œ Up t, (4) where b1 > 0 is a scaling parameter. While the ex ante rule propensities j(r, t) are the same for all individuals, diversity in the population is generated by independent draws from j(r, t) for each individual in each period. Since we are specifying a model of population averages rather than individual responses, the reinforcement function represents the cumulated effects of experience throughout the population. 536 STAHL AND HARUVY Given a space of rules R and initial conditions w(r, 1), the law of motion, Eq. (3), completely determines the behavior of the system for all t > 1. The remaining operational questions are (1) how to specify R, and (2) how to specify w(r, 1). An attractive feature of this general model is that it encompasses a wide variety of learning theories. For instance, to obtain replicator dynamics, we can simply let R be the set of J constant rules that always choose one unique action in A for all information states. Fictitious play and Cournot dynamics can be seen as very special cases in which R is a singleton rule which chooses a (possibly noisy) bestresponse to a belief that is a deterministic function of the history of play. 1 Moreover, the general model can include these constant rules, best-response rules, and other rules. The Family of Evidence-Based Rules Our approach to specifying the space of rules is to specify a finite number of empirically relevant discrete rules that can be combined to span a much larger space of rules. 2 In Stahl (1999, 2000a), the family of evidence-based rules was introduced as an extension of the Stahl & Wilson (1995) level-n rules. Evidence-based rules are derived from the notion that a player considers evidence for and against the available actions and tends to choose the action that has the most net favorable evidence based on the available information. Further, evidence is generated from three archetypal ‘‘theories of mind’’ (i.e., models of other players). The first kind of evidence comes from a null model of the other players. The null model provides no reason for the other players to choose any particular action, so for the first period of play, by virtue of insufficient reason, the belief is that all actions are equally likely. The expected utility payoff to each available action given the null model is y1 (W 1) — Up 0. We interpret y1j as evidence in favor of action j stemming from the null model and no prior history. For later periods (t > 1), the players have empirical data about the past choices of the other players. It is reasonable for a player to use simple distributed-lag forecasting: (1 − h) p 0+hp 1 for period 2 with h ¥ [0, 1]. Letting q t(h) denote the forecast for period t and defining q 0(h) — p 0, the following forecasting equation applies for all t \ 1: q t(h) — (1 − h) q t − 1(h)+hp t − 1. (5) The expected utility payoff given this belief is y1 (W t ; h) — Uq t(h). We can interpret y1j (W t ; h) as level-1 evidence in favor of action j stemming from the null model and history h t. The second kind of evidence is based on the Stahl & Wilson (1995) level-2 player who believes all other players are level-1 players and hence believes that the distribution of play will be b(q t(h)), where b(q t(h)) ¥ D(A) puts equal probability on all 1 In fictitious play, the belief is the unweighted empirical frequency of observed choices, and in Cournot dynamics the belief is the most recent observed empirical frequency. 2 Alternative approaches could use the production rules of ACT-R (Anderson, 1993), or fast and frugal heuristics (Gigerenzer, Todd, & the ABC Research Group, 1999). ASPIRATION-BASED RULES IN LEARNING 537 best responses to q t(h) and zero probability on all inferior responses. This is a second order theory of mind in which other minds are believed to be first order (i.e., level 1). Indeed, humans may have brain structures for this kind of representation (Churchland & Sejnowski, 1992). The expected utility conditional on this belief is y2 (W t ; h) — Ub(q t(h)). We can interpret y2j (W t ; h) as level-2 evidence in favor of action j. The third theory of mind is Nash equilibrium theory. Letting p NE denote a Nash equilibrium of G, y3 =Up NE provides a third kind of evidence on the available actions. For games with multiple NE, the evidence for each NE is equally weighted. 3 So far we have defined three kinds of evidence: Y — {y1 , ..., y3 }. The next step is to weigh this evidence and specify a probabilistic choice function. Let nk \ 0 denote a scalar weight associated with evidence yk . We define the weighted evidence vector ȳ(W t ; n, h) — Y(W t ; h) n, (6) where n — (n1 , ..., n3 )Œ. There are many ways to go from such a weighted evidence measure to a probabilistic choice function. We opt for the multinomial logit specification because of its computational advantages when it comes to empirical estimation. The implicit assumption is that the player assesses the weighted evidence with some error and chooses the action that from his or her perspective has the greatest net favorable evidence. Hence, the probability of choosing action j is ; p̂j (W t ; n, h) — exp[ȳj (W t ; n, h)] C exp[ȳa (W t ; n, h)]. (7) a Note that, given the four-dimensional parameter vector (n, h), Eq. (7) defines a mapping from W t to D(A), and hence p̂(• ; n, h) is a behavioral rule as defined abstractly above. By putting zero weight on all but one kind of evidence, Eq. (7) defines an archetypal rule—one for each kind of evidence corresponding to the underlying model of other players. The four-dimensional parameter vector (n, h) generates the space of rules spanned by these archetypal rules. To show the behavioral effect of various n values, Table 1 gives the predicted initial-period choice frequencies for games 13 and 16 for several values of (n1 , n2 , n3 ). Next, we represent behavior that is random in the first period and follows the herd in subsequent periods. Following the herd does not mean exactly replicating the most recent past, but rather following the past with inertia as represented by q t(h). Hence, Eq. (5) represents herd behavior as well as the beliefs of level-1 types. Finally, we allow for uniform trembles by introducing the uniformly random rule. Thus, the base model consists of a four-dimensional space of evidence-based rules (n, h), a herd rule characterized by h, and uniform trembles. 3 See Haruvy & Stahl (1999). 538 STAHL AND HARUVY TABLE 1 Choice Frequencies for Various Evidence-Based Rules (n1 , n2 , n3 ) A B C (0.23, 0, 0) (1.08, 0, 0) (0.81, 0.27, 0) (0.81, 0, 0.23) 0.760 1.0 0.937 1.0 Game 13 0.076 0 0 0 0.164 0 0.063 0 (0.23, 0, 0) (1.08, 0, 0) (0.81, 0.27, 0) (0.81, 0, 0.23) 0.659 1.0 1.0 0.492 Game 16 0.151 0 0 0.492 0.151 0 0 0.017 The Initial Distribution of Log-Propensities Letting dh denote the initial probability of the herd rule, and letting e denote the initial probability of the trembles, we set w(rherd , 1)=ln(dh ) and w(rtremble , 1)= ln(e). The initial log-propensity function for the evidence-based rules is specified as w(n, h, 1) — − 0.5 ||(n, h) − (n̄, h̄)|| 2/s 2+A, (8) where ||(n − h) − (n̄ − h̄)|| denotes the distance 4 between rule (n, h) and the mean of the distribution (n̄, h̄), and A is determined by the requirement that Eq. (1) integrated over the space of evidence-based rules is exactly 1 − dh − e. Hence, the initial propensity over the evidence-based rules is essentially a normal distribution with mean (n̄, h̄) and standard deviation s. Transference Since this theory is about rules that use game information as input, we should be able to predict behavior in a temporal sequence that involves a variety of games. For instance, suppose an experiment consists of one run with one game for T periods, followed by a second run with another game for T periods. How is what is learned about the rules during the first run transferred to the second run with the new game? A natural assumption would be that the log-propensities at the end of the first game are simply carried forward to the new game. Another extreme assumption would be that the new game is perceived as a totally different situation so the log-propensities revert to their initial state. We opt for a convex combination w(r, T+1)=(1 − y) w(r, 1)+yw(r, T+), 4 (9) This is a log-linear Euclidian distance: the n variable is measured on a logarithmic scale, while h is on a linear scale; see Appendix A1 of Stahl (2001) for computational details. ASPIRATION-BASED RULES IN LEARNING 539 where T+ indicates the update after period T of the first run, and y is the transference parameter. If y=0, there is no transference, so period T+1 has the same initial log-propensity as period 1; and if y=1, there is complete transference, so the first period of the second run has the log-propensity that would prevail if it were period T+1 of the first run (with no change of game). This specification extends the model to any number of runs with different games without requiring additional parameters. The Likelihood Function The theoretical model involves 10 parameters: b — (n̄1 , n̄2 , n̄3 , h̄, s, dh , e, b0 , b1 , y). The first four parameters (n̄1 , n̄2 , n̄3 , h̄) represent the mean of the population’s initial propensity w(r, 1) over the evidence-based rules, and s is the standard deviation of that propensity; the next two parameters (dh , e) are the initial propensities of the herd and tremble rules, respectively; b0 and b1 are the learning parameters of Eqs. (3) and (4); and y is the transference parameter in Eq. (9) for the initial propensity of the subsequent runs. Substituting Eq. (7) into Eq. (2), the rule propensities and law of motion yield population choice probabilities: pj (t | b)=F p̂j (W t ; n, h) j(n, h, t | b) d(n, h). (10) R To compute this integral, a grid was placed on the (n, h)-space of rules (see Appendix A1 of Stahl, 2001 for details). Let n tj denote the number of participants who choose action j in period t. Then the logarithm of the joint probability of the data conditional on b is LL(b) — C C n tj log[pj (t | b)]. t (11) j To find a b vector that maximizes LL(b), we use a simulated annealing algorithm (Goffe, Ferrier, & Rogers, 1994) for high but declining temperatures and then feed the result into the Nelder & Mead (1965) algorithm. The simulated annealing algorithm is effective in exploring the parameter space and finding neighborhoods of global maxima but very slow to converge, while the latter algorithm converges much faster. ENHANCEMENTS TO THE RULE LEARNING MODEL Aspiration-Based Experimentation Two components of an aspiration-based experimentation rule are the experimentation act and the aspiration level. We let p x(t) denote the probability distribution of the experimentation act in period t. Two obvious candidates for p x(t) are the uniform distribution p 0 for experimentation by random choice, and the recent past p t − 1 for experiment by imitation. The third alternative we will consider is experimentation by random beliefs in which p x(t) is the proportion of the probability 540 STAHL AND HARUVY simplex for which each choice is the best response. Such choice probabilities would arise from beliefs that are uniformly distributed over the probability simplex. Let a x(t) denote the aspiration level for period t. We assume an adaptive expectations dynamic for the aspiration level a x(t+1)=h2 a x(t)+(1 − h2 ) p x(t)Œ Up t (12) with initial condition a x(1). With 0 < h2 < 1, the aspiration level will adjust gradually to the actual expected utility that the experimentation act would have generated each period. If initial play is diverse and the experimentation act is diverse, then p x(t)Œ Up t will be highly correlated with the population average payoff. Hence, if the latter is much less than the initial aspiration level and if h2 < 1, then the aspiration level will gradually decline but remain above the population average payoff, and this positive difference will increase the likelihood that the experimentation rule will be used. Conversely, if the population average payoff is higher than the initial aspiration level, then the propensity to experiment will decline (don’t rock the boat). The rule learning framework can represent this dependence of the propensity to experiment on the aspiration level by using b1 a x(t+1) as the reinforcement, g(rx , W t+1), for the experimentation rule in Eq. (3). If initial aspirations are disappointed, then the log-propensity to experiment will increase, and vice versa. Asymptotically, as long as h2 < 1, Eq. (12) has the property that the long-run propensity to use the experimentation act, p x(t), will depend on the actual expected payoff from that act, since initial aspirations are geometrically discounted. Conversely, the larger h2 , the longer will be the effect of the initial aspiration level. For the game in Fig. 1, simulations of rule learning with an initial aspiration level of 50 or higher and a value of h2 > 0.9 produce separatrix crossings. In addition to specification of p x(t), we need to specify an initial aspiration level, x a (1), and an initial propensity, dx , to experiment (i.e., j(rx , 1)=dx ). It is natural to set a x(1) equal to the mean of the range of payoffs in the game: [maxjk Ujk − minjk Ujk ]/2. We let dx be a free parameter to be estimated from the data. We further assume that the transference equation applies to the aspiration level, so a x(T+1)=(1 − y) a x(1)+ya x(T+). (9Œ) While this model of aspiration-based experimentation will suffice to illustrate the richness of the rule learning theory, it does not exhaust all possible models of aspiration-based experimentation—which would go beyond the purpose of this paper. Reciprocity-Based Cooperation For the class of symmetric normal-form games, we define the symmetric cooperative action as the action with the largest diagonal payoff, and we restrict attention to the subclass of games with a unique symmetric cooperative action. Let p c denote the degenerate probability distribution that puts unit mass on this cooperative ASPIRATION-BASED RULES IN LEARNING 541 action; we will refer to this as the cooperative rule, rc . But why would a rational player choose the cooperative rule? A rational player who believes that almost everyone else will select the cooperative action would anticipate a payoff equal to the maximum diagonal payoff if he or she chooses the cooperative action. Then, if the cooperative action is a best response to this belief (i.e., if it is a payoff-dominant Nash equilibrium), the rational player will choose the cooperative action. If the cooperative action is not a bestreply to itself, a myopically rational player will not choose it, but a forward-looking rational player could still choose it if he or she believes that others will reciprocate leading to a higher payoff than the path that would follow from myopic best replies. If others do not cooperate at the start, the realized payoff to cooperation will be much less than anticipated, and so the player’s belief about the likelihood of reciprocation may decline (especially if the player is impatient). Let h2 denote the level of patience of such a player, so the anticipated payoff from cooperation, a c(t), (via future reciprocity) follows the simple adaptive expectations dynamic a c(t+1)=h2 a c(t)+(1 − h2 ) p cŒUp t, (13) where a c(1)=p cŒUp c, the largest diagonal payoff. This initial anticipated payoff is geometrically discounted over time; so provided h2 < 1, the anticipated payoff from cooperation will tend toward a geometrically weighted average of actual expected payoffs to cooperation. The larger h2 , the slower the adjustment, so if the propensity to cooperative is positively correlated with a c(t), cooperative efforts could persist despite the lack of reciprocation. The rule learning framework can represent this dependence of the propensity to cooperate on anticipated reciprocity by using b1 a c(t+1) as the reinforcement, g(rc , W t+1), for the cooperative rule in Eq. (3). When h2 =1, we would have a stubborn, irrational cooperative rule, but otherwise, the propensity to cooperate will depend on the difference between the anticipated payoff and the actual expected payoffs of alternative rules. When cooperation is reciprocated, so its payoff stays high relative to alternative rules (which recommend different actions), then the propensity to cooperate will increase, and vice versa. We further assume that the transference equation applies to the anticipated payoff, so a c(T+1)=(1 − y) a c(1)+ya c(T+). (9œ) Finally, we let dc denote the initial propensity to cooperate (i.e., j(rc , 1)=dc ). While this model of reciprocity-based cooperation will suffice to illustrate the richness of the rule learning theory, it does not exhaust all possible models of reciprocity-based cooperation—which would go beyond the purpose of this paper. Nonetheless, we consider one more model in the next section. A Tit-for-Tat Rule TFT behavior gained prominence following Axelrod’s (1981, 1984) experimental prisoner’s dilemma tournaments in which TFT emerged as the winning strategy, 542 STAHL AND HARUVY and Kreps, Milgrom, Roberts, & Wilson’s (1982) rational underpinning of TFT behavior in the finitely repeated prisoner’s dilemma with incomplete information. In a two-player prisoner’s dilemma, the TFT rule specifies cooperation in the first period and then mimicking the most recent action of one’s opponent thereafter. The TFT rule we examine here begins cooperatively and thereafter cooperates if a threshold number of players have cooperated in the most recent period; otherwise it chooses a best reply to the recent period. Formally, let fc (t) denote the fraction of the population that cooperated in period t, and let c ¥ (0, 1) denote the threshold. Then, we define the TFT probability of cooperating in period t+1 as k(t+1; fc (t), c) — ˛ 0.5[fc (t)/c] c, 1 − 0.5{[1 − fc (t)]/[1 − c]} c, if fc (t) [ c, if fc (t) > c, (14) where c \ 1 is a precision parameter to be estimated along with c. This probability has the property that (i) it is 0 when fc (t)=0, (ii) it is 1 when fc (t)=1, (iii) it is 0.5 when fc (t)=c, and (iv) it is S-shaped and monotonically increasing in fc (t). For large values of c, the probability function is close to a step function at fc (t)=c. When c % 0, the cooperative action is chosen for almost all realizations of fc (t), while when c % 1, the noncooperative best-reply action is chosen for almost all realizations of fc (t). From Eq. (14), it is straightforward to define the TFT behavioral rule rTFT (W t+1) which plays the cooperative action with probability k(t+1; fc (t), c) and plays the best reply to p t with probability 1 − k(t+1; fc (t), c). In accordance with Eq. (4), the reinforcement function g(rTFT , W t+1)=b1 rTFT (W t)Œ Up t. To complete the rule learning model with this TFT rule, we specify the initial propensity to use the TFT rule as dTFT . METHODS Participants We conducted 25 sessions, each involving 25 different participants. All participants were inexperienced in this or similar experiments, and all were upper division undergraduates or noneconomics graduate students. Two sessions were conducted at Texas A&M University, and the rest were conducted at the University of Texas. Apparatus Participants were seated at private computer terminals separated so that no participant could observe the choices of other participants. The relevant game, or decision matrix, was presented on the computer screen. Each participant could make a choice by clicking the mouse button on any row of the matrix, which then became highlighted. In addition, each participant could make hypotheses about the choices of the other players. An on-screen calculator would then calculate and display the hypothetical payoffs to each available action given each hypothesis. Participants were allowed to make as many hypothetical calculations and choice ASPIRATION-BASED RULES IN LEARNING 543 revisions as time permitted. Following each time period, each participant was shown the aggregate choices of all other participants and could view a record screen with the history of the aggregate choices of other participants for the entire run. Design and Procedures Each experiment session consisted of two runs of 12 periods each. In the first run, a single 3 × 3 symmetric game was played for 12 periods, and in the second run, a different 3 × 3 symmetric game was played for 12 periods. A mean-matching protocol was used. In each period, a participant’s token payoff was determined by his or her choice and the percentage distribution of the choices of all other participants, p(t), as follows: The row of the payoff matrix corresponding to the participant’s choice was multiplied by the vector of choice distribution of the other participants. Token payoffs were in probability units for a fixed prize of $2.00 per period. In other words, the token payoff for each period gave the percentage chance of winning $2 for that period. The lotteries that determined final monetary payoffs were conducted following the completion of both runs using dice. Specifically, a random number uniformly distributed on [00.0, 99.9] was generated by the throw of three 10-sided dice. A player won $2.00 if and only if his or her token payoff exceeded his or her generated dice number. Payment was made in cash immediately following each session. The Games We select 10 games (see Fig. 2) from those investigated in Haruvy & Stahl (1999), with properties that make them ideal for a thorough study of learning dynamics. These are games 1, 4, 9, 12, 13, 14, 16, 18, and 19 of Haruvy & Stahl (1999). In addition, we have a 3 × 3 version of a 5 × 5 game used in Stahl (1999), for which cooperative behavior seemed to persist (labeled game 21). What makes these games ideal? We begin by noting that games 1, 4, 13, 14, 16, and 19 are characterized by multiple equilibria which are Pareto ranked. The selection principles of payoff dominance, risk dominance, and security do not all coincide in any of these six games, and no pair of principles coincides in all of these six games. Most important the payoff dominant equilibrium action never coincides with the level-1 action of Stahl & Wilson (1994, 1995). This is important given the prediction by Haruvy & Stahl (1999) that level-1 is a heavy component in the determination of initial conditions. According to the predictions generated by the Haruvy & Stahl (1999) model, initial conditions in these six games will generally fall in a basin of attraction (for popular models of adaptive dynamics) inconsistent with the payoff-dominant choice. Games 1, 14, and 19 begin with initial conditions far from uniform and very little movement is observed thereafter, with the exception of one run of game 14, where a significant proportion of the group goes against their best response to cross the separatrix toward the payoff dominant Nash equilibrium. Game 13 has a long dynamic path from initial conditions to the final outcome. The final period outcomes of Game 16 are almost equally split between the two equilibria, although the initial conditions fall usually in the non-payoff-dominant basin, hence resulting in 544 STAHL AND HARUVY FIG. 2. Games. separatrix crossings. Game 18 was investigated because its theoretical dynamics involve a persistent cycle. Games 9 and 12 have only a mixed-strategy equilibrium and hence potentially interesting dynamic paths. We also investigate three additional versions of game 13 (with 20 added to each cell, ‘‘13+’’; with 20 subtracted from each cell, ‘‘13 − ’’; and with payoffs rescaled to [0, 100]; ‘‘13r’’), and an additional version of game 16 (with 20 added to each cell, ‘‘16+’’). The motivation for this variation was that if the initial aspiration level, a x(1), is fixed independent of the game payoffs (for example, determined by the participant’s anticipated compensation for the whole experimental session), then adding 20 to all payoffs should change behavior. For instance, if a x(1)=50, then in game 16 the ‘‘A’’ payoff of 20 is substantially less than the initial aspiration level ASPIRATION-BASED RULES IN LEARNING 545 TABLE 2 List of Games by Experimental Session Date First Run Second Run 2/17/98 2/19/98 4/7/98 4/9/98 6/18/98 7/16/98 7/30/98 1/28/99 3/2/99 6/9/99 6/10/99 8/6/99 8/11/99 1/27/00 2/16/00 2/24/00 4/13/00 7/13/00 7/19/00 7/20/00 7/26/00 7/27/00 8/2/00a 8/2/00b 8/3/00 16 19 14 13 16 16+ 16+ 16 16+ 16 16+ 16 16+ 16 16+ 16 16+ 16 16 16+ 16 16+ 16+ 19 1 9 12 1 21 13 13 − 13r 19 19 13 13 13+ 13+ 14 14 1 14 4 14 1 19 1 18 4 4 thereby inducing experimentation, while with 20 added to each cell, since then 40 is much closer to 50, experimentation (and hence separatrix crossings) should occur less often. The experimental evidence, however, is 5 and 4 crossings out of ten sessions of each version of game 16 respectively—a statistically insignificant difference at all commonly accepted significance levels. We therefore discard the hypothesis of a fixed initial aspiration level independent of the game payoffs. The games used in each session are listed in Table 2. The data from the first four sessions were analyzed in Stahl (2000a, 2000b, 2001). The data from the other 21 sessions have not been analyzed before. The entire data set is available upon request from the authors. Note that the range of payoffs for the games in Fig. 2 differs substantially from [20, 70] to [0, 100]. In Haruvy & Stahl (1999), we found support for the hypothesis that a single payoff rescaling parameter, b1 , applies to a wide range of games provided the payoffs are first rescaled to [0, 100]. On the current data set, we also found that the maximized LL increased with such rescaling. Therefore, in this paper we report only results obtained from rescaling to [0, 100]. This implies that all the 546 STAHL AND HARUVY TABLE 3 PERIOD Individual Choices for Session 6/18/1998, Run 1 (Game 16) 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 SUBJECT 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 C A B B A A B B B B B B A A A A B B B B B B B B A A A A A B B B C B B B A A A A A A B B B B B B B A A A A A B B B B B B A A A C B B B B B B B B A A B B B B B B B B B B B C C C A B B B B B B B A A A A A A B B A B B B B B A B C A A A C A A A A B A A A A A C A A B A A A A B A B A A A C A A A A A A A A A A A A A A A B A A C B A A A A A A A B A A A B A B B A B A A A A B A A B B B B B B A A B B B A B B B B B B B B B B B B B B B B B B B B B B B B B B B A B B B B B B B C B B B B B B B A B B B B B B B A B B B B B B B B B B B B B B B B B A B B B B B B B B B B B B B C B B B B B B B B B B B B B B B B B B versions of games 13 and 16 are ‘‘equivalent,’’ an assumption that has been verified by independent analysis of the choice behavior across the different versions. 5 RESULTS In addition to the plot of the dynamic path for one of the sessions shown in Fig. 1, we also report the individual choice behavior for that data in Table 3. A cursory look at Table 3 suggests that participants, for the most part, do not persist in cooperative behavior (action B in that game) unless others are also simultaneously cooperative. For instance, although six participants chose B in period 1 (putting the initial play close to the separatrix), all but one abandoned cooperation in period 2, and none of the three who cooperated in period 2 continued to do so in period 3. Even in the critical period 4, of the five who chose B, only two persisted with B in period 5. While nine of the ten participants who cooperated in period 5 continued to cooperate in period 6, by then B was the myopic best response. Similar patterns can be observed throughout the data. Hence, an ocular analysis of the data favors the experimentation explanation over the anticipated reciprocity explanation. We will rigorously test this and other hypotheses next. The Basic Rule Learning Model. We estimated the 10 parameters of the basic rule learning model using all the data from the 25 sessions. We pool over games in order to try to identify regular features of learning dynamics that are general and not game-specific, so we can be more confident that these features will be important in predicting out-of-sample This rescaling implies that the initial aspiration level in rescaled units is a x(1)=[maxjk Ujk − maxjk Ujk ]/2=50, for all games, which implies that the initial aspiration level in un-rescaled units varies across games depending on the original game payoffs. 5 ASPIRATION-BASED RULES IN LEARNING 547 TABLE 4 Parameter Estimates of Basic Rule Learning Model n̄1 n̄2 n̄3 h̄ s dh e b0 b1 y LL Stahl (2000b) Current data set 0.797 (0.207) 0.0696 (0.021) 0.000 (10 −8) 0.647 (0.026) 0.767 (0.086) 0.533 (0.028) 0.090 (0.023) 1.000 (0.002) 0.00790 (0.00145) 1.000 (0.049) 0.376 (0.201) 0.000 (10 −8) 0.000 (10 −7) 0.938 (0.049) 0.800 (0.104) 0.441 (0.021) 0.123 (0.015) 1.000 (0.003) 0.00465 (0.00097) 0.308 (0.145) -8550.21 behavior. The ML parameter estimates and standard errors 6 are given in Table 4, and the maximized LL is −8550.21. There are similarities and differences between these parameter estimates and those for a different data set reported in Stahl (2000b). 7 First, as in Stahl (2000b), we find b0 =1, so past propensity differences between rules persist unless affected by performance. The initial propensities for herd behavior (dh ) and trembles (e), and hence the total propensity to use some evidence-based rule, are similar. Also the standard deviation of the initial distribution over evidence-based rules, s, is similar, but the mean (n̄, h̄) is notably different. n̄2 is zero and n̄1 is less. Despite the low t-ratio for n̄1 , the lower bound on the 95% confidence interval was 0.296, which is still large enough to put substantial probability on the best response to the level-1 belief. However, n̄3 is still precisely zero. h̄ is much higher, implying that beliefs adjust quicker, while the rescaling parameter for the reinforcement function shrinks 38%. However, the most substantial difference is the transference parameter that was estimated to be 1 in Stahl (2000b), but only 0.308 here; also the standard error is much larger. Despite these differences, the contribution of rule learning is robust. Restricting b0 =1 and b1 =0 yields a model with diverse rules but no rule learning. The maximized LL decreases to −8569.93. Since the transference parameter (y) no longer plays a role, the likelihood-ratio test has three degrees of freedom, and the hypothesis of no-rule-learning has a p-value of less than 3 × 10 −9. To illustrate the amount of rule learning that takes place (hence the behavioral effect of b1 ), Fig. 3 shows selected j(n, h, t | b) values aggregated over the domain of h for the initial period (t=1), after the 12 periods of the first run (t=12), and after the 12 periods of the second run (t=24) for a typical session (namely 6/18/98 6 Standard errors and confidence intervals were obtained via parametric bootstrapping (Efron & Tibshirani, 1993, pp. 53–55, 169–177). 7 To ensure stability of the dynamic process, an upper bound of 1 is imposed on b0 . 548 STAHL AND HARUVY FIG. 3. j(t) for 6/18/98 session. for which the first-run data are given in Fig. 1 and Table 3). These j values come from the model of the Reciprocity-Based Cooperation Results section. Seven rule categories are shown, accounting for over 96% of the probability mass. The ‘‘herd&tremble’’ category on the left aggregates the herd rules and the tremble rule and shows a substantial decline in the propensities to use these rules. The next five rules are evidence-based rules where the three numbers that label each category are the values of (n1 , n2 , n3 ); 8 thus, the first three categories are level-1 rules with increasing degrees of precision. These rules show a substantial increase in propensity over time. The next rule combines level-1 and level-2 evidence, and the next rule combines level-1 and Nash evidence; both rules show a modest increase in propensity over time. The final category is a pure cooperation rule, the propensity of which appears fairly constant over the first run and then declines dramatically over the second run. Aspiration-Based Experimentation Results We investigated three versions of experimentation rules: (1) uniformly random choice, (2) imitation of the recent population distribution, and (3) uniformly random beliefs. The parameter estimates, standard errors, and maximized LL for each version are given in Table 5. In terms of LL, experimentation by imitation has the lowest LL (−8549.69; a statistically insignificant improvement of only 0.52). 8 Note that these are the same values as used in Table 1. 549 ASPIRATION-BASED RULES IN LEARNING TABLE 5 Parameter Estimates of Enhanced Models Aspiration-based experimentation n̄1 n̄2 n̄3 h̄ s dh dx e h2 b0 b1 y LL Random choice Imitation Random beliefs Anticipated reciprocity 0.361 (0.195) 0.000 (10 −7) 0.000 (10 −3) 0.920 (0.051) 0.747 (0.108) 0.448 (0.022) 0.108 (0.016) 0.000 (0.002) 0.757 (0.186) 1.000 (0.003) 0.00500 (0.00102) 0.208 (0.168) −8546.83 0.393 (0.241) 0.000 (0.002) 0.000 (0.002) 0.908 (0.057) 0.828 (0.105) 0.258 (0.143) 0.171 (0.133) 0.131 (0.018) 0.588 (0.352) 1.000 (0.003) 0.00486 (0.00099) 0.290 (0.132) −8549.69 0.324 (0.061) 0.0067 (0.007) 0.000 (0.0005) 0.923 (0.044) 0.587 (0.111) 0.426 (0.021) 0.0329 (0.014) 0.115 (0.013) 0.860 (0.321) 1.000 (0.003) 0.00558 (0.00100) 0.213 (0.136) −8544.70 0.324 (0.160) 0.000 (0.0004) 0.000 (0.0003) 0.859 (0.043) 0.715 (0.115) 0.430 (0.023) 0.037 (0.006) 0.084 (0.014) 0.000 (0.001) 1.000 (0.002) 0.00328 (0.00067) 0.888 (0.179) −8522.41 This poor performance is due to the fact that the basic rule learning model already includes imitation via the herd rules. We also find that the estimates for (dh , dx , e) for the imitation version are quite different from any other version, which is attributable to multicolinearity between the imitation rule and the herd rules. Note the low t-ratio for dx , indicating that we cannot reject the hypothesis that dx =0. Experimentation by uniformly random choice shows a LL gain of 3.38; with two degrees of freedom this gain has a p-value of 0.034, which is not impressive considering there are 625 participants in the data set. Moreover, the (dx , e) estimates are quite different from the basic rule learning model, which is probably attributable to the multicolinearity between random-choice experimentation and the tremble rule in the basic rule-learning model. Despite the seemingly respectable t-ratio for h2 , the 95% confidence interval is (0.0007, 0.921), shedding doubt on the role of aspirations in this model. The largest gain in LL is for experimentation by random beliefs. This gain of 5.51 has a p-value of 0.004 and so is statistically significant. Again, we find a large confidence interval for h2 (0.0001, 1.0), shedding doubt on the role of aspirations. We can test the hypothesis that aspirations are unimportant by setting h2 =0; then the initial aspiration level is fully discounted so the experimentation act, p x(t), is reinforced by its actual expected payoff to the most recent population choices—like all other behavioral rules in the model. The maximized LL of this restricted model decreases by only 1.10 and hence is not statistically significant at any common acceptance level. Therefore, we cannot reject the hypothesis that experimentation by random beliefs does not depend on aspirations in this data set. Yet, there is support for a simple experimentation-by-random-belief rule (3.3%) that is reinforced like any other behavioral rule. Since none of the other versions of aspirationbased experimentation does substantially better, we conclude more generally that aspirations (as modeled) play essentially no role in this data set. 550 STAHL AND HARUVY Reciprocity-Based Cooperation Results Table 5 also gives the parameter estimates, standard errors, and maximized LL for the anticipated reciprocity model. We note that the estimate of the patience parameter, h2 , is zero, indicating that players immediately react to the payoff from the cooperative action rather than reinforcing cooperative propensities by anticipated future reciprocity. The individual choice behavior (e.g., Table 3) is consistent with this finding. On the other hand, the estimated initial propensity to cooperate (dc ) is about 3.7% and statistically different from 0 (p-value < 10 −13). Therefore, we can strongly reject the hypothesis that there is no cooperative behavior in this data set (i.e., dc =0). Another curious finding is that the transference estimate increases to 0.888. 9 The gain in LL due to a non-reciprocity-based cooperative rule is 27.81, which is much larger than the gain from rule learning without any cooperative rule (19.72). Out of curiosity, we estimated the model with a non-reciprocity-based cooperative rule but without rule learning (b1 =0) and obtained a LL value of −8550.69; hence, adding a non-reciprocity-based cooperative rule to the no-rule-learning models yields a LL gain of 19.24, while the further addition of rule learning yields a LL gain of 28.28. Therefore, we can even more strongly reject the hypothesis of no rule learning in the presence of this cooperative rule. Because it is irrational to anticipate reciprocity in the last period of an experiment, we considered alternative versions of anticipated reciprocity that eliminated the cooperative rule for the last period, 10 but the results for these versions were essentially the same. Tit-for-Tat Results Estimating the TFT model, we found a maximized LL value of −8530.01, which is a substantial gain from the basic rule learning model, but still less than rule learning with a pure cooperative rule. Furthermore, the estimates for the TFT parameters are (c, c, dTFT )=(0.04, 5.03, 0.047), respectively. Given these parameters, k(t+1; fc (t), c) is a sharp step function at 0.04, implying that as long as one person out of 25 in the population chose the cooperative action in the period t, the TFT rule will cooperate in period t+1. This behavior is not much different from unconditional cooperation. Since our specification of k(t+1; fc (t), c) with c > 0 excludes pure cooperation, we decided to test another specification that would admit pure cooperation. This was accomplished simply by letting c take on negative values: c ¥ [−0.01, 1). Not surprising, the maximum LL was achieved for parameter values that mimicked a pure cooperation rule. Therefore, we cannot reject the hypothesis that the TFT rule adds nothing to explaining the data over a pure cooperation rule. 9 At the suggestion of a referee, we considered an encompassing model with a separate transference parameter for the anticipated reciprocity variable; however, we found no improvement in the maximized LL whatsoever and therefore cannot reject the hypothesis of a single transference parameter. 10 For the last period of each run in another version. ASPIRATION-BASED RULES IN LEARNING 551 DISCUSSION We have demonstrated how the framework of rule learning can accommodate behavioral rules like aspiration-based experimentation and reciprocity-based cooperation. In the data set considered consisting of 50 runs of 12 periods each involving 625 subjects, we found no evidence of aspiration-based experimentation or reciprocity-based cooperation. We did, however, find weak evidence for a simple experimentation by random belief rule that is reinforced by its actual expected payoff, and strong evidence for a simple cooperation rule that is also reinforced by its actual expected payoff like all other rules in the rule learning model. Further, we found no evidence for a tit-for-tat rule. However, our conclusions regarding a simple cooperation rule must be softened. While the maximized LL dramatically increases with the addition of a simple cooperation rule, this gain comes largely by moving the mean prediction for the first period of Game 16 closer to the separatrix. Since the mean starting point for the dynamics is so close to the separatrix, random fluctuations around the mean generate starting points on both sides of the separatrix, which lead to final outcomes being a fairly even split between the two equilibria. However, simulated paths rarely exhibit the reversal and separatrix crossing pattern (like Fig. 1) seen in the data. Since the leading theoretical explanations of such behavior (aspirationbased experimentation and anticipated reciprocity) fail to be significant on this data set, we still have a behavioral pattern without an adequate explanation. REFERENCES Anderson, J. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum. Axelrod, R. (1984). The evolution of cooperation. New York: Basic Books. Axelrod, R., & Hamilton, W. (1981). The evolution of cooperation. Science, 211, 1390–1396. Binmore, K., & Samuelson, L. (1997). Muddling through: noisy equilibrium selection. Journal of Economic Theory, 74, 235–265. Brandts, J., & Charness, G. (1999). Punishment and reward in a cheap-talk game. Unpublished manuscript. Camerer, C., & Ho, T. (1998). EWA learning in coordination games: probability rules, heterogeneity, and time-variation. Journal of Mathematical Psychology, 42, 305–326. Camerer, C., & Ho, T. (1999a). EWA learning in games: preliminary estimates from weak-link games. In D. Budescu, I. Erev, & R. Zwick, (Eds.), Games and human behavior: Essays in honor of Amnon Rapoport. Hillsdale, NJ: Erlbaum. Camerer, C., & Ho, T. (1999b). Experience-weighted attraction learning in normal form games. Econometrica, 67, 827–874. Camerer, C., Ho, T., & Chong, J. (2000). Sophisticated EWA learning and strategic teaching in repeated games. (Working paper 00-005). Wharton: Univ. of Pennsylvania. Cheung, Y-W., & Friedman, D. (1997). Individual learning in normal form games: some laboratory results. Games and Economic Behavior, 19, 46–76. Cheung, Y-W., & Friedman, D. (1998). Comparison of learning and replicator dynamics using experimental data. Journal of Economic Behavior and Organization, 35, 263–280. Churchland, P., & Sejnowski, T. (1992). The computational brain. Cambridge, UK: MIT Press. Cooper, R., Dejong, D., Forsythe, R., & Ross, T. (1993). Forward induction in the battle-of-the-sexes games. American Economic Review, 83, 1303–1316. 552 STAHL AND HARUVY Dufwenberg, M., & Kirchsteiger, G. (1998). A theory of sequential reciprocity. (CentER Discussion Paper 9837). The Netherlands: Tilburg University. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Engle-Warnick, J., & Slonim, R. L. (2001). The fragility and robustness of trust in repeated games. (working paper). Case Western Reserve University. Erev, I., Bereby-Meyer, Y., & Roth, A. (1999). The effect of adding a constant to all payoffs: experimental investigation, and implications for reinforcement learning models. Journal of Economic Behavior and Organization, 39, 111–128. Erev, I., & Roth, A. (1998). Predicting how people play games: reinforcement learning in experimental games with unique mixed strategy equilibria. American Economic Review, 88, 848–881. Fehr, E., & Gächter, S. (1998). Reciprocity and economics: the economic implications of homo reciprocans. European Economic Review, 42, 845–859. Fudenberg, D., & Levine, D. (1998). The theory of learning in games. Cambridge, MA: MIT Press. Gigerenzer, G., Todd, P., & the ABC Research Group, (1999). Simple heuristics that make us smart. Oxford: Oxford Univ. Press. Gneezy, U., Guth, W., & Verboven, F. (2000). Presents or investments? An experimental analysis. Journal of Economic Psychology, 21, 481–493. Goffe, W., Ferrier, G., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99. Harsanyi, J., & Selten, R. (1988). A general theory of equilibrium selection in Games. Cambridge, MA: MIT Press. Haruvy, E., & Stahl, D. (1999). Empirical tests of equilibrium selection based on player heterogeneity, mimeograph. Ho, T-H., Camerer, C., & Weigelt, K. (1998). Iterated dominance and iterated best-response in experimental p-beauty contests. American Economic Review, 88, 947–969. Hoffman, E., McCabe, K., & Smith, V. (1998). Behavioral foundations of reciprocity: experimental economics and evolutionary psychology. Economic Inquiry, 36, 335–352. Kreps, D., Milgrom, P., Roberts, J., & Wilson, R. (1982). Rational cooperation in the finitely repeated prisoner’s dilemma. Journal of Economic Theory, 27, 245–252. Nagel, R. (1995). Unraveling in guessing games: an experimental study. American Economic Review, 85, 1313–1326. Nelder, J., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313. Newell, A., Shaw, J., & Simon, H. (1958). Elements of a theory of human problem solving. Psychological Review, 65, 151–166. Rabin, M. (1993). Incorporating fairness into game theory and economics. American Economic, 83, 1281–1302. Roth, A., & Erev, I. (1995). Learning in extensive form games: experimental data and simple dynamic models in the intermediate term. Games en Economic Behavior, 8, 164–212. Sarin, R., & Vahid, F. (1999). Payoff assessments without probabilities: a simple dynamic model of choice. Games and Economic Behavior, 28, 294–309. Sen, A. (1977). Rational fools: a critique of the behavioral foundations of economic theory. Journal of Philosophy and Public Affairs, 6, 317–344. Stahl, D. (1996). Boundedly rational rule learning in a guessing game. Games and Economic Behavior, 16, 303–330. Stahl, D. (1999). Evidence based rules and learning in symmetric normal form games. International Journal of Game Theory, 28, 111–130. Stahl, D. (2000a). Rule learning in symmetric normal-form games: theory and evidence. Games and Economic Behavior, 32, 105–138. ASPIRATION-BASED RULES IN LEARNING 553 Stahl, D. (2000b). Action-reinforcement learning versus rule learning. Department of Economics, University of Texas. Stahl, D. (2001). Population rule learning in symmetric normal-form games: theory and evidence. Journal of Economic Behavior and Organization, 45, 19–35. Stahl, D., & Wilson, P. (1994). Experimental evidence of players’ models of other players. Journal of Economic Behavior and Organization, 25, 309–327. Stahl, D., & Wilson, P. (1995). On players models of other players: theory and experimental evidence. Games and Economic Behavior, 10, 218–254. Received: December 19, 2000; revised: October 2, 2001; published online: June 13, 2002
© Copyright 2026 Paperzz