Aspiration-Based and Reciprocity

Journal of Mathematical Psychology 46, 531–553 (2002)
doi:10.1006/jmps.2001.1409
Aspiration-Based and Reciprocity-Based Rules
in Learning Dynamics for Symmetric
Normal-Form Games
Dale O. Stahl
University of Texas at Austin
and
Ernan Haruvy
University of Texas at Dallas
Psychologically based rules are important in human behavior and have the
potential of explaining equilibrium selection and separatrix crossings to a
payoff dominant equilibrium in coordination games. We show how a rule
learning theory can easily accommodate behavioral rules such as aspirationbased experimentation and reciprocity-based cooperation and how to test for
the significance of additional rules. We confront this enhanced rule learning
model with experimental data on games with multiple equilibria and separatrix-crossing behavior. Maximum likelihood results do not support aspirationbased experimentation or anticipated reciprocity as significant explanatory
factors, but do support a small propensity for non-aspiration-based experimentation by random belief and non-reciprocity-based cooperation. © 2002
Elsevier Science (USA)
Key Words: aspiration; reciprocity; rules; learning; games.
In recent years, the field of learning in games has evolved considerably, with
recent models able to make robust predictions in a variety of interesting games.
Examples in the reinforcement learning arena include Erev & Roth (1998), Roth &
Erev (1995), and Sarin & Vahid (1999). In belief learning, Fudenberg & Levine
(1998) and Cheung & Friedman (1997, 1998) provide some comprehensive reviews.
In hybrid models, Camerer & Ho (1998, 1999a, 1999b) have led the field. These
models, though different in structure, are based on players adjusting, with noise and
Partial funding of this research was provided by Grants SBR-9410501 and SBR-9631389 from the
National Science Foundation and the Malcolm Forsman Centennial Professorship, but nothing herein
reflects their views.
Address correspondence and reprint requests to Dale O. Stahl, Department of Economics, University
of Texas, Austin, TX 7812. E-mail: [email protected].
⁄
531
0022-2496/02 $35.00
© 2002 Elsevier Science (USA)
All rights reserved.
532
STAHL AND HARUVY
inertia, in the direction of the best response to past observations. In that sense,
players’ sophistication is fairly limited. When hypotheses are permitted (in belief
and hybrid models), players’ hypotheses on others’ frequencies of choice are limited
to a weighted average of others’ past frequencies. In reality, players are far more
sophisticated and are able to reason iteratively, as Ho, Camerer & Weigelt (1998),
Nagel (1995), and Stahl (1996) have shown. For example, a sophisticated player
might believe that other players will tend to choose a best reply to the historical
trend (which could be different from a weighted average of others’ past frequencies)
and hence choose a best reply to this belief. Action reinforcement dynamics cannot
capture such behavior, since the behavior is not simply the tendency to choose an
action. Rather the behavior is a function (or behavioral rule) that maps game
information and histories into available actions.
Stahl’s (1999, 2000a, 2000b, 2001) rule learning theory assumes that players probabilistically select behavioral rules and that the propensities to use specific rules
are guided by the law of effect: the propensity of rules that perform better increase
over time and vice versa. Precursors of this approach are Newell, Shaw, & Simon
(1958). To operationalize the theory, a class of rules is assumed which includes the
sophisticated rule mentioned above (see Section 2.3 for details). While this specification of the rule learning theory has been remarkably successful empirically
(compared to other learning models), a shortcoming is that all the rules are either a
form of trending or myopically rational in the sense of being (perhaps noisy) best
replies to a current belief. We say this is a shortcoming because a growing number
of laboratory experiments have documented that individuals deviate systematically
from myopic self-interest behavior when there are possible future gains from
influencing others. Specifically, such nonmyopic actions may be motivated by
expectations of future reciprocity by others (e.g., Engle-Warnick & Slonim, 2001;
Fehr & Gächter, 1998; Gneezy, Guth, & Verboven, 2001), attempts to teach others
(e.g., Camerer, Ho, & Chong, 2000), or attempts to signal intended future actions in
coordination games (e.g., Cooper, Dejong, Forsythe, & Ross, 1993).
For example, consider the coordination game presented in Fig. 1. The matrix of
Fig. 1 gives the row player’s payoffs, and the triangular graph depicts the aggregate
choice frequencies of actions A and B for an experiment session. The dotted line
depicts the separatrix between the area for which B is the best reply (upper area)
and the area for which A is the best reply; C is a dominated action. The experiment
session had 25 participants mean-matched for 12 periods, with population feedback
after each period (see the Methods section for details). The first period is represented by the dot labeled with a ‘‘1’’, which lies in A’s best-reply area. In the second
and third periods (labeled ‘‘2’’ and ‘‘3’’ respectively), the choice frequencies move
strongly in the direction of the best reply. Then in period 4 there is a surprising
reversal of direction, followed in period 5 by the crossing to action B’s best-reply
area and eventual convergence to the payoff-dominant equilibrium (B). This
separatrix-crossing phenomenon is robust (9 out of 20 sessions) and is incompatible
with myopic rationality.
Thus, we have compelling reasons to develop better models that go beyond
trending and myopically rational behavior to include other psychologically based
modes of behavior. The primary purpose of this paper is to show how the rule
ASPIRATION-BASED RULES IN LEARNING
533
FIG. 1. Evidence of nonmyopically rational behavior.
learning theory can easily incorporate diverse psychologically based rules and how
to test for the significance of additional rules. Having motivated the need with the
example in Fig. 1, we will also use this example to motivate the behavioral rules
that we will use to achieve our purpose. However, we will leave the mystery of the
separatrix crossing unresolved, and we will not answer the quest for the best class of
rules since that is not our primary goal. Nonetheless, the methodology we develop
is a promising approach for future research.
Binmore & Samuelson (1997) offered a theoretical explanation for separatrixcrossing behavior based on aspirations and experimentation. Applied to the above
example, players begin with aspirations of payoffs, which during the first three
periods depicted in Fig. 1 are not fulfilled, inducing the players to experiment with
other strategies. If experimentation leads to enough B (but not C) choices, then the
separatrix can be crossed.
Another possible explanation of separatrix crossings is that some players make
short-term sacrifices to teach others to coordinate on the payoff-dominant equilibrium (an interesting model of teaching behavior in repeated games is given in
534
STAHL AND HARUVY
Camerer et al., 2000). We could think about these players as anticipating positive
reciprocity from the other players. In general models of reciprocity (e.g., Dufwenberg
& Kirchsteiger, 1998; Rabin, 1993), a costly action by one party which is favorable to
another is due to intentional actions by the other party perceived to be favorable
(positive reciprocity). Similarly, an unfavorable action is chosen in response to intentional actions perceived to be unfavorable (negative reciprocity). Though these
responses may be costly for the individual, they could serve to enforce social norms,
which may ultimately result in everyone being better off (Brandts & Charness, 1999;
Fehr & Gächter, 1998; Hoffman, McCabe, & Smith, 1998).
We can incorporate Binmore & Samuelson’s insight into the rule learning theory
by introducing an experimentation rule whose propensity is sensitive to the difference between player aspirations and actual payoffs. Players start with an initial
aspiration level and an initial propensity toward experimentation (loosely some
random deviation from myopic best-reply behavior), and if the actual payoffs fall
short of the aspiration level, the propensity to experiment increases, and vice versa.
Allowing the aspiration level to be updated gradually could then explain separatrixcrossing behavior.
Alternatively, we can incorporate anticipated reciprocity into the rule learning
theory by introducing a cooperative rule (in the current examples, to choose the
symmetric payoff dominant action), an initial propensity to cooperate (i.e., use this
rule), and an initial anticipated-reciprocity payoff (the payoff that would be
obtained if everyone cooperates). The anticipated-reciprocity payoff would be
updated only gradually, thereby representing patience in waiting for others to
reciprocate. Positive reciprocal behavior by others would increase the propensity to
cooperative, but if others failed to reciprocate, eventually the propensity to cooperate would plummet. We also consider an alternative modeling of anticipated
reciprocity in the form of a tit-for-tat (TFT) rule.
The next section presents the basic rule learning model and the following section
presents three enhancements. Following that, we present Methods, Results, and
Discussion.
THE RULE LEARNING FRAMEWORK
The Game Environment
Consider a finite, symmetric, two-player game G — (N, A, U) in normal form,
where N — {1, 2} is the set of players, A — {1, ..., J} is the set of actions available to
each player, U is the J × J matrix of expected utility payoffs for the row player, and
UŒ, the transpose of U, is the payoff matrix for the column player. For notational
convenience, let p 0 — (1/J, ..., 1/J)Œ denote the uniform distribution over A.
We focus on single population situations in which each player is matched in every
period with every other player; hence, the payoff relevant statistic for any given
player is the probability distribution of the choices of the population, and this
information is available to the players. To this end, p t will denote the empirical
frequency of all players’ actions in period t. It is also convenient to define
ASPIRATION-BASED RULES IN LEARNING
535
h t — {p 0, ..., p t − 1} as the history of all players’ choices up to period t with the
novelty that p 0 is substituted for the null history. Thus, the information available to
a representative player at the beginning of period t is W t — (G, h t).
Behavioral Rules and the Rule Learning Dynamic
A behavioral rule is a mapping from information W t to D(A), the set of probability measures on the actions A. For the purposes of presenting the abstract model,
let r ¥ R denote a generic behavioral rule in a space of behavioral rules R; r(W t) is
the mixed strategy generated by rule r given information W t.
The second element in the model is a probability measure over the rules: j(r, t)
denotes the probability of using rule r in period t. Because of the nonnegativity
restriction on probability measures, it is more convenient to specify the learning
dynamics in terms of a transformation of j that is unrestricted in sign. To this end,
we define w(r, t) as the log-propensity to use rule r in period t, such that
;5F exp(w(z, t)) dz6 .
j(r, t) — exp(w(r, t))
(1)
Given a space of behavioral rules R and probabilities j, the induced probability
distribution over actions for period t is
p(t) — F r(W t) dj(r, t).
(2)
R
In other words, pj (t) is the probability that action j will be chosen in period t and is
obtained by integrating the probability of j for each rule, rj (W t), with respect to the
probability distribution over rules, j(r, t). Computing this integral is the major
computational burden of this model.
The third element of the model is the equation of motion. The law of effect states
that rules which perform well are more likely to be used in the future. This law is
captured by the following dynamic on log-propensities,
w(r, t+1)=b0 w(r, t)+g(r, W t+1),
for t \ 1,
(3)
where g( ), is the reinforcement function for rule r conditional on information
W t+1=(G, h t+1), and b0 ¥ (0, 1] is an inertia parameter. We specify g(r, W t+1) as
the rescaled expected utility that rule r would have generated in period t:
g(r, W t+1)=b1 r(W t)Œ Up t,
(4)
where b1 > 0 is a scaling parameter.
While the ex ante rule propensities j(r, t) are the same for all individuals, diversity in the population is generated by independent draws from j(r, t) for each
individual in each period. Since we are specifying a model of population averages
rather than individual responses, the reinforcement function represents the
cumulated effects of experience throughout the population.
536
STAHL AND HARUVY
Given a space of rules R and initial conditions w(r, 1), the law of motion,
Eq. (3), completely determines the behavior of the system for all t > 1. The remaining operational questions are (1) how to specify R, and (2) how to specify w(r, 1).
An attractive feature of this general model is that it encompasses a wide variety
of learning theories. For instance, to obtain replicator dynamics, we can simply let
R be the set of J constant rules that always choose one unique action in A for all
information states. Fictitious play and Cournot dynamics can be seen as very
special cases in which R is a singleton rule which chooses a (possibly noisy) bestresponse to a belief that is a deterministic function of the history of play. 1 Moreover, the general model can include these constant rules, best-response rules, and
other rules.
The Family of Evidence-Based Rules
Our approach to specifying the space of rules is to specify a finite number of
empirically relevant discrete rules that can be combined to span a much larger space
of rules. 2 In Stahl (1999, 2000a), the family of evidence-based rules was introduced
as an extension of the Stahl & Wilson (1995) level-n rules. Evidence-based rules are
derived from the notion that a player considers evidence for and against the available actions and tends to choose the action that has the most net favorable evidence
based on the available information. Further, evidence is generated from three
archetypal ‘‘theories of mind’’ (i.e., models of other players).
The first kind of evidence comes from a null model of the other players. The null
model provides no reason for the other players to choose any particular action, so
for the first period of play, by virtue of insufficient reason, the belief is that all
actions are equally likely. The expected utility payoff to each available action given
the null model is y1 (W 1) — Up 0. We interpret y1j as evidence in favor of action j
stemming from the null model and no prior history.
For later periods (t > 1), the players have empirical data about the past choices
of the other players. It is reasonable for a player to use simple distributed-lag forecasting: (1 − h) p 0+hp 1 for period 2 with h ¥ [0, 1]. Letting q t(h) denote the forecast for period t and defining q 0(h) — p 0, the following forecasting equation applies
for all t \ 1:
q t(h) — (1 − h) q t − 1(h)+hp t − 1.
(5)
The expected utility payoff given this belief is y1 (W t ; h) — Uq t(h). We can interpret
y1j (W t ; h) as level-1 evidence in favor of action j stemming from the null model and
history h t.
The second kind of evidence is based on the Stahl & Wilson (1995) level-2 player
who believes all other players are level-1 players and hence believes that the distribution of play will be b(q t(h)), where b(q t(h)) ¥ D(A) puts equal probability on all
1
In fictitious play, the belief is the unweighted empirical frequency of observed choices, and in
Cournot dynamics the belief is the most recent observed empirical frequency.
2
Alternative approaches could use the production rules of ACT-R (Anderson, 1993), or fast and
frugal heuristics (Gigerenzer, Todd, & the ABC Research Group, 1999).
ASPIRATION-BASED RULES IN LEARNING
537
best responses to q t(h) and zero probability on all inferior responses. This is a
second order theory of mind in which other minds are believed to be first order
(i.e., level 1). Indeed, humans may have brain structures for this kind of representation (Churchland & Sejnowski, 1992). The expected utility conditional on this
belief is y2 (W t ; h) — Ub(q t(h)). We can interpret y2j (W t ; h) as level-2 evidence in
favor of action j.
The third theory of mind is Nash equilibrium theory. Letting p NE denote a Nash
equilibrium of G, y3 =Up NE provides a third kind of evidence on the available
actions. For games with multiple NE, the evidence for each NE is equally
weighted. 3
So far we have defined three kinds of evidence: Y — {y1 , ..., y3 }. The next step is
to weigh this evidence and specify a probabilistic choice function. Let nk \ 0 denote
a scalar weight associated with evidence yk . We define the weighted evidence vector
ȳ(W t ; n, h) — Y(W t ; h) n,
(6)
where n — (n1 , ..., n3 )Œ.
There are many ways to go from such a weighted evidence measure to a probabilistic choice function. We opt for the multinomial logit specification because of its
computational advantages when it comes to empirical estimation. The implicit
assumption is that the player assesses the weighted evidence with some error and
chooses the action that from his or her perspective has the greatest net favorable
evidence. Hence, the probability of choosing action j is
;
p̂j (W t ; n, h) — exp[ȳj (W t ; n, h)] C exp[ȳa (W t ; n, h)].
(7)
a
Note that, given the four-dimensional parameter vector (n, h), Eq. (7) defines a
mapping from W t to D(A), and hence p̂(• ; n, h) is a behavioral rule as defined abstractly above. By putting zero weight on all but one kind of evidence, Eq. (7)
defines an archetypal rule—one for each kind of evidence corresponding to the
underlying model of other players. The four-dimensional parameter vector (n, h)
generates the space of rules spanned by these archetypal rules. To show the behavioral effect of various n values, Table 1 gives the predicted initial-period choice
frequencies for games 13 and 16 for several values of (n1 , n2 , n3 ).
Next, we represent behavior that is random in the first period and follows the
herd in subsequent periods. Following the herd does not mean exactly replicating
the most recent past, but rather following the past with inertia as represented by
q t(h). Hence, Eq. (5) represents herd behavior as well as the beliefs of level-1 types.
Finally, we allow for uniform trembles by introducing the uniformly random
rule. Thus, the base model consists of a four-dimensional space of evidence-based
rules (n, h), a herd rule characterized by h, and uniform trembles.
3
See Haruvy & Stahl (1999).
538
STAHL AND HARUVY
TABLE 1
Choice Frequencies for Various Evidence-Based Rules
(n1 , n2 , n3 )
A
B
C
(0.23, 0, 0)
(1.08, 0, 0)
(0.81, 0.27, 0)
(0.81, 0, 0.23)
0.760
1.0
0.937
1.0
Game 13
0.076
0
0
0
0.164
0
0.063
0
(0.23, 0, 0)
(1.08, 0, 0)
(0.81, 0.27, 0)
(0.81, 0, 0.23)
0.659
1.0
1.0
0.492
Game 16
0.151
0
0
0.492
0.151
0
0
0.017
The Initial Distribution of Log-Propensities
Letting dh denote the initial probability of the herd rule, and letting e denote the
initial probability of the trembles, we set w(rherd , 1)=ln(dh ) and w(rtremble , 1)=
ln(e). The initial log-propensity function for the evidence-based rules is specified as
w(n, h, 1) — − 0.5 ||(n, h) − (n̄, h̄)|| 2/s 2+A,
(8)
where ||(n − h) − (n̄ − h̄)|| denotes the distance 4 between rule (n, h) and the mean of
the distribution (n̄, h̄), and A is determined by the requirement that Eq. (1)
integrated over the space of evidence-based rules is exactly 1 − dh − e. Hence, the
initial propensity over the evidence-based rules is essentially a normal distribution
with mean (n̄, h̄) and standard deviation s.
Transference
Since this theory is about rules that use game information as input, we should be
able to predict behavior in a temporal sequence that involves a variety of games.
For instance, suppose an experiment consists of one run with one game for T
periods, followed by a second run with another game for T periods. How is what is
learned about the rules during the first run transferred to the second run with the
new game? A natural assumption would be that the log-propensities at the end of
the first game are simply carried forward to the new game. Another extreme
assumption would be that the new game is perceived as a totally different situation
so the log-propensities revert to their initial state. We opt for a convex combination
w(r, T+1)=(1 − y) w(r, 1)+yw(r, T+),
4
(9)
This is a log-linear Euclidian distance: the n variable is measured on a logarithmic scale, while h is on
a linear scale; see Appendix A1 of Stahl (2001) for computational details.
ASPIRATION-BASED RULES IN LEARNING
539
where T+ indicates the update after period T of the first run, and y is the transference parameter. If y=0, there is no transference, so period T+1 has the same
initial log-propensity as period 1; and if y=1, there is complete transference, so the
first period of the second run has the log-propensity that would prevail if it were
period T+1 of the first run (with no change of game). This specification extends
the model to any number of runs with different games without requiring additional
parameters.
The Likelihood Function
The theoretical model involves 10 parameters: b — (n̄1 , n̄2 , n̄3 , h̄, s, dh , e, b0 , b1 , y).
The first four parameters (n̄1 , n̄2 , n̄3 , h̄) represent the mean of the population’s initial
propensity w(r, 1) over the evidence-based rules, and s is the standard deviation of
that propensity; the next two parameters (dh , e) are the initial propensities of the
herd and tremble rules, respectively; b0 and b1 are the learning parameters of Eqs.
(3) and (4); and y is the transference parameter in Eq. (9) for the initial propensity
of the subsequent runs. Substituting Eq. (7) into Eq. (2), the rule propensities and
law of motion yield population choice probabilities:
pj (t | b)=F p̂j (W t ; n, h) j(n, h, t | b) d(n, h).
(10)
R
To compute this integral, a grid was placed on the (n, h)-space of rules (see
Appendix A1 of Stahl, 2001 for details). Let n tj denote the number of participants
who choose action j in period t. Then the logarithm of the joint probability of the
data conditional on b is
LL(b) — C C n tj log[pj (t | b)].
t
(11)
j
To find a b vector that maximizes LL(b), we use a simulated annealing algorithm
(Goffe, Ferrier, & Rogers, 1994) for high but declining temperatures and then feed
the result into the Nelder & Mead (1965) algorithm. The simulated annealing algorithm is effective in exploring the parameter space and finding neighborhoods of
global maxima but very slow to converge, while the latter algorithm converges
much faster.
ENHANCEMENTS TO THE RULE LEARNING MODEL
Aspiration-Based Experimentation
Two components of an aspiration-based experimentation rule are the experimentation act and the aspiration level. We let p x(t) denote the probability distribution
of the experimentation act in period t. Two obvious candidates for p x(t) are the
uniform distribution p 0 for experimentation by random choice, and the recent past
p t − 1 for experiment by imitation. The third alternative we will consider is experimentation by random beliefs in which p x(t) is the proportion of the probability
540
STAHL AND HARUVY
simplex for which each choice is the best response. Such choice probabilities would
arise from beliefs that are uniformly distributed over the probability simplex.
Let a x(t) denote the aspiration level for period t. We assume an adaptive expectations dynamic for the aspiration level
a x(t+1)=h2 a x(t)+(1 − h2 ) p x(t)Œ Up t
(12)
with initial condition a x(1). With 0 < h2 < 1, the aspiration level will adjust gradually to the actual expected utility that the experimentation act would have generated each period. If initial play is diverse and the experimentation act is diverse,
then p x(t)Œ Up t will be highly correlated with the population average payoff. Hence,
if the latter is much less than the initial aspiration level and if h2 < 1, then the
aspiration level will gradually decline but remain above the population average
payoff, and this positive difference will increase the likelihood that the experimentation rule will be used. Conversely, if the population average payoff is higher than
the initial aspiration level, then the propensity to experiment will decline (don’t rock
the boat).
The rule learning framework can represent this dependence of the propensity to
experiment on the aspiration level by using b1 a x(t+1) as the reinforcement,
g(rx , W t+1), for the experimentation rule in Eq. (3). If initial aspirations are disappointed, then the log-propensity to experiment will increase, and vice versa.
Asymptotically, as long as h2 < 1, Eq. (12) has the property that the long-run propensity to use the experimentation act, p x(t), will depend on the actual expected
payoff from that act, since initial aspirations are geometrically discounted. Conversely, the larger h2 , the longer will be the effect of the initial aspiration level. For the
game in Fig. 1, simulations of rule learning with an initial aspiration level of 50 or
higher and a value of h2 > 0.9 produce separatrix crossings.
In addition to specification of p x(t), we need to specify an initial aspiration level,
x
a (1), and an initial propensity, dx , to experiment (i.e., j(rx , 1)=dx ). It is natural
to set a x(1) equal to the mean of the range of payoffs in the game:
[maxjk Ujk − minjk Ujk ]/2. We let dx be a free parameter to be estimated from the
data. We further assume that the transference equation applies to the aspiration
level, so
a x(T+1)=(1 − y) a x(1)+ya x(T+).
(9Œ)
While this model of aspiration-based experimentation will suffice to illustrate the
richness of the rule learning theory, it does not exhaust all possible models of
aspiration-based experimentation—which would go beyond the purpose of this
paper.
Reciprocity-Based Cooperation
For the class of symmetric normal-form games, we define the symmetric cooperative action as the action with the largest diagonal payoff, and we restrict attention
to the subclass of games with a unique symmetric cooperative action. Let p c denote
the degenerate probability distribution that puts unit mass on this cooperative
ASPIRATION-BASED RULES IN LEARNING
541
action; we will refer to this as the cooperative rule, rc . But why would a rational
player choose the cooperative rule?
A rational player who believes that almost everyone else will select the cooperative action would anticipate a payoff equal to the maximum diagonal payoff if he
or she chooses the cooperative action. Then, if the cooperative action is a best
response to this belief (i.e., if it is a payoff-dominant Nash equilibrium), the rational player will choose the cooperative action. If the cooperative action is not a bestreply to itself, a myopically rational player will not choose it, but a forward-looking
rational player could still choose it if he or she believes that others will reciprocate
leading to a higher payoff than the path that would follow from myopic best
replies. If others do not cooperate at the start, the realized payoff to cooperation
will be much less than anticipated, and so the player’s belief about the likelihood of
reciprocation may decline (especially if the player is impatient).
Let h2 denote the level of patience of such a player, so the anticipated payoff
from cooperation, a c(t), (via future reciprocity) follows the simple adaptive expectations dynamic
a c(t+1)=h2 a c(t)+(1 − h2 ) p cŒUp t,
(13)
where a c(1)=p cŒUp c, the largest diagonal payoff. This initial anticipated payoff is
geometrically discounted over time; so provided h2 < 1, the anticipated payoff from
cooperation will tend toward a geometrically weighted average of actual expected
payoffs to cooperation. The larger h2 , the slower the adjustment, so if the propensity to cooperative is positively correlated with a c(t), cooperative efforts could
persist despite the lack of reciprocation.
The rule learning framework can represent this dependence of the propensity to
cooperate on anticipated reciprocity by using b1 a c(t+1) as the reinforcement,
g(rc , W t+1), for the cooperative rule in Eq. (3). When h2 =1, we would have a
stubborn, irrational cooperative rule, but otherwise, the propensity to cooperate
will depend on the difference between the anticipated payoff and the actual
expected payoffs of alternative rules. When cooperation is reciprocated, so its
payoff stays high relative to alternative rules (which recommend different actions),
then the propensity to cooperate will increase, and vice versa.
We further assume that the transference equation applies to the anticipated
payoff, so
a c(T+1)=(1 − y) a c(1)+ya c(T+).
(9œ)
Finally, we let dc denote the initial propensity to cooperate (i.e., j(rc , 1)=dc ).
While this model of reciprocity-based cooperation will suffice to illustrate the
richness of the rule learning theory, it does not exhaust all possible models of
reciprocity-based cooperation—which would go beyond the purpose of this paper.
Nonetheless, we consider one more model in the next section.
A Tit-for-Tat Rule
TFT behavior gained prominence following Axelrod’s (1981, 1984) experimental
prisoner’s dilemma tournaments in which TFT emerged as the winning strategy,
542
STAHL AND HARUVY
and Kreps, Milgrom, Roberts, & Wilson’s (1982) rational underpinning of TFT
behavior in the finitely repeated prisoner’s dilemma with incomplete information.
In a two-player prisoner’s dilemma, the TFT rule specifies cooperation in the first
period and then mimicking the most recent action of one’s opponent thereafter. The
TFT rule we examine here begins cooperatively and thereafter cooperates if a
threshold number of players have cooperated in the most recent period; otherwise it
chooses a best reply to the recent period. Formally, let fc (t) denote the fraction of
the population that cooperated in period t, and let c ¥ (0, 1) denote the threshold.
Then, we define the TFT probability of cooperating in period t+1 as
k(t+1; fc (t), c) —
˛
0.5[fc (t)/c] c,
1 − 0.5{[1 − fc (t)]/[1 − c]} c,
if fc (t) [ c,
if fc (t) > c,
(14)
where c \ 1 is a precision parameter to be estimated along with c. This probability
has the property that (i) it is 0 when fc (t)=0, (ii) it is 1 when fc (t)=1, (iii) it is 0.5
when fc (t)=c, and (iv) it is S-shaped and monotonically increasing in fc (t). For
large values of c, the probability function is close to a step function at fc (t)=c.
When c % 0, the cooperative action is chosen for almost all realizations of fc (t),
while when c % 1, the noncooperative best-reply action is chosen for almost all
realizations of fc (t).
From Eq. (14), it is straightforward to define the TFT behavioral rule rTFT (W t+1)
which plays the cooperative action with probability k(t+1; fc (t), c) and plays the
best reply to p t with probability 1 − k(t+1; fc (t), c). In accordance with Eq. (4), the
reinforcement function g(rTFT , W t+1)=b1 rTFT (W t)Œ Up t. To complete the rule
learning model with this TFT rule, we specify the initial propensity to use the TFT
rule as dTFT .
METHODS
Participants
We conducted 25 sessions, each involving 25 different participants. All participants were inexperienced in this or similar experiments, and all were upper division
undergraduates or noneconomics graduate students. Two sessions were conducted
at Texas A&M University, and the rest were conducted at the University of Texas.
Apparatus
Participants were seated at private computer terminals separated so that no participant could observe the choices of other participants. The relevant game, or
decision matrix, was presented on the computer screen. Each participant could
make a choice by clicking the mouse button on any row of the matrix, which then
became highlighted. In addition, each participant could make hypotheses about the
choices of the other players. An on-screen calculator would then calculate and
display the hypothetical payoffs to each available action given each hypothesis.
Participants were allowed to make as many hypothetical calculations and choice
ASPIRATION-BASED RULES IN LEARNING
543
revisions as time permitted. Following each time period, each participant was
shown the aggregate choices of all other participants and could view a record screen
with the history of the aggregate choices of other participants for the entire run.
Design and Procedures
Each experiment session consisted of two runs of 12 periods each. In the first run,
a single 3 × 3 symmetric game was played for 12 periods, and in the second run, a
different 3 × 3 symmetric game was played for 12 periods.
A mean-matching protocol was used. In each period, a participant’s token payoff
was determined by his or her choice and the percentage distribution of the choices
of all other participants, p(t), as follows: The row of the payoff matrix corresponding to the participant’s choice was multiplied by the vector of choice distribution of
the other participants. Token payoffs were in probability units for a fixed prize of
$2.00 per period. In other words, the token payoff for each period gave the percentage chance of winning $2 for that period. The lotteries that determined final
monetary payoffs were conducted following the completion of both runs using dice.
Specifically, a random number uniformly distributed on [00.0, 99.9] was generated
by the throw of three 10-sided dice. A player won $2.00 if and only if his or her
token payoff exceeded his or her generated dice number. Payment was made in cash
immediately following each session.
The Games
We select 10 games (see Fig. 2) from those investigated in Haruvy & Stahl (1999),
with properties that make them ideal for a thorough study of learning dynamics.
These are games 1, 4, 9, 12, 13, 14, 16, 18, and 19 of Haruvy & Stahl (1999). In
addition, we have a 3 × 3 version of a 5 × 5 game used in Stahl (1999), for which
cooperative behavior seemed to persist (labeled game 21).
What makes these games ideal? We begin by noting that games 1, 4, 13, 14, 16,
and 19 are characterized by multiple equilibria which are Pareto ranked. The selection principles of payoff dominance, risk dominance, and security do not all coincide in any of these six games, and no pair of principles coincides in all of these six
games. Most important the payoff dominant equilibrium action never coincides
with the level-1 action of Stahl & Wilson (1994, 1995). This is important given the
prediction by Haruvy & Stahl (1999) that level-1 is a heavy component in the
determination of initial conditions. According to the predictions generated by the
Haruvy & Stahl (1999) model, initial conditions in these six games will generally fall
in a basin of attraction (for popular models of adaptive dynamics) inconsistent with
the payoff-dominant choice.
Games 1, 14, and 19 begin with initial conditions far from uniform and very little
movement is observed thereafter, with the exception of one run of game 14, where a
significant proportion of the group goes against their best response to cross the
separatrix toward the payoff dominant Nash equilibrium. Game 13 has a long
dynamic path from initial conditions to the final outcome. The final period outcomes of Game 16 are almost equally split between the two equilibria, although the
initial conditions fall usually in the non-payoff-dominant basin, hence resulting in
544
STAHL AND HARUVY
FIG. 2. Games.
separatrix crossings. Game 18 was investigated because its theoretical dynamics
involve a persistent cycle. Games 9 and 12 have only a mixed-strategy equilibrium
and hence potentially interesting dynamic paths.
We also investigate three additional versions of game 13 (with 20 added to each
cell, ‘‘13+’’; with 20 subtracted from each cell, ‘‘13 − ’’; and with payoffs rescaled
to [0, 100]; ‘‘13r’’), and an additional version of game 16 (with 20 added to each
cell, ‘‘16+’’). The motivation for this variation was that if the initial aspiration
level, a x(1), is fixed independent of the game payoffs (for example, determined by
the participant’s anticipated compensation for the whole experimental session), then
adding 20 to all payoffs should change behavior. For instance, if a x(1)=50, then in
game 16 the ‘‘A’’ payoff of 20 is substantially less than the initial aspiration level
ASPIRATION-BASED RULES IN LEARNING
545
TABLE 2
List of Games by Experimental Session
Date
First Run
Second Run
2/17/98
2/19/98
4/7/98
4/9/98
6/18/98
7/16/98
7/30/98
1/28/99
3/2/99
6/9/99
6/10/99
8/6/99
8/11/99
1/27/00
2/16/00
2/24/00
4/13/00
7/13/00
7/19/00
7/20/00
7/26/00
7/27/00
8/2/00a
8/2/00b
8/3/00
16
19
14
13
16
16+
16+
16
16+
16
16+
16
16+
16
16+
16
16+
16
16
16+
16
16+
16+
19
1
9
12
1
21
13
13 −
13r
19
19
13
13
13+
13+
14
14
1
14
4
14
1
19
1
18
4
4
thereby inducing experimentation, while with 20 added to each cell, since then 40 is
much closer to 50, experimentation (and hence separatrix crossings) should occur
less often. The experimental evidence, however, is 5 and 4 crossings out of ten sessions of each version of game 16 respectively—a statistically insignificant difference
at all commonly accepted significance levels. We therefore discard the hypothesis of
a fixed initial aspiration level independent of the game payoffs.
The games used in each session are listed in Table 2. The data from the first four
sessions were analyzed in Stahl (2000a, 2000b, 2001). The data from the other 21
sessions have not been analyzed before. The entire data set is available upon request
from the authors.
Note that the range of payoffs for the games in Fig. 2 differs substantially from
[20, 70] to [0, 100]. In Haruvy & Stahl (1999), we found support for the hypothesis that a single payoff rescaling parameter, b1 , applies to a wide range of games
provided the payoffs are first rescaled to [0, 100]. On the current data set, we also
found that the maximized LL increased with such rescaling. Therefore, in this paper
we report only results obtained from rescaling to [0, 100]. This implies that all the
546
STAHL AND HARUVY
TABLE 3
PERIOD
Individual Choices for Session 6/18/1998, Run 1 (Game 16)
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
SUBJECT
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
C
A
B
B
A
A
B
B
B
B
B
B
A
A
A
A
B
B
B
B
B
B
B
B
A
A
A
A
A
B
B
B
C
B
B
B
A
A
A
A
A
A
B
B
B
B
B
B
B
A
A
A
A
A
B
B
B
B
B
B
A
A
A
C
B
B
B
B
B
B
B
B
A
A
B
B
B
B
B
B
B
B
B
B
B
C
C
C
A
B
B
B
B
B
B
B
A
A
A
A
A
A
B
B
A
B
B
B
B B A B C A A A C A A A A B A A
A A A C A A B A A A A B A B A A
A C A A A A A A A A A A A A A A
A B A A C B A A A A A A A B A A
A B A B B A B A A A A B A A B B
B B B B A A B B B A B B B B B B
B B B B B B B B B B B B B B B B
B B B B B A B B B B B B B C B B
B B B B B A B B B B B B B A B B
B B B B B B B B B B B B B B B A
B B B B B B B B B B B B B C B B
B B B B B B B B B B B B B B B B
versions of games 13 and 16 are ‘‘equivalent,’’ an assumption that has been verified
by independent analysis of the choice behavior across the different versions. 5
RESULTS
In addition to the plot of the dynamic path for one of the sessions shown in
Fig. 1, we also report the individual choice behavior for that data in Table 3. A
cursory look at Table 3 suggests that participants, for the most part, do not persist
in cooperative behavior (action B in that game) unless others are also simultaneously cooperative. For instance, although six participants chose B in period 1
(putting the initial play close to the separatrix), all but one abandoned cooperation
in period 2, and none of the three who cooperated in period 2 continued to do so in
period 3. Even in the critical period 4, of the five who chose B, only two persisted
with B in period 5. While nine of the ten participants who cooperated in period 5
continued to cooperate in period 6, by then B was the myopic best response. Similar
patterns can be observed throughout the data. Hence, an ocular analysis of the data
favors the experimentation explanation over the anticipated reciprocity explanation.
We will rigorously test this and other hypotheses next.
The Basic Rule Learning Model.
We estimated the 10 parameters of the basic rule learning model using all the
data from the 25 sessions. We pool over games in order to try to identify regular
features of learning dynamics that are general and not game-specific, so we can be
more confident that these features will be important in predicting out-of-sample
This rescaling implies that the initial aspiration level in rescaled units is a x(1)=[maxjk Ujk −
maxjk Ujk ]/2=50, for all games, which implies that the initial aspiration level in un-rescaled units varies
across games depending on the original game payoffs.
5
ASPIRATION-BASED RULES IN LEARNING
547
TABLE 4
Parameter Estimates of Basic Rule Learning Model
n̄1
n̄2
n̄3
h̄
s
dh
e
b0
b1
y
LL
Stahl (2000b)
Current data set
0.797 (0.207)
0.0696 (0.021)
0.000 (10 −8)
0.647 (0.026)
0.767 (0.086)
0.533 (0.028)
0.090 (0.023)
1.000 (0.002)
0.00790 (0.00145)
1.000 (0.049)
0.376 (0.201)
0.000 (10 −8)
0.000 (10 −7)
0.938 (0.049)
0.800 (0.104)
0.441 (0.021)
0.123 (0.015)
1.000 (0.003)
0.00465 (0.00097)
0.308 (0.145)
-8550.21
behavior. The ML parameter estimates and standard errors 6 are given in Table 4,
and the maximized LL is −8550.21.
There are similarities and differences between these parameter estimates and
those for a different data set reported in Stahl (2000b). 7 First, as in Stahl (2000b),
we find b0 =1, so past propensity differences between rules persist unless affected
by performance. The initial propensities for herd behavior (dh ) and trembles (e),
and hence the total propensity to use some evidence-based rule, are similar. Also
the standard deviation of the initial distribution over evidence-based rules, s, is
similar, but the mean (n̄, h̄) is notably different. n̄2 is zero and n̄1 is less. Despite the
low t-ratio for n̄1 , the lower bound on the 95% confidence interval was 0.296, which
is still large enough to put substantial probability on the best response to the level-1
belief. However, n̄3 is still precisely zero. h̄ is much higher, implying that beliefs
adjust quicker, while the rescaling parameter for the reinforcement function shrinks
38%. However, the most substantial difference is the transference parameter that
was estimated to be 1 in Stahl (2000b), but only 0.308 here; also the standard error
is much larger.
Despite these differences, the contribution of rule learning is robust. Restricting
b0 =1 and b1 =0 yields a model with diverse rules but no rule learning. The
maximized LL decreases to −8569.93. Since the transference parameter (y) no
longer plays a role, the likelihood-ratio test has three degrees of freedom, and the
hypothesis of no-rule-learning has a p-value of less than 3 × 10 −9.
To illustrate the amount of rule learning that takes place (hence the behavioral
effect of b1 ), Fig. 3 shows selected j(n, h, t | b) values aggregated over the domain
of h for the initial period (t=1), after the 12 periods of the first run (t=12), and
after the 12 periods of the second run (t=24) for a typical session (namely 6/18/98
6
Standard errors and confidence intervals were obtained via parametric bootstrapping (Efron &
Tibshirani, 1993, pp. 53–55, 169–177).
7
To ensure stability of the dynamic process, an upper bound of 1 is imposed on b0 .
548
STAHL AND HARUVY
FIG. 3. j(t) for 6/18/98 session.
for which the first-run data are given in Fig. 1 and Table 3). These j values come
from the model of the Reciprocity-Based Cooperation Results section. Seven rule
categories are shown, accounting for over 96% of the probability mass. The
‘‘herd&tremble’’ category on the left aggregates the herd rules and the tremble rule
and shows a substantial decline in the propensities to use these rules. The next five
rules are evidence-based rules where the three numbers that label each category are
the values of (n1 , n2 , n3 ); 8 thus, the first three categories are level-1 rules with
increasing degrees of precision. These rules show a substantial increase in propensity over time. The next rule combines level-1 and level-2 evidence, and the next rule
combines level-1 and Nash evidence; both rules show a modest increase in propensity over time. The final category is a pure cooperation rule, the propensity of
which appears fairly constant over the first run and then declines dramatically over
the second run.
Aspiration-Based Experimentation Results
We investigated three versions of experimentation rules: (1) uniformly random
choice, (2) imitation of the recent population distribution, and (3) uniformly
random beliefs. The parameter estimates, standard errors, and maximized LL for
each version are given in Table 5. In terms of LL, experimentation by imitation has
the lowest LL (−8549.69; a statistically insignificant improvement of only 0.52).
8
Note that these are the same values as used in Table 1.
549
ASPIRATION-BASED RULES IN LEARNING
TABLE 5
Parameter Estimates of Enhanced Models
Aspiration-based experimentation
n̄1
n̄2
n̄3
h̄
s
dh
dx
e
h2
b0
b1
y
LL
Random choice
Imitation
Random beliefs
Anticipated
reciprocity
0.361 (0.195)
0.000 (10 −7)
0.000 (10 −3)
0.920 (0.051)
0.747 (0.108)
0.448 (0.022)
0.108 (0.016)
0.000 (0.002)
0.757 (0.186)
1.000 (0.003)
0.00500 (0.00102)
0.208 (0.168)
−8546.83
0.393 (0.241)
0.000 (0.002)
0.000 (0.002)
0.908 (0.057)
0.828 (0.105)
0.258 (0.143)
0.171 (0.133)
0.131 (0.018)
0.588 (0.352)
1.000 (0.003)
0.00486 (0.00099)
0.290 (0.132)
−8549.69
0.324 (0.061)
0.0067 (0.007)
0.000 (0.0005)
0.923 (0.044)
0.587 (0.111)
0.426 (0.021)
0.0329 (0.014)
0.115 (0.013)
0.860 (0.321)
1.000 (0.003)
0.00558 (0.00100)
0.213 (0.136)
−8544.70
0.324 (0.160)
0.000 (0.0004)
0.000 (0.0003)
0.859 (0.043)
0.715 (0.115)
0.430 (0.023)
0.037 (0.006)
0.084 (0.014)
0.000 (0.001)
1.000 (0.002)
0.00328 (0.00067)
0.888 (0.179)
−8522.41
This poor performance is due to the fact that the basic rule learning model already
includes imitation via the herd rules. We also find that the estimates for (dh , dx , e)
for the imitation version are quite different from any other version, which is attributable to multicolinearity between the imitation rule and the herd rules. Note the
low t-ratio for dx , indicating that we cannot reject the hypothesis that dx =0.
Experimentation by uniformly random choice shows a LL gain of 3.38; with two
degrees of freedom this gain has a p-value of 0.034, which is not impressive considering there are 625 participants in the data set. Moreover, the (dx , e) estimates are
quite different from the basic rule learning model, which is probably attributable to
the multicolinearity between random-choice experimentation and the tremble rule
in the basic rule-learning model. Despite the seemingly respectable t-ratio for h2 , the
95% confidence interval is (0.0007, 0.921), shedding doubt on the role of aspirations in this model.
The largest gain in LL is for experimentation by random beliefs. This gain of 5.51
has a p-value of 0.004 and so is statistically significant. Again, we find a large confidence interval for h2 (0.0001, 1.0), shedding doubt on the role of aspirations. We
can test the hypothesis that aspirations are unimportant by setting h2 =0; then the
initial aspiration level is fully discounted so the experimentation act, p x(t), is reinforced by its actual expected payoff to the most recent population choices—like all
other behavioral rules in the model. The maximized LL of this restricted model
decreases by only 1.10 and hence is not statistically significant at any common
acceptance level. Therefore, we cannot reject the hypothesis that experimentation
by random beliefs does not depend on aspirations in this data set. Yet, there is
support for a simple experimentation-by-random-belief rule (3.3%) that is reinforced like any other behavioral rule. Since none of the other versions of aspirationbased experimentation does substantially better, we conclude more generally that
aspirations (as modeled) play essentially no role in this data set.
550
STAHL AND HARUVY
Reciprocity-Based Cooperation Results
Table 5 also gives the parameter estimates, standard errors, and maximized LL
for the anticipated reciprocity model. We note that the estimate of the patience
parameter, h2 , is zero, indicating that players immediately react to the payoff from
the cooperative action rather than reinforcing cooperative propensities by anticipated future reciprocity. The individual choice behavior (e.g., Table 3) is consistent with this finding. On the other hand, the estimated initial propensity to cooperate (dc ) is about 3.7% and statistically different from 0 (p-value < 10 −13). Therefore, we can strongly reject the hypothesis that there is no cooperative behavior in
this data set (i.e., dc =0). Another curious finding is that the transference estimate
increases to 0.888. 9
The gain in LL due to a non-reciprocity-based cooperative rule is 27.81, which is
much larger than the gain from rule learning without any cooperative rule (19.72).
Out of curiosity, we estimated the model with a non-reciprocity-based cooperative
rule but without rule learning (b1 =0) and obtained a LL value of −8550.69;
hence, adding a non-reciprocity-based cooperative rule to the no-rule-learning
models yields a LL gain of 19.24, while the further addition of rule learning yields a
LL gain of 28.28. Therefore, we can even more strongly reject the hypothesis of no
rule learning in the presence of this cooperative rule.
Because it is irrational to anticipate reciprocity in the last period of an experiment, we considered alternative versions of anticipated reciprocity that eliminated
the cooperative rule for the last period, 10 but the results for these versions were
essentially the same.
Tit-for-Tat Results
Estimating the TFT model, we found a maximized LL value of −8530.01, which
is a substantial gain from the basic rule learning model, but still less than rule
learning with a pure cooperative rule. Furthermore, the estimates for the TFT
parameters are (c, c, dTFT )=(0.04, 5.03, 0.047), respectively. Given these parameters, k(t+1; fc (t), c) is a sharp step function at 0.04, implying that as long as one
person out of 25 in the population chose the cooperative action in the period t, the
TFT rule will cooperate in period t+1. This behavior is not much different from
unconditional cooperation.
Since our specification of k(t+1; fc (t), c) with c > 0 excludes pure cooperation,
we decided to test another specification that would admit pure cooperation. This
was accomplished simply by letting c take on negative values: c ¥ [−0.01, 1). Not
surprising, the maximum LL was achieved for parameter values that mimicked a
pure cooperation rule. Therefore, we cannot reject the hypothesis that the TFT rule
adds nothing to explaining the data over a pure cooperation rule.
9
At the suggestion of a referee, we considered an encompassing model with a separate transference
parameter for the anticipated reciprocity variable; however, we found no improvement in the maximized
LL whatsoever and therefore cannot reject the hypothesis of a single transference parameter.
10
For the last period of each run in another version.
ASPIRATION-BASED RULES IN LEARNING
551
DISCUSSION
We have demonstrated how the framework of rule learning can accommodate
behavioral rules like aspiration-based experimentation and reciprocity-based cooperation. In the data set considered consisting of 50 runs of 12 periods each involving 625 subjects, we found no evidence of aspiration-based experimentation or
reciprocity-based cooperation. We did, however, find weak evidence for a simple
experimentation by random belief rule that is reinforced by its actual expected
payoff, and strong evidence for a simple cooperation rule that is also reinforced by
its actual expected payoff like all other rules in the rule learning model. Further, we
found no evidence for a tit-for-tat rule.
However, our conclusions regarding a simple cooperation rule must be softened.
While the maximized LL dramatically increases with the addition of a simple
cooperation rule, this gain comes largely by moving the mean prediction for the
first period of Game 16 closer to the separatrix. Since the mean starting point for
the dynamics is so close to the separatrix, random fluctuations around the mean
generate starting points on both sides of the separatrix, which lead to final outcomes being a fairly even split between the two equilibria. However, simulated
paths rarely exhibit the reversal and separatrix crossing pattern (like Fig. 1) seen in
the data. Since the leading theoretical explanations of such behavior (aspirationbased experimentation and anticipated reciprocity) fail to be significant on this data
set, we still have a behavioral pattern without an adequate explanation.
REFERENCES
Anderson, J. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum.
Axelrod, R. (1984). The evolution of cooperation. New York: Basic Books.
Axelrod, R., & Hamilton, W. (1981). The evolution of cooperation. Science, 211, 1390–1396.
Binmore, K., & Samuelson, L. (1997). Muddling through: noisy equilibrium selection. Journal of Economic Theory, 74, 235–265.
Brandts, J., & Charness, G. (1999). Punishment and reward in a cheap-talk game. Unpublished
manuscript.
Camerer, C., & Ho, T. (1998). EWA learning in coordination games: probability rules, heterogeneity,
and time-variation. Journal of Mathematical Psychology, 42, 305–326.
Camerer, C., & Ho, T. (1999a). EWA learning in games: preliminary estimates from weak-link games. In
D. Budescu, I. Erev, & R. Zwick, (Eds.), Games and human behavior: Essays in honor of Amnon
Rapoport. Hillsdale, NJ: Erlbaum.
Camerer, C., & Ho, T. (1999b). Experience-weighted attraction learning in normal form games. Econometrica, 67, 827–874.
Camerer, C., Ho, T., & Chong, J. (2000). Sophisticated EWA learning and strategic teaching in repeated
games. (Working paper 00-005). Wharton: Univ. of Pennsylvania.
Cheung, Y-W., & Friedman, D. (1997). Individual learning in normal form games: some laboratory
results. Games and Economic Behavior, 19, 46–76.
Cheung, Y-W., & Friedman, D. (1998). Comparison of learning and replicator dynamics using experimental data. Journal of Economic Behavior and Organization, 35, 263–280.
Churchland, P., & Sejnowski, T. (1992). The computational brain. Cambridge, UK: MIT Press.
Cooper, R., Dejong, D., Forsythe, R., & Ross, T. (1993). Forward induction in the battle-of-the-sexes
games. American Economic Review, 83, 1303–1316.
552
STAHL AND HARUVY
Dufwenberg, M., & Kirchsteiger, G. (1998). A theory of sequential reciprocity. (CentER Discussion
Paper 9837). The Netherlands: Tilburg University.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Engle-Warnick, J., & Slonim, R. L. (2001). The fragility and robustness of trust in repeated games.
(working paper). Case Western Reserve University.
Erev, I., Bereby-Meyer, Y., & Roth, A. (1999). The effect of adding a constant to all payoffs: experimental investigation, and implications for reinforcement learning models. Journal of Economic Behavior and Organization, 39, 111–128.
Erev, I., & Roth, A. (1998). Predicting how people play games: reinforcement learning in experimental
games with unique mixed strategy equilibria. American Economic Review, 88, 848–881.
Fehr, E., & Gächter, S. (1998). Reciprocity and economics: the economic implications of homo reciprocans. European Economic Review, 42, 845–859.
Fudenberg, D., & Levine, D. (1998). The theory of learning in games. Cambridge, MA: MIT Press.
Gigerenzer, G., Todd, P., & the ABC Research Group, (1999). Simple heuristics that make us smart.
Oxford: Oxford Univ. Press.
Gneezy, U., Guth, W., & Verboven, F. (2000). Presents or investments? An experimental analysis.
Journal of Economic Psychology, 21, 481–493.
Goffe, W., Ferrier, G., & Rogers, J. (1994). Global optimization of statistical functions with simulated
annealing. Journal of Econometrics, 60, 65–99.
Harsanyi, J., & Selten, R. (1988). A general theory of equilibrium selection in Games. Cambridge, MA:
MIT Press.
Haruvy, E., & Stahl, D. (1999). Empirical tests of equilibrium selection based on player heterogeneity,
mimeograph.
Ho, T-H., Camerer, C., & Weigelt, K. (1998). Iterated dominance and iterated best-response in experimental p-beauty contests. American Economic Review, 88, 947–969.
Hoffman, E., McCabe, K., & Smith, V. (1998). Behavioral foundations of reciprocity: experimental
economics and evolutionary psychology. Economic Inquiry, 36, 335–352.
Kreps, D., Milgrom, P., Roberts, J., & Wilson, R. (1982). Rational cooperation in the finitely repeated
prisoner’s dilemma. Journal of Economic Theory, 27, 245–252.
Nagel, R. (1995). Unraveling in guessing games: an experimental study. American Economic Review, 85,
1313–1326.
Nelder, J., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7,
308–313.
Newell, A., Shaw, J., & Simon, H. (1958). Elements of a theory of human problem solving. Psychological
Review, 65, 151–166.
Rabin, M. (1993). Incorporating fairness into game theory and economics. American Economic, 83,
1281–1302.
Roth, A., & Erev, I. (1995). Learning in extensive form games: experimental data and simple dynamic
models in the intermediate term. Games en Economic Behavior, 8, 164–212.
Sarin, R., & Vahid, F. (1999). Payoff assessments without probabilities: a simple dynamic model of
choice. Games and Economic Behavior, 28, 294–309.
Sen, A. (1977). Rational fools: a critique of the behavioral foundations of economic theory. Journal of
Philosophy and Public Affairs, 6, 317–344.
Stahl, D. (1996). Boundedly rational rule learning in a guessing game. Games and Economic Behavior, 16,
303–330.
Stahl, D. (1999). Evidence based rules and learning in symmetric normal form games. International
Journal of Game Theory, 28, 111–130.
Stahl, D. (2000a). Rule learning in symmetric normal-form games: theory and evidence. Games and
Economic Behavior, 32, 105–138.
ASPIRATION-BASED RULES IN LEARNING
553
Stahl, D. (2000b). Action-reinforcement learning versus rule learning. Department of Economics, University of Texas.
Stahl, D. (2001). Population rule learning in symmetric normal-form games: theory and evidence.
Journal of Economic Behavior and Organization, 45, 19–35.
Stahl, D., & Wilson, P. (1994). Experimental evidence of players’ models of other players. Journal of
Economic Behavior and Organization, 25, 309–327.
Stahl, D., & Wilson, P. (1995). On players models of other players: theory and experimental evidence.
Games and Economic Behavior, 10, 218–254.
Received: December 19, 2000; revised: October 2, 2001; published online: June 13, 2002