Lecture 4: Repeated Games III

MA300.2 Game Theory II, LSE
Summary of Lecture 4
More on Collusion and Punishments in Repeated Games
1. Punishments More Severe Than Nash Reversion
In the previous section we provided a detailed example of a “real-life” game — the Cournot
oligopoly — in which the one-shot Nash equilibrium fails to push firms down to their security
level. This raises the question of whether more severe credible punishments are available.
Example. [Osborne, p. 456.] Consider the following game:
A1
B1
C1
A2
4, 4
0, 3
0, 1
B2
3, 0
2, 2
0, 1
C2
1, 0
1, 0
0, 0
The unique Nash equilibrium of the above game involves both parties playing A. But it
is easy to check that each player’s security level is 1. So are there equilibrium payoffs “in
between”?
Consider the following strategy profile, descibed in two phases, which we describe as follows:
The Ongoing Path. (Phase O) Play (B1 , B2 ) at every date.
The Punishment Phase. (Phase P) Play (C1 , C2 ) for two periods; then return to Phase O.
Start with Phase O. If there is any deviation, start up Phase P. If there is any deviation from
that, start Phase P again.
To check whether this strategy profile forms a SGPE, suffices to check one-shot deviations.
Phase O yields a lifetime payoff of 2. A deviation gets her the payoff
(1 − β)[3 + β0 + β 2 0] + β 3 2.
Noting that 1 − β 3 = (1 − β)(1 + β + β 2 ), we see that a deviation in Phase O is not worthwhile
if
2 + 2β + 2β 2 ≥ 3
√
or β ≥ ( 3 − 1)/2.
What about the first date of Phase P? Lifetime utility in this phase is β 2 2 (why?). If
she deviates she can get 1 today, then the phase is started up again. So deviation is not
worthwhile if
β 2 2 ≥ (1 − β)1 + ββ 2 2
(1)
or if β ≥
one.
√
2/2. This is a stronger restriction than the one for phase O so hold on to this
2
Finally, notice without doing the calculations that it
√ is harder to deviate in date 2 of Phase
P (why?). So these strategies form a SGPE if β ≥ 2/2.
Several remarks are of interest here.
1. The equilibrium payoff from this strategy profile is 2. But in fact, the equilibrium bootstraps off another equilibrium: the one that actually starts at Phase P. The return to that
equilibrium is even lower: it is β 2 2.
2. Indeed, at that lowest value of β for which this second equilibrium is sustainable, the
equilibrium exactly attains the minimax value for each player! And so everything that can
be conceivable sustained in this example can be done with this punishment equilibrium, at
least at this threshold discount factor.
3. Notice that the ability to sustain this security value as an equilibrium payoff is not
exactly “monotonic” in the discount factor. In fact if the discount factor rises a bit above
the minimum threshold you cannot find an equilibrium with security payoffs. But this is
essentially an integer problem — you can punish for two periods but the discount factor
may not be “good enough” for a three-period punishment. Ultimately, as the discount factor
becomes close to 1 we can edge arbitrarily close to the security payoff and stay in that close
zone; this insight will form the basis of the celebrated folk theorem.
Example. [Abreu.] Here is a simple, stripped-down version of the Cournot example in which
we can essentially try out the same sort of ideas. The nice feature about this example (in
contrast to the previous one, the role of which was purely pedagogical) is that it has some
collusive outcome better than the Nash whcih players are trying to sustain.
L1
M1
H1
L2
10, 10
15, 3
7, 0
M2
3, 15
7, 7
5, −4
H2
0, 7
−4, 5
−15, −15
Think of L, M and H as low, medium and high outputs respectively. Now try and interpret
the payoffs to your satisfaction.
Notice that each player’s maximin payoff is 0, but of course, no one-shot Nash equilibrium
achieves this payoff.
1. You can support the collusive outcome using Nash reversion. To check when this works,
notice that sticking to collusion gives 10, while the best deviation followed by Nash reversion
yields
(1 − β)15 + β7.
It is easy to see that this strategy profile forms an equilibrium if and only if β ≥ 5/8 . For
lower values of β Nash reversion will not work.
2. But here is another one that works for somewhat lower values of β. Start with (L1 , L2 ).
If there is any deviation play (H1 , H2 ) once and then revert to (L1 , L2 ). If there is any
deviation from that, start the punishment up again. Check this out. The punishment value
3
is
−15(1 − β) + 10β ≡ p,
(2)
and so the no-deviation constraint in the punishment phase is
p ≥ (1 − β)0 + βp,
or p ≥ 0. This yields the condition β ≥ 3/5 .
What about the collusive phase? In that phase, the no-deviation condition tells us that
10 ≥ (1 − β)15 + βp
but (2) assures us that this restriction is always satisfied (why?). So the collusive phase
is not an issue, so our restriction is indeed β ≥ 3/5, the one that’s needed to support the
punishment phase.
3. For even lower values of β, the symmetric punishment described above will not work. But
here is something that else that will: punishments tailored to the deviator! Think of two
punishment phases, one for player 1 and one for player 2. The punishment phase for player
i (where i is either 1 or 2) looks like this:
(Mi , Hj ); (Li , Mj ), (Li , Mj ), (Li , Mj ), . . .
Now we have to be more careful in checking the conditions on the discount factor. First write
down the payoff to players i and j from punishment phase Pi , the one that punishes i. It is
p ≡ −4(1 − β) + 3β in stage 1 and 3 in each stage thereafter
for the “punishee” player i and
5(1 − β) + 15β in stage 1 and 15 in each stage thereafter
for the “punisher” player j.
now, if i deviates in the first stage of his punishment he gets 0 and then is punished again.
So the no-deviation condition is
p ≥ (1 − β)0 + βp,
or just plain p ≥ 0, which yields the restriction β ≥ 4/7 .
What if i deviates in some future stage of his punishment? The condition there is
3 ≥ (1 − β)7 + βp = (1 − β)7 − β(1 − β)4 + β 2 3,
but it is easy to see that this is taken care of by the β ≥ 4/7 restriction.
Now we must check j’s deviation from i’s punishment! In the second and later stages there
is nothing to check (why?). In stage 1, the condition is
5(1 − β) + 15β ≥ (1 − β)7 + βp.
[Notice how j’s punishment is started off if she deviates from i’s punishment!] Compared
with the previous inequality, this one is easier to satisfy.
Finally, we must see that no deviation is profitable from the original cooperative path. This
condition is just
10 ≥ (1 − β)15 + βp,
4
and reviewing the definition of p we see that no further restrictions on β are called for.
4. Can we do still better? We can! The following punishment exactly attains the minimax
value for each agent for all β ≥ 8/15 .
To punish player i, simply play the path
(Li , Hj ); (Li , Hj ), . . .
forever. Notice that this pushes player i down to minimax. Moreoover, player i cannot
profitably deviate from this punishment.
But player j can! The point is, however, that in that case we will start punishing player j
with the corresponding path
(Lj , Hi ); (Lj , Hi ), . . .
which gives her zero. All we need to do now is to check that a one-shot deviation by j is
unprofitable. Given the description above, this is simply the condition that
7 ≥ (1 − β)15 + β punishment payoff = (1 − β)15.
This condition is satisfied for all β ≥ 8/15.
So you see that in general, we can punish more strongly than Nash reversion, and what is
more, this is a variety of such punishments, all involving either a nonstationary time structure
(“carrot-and-stick”, as in part 2) or a family of player-specific punishments (as in part 4) or
both (as in part 3). This leads to the Pandora’s Box of too many equilibria. The repeated
game, in its quest to explain why players cooperate, also ends up “explaining” why they
might fare even worse than one-shot Nash!
2. The Folk Theorem
It turns out that the above observations can be generalized substantially provided players
are patient enough. Recall the one-shot game G, and look at the set of all payoff vectors
p ∈ IRn such that p = (a). This is the set of all feasible payoffs; call it F . Define F ∗ to be
the convex hull of F . Notice that normalized payoffs in the infinitely repeated game generate
values in F ∗ .
[Explain this by drawing the set of feasible payoffs for the PD or for a coordination game.]
2.1. The Two-Player Folk Theorem. First assume that n = 2. Then the following result
is true.
Theorem 1. [Two Player Folk Theorem.] Consider any payoff vector p ∈ F ∗ such that
p m, where m is the vector of security levels. Then for any > 0 there exists a threshold
discount factor β ∗ such that if all players are more patient than β ∗ then a payoff vector -close
to p is sustainable as a subgame perfect equilibrium payoff outcome.
Proof. The proof involves some subtleties which are best absorbed step by step. So in the
first, formal part of the proof I am going to assume that there is an action profile a with
payoff f (a) exactly equal to p. Then I indicate later how to extend the argument.
5
For player i, let âi be an action that succeeds in minimaxing his opponent player j. That is,
âi minimizes dj (ai ) over different choices of ai .
Denote by â the vector of these two actions (â1 , â2 ). Also, denote by M the largest possible
payoff in the one-shot game.
The following claim is subtle and crucial.
Claim. There exists β ∗ and a length of time T (an integer) such that for all β ≥ β ∗ ,
(3)
pi > (1 − β)M + β p̃i ,
where
(4)
p̃i = (1 − β T )fi (â) + β T pi
with
(5)
p̃i > mi
for i = 1, 2.
To see why such a β ∗ and T must exist, first substitute (4) into the right-hand side of (3) to
get the expression
h(β) ≡ (1 − β)M + β(1 − β T )fi (â) + β T +1 pi ,
and notice that this equals pi when β exactly equals 1. I want the derivative of this expression
with respect to β to be positive evaluated at β = 1. (You’ll see why in a minute.) Take the
derivative first:
h0 (β) = −M + [1 − (T + 1)β T ]fi (â) + (T + 1)β T pi ,
so that
h0 (1) = −M + fi (â) + (T + 1)[pi − fi (â)],
which can be guaranteed to be positive for T chosen large enough. Now observe that for β
close enough to 1, (3)–(5) all hold (the last because pi > mi and so p̃i > mi for β sufficiently
close to 1).
Now consider the following strategy profile. Begin by playing some action profile a such that
f (a) = p. If there is any unilateral deviation, play the mutual minimaxing action profile â for
T periods, and then start up a again. If there is any deviation, simply restart the T -period
“punishment phase” all over again.
To see that this is SGP, first consider the initial phase. If player i deviates, she gets
(1 − β)di (a) + β punishment payoff ≤ (1 − β)M + β p̃i ,
and this inequality follows because di (a) ≤ M (the maximum payoff) and p̃i is precisely i’s
payoff from punishment (examine (4)). But (3) tells us that the right hand side of the above
inequality is itself no larger than pi , and so we must conclude that there is no profitable
one-shot deviation for i during the initial phase.
What about the punishment phase? Well, notice that it suffices to check deviations at the
very first date of the punishment phase (why?). But at that date all that i can get is mi ,
because player 2 is minimaxing him! Consequently, the required no-deviation constraint is
p̃i ≥ (1 − β)mi + β p̃i ,
6
where the second term on the right hand side simply follows from the fact that we start the
punishment up again. But this inequality follows right away from (5), and by the one-shot
deviation principle, we are done.
What follows is a precise but somewhat informal description of how the argument is extended
to the case in which there is no action profile that exactly hits p. In this case, there certainly
are a finite number of action profiles (no more than three actually for 2 players) for which
the convex combination of payoffs equals p.
Call these action profiles a1 , a2 and a3 , with associated payoffs p1 , p2 and p3 , so that for
some nonnegative weights (λ1 , λ2 , λ3 ),
λ1 p1 + λ2 p2 + λ3 p3 = p.
Now we will have to choose β ∗ a little more tightly. Certainly, (3)–(5) will have to be satisfied
as before, but now we need some more properties.
We are going to play (along the “initial phase”) p1 , p2 and p3 in rotation, with relative time
periods roughly proportional to (λ1 , λ2 , λ3 ). If we then take β very close to one, then the
overall payoff generated will be very close to p. Indeed, the overall lifetime payoff for each
player no matter which part of the “rotation” we are in will be very close to p. Now we
will replace the initial phase in the formal part of the proof above by the rotated play of
these three action profiles (as also in the second part of each punishment phase). The same
arguments then go through.
Notice the carrot-and-stick structure of punishments, very standard in repeated games. The
good stuff typically comes later, the bad stuff comes first. It is the promise of the rewards
later than makes players stick to their punishments (at least for symmetric punishments of
the kind considered here).
Also, do notice that we are not considering the strongest punishments possible. But it does
not matter, because the statement of the folk theorem simply cannot be strengthened. You
cannot drive players below their minimax values.
2.2. Remarks on Three or More Players. With three or more players, the folk theorem
runs into some problems. Indeed, the theorem is generally false in this case. Consider the
following
Example. Player 1 chooses rows, player 2 chooses columns, and player 3 chooses matrices:
U
D
L
1, 1, 1
0, 0, 0
R
0, 0, 0
0, 0, 0
U
D
L
0, 0, 0
0, 0, 0
R
0, 0, 0
1, 1, 1
Each player’s minmax value is 0, but notice that there is no action combination that simultaneously minmaxes all three players. E.g., to minimax player 3, 12 play U R. To minmax 2,
13 play U 2. To minmax player 1, 23 play L2. Nothing works to simultaneously minmax all
three players.
7
Let α be the lowest equilibrium payoff for any player. Note that
α ≥ (1 − β)D + βα,
where D is the largest deviation payoff to some player under any first-period action supporting
α. It can be shown (even with the iuse of observable mixed strategies) that D ≥ 1/4. So
α ≥ 1/4. No folk theorem.
The problem is that we cannot separately minmax each deviator and provide incentives to
the other players to carry out the minmaxing, because all payoffs are common. If there is
enough “wiggle-room” to separately reward the players for going into the various punishment
phase, then we can get around this problem. A sufficient condition for this is that F ∗ has
full dimensionality. We omit the details.