Lecture 3: Prisoner`s Dilemma, Tit-for

Lecture 3: Prisoner’s Dilemma, Tit-for-tat, and ZD strategies
http://gawker.com/5903692/must-watch-golden-balls-contestant-wins-with-most-ballsy-move-ever
The Prisoner’s Dilemma (PD): Two players cooperate or defect. If A cooperates and B defects, A loses; if A
defects and B cooperates A wins; if both cooperate both gain; if both defect, both lose. Standard PD payoff matrix
The matrix of payoffs: CC> DC/CD > DD; and CC>[ DC + CD]/2 (so an alternating strategy is not best). Since
CC produces max: 6 > 5> 2, PD is not a zero sum game. Cooperation has a productive value. CC need not be
socially positive -- positive for cartel ripping off consumers. Often written for individual as T(DC) > R (CC) > P
(DD) > S(CD) with R= 3, S=0, T=5, P =1
In a one-shot game, the solution is D. If I do C, you will do D in the single encounter, and vice versus. The
same is true if we know the game will end in period T. Why? We both play D in T. So what about T-1? It’s a oneshot game again, because we know what T will give us. Thus a known number of interactions yields ALLD. But
we observe much less defecting in the world. To explain this we need:
Expected future dealings. Rip off the tourist because there is no future exchange but cooperate with your
spouse since you expect to interact again with no certain endpoint. This is the repeated or iterated PD game: IPD.
There is no best action. Your best strategy depends on what others do and on discount factor for future payoffs.
With low discount rates, cooperative strategies do better than D-strategies because if cooperate today leads to
cooperate tomorrow the 6 produced by cooperation exceeds the 5 produced by CD.
But cooperative strategy must be retaliatory. You play D and gain 5 against C but if C retaliates by shifting
to D, you get 5+1 in two rounds and 5+1+1 in three rounds while if you keep playing C with another cooperator
you get 3+3 in two rounds and 3+3+3 in three rounds. Retaliation gives D 5,1,1 while each cooperator gains 3, 3, 3
and so on. D could still score more if the future were heavily discounted. Without retaliation, you keep playing D
and the other player does C and you get 5,5,5 ...
Let w = 1/(1+r) < 1, where r = interest or discount rate for next period,. Then w2 is the value of a payoff two
periods from now, w3 , the value 3 periods on, etc. The future matters more when w is large. A cooperative strategy
that does not play all C can beat out all D. (Play all-C and D gets 5 every time).
2.Opponents have strategies beyond all D or C If other player is all-D, best is D. If other player is all C, best is
D. But if pair reacts to D/C may be better to be nicer. TFT, tit-for-tat, cooperates until opponent defects, then
defect until opponent changes. An eye for an eye, tooth for tooth.
Opponent C D C/D
TFT
C C D C/D
Consider three periods, (the minimum for TFT to work better than All D given the payoff matrix above) and where
TFT and D meet half the time. For simplicity let w=1 so future is worth as much as present.
TFT meets TFT: rewards = 3(1 +w+w2) =9
All D meets TFT: rewards = 5 +w+w2 =7
TFT meets All D: rewards = 0 + w+w2 = 2
All D meets All D: rewards = 1+ w+w2 = 3
TFT gets 11 (= 9+2)/2 vs 10 (=7+3)/2 for D. Future cooperation gives higher present value than the gain of defect.
3.The winning strategy varies with the distribution of strategies. In D world best is all-D. In TFT world, best
is TFT type strategy. Consider how payoffs vary with the all-D and TFT population in a 3 period model
%D
1/3
1/2
2/3
3/4
TFT
20/3 (1/3 2 + 2/3 9)
11/2 (½ 2 +1/2 9)
13/3
15/4
D
17/3 (1/3 3 + 2/3 7)
10/2
13/3
16/4
TFT WINS
TFT WINS
EQUAL SCORES
D WINS
So when %D> 2/3rds, D wins; when %D < 2/3rds D loses; at 2/3rds get unstable mixed equilibrium. Note TFT
requires smaller proportion of itself to win (1/3rd +) than D (2/3rd+). Reason is 6>5.
4) Addition of all Cooperate (turn other cheek) helps all-D and hurts TFT: Too many suckers destroys world
TFT
C
D
TFT C
9
9
9
9
7
15
D
2
0
3
1/3 of each
20/3
18/3
25/3*
2/5 TFT, 2/5C,
38/5
36/5
47/5*
7/10 TFT 2/10 C
8.3*
*FOR WIN
8.1
8.2
D wins because it exploits C. With 2/5 TFT and 2/5 C (and 1/5 D), D wins. With 7/10 TFT and 2/10 C, TFT wins.
Thus, no best choice choice in PD. Success depends on ecology of strategies. For any payoff matrix, there is a
distribution of All D, All C, and TFT so D wins and so TFT wins even though TFT never wins one on one. Nice
strategies gain from interactions with nice strategies. TFT gains more than D through interactions with TFT.
2. Axelrod 1979 Computer Tournament asked experts to submit computer code giving responses to any action by
another. Fifteen entrants. Someone puts in D. Someone puts in C. Several programs try to infer/exploit opponents
strategy. Anatol Rapaport enters TFT and wins. Axelrod announces results and holds second contest. Analysis of
round 1 showed that a more forgiving strategy could beat TFT:Tit for two tats -- TFTT -- which retaliates against
DD but not D. 63 entrants in 2nd tournament and TFT (Rapaport) won again. Simulation found that if higher
scoring strategies increase their share of the population over time TFT and other nice rules did well over time.
Why? Bcs TFT/nice strategies never defect first and retaliates quickly to D, which limits D's making much money.
Can a TFT world survive/outscore an invasion of D? Depends on %D (p). In first period TFT scores 3(1-p)
+ p, while D gets (1-p) + 5p so TFT beats D when 2(1-p)> -4p >0 ---> p<2/3. Invasion of <2/3Ds would fail .
Can a world of Ds survive invasion of TFTs? Yes, 1/3 or more needed with given matrix. But TFT cannot
stop invasion of Cs because TFT and C score the same. But with Cs in the world, the door is open to Ds to invade.
The Axelrod/other tournaments operate with a strategy that plays every other one and grows proportionate
to its total score. If instead it operated as a tennis tournament, with winners of each going contest going to next
round would have completely different results. TFT meets all D → D proceeds and result is world of D.
Spatial interactions and n-hoods
IF TFTs interact more with each other in local N-HOOD rather than with the entire population ,TFT is
more likely to survive. Say 10 TFTs enter world of 100 All-Ds but have 1/2 of their interactions with TFTs. This is
equivalent to a world with 50% TFTs. The Ds still interact largely with Ds, so TFT could win.
CA models show how n-hood interactions affect outcomes in spatial PD games. Assume that players
interact with others in nhood and change strategy depending on what they can win in the nhood. Surrounded by
Ds you turn D. Surrounded by TFTs you play TFT. Conflicts occur on the borders. Compare a TFT with 3 Ds and
1 TFT for neighbors with a TFT and D having half TFT neighbors and a TFT with 2 TFT neighbors.
TFT
TFT
TFT
TFT * ?? * TFT
D * ?? * D
D * ?? * D
D
TF T
D
The rule for ?? is to compute profits from D and TFT and pick most profitable. Consider the rewards using payoffs
for three period interactions: TFT-TFT 9, TFT-D 2, D-D 3, D-TFT 7. The above n-hoods produce
N-Hood
1D, 3 TFT 2D 2 TFT 3 D 1 TFT
TFT
29
22
15
Surrounded by 2 or 3 TFTs choose TFT.
D
24
20
16
Surrounded by 3 or more Ds choose D;
Decision TFT
TFT
D
Go to http://www.taumoda.com/web/PD/lab.html and experiment with the PD games.
4.Better than TFT: Nicer and Conditional
TFT has problems with errors in communication. If TFT meets TFT and errs, it --> an alternating cycle, with
lower rewards than C. Error D TFT CCC D CDCD … More forgiving is TFTT CCC D CC ...
CCC DD CC
TFT CCC C DCDC...
TFTT CCC C CC...
CCC CC DD
To generalize strategies in terms of conditional probabilities, Let P be the probability you cooperate if X cooperated
and Q be the probability that you cooperate if X defected. Then strategies as below -Sigmund, Games of Life, 1994.
Nowak and Sigmund simulate world of (p,q) strategies with random ps and qs and NO neighborhoods. PAVLOV
responds to previous round by switching if it loses: if its D leads to a D, it tries C while if its C meets a D, it tries D.
WIN-STAY. LOSE-SHIFT. Pavlov would fail in Axelrod-tournament until TFT has destroyed most Ds.
5.Psychology Experiments-- Framing matters
Study 1: More cooperation in ‘‘Community Game’’ PD than ‘‘Wall Street Game’’ in Israeli Air force
experiment. Compare results by guess of instructors who will cooperate based on behavior during training.
(Liberman, V., S. M. Samuels, and L. Ross. 2004. Personality and Social Psychology Bulletin 30:1175-85.)
Study 2: Interpretive labels of the game, the choices, and the outcomes led to different outcomes. (Zhong ,
Loewenstein, Murnighan “ Journal of Conflict Resolution,” Vol. 51, No. 3, 431-456 (2007))
this experiment provided clear and strongly worded labels for the PD game, for the players’ choices,
and for their outcomes: blunt labels—trust and cutthroat— described the game in starkly different ways; the
choice labels reflected researchers’ common interpretative descriptions of participants’ choices, for example,
cooperate, defect, rational, and choose for the group; and labels for the participants’ outcomes also reflected
common research usage, for example, winner, sucker, saint, punishment, or group maximum.
6.One strategy to Rule Them All: the ZD Condition.
“It would be surprising if any significant mathematical feature of IPD has remained undescribed, but that
appears to be the case” (Freeman Dyson and William Press. 2012). The new feature is Zero Determinant
strategies that gave a strategy “control” over outcomes regardless of what an opposing non-ZD strategy does.
ZD plays C with conditional probability between 0 and 1 depending on last period's play – a mem 1 strategy.
Dyson-Press show that against a mem 1 player an opponent who considers earlier encounters does no better
than an opponent with an mem 1 strategy, so they need only consider strategies that remember the previous
round, which is critical for making the problem manageable.
Is it surprising? Not to Drew Fudenberg who noted that the only reason to condition play on the past is if
your opponent does so, which means that an opponent to a mem 1 player gains nothing from going back
further in trying to “guess” your next move. There is no link across plays in a pure repeated game. If
opponent uses no memory no need for you to do so.
Here is ZD compared to four major strategies who do 0,1 responses to previous round.
CC (R)
CD (T)
DC (S)
DD (P)
ZD Strategy
Pcc
Pcd
Pdc
Pdd
Conditional Probability of Playing C
All Coop All Defect
TFT
“Pavlov – WSLC”
1
0
1
1
1
0
0
0
1
0
1
0
1
0
0
1
To do well ZD will likely set Pcc high; Pcd low; Pdc high; Pdd but not 0. Why? Note that All C, All
D, TFT, and Pavlov are ZD strategies with extreme/simple conditional probabilities .
Let Qcc, Qcd, Qdc, Qdd represent conditional probabilities of 2nd player. Then the four probabilities
Pii and Qii characterize each players' strategy . Putting them together gives a probability distribution for
the outcome of each round, conditional on the outcome of the previous round – a 4 by 4 Markov chain
transition matrix M for the four outcomes in this period to the next.
CC
CD
DC
DD
CC
Pcc Qcc
Pcd Qdc
Pdc Qcd
Pdd Qdd
CD
Pcc (1-Qcc)
Pcd (1-Qdc)
Pdc (1-Qcd)
Pdd (1-Qdd)
DC
(1-Pcc) Qcc
(1-Pcd) Qdc
(1-Pdc) Qcd
(1-Pdd) Qdd
DD
(1-Pcc) (1-Qcc)
(1-Pcd) (1-Qdc)
(1-Pdc) (1-Qcd)
(1-Pdd) (1-Qdd)
Press and Dyson find the 4 element vector v of the distribution of outcomes among CC, CD, DC, DD
that is the stationary distribution/equilibrium of the encounter between P and Q from v = M v. Multiplying the
elements of the distribution by the rewards each player gets from the outcomes gives their payoff, S:
Player P (row player) gets
S(p,q) = vccR + vcd S + vdc T + vdd P
Player Q (column player) gets S(q,p) = vccR + vcd T + vdc S + vdd P
where the distribution parameters vij measure the frequency of the outcomes. The equations show that both
get the same amount from CC and DD and differ in their rewards in the CD and DC cases. Since T> S,
player P wins when vdc >vcd and player Q wins in the opposite case. For player P, more DC than CD is good.
Since the stationary distribution parameters v depend on the conditional probabilities the players choose, the
ZD solution links the conditional probabilities of the ZD player to the R, S, T, P rewards This link yields a
linear equation that connects the rewards to the first player- S(p,q) to the rewards of the other player S(q,p) by
an equation.
αS(p,q) +βS(q,p) + γ = 0
The equation allows a player to force a given linear relation between the outcomes independently of
whatever strategy the other might choose.
Press and Dyson obtain the solution by setting the determinant of a matrix that depends on the game
matrix and a vector of rewards equal to zero, hence the name ZD. Taking this equation as given, if the
first player knows ZD strategy and the second does not, the first can can use the linear relation to set the
average score of an opponent regardless of opponents' strategy. The following is the solution for choosing
conditional probabilities that set the opponents' score to a fixed level.
If CC, cooperate with probability 1-e
If DC, cooperate with probability 1/2(e+3f)
If CD, cooperate with probability 1-2e-f
If DD, cooperate with f
http://s3.boskent.com/prisoners-dilemma/fixed.html solves Press and Dyson equations for a target of 2
and finds that e= 1/3 and f = 1/3 so that
CC → cooperate with probability 2/3
DC → cooperate with probability 2/3.
CD → cooperate with probability 0.
DD → cooperate with probability 1/3.
Whatever the non-ZD player does its long term outcome is 2. Similarly can get find conditional probabilities
that force opponent to have any number between 1 and 3.
ZD can also ”Extort” gains by defecting enough times to win in any one-on-one contest. There is a set of
conditional ZD probabilities so that ZD gains µ >1 times the share of payoffs above the mutual defect
outcome (1 in the standard model) compared with the gains received by opponent above the mutual defect
outcome. The general equation for this is: (S(p,q) -P) = γ (S(q,p)-P), where γ > 1 is the rate of extortion.
The extortion conditional probabilities are:
If CC, cooperate with probability 1-e( χ-1) (R-P)/(P-S)
If DC, cooperate with probability 1- e (1+ χ(T-P)/(P-S)
If CD, cooperate with probability e (χ + (T-P)/(P-S)
If DD, cooperate with probability 0
EXTORT-2 says give the ZD Row Player twice the gain above P that the Column Player gets – ie χ =2 in the
equations where the Column Player tries to maximize its own value, which occurs when it plays all C. So the
extortion is that ZD says maximize your interest and with my chosen conditional probabilities you will play
so I get twice the gain above P you get. This strategy is an ULTIMATUM game strategy … We divide $1000
and I have the chip that says I get $1000 if you agree. I offer you 1 cent, your maximizing choice is … 1 ct.
Using R= 3, S=0, T=5, P =1, the conditional probabilities for getting twice above P as you is
Revolution in Game Theory?
Dong, Zhi-Hai, and Tao( Chin. Phys. B, 2014) ”ZD ... fundamentally changes the research paradigm
of game theory. In the framework of ZD … are dozens of ingenious ideas and untraditional approaches for
analyzing not only prisoner’s dilemma but also bi-matrix games, which dramatically expand our
understanding of the stochastic process, the mutual benefit, the cooperation incentive, and even the optimal
control in the repeated games.
William Press: “When both players have a theory of mind (that is, are not just evolving to maximize
their own score) are all games in some deep way, actually Ultimatum Games?”
Freeman Dyson;“Cooperation loses and defection wins ... My view of the evolution of cooperation
is colored by my memories of childhood … Christmas and Guy Fawkes (days). Christmas was the
festival of love and forgiveness. Guy Fawkes was the festival of hate and punishment... (for) the guy who
tried to blow up the King and the Parliament in 1605 and was gruesomely punished by torture and
burning. For the children, Christmas was boring and Guy Fawkes was fun. We were born with an innate
reward system that finds joy in punishing cheaters. The system evolved to give cooperative tribes an
advantage over noncooperative tribes, using punishment to give cooperation an evolutionary advantage
within the tribe. This double selection of tribes and individuals goes way beyond the Prisoners'
Dilemma model.”
xxxx “To me this provides the demonstrable benefit of unions and governments … companies in
capitalist societies make use of ZD strategies to exploit their shorter term acting employees
(“employees who do not know ZD strategies”). It is therefore in the interests of workers to create their
ZD strategy organization aka a union)
How does the ZD analysis change the PD model?
ZD assumes a fundamental difference in the players – the player who uses ZD knows ZD while the
unwitting player does not and just adjusts to gain best it can. By giving information to one player, it turns PD
into ultimatum game with ZD being able to make the first offer. The column player either does what is in its
best interest … or says “To hell with you” and says no I will play D in every turn … the result is no one gets
above the P outcome.
General result in ultimatum games is that people often offer to share rewards equally in the belief that
both will stick to that outcome, but if the player who goes first divides the rewards 60-40 or even 70-30 the
second player follows maximizing behavior and accepts.
Thus, ZD strategies provide the first player with a strong unilateral control in games but if both
players know ZD and if neither blindly maximizes their returns the ZD strategy collapses.
What happens to ZD in evolutionary games with many strategies?
In lecture 4, we will see that ZD does poorly in an Axelrod tournament, in part because when two ZD
extortionary strategies meet, they end up with DD lowest value unless they decide to move their conditional
probabilities of cooperation higher to be generous and play TFT.
Conclusion
1- In one on one contests where winner lives and loser dies, D triumphs. ALL D wins or draws all the
time. It beats TFT on round one, ties afterwards. TFT never wins one on one.
2-- Outcomes in tournament with many strategies depend on ecology/population of strategies and how
they evolve over time -->analysis of evolutionary stable strategies (ESS) next class, where strategies evolve
depending on their total points relative to others.
3-- Both cooperative and defect strategies score most when playing against C strategies, but D scores
more. Cooperative strategies outscore D by playing against TFT-type strategies. Even though TFT-type may
lose every one on one, it will “win” by scoring relatively more points than defect-type strategy.
4 – ZD strategies allows player to determine outcomes of another player who seeks to maximize their
rewards and thus win one-on-one against that player but falter when both players know ZD and are willing to
sacrifice rewards to change the game (and in evolutionary context, as we see next class).