Lecture 3: Prisoner`s Dilemma, Tit-for

Lecture 3: Prisoner’s Dilemma, Tit-for-tat, and ZD strategies
The Prisoner’s Dilemma (PD): Two players cooperate or defect. If A cooperates and B defects, A loses; if A
defects and B cooperates A wins; if both cooperate both gain; if both defect, both lose. Standard PD payoff matrix
The payoffs: CC> DC/CD > DD; and CC>[ DC + CD]/2 (so alternating strategy is not best). Since CC produces
max: 6 > 5> 2, PD is not a zero sum game. Cooperation has a productive value. CC can be socially negative -cartel ripping off consumers. Often written as T(DC) > R (CC) > P (DD) > S(CD) with R= 3, S=0, T=5, P =1
In a one-shot game, solution is D. If I do C, you will do D and win and vice versus. So we both play D. Same
holds if we know the game ends in T.,which is a one-shot game. But T-1 is a one-shot game again. And so on. Thus
a known number of interactions yields ALLD.
But much less defecting in the world. To explain this we need:
1. Expected future dealings – expect to interact again with no certain endpoint. This is repeated or iterated PD
game: IPD, for which there is no best action. Returns depend on what others do and discount of future payoffs.
Rip off the tourist but cooperate with your spouse.
2. Low discount rate so future dealings matter. C today to encourage C tomorrow. Total value CC 6 > CD 5.
Discount rate w = 1/(1+r) < 1, where r = interest rate,. w2 is value of a payoff two periods from now, w3 is value 3
periods on, etc. The future matters more when w is large so strategy that pays off in future can beat all D.
3. Conditional retaliatory strategies. If I play D against your C and you do not change to D, I win. If you shift to
D against me, I get 5+1 in two rounds and 5+1+1 in three rounds while if you play C with another C player you get
6 in two rounds and 9 in three rounds and 9>7. Retaliation drops D to 1 in next rounds. Spiteful monkeys
4.World of strategies beyond all D or C In all-D world, best is D. In all C world, best is D. But with other
strategies may be better to be nicer. Key other strategy is TFT, tit-for-tat, cooperate until opponent defect, then
defect until opponent changes. An eye for an eye, tooth for tooth.
Opponent
TFT
C D C/D
C C D C/D
Consider three periods, (the minimum for TFT to work better than All D given the payoff matrix above) where TFT
and D meet half the time. For simplicity let w=1 so future is worth as much as present.
TFT meets TFT: rewards = 3(1 +w+w2) =9
All D meets TFT: rewards = 5 +w+w2 =7
TFT meets All D: rewards = 0 + w+w2 = 2
All D meets All D: rewards = 1+ w+w2 = 3
TFT gets 11 (= 9+2 ) from playing D and TFT; D gets 10 (=7+3). TFT cooperation > defect.
5.Winning strategy varies with the distribution of strategies in world. In all-D world best is all-D. In TFT
world, best is TFT type strategy. Consider how payoffs vary with the all-D and TFT population in a 3 period model
%D
TFT
D
1/3
20/3 (1/3 2 + 2/3 9)
17/3 (1/3 3 + 2/3 7) TFT WINS
1/2
11/2 (½ 2 +1/2 9)
10/2
TFT WINS
2/3
13/3
13/3
EQUAL SCORES
3/4
15/4
16/4
D WINS
So when %D> 2/3rds, D wins; when %D < 2/3rds D loses; at 2/3rds get unstable mixed equilibrium. Note TFT
requires smaller proportion of itself to win (1/3rd +) than D (2/3rd+). Reason is 6>5.
6. Addition of all Cooperate (turn other cheek) helps all-D and hurts TFT: Too many suckers destroys world
TFT
C
D
TFT C
9
9
9
9
7
15
D
2
0
3
1/3 of each
20/3
18/3
25/3*
2/5 TFT, 2/5C,
38/5
36/5
47/5*
7/10 TFT 2/10 C
8.3*
*FOR WIN
8.1
8.2
D wins because it exploits C. With 2/5 TFT and 2/5 C (and 1/5 D), D wins. With 7/10 TFT and 2/10 C, TFT wins.
Thus, NO BEST CHOICE IN iterated PD. SUCCESS DEPENDS ON ECOLOGY OF STRATEGIES. For
any payoff matrix, there is a distribution of All D, All C, and TFT so that D wins and that TFT wins. C never wins.
One on one, TFT never wins. When TFT meets D, D scores more.
Nice strategies gain from interactions with nice strategies. TFT beats D through its interaction with TFT.
PD game on TV http://gawker.com/5903692/must-watch-golden-balls-contestant-wins-with-most-ballsy-move-ever
Axelrod 1979 Computer Tournament
R. Axelrod asked experts to submit programs for the PD – code giving responses to any action by another. Fifteen
programs enter, including D and C. Several complex programs try to infer and exploit opponents strategy. Anatol
Rapaport enters TFT. TFT wins.
Axelrod announces results and holds second contest. Analysis of round 1 showed that a more generous/
forgiving strategy could beat TFT: Tit for two tats -- TFTT -- which retaliates against DD but not D. 63 entrants in
2nd tournament and TFT (Rapaport) won again. Axelrod then simulated what would happen to the population of
strategies in the next generation if higher scoring strategies increase their share of the population. TFT and other
nice rules did well over time.
TFT/nice strategies win because they never defect first but retaliate quickly to D, which limits D's points.
Can a TFT world survive invasion of Ds, where survive means outscore D? Depends on %D invades (p). In
first period TFT scores 3(1-p) + p, while D gets (1-p) + 5p so TFT beats D when 2(1-p)> -4p >0 ---> p<2/3. So if
population change depends on relative scores, initial invasion of <2/3Ds would fail .
Can a world of Ds survive invasion of TFTs? Yes, 1/3 or more needed with given matrix.
Can TFT world survive invasion of Cs? No, because TFT and C score the same. Cs open door to D invasion.
Spatial interactions and n-hoods:CA models of PD
IF TFTs interact more with each other in local N-HOOD rather than with the entire population – TFT is
more likely to survive. Say 1-p% TFTs enter All-D and have 2 of their 4 interactions with TFTs. Then their score is
equivalent to a world with 50% TFTs. But the Ds still interact largely with Ds, so TFT could win.
CA models show how n-hood interactions affect outcomes in spatial PD games. Assume that players
interact with others in nhood and change strategy depending on what they win in the nhood. Surrounded by Ds
you turn D. Surrounded by TFTs you play TFT. Conflicts occur on the borders. Compare a TFT with 3 Ds and 1
TFT for neighbors with a TFT and D having half TFT neighbors and a TFT with 2 TFT neighbors.
TFT
TFT
TFT
TFT * ?? * TFT
D * ?? * D
D * ?? * D
D
TF T
D
The rule for ?? is to compute profits from D and TFT and pick most profitable. Consider the rewards using payoffs
for three period interactions: TFT-TFT 9, TFT-D 2, D-D 3, D-TFT 7
PICK
TFT
D
Decision
NEIGHBORHOOD
1D, 3 TFT 2D 2 TFT 3 D 1 TFT
29
22
15
Surrounded by 2 or 3 TFTs choose TFT.
24
20
16
Surrounded by 3 or more Ds choose D;
TFT
TFT
D
Go to http://ccl.northwestern.edu/netlogo/models/PDBasicEvolutionaryl and experiment with the PD games.
New Material on Spatial PD
1)Review of experiments on Prisoner’s Dilemmas on lattices to test interpretation of human behavior. We
find that the experiments “moody conditional cooperation”1 not non-innovative game dynamics such as imitate-thebest or pairwise comparison rules fit the data. The results suggest that imposed lattice structure does not influence
global cooperation, (Grujik, et al, 2014 A comparative analysis of spatial Prisoner’s Dilemma
experiments:Conditional cooperation and payoff irrelevance, www.nature.com/articles/srep04615
2)Existence of a zealot who stays a cooperator irrespective of the result of an interaction has been reported
to add “social viscosity” to a population and thereby helps increase the cooperation level in prisoner's dilemma
games. which premises the so-called well-mixed situation of a population. We found that this is not always true
when a spatial structure, i.e., connecting agent, is introduced. Deploying zealots is counterproductive, especially
when the underlying topology is homogeneous, similar to that of a lattice. Our simulation reveals how the existence
of never-converting cooperators destroys rather than boosts cooperation. (Matsuzawa,et al “Spatial prisoner’s
dilemma games with zealous cooperators” PHYSICAL REVIEW E 94, 022114 (2016)
Better than TFT: Nicer and Conditional
TFT has problems with errors in communication D'. If TFT meets TFT and errs, it --> an alternating cycle, with
lower rewards than C. TFT CCC D' CDCD …More forgiving is TFTT CCC D' CC CCC DD CC
TFT CCC C DCDC...
TFTT CCC C CC. CCC CC DD
To generalize strategies via conditional probabilities, let P be the probability you cooperate if X cooperated and Q
be the probability you cooperate if X defected. This gives strategies below (Sigmund, Games of Life,)
1 MCC is CC if the player has cooperated the last time. If player has defected the last time, a player adopting MCC decides without
taking into account what the neighbors in the contact network have done previously – mutation rather than copy neighbors.
Nowak and Sigmund simulate world of (p,q) strategies with random ps and qs and NO neighborhoods.
PAVLOV responds to previous round by switching if it loses: if its D leads to a D, it tries C while if its C
meets a D, it tries D. WIN-STAY. LOSE-SHIFT. Pavlov would fail in Axelrod-tournament until TFT has
destroyed most Ds.
Psychology Experiments-- Framing matters
Study 1: More cooperation in ‘‘Community Game’’ PD than ‘‘Wall Street Game’’ in Israeli Air force.
Instructors guessed who will cooperate based on behavior during training. (Liberman, V., S. M. Samuels, and
L. Ross. 2004. Personality and Social Psychology Bulletin 30:1175-85.)
Study 2: Interpretive labels of the game, the choices, and the outcomes led to different outcomes. (Zhong ,
Loewenstein, Murnighan “ Journal of Conflict Resolution,” Vol. 51, No. 3, 431-456 (2007))
6.One strategy to Rule Them All: the ZD Condition.
“It would be surprising if any significant mathematical feature of IPD has remained undescribed, but that
appears to be the case” (Freeman Dyson and William Press. 2012). Also surprising 93yr old Dyson added it!
Dyson also is a climate skeptic “ he thinks the computer-generated models being used to predict longterm climate consequences are flawed because scientists have too little information about many of the variables
that must be taken into account” (http://noconsensus.org/scientists/freeman_dyson.php). See 2015n Dyson
interview www.youtube.com/watch?v=BiKfWdXXfIs.
www.realclimate.org/index.php/archives/2008/05/freeman-dysons-selective-vision/ citicizes him.
Suggested paper: How valid are his criticisms of climate change models? Putting other problems first?
ZD as slogan : "Robert Axelrod's 1980 tournaments of iterated prisoner's dilemma strategies have been
condensed into the slogan, Don't be too clever, don't be unfair. Press and Dyson have shown that cleverness
and unfairness triumph after all." — William Poundstone
The Zero Determinant model presents strategy that “controls” outcomes regardless of what an
opposing non-ZD strategy does. ZD plays C with conditional probability between 0 and 1 depending on last
period's play – a memory 1 strategy. Dyson & Press show that an opponent who considers earlier encounters
does no better playing against a mem 1 player than a mem 1 strategy, so that analysis need only consider
strategies that remember the previous round. Here is ZD compared to four major strategies who do 0,1
responses to previous round.
CC (R)
CD (T)
DC (S)
DD (P)
ZD Strategy
Pcc
Pcd
Pdc
Pdd
Conditional Probability of Playing C
All Coop All Defect
TFT
“Pavlov – WSLC”
1
0
1
1
1
0
0
0
1
0
1
0
1
0
0
1
To do well ZD will likely set Pcc high; Pcd low; Pdc high; Pdd but not 0. Why? Since ZD can replicate other
strategies, it has to have some “edge:”.
Let Qcc, Qcd, Qdc, Qdd represent conditional probabilities of 2nd player. Then the four probabilities are
each players strategy. Putting them together gives a probability distribution for the outcome of each
round, conditional on the outcome of the previous round – a 4 by 4 Markov chain transition matrix M
for the four outcomes in this period to the next
CC
CD
DC
DD
CC
Pcc Qcc
Pcc (1-Qcc)
(1-Pcc) Qcc
(1-Pcc) (1-Qcc)
CD
Pcd Qdc
Pcd (1-Qdc)
(1-Pcd) Qdc
(1-Pcd) (1-Qdc)
DC
Pdc Qcd
Pdc (1-Qcd)
(1-Pdc) Qcd
(1-Pdc) (1-Qcd)
DD
Pdd Qdd
Pdd (1-Qdd)
(1-Pdd) Qdd
(1-Pdd) (1-Qdd)
Let v be the 4 element vector of the distribution of outcomes among CC, CD, DC, and DD aka R, T,
S, P. The v that solves v = M v gives the stationary distribution/equilibrium of R,T, S, P in which the row
player gets v times rewards (R,T, S, P) while the column player gets v times rewards (R,S,T,P)
Press and Dyson show that Sx can be expressed as a determinant in which one column involves
only the four probabilities of one player’s strategy and another column that involves the probabilities of
the other player's strategy. This allows a player to force a given linear relation between the outcomes of
both players independently of whatever strategy the other might choose. This control is obtained by
setting the determinant to zero, hence the name ZD.
ZD strategy uses the linear relation to set the average score of opponent regardless of opponents'
strategy. See http://s3.boskent.com/prisoners-dilemma/fixed.html, which plays conditional probabilities
by solving Press and Dyson equations for a target of 2.
If you cooperated last time, it cooperates with probability 2/3.
If you defected while it cooperated, it cooperates with probability 0.
If you defected last time and it defected, it cooperate with probability 1/3.
Whatever the non-ZD player does its long term outcome is 2.
ZD can also ”Extort” gains by defecting enough times to win in any one on one contest with the other
player. Extort-2 below forces the relationship where. ZD gains twice the share of payoffs above P compared
with those received by opponent, where P is … .
Where does solution come from?
Press and Dyson prove that v f SX equals the determinant of a matrix which is obtained
via replacing the last column of MI by SX .
Denote this determinant as D(p,q,SX ), then player X’s expected payoff is EX = D(p,q,SX )/D(p,q,1), where 1
is an all-ones vector In the determinant, the second column is determined by the strategy of player X and the
third column is solely determined by the strategy of player Y. We record the second column
as ˜p = (�1+ p1,�1+ p2, p3, p4) and the third column as
The ZD solution links the conditional probabilities to the R, S, T, P rewards by a, b, and v parameters
Pcc =a R + b R + v +1
Pcd = a S + b T + v +1
Pdc = a T + b S + v
Pdd = a P + b P + v
which yields a linear equation that the rewards to them-- A(p,q) are connected to the rewards of the other
player A(q,p) by this equation. :
Revolution in Game Theory?: Response of Researchers to New Solution
Hao Dong, Rong Zhi-Hai, Zhou Tao al “Zero-determinant strategy: An underway revolution in game
theory∗ (Chinese. Physics. B, 2014) ”ZD ... fundamentally changes the research paradigm of game theory. In
the framework of ZD … are dozens of ingenious ideas and untraditional approaches for analyzing not only
prisoner’s dilemma but also bi-matrix games, which dramatically expand our understanding of the stochastic
process, the mutual benefit, the cooperation incentive, and even the optimal control in the repeated games.
William Press: “When both players have a theory of mind (that is, are not just evolving to maximize
their own score) are all games in some deep way, actually Ultimatum Games.
Freeman Dyson;“Cooperation loses and defection wins ... My view of the evolution of cooperation
is colored by my memories of childhood … two important days, Christmas and Guy Fawkes. Christmas
was the festival of love and forgiveness. Guy Fawkes was the festival of hate and punishment... (for) the
guy who tried to blow up the King and the Parliament in 1605 and was gruesomely punished by torture
and burning. For the children, Christmas was boring and Guy Fawkes was fun. We were born with an
innate reward system that finds joy in punishing cheaters. The system evolved to give cooperative
tribes an advantage over noncooperative tribes, using punishment to give cooperation an evolutionary
advantage within the tribe. This double selection of tribes and individuals goes way beyond the
Prisoners' Dilemma model.”
Chad English (Comment on Plos Blog, http://blogs.plos.org/neuroanthropology/2012/06/24/
prisoners-dilemma-and-the-evolution-of-inequality-does-unfairness-triumph-after-all/) “this provides
the demonstrable benefit of unions and governments … companies in capitalist societies make use of
ZD strategies to exploit their shorter term acting employees (“employees who do not know ZD
strategies”). It is therefore in the interests of workers to create their ZD strategy organization aka a
union)
What is the change in the PD model from previous model? ZD assumes different information – the player
who uses ZD knows ZD and the other does not and just adjusts to gain best it can. ZD strategies provide the
player with a strong unilateral control in games but this does not mean that ZD triumphs in evolutionary
games. strategies that provide a unilateral advantage to sentient players pitted against unwitting opponents.
As we shall see in lecture 4, ZD does poorly in evolutionary, in part because if two ZD extortionary strategies
meet, they ends up with DD lowest value.
How does it work with noise? (Hao, et al, Extortion under uncertainty: Zero-determinant strategies in noisy
games” Phys. Rev. E 91, 052803: The original ZD strategy does not capture the notion of robustness when
the game is subjected to stochastic errors. We find that ZD strategies have high robustness against errors. We
further derive the pinning strategy under noise, by which the ZD strategy player coercively sets the
opponent's expected payoff to his desired level, although his payoff control ability declines with the increase
of noise strength. Due to the uncertainty caused by noise, the ZD strategy player cannot ensure his payoff to
be permanently higher than the opponent's, which implies dominant extortions do not exist even under low
noise. But the ZD strategy player can establish a novel kind of extortion, named contingent extortions, where
any increase of his own payoff always exceeds that of the opponent's by a fixed percentage; the conditions
under which the contingent extortions can be realized are more stringent as the noise becomes stronger.
Conclusion
1- In one on one contests where winner lives and loser dies, D triumph because.it beats TFT (other
Reciprocal Coop) on round one, ties afterwards. TFT/Coop never wins one on one.
2-- In tournament where a strategy interacts with many others, including strategies like itself,
cooperation can score more than D-strategies. Both cooperative and defect strategies score most when
playing against C strategies, but D scores more. But Cooperative strategies gain more against TFT-type
strategies than Defect-type strategies.
3-- Outcomes in tournament environment depend on ecology/population of strategies and how it
evolves over time -->analysis of evolutionary stable strategies (ESS) next class, where strategies evolve
depending on their total points relative to others. TFT-type may lose every one on one, it will “win” by
scoring relatively more points, than defect-type strategy.
4-- Issues not treated in basic model: robustness to mistakes in playing – mis-communication; cost of
developing more complex strategies – # parameters can lead – ZD requires more “brainpower” than all D;
prevalence of PD games vs others (ultimatum, dictator, etc in world?