Lecture 3: Prisoner’s Dilemma, Tit-for-tat, and ZD strategies http://gawker.com/5903692/must-watch-golden-balls-contestant-wins-with-most-ballsy-move-ever The Prisoner’s Dilemma (PD): Two players cooperate or defect. If A cooperates and B defects, A loses; if A defects and B cooperates A wins; if both cooperate both gain; if both defect, both lose. Standard PD payoff matrix The matrix of payoffs: CC> DC/CD > DD; and CC>[ DC + CD]/2 (so an alternating strategy is not best). Since CC produces max: 6 > 5> 2, PD is not a zero sum game. Cooperation has a productive value. CC need not be socially positive -- positive for cartel ripping off consumers. Often written for individual as T(DC) > R (CC) > P (DD) > S(CD) with R= 3, S=0, T=5, P =1 In a one-shot game, the solution is D. If I do C, you will do D in the single encounter, and vice versus. The same is true if we know the game will end in period T. Why? We both play D in T. So what about T-1? It’s a oneshot game again, because we know what T will give us. Thus a known number of interactions yields ALLD. But we observe much less defecting in the world. To explain this we need: Expected future dealings. Rip off the tourist because there is no future exchange but cooperate with your spouse since you expect to interact again with no certain endpoint. This is the repeated or iterated PD game: IPD. There is no best action. Your best strategy depends on what others do and on discount factor for future payoffs. With low discount rates, cooperative strategies do better than D-strategies because if cooperate today leads to cooperate tomorrow the 6 produced by cooperation exceeds the 5 produced by CD. But cooperative strategy must be retaliatory. You play D and gain 5 against C but if C retaliates by shifting to D, you get 5+1 in two rounds and 5+1+1 in three rounds while if you keep playing C with another cooperator you get 3+3 in two rounds and 3+3+3 in three rounds. Retaliation gives D 5,1,1 while each cooperator gains 3, 3, 3 and so on. D could still score more if the future were heavily discounted. Without retaliation, you keep playing D and the other player does C and you get 5,5,5 ... Let w = 1/(1+r) < 1, where r = interest or discount rate for next period,. Then w2 is the value of a payoff two periods from now, w3 , the value 3 periods on, etc. The future matters more when w is large. A cooperative strategy that does not play all C can beat out all D. (Play all-C and D gets 5 every time). 2.Opponents have strategies beyond all D or C If other player is all-D, best is D. If other player is all C, best is D. But if pair reacts to D/C may be better to be nicer. TFT, tit-for-tat, cooperates until opponent defects, then defect until opponent changes. An eye for an eye, tooth for tooth. Opponent C D C/D TFT C C D C/D Consider three periods, (the minimum for TFT to work better than All D given the payoff matrix above) and where TFT and D meet half the time. For simplicity let w=1 so future is worth as much as present. TFT meets TFT: rewards = 3(1 +w+w2) =9 All D meets TFT: rewards = 5 +w+w2 =7 TFT meets All D: rewards = 0 + w+w2 = 2 All D meets All D: rewards = 1+ w+w2 = 3 TFT gets 11 (= 9+2)/2 vs 10 (=7+3)/2 for D. Future cooperation gives higher present value than the gain of defect. 3.The winning strategy varies with the distribution of strategies. In D world best is all-D. In TFT world, best is TFT type strategy. Consider how payoffs vary with the all-D and TFT population in a 3 period model %D 1/3 1/2 2/3 3/4 TFT 20/3 (1/3 2 + 2/3 9) 11/2 (½ 2 +1/2 9) 13/3 15/4 D 17/3 (1/3 3 + 2/3 7) 10/2 13/3 16/4 TFT WINS TFT WINS EQUAL SCORES D WINS So when %D> 2/3rds, D wins; when %D < 2/3rds D loses; at 2/3rds get unstable mixed equilibrium. Note TFT requires smaller proportion of itself to win (1/3rd +) than D (2/3rd+). Reason is 6>5. 4) Addition of all Cooperate (turn other cheek) helps all-D and hurts TFT: Too many suckers destroys world TFT C D TFT C 9 9 9 9 7 15 D 2 0 3 1/3 of each 20/3 18/3 25/3* 2/5 TFT, 2/5C, 38/5 36/5 47/5* 7/10 TFT 2/10 C 8.3* *FOR WIN 8.1 8.2 D wins because it exploits C. With 2/5 TFT and 2/5 C (and 1/5 D), D wins. With 7/10 TFT and 2/10 C, TFT wins. Thus, no best choice choice in PD. Success depends on ecology of strategies. For any payoff matrix, there is a distribution of All D, All C, and TFT so D wins and so TFT wins even though TFT never wins one on one. Nice strategies gain from interactions with nice strategies. TFT gains more than D through interactions with TFT. 2. Axelrod 1979 Computer Tournament asked experts to submit computer code giving responses to any action by another. Fifteen entrants. Someone puts in D. Someone puts in C. Several programs try to infer/exploit opponents strategy. Anatol Rapaport enters TFT and wins. Axelrod announces results and holds second contest. Analysis of round 1 showed that a more forgiving strategy could beat TFT:Tit for two tats -- TFTT -- which retaliates against DD but not D. 63 entrants in 2nd tournament and TFT (Rapaport) won again. Simulation found that if higher scoring strategies increase their share of the population over time TFT and other nice rules did well over time. Why? Bcs TFT/nice strategies never defect first and retaliates quickly to D, which limits D's making much money. Can a TFT world survive/outscore an invasion of D? Depends on %D (p). In first period TFT scores 3(1-p) + p, while D gets (1-p) + 5p so TFT beats D when 2(1-p)> -4p >0 ---> p<2/3. Invasion of <2/3Ds would fail . Can a world of Ds survive invasion of TFTs? Yes, 1/3 or more needed with given matrix. But TFT cannot stop invasion of Cs because TFT and C score the same. But with Cs in the world, the door is open to Ds to invade. The Axelrod/other tournaments operate with a strategy that plays every other one and grows proportionate to its total score. If instead it operated as a tennis tournament, with winners of each going contest going to next round would have completely different results. TFT meets all D → D proceeds and result is world of D. Spatial interactions and n-hoods IF TFTs interact more with each other in local N-HOOD rather than with the entire population ,TFT is more likely to survive. Say 10 TFTs enter world of 100 All-Ds but have 1/2 of their interactions with TFTs. This is equivalent to a world with 50% TFTs. The Ds still interact largely with Ds, so TFT could win. CA models show how n-hood interactions affect outcomes in spatial PD games. Assume that players interact with others in nhood and change strategy depending on what they can win in the nhood. Surrounded by Ds you turn D. Surrounded by TFTs you play TFT. Conflicts occur on the borders. Compare a TFT with 3 Ds and 1 TFT for neighbors with a TFT and D having half TFT neighbors and a TFT with 2 TFT neighbors. TFT TFT TFT TFT * ?? * TFT D * ?? * D D * ?? * D D TF T D The rule for ?? is to compute profits from D and TFT and pick most profitable. Consider the rewards using payoffs for three period interactions: TFT-TFT 9, TFT-D 2, D-D 3, D-TFT 7. The above n-hoods produce N-Hood 1D, 3 TFT 2D 2 TFT 3 D 1 TFT TFT 29 22 15 Surrounded by 2 or 3 TFTs choose TFT. D 24 20 16 Surrounded by 3 or more Ds choose D; Decision TFT TFT D Go to http://www.taumoda.com/web/PD/lab.html and experiment with the PD games. 4.Better than TFT: Nicer and Conditional TFT has problems with errors in communication. If TFT meets TFT and errs, it --> an alternating cycle, with lower rewards than C. Error D TFT CCC D CDCD … More forgiving is TFTT CCC D CC ... CCC DD CC TFT CCC C DCDC... TFTT CCC C CC... CCC CC DD To generalize strategies in terms of conditional probabilities, Let P be the probability you cooperate if X cooperated and Q be the probability that you cooperate if X defected. Then strategies as below -Sigmund, Games of Life, 1994. Nowak and Sigmund simulate world of (p,q) strategies with random ps and qs and NO neighborhoods. PAVLOV responds to previous round by switching if it loses: if its D leads to a D, it tries C while if its C meets a D, it tries D. WIN-STAY. LOSE-SHIFT. Pavlov would fail in Axelrod-tournament until TFT has destroyed most Ds. 5.Psychology Experiments-- Framing matters Study 1: More cooperation in ‘‘Community Game’’ PD than ‘‘Wall Street Game’’ in Israeli Air force experiment. Compare results by guess of instructors who will cooperate based on behavior during training. (Liberman, V., S. M. Samuels, and L. Ross. 2004. Personality and Social Psychology Bulletin 30:1175-85.) Study 2: Interpretive labels of the game, the choices, and the outcomes led to different outcomes. (Zhong , Loewenstein, Murnighan “ Journal of Conflict Resolution,” Vol. 51, No. 3, 431-456 (2007)) this experiment provided clear and strongly worded labels for the PD game, for the players’ choices, and for their outcomes: blunt labels—trust and cutthroat— described the game in starkly different ways; the choice labels reflected researchers’ common interpretative descriptions of participants’ choices, for example, cooperate, defect, rational, and choose for the group; and labels for the participants’ outcomes also reflected common research usage, for example, winner, sucker, saint, punishment, or group maximum. 6.One strategy to Rule Them All: the ZD Condition. “It would be surprising if any significant mathematical feature of IPD has remained undescribed, but that appears to be the case” (Freeman Dyson and William Press. 2012). The new feature is Zero Determinant strategies that gave a strategy “control” over outcomes regardless of what an opposing non-ZD strategy does. ZD plays C with conditional probability between 0 and 1 depending on last period's play – a mem 1 strategy. Dyson-Press show that against a mem 1 player an opponent who considers earlier encounters does no better than an opponent with an mem 1 strategy, so they need only consider strategies that remember the previous round, which is critical for making the problem manageable. Is it surprising? Not to Drew Fudenberg who noted that the only reason to condition play on the past is if your opponent does so, which means that an opponent to a mem 1 player gains nothing from going back further in trying to “guess” your next move. There is no link across plays in a pure repeated game. If opponent uses no memory no need for you to do so. Here is ZD compared to four major strategies who do 0,1 responses to previous round. CC (R) CD (T) DC (S) DD (P) ZD Strategy Pcc Pcd Pdc Pdd Conditional Probability of Playing C All Coop All Defect TFT “Pavlov – WSLC” 1 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 To do well ZD will likely set Pcc high; Pcd low; Pdc high; Pdd but not 0. Why? Note that All C, All D, TFT, and Pavlov are ZD strategies with extreme/simple conditional probabilities . Let Qcc, Qcd, Qdc, Qdd represent conditional probabilities of 2nd player. Then the four probabilities Pii and Qii characterize each players' strategy . Putting them together gives a probability distribution for the outcome of each round, conditional on the outcome of the previous round – a 4 by 4 Markov chain transition matrix M for the four outcomes in this period to the next. CC CD DC DD CC Pcc Qcc Pcd Qdc Pdc Qcd Pdd Qdd CD Pcc (1-Qcc) Pcd (1-Qdc) Pdc (1-Qcd) Pdd (1-Qdd) DC (1-Pcc) Qcc (1-Pcd) Qdc (1-Pdc) Qcd (1-Pdd) Qdd DD (1-Pcc) (1-Qcc) (1-Pcd) (1-Qdc) (1-Pdc) (1-Qcd) (1-Pdd) (1-Qdd) Press and Dyson find the 4 element vector v of the distribution of outcomes among CC, CD, DC, DD that is the stationary distribution/equilibrium of the encounter between P and Q from v = M v. Multiplying the elements of the distribution by the rewards each player gets from the outcomes gives their payoff, S: Player P (row player) gets S(p,q) = vccR + vcd S + vdc T + vdd P Player Q (column player) gets S(q,p) = vccR + vcd T + vdc S + vdd P where the distribution parameters vij measure the frequency of the outcomes. The equations show that both get the same amount from CC and DD and differ in their rewards in the CD and DC cases. Since T> S, player P wins when vdc >vcd and player Q wins in the opposite case. For player P, more DC than CD is good. Since the stationary distribution parameters v depend on the conditional probabilities the players choose, the ZD solution links the conditional probabilities of the ZD player to the R, S, T, P rewards This link yields a linear equation that connects the rewards to the first player- S(p,q) to the rewards of the other player S(q,p) by an equation. αS(p,q) +βS(q,p) + γ = 0 The equation allows a player to force a given linear relation between the outcomes independently of whatever strategy the other might choose. Press and Dyson obtain the solution by setting the determinant of a matrix that depends on the game matrix and a vector of rewards equal to zero, hence the name ZD. Taking this equation as given, if the first player knows ZD strategy and the second does not, the first can can use the linear relation to set the average score of an opponent regardless of opponents' strategy. The following is the solution for choosing conditional probabilities that set the opponents' score to a fixed level. If CC, cooperate with probability 1-e If DC, cooperate with probability 1/2(e+3f) If CD, cooperate with probability 1-2e-f If DD, cooperate with f http://s3.boskent.com/prisoners-dilemma/fixed.html solves Press and Dyson equations for a target of 2 and finds that e= 1/3 and f = 1/3 so that CC → cooperate with probability 2/3 DC → cooperate with probability 2/3. CD → cooperate with probability 0. DD → cooperate with probability 1/3. Whatever the non-ZD player does its long term outcome is 2. Similarly can get find conditional probabilities that force opponent to have any number between 1 and 3. ZD can also ”Extort” gains by defecting enough times to win in any one-on-one contest. There is a set of conditional ZD probabilities so that ZD gains µ >1 times the share of payoffs above the mutual defect outcome (1 in the standard model) compared with the gains received by opponent above the mutual defect outcome. The general equation for this is: (S(p,q) -P) = γ (S(q,p)-P), where γ > 1 is the rate of extortion. The extortion conditional probabilities are: If CC, cooperate with probability 1-e( χ-1) (R-P)/(P-S) If DC, cooperate with probability 1- e (1+ χ(T-P)/(P-S) If CD, cooperate with probability e (χ + (T-P)/(P-S) If DD, cooperate with probability 0 EXTORT-2 says give the ZD Row Player twice the gain above P that the Column Player gets – ie χ =2 in the equations where the Column Player tries to maximize its own value, which occurs when it plays all C. So the extortion is that ZD says maximize your interest and with my chosen conditional probabilities you will play so I get twice the gain above P you get. This strategy is an ULTIMATUM game strategy … We divide $1000 and I have the chip that says I get $1000 if you agree. I offer you 1 cent, your maximizing choice is … 1 ct. Using R= 3, S=0, T=5, P =1, the conditional probabilities for getting twice above P as you is Revolution in Game Theory? Dong, Zhi-Hai, and Tao( Chin. Phys. B, 2014) ”ZD ... fundamentally changes the research paradigm of game theory. In the framework of ZD … are dozens of ingenious ideas and untraditional approaches for analyzing not only prisoner’s dilemma but also bi-matrix games, which dramatically expand our understanding of the stochastic process, the mutual benefit, the cooperation incentive, and even the optimal control in the repeated games. William Press: “When both players have a theory of mind (that is, are not just evolving to maximize their own score) are all games in some deep way, actually Ultimatum Games?” Freeman Dyson;“Cooperation loses and defection wins ... My view of the evolution of cooperation is colored by my memories of childhood … Christmas and Guy Fawkes (days). Christmas was the festival of love and forgiveness. Guy Fawkes was the festival of hate and punishment... (for) the guy who tried to blow up the King and the Parliament in 1605 and was gruesomely punished by torture and burning. For the children, Christmas was boring and Guy Fawkes was fun. We were born with an innate reward system that finds joy in punishing cheaters. The system evolved to give cooperative tribes an advantage over noncooperative tribes, using punishment to give cooperation an evolutionary advantage within the tribe. This double selection of tribes and individuals goes way beyond the Prisoners' Dilemma model.” xxxx “To me this provides the demonstrable benefit of unions and governments … companies in capitalist societies make use of ZD strategies to exploit their shorter term acting employees (“employees who do not know ZD strategies”). It is therefore in the interests of workers to create their ZD strategy organization aka a union) How does the ZD analysis change the PD model? ZD assumes a fundamental difference in the players – the player who uses ZD knows ZD while the unwitting player does not and just adjusts to gain best it can. By giving information to one player, it turns PD into ultimatum game with ZD being able to make the first offer. The column player either does what is in its best interest … or says “To hell with you” and says no I will play D in every turn … the result is no one gets above the P outcome. General result in ultimatum games is that people often offer to share rewards equally in the belief that both will stick to that outcome, but if the player who goes first divides the rewards 60-40 or even 70-30 the second player follows maximizing behavior and accepts. Thus, ZD strategies provide the first player with a strong unilateral control in games but if both players know ZD and if neither blindly maximizes their returns the ZD strategy collapses. What happens to ZD in evolutionary games with many strategies? In lecture 4, we will see that ZD does poorly in an Axelrod tournament, in part because when two ZD extortionary strategies meet, they end up with DD lowest value unless they decide to move their conditional probabilities of cooperation higher to be generous and play TFT. Conclusion 1- In one on one contests where winner lives and loser dies, D triumphs. ALL D wins or draws all the time. It beats TFT on round one, ties afterwards. TFT never wins one on one. 2-- Outcomes in tournament with many strategies depend on ecology/population of strategies and how they evolve over time -->analysis of evolutionary stable strategies (ESS) next class, where strategies evolve depending on their total points relative to others. 3-- Both cooperative and defect strategies score most when playing against C strategies, but D scores more. Cooperative strategies outscore D by playing against TFT-type strategies. Even though TFT-type may lose every one on one, it will “win” by scoring relatively more points than defect-type strategy. 4 – ZD strategies allows player to determine outcomes of another player who seeks to maximize their rewards and thus win one-on-one against that player but falter when both players know ZD and are willing to sacrifice rewards to change the game (and in evolutionary context, as we see next class).
© Copyright 2026 Paperzz