How to Design a Strategy to Win an IPD Tournament

July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
ws-book975x65
Chapter 4
How to Design a Strategy to Win an IPD
Tournament
Jiawei Li
Harbin Institute of Technology, China
Email: lijiawei [email protected]
Imagine that a player in an IPD tournament knows the strategy of each of his
opponents beforehand; he will defect the opponents like ALLC or ALLD and cooperate with opponents such as GRIM or TFT in order to maximize his payoff. This
means that he can interact with each opponent optimally and receive higher payoffs. Although this information a priori is not possible, one can identify a strategy
during the game. For example, if a strategy cooperated with its opponent in the
previous 10 rounds while its opponent defected, it seems sensible to deduce that
it will always cooperate. In fact, each strategy will gradually reveal itself during
an IPD game; moreover, it is not after the game that we can identify the strategy
but possibly after a few rounds. With an efficient identification mechanism, it is
possible for a strategy to interact with most of its opponent optimally.
However, two main problems must be solved in designing an efficient identification mechanism. Firstly, it is impossible, in theory, for a strategy to identify an
opponent within a finite number of rounds because the number of possible strategies is huge. Only can the types of strategies belonging to a preconcerted finite set
be identified, which may be just a small proportion of all those possible because
identification will be of no use if it takes too long. Secondly, there exists the risk
of exploring an opponent putting the player into a much worse position. In other
words, such an action may have negative effect on future rewards. For example,
in order to distinguish between ALLC and GRIM, a strategy has to defect at least
once and loses the chance to cooperate with GRIM in the future.
In this chapter we will discuss how to resolve these problems, how to design an
identification mechanism for IPD games, and how the strategy of Adaptive Pavlov
is designed that was ranked first in Competition 4 of the 2005 IPD tournament.
29
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
30
4.1
ws-book975x65
Book Title
Analysis of strategies involved in IPD games
Each strategy may have its disadvantages as well as its advantages. A strategy may
receive high payoffs when its opponent belongs to some set of strategies, and receive
lesser payoffs when an opponent belongs to another set of strategies. However, some
strategies always do better than others in IPD tournaments.
The strategies involved in IPDs can be classified according to whether or not
they respond to their opponents. One set of strategies is fixed and plays a predetermined action no matter what their opponent has done. ALLD, ALLC and RAND
are typical. Other strategies are more complicated and their actions depend on
their opponent’s behavior. TFT, for example, starts with COOPERATE and then
repeats his opponent’s last move. The second set is obviously superior to the former
since the strategies like TFT, TFTT and GRIM have always performed better than
’fixed’ strategies in past IPD tournaments.
Then, the question here is what the optimally response to every opponent is.
Is TFT’s imitation of opponent’s last move the best response? Although TFT has
been shown to be superior to many other strategies, it is not perfect enough to win
every IPD tournament.
Let’s consider a simulation of IPD tournament with 9 players. These players
are ALLC, ALLD, RAND, GRIM, TFT, STFT, TFTT, TTFT, and Pavlov. The
descriptions of the strategies of these players are as shown in Table 4.1. These
strategies are simple and representational, each of which has appeared in the past
IPD tournaments.
Table 4.1
Description of the players of the IPD simulation.
Players
Descriptions
ALLC
ALLD
RAND
GRIM
TFT
TFTT
STFT
TTFT
Pavlov
This strategy always plays COOPERATE.
This strategy always plays DEFECT.
It plays DEFECT or COOPERATE with 1/2 probability.
Starts with COOPERATE, but after one defection plays always DEFECT.
Starts with COOPERATE, and then repeats opponent’s moves.
Like TFT but it plays DEFECT after two consecutive defections.
Like TFT but in first move it plays DEFECT.
Like TFT but it plays two DEFECT after opponent’s defection.
Result of each move is divided into two groups: SUCCESS(payoff 5 or 3)and
DEFEAT (payoff 1 or 0). If the last result belongs to SUCCESS group it plays the
same move, otherwise it plays the other move.
The rule of the simulation is that each strategy will play a 200-round IPD game
with every strategy (including itself). The payoffs in a round is as shown in Fig.
4.1. The total payoff received by any given strategy is the summation of the payoffs
throughout the tournament.
The results of the tournaments vary because there are random choices in the
strategies of Pavlov and RAND. In order to decrease the variability of the result,
the tournament is repeated several times and the average score for each strategy is
calculated. Simulation results show that TFT, TFTT and GRIM acquire obviously
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
ws-book975x65
How to design a strategy to win an IPD tournament?
31
Player 2’s choice
Player 1’s choice
COOPERATE
DEFECT
COOPERATE
(3, 3)
(5, 0)
DEFECT
(0, 5)
(1, 1)
Fig. 4.1 Payoffs table of the IPD tournament. The numbers in brackets denote the payoffs two
players receive in a round of a game.
higher scores than the others and their average scores across several tournaments are
quite close. TFTT, however, wins more times than the others in single tournament.
For example, TFTT wins 11 tournaments from a total of 20, while TFT wins 4 and
GRIM wins 5. In addition, if Pavlov and RAND are removed TFTT will always
win.
One of the limitations of TFT is that it will inevitably run into the circle of
defecting-defected (which means that TFT plays COOPERATE while its opponent
defects; and then TFT plays DEFECT while its opponent cooperates) when its
opponent happens to be STFT. However, cooperation will be achieved resulting in
higher payoffs if TFT cooperates once after its opponent defects. TFTT is superior
to TFT in this regard. And it is this reason why TFTT wins more tournaments
than TFT in the above IPD simulation. It is easy to verify that TFT will not get
lower scores than TFTT if STFT is removed from the simulation.
Thus, we can improve the strategy of TFT in this way: when TFT enters a circle
of defecting-defected (for example a sequence of 3 pairs of defecting-defected) it will
choose COOPERATE in two continuous rounds. This modified TFT (MTFT) will
achieve higher payoffs than TFT in the case that their opponents are STFT. By
substituting MTFT for TFT, IPD experiments show that MTFT gets the highest
average score and wins more single tournaments than the others.
MTFT has used an identification technique. It identified STFT by detecting the
defecting-defected circles in the process of an IPD game. When the opponent was
considered to be STFT, optimal action (cooperates in two sequential rounds) would
be carried out in order to maximize future payoffs. In this way, it is natural to
deduce that MTFT can be further improved so that it can identify more strategies
and then interact with them optimally.
In the following sections, an approach to identify each strategy in a finite set
will be introduced. A strategy can interact with the opponents almost optimally
by using this identification mechanism.
4.2
Estimation of possible strategies in an IPD tournament
In this section, we seek to define a finite set of types of strategies to be identified.
Since the number of possible strategies for IPD are infinite, it is impossible to
identify each of them in a finite number of rounds. For example, suppose that a
strategy cooperated with its opponent in 10 sequential rounds while its opponent
defected continuously. Although it is very likely to be ALLC, there are always
other possibilities. It may be GRIM but the trigger is 11 defections; it may be
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
32
ws-book975x65
Book Title
RAND that has just happened to play 10 sequential COOPERATE; or it may be
a combination of ALLC and TFT and it will behave as TFT type in the following
rounds. However, since only ALLC belongs to the set of identification, those other
possibilities will be eliminated.
How to choose the set of identification depends on prior knowledge and subjective estimation. Some strategies like TFT are likely to appear; while others are
designated as default strategies.
There are numerous strategies one can design for an IPD tournament. However,
most of them seldom appear because their chances of winning are very small. For
example, there may be such a strategy that it cooperates in the first two rounds and
defects in the following two rounds, and then it cooperates and defects alternately.
Few players will apply such a strategy because it is unlikely to win in any IPD
tournament. It is obvious that the strategies that usually win appear frequently
and the others appear infrequently.
We define two classifications of IPD strategies: cooperating and defecting. Cooperating strategies, for example TFT and TFTT, wish to cooperate with their
opponents and never start defecting. Defecting strategies, for example ALLD and
Pavlov begins with DEFECT (PavlovD), wish to defect in order to maximize their
payoffs and they always start defecting.
The cooperating strategies differ in the way of their responses to the opponent’s
defections. For example, TFTT is more forgiving than TFT as it retaliates only
if its opponent has defected twice. GRIM is sterner than TFT as it never forgives
a defection. These strategies can be classified according to their responses to the
opponent’s defections. The rules are the same as the one described in the previous
simulation as shown in Fig. 4.2.
Stern
Forgiving
ALLC
TFTT
Fig. 4.2
TFT
TTFT
GRIM
The cooperatiing strategies.
The defecting strategies differ in the way they insist on defecting. PavlovD is a
representative strategy in this set. It starts with DEFECT. If the opponent is too
forgiving to retaliate, it defects forever. Otherwise, it tries to cooperate with the
opponent.1 The defecting strategies can be classified as shown in Fig. 4.3.
Other simple strategies which lack a clear objective differ from the cooperating
and defecting strategies and hardly ever to get high scores in IPD tournaments.
Most of the players of an IPD tournament will be cooperating strategies at
the present time since cooperating strategies have been dominant in most of the
1 Although PavlovD tries to cooperate with an opponent when the opponent retaliates upon its
defection, it seldom succeeds. For example, even if PavlovD meets a forgiving strategy like TFTT
they can not keep cooperating in the game. In fact, if only PavlovD cooperates one more time
cooperating can be achieved. We have examined a modified PavlovD (MPavlovD) strategy that
starts with DEFECT and cooperates twice when the opponent retaliates. The results of simulation
show that MPavlovD always gains more scores than PavlovD.
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
ws-book975x65
How to design a strategy to win an IPD tournament?
Defect more
Defect less
PavlovD
STFT
Fig. 4.3
33
ALLD
The defectiing strategies.
tournaments. There will also be a small quantity of defecting strategies. Based
on the above idea, we have designed the Adaptive Pavlov strategy that applies a
simple mechanism to distinguish cooperating strategies and several representative
defecting strategies.
4.3
Interaction with a strategy optimally
For any strategy there must be another strategy that optimally deals with it. Because the strategies of ALLC, ALLD and RAND are independent of the opponent’s
behavior, ALLD is the optimal strategy. Because the strategies of GRIM, TFT,
STFT and TTFT retaliate soon as their opponent defects, the optimal strategy of
its opponent is to always cooperate but defect in the last round. TFTT is more
charitable and forgives a single defection; therefore, its opponent can maximize the
payoff by alternately choosing DEFECT and COOPERATE. If Pavlov starts with
COOPERATE its opponent should always cooperate except in the last round; Otherwise, its opponent should start with DEFECT, then always cooperate except in
the last round. Table 4.2 shows the optimal strategies to deal with each strategy as
shown in Table 4.1.
Table 4.2
Strategies
ALLC
ALLD
RAND
GRIM
TFT
TFTT
STFT
TTFT
Pavlov
Optimal strategies to interact with a known strategy.
Optimal strategy of opponent
It always plays DEFECT.
It always plays DEFECT.
It always plays DEFECT.
It always plays COOPERATE except DEFECT in the last move.
It always plays COOPERATE except DEFECT in the last move.
It starts with DEFECT, and then plays COOPERATE and DEFECT
in turn.
It always plays COOPERATE except DEFECT in the last move.
It always plays COOPERATE except DEFECT in the last move.
If Pavlov starts with DEFECT it starts with DEFECT, and then always
plays COOPERATE except that it plays DEFECT in the last round; If
Pavlov starts with COOPERATE it always plays COOPERATE except that
it plays DEFECT in the last round.
Given an IPD tournament with n players, a player will win the tournament if
it interacts with each of its opponent optimally. For example, a unique ALLD will
win when the other n − 1 players in a IPD tournament are all ALLC. Hence, the
winning strategy of an IPD tournament must be optimal in interacting with most
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
34
ws-book975x65
Book Title
of the others.
Although the strategy of a player is unknown to his opponent before a game, the
strategy gradually emerges as the game progresses. It is not difficult for a human
player to identify the strategy of his opponent but it is more difficult for a computer
program to gain the ability of identification. To make this feasible, there is a need
for a method to distinguish each type of strategies from the others, and then a
computer program can interact with different type of strategies with the relevant
response. Under the assumption that every player belongs to a pre-defined finite
set of strategies, an example is given to show how the method of identification is
realized and how the winning strategy is designed .
Consider an IPD tournament with 10 players. Besides the players shown in
Table 4.1, let’s add a new player MyStrategy (MS) which applies an identification
mechanism to identify its opponent. The rules are the same as the one described
in the previous simulation.
MS starts with DEFECT. If its opponent chooses DEFECT in the first round,
MS chooses COOPERATE in round two, otherwise MS chooses DEFECT. MS
always chooses COOPERATE in the third round. In this way, most of the strategies
can be identified after just three rounds.
For example, suppose that the choices of MS and its opponent in the first 3
rounds are as shown in Fig. 4.4. The strategy of the opponent can be confirmed
to be RAND. Because the opponent starts with DEFECT it must be one of the
strategies of ALLD, STFT, RAND and Pavlov. Since MS defects in the first round
and the opponent cooperates in round two, it is impossible to be ALLD or STFT.
Since MS and the opponent cooperate in the second round, the opponent should
not defect in the third round if it were Pavlov. Therefore, the opponent must be
RAND. The optimal strategy is ALLD in interacting with RAND, and MS will
behave as ALLD in the following rounds of the game.
Round 1
Round 2
Round 3
MS’s moves
Defect
Cooperate
Cooperate
Opponent’s moves
Defect
Cooperate
Defect
Fig. 4.4
A possible process of a game (shows that the opponent is RAND).
Some possible results of identification for the 9 strategies are listed in Table 4.3,
where ’C’ denotes COOPERATE and ’D’ denotes DEFECT. Because the strategy
RAND chooses its move randomly it may behave like any other strategy during a
short period; therefore, there needs more rounds to distinguish RAND from other
strategies. If there is a process different from that of as shown in Table 4.3, the
strategy of the opponent must be RAND.
In this way, a strategy can be identified after several rounds of game, and then
the optimal strategy can be applied.
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
ws-book975x65
How to design a strategy to win an IPD tournament?
Table 4.3
Players
35
Identification of the 9 strategies.
Possible moves of two players in a IPD game
Identification result
MyStrategy
The opponent
D C C
D C C
Pavlov (RAND)
MyStrategy
The opponent
D C C
D D D
ALLD (RAND)
MyStrategy
The opponent
D C C
D D D C
STFT (RAND)
MyStrategy
The opponent
D D C
C C C C
ALLC (RAND)
MyStrategy
The opponent
D D C
C C C D
TFTT (RAND)
MyStrategy
The opponent
D D C
C C D C
Pavlov (RAND)
MyStrategy
The opponent
D D C C
C C D D C
TFT (RAND)
MyStrategy
The opponent
D D C C C
C D D D C
TTFT (RAND)
MyStrategy
The opponent
D D C C C
C D D D D
GRIM (RAND)
Ten IPD tournaments with the above 10 players are carried out.2 The simulation results are as shown in Fig. 4.5. It shows that MS gains the highest average
payoffs when compared to the other strategies and achieves the highest score in each
tournament. The reason for MS’s success is that it has almost optimally interacted
with most of the strategies in this IPD tournament.
Most of IPD strategies, such as TFT or Pavlov, are memory-one strategies which
can only respond to the opponent’s last move; however, the past process of the game
contains more information. The identification mechanism of MS uses information
about the opponent’s strategy, thus MS respond to not just the opponent’s past
moves but the opponent’s strategy. By identifying different opponents, MS makes
use of more information than the simple strategies. This is the reason MS is able
to win IPD tournaments.
Different identification approaches may lead to different results for MS. For
example, all of the strategies of GRIM, TFT and ALLC start with COOPERATE,
and they will not defect if their opponents don’t. To identify each of these strategies,
MS starts with DEFECT and loses the chance to cooperate with GRIM. On the
2 How many rounds an IPD game commits is usually not fixed in order to avoid the players’
knowing of when to end the game. The simulation applies a fixed number of rounds in order to
decrease complexity of computation. However, the strategy of MS does not make use of this to get
extra payoff; that is to say, MS does not purposely choose DEFECT in the last round of a game.
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
36
ws-book975x65
Book Title
Players
Points in 10 tournaments
Average Rank
MS
6134
6213
6179
6127
6202
6175
6152
6172
6212
6187
6175.3
1
TFTT
5957
5996
5970
6003
5994
5959
5965
5969
5966
5976
5975.5
2
TFT
5961
5936
5919
5946
5959
5938
5940
5929
5954
5978
5946.0
3
Pavlov
5718
5691
5725
5775
5816
5763
5748
5763
5733
5745
5747.7
4
TTFT
5725
5723
5725
5717
5719
5725
5746
5732
5722
5716
5725.0
5
GRIM
5404
5394
5416
5410
5440
5468
5322
5400
5390
5384
5402.8
6
ALLC
5115
5091
5103
5127
5103
5103
5103
5082
5109
5091
5102.7
7
RAND
4339
4349
4254
4340
4216
4219
4258
4241
4228
4274
4271.8
8
STFT
4165
4187
4160
4169
4179
4144
4173
4158
4142
4158
4163.5
9
ALLD
3800
3792
3852
3792
3848
3856
3832
3864
3832
3832
3830.0
10
Fig. 4.5
Simulation results of 10 IPD tournaments.
other hand, if MS doesn’t defect firstly, it cannot distinguish the 3 strategies and
cannot interact with ALLC optimally. The risk involved in exploring the opponent
must be considered in order to choose an efficient or payoff-maximizing identification
approach.
4.4
Escape from the trap of defection
When a player begins to explore the opponent, there appears the risk of the identifying process’s putting the player into a much worse position. Some strategies,
especially those with trigger mechanism such as GRIM, will change their behaviors
at the trigger point. For example, the strategy MS described in the above section
defects at the beginning of IPD games in order to distinguish each of the cooperating
strategies ALLC, TFT and GRIM; however, the chance to cooperate with GRIM is
lost. In IPD games, the risk of identification is mainly the trap of defection, which
means an identifying process leading the opponent to keep defecting with nothing
that can be done to rescue the situation.
It appears that a strategy will not run into the trap of defection if it never defects
first. But this is not the case. Suppose a strategy keeps playing COOPERATE if
its opponent defects, and defects forever once its opponent cooperates; then, any
cooperating strategy will be defected in interacting with it while most of defecting
strategies keep it cooperating. If there is a equal possibility of this reverse-GRIM
strategy appearing in a game to that of GRIM, to cooperate or to defect has equal
risk to invoke future defection. This means that there always exists the risk of
defection trap whether or not an identification mechanism is applied.
One may argue that the reverse-GRIM type of strategies will not appear as
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
How to design a strategy to win an IPD tournament?
ws-book975x65
37
frequently as GRIMs in IPDs, so to cooperate is safer than to defect and the MS
strategy is more likely to run into the defection trap than TFT. That is right. But
it is not enough to testify that defection trap is not inevitable for a strategy with
an identification mechanism because many identification approaches can be applied.
For example, a simple way to avoid retaliation from GRIM is not to defect first.
The identification mechanism that Adaptive Pavlov used in 2005 IPD tournament
only explored defecting strategies in order to keep cooperation with each of those
cooperating strategies.
Again, what kind of identification mechanisms should be applied depends on
prior knowledge and subjective estimation. If there are enough ALLC strategies
in an IPD game, it is worth identifying them from other cooperating strategies.
But if GRIMs are prevailing, it is better not to defect first. Generally speaking,
we can compare different identification approaches to choose the most efficient one
although uncertainty still exists.
4.5
Adaptive Pavlov and Competition 4 of 2005 IPD tournament
The 2005 IPD tournament comprised 4 competitions. Competition 4 mirrors the
original competition of Axelrod. There were totally 50 players including 8 default
strategies. The strategy of Adaptive Pavlov (AP) that ranked first in Competition
4 will be analyzed in this section.
The strategy of AP combines 6 continuous rounds to a period and applies different tactics in different periods. AP behaves as a TFT strategy in the first period,
and then changes its strategy according to the identification of its opponent.
AP classifies the possible opponents into 5 categories: cooperating strategies,
STFT, PavlovD, ALLD and RAND.3 By identifying the opponent’s strategy at the
end of a period, AP shift its strategy in new period in order to deal with each
opponent optimally.
AP is never the first to defect, and thus it will cooperate with each of cooperating strategies. AP tries to cooperate with the strategies of STFT and PavlovD, and
defect to the strategies such as ALLD or RAND. The processes of AP’s interacting
with cooperating strategies, ALLD, STFT, and PavlovD in first 6 rounds are shown
in Fig. 4.6(AP behaves as TFT). For example, when a process of interaction as that
of Fig. 4.6(c) happens, the opponent will be identified to be STFT and AP will
cooperate twice in the next period in order to achieve cooperation. If the opponent
is determined to be PavlovD, AP will defect once and then always cooperate in the
next period. If there is a process of interaction different from that of as shown in
Fig. 4.6, the opponent will be identified as RAND. In this way, any strategy that is
not defined in identification set is likely to be identified as RAND. Once cooperation
has been established, AP will always cooperate unless a defection occurs. Identification of the opponent is performed in each period throughout IPD tournament in
order to correct misidentification and to deal with those players who change their
strategies during a game.
As we have mentioned, most of the players will be cooperating strategies, the
3 RAND
is claimed to be a default strategy.
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
38
ws-book975x65
Book Title
1
2
3
4
5
6
AP
C
C
C
C
C
C
Co-op
C
C
C
C
C
C
1
2
3
4
5
6
AP
C
D
D
D
D
D
ALLD
D
D
D
D
D
D
(a)
(b)
1
2
3
4
5
6
AP
C
D
C
D
C
D
STFT
D
C
D
C
D
C
1
2
3
4
5
6
AP
C
D
D
C
D
D
PavlovD
D
D
C
D
D
C
(c)
(d)
Fig. 4.6 Identifying the opponent according to the process of interaction in six rounds. (a) AP
cooperates with any a cooperating strategy. (b) ALLD strategy always defects. (c) If a strategy
alternately plays D and C when interacting with TFT, it is identified to be STFT. (d) If a strategy
periodically plays D-D-C when interacting with TFT, it is identified to be PavlovD.
results show that there are 34 cooperating strategies in Competition 4 (including
4 default strategies of TFT, TFTT, GRIM and ALLC). With the exception of the
default strategies, there are still 3 strategies that behave like ALLD, 5 strategies that
behave like STFT, and 2 strategies that behave like NEG. As shown in Table 4.4,
AP can identify most of the strategies involved in Competition 4.4
Table 4.4
4.6
Categories of the strategies in Competition 4.
Categories
Number of the strategies
Cooperating strategies
Strategies like STFT
Strategies like ALLD
Strategies like NEG
Strategies like RAND
Others
34
6
4
3
1
2
Discussion and conclusion
AP belongs to the type of adaptive automata for IPD. However, it differs from
other adaptive strategies in respect of how adaptation is achieved. The approach
of AP exactly belongs to the set of artificial intelligence approaches. Rather than
adjusting some parameters in computing responses as most of the adaptive strategies
do, AP uses an identification mechanism which acts as an expert system. Knowledge
4 AP regards NEG as RAND. It maximizes the scores when interacting with the strategies like
NEG because either of the optimal strategies to interact with NEG and RAND are ALLD.
July 28, 2006
4:38
WSPC/Book Trim Size for 9.75in x 6.5in
How to design a strategy to win an IPD tournament?
ws-book975x65
39
about different opponents is expressed in the form of ’If, then”, for example, if the
opponent cooperates in 6 rounds then it is determined to be ALLC. In this way,
information acquired and used can be transparently expressed and thus AP can tell
what strategy the opponent is.
Recent years have seen much AI approaches applied in evolutionary game theory and IPD, for example reinforcement learning, artificial neural networks, and
fuzzy logic (Sandholm and Crites (1996), Macy and Carley (1996), Fort and Prez
(2005)). To solve the problem of computing a best response to an unknown strategy
has been one of the objectives of those AI approaches. The problem is in general
intractable because of the computational complexity, and finding the best response
for an arbitrary strategy can be non-computable (Papadimitriou (1992), Nachbar
and Zame (1996)). Reinforcement learning which is based on the idea that the
tendency to produce an action should be reinforced if it produces favourable results, and weakened if it produces unfavourable results (Gilboa (1988), Gilboa and
Zemel (1989)) is widely used for the automata to learn from the interaction with
others. In respect to IPD, several approaches have been developed to learn optimal
response to a deterministic or mixed strategy (Carmel and Markovitch (1998), Darwen and Yao(2002)). However, computational complexity is still the main difficulty
in the application of these approaches in real IPD tournaments. AP’s identification
mechanism is implemented in a simple way by making use of a priori knowledge,
which greatly reduces the computational complexity and makes it practical for AP
to respond to the opponent almost optimally. First, a priori knowledge about what
strategies are more likely to appear in the IPD tournament is used in determining
the identification set. The size of identification set is restricted in order to reduce
computational complexity. Second, a priori knowledge about how well different
identification approaches will work in a certain environment is used in selecting an
efficient identification approach, with which AP can avoid the risk of identification
and maximize the payoffs. Third, a priori knowledge about how to identify the
opponent according to the process of interaction is used in constructing the identification rules. With these simple rules, AP strategy is easy to be understood and
improved.
It is obvious that the identification set can be extended in order to include more
strategies that can be identified; however, more calculations will be involved as the
size of identification set increases. We have to make a tradeoff between the wish to
identify any strategy and the wish to develop a less complicated strategy. Compared
to the NP-completeness of those reinforcement learning
√ approaches (Papadimitriou
(1992)), AP’s computational complexity is between O( n) and O(n), which depends
on the similarities of those strategies to be identified. Therefore, the algorithm of
AP is very suitable for real IPD tournaments.
Identification mechanism can also work in the environment with noise, where
each strategy might, with a possibility, misunderstand the outcome of game. Noise
blurs the boundaries between different strategies. However, identification can still be
applicable by admitting a small identification error. In this circumstance, we can set
a threshold value that the opponent is considered to be identified if the probability
of misidentification is smaller than this value. Just as the case of identifying the
strategy of RAND, the probability of mistake identifying a strategy will decrease
July 28, 2006
4:38
40
WSPC/Book Trim Size for 9.75in x 6.5in
ws-book975x65
Book Title
to zero as the process of computation and identification repeats.
Information plays a key role in intelligent activities. The individuals with more
information consequentially gain the advantage over others in most circumstances.
With identification mechanism, the strategies like AP acquire information about
their opponents and they are more intelligent than any known strategies such as
TFT or Pavlov. This type of strategies is suitable in modeling the decision-making
process of human being, where learning and improving frequently happens.
Reference
[1] Ben-Porath E. (1990) The complexity of computing a best response automaton
in repeated games with mixed strategies, Games and Economic Behavior, 2:1-12.
[2] Carmel D. and Markovitch S. (1998) ”How to explore your opponent’s strategy
(almost) optimally,” Proceedings of the International Conference on Multi Agent
Systems, 64-71.
[3] Darwen P. and Yao X. (2002) Coevolution in iterated prisoner’s dilemma with
intermediate levels of cooperation: Application to missile defense, International
Journal of Computational Intelligence and Applications, 2(1): 83-107.
[4] Fort H. and Prez N. (2005) ”The fate of spatial dilemmas with different fuzzy
measures of success,” Journal of Artificial Societies and Social Simulation, 8(3).
[5] Gilboa I. (1988) The complexity of computing best response automata in
repeated games, Journal of Economic Theory, 45:342-352.
[6] Gilboa I. and Zemel E. (1989) Nash and correlated equilibria: some complexity
considerations, Games and Economic Behavior, 1:80-93.
[7] Macy M. and Carley K. (1996) ”Natural selection and social learning in
prisoner’s dilemma: co-adaptation with genetic algorithms and artificial neural
networks,” Sociological Methods and Research, 25(1): 103-137
[8] Nachbar J. and Zame W. (1996) Non-computable strategies and discounted
repeated games, Economic Theory, 8:103-122.
[9] Papadimitriou C. (1992) On players with bounded number of states, Games
and Economic Behavior, 4:122-131.
[10] Sandholm T. and Crites R. (1996) ”Multiagent reinforcement learning in the
Iterated Prisoner’s Dilemma,” Biosystems, 37(1-2): 147-66.